1. Supervisors

     Encadrant: Legrand Joël
     Equipe et laboratoire Equipe Synalp, Loria, Nancy
     Contact: joel.legrand@loria.fr
     Co-encadrant: Teruel Camille
     Entreprise: Steerway
     Contact: camille@steerway.dev


2. Context

We are developing steerway, a new code assistant. Our solution is based
on an executable journal (a hybrid between a jupyter notebook and a
terminal) where the developer can decompose a complex task in steps and
compile relevant documents, such as:
-   the issue tracker ticket that describes the task,
-   documentation of third-party libraries,
-   related source files.

The journal itself includes developer notes such as a short description
of the current step (e.g. "write a regression test"). As the developer
makes progress in his task, that context changes to include relevant
document at each step.

This journal and its referenced documents are used as a context to
condition the generation of short code suggestions in the edited source
file. Our main objective is to improve the efficiency of the LLMs we
use: maintaining or improving their accuracy for our use case while
significantly decreasing their resource consumption. We focus on
methods that do not require a huge amount of compute.

The four following subjects present challenging difficulties and
address current research issues. The selected candidate(s) will be
asked to choose from the three proposed subjects, depending on the
candidate's interest and abilities.


2.1 Assessing reasoning vs memorization effects

LLMs encode a lot of knowledge from their pre-training stage, albeit in
a compressed fuzzy form. Capturing real-world knowledge can be
beneficial for some tasks, such as closed-book question answering, one
of the typical use cases for conversational agent such as ChatGPT. But
for other tasks, like RAG (Retrieval Augmented Generation), it might be
more of a hindrance than an advantage. When the context does not match
the pre-training data, an LLM with a lot of implicit knowledge would
rely on its "memory" rather than reasoning on the out-of-distribution
context.

One concrete example is retrieval augmented code generation. An LLM may
have been trained on a significant amount of data coming from a
previous version of a library or programming language but when prompted
with the documentation of a newer version, it may still generate code
that adheres to the old version. Is this just a prompt engineering
problem or something more profound?

This yields some interesting research questions:
-   How can we determine how much a model is reciting training data vs.
    reasoning on the provided context? Does a high in-context learning
    ability implies reasoning or are we fooled by tasks that are
    actually in distribution?
-   What kind of data, training objectives, regularization, or other
    techniques can be applied to promote that models learn to reason
    rather than to memorize?
-   Reducing the memorization effect might also be an opportunity to
    reduce the size of models as there is less data to store. The
    hypothesis is that reasoning patterns are more efficient than raw
    fact memorization.


References

[1] Xie, C., Huang, Y., Zhang, C., Yu, D., Chen, X., Lin, B., Li, B.,
    Ghazi, B. & Kumar, R. On Memorization of Large Language Models in
    Logical Reasoning. (2024), https://arxiv.org/abs/2410.23123
[2] Lou, S., Chen, Y., Liang, X., Lin, L. & Zhang, Q. Quantifying
    In-Context Reasoning Effects and Memorization Effects in LLMs.
    (2024), https://arxiv.org/abs/2405.11880
[3] Prabhakar, A., Griffiths, T. & McCoy, R. Deciphering the Factors
    Influencing the Efficacy of Chain-of-Thought: Probability,
    Memorization, and Noisy Reasoning. (2024),
    https://arxiv.org/abs/2407.01687
[4] Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B., Kim,
    N., Andreas, J. & Kim, Y. Reasoning or Reciting? Exploring the
    Capabilities and Limitations of Language Models Through
    Counterfactual Tasks. (2024), https://arxiv.org/abs/2307.02477


2.2 Assessing the efficiency trade-offs of post-training compression

LLM compression is a set of techniques designed to reduce the size of
LLMs. Reducing the memory footprint may be accompanied by an increase
in throughput when hardware is capable of taking advantage of it. We
are interested in the combined application of two post-training
compression techniques: quantization and pruning. Quantization reduces
the precision of a model parameters, and pruning remove unused or
redundant parameters.

Modern GPU architectures, such as Ampere and Hopper, have dedicated
low-precision tensor cores (INT8 and INT4). They can also leverage
specific sparsity patterns to increase their performance. Combining
quantization with semi-structured pruning is an opportunity to
significantly decrease resource requirements and increase decoding
speed.

Although extensive research has been conducted on the application of
one of those techniques, their combination remains largely undocumented.
Each technique incurs a loss in accuracy, and this loss might compound
when those techniques are used in combination. When considering the
different pruning and quantization algorithms, the hyperparameters of
each algorithm and the varying effect on different models and
benchmarks, the research space becomes vast.

We are interested in uncovering the efficiency trade-offs and the best
practices of this aggressive LLM compression:
Figure 1: We can save a lot of computation with sparse attention and
caching
-   How much accuracy is lost when applying quantization and pruning
    at the same time? How much can be recovered by fine-tuning?
-   Is there empirical rules for tuning those algorithm
    hyperparameters?
-   How is each technique best applied to the different parts of the
    transformer architecture?


References

[1] Zhu, X., Li, J., Liu, Y., Ma, C. & Wang, W. A Survey on Model
    Compression for Large Language Models. (2024),
    https://arxiv.org/pdf/2308.07633


2.3 Sparse attention and cache for dynamic multi-document context

In the context of RAG, the context consists of a list of relevant
retrieved documents followed by a question or instruction. This is an
opportunity to apply sparse attention. Each document is independent, so
there is no strong reason for tokens of one document to attend to the
tokens of other documents. The question/instruction would attend to all
tokens from all documents, allowing the model to make the necessary
cross-document connections.

Because a single document may be used for many queries, this opens
another opportunity to cache the attention scores of each document;
see Figure 1. But we must assess whether caching those attention scores
would actually save time. Because what is saved on compute must be paid
in IO. In transformer inference, where memory bandwidth is usually
the bottleneck, this could make loading cached attention scores longer
than recomputing them.

Another problem is that a document may appear at different places for
different contexts, which may prevent caching if the attention scores
depend on absolute positional information. Some research suggests that
attention scores are translation invariant [1], but this must be tested
in practice. Translation-invariant positional encoding like RoPE might
help.


References

[1] Wennberg, U. & Henter, G. The Case for Translation-Invariant
    Self-Attention in Transformer-Based Language Models. (2021),
    https://arxiv.org/abs/2106.01950


2.4 Attention injection

Work on mechanistic interpretation reveals that many attention heads
implement simple linguistic functions (e.g., verb-to-subject
relationship) or positional functions (e.g., previous tokens).
Positional functions are trivial to implement. And so are linguistic
features in the context of code generation thanks to semantic analysis
(e.g., use-def relationship). This means that we can replace those
heads with an exact implementation of these functions by injecting
attention patterns. This skips the computation of the attention scores
for these heads.

If we implement such modifications, a number of questions arises:
-   Can we save a significant amount of compute? Maybe it is too
    unstructured to leverage it on hardware.
-   Can we save a significant amount of memory and memory bandwidth?
    This could reduce the size of the KV cache.
-   Can exact implementations improve model accuracy? If not, how
    well can a model recover from such modifications?
-   Do such modifications open new pruning opportunities in lower
    layers?


References

[1] Voita, E., Talbot, D., Moiseev, F., Sennrich, R. & Titov, I.
    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
    Lifting, the Rest Can Be Pruned. (2019),
    https://arxiv.org/abs/1905.09418
[2] Liu, Z. & Chen, N. Picking the Underused Heads: A Network Pruning
    Perspective of Attention Head Selection for Fusing Dialogue
    Coreference Information. (2023), https://arxiv.org/abs/2312.09541