1. Supervisors Encadrant: Legrand Joël Equipe et laboratoire Equipe Synalp, Loria, Nancy Contact: joel.legrand@loria.fr Co-encadrant: Teruel Camille Entreprise: Steerway Contact: camille@steerway.dev 2. Context We are developing steerway, a new code assistant. Our solution is based on an executable journal (a hybrid between a jupyter notebook and a terminal) where the developer can decompose a complex task in steps and compile relevant documents, such as: - the issue tracker ticket that describes the task, - documentation of third-party libraries, - related source files. The journal itself includes developer notes such as a short description of the current step (e.g. "write a regression test"). As the developer makes progress in his task, that context changes to include relevant document at each step. This journal and its referenced documents are used as a context to condition the generation of short code suggestions in the edited source file. Our main objective is to improve the efficiency of the LLMs we use: maintaining or improving their accuracy for our use case while significantly decreasing their resource consumption. We focus on methods that do not require a huge amount of compute. The four following subjects present challenging difficulties and address current research issues. The selected candidate(s) will be asked to choose from the three proposed subjects, depending on the candidate's interest and abilities. 2.1 Assessing reasoning vs memorization effects LLMs encode a lot of knowledge from their pre-training stage, albeit in a compressed fuzzy form. Capturing real-world knowledge can be beneficial for some tasks, such as closed-book question answering, one of the typical use cases for conversational agent such as ChatGPT. But for other tasks, like RAG (Retrieval Augmented Generation), it might be more of a hindrance than an advantage. When the context does not match the pre-training data, an LLM with a lot of implicit knowledge would rely on its "memory" rather than reasoning on the out-of-distribution context. One concrete example is retrieval augmented code generation. An LLM may have been trained on a significant amount of data coming from a previous version of a library or programming language but when prompted with the documentation of a newer version, it may still generate code that adheres to the old version. Is this just a prompt engineering problem or something more profound? This yields some interesting research questions: - How can we determine how much a model is reciting training data vs. reasoning on the provided context? Does a high in-context learning ability implies reasoning or are we fooled by tasks that are actually in distribution? - What kind of data, training objectives, regularization, or other techniques can be applied to promote that models learn to reason rather than to memorize? - Reducing the memorization effect might also be an opportunity to reduce the size of models as there is less data to store. The hypothesis is that reasoning patterns are more efficient than raw fact memorization. References [1] Xie, C., Huang, Y., Zhang, C., Yu, D., Chen, X., Lin, B., Li, B., Ghazi, B. & Kumar, R. On Memorization of Large Language Models in Logical Reasoning. (2024), https://arxiv.org/abs/2410.23123 [2] Lou, S., Chen, Y., Liang, X., Lin, L. & Zhang, Q. Quantifying In-Context Reasoning Effects and Memorization Effects in LLMs. (2024), https://arxiv.org/abs/2405.11880 [3] Prabhakar, A., Griffiths, T. & McCoy, R. Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning. (2024), https://arxiv.org/abs/2407.01687 [4] Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B., Kim, N., Andreas, J. & Kim, Y. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. (2024), https://arxiv.org/abs/2307.02477 2.2 Assessing the efficiency trade-offs of post-training compression LLM compression is a set of techniques designed to reduce the size of LLMs. Reducing the memory footprint may be accompanied by an increase in throughput when hardware is capable of taking advantage of it. We are interested in the combined application of two post-training compression techniques: quantization and pruning. Quantization reduces the precision of a model parameters, and pruning remove unused or redundant parameters. Modern GPU architectures, such as Ampere and Hopper, have dedicated low-precision tensor cores (INT8 and INT4). They can also leverage specific sparsity patterns to increase their performance. Combining quantization with semi-structured pruning is an opportunity to significantly decrease resource requirements and increase decoding speed. Although extensive research has been conducted on the application of one of those techniques, their combination remains largely undocumented. Each technique incurs a loss in accuracy, and this loss might compound when those techniques are used in combination. When considering the different pruning and quantization algorithms, the hyperparameters of each algorithm and the varying effect on different models and benchmarks, the research space becomes vast. We are interested in uncovering the efficiency trade-offs and the best practices of this aggressive LLM compression: Figure 1: We can save a lot of computation with sparse attention and caching - How much accuracy is lost when applying quantization and pruning at the same time? How much can be recovered by fine-tuning? - Is there empirical rules for tuning those algorithm hyperparameters? - How is each technique best applied to the different parts of the transformer architecture? References [1] Zhu, X., Li, J., Liu, Y., Ma, C. & Wang, W. A Survey on Model Compression for Large Language Models. (2024), https://arxiv.org/pdf/2308.07633 2.3 Sparse attention and cache for dynamic multi-document context In the context of RAG, the context consists of a list of relevant retrieved documents followed by a question or instruction. This is an opportunity to apply sparse attention. Each document is independent, so there is no strong reason for tokens of one document to attend to the tokens of other documents. The question/instruction would attend to all tokens from all documents, allowing the model to make the necessary cross-document connections. Because a single document may be used for many queries, this opens another opportunity to cache the attention scores of each document; see Figure 1. But we must assess whether caching those attention scores would actually save time. Because what is saved on compute must be paid in IO. In transformer inference, where memory bandwidth is usually the bottleneck, this could make loading cached attention scores longer than recomputing them. Another problem is that a document may appear at different places for different contexts, which may prevent caching if the attention scores depend on absolute positional information. Some research suggests that attention scores are translation invariant [1], but this must be tested in practice. Translation-invariant positional encoding like RoPE might help. References [1] Wennberg, U. & Henter, G. The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models. (2021), https://arxiv.org/abs/2106.01950 2.4 Attention injection Work on mechanistic interpretation reveals that many attention heads implement simple linguistic functions (e.g., verb-to-subject relationship) or positional functions (e.g., previous tokens). Positional functions are trivial to implement. And so are linguistic features in the context of code generation thanks to semantic analysis (e.g., use-def relationship). This means that we can replace those heads with an exact implementation of these functions by injecting attention patterns. This skips the computation of the attention scores for these heads. If we implement such modifications, a number of questions arises: - Can we save a significant amount of compute? Maybe it is too unstructured to leverage it on hardware. - Can we save a significant amount of memory and memory bandwidth? This could reduce the size of the KV cache. - Can exact implementations improve model accuracy? If not, how well can a model recover from such modifications? - Do such modifications open new pruning opportunities in lower layers? References [1] Voita, E., Talbot, D., Moiseev, F., Sennrich, R. & Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. (2019), https://arxiv.org/abs/1905.09418 [2] Liu, Z. & Chen, N. Picking the Underused Heads: A Network Pruning Perspective of Attention Head Selection for Fusing Dialogue Coreference Information. (2023), https://arxiv.org/abs/2312.09541