*Keywords*:
Large Language Models, Morphology, Natural Language Processing


*Research Context and Questions*
Hofmann et al. (2025) studied the modeling of competition between the
nominal suffixes -ity (available -> availability) and -ness
(selfish -> selfishness). They find that LLMs model this competition
fairly well. However, they did not study the competition between
prefixes (e.g., un- and non-).

This is an important research question because studying the
morphological competence of LLMs allows us to measure their
generalization ability (Weissweiler et al., 2023; Weller-Di Marco and
Fraser, 2024; Lerner and Yvon, 2025b). Indeed, the lexicon is not a
list of words that is known a priori and immutable (Corbin, 2012;
¦tekauer et al., 2005). However, LLMs are probabilistic models trained
to maximize the likelihood of their training data. They model a
probability distribution over a finite token vocabulary. While
infrequent words in a corpus were typically filtered out in traditional
approaches (Eisenstein, 2019), modern models (OpenAI, 2023;
Llama Team, 2024; Gemma Team, 2024) all rely on BPE (Byte Pair
Encoding) segmentation, which segments rare words into subwords by
optimizing a data compression criterion (Gage, 1994;
Sennrich et al., 2016; Beinborn and Pinter, 2023). Thus, models are
theoretically capable of deriving or inflecting words in forms absent
from their training corpus, but the reality is more complex
(Hofmann et al., 2020; Weissweiler et al., 2023; Lerner and Yvon, 2025b).
Morphologically competent LLMs would be useful for a wide range of NLP
applications, notably for Machine Translation (Ataman et al., 2019;
Marco et al., 2022; Lerner and Yvon, 2025a), and more generally for
Natural Language Generation.

Our previous work is limited to fairly simple concatenative phenomena
(e.g., the prefixation of pré+entraînement). However, several affixes
can be competitive/synonymous (Corbin, 2012), for example pré- and
anté-, which raises the following question: given the same definition,
could we produce antéentraînement rather than préentraînement? If not,
why? It may be because of: a phonological constraint (e.g., the number
of syllables (Plénat, 2009; Lindsay and Aronoff, 2013), euphony (Lignon
and Plénat, 2009)), lexical consistency (e.g., analogy with
prétraitement), or simply historical chance (e.g., influence of
English's pretraining (Lignon and Plénat, 2009; Hole¨, 2023))?


*Objectives*
These questions will be assessed by comparing the probability that LLMs
assign to different affixes for pseudo-words (e.g. generated using
UniPseudo (New et al., 2024)) to that of a cognitively plausible model,
GCM (Nosofsky, 1990). If these results are not conclusive, we will
conduct a survey with native speakers, to collect judgments of
acceptability (comparing, e.g. "unwug" vs. "nonwug"), in the same
fashion as Hofmann et al. (2025); Copot and Bonami (2024).

Other phenomena in derivational morphology raise similar questions
(Corbin, 2012), notably allomorphy, where different variants of the
same morpheme can be used according to morphophonological constraints
(e.g., *indétruisable vs. indestructible or, conversely, traduisible
vs. *traductible).

These questions will be studied by comparing BPE-based LLMs with
byte-based LLMs, which are an emerging alternative to BPE (Wang et al.,
2024; Zuo et al., 2024). However, they must process much longer
sequences (since a word is typically segmented into many characters or
bytes), which limits the use of Transformers, whose complexity is
quadratic with respect to the sequence length (Vaswani et al., 2017).
This will allow us to understand why BPE-based LLMs sometimes fail or
succeed in deriving new lexemes. Marco and Fraser (2024) found that for
inflection, the most important criterion was the consistency of the
tokenization among all inflections of a given lexeme.


*Internship conditions*
The internship will be supervised by Paul Lerner
(https://paullerner.github.io/), postdoc researcher, Leonie Weissweiler
(https://leonieweissweiler.github.io/), postdoc researcher, and
François Yvon (https://fyvo.github.io/), senior researcher. The
internship may lead to a PhD thesis, provided available funding. The
internship will take place at ISIR in the MLIA team
(https://www.isir.upmc.fr/teams/mlia/presentation-mlia/?lang=en). ISIR
is under the dual supervision of Sorbonne University, which is a
world-class multidisciplinary university, and the French National
Centre for Scientific Research (CNRS), which is one of the most
important research institutions in the world. ISIR includes 6 research
teams and 226 people. The intern will be located at 4, place Jussieu,
75005 Paris.

-   Remuneration: around 600¤ along with the refund of 75% of the
    Navigo (public transport) card.
-   Starting date: the internship is expected to start in February or
    March 2025.
-   Duration: 5-6 months


*Requirements*
We are looking for a second-year Master's student with a strong
background in Natural Language Processing/Computational Linguistics.
The intern is expected to be proficient in programming, especially in
the Python language, and to have already worked under Linux. They
should also have experience with a deep learning framework, preferably
PyTorch.


*Application*
Please send a resume along with a cover letter (in French or English)
and grade transcripts for the last two years to Paul Lerner at
lerner@isir.upmc.fr. A list of pointers to example projects (e.g., via
GitHub) or a letter of recommendation is a plus.