Multilingual Language Models: Analysis and Algorithms.

Terra Blevins

While large language models (LLMs) continue to grow in scale and gain new zero-shot capabilities, their performance for languages beyond English increasingly lags behind. This gap is due to the "curse of multilinguality," where multilingual language models perform worse on individual languages than a monolingual model trained on that language due to inter-language competition for representation. These issues are further compounded by the disparate amounts and qualities of training data for different languages, leading to increasingly degraded performance on lower-resource languages. However, because training new large language models for individual languages is compute- and data-intensive, multilingual language models remain the de facto approach for most of the world's languages. Therefore, it remains an open question as to how we can alleviate the curse of multilinguality and build multilingual models that fairly model many languages. This dissertation investigates how current language models do and don't capture multiple languages and examines how multilingual language models differ from monolingual ones. We first present an analysis method, "structural probing," used for many of this work's analyses. Then, we examine the unexpected ability of monolingual language models to exhibit cross-lingual behavior, finding that this phenomenon is due to inherent language contamination of pretraining data collected at scale. This shows that LMs can learn languages from surprisingly small subsets of their training data and implies that all language models are multilingual when trained at scale. We next characterize the pretraining dynamics of multilingual language models, showing that while multilingual models learn information about individual languages early on, cross-lingual transfer is acquired throughout the pretraining process. This analysis also demonstrates the curse of multilinguality as it develops during pretraining, causing the model to forget previously learned information. Inspired by these insights, we propose a sparse language modeling approach for training Cross-Lingual Expert Language Models (X-ELM) to explicitly allocate parameters to different languages and reduce inter-language competition for model capacity. X-ELMs improve performance for all languages we consider, as well as provide efficiency and model adaptation benefits over prior methods. Due to these characteristics, X-ELM increases access to multilingual NLP by providing better-performing and more usable models for all languages. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml.]