Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Kelechi Ogueji, Yuxin Zhu, Jimmy Lin


Abstract
Pretrained multilingual language models have been shown to work well on many languages for a variety of downstream NLP tasks. However, these models are known to require a lot of training data. This consequently leaves out a huge percentage of the world’s languages as they are under-resourced. Furthermore, a major motivation behind these models is that lower-resource languages benefit from joint training with higher-resource languages. In this work, we challenge this assumption and present the first attempt at training a multilingual language model on only low-resource languages. We show that it is possible to train competitive multilingual language models on less than 1 GB of text. Our model, named AfriBERTa, covers 11 African languages, including the first language model for 4 of these languages. Evaluations on named entity recognition and text classification spanning 10 languages show that our model outperforms mBERT and XLM-Rin several languages and is very competitive overall. Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages. Code, data and models are released at https://github.com/keleog/afriberta.
Anthology ID:
2021.mrl-1.11
Volume:
Proceedings of the 1st Workshop on Multilingual Representation Learning
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, Gozde Gul Sahin
Venue:
MRL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
116–126
Language:
URL:
https://aclanthology.org/2021.mrl-1.11
DOI:
10.18653/v1/2021.mrl-1.11
Bibkey:
Cite (ACL):
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021. Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages (Ogueji et al., MRL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mrl-1.11.pdf
Code
 keleog/afriberta