BigScience, Laurençon, H, Saulnier, L, Wang, T, Akiki, C, Villanova del Moral, A, Le Scao, T, von Werra, L, Mou, C, González Ponferrada, E, Nguyen, H, Frohberg, J, Šaško, M, Lhoest, Q, Mcmillan-Major, A, Dupont, G, Biderman, S
, Rogers, A, Ben Allal, L, de Toni, F, Pistilli, G, Nguyen, O, Nikpoor, S, Masoud, M, Colombo, P, de la Rosa, J, Villegas, P, Thrush, T, Longpre, S, Nagel, S, Weber, L, Romero Muñoz, M, Zhu, J, van Strien, D, Alyafeai, Z, Almubarak, K, Chien, VM, Gonzalez-Dios, I, Soroa, A, Lo, K, Dey, M, Ortiz Suarez, P, Gokaslan, A, Bose, S, Adelani, DI, Phan, L, Tran, H, Yu, I, Pai, S, Chim, J & Lepercq, V 2022,
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. in
Thirty-Sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. New Orleans, United States. <
https://openreview.net/pdf?id=UoEw6KigkUn>