Skip to main navigation Skip to search Skip to main content

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages.

  • Aman Kumar
  • , Himani Shrotriya
  • , Prachi Sahu
  • , Amogh Mishra
  • , Raj Dabre
  • , Ratish Puduppully
  • , Anoop Kunchukuttan
  • , Mitesh M. Khapra
  • , Pratyush Kumar

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. We present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models will be publicly available.
Original languageEnglish
Title of host publicationProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics
Publication date2022
Pages5363-5394
DOIs
Publication statusPublished - 2022
Externally publishedYes
EventConference on Empirical Methods in Natural Language Processing - Abu Dhabi, United Arab Emirates
Duration: 7 Dec 202211 Dec 2022
https://2022.emnlp.org/

Conference

ConferenceConference on Empirical Methods in Natural Language Processing
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period07/12/202211/12/2022
Internet address

Keywords

  • Natural Language Generation
  • Multilingual Benchmarks
  • Indic Languages
  • Pre-trained Sequence-to-Sequence Models
  • Low-resource Languages Data Collection

Fingerprint

Dive into the research topics of 'IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages.'. Together they form a unique fingerprint.

Cite this