Shubham's blog

Indic NLP research: AI4Bharat papers - Part 2

Building on the Foundation: IndicBART and IndicNLG Suite

This is a continuation of the AI4Bharat papers series. If you haven't read Part 1, I highly recommend starting there, as each paper builds upon the previous work.


The AI4Bharat Journey: Where Part 2 Fits In

In Part 1, we explored two foundational papers from AI4Bharat:

These laid the groundwork for understanding Indian languages and translating between them. Part 2 continues this journey by focusing on Natural Language Generation (NLG) - teaching models not just to understand or translate, but to create new content in Indian languages.

An overview of how all these research papers sit in the larger scheme of things: Gemini_Generated_Image_gbpcu9gbpcu9gbpc (image credits to Nano Banana Pro from Gemini)

The AI4Bharat Research Timeline (2020-2023)

2020
├── IndicNLP Corpus (May 2020) - 2.7B words, word embeddings
└── IndicNLPSuite (Nov 2020) - IndicCorp v1 (8.8B tokens), IndicBERT, IndicGLUE

2021
├── Samanantar (Apr 2021) - 49.7M parallel sentences, IndicTrans
└── IndicBART (Sep 2021) - First multilingual seq2seq model for Indic languages

2022
└── IndicNLG Suite (Mar 2022) - First comprehensive NLG benchmark for Indic languages

2023
├── IndicCorp v2 (Jun 2023) - 20.9B tokens, 24 languages, 2.3x increase
├── IndicBERT v2 (Jun 2023) - 278M parameters, 23 languages
├── Naamapadam (Jul 2023) - 400K+ annotated sentences for NER
└── IndicTrans2 (Sep 2023) - 22 Indic languages, SOTA translation

Part 2 covers the critical transition from NLU to NLG through:

  1. IndicBART (Sept 2021): The first pre-trained sequence-to-sequence model designed specifically for generating text in Indic languages
  2. IndicNLG Benchmark (March 2022): A comprehensive evaluation framework for NLG tasks across 11 Indic languages

Paper 1: IndicBART - A Pre-trained Model for Indic Natural Language Generation

Published: September 2021
Key Innovation: First multilingual encoder-decoder model optimized for Indian languages with script unification

Why IndicBART Matters

Before IndicBART, generating high-quality content in Indian languages faced two major challenges:

  1. Model Size vs Performance: Large multilingual models like mBART50 (611M parameters) worked across many languages but were computationally expensive and showed suboptimal performance on Indian languages
  2. Limited Script Awareness: Most models didn't leverage the orthographic similarity between Indian scripts, missing opportunities for cross-lingual transfer

IndicBART addressed both by creating a compact, Indic-focused model with intelligent script handling.


Data: Building on IndicCorp

Pre-training Corpus

Script Unification: A Key Innovation

One of IndicBART's most distinctive features is converting all Indic scripts to Devanagari using the IndicNLP library. This decision had profound implications:

Why Script Unification?

Example:

Original Scripts:
- Hindi (Devanagari): विद्यालय
- Bengali (Bengali): বিদ্যালয়
- Gujarati (Gujarati): વિદ્યાલય
- Punjabi (Gurmukhi): ਵਿਦਿਆਲਾ

All converted to Devanagari → Better vocabulary sharing

Impact:

Alternative Variant: SSIndicBART (Separate Script IndicBART) was also trained using original scripts for comparison.

Fine-tuning Datasets

For Neural Machine Translation:

  1. PMI subset (WAT 2021): Low-resource, domain-specific (health, tourism)
  2. CVIT-PIB: Mid-resource, domain-specific (press releases)
  3. Samanantar: High-resource, general-domain
  4. Guzmán et al.: Parallel data for Nepali and Sinhala to English

For Summarization:


Model Architecture & Training

Model Specifications

Why 244M parameters?
Significantly smaller than mBART50 (611M) and mT5-base (580M), making it:

Training Details

IndicALBART: The Compact Variant


Evaluation & Results

Tasks Evaluated

  1. Neural Machine Translation (NMT)

    • Low-resource settings
    • Multilingual translation
    • Zero-shot translation (languages not seen during training)
  2. Extreme Summarization

    • XL-Sum dataset
    • Single-sentence summaries of news articles

Evaluation Metrics

Key Comparisons

IndicBART was benchmarked against:


Key Findings

1. Competitive with Much Larger Models

2. Script Unification is Highly Effective

3. Data Size Matters, But Pre-training Helps More in Low-Resource Settings

4. Domain Adaptation is Effective

6. Summarization Performance

7. IndicALBART: Size-Performance Trade-off


Impact & Practical Implications

For Researchers:

For Practitioners:

For the Indic NLP Ecosystem:


Paper 2: IndicNLG Suite - Multilingual Datasets for Diverse NLG Tasks

Published: March 2022
Key Innovation: First comprehensive benchmark for evaluating Natural Language Generation across Indian languages

Why a Benchmark Was Needed

After IndicBART provided a capable NLG model, the natural question arose: How do we systematically evaluate NLG performance for Indian languages?

The Problem:

The Solution: IndicNLG Suite


Data: Creating 8.5 Million Examples

The IndicNLG Benchmark contains ~8.5 million examples across 5 tasks and 11 languages, making it the largest multilingual NLG dataset of its time.

Languages Covered

Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Odia (or), Punjabi (pa), Tamil (ta), Telugu (te)

The Five NLG Tasks

1. Biography Generation (WikiBio) - 57,426 examples

Task: Generate the first sentence of a Wikipedia article from an infobox

Data Source:

Example:

Input (Infobox):
name: ਏ. ਆਰ. ਰਹਮਾਨ
occupation: ਸੰਗੀਤਕਾਰ, ਗਾਇਕ
born: 6 ਜਨਵਰੀ 1967
birthplace: ਚੇਨਈ, ਤਮਿਲਨਾਡੂ

Output (Biography sentence):
ਏ. ਆਰ. ਰਹਮਾਨ ਇੱਕ ਭਾਰਤੀ ਸੰਗੀਤਕਾਰ, ਗਾਇਕ ਅਤੇ ਸੰਗੀਤ ਨਿਰਮਾਤਾ ਹੈ।

Challenge: Converting structured data into fluent, coherent text

2. Headline Generation (HG) - 1.31 million examples

Task: Generate a news headline from an article

Data Sources:

Dataset Statistics:

Quality Control:

3. Sentence Summarization (SS) - 431K examples

Task: Generate a single-sentence summary of a news article

Data Source:

Rationale:

4. Paraphrase Generation (PG) - 5.57 million examples

Task: Generate alternative phrasings of a sentence with same meaning

Creation Method: Pivoting through English

  1. Take parallel corpus (Samanantar)
  2. For each Indic sentence, identify its English translation
  3. Find other Indic sentences with the same English translation
  4. These Indic sentences are paraphrases of each other

Example (simplified):

Hindi sentence 1: मुझे पानी चाहिए
English: I want water
Hindi sentence 2: मुझे जल की आवश्यकता है
→ Sentences 1 and 2 are paraphrases

Size: Largest component of the benchmark (5.57M of 8.5M total examples)

Quality Considerations:

5. Question Generation (QG) - 1.08M examples

Task: Generate a question for which a text passage provides the answer

Creation Method:

  1. Took English SQuAD dataset (100K+ question-answer pairs)
  2. Translated to 11 Indic languages using IndicTrans
  3. Validated translations through sampling and human evaluation

Example:

Context (English):
"The Taj Mahal was commissioned by Shah Jahan in 1632."

Question (English): "Who commissioned the Taj Mahal?"
Answer: "Shah Jahan"

Question (Hindi): "ताज महल को किसने बनवाया?"

Challenges:


Dataset Creation Philosophy

Three Key Principles:

  1. Automatic Creation Where Possible

    • Leveraged existing structure (Wikipedia, news websites)
    • Used MT for augmentation (SQuAD translation)
    • Enabled scale: 8.5M examples
  2. Language-Agnostic Methods

    • Approaches transferable to other language families
    • No language-specific rules or tools
    • Facilitates future expansion
  3. Quality Validation

    • Human evaluation on subsets (WikiBio, HG, PG)
    • Statistical cleaning (regex, frequency analysis)
    • Automated quality checks (length, script validation)

Model Baselines

Models Evaluated

1. mT5 (Multilingual T5)

2. IndicBART

3. SSIndicBART

Training Configurations

Monolingual Models:

Multilingual Models:

Fine-tuning Details:


Evaluation Methodology

Metrics

Primary Metric: ROUGE-L F1

Paraphrasing Metric: iBLEU

Evaluation Dimensions

The paper explored three key questions:

1. Impact of Multilingualism

2. Impact of Language Family

3. Impact of Task Nature


Results & Key Findings

1. Multilingual Models Generally Superior

Headline Generation (ROUGE-L):

Language    Monolingual    Multilingual    Improvement
Hindi          23.4           25.8            +2.4
Bengali        21.7           24.1            +2.4
Tamil          19.2           21.5            +2.3
Gujarati       20.8           22.9            +2.1
Average        ~21.3          ~23.6           +2.3

Key Insights:

Exception: Biography generation showed smaller gaps, possibly because task is more template-like

2. Language-Specific Pre-training Matters (Sometimes)

IndicBART vs mT5:

Monolingual Setting:

Multilingual Setting:

Conclusion: Language-family-specific pre-training most beneficial for monolingual applications

3. Task Difficulty Varies Significantly

ROUGE-L Scores by Task (Multilingual IndicBART):

Easier Tasks:

Harder Tasks:

Pattern: Extractive/template-based tasks easier than abstractive/creative tasks

4. Language-Specific Performance Patterns

High-Performing Languages:

Lower-Performing Languages:

Implication: Pre-training corpus size directly impacts downstream task performance

5. Human Evaluation Results

Conducted on subsets of WikiBio, Headline Generation, and Paraphrasing:

Quality Ratings (1-5 scale):

Conclusion: Generated text generally high-quality, though room for improvement in creativity


Transfer Learning Experiments

The paper also explored using one NLG task to improve another:

Experiment: Headline Generation → Document Summarization

Setup:

  1. Fine-tune IndicBART on Headline Generation (1.31M examples)
  2. Further fine-tune on XL-Sum extreme summarization (30K examples)

Results:

Other Successful Transfers:


Impact & Contributions

For the Research Community

1. Standardized Evaluation

2. Dataset Availability

3. Baseline Models

For Practitioners

1. Production-Ready Datasets

2. Model Selection Guidance

For Low-Resource Languages

1. Methodology Transfer

2. Multilingual Benefits


Connecting the Papers: From Foundation to Application

The AI4Bharat Stack (2020-2022)

Layer 4: Applications & Transfer Learning (2022)
├── IndicNLG Benchmark → Systematic NLG evaluation
└── Transfer learning demonstrated across tasks

Layer 3: NLG Models (2021)
├── IndicBART → First Indic seq2seq pre-trained model
└── Script unification → Cross-lingual transfer

Layer 2: NLU Models & Translation (2020-2021)
├── IndicBERT → Understanding 
├── Samanantar → Parallel data
└── IndicTrans → Translation

Layer 1: Foundation (2020)
├── IndicCorp → Monolingual corpora (8.8B tokens)
└── IndicGLUE → NLU evaluation

Key Themes Across Papers

1. Data First

2. Efficient Modeling

3. Evaluation-Driven

4. Open Source Philosophy


Practical Implications for 2025

What These Papers Enabled

1. Commercial Applications

2. Further Research

3. Democratization

Current State (2025)

The work from 2021-2022 has evolved significantly:

IndicBART → IndicTrans2 (2023)

IndicCorp v1 → IndicCorp v2 (2023)

IndicNLG → Comprehensive Evaluation


Lessons for Building Language Technology

1. Script Engineering Matters

2. Specialization Has Value

3. Benchmark Early and Often

4. Transfer Learning is Powerful

5. Practical Dataset Creation


Looking Forward

The foundation laid by IndicBART and IndicNLG continues to influence Indic NLP:

Current Directions (2024-2025):

Open Challenges:


Conclusion

Papers 3 and 4 in the AI4Bharat series represent a crucial evolution from understanding to generation. IndicBART demonstrated that compact, well-designed models can rival much larger systems, while IndicNLG provided the evaluation framework necessary to measure progress systematically.

Together, these papers:

The impact extends beyond academic metrics. By making high-quality NLG accessible for Indian languages, this work has contributed to digital inclusion for hundreds of millions of users. It's a testament to the importance of focused, systematic research in building language technology for underserved communities.


Resources

Papers

Code & Models

Documentation


Citation

If you use IndicBART or IndicNLG in your research, please cite:

@inproceedings{dabre2021indicbart,
  title={IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages},
  author={Raj Dabre and Himani Shrotriya and Anoop Kunchukuttan and Ratish Puduppully and Mitesh M. Khapra and Pratyush Kumar},
  year={2021},
  eprint={2109.02903},
  archivePrefix={arXiv}
}

@misc{kumar2022indicnlg,
  title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages},
  author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
  year={2022},
  eprint={2203.05437},
  archivePrefix={arXiv}
}

Stay tuned for Part 3, where we'll explore IndicTrans2 and IndicBERT v2 - the next evolution in the AI4Bharat journey!