Indic NLP research: AI4Bharat papers — Part 1

27 Aug, 2025

The series aims to delve deeper into each paper and understand what's being done in Indic language space. One of the pioneers in this has been the research institution 'AI4Bharat'.

This is an attempt at exploring the first 2 papers that came out in the AI4Bharat series. As part of the series, I am only exploring papers that concern the text modality. I have not considered papers involving any other modality like speech. That excludes papers discussing ASR (Automatic speech recognition/Speech to text) or Text to speech.

I will be covering papers chronologically starting from 2020. I am covering 2 papers as part of this blog post:

Nov 2020: IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
Apr 2021: Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Framework

Almost all of the papers from AI4Bharat can be thought of spanning one or multiple of the 3 buckets: Data, Model, Evaluation.

Data → Model → Evaluation

Data

In the realm of machine learning and majorly deep learning, any model is dependent on the quality of the data. This includes:

Collection of data — where to collect data from, how to collect data
Pre-processing data — cleaning of data, representing data in a way suited for our task

This will form the bulk of the discussion in most of the AI4Bharat papers.

Model

This includes details about:

Was the model pre-trained or was it trained from scratch?
- Pre-trained models — these are models which leverage a model which has already been trained on some data and is then fine-tuned on custom data
- Training from scratch involves training the model only on the custom data that has been produced
What are the kind of embeddings fed into the model?
- Embeddings play a huge role in the model behaviour. Embeddings can be thought of as a series of numbers representing a character/word/sentence. These embeddings seek to embody semantic relationships between words via numbers.
What kind of architecture was used in the model?
What is the model trained for? What should the model be capable of doing?

NLP models are capable of either or both of 2 major tasks — NLU and NLG.

NLU (Natural Language Understanding)

This refers to tasks where the model is supposed to understand the text and then answer:

Sentiment analysis — requires understanding the text to tell about the sentiment. Output: sentiment types like good, bad, poor
Text classification — understanding the text & slotting it into certain pre-defined categories. Output: as defined by the task
Named entity recognition: Output: names, locations, places from a piece of text
Paraphrase detection: Output: yes/no in terms of whether 2 pieces of text are similar in meaning or not

NLG (Natural Language Generation)

This is focussed on generating new text basis what the model has read:

Translation — Generating text in language B having been fed text in language A
Summarization
Question-answering

Models like ChatGPT, Claude — are both NLU & NLG, however that is facilitated by their massive model size (> trn tokens).

However, not every task requires a model proficient in both understanding and generation of text. Models specialised in either of the tasks can help reduce costs, work better with multilingual languages, offer a chance to be used on edge devices (mobiles, low-resource computers etc.). Hence for a specialised task, it makes sense to use specific models which can do these.

Evaluation

To determine how good an NLP model is working, evaluations are very important.

How do you evaluate how good is your model working — this is what evaluations cover
Which evaluation benchmarks to use — are new ones needed? can existing ones be used? can existing benchmarks be modified?

I will try and use this framework to explain the next 2 papers. This will hopefully help the significance of any paper that I cover in the larger scheme of things.

For AI4Bharat, a lot of the papers have significant focus on gathering data and processing data. This is natural since building Indic LLMs require massive Indic language data.

Paper 1: IndicNLP Suite

Motivation

The introduction points out to the lack of large, publicly available monolingual corpora for many major Indian languages, despite being spoken by over a billion people and including 8 of the top 20 most spoken languages. Progress in Indic NLP has been hindered by the scarcity of large-scale monolingual corpora and evaluation benchmarks. The paper aims to address this by creating:

Large, general-domain monolingual corpora for 11 Indian languages
Word embeddings and multilingual language models trained on these corpora
An evaluation benchmark comprising various NLU tasks

Data

This paper introduces IndicCorp — Large-scale sentence-level monolingual corpora (corpora, corpus — all refer to datasets) for 11 Indian languages and Indian English, totaling 8.8 billion tokens primarily from news crawls.

For comparison: GPT-4 (the model that powers ChatGPT) was trained on 13000 billion or 1.3 trn tokens.

IndicCorp: Indian Language Corpora

Data sources: The goal was to collect corpora reflecting contemporary use and covering a wide range of topics. The primary focus was on crawling news articles, magazines, and blog posts from popular Indian language news websites. This was important since the model, if trained on too specific a domain, will not generalise well for common tasks.
Article Extraction: For extracting the main content from news websites, the tool BoilerPipe was used with BeautifulSoup a Python library for parsing HTML/XML documents. Filters based on content length and script were applied to ensure good quality articles.
Text Processing: The text underwent several processing steps, including canonicalization (to handle multiple Unicode representations of characters), sentence splitting and tokenization (taking into account Indic punctuations and sentence delimiters).

Model

Word Embeddings

The authors discuss existing word embeddings trained for Indian languages on limited corpora, such as Polyglot and FastText (trained on Wikipedia and Wikipedia + CommonCrawl).

The authors introduce new pre-trained word embeddings, IndicFT, based on FastText (a library developed by Facebook in 2016 to generate embeddings from words). These embeddings outperformed other embeddings trained for Indian languages on most evaluation tasks. FastText was used to be able to handle the morphological complexity of Indian languages.

Morphological complexity refers to the degree of internal structure and variation in the words of a language. For ex: Hindi nouns change form based on gender (masculine/feminine), number (singular/plural).

The quality of these embeddings was evaluated on word similarity, word analogy, text classification, and bilingual lexicon induction tasks.

IndicBERT: Multilingual NLU Model

The focus on the paper is on NLU and hence the model is trained for NLU tasks. Instead of training a model from scratch, the authors used a variation of ALBERT, which is a less resource-heavy version of BERT. This has been introduced as IndicBERT. Pre-trained models are valuable for initialization and transfer learning in various NLP tasks. The quality of these models heavily relies on the size of the monolingual corpora used for training.

This section introduces IndicBERT, a multilingual NLU model trained on IndicCorp and evaluated on IndicGLUE. The ALBERT model was chosen as the base due to its compact size. A single model was trained for all Indian languages to leverage their relatedness, which could be particularly beneficial for under-represented languages.

Pre-training: First, a sentence piece tokenizer was trained to tokenize the sentences in each language. A multilingual ALBERT model was then pre-trained using the masked language model (MLM) objective, without the Sentence Order Prediction objective used in the original ALBERT. Exponentially smoothed weighting was applied to the data across languages to improve representation for low-resource languages.

A vocabulary of 200k was used. Both base and large versions of ALBERT were trained on a TPU v3. A smaller maximum sequence length of 128 and a batch size of 2048 (for the large model) were used due to memory constraints. The model was trained for 400k steps.

Fine-tuning: IndicBERT was fine-tuned independently for each task in IndicGLUE and for each language using their respective training sets. The fine-tuning procedures for each task are described:

Headline Prediction and Wikipedia Section Title Prediction: The article/section and candidate headline/title are fed to the model with a SEP token. A classification head assigns a score, and cross-entropy loss is used. The candidate with the highest score is predicted.
Named Entity Recognition: Each sentence is fed as a single sequence. A softmax layer at the output computes a probability distribution over NER classes for each token.
Cloze-style Multiple-choice QA: The masked text is input, and a softmax layer predicts a probability distribution over the candidates. Cross-entropy loss is used.
Cross-lingual Sentence Retrieval: No fine-tuning is required. Sentence representations are obtained by mean-pooling the last hidden layer outputs, and cosine distance is used for similarity. Sentence vectors are also centered to remove language-specific bias.
Winograd NLI, COPA, Paraphrase Detection: Sentence pairs are input as segment A and B. The [CLS] representation from the last layer is fed to a classification layer.
News Category Classification, Discourse Mode Classification, Sentiment Analysis: The [CLS] token representation is fed to a linear classifier with a softmax layer to predict the category distribution. Multi-class cross-entropy loss is used.

Evaluation

The authors introduce IndicGLUE (Indic General Language Understanding Evaluation Benchmark), a new Multilingual NLU Benchmark, which comprises a collection of various tasks designed to evaluate the NLU capabilities across multiple Indian languages. 2 major methods were used to create the evaluations:

Existing datasets were used for some tasks, although these were available for only 4–5 Indian languages. Some English datasets were also manually translated into a few Indian languages.
Second, new datasets were created that span all major Indian languages. These were curated semi-automatically using external metadata like website/Wikipedia structure to create reasonably complex NLU tasks.

The tasks involved within IndicGLUE are:

News Category Classification: The task is to predict the genre/topic of a given news article or headline. Datasets were created using IndicCorp for 9 languages, with categories determined from URL components. Generic categories like entertainment, sports, business, lifestyle, technology, politics, and crime were chosen.
Headline Prediction Task: The task is to predict the correct headline for a news article from a list of four candidates (one correct, three incorrect). The dataset was generated from news article crawls containing articles and their headlines.
Wikipedia Section-title Prediction: The task is to predict the correct title for a Wikipedia section from four candidates.
Cloze-style Multiple-choice QA: Given a text with a masked entity, the task is to predict the entity from four candidates. Text was obtained from Wikipedia, and entities were identified using Wikidata. This task assesses if language models can be used as knowledge bases.
Named Entity Recognition: The WikiAnn NER dataset, containing NER data for 282 languages (including Indian languages), was used. The task considers coarse-grained labels: Person (PER), Organisation (ORG), and Location (LOC).
Cross-lingual Sentence Retrieval: Given an English sentence, the task is to retrieve its translation from a set of candidate sentences in an Indian language. The CVIT-Mann Ki Baat dataset was used for this task.
Winograd NLI (WNLI): This task, part of the GLUE benchmark, involves pairs of sentences where a pronoun in the second sentence is replaced with a possible referent from the first. The task is to predict if the second sentence is entailed by the first. The dataset was manually translated into three Indic languages (hi, mr, gu).
COPA: The Choice Of Plausible Alternatives task evaluates commonsense causal reasoning. It presents a premise and two alternatives, and the task is to select the more plausible cause or effect. The dataset was translated into three Indic languages (hi, mr, gu).
Paraphrase Detection: The Amritha paraphrase dataset, comprising four Indic languages (hi, pa, ta, ml), was used. Two subtasks are included: (1) classifying sentence pairs as paraphrases or not, and (2) identifying if they are completely equivalent, roughly equivalent, or not equivalent.
Discourse Mode Classification: Given a sentence, the task is to classify it into one of the following discourse categories: argumentative, descriptive, dialogic, informative, narrative. The MIDAS Hindi Discourse Analysis dataset was used.
Sentiment Analysis: Several publicly available datasets were used, including the IIT-Patna Movie and Product Sentiment Analysis dataset (Hindi) and the ACTSA Sentiment Analysis corpus (Telugu).

Model (IndicBERT vs others) performance on IndicGLUE:

IndicBERT models outperform XLM-R and mBERT on most tasks, being competitive on Wikipedia Section Title prediction.
mBERT shows relatively higher performance on Wikipedia-based tasks (NER, Wikipedia Section Title prediction, and Multiple-choice QA), suggesting it benefits from Wikipedia data during pre-training, which was deliberately excluded from IndicCorp.
Multiple-choice QA and Cross-Lingual Sentence Retrieval are identified as more challenging tasks, where IndicBERT models show improvement over XLM-R and mBERT.
The performance of IndicBERT large on monolingual tasks is generally lower for Assamese and Odia, the languages with the smallest corpora, and highest for Hindi and Bengali, which have the largest corpora, indicating the impact of corpus size.

Paper 2: Samanantar

The previous paper talked about NLU for Indic languages whereas this paper talks about translation (NLG) amongst Indic languages. Whereas the last paper focussed on a monolingual corpora for each language, the focus is here on creating parallel corpora for Indic languages. Parallel corpora means having translations mapped amongst different languages.

3 sentence pairs of parallel data (En-Hindi) where src = source is English and tgt=target is Hindi. Samanantar has this for ~49mn sentence pairs across 11 languages (i.e. target is any one of 11 Indic Languages — Gujarati, Marathi, Malayalam, Telugu, Tamil, Assamese, Oriya, Hindi, Bengali, Kannada, Punjabi)

This paper presents Samanantar, the largest publicly available parallel corpora collection for 11 Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages. This represented a 4× increase over existing publicly available parallel corpora for these languages.

The authors compiled 12.4 million sentence pairs from existing public sources and additionally mined 37.4 million sentence pairs from the web. The web mining process involved combining several corpora, tools, and methods.

The paper details the process of mining parallel sentences (explained below). To evaluate the quality of the newly mined corpus, the authors conducted human evaluation of samples across the 11 languages, which validated the high quality of the parallel sentences.

Furthermore, the paper introduces IndicTrans, a multilingual NMT model trained on the Samanantar corpus. This model spans all 11 Indic languages and English, with a design choice to represent all Indic language data in a single script (Devanagari) to improve lexical sharing.

IndicTrans was compared with commercial translation systems (Google, Microsoft) and other publicly available models on various benchmarks like FLORES, WAT, and WMT16 (FLORES — Facebook Low Resource (FLoRes) machine translation benchmark, WAT — Workshop on Asian Translation, and WMT16 - Workshop on Machine Translation 2016, are common translation benchmarks used to evaluate the quality of translations from a model)

The results demonstrate that IndicTrans outperforms existing open-source models and even surpasses commercial systems on many benchmarks, establishing the utility of Samanantar. Notably, IndicTrans showed higher performance gains for low-resource languages.

Data

The paper introduces Samanantar, containing a total of 49.7 million sentence pairs between English and these languages. This represents a 4x increase over existing publicly available data.

The corpus was built by collating 12.4 million sentence pairs from existing public sources like OPUS, WAT 2021, and various non-OPUS sources. These sources were not cleaned or post-processed by the authors.

37.4 million new sentence pairs were mined from the web using several methods:

Machine-readable comparable corpora

Parallel sentences were extracted from news websites, educational platforms (NPTEL, Coursera, Khan Academy), and science YouTube channels by extracting articles/subtitles, tokenizing using the Indic NLP Library, and using cosine similarity of LaBSE embeddings (LAS) to identify potential parallel sentences with a threshold of 0.75.

Non-machine readable comparable corpora

Text was extracted from scanned documents (government sources from Tamil Nadu, Bangladesh, West Bengal, Andhra Pradesh, and Telangana) using OCR (Google's Vision API), followed by tokenization and parallel sentence extraction using LAS on corresponding documents.

Web-scale monolingual corpora (IndicCorp)

FAISS was used for efficient nearest neighbor search in the English portion of IndicCorp based on LaBSE embeddings with product quantization. The top-1 matching English sentence for each Indic sentence was retrieved, and then LAS was computed on the full embeddings with a higher threshold of 0.80 for filtering. Wikipedia was also processed similarly. This LaBSE-based alignment was chosen over methods like Vecalign and Bleualign because those methods assume/require parallel documents.

83.4 million parallel sentences between all 55 Indic language pairs were mined by pivoting through English from the English-centric corpus. A strict deduplication method was used.

The largest contributor to the newly mined data was IndicCorp (67%).

Model

The paper presents IndicTrans, a multilingual NMT model for translating between English and 11 Indic languages, and between the Indic languages themselves.

A key design choice was to represent all Indic language data in a single script (Devanagari) using the Indic NLP Library to enhance lexical sharing and reduce vocabulary fragmentation. Source and target languages are indicated using special tokens.

IndicTrans was trained using all the parallel data in the Samanantar corpus between English and the 11 Indic languages, after removing overlaps with test and validation sets.

Separate vocabularies (32K BPE) were learned for English and Indic languages from the English-centric training data using subword-nmt.

The model uses a transformer-based architecture with 6 encoder and decoder layers, input embeddings of size 1536, 16 attention heads, and a feedforward dimension of 4096.

Training was performed using fairseq with the Adam optimizer, label smoothing, gradient clipping, mixed precision training on 8 V100 GPUs, and early stopping.

Beam search (beam size 5, length penalty 1) was used for decoding.

Evaluation

The quality of the mined parallel corpus was evaluated through a human annotation task.

9,566 English-Indic sentence pairs were sampled across 11 languages and different mining sources, stratified based on their LAS score.

38 native speakers annotated the semantic textual similarity (STS) on a scale of 0 to 5, based on SemEval-2016 guidelines.

The results showed a high mean STS score of 4.27 for all accepted pairs, indicating high semantic similarity. The LAS thresholds were shown to regulate quality effectively.

A moderate positive correlation (0.37) was found between LAS and STS, suggesting potential for improving multilingual representations.

Error analysis revealed an overall extraction accuracy of 79.5% for accepted pairs.

The performance of IndicTrans was evaluated on various publicly available benchmarks: WAT2020, WAT2021, WMT, UFAL Entam, and FLORES. A new test set was also created for the en-as pair.

The primary evaluation metric was BLEU score, calculated using SacreBLEU with specific tokenization settings for Indic-English and English-Indic directions.

IndicTrans was compared against commercial MT systems (Google, Microsoft), publicly available open-source NMT systems (OPUS-MT, mBART50), and models trained on all existing parallel data (Transformer, mT5).

The results demonstrated that IndicTrans trained on Samanantar outperforms nearly all existing open-source models and often outperforms commercial systems on most benchmarks. Significant gains were observed for low-resource languages. The performance on the independently created FLORES test set highlighted the utility of Samanantar across different domains.

The authors also suggest areas for future work, including improving multilingual representations for low-resource languages and longer sentences, optimizing training strategies, and pre-training multilingual models for Indic languages.

The paper concludes by highlighting the three main contributions:

Samanantar, the largest parallel corpora collection for Indic languages
IndicTrans, a multilingual translation model
Human judgments on cross-lingual textual similarity