The success of LLM(Large Language Model) based technology like ChatGPT has started a race to revolutionize AI. So far these systems have been feeding on datasets which are mostly in English limiting the scope of use when it comes to Indic languages. But Tech Mahindra is working on an indigenous foundational model which would be based on Indic languages. Project Indus is potentially one of the biggest projects that the company has worked on.
LLM based tools like ChatGPT by OpenAI can perform in multiple languages but the data that it feeds on is primarily in English. Project Indus is aimed to initially support 40 Hindi dialects and other languages. The model has the capacity to serve 25 percent of the total world population. This would benefit a large population of non-English speakers.
The biggest challenge for the project is the unavailability of datasets in local languages. Most of these datasets are either untranslated or incomplete. Datasets in Hindi are also mostly fragmented making it difficult to process. The company has started a campaign “Bhasha Daan” which aims at collecting local dialects and languages. This portal allows speakers of the particular dialect to record information which would be later utilized.




