Building an annotated Nheengatu-Portuguese parallel corpus • Latin American and Iberian Languages Open Corpora Forum

Name	Nheengatu-Portuguese parallel corpus
Link	github.com/juliana-gurgel/yrl
Title	Building an annotated Nheengatu-Portuguese parallel corpus
Presented by	Gurgel, J. , Alexandre, D. Alencar, L.
Languages	Brazilian Portuguese, Nheengatu
Language codes	pt-BR, yrl
Category	resource
Status	available
Type	corpus
Year	2021

Natural Language Processing (NLP) resources have been mostly developed for major languages like Portuguese and English. Regarding endangered languages, both annotated corpora and NLP tools are necessary resources for language preservation. Thereby, the aim of this ongoing research is to build the first electronic corpus for Nheengatu, an endangered indigenous language spoken in Brazil, Colombia and Venezuela by several ethnic groups that live in the Amazon region. So far, we have compiled 2.207 sentences in the language pair Nheengatu-Portuguese and a dictionary containing 522 words in Nheengatu. The sentences and the dictionary were extracted from the books Curso de Língua Geral and Noções de língua geral ou nheengatu. The compilation of the parallel corpus took the following steps: manual and automatic extraction of data, sentence splitting and alignment, and normalization. The non-annotated corpus and the dictionary will be further used in the implementation of Nheenga-Tagger, a part-of-speech tagger for Nheengatu.