Building an annotated Nheengatu-Portuguese parallel corpus • Latin American and Iberian Languages Open Corpora Forum

Nombre	Nheengatu-Portuguese parallel corpus
Liga	github.com/juliana-gurgel/yrl
Título	Building an annotated Nheengatu-Portuguese parallel corpus
Presentado por	Gurgel, J. , Alexandre, D. Alencar, L.
Lenguas	Brazilian Portuguese, Nheengatu
Códigos de lenguas	pt-BR, yrl
Categoría	resource
Estado	available
Tipo	corpora
Año	2021

Natural Language Processing (NLP) resources have been mostly developed for major languages like Portuguese and English. Regarding endangered languages, both annotated corpora and NLP tools are necessary resources for language preservation. Thereby, the aim of this ongoing research is to build the first electronic corpus for Nheengatu, an endangered indigenous language spoken in Brazil, Colombia and Venezuela by several ethnic groups that live in the Amazon region. So far, we have compiled 2.207 sentences in the language pair Nheengatu-Portuguese and a dictionary containing 522 words in Nheengatu. The sentences and the dictionary were extracted from the books Curso de Língua Geral and Noções de língua geral ou nheengatu. The compilation of the parallel corpus took the following steps: manual and automatic extraction of data, sentence splitting and alignment, and normalization. The non-annotated corpus and the dictionary will be further used in the implementation of Nheenga-Tagger, a part-of-speech tagger for Nheengatu.