Saltar al contenido

Building an annotated Nheengatu-Portuguese parallel corpus

NombreNheengatu-Portuguese parallel corpus
Ligagithub.com/juliana-gurgel/yrl
TítuloBuilding an annotated Nheengatu-Portuguese parallel corpus
Presentado porGurgel, J. , Alexandre, D. Alencar, L.
LenguasBrazilian Portuguese, Nheengatu
Códigos de lenguaspt-BR, yrl
Categoríaresource
Estadoavailable
Tipocorpora
Año2021

Natural Language Processing (NLP) resources have been mostly developed for major languages like Portuguese and English. Regarding endangered languages, both annotated corpora and NLP tools are necessary resources for language preservation. Thereby, the aim of this ongoing research is to build the first electronic corpus for Nheengatu, an endangered indigenous language spoken in Brazil, Colombia and Venezuela by several ethnic groups that live in the Amazon region. So far, we have compiled 2.207 sentences in the language pair Nheengatu-Portuguese and a dictionary containing 522 words in Nheengatu. The sentences and the dictionary were extracted from the books Curso de Língua Geral and Noções de língua geral ou nheengatu. The compilation of the parallel corpus took the following steps: manual and automatic extraction of data, sentence splitting and alignment, and normalization. The non-annotated corpus and the dictionary will be further used in the implementation of Nheenga-Tagger, a part-of-speech tagger for Nheengatu.