Natural Language Processing (NLP) resources have been mostly developed for major languages like Portuguese and English. Regarding endangered languages, both annotated corpora and NLP tools are necessary resources for language preservation. Thereby, the aim of this ongoing research is to build the first electronic corpus for Nheengatu, an endangered indigenous language spoken in Brazil, Colombia and Venezuela by several ethnic groups that live in the Amazon region. So far, we have compiled 2.207 sentences in the language pair Nheengatu-Portuguese and a dictionary containing 522 words in Nheengatu. The sentences and the dictionary were extracted from the books Curso de Língua Geral and Noções de língua geral ou nheengatu. The compilation of the parallel corpus took the following steps: manual and automatic extraction of data, sentence splitting and alignment, and normalization. The non-annotated corpus and the dictionary will be further used in the implementation of Nheenga-Tagger, a part-of-speech tagger for Nheengatu.
Building an annotated Nheengatu-Portuguese parallel corpus
Name | Nheengatu-Portuguese parallel corpus |
Link | github.com/juliana-gurgel/yrl |
Title | Building an annotated Nheengatu-Portuguese parallel corpus |
Presented by | Gurgel, J. , Alexandre, D. Alencar, L. |
Languages | Brazilian Portuguese, Nheengatu |
Language codes | pt-BR, yrl |
Category | resource |
Status | available |
Type | corpus |
Year | 2021 |