The Wixarika-Spanish Parallel Corpus • Latin American and Iberian Languages Open Corpora Forum

Name	The Wixarika-Spanish Parallel Corpus
Link	https://github.com/pywirrarika/wixarikacorpora
Title	The Wixarika-Spanish Parallel Corpus
Presented by	Mager, J. , Carrillo, D. Meza, I.
Language	Wixarika
Language code	hch-MX (iso 639-3)
Category	resource
Status	available
Type	corpora
Year	2018

Introduction

Wixarika is an indigenous language spoken in central west Mexico¹ by approx- imately fifty thousand people² . For indigenous languages like Wixarika, there is a lack of digital resources in general since native speakers do not necessarily generate a digital fingerprint on public forums.

The lack of resources is even more noticeable for NLP related tasks. The corpus presented here aims to be a seed of a future larger effort to overcome this lag in the field, and especially for data-driven machine translation (MT) [4, 3].Since our collection has only 8, 967 parallel phrases, it could be considered a low resource corpus. This could be a limiting for certain research purposes. – Wixarika has inherent linguistic properties which make it interesting to study for the sake of understanding the inner-working of languages. – Low resource scenarios offer an opportunity to imagine and create new tools for the transfer or exploitation of knowledge from other languages. – It requires to define new methodologies for the collection of corpora within the native speaker communities.

Wixarika

Wixarika is a language which belongs to the Coracholan subgroup of languages within the Uto-Aztecan family [1]. It has a subject-object-verb (SOV) struc- ture, and its morphological typology is polysynthetic. This means that it has a high morpheme-to-word ratio and a consequently large overall number of words. Therefore, this allows incorporating a great amount of information at the mor- phological level [2]. Native speakers use 18 symbols Σwixarika ={a,e,h,i,+,k,m,n,p,r,t,s,u,w,x,y,’} from which ones five denote vowels: {a,e,i,u,+} with long and short variants. Al- though most linguists prefer a dashed i to denote the fourth vowel, in practice native speakers use a plus symbol (+). This corpus chose to use the latter in the orthography transcription of Wixarika.

To illustrate on the high amount of information contained in one single word in the Wixarika language let us analyze the nep+ka’ukats+k+, which means “I don’t have a dog”. This word is composed of the morphs ne|p+ |ka|’u|ka|ts+k+ ³ . In this example although this word is a verb, its polysynthetic nature makes it a full sentence: ts+k+ is the stem and means “dog”, ne is a first person possessive, ka negation, ’u refers to a visual object and ka is the second part of the negation.

Corpus

The corpus consists of a parallel collection of sentences which originated from the Hans Christian Andersen’s and brother Grimm classic fairy tales. A Wixarika native speaker fluent in Spanish carefully translated sentences from the tales. Table 3 summarizes the main statistics of the corpus. Although it is a small corpus you can notice that there is a big amount of token types given the rich morphology of the Wixarika language.

The corpus is freely available from http://anonymized/wixarikacorpora⁴ . This has already being used for creating two machine translation systems⁵ .

                     Phrases 11, 562 Unique phrases 8, 967
                     Tokens 56, 037 Token types    17, 131

Table 1. Amount of sentences, tokens, and words contained in the Wixarika-Spanish parallel corpus.

Conclusions

The Wixarika-Spanish parallel corpus is an effort to increase the research in Machine Translation for this language pair. Moreover, it can be a seed to promote the creation of more data collection for other indigenous languages. The main aim of the creation of such datasets is to feed data-driven MT systems.

¹Wixarika is spoken in the states of Jalisco, Nayarit, Durango, and Zacatecas.

² Wixarika is also known as huichol which it is close to the Nahuatl denomination of the language.

³ Notice that we use | symbol to delimit its morphemes

⁴ The correct link will be provided in the final version.

⁵ Links to be provided in the final version.

References

Baker, M.C.: Complex predicates and agreement in polysynthetic languages. Com- plex predicates, pp. 247–288 (1997)
Iturrio, J.L., Gómez López, P.: Gramática Wixarika I. Archivo de lenguas indı́genas de México, Lincom Europa (1999)
Mager, M., Dionico, C., Ivan, M.: Probabilistic finite-state morphological segmenter for the Wixarika (Huichol) language. Journal of Intelligent & Fuzzy Systems (Special Issue) (2018)
Mager Hois, J.M., Barron Romero, C., Meza Ruı́z, I.V.: Traductor estadı́stico wixarika - español usando descomposición morfológica. COMTEL (6) (2016)