Skip to Content

On the Building of the Large Scale Corpus of Southern Qichwa

NameCorpus of Southern Qichwa
Linkhttps://siminchikkunarayku.pe/raw_audio.html
TitleOn the Building of the Large Scale Corpus of Southern Qichwa
Presented byCamacho, L. , Zevallos, R. Melgarejo., N.
LanguageSouthern Qichwa
Language codequ-PE
Categoryresource
Statusavailable
Typecorpora
Year2018

Introduction

Even though there is not exist any Quechua speech dataset, to the best of our knowledge, various groups in Latin America and abroad have been working on Quechua language technology for the last few years.

The Instituto de Lengua y Literatura Andina Amazonica (ILLA)1 has been working on the construction of electronic dictionaries for Quechua, Aymara and Guarani; the group Hinantin2 at the Universidad Nacional San Antonio Abad del Cusco (UNSAAC) has produced a text-to-speech system for Cusco Quechua, a Quechua spell checker plug-in for LibreOffice [5] and a morphological analyzer for Ashaninka, an aboriginal language whose population is scattered across the Amazonian rainforest in Peru and Brazil.

Rios [6] describes a language technology toolkit that includes several things worth mentioning, such as the first morphological analyzer for Quechua, a hybrid

machine translation in the direction Spanish-Quechua, and the first Quechua dependency treebank.

The Quechua Language Familiy

Qichwa or Quechua (spanish form) is a family of languages spoken in South America with around 10 million speakers, not only in the Andean regions but also along the valleys and plains connecting the Amazonian Forest to the Pacific Ocean coastline. Quechua languages are considered highly agglutinative with sentence struc- ture subject-object-verb (SOV) and mostly post-positional. Table 1 contains an example of standard Quechua.

                   Quechua    Qichwa siminchik kan
                              Qichwa simi-nchik ka-n
                  Lit. trans. Quechua mouth-ours is
                  Translation Quechua is our language.
        Table 1. Sentence example of standard Quechua Chanca

Even though the classification of Quechua languages remains open to re- search [3,4], recent work in language technology for Quechua [6,2] have adopted the categorization system described by Torero [8]. This categorization divides the Quechua languages into two main branches, QI and QII. Branch QI corre- sponds to the dialects spoken in central Peru. QII is further divided in three branches, QIIA, QIIB and QIIC. QIIA groups the dialects spoken in Northern Peru, while QIIB the ones in Ecuador and Colombia. In this paper, we focus in the QIIC dialects, which correspond to the ones spoken in Southern Peru, Bolivia, Chile and Argentina. Mutual intelligibility between speakers of QI and QII dialects is not always given. However, QII dialects are close enough to allow mutual intelligibility (see Figure 1)

There are two dialects spoken in Southern Peru. The first one, Quechua Chanca, is mainly spoken in Ayacucho and surrounding departments of Peru. The second one, Quechua Collao, is spoken in the departments of Cusco and Puno, and some Northern regions of Bolivia. The main difference between these dialects is the occurrence of glottalized and aspirated stops in Quechua Collao, a phonetic distinction that Quechua Chanca lacks.

1 http://www.illa-a.org/wp/

2 http://hinant.in

References

  1. Cerrón-Palomino, R.: Quechua sureño. Diccionario unificado, Lima, Perú, Biblioteca Nacional del Perú (1994)

  2. Gonzales, A.R., Mamani, R.A.C.: Morphological disambiguation and text normal- ization for southern quechua varieties. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. pp. 39–47 (2014)

  3. Heggarty, P., Valko, M.L., Huarcaya, S.M., Jerez, O., Pilares, G., Paz, E.P., Noli, E., Usandizaga, H.: Enigmas en el origen de las lenguas andinas: aplicando nuevas técnicas a las incógnitas por resolver. Revista Andina 40, 9–57 (2005)

  4. Landerman, P.N.: Quechua dialects and their classification. (1992)

  5. Rios, A.: Spell checking an agglutinative language: Quechua. In: 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. pp. 51–55 (2011)

  6. Rios, A.: A basic language technology toolkit for quechua (2016) 5 http://www.meta-net.eu/meta-share/licenses 6 L. Camacho and R. Zevallos

  7. Soria, C., Pretorius, L., Declerck, T., Mariani, J., Scannell, K., Wandl-Vogt, E.: Ccurl 2016 collaboration and computing for under-resourced languages: Towards an alliance for digital language diversity (2016)

  8. Torero, A.: Los dialectos quechuas. Univ. Agraria (1964)