Towards a twitter corpus of the indigenous languages of the Americas • Latin American and Iberian Languages Open Corpora Forum

Title	Towards a twitter corpus of the indigenous languages of the Americas
Presented by	Rosales, M. , Mager, M. Meza, I.
Language	Mexican Native
Language code	??
Category	methodology
Status	available
Type	text
Year	2019

Internet communication has become an important social phenomenon. However, minor languages have been largely ignored in the design of social networks. Also, natural language processing (NLP) and computational linguistics (CL) communities have done only a small amount of research in the last years for those languages [5]. To close this gap, we propose the recollection of a linguistic corpus of indigenous languages of the Americas on Internet-mediated communication (IMC). The IMC has a wide variety of platforms, but, in many of these, it is complicated to collect linguistic data for study purposes. Based on the success of previous work [4, 10] we decided to use Twitter.

First, we will make a manual recollection of accounts that use to write posts with any indigenous language of the Americas⁴ , regardless of the language or the variant. To speed up the search keywords are going to be extracted from available grammars [7]. This approach will give us the first immersion to the type of data that is generated by users on Twitter. We also expect to identify specific issues in the recollected data. With this information, we will propose a consistent annotation schema. This preliminary compilation will be done with a minimum of 100 accounts labeled by the type of language manifestation on the tweets, type of accounts(institutional, personal, community, activist or diffusion accounts), and most frequently used languages.

The next step is to make an automated account search, which recognizes tweets that contains indigenous languages of America. The intention is to per- form a data rectification phase in which the collection problems and possible bias will be observed. We ideally hope to obtain different types of languages manifestations: code-switching, vocabulary tweets (indigenous language words with their translation or the word or phrase with the closest meaning), parallel data, literary expressions, and monolingual indigenous tweets. At the same time, we hope to identify different types of users: broadcasting pages, activist, native speakers, hereditary speakers, etc. The collected data will be used according to the twitter developer agreement, and therefore the collected data will not be public available. However, we are going to make public the annotation and the reference to the original data source to make the dataset reproducible.

The foreseen tasks that will arise from the existing data are: automatic lan- guage identification; parallel phrase extraction; and sociolinguistic studies about the usage of these languages on the web [8, 2]. In order to train models that perform those tasks we plan to use existing complementary resources as: mono- lingual data [3], bible parallel data [6] and web corpora [9].

⁴ In order to collect those accounts we will use the index of indigenous twitter users [1] at http://www.indigenoustweets.com_

References

Bhroin, N.N.: Social media-innovation: The case of indigenous tweets. The Journal of Media Innovations 2(1), 89–106 (2015)
Eleta, I., Golbeck, J.: Multilingual use on twitter: Social networks at the language fronteir. Computers in Human Behavior 41, 424–432 (2014)
Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In: LREC. vol. 29, pp. 31–43 (2012)
Keegan, T.T., Mato, P., Ruru, S.: Using twitter in an indigenous language: An analysis of te reo māori tweets. AlterNative: An International Journal of Indigenous Peoples 11(1), 59–75 (2015)
Mager, M., Gutierrez-Vasques, X., Sierra, G., Meza-Ruiz, I.: Challenges of language technologies for the indigenous languages of the americas. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 55–69 (2018)
Mayer, T., Cysouw, M.: Creating a massively parallel bible corpus. Oceania 135(273), 40 (2014)
Neubig, G., Mori, S., Mizukami, M.: A framework and tool for collaborative ex- traction of reliable information. In: Proceedings of the Workshop on Language Processing and Crisis Information 2013. pp. 26–35 (2013)
Nguyen, D., Trieschnigg, D., Cornips, L.: Audience and the use of minority lan- guages on twitter. In: Ninth International AAAI Conference in Web and Social Media (2015)
Scannell, K.P.: The crúbadán project: Corpus building for under-resourced lan- guages. In: Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop. vol. 4, pp. 5–15 (2007)
Ungerleider, N.: Preserving indigenous languages via twitter. Fast Company 14 (2011)