Portuguese Universal Dependencies via Bosque • Latin American and Iberian Languages Open Corpora Forum

Nombre	The Bosque Corpus
Liga	https://github.com/UniversalDependencies/UD_Portuguese-Bosque/
Título	Portuguese Universal Dependencies via Bosque
Presentado por	dePaiva, V. , Freitas, C. , Rademaker, A. , Real, L. Chalub, F.
Language	Brazilian Portuguese
Código de lengua	pt-BR
Categoría	resource
Estado	available
Tipo	corpora
Año	2018

A corpus of popular language

CorPop

This research proposes a corpus of popular Brazilian Portuguese, called CorPop [1], with texts selected based on the average level of literacy of the country ’s readers. Cor- Pop’s theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related disciplines, such as Corpus Studies, Text Lin- guistics, Psycholinguistics and Natural Language Processing studies. The development of CorPop took place through the compilation of data about the level of literacy of Brazilian readers and the characteristics of a standard of text simplicity in a corpus of texts suitable for these readers. The data were collected from the surveys Indicador de Analfabetismo Funcional (Inaf) and Retratos da Leitura no Brasil [6], as well as from a questionnaire with readers.

In Brazil, most of the corpora research has used materials from mainstream Brazilian media, represented by vehicles such as Folha de São Paulo, O Estado de São Paulo, O Globo, Zero Hora, among others. CorPop, using distinctive source materials, represents popular Brazilian Portuguese, in use by most Brazilians. CorPop aims to be relevant as reference material for linguistic research connected with the reality of low literacy writ- ers. It differs from other current corpora of Portuguese not only in its extension, which is small, constituting itself in a lean corpus, but especially in the way it was planned and composed, text to text, segment to segment.

The texts compiled in the corpus are included, as the main criterion, in the reading universe of the Brazilian average reader, whose socio-demographic profile is quite spe- cific. Thus, it was necessary to recognize and determine the proficiency profile of read- ing and literacy of Brazilian readers and, consequently, the average Brazilian reader, to pre-select the texts to be included in CorPop. From this, we were able to select the texts according to what the average readers would understand or not, according to the aver- age level of literacy and schooling of Brazilians. The texts selected for CorPop are: (1) popular journalism of the PorPopular Project [2] (newspaper Diário Gaúcho) and A hora de Santa Catarina newspaper, massively consumed by the C and D classes (01 to 05 minimum wages), which is the average Brazilian reader; (2) texts and authors most read by the respondents of the last editions of the research Retratos da Leitura no Brasil; (3) collection “É Só o Começo” (adaptation of Brazilian literature classics to readers with low literacy, adapted by linguists); (4) texts of the newspaper Boca de Rua, a newspaper by people with low schooling and low literacy; and (5) texts of the Diário da Causa Operária, Brazilian working class press produced also by people within the average literacy range of the country. After collection, preparation and processing of the corpus, we performed a series of experiments with the list of frequencies. The re- sults obtained show promising applications of CorPop in several linguistic tasks, such as text simplification and use as controlled vocabulary for writing definitions in dic- tionaries [5; see Chapter 4 for details on all tests]. Also, CorPop shows that a small corpus can have the same legitimacy as a corpus of large proportions. The table below summarizes the contents of CorPop in modules:

Table 1. Total number of types and tokens per CorPop module.

Module	Types	Tokens
PorPopular - Diário Gaúcho newspaper	6.378	30.944
Hora de Santa Catarina newspaper	4.118	18.303
Boca de Rua newspaper	8.913	71.454
Diário da Causa Operária newspaper	7.841	59.785
Retratos da Leitura no Brasil	22.463	430.806
Coleção “É Só o Começo”	8.161	73.507
Total	32.138	684.799

CorPop was inspired by the project PorPopular, developed at UFRGS since 2008, which has the goal to describe and study patterns of vocabulary shown in texts of pop- ular newspapers aimed at low income readers [3]. The purpose of the project is to col- lect a corpus of popular Brazilian Portuguese from the printed version of popular news- papers [4] to serve as a reference for studies and research on popular language. CorPop is deeply linked to PorPopular, uses part of its collected corpora, and is hosted at the site http://www.ufrgs.br/textecc/porlexbras/corpop/index.php, a “sister” site to the PorPopular project, at http://www.ufrgs.br/textecc/porlexbras/porpopular/.

References

CorPop Homepage, http://www.ufrgs.br/textecc/porlexbras/corpop/index.php, last accessed 2018/07/30.
PorPopular Homepage, http://www.ufrgs.br/textecc/porlexbras/porpopular/, last accessed 2018/07/30.
Finatto, M. J. B: Complexidade textual em artigos científicos: contribuições para o estudo do texto científico em português. Organon 50, 30-45 (2011).
Finatto, M. J. B.; Evers, A.; Pasqualini, B.; Kuhn, T. Z. Maciel, A. P. Vocabulário controlado e redação de definições em dicionários de português para estrangeiros: ensaios para uma léxico-estatística textual. Trama 10, 53-68 (2014).
Pasqualini, Bianca Franco. CorPop: um corpus de referência do português popular escrito do Brasil. 250 p. Orientadora: Maria José Bocorny Finatto. Tese (Doutorado) - Universidade Federal do Rio Grande do Sul, Instituto de Letras, Programa de Pós-Graduação em Letras, Porto Alegre, BR-RS, 2018.
Amorim, G. Retratos da leitura no Brasil. São Paulo: Instituto Pró-livro/ Imprensa Oficial do Estado de São Paulo, 2012