Skip to Content

OBras: a fully annotated and partially human-revised corpus of Brazilian literary works in the public domain

NameObras Brasileiras
Linkhttp://www.linguateca.pt/
TitleOBras: a fully annotated and partially human-revised corpus of Brazilian literary works in the public domain
Presented bySantos, D. , Freitas, C. Bick, E.
LanguageBrazilian Portuguese
Language codept-BR
Categoryresource
Statusavailable
Typecorpora
Year2018

The AC/DC project

Ever since 1999 Linguateca has developed the AC/DC project so that people could interrogate annotated corpora in Portuguese, with ever increasing quality of annotation, a wider genre palette, and more kinds of information, compare [2] with [3].

Even though all material is fully available for querying, not all corpora in- cluded in AC/DC can be distributed in their entirety, due to copyright limita- tions. Literature is one of the genres included in AC/DC since its creation, but it is especially prone to availability restrictions. This is why most literary corpora only include old texts which are already in the public domain, or have restrictive conditions, like COMPARA.

This is why, at least in a first phase, we invested in literature which was already in the public domain, which basically spans, in what Brazil is concerned, one century. We named the corpus “Obras Brasileiras” with acronym OBras, and we are still in the process of adding more texts. http://www.linguateca.pt/OBras shows the works included, together with their metadata and size in tokens, as well as how to downnload it.

Corpus description

Through the several annotations one can describe a corpus on many levels. We start by the description in terms of part-of-speech, as well as the size and variety of proper names. Then we present the quantities of all different semantic data, for version 5.3 of 18 June 2018, see [4] for the annotation.

Table 1. Quantification of OBras according to annotation fields. NB! We have not included the multiword cases of colour, body and clothing for type counting

AnnotationTokensTypes (lemmas)
Sizeca. 5 millions @ 151,676
Verbs842,73617,134
Nouns965,80526,426
Adjectives289,50711,087
Proper names132,21021,320
Colours11,932258
Clothing10,395208
Body54,762242
Saying verbs78,219825
Emotions132,3362,185

*Thanks to all who have contributed with text preparation and annotation, under the scope of Linguateca.

References

  1. Bick, Eckhard: The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus University Press, Aarhus, Denmark (2000)
  2. Santos, Diana, Bick, Eckhard: Providing Internet access to Portuguese corpora: the AC/DC project. In: Gavrilidou, Maria et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), pp. 205-210. ELRA (2010)
  3. Santos, Diana: Corpora at Linguateca: Vision and Roads Taken. In: Berber Sardinha, Tony, Ferreira, Telma de Lurdes São Bento (Eds.), Working with Por- tuguese Corpora, pp. 219-236. Bloomsbury (2014 )
  4. Anotacedil ¸ ão. http://www.linguateca.pt/acesso/anotacao.html. Last accessed 29 July 2018