Skip to Content

C-ORAL-BRASIL and its resources

NameC-ORAL-BRASIL
Linkwww.c-oral-brasil.org
TitleC-ORAL-BRASIL and its resources
Presented byMello, H. , Raso, T. , Ferrari, L. , Bick, E. Chaves, H.
LanguageBrazilian Portuguese
Language codept-BR
Categoryresource
Statusavailable
Typecorpora
Year2020

The C-ORAL-BRASIL project (www.c-oral-brasil.org) is dedicated to the compilation of Brazilian Portuguese spontaneous speech corpora in addition to the development of informationally annotated minicorpora. Spontaneous speech is intended as non-planned speech accomplished while it is performed (Cresti, 2000). In other words, it differs not only from read or task-based speech, but also from sociolinguistic interviews or narratives that comprise one single type of mostly monologic and planned interaction with scarce diaphasic variation. The project stemmed from the European C-ORAL-ROM project (Cresti and Moneglia, 2005). The corpora files include: CHAT format transcriptions implemented with annotations of conclusive and non-conclusive prosodic breaks (respectively // and /); metadata in .txt format; text-to-speech time alignment through WinPitch (Martin 2015) in .xml format; and PoS and syntactically tagged transcriptions through PALAVRAS (Bick, 2000). The corpora can be queried through a dedicated tool, the DB-CoM (www.c-oral-brasil.org/db-com) featuring the following searches: KWIC, lemma, POS and regular expressions, through the specification of metadata and utterance type. Among the C-ORAL-BRASIL project corpora, the C-ORAL-BRASIL I informal spontaneous speech corpus (Raso & Mello, 2012) and the CORAL-BRASIL II corpus (which includes the Natural Context Formal, the Media and the Telephonic subcorpora) are especially relevant. Their specifications can be seen in Table 1 below.

Table 1. C-ORAL-BRASIL corpora overall size

Corpusnumber of filesnumber of wordsnumber of utterances
Informal139208,13031,442
Natural Context Formal74121,39610,599
Media101139,64713,005
Telephonic7931,3085,850
Total393500,48160,896

References

  1. Bick, E. The Parsing System “Palavras”. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. University of Arhus, Århus (2000).
  2. Cresti, E. Corpus di italiano parlato. Presso L’Accad. della Crusca, Firenze (2000).
  3. Cresti, E., Moneglia, M. C-ORAL-ROM: integrated reference corpora for spoken Romance languages. John Benjamins, Amsterdam/New York (2005).
  4. Martin, P. WinPitch. In: The Structure of Spoken Language: Intonation in Romance, pp. 259-271. Cambridge University Press, Cambridge (2015).
  5. Raso, T., Mello, H. C-ORAL-BRASIL I: corpus de referência do português brasileiro falado informal. Editora UFMG, Belo Horizonte (2012).