The C-ORAL-BRASIL project (www.c-oral-brasil.org) is dedicated to the compilation of Brazilian Portuguese spontaneous speech corpora in addition to the development of informationally annotated minicorpora. Spontaneous speech is intended as non-planned speech accomplished while it is performed (Cresti, 2000). In other words, it differs not only from read or task-based speech, but also from sociolinguistic interviews or narratives that comprise one single type of mostly monologic and planned interaction with scarce diaphasic variation. The project stemmed from the European C-ORAL-ROM project (Cresti and Moneglia, 2005). The corpora files include: CHAT format transcriptions implemented with annotations of conclusive and non-conclusive prosodic breaks (respectively // and /); metadata in .txt format; text-to-speech time alignment through WinPitch (Martin 2015) in .xml format; and PoS and syntactically tagged transcriptions through PALAVRAS (Bick, 2000). The corpora can be queried through a dedicated tool, the DB-CoM (www.c-oral-brasil.org/db-com) featuring the following searches: KWIC, lemma, POS and regular expressions, through the specification of metadata and utterance type. Among the C-ORAL-BRASIL project corpora, the C-ORAL-BRASIL I informal spontaneous speech corpus (Raso & Mello, 2012) and the CORAL-BRASIL II corpus (which includes the Natural Context Formal, the Media and the Telephonic subcorpora) are especially relevant. Their specifications can be seen in Table 1 below.
Table 1. C-ORAL-BRASIL corpora overall size
Corpus | number of files | number of words | number of utterances |
---|---|---|---|
Informal | 139 | 208,130 | 31,442 |
Natural Context Formal | 74 | 121,396 | 10,599 |
Media | 101 | 139,647 | 13,005 |
Telephonic | 79 | 31,308 | 5,850 |
Total | 393 | 500,481 | 60,896 |
References
- Bick, E. The Parsing System “Palavras”. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. University of Arhus, Århus (2000).
- Cresti, E. Corpus di italiano parlato. Presso L’Accad. della Crusca, Firenze (2000).
- Cresti, E., Moneglia, M. C-ORAL-ROM: integrated reference corpora for spoken Romance languages. John Benjamins, Amsterdam/New York (2005).
- Martin, P. WinPitch. In: The Structure of Spoken Language: Intonation in Romance, pp. 259-271. Cambridge University Press, Cambridge (2015).
- Raso, T., Mello, H. C-ORAL-BRASIL I: corpus de referência do português brasileiro falado informal. Editora UFMG, Belo Horizonte (2012).