Skip to Content

Corpus de Referência do Português Contemporâneo: freely available subcorpora

NameCRPC
TitleCorpus de Referência do Português Contemporâneo: freely available subcorpora
Presented byMendes, A.
LanguagesBrazilian Portuguese, Portuguese
Language codespt-BR, pt-PT
Categoryresource
Statusavailable
Typecorpora
Year2020

The CRPC corpus and free subcorpora

The CRPC is a large electronic corpus of European Portuguese and other na- tional varieties [1]. It contains 311,4 million words and covers several types of written texts (literary, newspaper, technical, etc.) and spoken texts (formal and informal). Due to copyrights restrictions, the written subpart of the CRPC (309 M) can only be searched online. Specific subparts free from copyright restrictions have been made freely available for academic use, and are described below.

Português Fundamental: A spoken corpus of European Portuguese, collected between 1970 and 1974, composed of 137 recordings, transcribed, aligned in Ex- maralda and tagged with PoS. catalogue.elra.info/en-us/repository/browse/ELRA-S0346

Português Falado: A Spoken corpus of Portuguese varieties in the world with 86 recordings: Portugal (30), Brazil (20), 5 African countries with Portuguese as its official language (5 each), Macao (5), Goa (3) and East-Timor (3). The corpus is transcribed, aligned with Exmaralda and tagged with PoS. catalogue.elra.info/en-us/repository/browse/ELRA-S0345

LT Corpus: The Literary Corpus contains approximately 1,781,083 running words of 70 copyright-free classics of European and Brazilian literature (61 Por- tugal and 9 from Brazil) published before 1940. https://catalogue.elra.info/en- us/repository/browse/ELRA-W0059/

PTPARL Corpus The PTPARL Corpus contains 1,076 transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens, with PoS and NP chunks.We plan to freely release the total subpart of the parliamentary sessions in the near future. catalogue.elra.info/en-us/repository/browse/ELRA-W0060

References

  1. Généreux, M., Hendrickx, I., Mendes, A.: Introducing the Reference Corpus of Con- temporary Portuguese On-Line. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) LREC’2012 – Eighth International Conference on Language Resources and Evaluation. pp. 2237–2244. European Language Resources Association (ELRA), Istanbul, Turkey (May 2012)