The CRPC corpus and free subcorpora
The CRPC is a large electronic corpus of European Portuguese and other na- tional varieties [1]. It contains 311,4 million words and covers several types of written texts (literary, newspaper, technical, etc.) and spoken texts (formal and informal). Due to copyrights restrictions, the written subpart of the CRPC (309 M) can only be searched online. Specific subparts free from copyright restrictions have been made freely available for academic use, and are described below.
Português Fundamental: A spoken corpus of European Portuguese, collected between 1970 and 1974, composed of 137 recordings, transcribed, aligned in Ex- maralda and tagged with PoS. catalogue.elra.info/en-us/repository/browse/ELRA-S0346
Português Falado: A Spoken corpus of Portuguese varieties in the world with 86 recordings: Portugal (30), Brazil (20), 5 African countries with Portuguese as its official language (5 each), Macao (5), Goa (3) and East-Timor (3). The corpus is transcribed, aligned with Exmaralda and tagged with PoS. catalogue.elra.info/en-us/repository/browse/ELRA-S0345
LT Corpus: The Literary Corpus contains approximately 1,781,083 running words of 70 copyright-free classics of European and Brazilian literature (61 Por- tugal and 9 from Brazil) published before 1940. https://catalogue.elra.info/en- us/repository/browse/ELRA-W0059/
PTPARL Corpus The PTPARL Corpus contains 1,076 transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens, with PoS and NP chunks.We plan to freely release the total subpart of the parliamentary sessions in the near future. catalogue.elra.info/en-us/repository/browse/ELRA-W0060
References
- Généreux, M., Hendrickx, I., Mendes, A.: Introducing the Reference Corpus of Con- temporary Portuguese On-Line. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) LREC’2012 – Eighth International Conference on Language Resources and Evaluation. pp. 2237–2244. European Language Resources Association (ELRA), Istanbul, Turkey (May 2012)