Skip to Content

OpenCor 2020


March 2, 2020. Colégio Espírito Santo, University of Evora. Room 120

  • 9h30-9h50 Carvalho, um corpus diacrónico em ortografia original para português, espanhol e inglês, José Ramom Pichel, Pablo Gamallo Otero, Marco Neves and Iñaki Alegria

  • 9h50-9h55 Video: Brands.Br – a Portuguese Reviews Corpus, Evandro Fonseca, Amanda Oliveira, Carolina Gadelha and Valter H. Guandaline

  • 9h55-10:20 B2W-Reviews01-Opinion, an annotated review sample, Livy Real, Alissa Bento, Karina Soares, Marcio Oshiro and Alexandre Mafra

  • 10h20-10h25 Video: Pre-trained Portuguese BERT models, Fábio Souza, Rodrigo Nogueira and Roberto Lotufo

  • 10h25-10h55 Petrolês: primeiros passos na construção de um corpus de domínio, Cláudia Freitas

  • 10h55-11h Video: English-Portuguese parallel corpus made of song lyrics, Valter Martins and Larissa Freitas

  • 11h-11h30 Coffee Break

  • 11h30-12h Corpus de Referência do Português Contemporâneo: freely available subcorpora, Amália Mendes

  • 12h-13h Invited Speaker: An overview of open language resources for Galician, Marcos Garcia

Invited Talk

An overview of open language resources for Galician, Marcos Garcia (Universidade da Coruña)

In this talk I will present an overview of open and freely available language resources for Galician, including electronic dictionaries, corpora, and annotated datasets for different tasks. I will also present the results of some experiments which exploit the strong similarity between Galician and Portuguese to perform some NLP tasks and to speed up the construction of annotated resources.


Recent years have seen a move in Computational Linguistics towards bigger and better, more reliably annotated corpora. However, the existence of such reliably annotated corpora is one of the big bottlenecks for processing natural language. Producing and maintaining corpora is a hard task that most of the time requires sizeable funding and the cooperation of several experts. Although having such corpora available is clearly essential, the many difficulties and the amount of work needed to produce reliable corpora make the process of producing this data and making it available a non-trivial proposition. While “big data” is a trend, producing reliable corpora continues to be an invisible task in Natural Language Processing. Especially when working on languages different from English, on smaller datasets not immediately suitable for machine learning approaches or on a new release of a previous dataset, it is not obvious to the corpora creators how to publish and properly discuss their work. Most of the biggest Natural Language Processing venues are not open to accepting corpora descriptions. The situation is even worse when considering minority languages and endangered languages since most of them do not have a related venue where these works can be discussed.

The Latin American and Iberian communities that produce open corpora do not have an established event that would make possible for experts to share ideas, to discuss difficulties and to get feedback on their work. Different meetings have been held in the last years, but either they are not generic enough to embrace all corpora work done in these communities, or there was no continuation or support for future editions. Due to these conditions, it is no rare that groups that share related interests or face the same difficulties are not aware of other groups and their recent work within these communities.

This forum aims both to fill the gap of having a permanent venue for construction, annotation, and maintenance of open corpora for Latin American and Iberian languages and to create an extensive list of these resources. OpenCor welcomes discussions on Portuguese, Spanish, indigenous languages, creoles, Galician, Catalan, Aragonese, Astur-Leonese, Aranese and any other language spoken in Latin America and Iberian countries. Work on endangered languages, minority, and/or less resourced languages are particularly welcome.

The venue

This is the third edition of OpenCor Forum, an attempt to gather the community that produces, maintains and makes freely available language resources for the large variety of languages spoken in Iberian countries and in Latin America. All accepted works will also be part of the OpenCor list, an initiate to have cataloged open resources produced for the targeted languages. This forum welcomes, but it is not restricted to, the following topics:

  • releases of new open data sets
  • descriptions of established open corpora
  • guidelines creation, annotation strategies, and best practices discussion
  • corpora maintenance and management
  • corpora curation and assessment
  • corpora design and evaluation
  • corpora creation strategies and difficulties faced by the community


We invite submissions of anonymized extended abstracts up to one page, with references. The documents should be anounimous. Documents must follow Springer LNCS and must be submitted in Iberian, Latin American Language or English. Accepted extended abstracts will serve as the submitted corpus description on the OpenCor List. OpenCor is non-archival, therefore works that have been or are planned to be published elsewhere are also welcome.

Authors need to submit together with the extended abstract the link for their resources. One of the goals of OpenCor is to provide a full list of resources and described languages by the end of the forum. We hope this list will be helpful in keeping track of freely available resources for our targeted languages.

Considering that one of the main challenges for these communities is funding raising, all accepted works will be available in the forum page and will appear in the resources list, even if no author can attend the forum. If the authors attend there will be the chance to give an extended talk during the forum, if not we will ask for a five-minute video on any video-online platform that will be projected during the event. Authors must indicate in the moment of submission how they want to participate in the OpenCor Forum 2020. This is an attempt to create an extensive list of open corpora available that does not rely on how much funding the working groups have.

Submission link:

LaTex stylesheet
MS Word stylesheet


NEW Deadline submission: February 14, 2020 (OLD: February 10, 2020 )

Acceptance: February 17, 2020

Session: March 2nd to 4th, 2020

Forum Registration

This edition is part of the PROPOR 2020 conference. For inscription costs check the official website.


  • Livy Real – University of São Paulo / GLiC
  • Ivan Vladimir Meza Ruiz – Universidad Nacional Autónoma de Mexico / IIMAS

Program Committee

  • Aline Villavicencio
  • Esau Villatoro-Tello
  • Fernanda López
  • Gabriela Ramírez-De-La-Rosa
  • Ivan Vladimir Meza Ruiz
  • Jesús Mager
  • Livy Real
  • Luís Trigo
  • Maria Jose Bocorny Finatto
  • Pablo Gamallo
  • Thiago Pardo


Any question may be sent to the organizers: livyreal [at]; ivanvladimir [at]