CFP: OpenCor 2024 • Latin American and Iberian Languages Open Corpora Forum

This will be the fifth edition of OpenCor, an annual venue that aims to gather the community to work on freely available language resources for the variety of languages spoken in Iberian countries and Latin America.

Recent years have seen a move in Computational Linguistics towards bigger and better, more reliably annotated corpora. However, the existence of such reliably annotated corpora is one of the serious bottlenecks for processing natural language. Producing and maintaining corpora is a difficult task that usually requires sizeable funding and the cooperation of several experts. Although having such corpora available is essential, the many difficulties and the amount of work needed to produce reliable corpora make creating this data and making it available a non-trivial proposition. Producing reliable corpora continues to be an invisible task in Natural Language Processing. Especially when working on languages different from English, on smaller datasets not immediately suitable for machine learning approaches, or on a new release of a previous dataset, it needs to be made clear to the corpora creators how to publish and properly discuss their work. Most of the biggest Natural Language Processing venues are closed to accepting corpora descriptions. The situation is even worse when considering minority and endangered languages since most of them do not have a related venue where these works can be discussed.

The Latin American and Iberian communities that produce open corpora have yet to establish an event allowing experts to share ideas, discuss difficulties, and get feedback on their work. Different meetings have been held in the last years, but either they need to be more generic to embrace all corpora work done in these communities, or there needs to be continuation and support for future editions. Due to these conditions, it is common for groups that share related interests or face the same difficulties to be unaware of other groups and their recent work within these communities.

This forum aims both to fill the gap of having a permanent venue for the construction, annotation, and maintenance of open corpora for Latin American and Iberian languages and to create an extensive list of these resources. OpenCor welcomes discussions on Portuguese, Spanish, indigenous languages, creoles, Galician, Catalan, Aragonese, Astur-Leonese, Aranese, and other languages spoken in Latin America or Iberian countries. Work on endangered, minority, and/or less-resourced languages is particularly welcome.

The venue

This is the fifth edition of OpenCor Forum, a forum to gather the community that produces, maintains, and distributes freely available language resources for the large variety of languages spoken in Iberian countries and Latin America. All accepted works will also be part of the OpenCor list, an initiative to have catalogued open resources produced for the targeted languages. This forum welcomes, but is not restricted to, the following topics:

releases of new open data sets
descriptions of established open corpora
guidelines creation, annotation strategies, and best practices discussion
corpora maintenance and management
corpora curation and assessment
corpora design and evaluation
corpora creation strategies and difficulties faced by the community
ethical aspects of corpora creation

This edition OpenCor Forum will be held on site, as a part of PROPOR 2024 - 16th International Conference on Computational Processing of Portuguese in Santiago de Compostela (Galiza) from 14th to 15th March.

Submission

We invite submissions of anonymized extended abstracts up to one page, with references. The documents should be anonymous. Documents must follow Springer LNCS guidelines and must be submitted in Iberian, Latin American Language or English. Accepted extended abstracts will serve as the submitted corpus description on the OpenCor List. OpenCor is non-archival; therefore, works that have been or are planned to be published elsewhere are also welcome.

Authors need to submit the link for their resources together with the extended abstract. One of the goals of OpenCor is to provide a full list of resources and described languages by the end of the forum. We hope this list will help keep track of freely available resources for our targeted languages.

Considering that one of the main challenges for these communities is funding raising, all accepted works will be available on the forum page and will appear in the resources list, even if no author can attend the forum. If the authors can attend there will be the chance to give a talk during the forum. If they cannot, we will ask for a five-minute video on any video-online platform that will be projected during the event. Authors must indicate at submission how they want to participate in the OpenCor Forum 2024. This is an attempt to create an extensive list of open corpora available that does not rely on how much funding the working groups have.

Schedule

Deadline for submission: ~~January 17th~~ January 21st
Acceptance notification: February 1st
Workshop: Day to be announced, March 14th or 15th

Forum Registration

To be announced

Organization

Livy Real – americanas s.a. Digital Lab - livyreal [at] gmail.com
Ivan Vladimir Meza Ruiz – IIMAS/UNAM
Valeria de Paiva - Topos Institute

Program Committee

Carlos Ricardo Cruz Mendoza, IIMAS, UNAM
Carlos Hernandez-Mena, BSC
Fernanda López, ENaCiF-UNAM
Amália Mendes,
Victor Mijangos, FC-UNAM
Manuel Alejandro Sanchez Fernandez, UABC

Contact

Questions can be sent to the organizers: livyreal [at] gmail.com; ivanvladimir [at] turing.iimas.unam.mx