The Catalan Language CLUB • Latin American and Iberian Languages Open Corpora Forum

Name	Catalan Language Understanding Benchmark (CLUB)
Link	https://github.com/projecte-aina/club
Title	The Catalan Language CLUB
Presented by	Rodríguez-Penagos, C. , Armentano-Oller, C. , Villegas, M. , Melero, M. , González-Agirre, A. , Gibert, O. Carrino, C.
Language	Catalan
Language code	cat
Category	resource
Status	available
Type	benchmark
Year	2021

Two public funding initiatives (PlanTL and AINA) provide the Catalan language with the tooling and resources that modern AI models can bring to industry, commerce and society in general. These efforts have incorporated corpus annotation best practices, and at the same time foster local annotation companies that can deal with the complex tasks needed for Data Science and modern AI.

The Catalan Language Understanding Benchmark (CLUB) enable evaluations of models and downstream applications for real, practical use.

TECa, Textual Entailment for Catalan, containing more than 20,000 annotated pairs of sentences, with Neutral, Inference and Contradiction labels.
TeCla Textual Classification for Catalan, a News corpus for thematic Text Classification, with 153.265 newswire articles classified under 30 different categories.
VilaQuAD and ViquiQuAD, two extractive Question Answering datasets from newswire and thw Wikipedia, comprising more than 20,000 questions and answer segments, in addition to a professionally translated version of the XQuAD dataset for Catalan
STS-ca 7 corpus for evaluating Semantic Textual Similarity in Catalan, with more than 3,000 sentence pairs, annotated with the semantic similarity between them, using a scale from 0 (no similarity at all) to 5 (semantic equivalence).