Skip to Content

The Catalan Language CLUB

NameCatalan Language Understanding Benchmark (CLUB)
Linkhttps://github.com/projecte-aina/club
TitleThe Catalan Language CLUB
Presented byRodríguez-Penagos, C. , Armentano-Oller, C. , Villegas, M. , Melero, M. , González-Agirre, A. , Gibert, O. Carrino, C.
LanguageCatalan
Language codecat
Categoryresource
Statusavailable
Typebenchmark
Year2021

Two public funding initiatives (PlanTL and AINA) provide the Catalan language with the tooling and resources that modern AI models can bring to industry, commerce and society in general. These efforts have incorporated corpus annotation best practices, and at the same time foster local annotation companies that can deal with the complex tasks needed for Data Science and modern AI.

The Catalan Language Understanding Benchmark (CLUB) enable evaluations of models and downstream applications for real, practical use.

  • TECa, Textual Entailment for Catalan, containing more than 20,000 annotated pairs of sentences, with Neutral, Inference and Contradiction labels.
  • TeCla Textual Classification for Catalan, a News corpus for thematic Text Classification, with 153.265 newswire articles classified under 30 different categories.
  • VilaQuAD and ViquiQuAD, two extractive Question Answering datasets from newswire and thw Wikipedia, comprising more than 20,000 questions and answer segments, in addition to a professionally translated version of the XQuAD dataset for Catalan
  • STS-ca 7 corpus for evaluating Semantic Textual Similarity in Catalan, with more than 3,000 sentence pairs, annotated with the semantic similarity between them, using a scale from 0 (no similarity at all) to 5 (semantic equivalence).