Skip to Content

The Multilingual Corpus of Survey Questionnaires

NameThe Multilingual Corpus of Survey Questionnaires (MCSQ)
Linkhttps://www.upf.edu/web/mcsq/
TitleThe Multilingual Corpus of Survey Questionnaires
Presented byZavala-Rojas, D. Sorato, D.
LanguagesEnglish, Catalan, Czech, French, German, Norwegian, Portuguese, Spanish, Russian, 29 language varieties
Language codeseng, cat, ces, fra, deu, nor, por, esp, rus
Categoryresource
Statusavailable
Typecorpora
Year2021

The Multilingual Corpus of Survey Questionnaires (MCSQ) is the first publicly available corpus of survey questionnaires. In its third version (entitled Rosalind Franklin), the MCSQ contains approximately 766.000 sentences and more than 4 million tokens, comprising 306 distinct questionnaires designed in the source (British) English language and their translations into Catalan, Czech, French, German, Norwegian, Portuguese, Spanish, and Russian, adding to 29 country-language combinations (e.g., Switzerland-French). The MCSQ is a resource designed and implemented following the FAIR principles, and its contents are freely available through an especially tailored user interface.

The MCSQ consists of more than 40 years of survey research from large-scale comparative survey projects that provide cross-national and cross-cultural data to the Social Sciences and Humanities (SSH), namely, the European Social Survey (ESS), the European Values Study (EVS), the Survey of Health Ageing and Retirement in Europe (SHARE), and the WageIndicator Survey (WIS). All questionnaires in the MCSQ are composed of survey items. A survey item is a request for an answer with a set of answer options, and may include additional textual elements guiding interviewers and clarifying the information that should be understood and provided by respondents. Except in the case of the WIS, the translation process was implemented according to the TRAPD (Translation, Review, Adjudication, Pretesting and, Documentation) method, a team approach for the translation of survey questionnaires.

Questionnaires included in the MCSQ were obtained from the survey projects' archives in distinct formats such as spreadsheets, XML, PDF files. The PDF files had to undergo an additional step of conversion to plain texts before going through the preprocessing pipeline. Then, the texts were extracted from the input files and preprocessed, sentence aligned with respect to the English source and annotated with Part-of-speech (POS) and Named Entity Recognition (NER) tags.