A brief description of SICK-BR • Latin American and Iberian Languages Open Corpora Forum

Name	SICK-BR
Link	https://github.com/livyreal/SICK-BR
Title	A brief description of SICK-BR
Presented by	Real, L. , Rodrigues, A. , Silva, A. , Thalenberg, B. , Guide, B. , Silva, C. , Câmara, I. , Lima, G. , Souza, R. Paiva, V.
Language	Brazilian Portuguese
Language code	pt-BR
Category	resource
Status	available
Type	corpora
Year	2018

We aim a simple corpus for NLI/RTE in Portuguese A resource as the one we aim for is already available for English, SICK (Sentences Involving Composi- tional Knowledge) [4]. Since building a resource like this is very time consuming and needs financial support, we decided to bootstrap the creation of a Portuguese corpus translating and adapting SICK for Portuguese, giving rise to SICK-BR. This approach has also the value that it comes with a parallel corpus since the pairs of SICK and SICK-BR are aligned and offer the same labels.

SICK is simplified in aspects of language processing not fundamentally re- lated to composionality: there are no named entities, the tenses have been simpli- fied to the progressive only, there are few modifiers, etc. The data set consists of 9840 English sentence pairs (composed by some 6k unique sentences), generated from existing sets of captions of pictures.

To obtain an open corpus like SICK for Portuguese and specially to make use of the human annotations in SICK, we started the process of translating and adapting SICK for Portuguese. Due to the nature of SICK, it is not possible to ‘simply’ translate the original pairs and directly get the inference and relatedness labels properly assigned. We needed to be sure that our translations still get exactly the same truth-conditional semantics as the original pairs. We also want to have, as much as possible, the same kind of phenomena that SICK discusses. Another goal is to keep the relatedness between the paired sentences, which imposes challenges on lexical choices.

We started from an automated translation 4 of the unique 6k sentences that compose SICK. Then we reviewed all these translation considering our goals: i. translation should keep the same truth value as the original sentence; ii. we keep, as much as possible, the same lexical choices over the corpus; iii. we keep the same phenomena the original sentence was showcasing; iv. we keep naturally sounding Portuguese sentences, as much as possible.

To assure the quality of this work, we adopted some strategies as having a glossary and a forum for annotation discussions. Our annotators also always had the possibility of do not annotate something they doubt about.

When all the unique sentences were translated and checked, the corpus was reconstructed: the sentences were paired as the original ones and the original labels were assigned to the Portuguese pairs. Then, we verified how much the original label fits our translation. We checked 400 labels for relatedness and 800 labels for inference relations, chosen randomly but equally distributed between the different label types. This last step showed that we do not always agree with the original SICK label. Specially considering relatedness, which is a subtle feature.We also found some inconsistency on inference labels as [3, 2] have already shown. Although we found these issues, for all original labels we agree on, we also agree on the Portuguese labels, which shows that our strategies were enough to create a resource for Portuguese from a previous English one.

We described the construction of a NLI/RTE corpus for Portuguese, SICK- BR5 ., which is based on and aligned to SICK. We focused on linguistic strategies to guarantee (i.) the reuse of the original NLI/relatedness labels of SICK when applied to SICK-BR; (ii.) a natural register in Portuguese and (iii.) that the same linguistic phenomena found in SICK were present in SICK-BR. The issues found with the labels in SICK-BR were already found in SICK, which suggests, as a next step, to investigate how to correct these annotations, perhaps following the work in [1]. We hope in future work, to test different approaches to automatically detecting inference relations in SICK-BR.

References

Kalouli, A.L., Real, L., De Paiva, V.: Wordnet for ’easy’ textual inferences. GLOB- ALEX (2018)
Kalouli, A.L., Real, L., de Paiva, V.: Correcting contradictions. In: Proceedings of Computing Natural Language Inference (CONLI) Workshop (2017)
Kalouli, A.L., Real, L., de Paiva, V.: Textual inference: getting logic from humans. In: Proceedings of the 12th International Conference on Computational Semantics (IWCS) (2017)
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., Zamparelli, R.: A SICK cure for the evaluation of compositional distributional semantic models. In: Proceedings of LREC 2014 (2014)

⁴ We thank Milos Stanojevic for producing the initial machine translations that we worked from.

⁵ Available in https://github.com/livyreal/SICK-BR