A English-Portuguese parallel corpus made of song lyrics • Latin American and Iberian Languages Open Corpora Forum

Name	The English-Portuguese Parallel Corpus
Link	https://github.com/vaaaltin/PLN/tree/master/Trabalho_2
Title	A English-Portuguese parallel corpus made of song lyrics
Presented by	Martins, V. Freitas, L.
Languages	Brazilian Portuguese, English
Language codes	pt-BR, en
Category	resource
Status	available
Type	corpora
Year	2020

Dataset

This paper presents a parallel corpus constructed from the translations of lyrics available on the Letras (https://www.letras.mus.br/). To develop the corpus we found the patterns in the HTML from website. For this task, we used Beauti- fulSoup (https://pypi.org/project/beautifulsoup4/) library, which was divided into three phases.

The first phase was to catch all the artists. For this, in the website source all of them are in the CSS attribute “class” named “home-artistas g-1 g-fix”. The second phase was to catch all lyrics with translation. Each artist page have a CSS attribute called “data-action” with the class name “translation” containing the URL for lyrics. Each lyric who has the “translation” class, was searched the “song-name” class, returning a list of the URL for all lyrics with translation. The third phase was to find the “div” element in each link for the lyrics, which have two classes, “cnt-trad-l” and “cnt-trad-r” that are the original letter for the music, and the letter for the music translated to Portuguese, respectively.

After that, we obtained full dataset with 936 artists, 1.933.696 sentences in Portuguese. To classify the language of lyrics we used Polyglot (https://pypi.org/ project/polyglot/) library. The English-Portuguese parallel corpus contains 23912 sentences.

In literature, another works about English-Portuguese Corpus are [1], [3], [4], and [2]. The parallel corpus constructed from the translations of lyrics is available in https://github.com/vaaaltin/PLN/tree/master/Trabalho_2 .

References

Frankenberg-Garcia, A., Santos, D. (2003): “Introducing Compara: the Portugues- English parallel corpus”. In Zanettin, F., Bernardini, S. and Stewart, D.(eds.), Cor- pora in translator education. Manchester Northampton, St. Jerome Publishing, pp. 71-87.
Barreiro, A., Mota, C. (2017): “e-PACT: eSPERTo Paraphrase Aligned Corpus of EN-EP/BP Translations”. In: Tradução em Revista, pp. 87–102.
Koehn, P. (2005): “Europarl: A Parallel Corpus for Statistical Machine Translation”. In: 10th Machine Translation Summit, pp. 79–86.
Lison, P., Tiedemann, J. (2016): “Open Subtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles”. In: 10th International Conference on Lan- guage Resources and Evaluation, pp. 923–929.