Dataset
This paper presents a parallel corpus constructed from the translations of lyrics available on the Letras (https://www.letras.mus.br/). To develop the corpus we found the patterns in the HTML from website. For this task, we used Beauti- fulSoup (https://pypi.org/project/beautifulsoup4/) library, which was divided into three phases.
The first phase was to catch all the artists. For this, in the website source all of them are in the CSS attribute “class” named “home-artistas g-1 g-fix”. The second phase was to catch all lyrics with translation. Each artist page have a CSS attribute called “data-action” with the class name “translation” containing the URL for lyrics. Each lyric who has the “translation” class, was searched the “song-name” class, returning a list of the URL for all lyrics with translation. The third phase was to find the “div” element in each link for the lyrics, which have two classes, “cnt-trad-l” and “cnt-trad-r” that are the original letter for the music, and the letter for the music translated to Portuguese, respectively.
After that, we obtained full dataset with 936 artists, 1.933.696 sentences in Portuguese. To classify the language of lyrics we used Polyglot (https://pypi.org/ project/polyglot/) library. The English-Portuguese parallel corpus contains 23912 sentences.
In literature, another works about English-Portuguese Corpus are [1], [3], [4], and [2]. The parallel corpus constructed from the translations of lyrics is available in https://github.com/vaaaltin/PLN/tree/master/Trabalho_2 .
References
- Frankenberg-Garcia, A., Santos, D. (2003): “Introducing Compara: the Portugues- English parallel corpus”. In Zanettin, F., Bernardini, S. and Stewart, D.(eds.), Cor- pora in translator education. Manchester Northampton, St. Jerome Publishing, pp. 71-87.
- Barreiro, A., Mota, C. (2017): “e-PACT: eSPERTo Paraphrase Aligned Corpus of EN-EP/BP Translations”. In: Tradução em Revista, pp. 87–102.
- Koehn, P. (2005): “Europarl: A Parallel Corpus for Statistical Machine Translation”. In: 10th Machine Translation Summit, pp. 79–86.
- Lison, P., Tiedemann, J. (2016): “Open Subtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles”. In: 10th International Conference on Lan- guage Resources and Evaluation, pp. 923–929.