Portuguese is ranked among the top 10 languages with most speakers around the world. This popularity, however, does not reflect the availability of resources for Natural Language Processing (NLP) and available annotated corpora are usually of small size. The scarcity of annotated data is a significant challenge that can hinder the application of some state-of-the-art techniques, such as deep learning models that are commonly data hungry. Recent works on transfer learning using pre-trained neural language models (LMs) have shown to improve the performance on a variety of NLP tasks and reduce the annotated data requirements [2,3,4]. However, pre-training LMs require vast amounts of computational time and specialized hardware (TPUs3 ). In addition, pre-trained models are often- times made publicly available only for high resourced languages, such as English and Chinese. Considering their task-agnostic architecture, pre-trained LMs are a valuable asset for less resourced languages and can be applied to a number of NLP tasks with minimal architecture modifications.
In this work, we pre-train BERT (Bidirectional Encoder Representation from Transformers) [2] models on the Portuguese language, which we make available to the community.4 We first generate a cased Portuguese WordPiece [7] vocabulary with 30k subword units, which is obtained using 200k random Wikipedia articles. We then train BERT Base and Large models using unlabeled data from brWaC (Brazilian Web as Corpus) [9], a large corpus of Brazilian webpage texts totalizing 17 GB of data. Training takes 4 days for BERT Base and 7 days for BERT Large on a cloud TPU v3-8 instance. More detailed information on the pre-trainings are found in our paper [8].
We then evaluate our trained models on the downstream task of named entity recognition (NER) using the First HAREM [5] corpora, a popular dataset. Our Portuguese BERT achieves a new state-of-the-art with an improvement of up to 4 absolute points on micro F1-score over a Multilingual BERT model and previously best published results [1,6]. Regarding HAREM, we provide a script5 to preprocess the datasets and produce a version more suitable to modeling NER as a sequence tagging problem. The script selects a single true target for entities that have multiple identification and/or classification solutions, stardardizing decisions that otherwise can hinder the comparison of related works. We hope that by making these models publicly available, others will be able to benchmark and improve the performance of many other NLP tasks in Portuguese.
3 https://cloud.google.com/tpu/
4 Models available on GitHub at https://github.com/neuralmind-ai/portuguese-bert
5 Script available at https://github.com/fabiocapsouza/harem preprocessing
References
- Castro, P.V.Q.d., Silva, N.F.F.d., Soares, A.d.S.: Portuguese named en- tity recognition using lstm-crf. In: Computational Processing of the Por- tuguese Language. pp. 83–92. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3 9
- Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa- pers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Min- nesota (2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/ anthology/N19-1423
- Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 328–339 (2018)
- Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237 (2018)
- Santos, D., Seco, N., Cardoso, N., Vilela, R.: HAREM: An advanced NER evalua- tion contest for Portuguese. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). European Language Resources Association (ELRA), Genoa, Italy (2006)
- Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., Vieira, R.: Assessing the impact of contextual embeddings for portuguese named entity recognition. In: 8th Brazilian Conference on Intelligent Systems, BRACIS, Bahia, Brazil, October 15-18. pp. 437–442 (2019)
- Schuster, M., Nakajima, K.: Japanese and korean voice search. In: 2012 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5149–5152. IEEE (2012)
- Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using bert-crf (2019)
- Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC cor- pus: A new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Eu- ropean Language Resources Association (ELRA), Miyazaki, Japan (2018), https: //www.aclweb.org/anthology/L18-1686