CPLP:tuítes – The pluricentric corpus of tweets in Portuguese language

Andressa Rodrigues Gomide
2023 Zenodo  
This work presents the process of collecting, preparing and publishing the Pluricentric Corpus of Tweets in Portuguese Language (CPLP:tuítes). CPLP:tuítes is a corpus composed of 125,827 tweets and a total of 2,633,507 tokens. The tweets come from 53 newspaper accounts or news providers in Angola, Brazil, Cape Verde, Guinea-Bissau, Mozambique, Portugal, and São Tomé and Príncipe. This corpus is part of the Portuguese Database (BDP), a repository that will offer free access to corpora, as well
more » ... the instruments used to prepare them, with content in Portuguese produced in the 11 countries where Portuguese is an official language. The first version of CPLP:tuítes was lemmatized and tagged for grammatical classes and is available via CQPweb, a corpus search and statistical analysis program that features a friendly and accessible interface via a web browser, with no installation required. The article also presents a brief discussion on decisions to be made when preparing a reference corpus for the representation of a pluricentric language in its many varieties.
doi:10.5281/zenodo.7649627 fatcat:uhmtqe6u6zh2noyg56d4wfdmfm