Datasets:
The dataset viewer is disabled because the authors forbid processing this dataset automatically and require the users to download the dataset files manually.
Dataset Card for BrWaC
Dataset Summary
The BrWaC (Brazilian Portuguese Web as Corpus) is a large corpus constructed following the Wacky framework, which was made public for research purposes. The current corpus version, released in January 2017, is composed by 3.53 million documents, 2.68 billion tokens and 5.79 million types. Please note that this resource is available solely for academic research purposes, and you agreed not to use it for any commercial applications. Manually download at https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC
Supported Tasks and Leaderboards
[More Information Needed]
Languages
Portuguese
Dataset Structure
Data Instances
An example from the BrWaC dataset looks as follows:
{
"doc_id": "netg-1afc73",
"text": {
"paragraphs": [
[
"Conteúdo recente"
],
[
"ESPUMA MARROM CHAMADA \"NINGUÉM MERECE\""
],
[
"31 de Agosto de 2015, 7:07 , por paulo soavinski - | No one following this article yet."
],
[
"Visualizado 202 vezes"
],
[
"JORNAL ELETRÔNICO DA ILHA DO MEL"
],
[
"Uma espuma marrom escuro tem aparecido com frequência na Praia de Fora.",
"Na faixa de areia ela aparece disseminada e não chama muito a atenção.",
"No Buraco do Aipo, com muitas pedras, ela aparece concentrada.",
"É fácil saber que esta espuma estranha está lá, quando venta.",
"Pequenos algodões de espuma começam a flutuar no espaço, pertinho da Praia do Saquinho.",
"Quem pode ajudar na coleta deste material, envio a laboratório renomado e pagamento de análises, favor entrar em contato com o site."
]
]
},
"title": "ESPUMA MARROM CHAMADA ‟NINGUÃÂM MERECE‟ - paulo soavinski",
"uri": "http://blogoosfero.cc/ilhadomel/pousadasilhadomel.com.br/espuma-marrom-chamada-ninguem-merece"
}
Data Fields
doc_id
: The document IDtitle
: The document titleuri
: URI where the document was extracted fromtext
: A list of document paragraphs (with a list of sentences in it as a list of strings)
Data Splits
The data is only split into train set with size of 3530796 samples.
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
[More Information Needed]
Citation Information
@inproceedings{wagner2018brwac,
title={The brwac corpus: A new open resource for brazilian portuguese},
author={Wagner Filho, Jorge A and Wilkens, Rodrigo and Idiart, Marco and Villavicencio, Aline},
booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}
Contributions
Thanks to @jonatasgrosman for adding this dataset.
- Downloads last month
- 683