The dataset viewer is not available for this split.
Error code: RowsPostProcessingError
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for blogspot raw dataset
Dataset Summary
This dataset is a corpus of raw blogposts from blogspot mostly in the English language. It was obtained by scraping corpora of webarchive and commoncrawl.
Supported Tasks and Leaderboards
The dataset may be used for training language models or serve other research interests.
Languages
Mostly English language, but some outliers may occur.
Dataset Structure
The distribution of the blog posts over time can be viewed at ./blogspot_dist_comm.png
Data Instances
[More Information Needed]
Data Fields
text: string
URL: string
date: string
comment: int
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
The dataset was constructed by utilizing the WARC-dl pipeline. It was executed on cluster architecture. The corpora of archive.org and commoncrawl.org contain WARC files that contain HTML which gets parsed by the pipeline. The pipeline extracts HTML from the WARC files and applies distributed filtering to efficiently filter for the desired content.
Source Data
Initial Data Collection and Normalization
The corpora "corpus-commoncrawl-main-2022-05" and "corpus-iwo-internet-archive-wide00001" have been searched for the content present in this dataset. Search terms have been inserted into the preciously mentioned pipeline to filter URLs for "blogspot.com" and characteristic timestamp information contained in the URL (e.g. "/01/2007"). The HTML documents were parsed for specific tags to obtain the timestamps. Further, the data was labeled with the "comment" label if there were some comment markers in the URL, indicating that the retrieved text is from the main text of a blog post or from the comments section. The texts are stored raw and no further processing has been done.
Who are the source language producers?
Since blogspot provides a high-level framework to allow people everywhere in the world to set up and maintain a blog, the producers of the texts may not be further specified.
Annotations
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
Texts are raw and unfiltered, thus personal and sensitive information, as well as explicit language, may be present in the dataset.
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
The retrieval of the timestamps from the HTML documents was not 100% accurate, so a small proportion of wrong or nonsense timestamps can be present in the data. Also we can not guarantee the correctness of the timestamps as well as the "comment" labels.
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
The dataset was constructed during the course "Big Data and Language Technologies" of the Text Mining and Retrieval Group, Department of Computer Science at the University of Leipzig.
Licensing Information
[More Information Needed]
Citation Information
[More Information Needed]
Contributions
Thanks to @jonaskonig, @maschirmer and @1BlattPapier for contributing.
- Downloads last month
- 70