Datasets:
The dataset viewer is disabled because the authors forbid processing this dataset automatically and require the users to download the dataset files manually.
Dataset Card for arXiv Dataset
Dataset Summary
A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
The language supported is English
Dataset Structure
Data Instances
This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. An example is given below
{'id': '0704.0002',
'submitter': 'Louis Theran',
'authors': 'Ileana Streinu and Louis Theran',
'title': 'Sparsity-certifying Graph Decompositions',
'comments': 'To appear in Graphs and Combinatorics',
'journal-ref': None,
'doi': None,
'report-no': None,
'categories': 'math.CO cs.CG',
'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',
'abstract': ' We describe a new algorithm, the $(k,\\ell)$-pebble game with colors, and use\nit obtain a characterization of the family of $(k,\\ell)$-sparse graphs and\nalgorithmic solutions to a family of problems concerning tree decompositions of\ngraphs. Special instances of sparse graphs appear in rigidity theory and have\nreceived increased attention in recent years. In particular, our colored\npebbles generalize and strengthen the previous results of Lee and Streinu and\ngive a new proof of the Tutte-Nash-Williams characterization of arboricity. We\nalso present a new decomposition that certifies sparsity based on the\n$(k,\\ell)$-pebble game with colors. Our work also exposes connections between\npebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and\nWestermann and Hendrickson.\n',
'update_date': '2008-12-13'}
Data Fields
id
: ArXiv ID (can be used to access the paper)submitter
: Who submitted the paperauthors
: Authors of the papertitle
: Title of the papercomments
: Additional info, such as number of pages and figuresjournal-ref
: Information about the journal the paper was published indoi
: Digital Object Identifierreport-no
: Report Numberabstract
: The abstract of the papercategories
: Categories / tags in the ArXiv system
Data Splits
The data was not splited.
Dataset Creation
Curation Rationale
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth. In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more is presented to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
Source Data
This data is based on arXiv papers. [More Information Needed]
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
This dataset contains no annotations.
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
The original data is maintained by ArXiv
Licensing Information
The data is under the Creative Commons CC0 1.0 Universal Public Domain Dedication
Citation Information
@misc{clement2019arxiv,
title={On the Use of ArXiv as a Dataset},
author={Colin B. Clement and Matthew Bierbaum and Kevin P. O'Keeffe and Alexander A. Alemi},
year={2019},
eprint={1905.00075},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
Contributions
Thanks to @tanmoyio for adding this dataset.
- Downloads last month
- 710