Datasets:
The dataset viewer is not available for this split.
Error code: JobManagerExceededMaximumDurationError
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for [financial-reports-sec]
Dataset Summary
The dataset contains the annual report of US public firms filing with the SEC EDGAR system from 1993-2020. Each annual report (10K filing) is broken into 20 sections. Each section is split into individual sentences. Sentiment labels are provided on a per filing basis from the market reaction around the filing date for 3 different time windows [t-1, t+1], [t-1, t+5] and [t-1, t+30]. Additional metadata for each filing is included in the dataset.
Dataset Configurations
Four configurations are available:
- large_lite:
- Contains only the basic features needed. Extra metadata is ommitted.
- Features List:
- cik
- sentence
- section
- labels
- filingDate
- docID
- sentenceID
- sentenceCount
- large_full:
- All features are included.
- Features List (excluding those already in the lite verison above):
- name
- tickers
- exchanges
- entityType
- sic
- stateOfIncorporation
- tickerCount
- acceptanceDateTime
- form
- reportDate
- returns
- small_lite:
- Same as large_lite version except that only (200,000/20,000/20,000) sentences are loaded for (train/test/validation) splits.
- small_full:
- Same as large_full version except that only (200,000/20,000/20,000) sentences are loaded for (train/test/validation) splits.
Usage
import datasets
# Load the lite configuration of the dataset
raw_dataset = datasets.load_dataset("JanosAudran/financial-reports-sec", "large_lite")
# Load a specific split
raw_dataset = datasets.load_dataset("JanosAudran/financial-reports-sec", "small_full", split="train")
Supported Tasks
The tasks the dataset can be used directly for includes:
- Masked Language Modelling
- A model like BERT can be fine-tuned on this corpus of financial text.
- Sentiment Analysis
- For each annual report a label ["positive", "negative"] is provided based on the market reaction around the filing date (refer to Annotations).
- Next Sentence Prediction/Sentence Order Prediction
- Sentences extracted from the filings are in their original order and as such the dataset can be adapted very easily for either of these tasks.
Languages
All sentences are in English.
Dataset Structure
Data Instances
Refer to dataset preview.
Data Fields
Feature Name
- Description
- Data type
- Example/Structure
cik
- 10 digit identifier used by SEC for a firm.
- string
- '0000001750'
sentence
- A single sentence from the 10-K filing.
- string
- 'The finance agreement is secured by a first priority security interest in all insurance policies, all unearned premium, return premiums, dividend payments and loss payments thereof.'
section
- The section of the 10-K filing the sentence is located.
- ClassLabel
ClassLabel(names=['section_1', 'section_10', 'section_11', 'section_12', 'section_13', 'section_14', 'section_15', 'section_1A', 'section_1B', 'section_2','section_3', 'section_4', 'section_5', 'section_6', 'section_7', 'section_7A','section_8', 'section_9', 'section_9A', 'section_9B'], id=None)
labels
- The sentiment label for the entire filing (positve or negative) based on different time windows.
- Dict of ClassLables
{ '1d': ClassLabel(names=['positive', 'negative'], id=None), '5d': ClassLabel(names=['positive', 'negative'], id=None), '30d': ClassLabel(names=['positive', 'negative'], id=None) }
filingDate
- The date the 10-K report was filed with the SEC.
- string
- '2021-03-10'
docID
- Unique ID for identifying the exact 10-K filing. Unique across all configs and splits. Can be used to identify the document from which the sentence came from.
- string
- '0000001750_10-K_2020'
sentenceID
- Unique ID for identifying the exact sentence. Unique across all configs and splits.
- string
- '0000001750_10-K_2020_section_1_100'
sentenceCount
- Integer identiying the running sequence for the sentence. Unique only for a given config and split.
- string
- 123
name
- The name of the filing entity
- string
- 'Investar Holding Corp'
tickers
- List of ticker symbols for the filing entity.
- List of strings
- ['ISTR']
exchanges
- List of exchanges for the filing entity.
- List of strings
- ['Nasdaq']
entityType
- The type of entity as identified in the 10-K filing.
- string
- 'operating'
sic
- Four digit SIC code for the filing entity.
- string
- '6022'
stateOfIncorporation
- Two character code for the state of incorporation for the filing entity.
- string
- 'LA'
tickerCount
- Internal use. Count of ticker symbols. Always 1.
- int
- 1
acceptanceDateTime
- The full timestamp of when the filing was accepted into the SEC EDGAR system.
- string
- '2021-03-10T14:26:11.000Z'
form
- The type of filing. Always 10-K in the dataset.
- string
- '10-K'
reportDate
- The last date in the fiscal year for which the entity is filing the report.
- string
- '2020-12-31'
returns
- Internal use. The prices and timestamps used to calculate the sentiment labels.
- Dict
{'1d': { 'closePriceEndDate': 21.45746421813965, 'closePriceStartDate': 20.64960479736328, 'endDate': '2021-03-11T00:00:00-05:00', 'startDate': '2021-03-09T00:00:00-05:00', 'ret': 0.03912226855754852 }, '5d': { 'closePriceEndDate': 21.743167877197266, 'closePriceStartDate': 20.64960479736328, 'endDate': '2021-03-15T00:00:00-04:00', 'startDate': '2021-03-09T00:00:00-05:00', 'ret': 0.052958063781261444 }, '30d': { 'closePriceEndDate': 20.63919448852539, 'closePriceStartDate': 20.64960479736328, 'endDate': '2021-04-09T00:00:00-04:00', 'startDate': '2021-03-09T00:00:00-05:00', 'ret': -0.0005041408003307879}}
Data Splits
Config | train | validation | test |
---|---|---|---|
large_full | 67,316,227 | 1,585,561 | 2,965,174 |
large_lite | 67,316,227 | 1,585,561 | 2,965,174 |
small_full | 200,000 | 20,000 | 20,000 |
small_lite | 200,000 | 20,000 | 20,000 |
Dataset Summary Statistics
Variable | count | mean | std | min | 1% | 25% | 50% | 75% | 99% | max |
---|---|---|---|---|---|---|---|---|---|---|
Unique Firm Count | 4,677 | |||||||||
Filings Count | 55,349 | |||||||||
Sentence Count | 71,866,962 | |||||||||
Filings per Firm | 4,677 | 12 | 9 | 1 | 1 | 4 | 11 | 19 | 27 | 28 |
Return per Filing - 1d | 55,349 | 0.008 | 0.394 | -0.973 | -0.253 | -0.023 | 0 | 0.02 | 0.367 | 77.977 |
Return per Filing - 5d | 55,349 | 0.013 | 0.584 | -0.99 | -0.333 | -0.034 | 0 | 0.031 | 0.5 | 100 |
Return per Filing - 30d | 55,349 | 0.191 | 22.924 | -0.999 | -0.548 | -0.068 | 0.001 | 0.074 | 1 | 5,002.748 |
Sentences per Filing | 55,349 | 1,299 | 654 | 0 | 110 | 839 | 1,268 | 1,681 | 3,135 | 8,286 |
Sentences by Section - section_1 | 55,349 | 221 | 183 | 0 | 0 | 97 | 180 | 293 | 852 | 2,724 |
Sentences by Section - section_10 | 55,349 | 24 | 40 | 0 | 0 | 4 | 6 | 20 | 173 | 1,594 |
Sentences by Section - section_11 | 55,349 | 16 | 47 | 0 | 0 | 3 | 3 | 4 | 243 | 808 |
Sentences by Section - section_12 | 55,349 | 9 | 14 | 0 | 0 | 3 | 4 | 8 | 56 | 1,287 |
Sentences by Section - section_13 | 55,349 | 8 | 20 | 0 | 0 | 3 | 3 | 4 | 79 | 837 |
Sentences by Section - section_14 | 55,349 | 22 | 93 | 0 | 0 | 3 | 3 | 8 | 413 | 3,536 |
Sentences by Section - section_15 | 55,349 | 177 | 267 | 0 | 0 | 9 | 26 | 315 | 1104 | 4,140 |
Sentences by Section - section_1A | 55,349 | 197 | 204 | 0 | 0 | 3 | 158 | 292 | 885 | 2,106 |
Sentences by Section - section_1B | 55,349 | 4 | 31 | 0 | 0 | 1 | 3 | 3 | 13 | 2,414 |
Sentences by Section - section_2 | 55,349 | 16 | 45 | 0 | 0 | 6 | 8 | 13 | 169 | 1,903 |
Sentences by Section - section_3 | 55,349 | 14 | 36 | 0 | 0 | 4 | 5 | 12 | 121 | 2,326 |
Sentences by Section - section_4 | 55,349 | 7 | 17 | 0 | 0 | 3 | 3 | 4 | 66 | 991 |
Sentences by Section - section_5 | 55,349 | 20 | 41 | 0 | 0 | 10 | 15 | 21 | 87 | 3,816 |
Sentences by Section - section_6 | 55,349 | 8 | 29 | 0 | 0 | 3 | 4 | 7 | 43 | 2,156 |
Sentences by Section - section_7 | 55,349 | 265 | 198 | 0 | 0 | 121 | 246 | 373 | 856 | 4,539 |
Sentences by Section - section_7A | 55,349 | 18 | 52 | 0 | 0 | 3 | 9 | 21 | 102 | 3,596 |
Sentences by Section - section_8 | 55,349 | 257 | 296 | 0 | 0 | 3 | 182 | 454 | 1105 | 4,431 |
Sentences by Section - section_9 | 55,349 | 5 | 33 | 0 | 0 | 3 | 3 | 4 | 18 | 2,330 |
Sentences by Section - section_9A | 55,349 | 17 | 16 | 0 | 0 | 8 | 15 | 23 | 50 | 794 |
Sentences by Section - section_9B | 55,349 | 4 | 18 | 0 | 0 | 2 | 3 | 4 | 23 | 813 |
Word count per Sentence | 71,866,962 | 28 | 22 | 1 | 2 | 16 | 24 | 34 | 98 | 8,675 |
Dataset Creation
Curation Rationale
To create this dataset multiple sources of information have to be cleaned and processed for data merging. Starting from the raw filings:
- Useful metadata about the filing and firm was added.
- Time windows around the filing date were carefully created.
- Stock price data was then added for the windows.
- Ambiguous/duplicate records were removed.
Source Data
Initial Data Collection and Normalization
Initial data was collected and processed by the authors of the research paper EDGAR-CORPUS: Billions of Tokens Make The World Go Round. Market price and returns data was collected from Yahoo Finance. Additional metadata was collected from SEC.
Who are the source language producers?
US public firms filing with the SEC.
Annotations
Annotation process
Labels for sentiment classification are based on buy-and-hold returns over a fixed time window around the filing date with the SEC i.e. when the data becomes public. Returns are chosen for this process as it reflects the combined market intelligence at parsing the new information in the filings. For each filing date t the stock price at t-1 and t+W is used to calculate returns. If, the returns are positive a label of positive is assigned else a label of negative is assigned. Three different windows are used to assign the labels:
- 1d: [t-1, t+1]
- 5d: [t-1, t+5]
- 30d: [t-1, t+30]
The windows are based on calendar days and are adjusted for weekends and holidays. The rationale behind using 3 windows is as follows:
- A very short window may not give enough time for all the information contained in the filing to be reflected in the stock price.
- A very long window may capture other events that drive stock price for the firm.
Who are the annotators?
Financial market participants.
Personal and Sensitive Information
The dataset contains public filings data from SEC. Market returns data was collected from Yahoo Finance.
Considerations for Using the Data
Social Impact of Dataset
Low to none.
Discussion of Biases
The dataset is about financial information of public companies and as such the tone and style of text is in line with financial literature.
Other Known Limitations
NA
Additional Information
Dataset Curators
Aman Khan
Licensing Information
This dataset is provided under Apache 2.0.
References
- Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, & Prodromos Malakasiotis. (2021). EDGAR-CORPUS [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5589195
Citation Information
Please use the following to cite this dataset:
@ONLINE{financial-reports-sec,
author = "Aman Khan",
title = "Financial Reports SEC",
url = "https://huggingface.co/datasets/JanosAudran/financial-reports-sec"
}
- Downloads last month
- 383