Dataset Card for [financial-reports-sec]

Name: financial-reports-sec
Creator: Aman Khan
License: https://choosealicense.com/licenses/apache-2.0/

Dataset Summary

The dataset contains the annual report of US public firms filing with the SEC EDGAR system from 1993-2020. Each annual report (10K filing) is broken into 20 sections. Each section is split into individual sentences. Sentiment labels are provided on a per filing basis from the market reaction around the filing date for 3 different time windows [t-1, t+1], [t-1, t+5] and [t-1, t+30]. Additional metadata for each filing is included in the dataset.

Dataset Configurations

Four configurations are available:

large_lite:
- Contains only the basic features needed. Extra metadata is ommitted.
- Features List:
  - cik
  - sentence
  - section
  - labels
  - filingDate
  - docID
  - sentenceID
  - sentenceCount
large_full:
- All features are included.
- Features List (excluding those already in the lite verison above):
  - name
  - tickers
  - exchanges
  - entityType
  - sic
  - stateOfIncorporation
  - tickerCount
  - acceptanceDateTime
  - form
  - reportDate
  - returns
small_lite:
- Same as large_lite version except that only (200,000/20,000/20,000) sentences are loaded for (train/test/validation) splits.
small_full:
- Same as large_full version except that only (200,000/20,000/20,000) sentences are loaded for (train/test/validation) splits.

Usage

import datasets

# Load the lite configuration of the dataset
raw_dataset = datasets.load_dataset("JanosAudran/financial-reports-sec", "large_lite")

# Load a specific split
raw_dataset = datasets.load_dataset("JanosAudran/financial-reports-sec", "small_full", split="train")

Supported Tasks

The tasks the dataset can be used directly for includes:

Masked Language Modelling
- A model like BERT can be fine-tuned on this corpus of financial text.
Sentiment Analysis
- For each annual report a label ["positive", "negative"] is provided based on the market reaction around the filing date (refer to Annotations).
Next Sentence Prediction/Sentence Order Prediction
- Sentences extracted from the filings are in their original order and as such the dataset can be adapted very easily for either of these tasks.

Languages

All sentences are in English.

Dataset Structure

Data Instances

Refer to dataset preview.

Data Fields

Feature Name

Description
Data type
Example/Structure

cik

10 digit identifier used by SEC for a firm.
string
'0000001750'

sentence

A single sentence from the 10-K filing.
string
'The finance agreement is secured by a first priority security interest in all insurance policies, all unearned premium, return premiums, dividend payments and loss payments thereof.'

section

The section of the 10-K filing the sentence is located.
ClassLabel

ClassLabel(names=['section_1', 'section_10', 'section_11', 'section_12', 'section_13', 'section_14', 'section_15', 'section_1A', 'section_1B', 'section_2','section_3', 'section_4', 'section_5', 'section_6', 'section_7', 'section_7A','section_8', 'section_9', 'section_9A', 'section_9B'], id=None)

labels

The sentiment label for the entire filing (positve or negative) based on different time windows.
Dict of ClassLables

{
  '1d': ClassLabel(names=['positive', 'negative'], id=None),
  '5d': ClassLabel(names=['positive', 'negative'], id=None),
  '30d': ClassLabel(names=['positive', 'negative'], id=None)
}

filingDate

The date the 10-K report was filed with the SEC.
string
'2021-03-10'

docID

Unique ID for identifying the exact 10-K filing. Unique across all configs and splits. Can be used to identify the document from which the sentence came from.
string
'0000001750_10-K_2020'

sentenceID

Unique ID for identifying the exact sentence. Unique across all configs and splits.
string
'0000001750_10-K_2020_section_1_100'

sentenceCount

Integer identiying the running sequence for the sentence. Unique only for a given config and split.
string
123

name

The name of the filing entity
string
'Investar Holding Corp'

tickers

List of ticker symbols for the filing entity.
List of strings
['ISTR']

exchanges

List of exchanges for the filing entity.
List of strings
['Nasdaq']

entityType

The type of entity as identified in the 10-K filing.
string
'operating'

sic

Four digit SIC code for the filing entity.
string
'6022'

stateOfIncorporation

Two character code for the state of incorporation for the filing entity.
string
'LA'

tickerCount

Internal use. Count of ticker symbols. Always 1.
int
1

acceptanceDateTime

The full timestamp of when the filing was accepted into the SEC EDGAR system.
string
'2021-03-10T14:26:11.000Z'

form

The type of filing. Always 10-K in the dataset.
string
'10-K'

reportDate

The last date in the fiscal year for which the entity is filing the report.
string
'2020-12-31'

returns

Internal use. The prices and timestamps used to calculate the sentiment labels.
Dict

{'1d': {
  'closePriceEndDate': 21.45746421813965,
  'closePriceStartDate': 20.64960479736328,
  'endDate': '2021-03-11T00:00:00-05:00',
  'startDate': '2021-03-09T00:00:00-05:00',
  'ret': 0.03912226855754852
  },
'5d': {
  'closePriceEndDate': 21.743167877197266,
  'closePriceStartDate': 20.64960479736328,
  'endDate': '2021-03-15T00:00:00-04:00',
  'startDate': '2021-03-09T00:00:00-05:00',
  'ret': 0.052958063781261444
  },
'30d': {
  'closePriceEndDate': 20.63919448852539,
  'closePriceStartDate': 20.64960479736328,
  'endDate': '2021-04-09T00:00:00-04:00',
  'startDate': '2021-03-09T00:00:00-05:00',
  'ret': -0.0005041408003307879}}

Data Splits

Config	train	validation	test
large_full	67,316,227	1,585,561	2,965,174
large_lite	67,316,227	1,585,561	2,965,174
small_full	200,000	20,000	20,000
small_lite	200,000	20,000	20,000

Dataset Summary Statistics

Variable	count	mean	std	min	1%	25%	50%	75%	99%	max
Unique Firm Count	4,677
Filings Count	55,349
Sentence Count	71,866,962
Filings per Firm	4,677	12	9	1	1	4	11	19	27	28
Return per Filing - 1d	55,349	0.008	0.394	-0.973	-0.253	-0.023	0	0.02	0.367	77.977
Return per Filing - 5d	55,349	0.013	0.584	-0.99	-0.333	-0.034	0	0.031	0.5	100
Return per Filing - 30d	55,349	0.191	22.924	-0.999	-0.548	-0.068	0.001	0.074	1	5,002.748
Sentences per Filing	55,349	1,299	654	0	110	839	1,268	1,681	3,135	8,286
Sentences by Section - section_1	55,349	221	183	0	0	97	180	293	852	2,724
Sentences by Section - section_10	55,349	24	40	0	0	4	6	20	173	1,594
Sentences by Section - section_11	55,349	16	47	0	0	3	3	4	243	808
Sentences by Section - section_12	55,349	9	14	0	0	3	4	8	56	1,287
Sentences by Section - section_13	55,349	8	20	0	0	3	3	4	79	837
Sentences by Section - section_14	55,349	22	93	0	0	3	3	8	413	3,536
Sentences by Section - section_15	55,349	177	267	0	0	9	26	315	1104	4,140
Sentences by Section - section_1A	55,349	197	204	0	0	3	158	292	885	2,106
Sentences by Section - section_1B	55,349	4	31	0	0	1	3	3	13	2,414
Sentences by Section - section_2	55,349	16	45	0	0	6	8	13	169	1,903
Sentences by Section - section_3	55,349	14	36	0	0	4	5	12	121	2,326
Sentences by Section - section_4	55,349	7	17	0	0	3	3	4	66	991
Sentences by Section - section_5	55,349	20	41	0	0	10	15	21	87	3,816
Sentences by Section - section_6	55,349	8	29	0	0	3	4	7	43	2,156
Sentences by Section - section_7	55,349	265	198	0	0	121	246	373	856	4,539
Sentences by Section - section_7A	55,349	18	52	0	0	3	9	21	102	3,596
Sentences by Section - section_8	55,349	257	296	0	0	3	182	454	1105	4,431
Sentences by Section - section_9	55,349	5	33	0	0	3	3	4	18	2,330
Sentences by Section - section_9A	55,349	17	16	0	0	8	15	23	50	794
Sentences by Section - section_9B	55,349	4	18	0	0	2	3	4	23	813
Word count per Sentence	71,866,962	28	22	1	2	16	24	34	98	8,675

Dataset Creation

Curation Rationale

To create this dataset multiple sources of information have to be cleaned and processed for data merging. Starting from the raw filings:

Useful metadata about the filing and firm was added.
Time windows around the filing date were carefully created.
Stock price data was then added for the windows.
Ambiguous/duplicate records were removed.

Source Data

Initial Data Collection and Normalization

Initial data was collected and processed by the authors of the research paper EDGAR-CORPUS: Billions of Tokens Make The World Go Round. Market price and returns data was collected from Yahoo Finance. Additional metadata was collected from SEC.

Who are the source language producers?

US public firms filing with the SEC.

Annotations

Annotation process

Labels for sentiment classification are based on buy-and-hold returns over a fixed time window around the filing date with the SEC i.e. when the data becomes public. Returns are chosen for this process as it reflects the combined market intelligence at parsing the new information in the filings. For each filing date t the stock price at t-1 and t+W is used to calculate returns. If, the returns are positive a label of positive is assigned else a label of negative is assigned. Three different windows are used to assign the labels:

1d: [t-1, t+1]
5d: [t-1, t+5]
30d: [t-1, t+30]

The windows are based on calendar days and are adjusted for weekends and holidays. The rationale behind using 3 windows is as follows:

A very short window may not give enough time for all the information contained in the filing to be reflected in the stock price.
A very long window may capture other events that drive stock price for the firm.

Who are the annotators?

Financial market participants.

Personal and Sensitive Information

The dataset contains public filings data from SEC. Market returns data was collected from Yahoo Finance.

Considerations for Using the Data

Social Impact of Dataset

Low to none.

Discussion of Biases

The dataset is about financial information of public companies and as such the tone and style of text is in line with financial literature.

Other Known Limitations

Additional Information

Dataset Curators

Aman Khan

Licensing Information

This dataset is provided under Apache 2.0.

References

Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, & Prodromos Malakasiotis. (2021). EDGAR-CORPUS [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5589195

Citation Information

Please use the following to cite this dataset:

@ONLINE{financial-reports-sec,
author = "Aman Khan",
title = "Financial Reports SEC",
url = "https://huggingface.co/datasets/JanosAudran/financial-reports-sec"
}

Datasets:

JanosAudran
/

financial-reports-sec