Dataset Viewer (First 5GB)
Viewer
The dataset viewer is not available for this split.
Job manager was killed while running this job (job exceeded maximum duration).
Error code:   JobManagerExceededMaximumDurationError

Need help to make the dataset viewer work? Open a discussion for direct support.

Dataset Card for [financial-reports-sec]

Dataset Summary

The dataset contains the annual report of US public firms filing with the SEC EDGAR system from 1993-2020. Each annual report (10K filing) is broken into 20 sections. Each section is split into individual sentences. Sentiment labels are provided on a per filing basis from the market reaction around the filing date for 3 different time windows [t-1, t+1], [t-1, t+5] and [t-1, t+30]. Additional metadata for each filing is included in the dataset.

Dataset Configurations

Four configurations are available:

  • large_lite:
    • Contains only the basic features needed. Extra metadata is ommitted.
    • Features List:
      • cik
      • sentence
      • section
      • labels
      • filingDate
      • docID
      • sentenceID
      • sentenceCount
  • large_full:
    • All features are included.
    • Features List (excluding those already in the lite verison above):
      • name
      • tickers
      • exchanges
      • entityType
      • sic
      • stateOfIncorporation
      • tickerCount
      • acceptanceDateTime
      • form
      • reportDate
      • returns
  • small_lite:
    • Same as large_lite version except that only (200,000/20,000/20,000) sentences are loaded for (train/test/validation) splits.
  • small_full:
    • Same as large_full version except that only (200,000/20,000/20,000) sentences are loaded for (train/test/validation) splits.

Usage

import datasets

# Load the lite configuration of the dataset
raw_dataset = datasets.load_dataset("JanosAudran/financial-reports-sec", "large_lite")

# Load a specific split
raw_dataset = datasets.load_dataset("JanosAudran/financial-reports-sec", "small_full", split="train")

Supported Tasks

The tasks the dataset can be used directly for includes:

  • Masked Language Modelling
    • A model like BERT can be fine-tuned on this corpus of financial text.
  • Sentiment Analysis
    • For each annual report a label ["positive", "negative"] is provided based on the market reaction around the filing date (refer to Annotations).
  • Next Sentence Prediction/Sentence Order Prediction
    • Sentences extracted from the filings are in their original order and as such the dataset can be adapted very easily for either of these tasks.

Languages

All sentences are in English.

Dataset Structure

Data Instances

Refer to dataset preview.

Data Fields

Feature Name

  • Description
  • Data type
  • Example/Structure

cik

  • 10 digit identifier used by SEC for a firm.
  • string
  • '0000001750'

sentence

  • A single sentence from the 10-K filing.
  • string
  • 'The finance agreement is secured by a first priority security interest in all insurance policies, all unearned premium, return premiums, dividend payments and loss payments thereof.'

section

  • The section of the 10-K filing the sentence is located.
  • ClassLabel
  • ClassLabel(names=['section_1', 'section_10', 'section_11', 'section_12', 'section_13', 'section_14', 'section_15', 'section_1A', 'section_1B', 'section_2','section_3', 'section_4', 'section_5', 'section_6', 'section_7', 'section_7A','section_8', 'section_9', 'section_9A', 'section_9B'], id=None)
    

labels

  • The sentiment label for the entire filing (positve or negative) based on different time windows.
  • Dict of ClassLables
  • {
      '1d': ClassLabel(names=['positive', 'negative'], id=None),
      '5d': ClassLabel(names=['positive', 'negative'], id=None),
      '30d': ClassLabel(names=['positive', 'negative'], id=None)
    }
    

filingDate

  • The date the 10-K report was filed with the SEC.
  • string
  • '2021-03-10'

docID

  • Unique ID for identifying the exact 10-K filing. Unique across all configs and splits. Can be used to identify the document from which the sentence came from.
  • string
  • '0000001750_10-K_2020'

sentenceID

  • Unique ID for identifying the exact sentence. Unique across all configs and splits.
  • string
  • '0000001750_10-K_2020_section_1_100'

sentenceCount

  • Integer identiying the running sequence for the sentence. Unique only for a given config and split.
  • string
  • 123

name

  • The name of the filing entity
  • string
  • 'Investar Holding Corp'

tickers

  • List of ticker symbols for the filing entity.
  • List of strings
  • ['ISTR']

exchanges

  • List of exchanges for the filing entity.
  • List of strings
  • ['Nasdaq']

entityType

  • The type of entity as identified in the 10-K filing.
  • string
  • 'operating'

sic

  • Four digit SIC code for the filing entity.
  • string
  • '6022'

stateOfIncorporation

  • Two character code for the state of incorporation for the filing entity.
  • string
  • 'LA'

tickerCount

  • Internal use. Count of ticker symbols. Always 1.
  • int
  • 1

acceptanceDateTime

  • The full timestamp of when the filing was accepted into the SEC EDGAR system.
  • string
  • '2021-03-10T14:26:11.000Z'

form

  • The type of filing. Always 10-K in the dataset.
  • string
  • '10-K'

reportDate

  • The last date in the fiscal year for which the entity is filing the report.
  • string
  • '2020-12-31'

returns

  • Internal use. The prices and timestamps used to calculate the sentiment labels.
  • Dict
  • {'1d': {
      'closePriceEndDate': 21.45746421813965,
      'closePriceStartDate': 20.64960479736328,
      'endDate': '2021-03-11T00:00:00-05:00',
      'startDate': '2021-03-09T00:00:00-05:00',
      'ret': 0.03912226855754852
      },
    '5d': {
      'closePriceEndDate': 21.743167877197266,
      'closePriceStartDate': 20.64960479736328,
      'endDate': '2021-03-15T00:00:00-04:00',
      'startDate': '2021-03-09T00:00:00-05:00',
      'ret': 0.052958063781261444
      },
    '30d': {
      'closePriceEndDate': 20.63919448852539,
      'closePriceStartDate': 20.64960479736328,
      'endDate': '2021-04-09T00:00:00-04:00',
      'startDate': '2021-03-09T00:00:00-05:00',
      'ret': -0.0005041408003307879}}
    

Data Splits

Config train validation test
large_full 67,316,227 1,585,561 2,965,174
large_lite 67,316,227 1,585,561 2,965,174
small_full 200,000 20,000 20,000
small_lite 200,000 20,000 20,000

Dataset Summary Statistics

Variable count mean std min 1% 25% 50% 75% 99% max
Unique Firm Count 4,677
Filings Count 55,349
Sentence Count 71,866,962
Filings per Firm 4,677 12 9 1 1 4 11 19 27 28
Return per Filing - 1d 55,349 0.008 0.394 -0.973 -0.253 -0.023 0 0.02 0.367 77.977
Return per Filing - 5d 55,349 0.013 0.584 -0.99 -0.333 -0.034 0 0.031 0.5 100
Return per Filing - 30d 55,349 0.191 22.924 -0.999 -0.548 -0.068 0.001 0.074 1 5,002.748
Sentences per Filing 55,349 1,299 654 0 110 839 1,268 1,681 3,135 8,286
Sentences by Section - section_1 55,349 221 183 0 0 97 180 293 852 2,724
Sentences by Section - section_10 55,349 24 40 0 0 4 6 20 173 1,594
Sentences by Section - section_11 55,349 16 47 0 0 3 3 4 243 808
Sentences by Section - section_12 55,349 9 14 0 0 3 4 8 56 1,287
Sentences by Section - section_13 55,349 8 20 0 0 3 3 4 79 837
Sentences by Section - section_14 55,349 22 93 0 0 3 3 8 413 3,536
Sentences by Section - section_15 55,349 177 267 0 0 9 26 315 1104 4,140
Sentences by Section - section_1A 55,349 197 204 0 0 3 158 292 885 2,106
Sentences by Section - section_1B 55,349 4 31 0 0 1 3 3 13 2,414
Sentences by Section - section_2 55,349 16 45 0 0 6 8 13 169 1,903
Sentences by Section - section_3 55,349 14 36 0 0 4 5 12 121 2,326
Sentences by Section - section_4 55,349 7 17 0 0 3 3 4 66 991
Sentences by Section - section_5 55,349 20 41 0 0 10 15 21 87 3,816
Sentences by Section - section_6 55,349 8 29 0 0 3 4 7 43 2,156
Sentences by Section - section_7 55,349 265 198 0 0 121 246 373 856 4,539
Sentences by Section - section_7A 55,349 18 52 0 0 3 9 21 102 3,596
Sentences by Section - section_8 55,349 257 296 0 0 3 182 454 1105 4,431
Sentences by Section - section_9 55,349 5 33 0 0 3 3 4 18 2,330
Sentences by Section - section_9A 55,349 17 16 0 0 8 15 23 50 794
Sentences by Section - section_9B 55,349 4 18 0 0 2 3 4 23 813
Word count per Sentence 71,866,962 28 22 1 2 16 24 34 98 8,675

Dataset Creation

Curation Rationale

To create this dataset multiple sources of information have to be cleaned and processed for data merging. Starting from the raw filings:

  • Useful metadata about the filing and firm was added.
  • Time windows around the filing date were carefully created.
  • Stock price data was then added for the windows.
  • Ambiguous/duplicate records were removed.

Source Data

Initial Data Collection and Normalization

Initial data was collected and processed by the authors of the research paper EDGAR-CORPUS: Billions of Tokens Make The World Go Round. Market price and returns data was collected from Yahoo Finance. Additional metadata was collected from SEC.

Who are the source language producers?

US public firms filing with the SEC.

Annotations

Annotation process

Labels for sentiment classification are based on buy-and-hold returns over a fixed time window around the filing date with the SEC i.e. when the data becomes public. Returns are chosen for this process as it reflects the combined market intelligence at parsing the new information in the filings. For each filing date t the stock price at t-1 and t+W is used to calculate returns. If, the returns are positive a label of positive is assigned else a label of negative is assigned. Three different windows are used to assign the labels:

  • 1d: [t-1, t+1]
  • 5d: [t-1, t+5]
  • 30d: [t-1, t+30]

The windows are based on calendar days and are adjusted for weekends and holidays. The rationale behind using 3 windows is as follows:

  • A very short window may not give enough time for all the information contained in the filing to be reflected in the stock price.
  • A very long window may capture other events that drive stock price for the firm.

Who are the annotators?

Financial market participants.

Personal and Sensitive Information

The dataset contains public filings data from SEC. Market returns data was collected from Yahoo Finance.

Considerations for Using the Data

Social Impact of Dataset

Low to none.

Discussion of Biases

The dataset is about financial information of public companies and as such the tone and style of text is in line with financial literature.

Other Known Limitations

NA

Additional Information

Dataset Curators

Aman Khan

Licensing Information

This dataset is provided under Apache 2.0.

References

Citation Information

Please use the following to cite this dataset:

@ONLINE{financial-reports-sec,
author = "Aman Khan",
title = "Financial Reports SEC",
url = "https://huggingface.co/datasets/JanosAudran/financial-reports-sec"
}
Downloads last month
383
Edit dataset card
Evaluate models HF Leaderboard

Models trained or fine-tuned on JanosAudran/financial-reports-sec