long-t5-tglobal-base-16384 + BookSum
Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
- generalizes reasonably well to academic & narrative text.
- A simple example/use case on ASR is here.
- Example notebook in Colab (click on the icon above).
Cheeky Proof-of-Concept
A summary of the infamous navy seals copypasta:
The narrator tells us that he's graduated from the Navy seals and has been involved in many secret raids. He's also one of the best snipers in the entire U.S. military. He promises to "wipe you out with precision" when they meet again.
Contents
- Model description
- How-To in Python
- Intended uses & limitations
- Training and evaluation data
- FAQ
- Training procedure
- Citation info
Model description
A fine-tuned version of google/long-t5-tglobal-base on the kmfoda/booksum
dataset:
- 30+ epochs of fine-tuning from the base model on V100/A100 GPUs
- Training used 16384 token input / 1024 max output
Read the paper by Guo et al. here: LongT5: Efficient Text-To-Text Transformer for Long Sequences
How-To in Python
Install/update transformers pip install -U transformers
Summarize text with pipeline:
import torch
from transformers import pipeline
summarizer = pipeline(
"summarization",
"pszemraj/long-t5-tglobal-base-16384-book-summary",
device=0 if torch.cuda.is_available() else -1,
)
long_text = "Here is a lot of text I don't want to read. Replace me"
result = summarizer(long_text)
print(result[0]["summary_text"])
Pass other parameters related to beam search textgen when calling summarizer
to get even higher quality results.
Intended uses & limitations
- The current checkpoint is fairly well converged but will be updated if further improvements can be made.
- Compare performance to LED-base trained on the same dataset (API gen parameters are the same).
- while this model seems to improve upon factual consistency, do not take summaries to be foolproof and check things that seem odd.
Training and evaluation data
kmfoda/booksum
dataset on modeldatabase - read the original paper here. Summaries longer than 1024 LongT5 tokens were filtered out to prevent the model from learning to generate "partial" summaries.
FAQ
How to run inference over a very long (30k+ tokens) document in batches?
See summarize.py
in the code for my hf space Document Summarization :)
You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited.
How to fine-tune further?
See train with a script and the summarization scripts.
This model was originally tuned on Google Colab with a heavily modified variant of the longformer training notebook, key enabler being deepspeed. You can try this as an alternate route to fine-tuning the model without using the command line.
Are there simpler ways to run this?
For this reason, I created a Python package utility. It's called textsum, and you can use it to load models and summarize things in a few lines of code.
pip install textsum
Use textsum
in python with this model:
from textsum.summarize import Summarizer
summarizer = Summarizer(
model_name_or_path="pszemraj/long-t5-tglobal-base-16384-book-summary"
)
long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")
This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.
For details, explanations, and documentation, see the README (linked above) or the wiki.
Training procedure
Updates:
- July 22, 2022: updated to a fairly converged checkpoint
- July 3, 2022: Added a new version with several epochs of additional general training that is more performant.
Training hyperparameters
NOTE: early checkpoints of this model were trained on a "smaller" subsection of the dataset as it was filtered for summaries of 1024 characters. This was subsequently caught and adjusted to 1024 tokens and then trained further for 10+ epochs.
The following hyperparameters were used during the most recent training round*:
- learning_rate: 0.0005
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 128
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.01
- num_epochs: 2
* Prior training sessions used roughly similar parameters; multiple sessions were required as this takes eons to train
Framework versions
- Transformers 4.20.1
- Pytorch 1.10.0+cu113
- Datasets 2.3.2
- Tokenizers 0.12.1
Citation info
If you find pszemraj/long-t5-tglobal-base-16384-book-summary
useful in your work, please consider citing this model :)
@misc {peter_szemraj_2022,
author = { {Peter Szemraj} },
title = { long-t5-tglobal-base-16384-book-summary (Revision 4b12bce) },
year = 2022,
url = { https://modeldatabase.com/pszemraj/long-t5-tglobal-base-16384-book-summary },
doi = { 10.57967/hf/0100 },
publisher = { Model Database }
}
- Downloads last month
- 8,092
Dataset used to train pszemraj/long-t5-tglobal-base-16384-book-summary
Spaces using pszemraj/long-t5-tglobal-base-16384-book-summary 11
Evaluation results
- ROUGE-1 on kmfoda/booksumtest set self-reported36.408
- ROUGE-2 on kmfoda/booksumtest set self-reported6.065
- ROUGE-L on kmfoda/booksumtest set self-reported16.721
- ROUGE-LSUM on kmfoda/booksumtest set self-reported33.340
- loss on kmfoda/booksumtest set self-reportedNaN
- gen_len on kmfoda/booksumtest set self-reported252.810
- ROUGE-1 on samsumtest set self-reported30.905
- ROUGE-2 on samsumtest set self-reported7.471
- ROUGE-L on samsumtest set self-reported22.396
- ROUGE-LSUM on samsumtest set self-reported26.909
- loss on samsumtest set self-reportedNaN
- gen_len on samsumtest set self-reported46.797
- ROUGE-1 on cnn_dailymailtest set self-reported30.594
- ROUGE-2 on cnn_dailymailtest set self-reported7.252
- ROUGE-L on cnn_dailymailtest set self-reported17.716
- ROUGE-LSUM on cnn_dailymailtest set self-reported27.288
- loss on cnn_dailymailtest set self-reportedNaN
- gen_len on cnn_dailymailtest set self-reported125.251
- ROUGE-1 on xsumtest set self-reported20.365
- ROUGE-2 on xsumtest set self-reported3.413
- ROUGE-L on xsumtest set self-reported13.617
- ROUGE-LSUM on xsumtest set self-reported15.831
- loss on xsumtest set self-reportedNaN
- gen_len on xsumtest set self-reported82.218
- ROUGE-1 on billsumtest set self-reported39.638
- ROUGE-2 on billsumtest set self-reported13.002
- ROUGE-L on billsumtest set self-reported23.026
- ROUGE-LSUM on billsumtest set self-reported32.994
- loss on billsumtest set self-reported1.943
- gen_len on billsumtest set self-reported162.359
- ROUGE-1 on big_patenttest set self-reported34.764
- ROUGE-2 on big_patenttest set self-reported7.874
- ROUGE-L on big_patenttest set self-reported19.983
- ROUGE-LSUM on big_patenttest set self-reported29.208
- loss on big_patenttest set self-reported2.832
- gen_len on big_patenttest set self-reported132.748
- ROUGE-1 on launch/gov_reportvalidation set self-reported37.925
- ROUGE-2 on launch/gov_reportvalidation set self-reported8.584
- ROUGE-L on launch/gov_reportvalidation set self-reported18.027
- ROUGE-LSUM on launch/gov_reportvalidation set self-reported34.082
- loss on launch/gov_reportvalidation set self-reported2.567
- gen_len on launch/gov_reportvalidation set self-reported220.375
- ROUGE-1 on launch/gov_reporttest set self-reported37.444
- ROUGE-2 on launch/gov_reporttest set self-reported8.291
- ROUGE-L on launch/gov_reporttest set self-reported17.689
- ROUGE-LSUM on launch/gov_reporttest set self-reported33.714
- loss on launch/gov_reporttest set self-reported2.578
- gen_len on launch/gov_reporttest set self-reported214.969