long-t5-tglobal-xl + BookSum

Summarize long text and get a SparkNotes-like summary of any topic!

Generalizes reasonably well to academic & narrative text.
This is the XL checkpoint, which produces even better summaries from a human evaluation perspective.

A simple example/use case with the base model on ASR is here.

Cheeky Proof-of-Concept

A summary of the infamous navy seals copypasta:

In this chapter, the monster explains how he intends to exact revenge on "the little b****" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.

While this is a crude example, try running this copypasta through other summarization models to see the difference in comprehension (even though it's not even a "long" text!).

Contents

Description
How-To in Python
- Beyond the basics
  - Adjusting parameters
  - LLM.int8 Quantization
About
FAQ
Training procedure

Description

A fine-tuned version of google/long-t5-tglobal-xl on the kmfoda/booksum dataset.

Read the paper by Guo et al. here: LongT5: Efficient Text-To-Text Transformer for Long Sequences

How-To in Python

install/update transformers pip install -U transformers

summarize text with pipeline:

import torch
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-xl-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)
long_text = "Here is a lot of text I don't want to read. Replace me"

result = summarizer(long_text)
print(result[0]["summary_text"])

Beyond the basics

There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for reduced memory consumption.

Adjusting parameters

Pass other parameters related to beam search textgen when calling summarizer to get even higher quality results.

LLM.int8 Quantization

alternative section title: how to get this monster to run inference on free colab runtimes

Via this PR LLM.int8 is now supported for long-t5 models.

per initial tests the summarization quality seems to hold while using significantly less memory! *
a version of this model quantized to int8 is already on the hub here so if you're using the 8-bit version anyway, you can start there for a 3.5 gb download only!

First, make sure you have the latest versions of the relevant packages:

pip install -U transformers bitsandbytes accelerate

load in 8-bit (magic completed by bitsandbytes behind the scenes)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(
    "pszemraj/long-t5-tglobal-xl-16384-book-summary"
)

model = AutoModelForSeq2SeqLM.from_pretrained(
    "pszemraj/long-t5-tglobal-xl-16384-book-summary",
    load_in_8bit=True,
    device_map="auto",
)

The above is already present in the Colab demo linked at the top of the model card.

* More rigorous metrics-based research comparing beam-search summarization with and without LLM.int8 will take place over time.

About

Intended uses & limitations

While this model seems to improve factual consistency, don't take summaries as foolproof and check things that seem odd.

Specifically: negation statements (i.e., the model says: this thing does not have [ATTRIBUTE], when instead it should have said this thing has lots of [ATTRIBUTE]).

I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually check this by comparing a particular statement with what the surrounding sentences imply.

Training and evaluation data

kmfoda/booksum dataset on Model Database - read the original paper here.

For initial fine-tuning, only input text with 12288 input tokens or less and 1024 output tokens or less was used (i.e. lines longer than that were dropped before training) for memory reasons. After a quick analysis, summaries in the 12288-16384 range are in the small minority in this dataset.
- In addition, this initial training combined the training and validation sets and trained on them in aggregate to increase the functional dataset size. Therefore, take the validation set results with a grain of salt; primary metrics should (always) be the test set..
The final stages of fine-tuning used the standard 16384 input/1024 output conventions, preserving the standard in/out lengths (and truncating longer sequences). This did not seem to change the loss/performance much.

Eval results

Official results with the model evaluator will be computed and posted here.

Please read the note above, as due to the training methods, the performance on the validation set looks better than the results on the test set will be. The model achieves the following results on the evaluation set:

eval_loss: 1.2756
eval_rouge1: 41.8013
eval_rouge2: 12.0895
eval_rougeL: 21.6007
eval_rougeLsum: 39.5382
eval_gen_len: 387.2945
eval_runtime: 13908.4995
eval_samples_per_second: 0.107
eval_steps_per_second: 0.027

***** predict/test metrics (initial) ***** predict_gen_len = 506.4368 predict_loss = 2.028 predict_rouge1 = 36.8815 predict_rouge2 = 8.0625 predict_rougeL = 17.6161 predict_rougeLsum = 34.9068 predict_runtime = 2:04:14.37 predict_samples = 1431 predict_samples_per_second = 0.192 predict_steps_per_second = 0.048

* evaluating big model not as easy as it seems. Doing a bit more investigating

FAQ

How can I run inference with this on CPU?

lol

How to run inference over a very long (30k+ tokens) document in batches?

See summarize.py in the code for my hf space Document Summarization :)

You can also use the same code to split a document into batches of 4096, etc., and iterate over them with the model. This is useful in situations where CUDA memory is limited.

Update: see the section on the textsum package below.

How to fine-tune further?

See train with a script and the summarization scripts

Are there simpler ways to run this?

For this reason, I created a Python package utility. It's called textsum, and you can use it to load models and summarize things in a few lines of code.

pip install textsum

Use textsum in python with this model:

from textsum.summarize import Summarizer

summarizer = Summarizer(
    model_name_or_path="pszemraj/long-t5-tglobal-xl-16384-book-summary"
)

long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")

This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.

For details, explanations, and documentation, see the README (linked above) or the wiki.

Training procedure

Updates

Updates to this model/model card will be posted here when relevant. The model seems to be fairly converged; if updates/improvements are possible using the BookSum dataset, this repo will be updated.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0006
train_batch_size: 1
eval_batch_size: 1
seed: 10350
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 32
total_train_batch_size: 128
total_eval_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

*Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train.

Framework versions

Transformers 4.25.0.dev0
Pytorch 1.13.0+cu117
Datasets 2.6.1
Tokenizers 0.13.1

Downloads last month: 646

Hosted inference API

Summarization

Inference API has been turned off for this model.

Dataset used to train pszemraj/long-t5-tglobal-xl-16384-book-summary

Evaluation results

ROUGE-1 on multi_news
test set verified

36.204
ROUGE-2 on multi_news
test set verified

8.424
ROUGE-L on multi_news
test set verified

17.372
ROUGE-LSUM on multi_news
test set verified

32.399
loss on multi_news
test set verified

2.084
gen_len on multi_news
test set verified

248.357
ROUGE-1 on billsum
test set self-reported

41.364
ROUGE-2 on billsum
test set self-reported

16.144
ROUGE-L on billsum
test set self-reported

24.298
ROUGE-LSUM on billsum
test set self-reported

35.323
loss on billsum
test set self-reported

1.282
gen_len on billsum
test set self-reported

291.816
ROUGE-1 on ccdv/arxiv-summarization
test set self-reported

36.322
ROUGE-2 on ccdv/arxiv-summarization
test set self-reported

9.374
ROUGE-L on ccdv/arxiv-summarization
test set self-reported

19.840
ROUGE-LSUM on ccdv/arxiv-summarization
test set self-reported

32.253
loss on ccdv/arxiv-summarization
test set self-reported

2.147
gen_len on ccdv/arxiv-summarization
test set self-reported

186.297

View leaderboard (Papers With Code)