Edit model card

Donut (base-sized model, fine-tuned on DocVQA)

Donut model fine-tuned on DocVQA. It was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository.

Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Model Database team.

Model description

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

model image

Intended uses & limitations

This model is fine-tuned on DocVQA, a document visual question answering dataset.

We refer to the documentation which includes code examples.

Downloads last month
230
Hosted inference API
This model can be loaded on the Inference API on-demand.

Spaces using jinhybr/OCR-DocVQA-Donut 2