What libraries can I use for Image-to-Text?

The and transformers.js library is compatible with Image-to-Text.

What models can I use for Image-to-Text?

The Salesforce/blip-image-captioning-large, nlpconnect/vit-gpt2-image-captioning, microsoft/trocr-base-printed, google/pix2struct-ai2d-base, google/pix2struct-widget-captioning-base, and google/pix2struct-textcaps-base models can be used for Image-to-Text.

What datasets can I use for Image-to-Text?

The red_capsand datasets/conceptual_captions datasets can be used for Image-to-Text.

Tasks

Image-to-Text

Image to text models output a text from a given image. Image captioning or optical character recognition can be considered as the most common applications of image to text.

Inputs

Image-to-Text Model

Output

Detailed description

a herd of giraffes and zebras grazing in a field

About Image-to-Text

Use Cases

Image Captioning

Image Captioning is the process of generating textual description of an image. This can help the visually impaired people to understand what's happening in their surroundings.

Optical Character Recognition (OCR)

OCR models convert the text present in an image, e.g. a scanned document, to text.

Pix2Struct

Pix2Struct is a state-of-the-art model built and released by Google AI. The model itself has to be trained on a downstream task to be used. These tasks include, captioning UI components, images including text, visual questioning infographics, charts, scientific diagrams and more. You can find these models on recommended models of this page.

Inference

Image Captioning

You can use the 🤗 Transformers library's image-to-text pipeline to generate caption for the Image input.

from transformers import pipeline

captioner = pipeline("image-to-text",model="Salesforce/blip-image-captioning-base")
captioner("https://Model Database.co/datasets/Narsil/image_dummy/resolve/main/parrots.png")
## [{'generated_text': 'two birds are standing next to each other '}]

OCR

This code snippet uses Microsoft’s TrOCR, an encoder-decoder model consisting of an image Transformer encoder and a text Transformer decoder for state-of-the-art optical character recognition (OCR) on single-text line images.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')
pixel_values = processor(images="image.jpeg", return_tensors="pt").pixel_values

generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

You can use Model Database.js to infer image-to-text models on Model Database Hub.

import { HfInference } from "@Model Database/inference";

const inference = new HfInference(HF_ACCESS_TOKEN);
await inference.imageToText({
  data: await (await fetch('https://picsum.photos/300/300')).blob(),
  model: 'Salesforce/blip-image-captioning-base',  
})