Image-to-Text
Image to text models output a text from a given image. Image captioning or optical character recognition can be considered as the most common applications of image to text.
Detailed description
a herd of giraffes and zebras grazing in a field
About Image-to-Text
Use Cases
Image Captioning
Image Captioning is the process of generating textual description of an image. This can help the visually impaired people to understand what's happening in their surroundings.
Optical Character Recognition (OCR)
OCR models convert the text present in an image, e.g. a scanned document, to text.
Pix2Struct
Pix2Struct is a state-of-the-art model built and released by Google AI. The model itself has to be trained on a downstream task to be used. These tasks include, captioning UI components, images including text, visual questioning infographics, charts, scientific diagrams and more. You can find these models on recommended models of this page.
Inference
Image Captioning
You can use the 🤗 Transformers library's image-to-text
pipeline to generate caption for the Image input.
from transformers import pipeline
captioner = pipeline("image-to-text",model="Salesforce/blip-image-captioning-base")
captioner("https://Model Database.co/datasets/Narsil/image_dummy/resolve/main/parrots.png")
## [{'generated_text': 'two birds are standing next to each other '}]
OCR
This code snippet uses Microsoft’s TrOCR, an encoder-decoder model consisting of an image Transformer encoder and a text Transformer decoder for state-of-the-art optical character recognition (OCR) on single-text line images.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')
pixel_values = processor(images="image.jpeg", return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
You can use Model Database.js to infer image-to-text models on Model Database Hub.
import { HfInference } from "@Model Database/inference";
const inference = new HfInference(HF_ACCESS_TOKEN);
await inference.imageToText({
data: await (await fetch('https://picsum.photos/300/300')).blob(),
model: 'Salesforce/blip-image-captioning-base',
})
Useful Resources
- Image Captioning
- Image captioning use case
- Train Image Captioning model on your dataset
- Train OCR model on your dataset
This page was made possible thanks to efforts of Sukesh Perla and Johannes Kolbe.
Compatible libraries
Note A robust image captioning model.
Note A strong image captioning model.
Note A strong optical character recognition model.
Note A strong visual question answering model for scientific diagrams.
Note A strong captioning model for UI components.
Note A captioning model for images that contain text.
Note Dataset from 12M image-text of Reddit
Note A robust image captioning application.
Note An application that transcribes handwritings into text.
Note An application that can caption images and answer questions about a given image.
Note An application that can caption images and answer questions with a conversational agent.
Note An image captioning application that demonstrates the effect of noise on captions.
No example metric is defined for this task.
Note Contribute by proposing a metric for this task !