Visual Question Answering
Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions.
Question
What is in this image?
About Visual Question Answering
Use Cases
Aid the Visually Impaired Persons
VQA models can be used to reduce visual barriers for visually impaired individuals by allowing them to get information about images from the web and the real world.
Education
VQA models can be used to improve experiences at museums by allowing observers to directly ask questions they interested in.
Improved Image Retrieval
Visual question answering models can be used to retrieve images with specific characteristics. For example, the user can ask "Is there a dog?" to find all images with dogs from a set of images.
Video Search
Specific snippets/timestamps of a video can be retrieved based on search queries. For example, the user can ask "At which part of the video does the guitar appear?" and get a specific timestamp range from the whole video.
Task Variants
Video Question Answering
Video Question Answering aims to answer questions asked about the content of a video.
Inference
You can infer with Visual Question Answering models using the vqa
(or visual-question-answering
) pipeline. This pipeline requires the Python Image Library (PIL) to process images. You can install it with (pip install pillow
).
from PIL import Image
from transformers import pipeline
vqa_pipeline = pipeline("visual-question-answering")
image = Image.open("elephant.jpeg")
question = "Is there an elephant?"
vqa_pipeline(image, question, top_k=1)
#[{'score': 0.9998154044151306, 'answer': 'yes'}]
Useful Resources
The contents of this page are contributed by Bharat Raghunathan and Jose Londono Botero.
Note A visual question answering model trained to convert charts and plots to text.
Note A strong visual question answering that answers questions from book covers.
Note A widely used dataset containing questions (with answers) about images.
Note A dataset to benchmark visual reasoning based on text in images.
Note An application that can answer questions based on images.
Note An application that can caption images and answer questions about a given image.
- accuracy
- Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative
- wu-palmer similarity
- Measures how much a predicted answer differs from the ground truth based on the difference in their semantic meaning.