What libraries can I use for Fill-Mask?

The transformersand transformers.js libraries are compatible with Fill-Mask.

What models can I use for Fill-Mask?

The distilbert-base-uncasedand xlm-roberta-base models can be used for Fill-Mask.

What datasets can I use for Fill-Mask?

The wikipediaand c4 datasets can be used for Fill-Mask.

What metrics can I use for Fill-Mask?

The cross_entropyand perplexity metrics can be used for Fill-Mask.

Tasks

Fill-Mask

Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we want to get a statistical understanding of the language in which the model is trained in.

Inputs

Input

The <mask> barked at me

Fill-Mask Model

Output

wolf

0.487

dog

0.061

cat

0.058

fox

0.047

squirrel

0.025

About Fill-Mask

Use Cases

Domain Adaptation 👩‍⚕️

Masked language models do not require labelled data! They are trained by masking a couple of words in sentences and the model is expected to guess the masked word. This makes it very practical!

For example, masked language modeling is used to train large models for domain-specific problems. If you have to work on a domain-specific task, such as retrieving information from medical research papers, you can train a masked language model using those papers. 📄

The resulting model has a statistical understanding of the language used in medical research papers, and can be further trained in a process called fine-tuning to solve different tasks, such as Text Classification or Question Answering to build a medical research papers information extraction system. 👩‍⚕️ Pre-training on domain-specific data tends to yield better results (see this paper for an example).

If you don't have the data to train a masked language model, you can also use an existing domain-specific masked language model from the Hub and fine-tune it with your smaller task dataset. That's the magic of Open Source and sharing your work! 🎉

Inference with Fill-Mask Pipeline

You can use the Transformers library fill-mask pipeline to do inference with masked language models. If a model name is not provided, the pipeline will be initialized with distilroberta-base. You can provide masked text and it will return a list of possible mask values ranked according to the score.

from transformers import pipeline

classifier = pipeline("fill-mask")
classifier("Paris is the <mask> of France.")

# [{'score': 0.7, 'sequence': 'Paris is the capital of France.'},
# {'score': 0.2, 'sequence': 'Paris is the birthplace of France.'},
# {'score': 0.1, 'sequence': 'Paris is the heart of France.'}]

Useful Resources

Would you like to learn more about the topic? Awesome! Here you can find some curated resources that can be helpful to you!

Notebooks

Scripts for training

Documentation

Masked language modeling task guide

Deploy on Inference Endpoints

Compatible libraries

Transformers Transformers.js

Fill-Mask demo

using distilroberta-base

Fill-Mask

Examples

This model can be loaded on the Inference API on-demand.

Models for Fill-Mask

Browse Models (8,529)

distilbert-base-uncased

Fill-Mask • Updated Aug 18 • 6.95M • 261

Note A faster and smaller model than the famous BERT model.

xlm-roberta-base

Fill-Mask • Updated Apr 7 • 9.85M • 381

Note A multilingual model trained on 100 languages.

Datasets for Fill-Mask

Browse Datasets (248)

wikipedia

Preview • Updated Jun 1 • 37.6k • 282

Note A common dataset that is used to train models for many languages.

c4

Viewer • Updated Nov 3, 2022 • 81.6k • 146

Note A large English dataset with text crawled from the web.

Spaces using Fill-Mask

No example Space is defined for this task.

Note Contribute by proposing a Space for this task !

Metrics for Fill-Mask

cross_entropy: Cross Entropy is a metric that calculates the difference between two probability distributions. Each probability distribution is the distribution of predicted words

perplexity: Perplexity is the exponential of the cross-entropy loss. It evaluates the probabilities assigned to the next word by the model. Lower perplexity indicates better performance