gte-large
General Text Embeddings (GTE) model. Towards General Text Embeddings with Multi-stage Contrastive Learning
The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.
Metrics
We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the MTEB leaderboard.
Model Name | Model Size (GB) | Dimension | Sequence Length | Average (56) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization (1) | Classification (12) |
---|---|---|---|---|---|---|---|---|---|---|---|
gte-large | 0.67 | 1024 | 512 | 63.13 | 46.84 | 85.00 | 59.13 | 52.22 | 83.35 | 31.66 | 73.33 |
gte-base | 0.22 | 768 | 512 | 62.39 | 46.2 | 84.57 | 58.61 | 51.14 | 82.3 | 31.17 | 73.01 |
e5-large-v2 | 1.34 | 1024 | 512 | 62.25 | 44.49 | 86.03 | 56.61 | 50.56 | 82.05 | 30.19 | 75.24 |
e5-base-v2 | 0.44 | 768 | 512 | 61.5 | 43.80 | 85.73 | 55.91 | 50.29 | 81.05 | 30.28 | 73.84 |
gte-small | 0.07 | 384 | 512 | 61.36 | 44.89 | 83.54 | 57.7 | 49.46 | 82.07 | 30.42 | 72.31 |
text-embedding-ada-002 | - | 1536 | 8192 | 60.99 | 45.9 | 84.89 | 56.32 | 49.25 | 80.97 | 30.8 | 70.93 |
e5-small-v2 | 0.13 | 384 | 512 | 59.93 | 39.92 | 84.67 | 54.32 | 49.04 | 80.39 | 31.16 | 72.94 |
sentence-t5-xxl | 9.73 | 768 | 512 | 59.51 | 43.72 | 85.06 | 56.42 | 42.24 | 82.63 | 30.08 | 73.42 |
all-mpnet-base-v2 | 0.44 | 768 | 514 | 57.78 | 43.69 | 83.04 | 59.36 | 43.81 | 80.28 | 27.49 | 65.07 |
sgpt-bloom-7b1-msmarco | 28.27 | 4096 | 2048 | 57.59 | 38.93 | 81.9 | 55.65 | 48.22 | 77.74 | 33.6 | 66.19 |
all-MiniLM-L12-v2 | 0.13 | 384 | 512 | 56.53 | 41.81 | 82.41 | 58.44 | 42.69 | 79.8 | 27.9 | 63.21 |
all-MiniLM-L6-v2 | 0.09 | 384 | 512 | 56.26 | 42.35 | 82.37 | 58.04 | 41.95 | 78.9 | 30.81 | 63.05 |
contriever-base-msmarco | 0.44 | 768 | 512 | 56.00 | 41.1 | 82.54 | 53.14 | 41.88 | 76.51 | 30.36 | 66.68 |
sentence-t5-base | 0.22 | 768 | 512 | 55.27 | 40.21 | 85.18 | 53.09 | 33.63 | 81.14 | 31.39 | 69.81 |
Usage
Code example
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
model = AutoModel.from_pretrained("thenlper/gte-large")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Limitation
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
Citation
If you find our paper or models helpful, please consider citing them as follows:
@misc{li2023general,
title={Towards General Text Embeddings with Multi-stage Contrastive Learning},
author={Zehan Li and Xin Zhang and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang},
year={2023},
eprint={2308.03281},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 35,557
Spaces using thenlper/gte-large 3
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported72.627
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported34.469
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported66.237
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported92.518
- ap on MTEB AmazonPolarityClassificationtest set self-reported89.498
- f1 on MTEB AmazonPolarityClassificationtest set self-reported92.511
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported49.074
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported48.448
- map_at_1 on MTEB ArguAnatest set self-reported32.077
- map_at_10 on MTEB ArguAnatest set self-reported48.153
- map_at_100 on MTEB ArguAnatest set self-reported48.963
- map_at_1000 on MTEB ArguAnatest set self-reported48.966
- map_at_3 on MTEB ArguAnatest set self-reported43.184
- map_at_5 on MTEB ArguAnatest set self-reported46.072
- mrr_at_1 on MTEB ArguAnatest set self-reported33.073
- mrr_at_10 on MTEB ArguAnatest set self-reported48.540
- mrr_at_100 on MTEB ArguAnatest set self-reported49.335
- mrr_at_1000 on MTEB ArguAnatest set self-reported49.338
- mrr_at_3 on MTEB ArguAnatest set self-reported43.563
- mrr_at_5 on MTEB ArguAnatest set self-reported46.383
- ndcg_at_1 on MTEB ArguAnatest set self-reported32.077
- ndcg_at_10 on MTEB ArguAnatest set self-reported57.158
- ndcg_at_100 on MTEB ArguAnatest set self-reported60.325
- ndcg_at_1000 on MTEB ArguAnatest set self-reported60.402
- ndcg_at_3 on MTEB ArguAnatest set self-reported46.934
- ndcg_at_5 on MTEB ArguAnatest set self-reported52.158
- precision_at_1 on MTEB ArguAnatest set self-reported32.077
- precision_at_10 on MTEB ArguAnatest set self-reported8.592
- precision_at_100 on MTEB ArguAnatest set self-reported0.991
- precision_at_1000 on MTEB ArguAnatest set self-reported0.100
- precision_at_3 on MTEB ArguAnatest set self-reported19.275
- precision_at_5 on MTEB ArguAnatest set self-reported14.111
- recall_at_1 on MTEB ArguAnatest set self-reported32.077
- recall_at_10 on MTEB ArguAnatest set self-reported85.917
- recall_at_100 on MTEB ArguAnatest set self-reported99.075
- recall_at_1000 on MTEB ArguAnatest set self-reported99.644
- recall_at_3 on MTEB ArguAnatest set self-reported57.824
- recall_at_5 on MTEB ArguAnatest set self-reported70.555
- v_measure on MTEB ArxivClusteringP2Ptest set self-reported48.619
- v_measure on MTEB ArxivClusteringS2Stest set self-reported43.357
- map on MTEB AskUbuntuDupQuestionstest set self-reported63.064
- mrr on MTEB AskUbuntuDupQuestionstest set self-reported76.156
- cos_sim_pearson on MTEB BIOSSEStest set self-reported90.254
- cos_sim_spearman on MTEB BIOSSEStest set self-reported88.651
- euclidean_pearson on MTEB BIOSSEStest set self-reported88.149
- euclidean_spearman on MTEB BIOSSEStest set self-reported88.507
- manhattan_pearson on MTEB BIOSSEStest set self-reported87.965
- manhattan_spearman on MTEB BIOSSEStest set self-reported88.212
- accuracy on MTEB Banking77Classificationtest set self-reported86.058
- f1 on MTEB Banking77Classificationtest set self-reported86.016
- v_measure on MTEB BiorxivClusteringP2Ptest set self-reported39.105