SpanMarker with bert-base-uncased on SourceData
This is a SpanMarker model trained on the SourceData dataset that can be used for Named Entity Recognition. This SpanMarker model uses bert-base-uncased as the underlying encoder.
Model Details
Model Description
- Model Type: SpanMarker
- Encoder: bert-base-uncased
- Maximum Sequence Length: 256 tokens
- Maximum Entity Length: 8 words
- Training Dataset: SourceData
- Language: en
- License: cc-by-4.0
Model Sources
Model Labels
Label |
Examples |
CELL_LINE |
"293T", "WM266.4 451Lu", "501mel" |
CELL_TYPE |
"BMDMs", "protoplasts", "epithelial" |
DISEASE |
"melanoma", "lung metastasis", "breast prostate cancer" |
EXP_ASSAY |
"interactions", "Yeast two-hybrid", "BiFC" |
GENEPROD |
"CPL1", "FREE1 CPL1", "FREE1" |
ORGANISM |
"Arabidopsis", "yeast", "seedlings" |
SMALL_MOLECULE |
"polyacrylamide", "CHX", "SDS polyacrylamide" |
SUBCELLULAR |
"proteasome", "D-bodies", "plasma" |
TISSUE |
"Colon", "roots", "serum" |
Evaluation
Metrics
Label |
Precision |
Recall |
F1 |
all |
0.8345 |
0.8328 |
0.8336 |
CELL_LINE |
0.9060 |
0.8866 |
0.8962 |
CELL_TYPE |
0.7365 |
0.7746 |
0.7551 |
DISEASE |
0.6204 |
0.6531 |
0.6363 |
EXP_ASSAY |
0.7224 |
0.7096 |
0.7160 |
GENEPROD |
0.8944 |
0.8960 |
0.8952 |
ORGANISM |
0.8752 |
0.8902 |
0.8826 |
SMALL_MOLECULE |
0.8304 |
0.8223 |
0.8263 |
SUBCELLULAR |
0.7859 |
0.7699 |
0.7778 |
TISSUE |
0.8134 |
0.8056 |
0.8094 |
Uses
Direct Use for Inference
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata")
entities = model.predict("Comparison of ENCC-derived neurospheres treated with intestinal extract from hypoganglionosis rats, hypoganglionosis treated with Fecal microbiota transplantation (FMT) sham rat. Comparison of neuronal markers. (J) Immunofluorescence stain number of PGP9.5+. Nuclei were stained blue with DAPI; Triangles indicate PGP9.5+.")
Downstream Use
You can finetune this model on your own dataset.
Click to expand
from span_marker import SpanMarkerModel, Trainer
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata")
dataset = load_dataset("conll2003")
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-bert-base-uncased-sourcedata-finetuned")
Training Details
Training Set Metrics
Training set |
Min |
Median |
Max |
Sentence length |
4 |
71.0253 |
2609 |
Entities per sentence |
0 |
8.3186 |
162 |
Training Hyperparameters
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 3
Training Results
Epoch |
Step |
Validation Loss |
Validation Precision |
Validation Recall |
Validation F1 |
Validation Accuracy |
0.5237 |
3000 |
0.0162 |
0.7972 |
0.8162 |
0.8065 |
0.9520 |
1.0473 |
6000 |
0.0155 |
0.8188 |
0.8251 |
0.8219 |
0.9560 |
1.5710 |
9000 |
0.0155 |
0.8213 |
0.8324 |
0.8268 |
0.9563 |
2.0946 |
12000 |
0.0163 |
0.8315 |
0.8347 |
0.8331 |
0.9581 |
2.6183 |
15000 |
0.0167 |
0.8303 |
0.8378 |
0.8340 |
0.9582 |
Framework Versions
- Python: 3.9.16
- SpanMarker: 1.3.1.dev
- Transformers: 4.33.0
- PyTorch: 2.0.1+cu118
- Datasets: 2.14.0
- Tokenizers: 0.13.2
Citation
BibTeX
@software{Aarsen_SpanMarker,
author = {Aarsen, Tom},
license = {Apache-2.0},
title = {{SpanMarker for Named Entity Recognition}},
url = {https://github.com/tomaarsen/SpanMarkerNER}
}