Dataset Viewer
Viewer
The dataset viewer is not available for this split.
Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code:   StreamingRowsError
Exception:    FileNotFoundError
Message:      https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/wg4bpm33hj-2.zip
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 417, in _info
                  await _file_info(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 837, in _file_info
                  r.raise_for_status()
                File "/src/services/worker/.venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
                  raise ClientResponseError(
              aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/wg4bpm33hj-2.zip')
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 264, in get_rows_or_raise
                  return get_rows(
                File "/src/services/worker/src/worker/utils.py", line 205, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 227, in get_rows
                  ds = load_dataset(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 2146, in load_dataset
                  return builder_instance.as_streaming_dataset(split=split)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1329, in as_streaming_dataset
                  splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
                File "/tmp/modules-cache/datasets_modules/datasets/lapix--CCAgT/b217fbe80bc3e3bd4767d20634c00a8ce07a817f863ecd14c762718168f151e0/CCAgT.py", line 295, in _split_generators
                  self._download_and_extract_all(dl_manager)
                File "/tmp/modules-cache/datasets_modules/datasets/lapix--CCAgT/b217fbe80bc3e3bd4767d20634c00a8ce07a817f863ecd14c762718168f151e0/CCAgT.py", line 224, in _download_and_extract_all
                  os.path.join(self.images_base_dir, fn) for fn in os.listdir(self.images_base_dir) if fn.endswith(".zip")
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/streaming.py", line 74, in wrapper
                  return function(*args, download_config=download_config, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 532, in xlistdir
                  fs, *_ = fsspec.get_fs_token_paths(path, storage_options=storage_options)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 606, in get_fs_token_paths
                  fs = filesystem(protocol, **inkwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/registry.py", line 261, in filesystem
                  return cls(**storage_options)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 76, in __call__
                  obj = super().__call__(*args, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/zip.py", line 58, in __init__
                  self.fo = fo.__enter__()  # the whole instance is a context
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 102, in __enter__
                  f = self.fs.open(self.path, mode=mode)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 1199, in open
                  f = self._open(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 356, in _open
                  size = size or self.info(path, **kwargs)["size"]
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 115, in wrapper
                  return sync(self.loop, func, *args, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 100, in sync
                  raise return_result
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 55, in _runner
                  result[0] = await coro
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 430, in _info
                  raise FileNotFoundError(url) from exc
              FileNotFoundError: https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/wg4bpm33hj-2.zip

Need help to make the dataset viewer work? Open a discussion for direct support.

Dataset Card for Images of Cervical Cells with AgNOR Stain Technique

Dataset Summary

The CCAgT (Images of Cervical Cells with AgNOR Stain Technique) dataset contains 9339 images (1600x1200 resolution where each pixel is 0.111µmX0.111µm) from 15 different slides stained using the AgNOR technique. Each image has at least one label. In total, this dataset has more than 63K instances of annotated object. The images are from the patients of the Gynecology and Colonoscopy Outpatient Clinic of the Polydoro Ernani de São Thiago University Hospital of the Universidade Federal de Santa Catarina (HU-UFSC).

Supported Tasks and Leaderboards

  • image-segmentation: The dataset can be used to train a model for semantic segmentation or instance segmentation. Semantic segmentation consists in classifying each pixel of the image. Success on this task is typically measured by achieving high values of mean iou or f-score for pixels results. Instance segmentation consists of doing object detection first and then using a semantic segmentation model inside detected objects. For instances results, this task is typically measured by achieving high values of recall, precision and f-score.

  • object-detection: The dataset can be used to train a model for object detection to detect the nuclei categories or the nucleolus organizer regions (NORs), which consists of locating instances of objects and then classifying each one. This task is typically measured by achieving a high values of recall, precision and f-score.

Languages

The class labels in the dataset are in English.

Dataset Structure

Data Instances

An example looks like the one below:

semantic segmentation (default configuration)

{
  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1200x1600 at 0x276021C5EB8>,
  'annotation': <PIL.PngImagePlugin.PngImageFile image mode=L size=1200x1600 at 0x385021C5ED7>
}

object detection

{
  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1200x1600 at 0x276021C5EB8>,
  'objects': {
    'bbox': [
      [36, 7, 13, 32],
      [50, 7, 12, 32]
    ], 
    'label': [1, 5]
  }

instance segmentation

{
  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1200x1600 at 0x276021C5EB8>,
  'objects': {
    'bbox': [
      [13.3, 7.5, 47.6, 38.3],
      [10.2, 7.5, 50.7, 38.3]
    ], 
    'segment': [
      [[36.2, 7.5, 13.3, 32.1, 52.1, 40.6, 60.9, 45.8, 50.1, 40, 40, 33.2, 35.2]],
      [[10.2, 7.5, 10.3, 32.1, 52.1, 40.6, 60.9, 45.8, 50.1, 40, 40, 33.2, 35.2]],
    ],
    'label': [1, 5]
  }

Data Fields

The data annotations have the following fields:

semantic segmentation (default configuration)

  • image: A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0].
  • annotation: A PIL.Image.Image object containing the annotation mask. The mask has a single channel and the following pixel values are possible: BACKGROUND (0), NUCLEUS (1), CLUSTER (2), SATELLITE (3), NUCLEUS_OUT_OF_FOCUS (4), OVERLAPPED_NUCLEI (5), NON_VIABLE_NUCLEUS (6) and LEUKOCYTE_NUCLEUS (7).

object detection

  • image: A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0].
  • objects: a dictionary containing bounding boxes and labels of the cell objects
    • bbox: a list of bounding boxes (in the coco format) corresponding to the objects present on the image
    • label: a list of integers representing the category (7 categories to describe the objects in total; two to differentiate nucleolus organizer regions), with the possible values including NUCLEUS (0), CLUSTER (1), SATELLITE (2), NUCLEUS_OUT_OF_FOCUS (3), OVERLAPPED_NUCLEI (4), NON_VIABLE_NUCLEUS (5) and LEUKOCYTE_NUCLEUS (6).

instance segmentation

  • image: A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0].
  • objects: a dictionary containing bounding boxes and labels of the cell objects
    • bbox: a list of bounding boxes (in the coco format) corresponding to the objects present on the image
    • segment: a list of segments in format of [polygon_0, ..., polygon_n], where each polygon is [x0, y0, ..., xn, yn].
    • label: a list of integers representing the category (7 categories to describe the objects in total; two to differentiate nucleolus organizer regions), with the possible values including NUCLEUS (0), CLUSTER (1), SATELLITE (2), NUCLEUS_OUT_OF_FOCUS (3), OVERLAPPED_NUCLEI (4), NON_VIABLE_NUCLEUS (5) and LEUKOCYTE_NUCLEUS (6).

Data Splits

The data is split randomly using the fixed seed into training, test and validation set. The training data contains 70% of the images and the testing and the validation data contain 15% of the images each. In total, the training set contains 6533 images and the testing and the validation set 1403 images each.

Click here to see additional statistics:
Slide id Diagnostics images annotations NUCLEUS CLUSTER SATELLITE NUCLEUS_OUT_OF_FOCUS OVERLAPPED_NUCLEI NON_VIABLE_NUCLEUS LEUKOCYTE_NUCLEUS
A CIN 3 1311 3164 763 1038 922 381 46 14 0
B SCC 561 911 224 307 112 132 5 1 130
C AC 385 11420 2420 3584 1112 1692 228 477 1907
D CIN 3 2125 1258 233 337 107 149 12 8 412
E CIN 3 506 11131 2611 6249 1648 476 113 34 0
F CIN 1 318 3365 954 1406 204 354 51 326 70
G CIN 2 249 2759 691 1279 336 268 49 51 85
H CIN 2 650 5216 993 983 425 2562 38 214 1
I No lesion 309 474 56 55 19 170 2 23 149
J CIN 1 261 1786 355 304 174 743 18 33 159
K No lesion 1503 13102 2464 6669 638 620 670 138 1903
L CIN 2 396 3289 842 796 387 1209 27 23 5
M CIN 2 254 1500 357 752 99 245 16 12 19
N CIN 3 248 911 258 402 67 136 10 6 32
O AC 262 2904 792 1549 228 133 88 52 62
Total - 9339 63190 14013 25710 6478 9270 1373 1412 4934

Lesion types:

  • Cervical intraepithelial neoplasia 1 - CIN 1
  • Cervical intraepithelial neoplasia 2 - CIN 2
  • Cervical intraepithelial neoplasia 3 - CIN 3
  • Squamous cell carcinoma - SCC
  • Adenocarcinoma - AC
  • No lesion

Dataset Creation

Curation Rationale

CCAgT was built to provide a dataset for machines to learn how to identify nucleus and nucleolus organizer regions (NORs).

Source Data

Initial Data Collection and Normalization

The images are collected as patches/tiles of whole slide images (WSIs) from cervical samples stained with AgNOR technique to allow the detection of nucleolus organizer regions (NORs). NORs are DNA loops containing genes responsible for the transcription of ribosomal RNA located in the cell nucleolus. They contain a set of argyrophilic proteins, selectively stained by silver nitrate, which can be identified as black dots located throughout the nucleoli area and called AgNORs.

Who are the source language producers?

The dataset was built using images from examinations (a gynecological exam, colposcopy and biopsy) of 15 women patients who were treated at the Gynecology and Colposcopy Outpatient Clinic of the University Hospital Professor Polydoro Ernani de São Thiago of Federal University of Santa Catarina (HU-UFSC) and had 6 different diagnoses in their oncological exams. The samples were collected by the members of the Clinical Analyses Department: Ane Francyne Costa, Fabiana Botelho De Miranda Onofre, and Alexandre Sherlley Casimiro Onofre.

Annotations

Annotation process

The instances were annotated using the labelbox tool. The satellite category was labeled as a single dot, and the other categories were labeled as polygons. After the annotation process, all annotations were reviewed.

Who are the annotators?

Members of the Clinical Analyses Department and the Image Processing and Computer Graphics Lab. — LAPiX from Universidade Federal de Santa Catarina (UFSC).

  • Tainee Bottamedi
  • Vinícius Sanches
  • João H. Telles de Carvalho
  • Ricardo Thisted

Personal and Sensitive Information

This research was approved by the UFSC Research Ethics Committee (CEPSH), protocol number 57423616.3.0000.0121. All involved patients were informed about the study's objectives, and those who agreed to participate signed an informed consent form.

Considerations for Using the Data

Social Impact of Dataset

This dataset's purpose is to help spread the AgNOR as a support method for cancer diagnosis since this method is not standardized among pathologists.

Discussion of Biases

[More Information Needed]

Other Known Limitations

Satellite annotation is not as accurate for pixel-level representation due to single-point annotations.

Additional Information

Dataset Curators

Members of the Clinical Analyses Department from Universidade Federal de Santa Catarina (UFSC) collected the dataset samples: Ane Francyne Costa, Fabiana Botelho De Miranda Onofre, and Alexandre Sherlley Casimiro Onofre.

Licensing Information

The files associated with this dataset are licensed under an Attribution-NonCommercial 3.0 Unported license.

Users are free to adapt, copy or redistribute the material as long as they attribute it appropriately and do not use it for commercial purposes.

Citation Information

% Dataset oficial page
@misc{CCAgTDataset,
  doi = {10.17632/WG4BPM33HJ.2},
  url = {https://data.mendeley.com/datasets/wg4bpm33hj/2},
  author =  {Jo{\~{a}}o Gustavo Atkinson Amorim and Andr{\'{e}} Vict{\'{o}}ria Matias and Tainee Bottamedi and Vin{\'{i}}us Sanches and Ane Francyne Costa and Fabiana Botelho De Miranda Onofre and Alexandre Sherlley Casimiro Onofre and Aldo von Wangenheim},
  title = {CCAgT: Images of Cervical Cells with AgNOR Stain Technique},
  publisher = {Mendeley},
  year = {2022},
  copyright = {Attribution-NonCommercial 3.0 Unported}
}


% Dataset second version
% pre-print:
@article{AtkinsonAmorim2022,
  doi = {10.2139/ssrn.4126881},
  url = {https://doi.org/10.2139/ssrn.4126881},
  year = {2022},
  publisher = {Elsevier {BV}},
  author = {Jo{\~{a}}o Gustavo Atkinson Amorim and Andr{\'{e}} Vict{\'{o}}ria Matias and Allan Cerentini and Fabiana Botelho de Miranda Onofre and Alexandre Sherlley Casimiro Onofre and Aldo von Wangenheim},
  title = {Semantic Segmentation for the Detection of Very Small Objects on Cervical Cell Samples Stained with the {AgNOR} Technique},
  journal = {{SSRN} Electronic Journal}
}


% Dataset first version
% Link: https://arquivos.ufsc.br/d/373be2177a33426a9e6c/
% Paper:
@inproceedings{AtkinsonSegmentationAgNORCBMS2020,  
    author={Jo{\~{a}}o Gustavo Atkinson Amorim and Luiz Antonio Buschetto Macarini and Andr{\'{e}} Vict{\'{o}}ria Matias and Allan Cerentini and Fabiana Botelho De Miranda Onofre and Alexandre Sherlley Casimiro Onofre and Aldo von Wangenheim},
    booktitle={2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)},
    title={A Novel Approach on Segmentation of AgNOR-Stained Cytology Images Using Deep Learning},
    year={2020},
    pages={552-557}, 
    doi={10.1109/CBMS49503.2020.00110},
    url={https://doi.org/10.1109/CBMS49503.2020.00110}
}

Contributions

Thanks to @johnnv1 for adding this dataset.

Downloads last month
8
Edit dataset card
Evaluate models HF Leaderboard

Models trained or fine-tuned on lapix/CCAgT

Space using lapix/CCAgT 1