mednet.data.classify.tbx11k¶
TBX11k database for TB detection.
Database reference: [LWB+20]
The original database contains samples of healthy, sick (no TB), active and latent TB cases. There is a total of 11702 samples in the database. Healthy and sick individuals are kept in separate folders. Latent and active TB cases are merged in the same directory. One must check the radiological annotations to understand if samples contain either, or both (latent and active TB) signs.
There is one case of patient (file
imgs/tb/tb1199.png
), that is inside thetb
folder, but contains no annotations. This sample was excluded from our splits.There are 30 cases of patients that have both active and latent TB radiological signs, over the entire database. Those samples were also excluded from our splits:
imgs/tb/tb0135.png
imgs/tb/tb0142.png
imgs/tb/tb0154.png
imgs/tb/tb0167.png
imgs/tb/tb0190.png
imgs/tb/tb0246.png
imgs/tb/tb0255.png
imgs/tb/tb0279.png
imgs/tb/tb0284.png
imgs/tb/tb0350.png
imgs/tb/tb0378.png
imgs/tb/tb0392.png
imgs/tb/tb0395.png
imgs/tb/tb0501.png
imgs/tb/tb0506.png
imgs/tb/tb0526.png
imgs/tb/tb0543.png
imgs/tb/tb0639.png
imgs/tb/tb0640.png
imgs/tb/tb0667.png
imgs/tb/tb0676.png
imgs/tb/tb0713.png
imgs/tb/tb0786.png
imgs/tb/tb0870.png
imgs/tb/tb0875.png
imgs/tb/tb0945.png
imgs/tb/tb0949.png
imgs/tb/tb0968.png
imgs/tb/tb1104.png
imgs/tb/tb1143.png
Original train database samples:
Healthy: 3000
Sick (but no TB): 3000
Active TB only: 473
Latent TB only: 103
Both active and latent TB: 23
Unknown: 1
Total: 6600
Original validation database samples:
Healthy: 800
Sick (but no TB): 800
Latent TB only: 36
Active TB only: 157
Both active and latent TB: 7
Total: 1800
Original test database samples:
Unknown: 3302
Total: 3302
Because the test set does not have annotations, we generated train, validation and test databases as such:
The original validation database becomes our test set.
The original train database is split into new train and validation splits (validation ratio = 0.203 w.r.t. original train database size). The selection of samples is stratified (see comments through our split code, which is shipped alongside this file.)
Important
Raw data organization
The TBX11k base datadir, which you should configure following the Setup instructions, must contain at least these two subdirectories:
imgs/
(directory containing sub-directories and images in PNG format)annotations/
(directory containing labels in JSON and XML format)
Data specifications:
Raw data input (on disk): PNG images 8 bits RGB, 512 x 512 pixels
Output image:
Transforms:
Load raw PNG with
PIL
Convert to torch tensor
Final specifications:
RGB, encoded as a 3-plane tensor using 32-bit floats, square (512x512 pixels)
Labels: 0 (healthy, latent tb or sick but no tb depending on the protocol), 1 (active tuberculosis), as a torch float tensor.
Bounding-boxes: indicating regions of the image that corroborate (active or latent TB diagnostics).
Note
JSON Encoding
Details of the encoding of database splits in JSON format.
For healthy/sick (no TB)/latent TB cases, each sample is represented by a filename, relative to the root of the installed database, followed by the number 0 (negative class).
For active TB cases, each sample is represented by a filename, followed by the number 1, and then by 1 or more 5-tuples with radiological finding locations, as described above.
This module contains the base declaration of common data modules and raw-data loaders for this database. All configured splits inherit from this definition.
Module Attributes
Pythonic name of this database. |
|
Key to search for in the configuration file for the root directory of this database. |
|
Type of objects in our JSON representation for this database. |
Functions
|
Collate samples that include bounding boxes. |
Classes
|
TBX11k database for TB detection. |
|
A specialized raw-data-loader for the TBX11k database. |
- mednet.data.classify.tbx11k.DATABASE_SLUG = 'tbx11k'¶
Pythonic name of this database.
- mednet.data.classify.tbx11k.CONFIGURATION_KEY_DATADIR = 'datadir.tbx11k'¶
Key to search for in the configuration file for the root directory of this database.
- mednet.data.classify.tbx11k.DatabaseSample: TypeAlias = tuple[str, int] | tuple[str, int, tuple[tuple[int, int, int, int, int]]]¶
Type of objects in our JSON representation for this database.
For healthy/sick (no TB)/latent TB cases, each sample is represented by a filename, relative to the root of the installed database, followed by the number 0 (negative class).
For active TB cases, each sample is represented by a filename, followed by the number 1, and then by 1 or more 5-tuples with radiological finding locations, as described above.
- mednet.data.classify.tbx11k.custom_collate_fn(batch)[source]¶
Collate samples that include bounding boxes.
This allows us to have
torchvision.tv_tensors.BoundingBoxes
that can contain zero to multiple boxes, which is not supported by the default collate function that usestorch.stack()
for batching.- Returns:
The given batch.
- class mednet.data.classify.tbx11k.RawDataLoader(ignore_bboxes=False)[source]¶
Bases:
RawDataLoader
A specialized raw-data-loader for the TBX11k database.
- Parameters:
ignore_bboxes (
bool
) – If True, sample() does not return bounding boxes.
- sample(sample)[source]¶
Load a single image sample from the disk.
- Parameters:
sample (
Any
) – A tuple containing the path suffix, within the database root folder, where to find the image to be loaded, an integer, representing the sample target, and possible radiological findings represented by bounding boxes.- Return type:
- Returns:
The sample representation.
- class mednet.data.classify.tbx11k.DataModule(split_path, ignore_bboxes=False)[source]¶
Bases:
CachingDataModule
TBX11k database for TB detection.
- Parameters:
split_path (
Path
|Traversable
) – Path or traversable (resource) with the JSON split description to load.ignore_bboxes (
bool
) – If True, sample() does not return bounding boxes.