mednet.data.classify.nih_cxr14

NIH CXR14 (relabeled) DataModule for computer-aided diagnosis.

This dataset was extracted from the clinical PACS database at the National Institutes of Health Clinical Center (USA) and represents 60% of all their radiographs. It contains labels for 14 common radiological signs in this order: cardiomegaly, emphysema, effusion, hernia, infiltration, mass, nodule, atelectasis, pneumothorax, pleural thickening, pneumonia, fibrosis, edema and consolidation. Training and validation data come from the relabeled version created in the [RIB+18] study. Test data uses the available annotations with [WPL+17].

  • Database references:

    • Original data: [WPL+17] (contains 112’120 chest X-ray images) and up to 14 associated radiological findings.

    • Labels and split references: We use train and validation splits published at [RIB+18], that are available here <nih-cxr14-relabeled>`_. These are different compared to the file lists provided with the original :cite:p:`wang_chestx-ray8_2017` study (train/val set: 86’523 samples; test set: 25’595 samples; +2 missing samples which are not listed, making up 112’120 samples). The splits at [RIB+18], which we copied in this library, contain 104’987 samples which were relabeled making up a training and a validation set containing 98’637 and 6’350 samples respectively. Note the relabeling work provided by [RIB+18] does not provide test set annotations (only training and validation). Our test set then consists of all CXR8 samples that were not relabled, and for which we reused the original CXR8 annotations (7’133 samples).

Important

Raw data organization

The CXR8 base datadir, which you should configure following the Setup instructions, must contain at least the directory “images/” with all the images of the database.

The labels from [RIB+18] (available here) are already incorporated in this library and do not need to be re-downloaded.

The flag idiap_folder_structure makes the loader search for files named, e.g. images/00030621_006.png, as images/00030/00030621_006.png.

  • Raw data input (on disk):

    • PNG RGB 8-bit depth images

    • Original resolution: 1024 x 1024 pixels

    • Non-exclusive labels organized in a (compact) string list encoded as such:

      1. car: cardiomegaly

      2. emp: emphysema

      3. eff: effusion

      4. her: hernia

      5. inf: infiltration

      6. mas: mass

      7. nod: nodule

      8. ate: atelectasis

      9. pnt: pneumothorax

      10. plt: pleural thickening

      11. pne: pneumonia

      12. fib: fibrosis

      13. ede: edema

      14. con: consolidation

    • Patient age (integer)

    • Patient gender (“M” or “F”)

    • Total samples available: 112’120

  • Output image:

    • Transforms:

      • Load raw PNG with PIL, with auto-conversion to grayscale

      • Convert to torch tensor

    • Final specifications:

      • RGB, encoded as a 3-plane tensor, 32-bit floats, square (1024x1024 px)

        This decoder loads this description and converts it to a binary multi-label representation.

This module contains the base declaration of common data modules and raw-data loaders for this database. All configured splits inherit from this definition.

Module Attributes

DATABASE_SLUG

Pythonic name of this database.

CONFIGURATION_KEY_DATADIR

Key to search for in the configuration file for the root directory of this database.

CONFIGURATION_KEY_IDIAP_FILESTRUCTURE

Key to search for in the configuration file indicating if the loader should use standard or idiap-based file organisation structure.

RADIOLOGICAL_FINDINGS

List of radiological findings (abbreviations) supported on this database.

Functions

binarize_findings(lst)

Binarize the input list of radiological findings.

Classes

DataModule(split_path)

NIH CXR14 (relabeled) DataModule for computer-aided diagnosis.

RawDataLoader()

A specialized raw-data-loader for the NIH CXR-14 dataset.

mednet.data.classify.nih_cxr14.DATABASE_SLUG = 'nih_cxr14'

Pythonic name of this database.

mednet.data.classify.nih_cxr14.CONFIGURATION_KEY_DATADIR = 'datadir.cxr8'

Key to search for in the configuration file for the root directory of this database.

mednet.data.classify.nih_cxr14.CONFIGURATION_KEY_IDIAP_FILESTRUCTURE = 'cxr8.idiap_folder_structure'

Key to search for in the configuration file indicating if the loader should use standard or idiap-based file organisation structure.

It causes the internal loader to search for files in a slightly different folder structure, that was adapted to Idiap’s requirements (number of files per folder to be less than 10k).

mednet.data.classify.nih_cxr14.RADIOLOGICAL_FINDINGS = ['car', 'emp', 'eff', 'her', 'inf', 'mas', 'nod', 'ate', 'pnt', 'plt', 'pne', 'fib', 'ede', 'con']

List of radiological findings (abbreviations) supported on this database.

mednet.data.classify.nih_cxr14.binarize_findings(lst)[source]

Binarize the input list of radiological findings.

The output list contains zeros and ones, respecting the findings order in RADIOLOGICAL_FINDINGS.

Parameters:

lst (list[str]) – A list of radiological findings that will be converted.

Return type:

Tensor

Returns:

A list containing a binarized version of the input list.

class mednet.data.classify.nih_cxr14.RawDataLoader[source]

Bases: RawDataLoader

A specialized raw-data-loader for the NIH CXR-14 dataset.

datadir: Path

This variable contains the base directory where the database raw data is stored.

idiap_file_organisation: bool

If should use the Idiap’s filesystem organisation when looking up data.

This variable will be True, if the user has set the configuration parameter nih_cxr14.idiap_file_organisation in the global configuration file. It will cause internal loader to search for files in a slightly different folder structure, that was adapted to Idiap’s requirements (number of files per folder to be less than 10k).

sample(sample)[source]

Load a single image sample from the disk.

Parameters:

sample (Any) – A tuple containing the path suffix, within the dataset root folder, where to find the image to be loaded, and an integer, representing the sample target.

Return type:

Mapping[str, Any]

Returns:

The sample representation.

target(sample)[source]

Load only sample target from its raw representation.

The raw representation contains zero to many (unique) instances of radiological findings listed at RADIOLOGICAL_FINDINGS. This list is binarized (into 14 binary ositions) before it is returned.

Parameters:

sample (Any) – A tuple containing the path suffix, within the dataset root folder, where to find the image to be loaded, and an integer, representing the sample target.

Return type:

Tensor

Returns:

The labels corresponding to all radiological signs present in the specified sample, encapsulated as a 1D torch float tensor.

class mednet.data.classify.nih_cxr14.DataModule(split_path)[source]

Bases: CachingDataModule

NIH CXR14 (relabeled) DataModule for computer-aided diagnosis.

Parameters:

split_path (Path | Traversable) – Path or traversable (resource) with the JSON split description to load.