mednet.engine.segment.evaluator¶

Defines functionality for the evaluation of predictions.

Module Attributes

SUPPORTED_METRIC_TYPE

Supported metrics for evaluation of counts.

Functions

`accuracy`(tp, fp, tn, fn)	Calculate the accuracy given true/false positive/negative counts.
`all_metrics`(tp, fp, tn, fn)	Compute all available metrics at once.
`compare_annotators`(a1, a2)	Compare annotators and outputs all supported metrics.
`compute_metric`(counts, metric)	Compute `metric` for every row of `counts`.
`counts_for_all_thresholds`(pred, gt, mask, ...)	Calculate counts for all thresholds on one sample.
`f1_score`(tp, fp, tn, fn)	Calculate the F1 score given true/false positive/negative counts.
`get_counts_for_threshold`(pred, gt, mask, ...)	Calculate counts on one single sample, for a specific threshold.
`jaccard`(tp, fp, tn, fn)	Calculate the Jaccard index given true/false positive/negative counts.
`load_count`(prediction_path, predictions, ...)	Count true/false positive/negatives for the subset.
`load_predictions`(prediction_path, predictions)	Load predictions and ground-truth from HDF5 files.
`make_plots`(eval_data)	Create plots for all curves in `eval_data`.
`make_table`(eval_data, threshold, format_)	Extract and format table from pre-computed evaluation data.
`name2metric`(name)	Convert a string name to a callable for summarizing counts.
`precision`(tp, fp, tn, fn)	Calculate the precision given true/false positive/negative counts.
`recall`(tp, fp, tn, fn)	Calculate the recall given true/false positive/negative counts.
`run`(predictions, steps, threshold, metric)	Evaluate a segmentation model.
`specificity`(tp, fp, tn, fn)	Calculate the specificity given true/false positive/negative counts.
`tfpn_masks`(pred, gt, threshold)	Calculate true and false positives and negatives.
`validate_threshold`(threshold, splits)	Validate the user threshold selection and returns parsed threshold.

mednet.engine.segment.evaluator.SUPPORTED_METRIC_TYPE¶

Supported metrics for evaluation of counts.

alias of Literal[‘precision’, ‘recall’, ‘specificity’, ‘accuracy’, ‘jaccard’, ‘f1’]

mednet.engine.segment.evaluator.precision(tp, fp, tn, fn)[source]¶

Calculate the precision given true/false positive/negative counts.

P, AKA positive predictive value (PPV). It corresponds arithmetically to tp/(tp+fp). In the case tp+fp == 0, this function returns zero for precision.

Parameters:

tp (int) – True positive count, AKA “hit”.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
tn (int) – True negative count, AKA “correct rejection”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”.

Return type:

float

Returns:

The precision.

mednet.engine.segment.evaluator.recall(tp, fp, tn, fn)[source]¶

Calculate the recall given true/false positive/negative counts.

R, AKA sensitivity, hit rate, or true positive rate (TPR). It corresponds arithmetically to tp/(tp+fn). In the special case where tp+fn == 0, this function returns zero for recall.

Parameters:

tp (int) – True positive count, AKA “hit”.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
tn (int) – True negative count, AKA “correct rejection”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”.

Return type:

float

Returns:

The recall.

mednet.engine.segment.evaluator.specificity(tp, fp, tn, fn)[source]¶

Calculate the specificity given true/false positive/negative counts.

S, AKA selectivity or true negative rate (TNR). It corresponds arithmetically to tn/(tn+fp). In the special case where tn+fp == 0, this function returns zero for specificity.

Parameters:

tp (int) – True positive count, AKA “hit”.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
tn (int) – True negative count, AKA “correct rejection”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”.

Return type:

float

Returns:

The specificity.

mednet.engine.segment.evaluator.accuracy(tp, fp, tn, fn)[source]¶

Calculate the accuracy given true/false positive/negative counts.

A, see Accuracy. is the proportion of correct predictions (both true positives and true negatives) among the total number of pixels examined. It corresponds arithmetically to (tp+tn)/(tp+tn+fp+fn). This measure includes both true-negatives and positives in the numerator, what makes it sensitive to data or regions without annotations.

Parameters:

tp (int) – True positive count, AKA “hit”.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
tn (int) – True negative count, AKA “correct rejection”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”.

Return type:

float

Returns:

The accuracy.

mednet.engine.segment.evaluator.jaccard(tp, fp, tn, fn)[source]¶

Calculate the Jaccard index given true/false positive/negative counts.

J, see Jaccard Index or Similarity. It corresponds arithmetically to tp/(tp+fp+fn). In the special case where tn+fp+fn == 0, this function returns zero for the Jaccard index. The Jaccard index depends on a TP-only numerator, similarly to the F1 score. For regions where there are no annotations, the Jaccard index will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.

Parameters:

tp (int) – True positive count, AKA “hit”.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
tn (int) – True negative count, AKA “correct rejection”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”.

Return type:

float

Returns:

The Jaccard index.

mednet.engine.segment.evaluator.f1_score(tp, fp, tn, fn)[source]¶

Calculate the F1 score given true/false positive/negative counts.

F1, see F1-score. It corresponds arithmetically to 2*P*R/(P+R) or 2*tp/(2*tp+fp+fn). In the special case where P+R == (2*tp+fp+fn) == 0, this function returns zero for the Jaccard index. The F1 or Dice score depends on a TP-only numerator, similarly to the Jaccard index. For regions where there are no annotations, the F1-score will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.

Parameters:

tp (int) – True positive count, AKA “hit”.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
tn (int) – True negative count, AKA “correct rejection”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”.

Return type:

float

Returns:

The F1-score.

mednet.engine.segment.evaluator.name2metric(name)[source]¶

Convert a string name to a callable for summarizing counts.

Parameters:: name (Literal['precision', 'recall', 'specificity', 'accuracy', 'jaccard', 'f1']) – The name of the metric to be looked up.
Return type:: Callable[[int, int, int, int], float]
Returns:: A callable that summarizes counts into single floating-point number.

mednet.engine.segment.evaluator.all_metrics(tp, fp, tn, fn)[source]¶

Compute all available metrics at once.

Parameters:

tp (int) – True positive count, AKA “hit”.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
tn (int) – True negative count, AKA “correct rejection”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”.

Return type:

list[float]

Returns:

All supported metrics in the order defined by SUPPORTED_METRIC_TYPE.

mednet.engine.segment.evaluator.tfpn_masks(pred, gt, threshold)[source]¶

Calculate true and false positives and negatives.

All input arrays should have matching sizes.

Parameters:

pred (GenericAlias[single]) – Pixel-wise predictions as output by your model.
gt (GenericAlias[bool]) – Ground-truth (annotations).
threshold (float) – A particular threshold in which to calculate the performance measures. Values at this threshold are counted as positives.

Return type:

tuple[GenericAlias[bool], GenericAlias[bool], GenericAlias[bool], GenericAlias[bool]]

Returns:

tp – Boolean array with true positives, considering all observations.
fp – Boolean array with false positives, considering all observations.
tn – Boolean array with true negatives, considering all observations.
fn – Boolean array with false negatives, considering all observations.

mednet.engine.segment.evaluator.get_counts_for_threshold(pred, gt, mask, threshold)[source]¶

Calculate counts on one single sample, for a specific threshold.

Parameters:

pred (GenericAlias[single]) – Array with pixel-wise predictions.
gt (GenericAlias[bool]) – Array with ground-truth (annotations).
mask (GenericAlias[bool]) – Array with region mask marking parts to ignore.
threshold (float) – A particular threshold in which to calculate the performance measures.

Return type:

tuple[int, int, int, int]

Returns:

The true positives, false positives, true negatives and false negatives, in that order.

mednet.engine.segment.evaluator.counts_for_all_thresholds(pred, gt, mask, thresholds)[source]¶

Calculate counts for all thresholds on one sample.

Equivalent to calling get_counts_for_threshold() for each entry in thresholds, but implemented with sorting and binary search instead of re-scanning the full image per threshold.

Parameters:

pred (GenericAlias[single]) – Array with pixel-wise predictions.
gt (GenericAlias[bool]) – Array with ground-truth (annotations).
mask (GenericAlias[bool]) – Array with region mask marking parts to ignore.
thresholds (GenericAlias[double]) – Thresholds at which to calculate performance measures.

Return type:

GenericAlias[uint]

Returns:

A 2-D array with shape (len(thresholds), 4), where each row contains the true positives, false positives, true negatives and false negatives, in that order.

mednet.engine.segment.evaluator.load_count(prediction_path, predictions, thresholds)[source]¶

Count true/false positive/negatives for the subset.

This function will load predictions from their store location and will cumulatively count the number of true positives, false positives, true negatives and false negatives across the various thresholds. This alternative provides a memory-bound way to compute the performance of splits with potentially very large images or including a large/very large number of samples. Unfortunately, sklearn does not provide functions to compute standard metrics from true/false positive/negative counts, which implies one needs to make use of further functions defined in this module to compute such metrics. Alternatively, you may look into load_predictions(), if you want to use sklearn functions to compute metrics.

Parameters:

prediction_path (Path) – Base directory where the prediction files (HDF5) were stored.
predictions (Sequence[str]) – A list of relative sample prediction paths to consider for measurement.
thresholds (GenericAlias[double]) – A sequence of thresholds to be applied on predictions, when evaluating true/false positive/negative counts.

Return type:

GenericAlias[uint]

Returns:

A 2-D array with shape (len(thresholds), 4), where each row contains to the counts of true positives, false positives, true negatives and false negatives, for the related threshold, and for the whole dataset.

mednet.engine.segment.evaluator.load_predictions(prediction_path, predictions)[source]¶

Load predictions and ground-truth from HDF5 files.

Loading pixel-data as simple binary predictions with associated labels allows using sklearn library to compute most metrics defined in this module. Note however that computing metrics this way requires pre-allocation of a potentially large vector, which depends on the number of samples and the size of said samples. This may not work well for very large datasets of large/very large images. Currently, the evaluation system uses load_count() instead, which loads and pre-computes the number of true/false positives/negatives using a list of candidate thresholds.

Parameters:

prediction_path (Path) – Base directory where the prediction files (HDF5) were stored.
predictions (Sequence[str]) – A list of relative sample prediction paths to consider for measurement.

Return type:

tuple[GenericAlias[single], GenericAlias[bool]]

Returns:

Two 1-D arrays containing a linearized version of pixel predictions (probability) and matching ground-truth.

mednet.engine.segment.evaluator.compute_metric(counts, metric)[source]¶

Compute metric for every row of counts.

Parameters:

counts (GenericAlias[uint]) – A 2-D array with shape (*, 4), where each row contains to the counts of true positives, false positives, true negatives and false negatives, that need to be evaluated.
metric (Union[Callable[[int, int, int, int], float], Callable[[int, int, int, int], tuple[float, ...]]]) – A callable that takes 4 integers representing true positives, false positives, true negatives and false negatives, and outputs one or more floating-point metrics.

Return type:

GenericAlias[double]

Returns:

An 1-D array containing the provided metric computed alongside the first dimension and as many columns as `metric provides in each call.

mednet.engine.segment.evaluator.validate_threshold(threshold, splits)[source]¶

Validate the user threshold selection and returns parsed threshold.

Parameters:

threshold (float | str) – The threshold to validate.
splits (list[str]) – List of available splits.

Returns:

The validated threshold.

mednet.engine.segment.evaluator.compare_annotators(a1, a2)[source]¶

Compare annotators and outputs all supported metrics.

Parameters:

a1 (Path) – Annotator 1 annotations in the form of a JSON file mapping split-names ot list of lists, each containing the sample name and the (relative) location of an HDF5 file containing at least one boolean dataset named target. This dataset is considered as the annotations from the first annotator. If a boolean mask is available, it is also loaded. All elements outside the mask are not considered during the metrics calculations.
a2 (Path) – Annotator 1 annotations in the form of a JSON file mapping split-names ot list of lists, each containing the sample name and the (relative) location of an HDF5 file containing at least one boolean dataset named target. This dataset is considered as the annotations from the second annotator.

Return type:

dict[str, dict[str, float]]

Returns:

A dictionary that maps split-names to another dictionary with metric names and values computed by comparing the targets in a1 and a2.

mednet.engine.segment.evaluator.run(predictions, steps, threshold, metric)[source]¶

Evaluate a segmentation model.

Parameters:

predictions (Path) – Path to the file predictions.json, containing the list of predictions to be evaluated.
steps (int) – The number of steps between [0, 1] to build a threshold list from. This list will be applied to the probability outputs and true/false positive/negative counts generated from those.
threshold (str | float) – Which threshold to apply when generating unary summaries of the performance. This can be a value between [0, 1], or the name of a split in predictions where a threshold will be calculated at.
metric (Literal['precision', 'recall', 'specificity', 'accuracy', 'jaccard', 'f1']) – The name of a supported metric that will be used to evaluate the best threshold from a threshold-list uniformily split in steps, and for which unary summaries are generated.

Return type:

tuple[dict[str, dict[str, Any]], float]

Returns:

A JSON-able summary with all figures of merit pre-caculated, for all splits. This is a dictionary where keys are split-names contained in predictions, and values are dictionaries with the following keys:

counts: dictionary where keys are thresholds, and values are sequence of integers containing the TP, FP, TN, FN (in this order).

roc_auc: a float indicating the area under the ROC curve for the split. It is calculated using a trapezoidal rule.

average_precision: a float indicating the area under the precision-recall curve, calculated using a rectangle rule.

curves: dictionary with 2 keys:

roc: dictionary with 3 keys:

fpr: a list of floats with the false-positive rate

tpr: a list of floats with the true-positive rate

thresholds: a list of thresholds uniformily separated by steps, at which both fpr and tpr are evaluated.

precision_recall: a dictionary with 3 keys:

precision: a list of floats with the precision

recall: a list of floats with the recall

thresholds: a list of thresholds uniformily separated by steps, at which both precision and recall are evaluated.

threshold_a_priori: boolean indicating if the threshold for unary metrics where computed with a threshold chosen a priori or a posteriori in this split.

<metric-name>: a float representing the supported metric at the threshold that maximizes metric. There will be one entry of this type for each of the SUPPORTED_METRIC_TYPE’s.

Also returns the threshold considered for all splits.

mednet.engine.segment.evaluator.make_table(eval_data, threshold, format_)[source]¶

Extract and format table from pre-computed evaluation data.

Extracts elements from eval_data that can be displayed on a terminal-style table, format, and returns it.

Parameters:

eval_data (dict[str, dict[str, Any]]) – Evaluation data as returned by run().
threshold (float) – The threshold value used to compute unary metrics on all splits.
format_ (str) – A supported tabulate format.

Return type:

str

Returns:

A string representation of a table.

mednet.engine.segment.evaluator.make_plots(eval_data)[source]¶

Create plots for all curves in eval_data.

Parameters:: eval_data (dict[str, dict[str, Any]]) – Evaluation data as returned by run().
Return type:: list
Returns:: A list of figures to record to file