mednet.engine.segment.evaluator¶
Defines functionality for the evaluation of predictions.
Module Attributes
Supported metrics for evaluation of counts. |
Functions
|
Calculate the accuracy given true/false positive/negative counts. |
|
Compute all available metrics at once. |
|
Compare annotators and outputs all supported metrics. |
|
Compute |
|
Calculate the F1 score given true/false positive/negative counts. |
|
Calculate counts on one single sample, for a specific threshold. |
|
Calculate the Jaccard index given true/false positive/negative counts. |
|
Count true/false positive/negatives for the subset. |
|
Load predictions and ground-truth from HDF5 files. |
|
Create plots for all curves in |
|
Extract and format table from pre-computed evaluation data. |
|
Convert a string name to a callable for summarizing counts. |
|
Calculate the precision given true/false positive/negative counts. |
|
Calculate the recall given true/false positive/negative counts. |
|
Evaluate a segmentation model. |
|
Calculate the specificity given true/false positive/negative counts. |
|
Calculate true and false positives and negatives. |
|
Validate the user threshold selection and returns parsed threshold. |
- mednet.engine.segment.evaluator.SUPPORTED_METRIC_TYPE¶
Supported metrics for evaluation of counts.
alias of
Literal
[‘precision’, ‘recall’, ‘specificity’, ‘accuracy’, ‘jaccard’, ‘f1’]
- mednet.engine.segment.evaluator.precision(tp, fp, tn, fn)[source]¶
Calculate the precision given true/false positive/negative counts.
P, AKA positive predictive value (PPV). It corresponds arithmetically to
tp/(tp+fp)
. In the casetp+fp == 0
, this function returns zero for precision.
- mednet.engine.segment.evaluator.recall(tp, fp, tn, fn)[source]¶
Calculate the recall given true/false positive/negative counts.
R, AKA sensitivity, hit rate, or true positive rate (TPR). It corresponds arithmetically to
tp/(tp+fn)
. In the special case wheretp+fn == 0
, this function returns zero for recall.
- mednet.engine.segment.evaluator.specificity(tp, fp, tn, fn)[source]¶
Calculate the specificity given true/false positive/negative counts.
S, AKA selectivity or true negative rate (TNR). It corresponds arithmetically to
tn/(tn+fp)
. In the special case wheretn+fp == 0
, this function returns zero for specificity.
- mednet.engine.segment.evaluator.accuracy(tp, fp, tn, fn)[source]¶
Calculate the accuracy given true/false positive/negative counts.
A, see Accuracy. is the proportion of correct predictions (both true positives and true negatives) among the total number of pixels examined. It corresponds arithmetically to
(tp+tn)/(tp+tn+fp+fn)
. This measure includes both true-negatives and positives in the numerator, what makes it sensitive to data or regions without annotations.
- mednet.engine.segment.evaluator.jaccard(tp, fp, tn, fn)[source]¶
Calculate the Jaccard index given true/false positive/negative counts.
J, see Jaccard Index or Similarity. It corresponds arithmetically to
tp/(tp+fp+fn)
. In the special case wheretn+fp+fn == 0
, this function returns zero for the Jaccard index. The Jaccard index depends on a TP-only numerator, similarly to the F1 score. For regions where there are no annotations, the Jaccard index will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.
- mednet.engine.segment.evaluator.f1_score(tp, fp, tn, fn)[source]¶
Calculate the F1 score given true/false positive/negative counts.
F1, see F1-score. It corresponds arithmetically to
2*P*R/(P+R)
or2*tp/(2*tp+fp+fn)
. In the special case whereP+R == (2*tp+fp+fn) == 0
, this function returns zero for the Jaccard index. The F1 or Dice score depends on a TP-only numerator, similarly to the Jaccard index. For regions where there are no annotations, the F1-score will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.
- mednet.engine.segment.evaluator.name2metric(name)[source]¶
Convert a string name to a callable for summarizing counts.
- mednet.engine.segment.evaluator.all_metrics(tp, fp, tn, fn)[source]¶
Compute all available metrics at once.
- Parameters:
- Return type:
- Returns:
All supported metrics in the order defined by
SUPPORTED_METRIC_TYPE
.
- mednet.engine.segment.evaluator.tfpn_masks(pred, gt, threshold)[source]¶
Calculate true and false positives and negatives.
All input arrays should have matching sizes.
- Parameters:
pred (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – Pixel-wise predictions as output by your model.gt (
ndarray
[tuple
[int
,...
],dtype
[bool
]]) – Ground-truth (annotations).threshold (
float
) – A particular threshold in which to calculate the performance measures. Values at this threshold are counted as positives.
- Return type:
tuple
[ndarray
[tuple
[int
,...
],dtype
[bool
]],ndarray
[tuple
[int
,...
],dtype
[bool
]],ndarray
[tuple
[int
,...
],dtype
[bool
]],ndarray
[tuple
[int
,...
],dtype
[bool
]]]- Returns:
tp – Boolean array with true positives, considering all observations.
fp – Boolean array with false positives, considering all observations.
tn – Boolean array with true negatives, considering all observations.
fn – Boolean array with false negatives, considering all observations.
- mednet.engine.segment.evaluator.get_counts_for_threshold(pred, gt, mask, threshold)[source]¶
Calculate counts on one single sample, for a specific threshold.
- Parameters:
pred (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – Array with pixel-wise predictions.gt (
ndarray
[tuple
[int
,...
],dtype
[bool
]]) – Array with ground-truth (annotations).mask (
ndarray
[tuple
[int
,...
],dtype
[bool
]]) – Array with region mask marking parts to ignore.threshold (
float
) – A particular threshold in which to calculate the performance measures.
- Return type:
- Returns:
The true positives, false positives, true negatives and false negatives, in that order.
- mednet.engine.segment.evaluator.load_count(prediction_path, predictions, thresholds)[source]¶
Count true/false positive/negatives for the subset.
This function will load predictions from their store location and will cumulatively count the number of true positives, false positives, true negatives and false negatives across the various
thresholds
. This alternative provides a memory-bound way to compute the performance of splits with potentially very large images or including a large/very large number of samples. Unfortunately, sklearn does not provide functions to compute standard metrics from true/false positive/negative counts, which implies one needs to make use of further functions defined in this module to compute such metrics. Alternatively, you may look intoload_predictions()
, if you want to use sklearn functions to compute metrics.- Parameters:
prediction_path (
Path
) – Base directory where the prediction files (HDF5) were stored.predictions (
Sequence
[str
]) – A list of relative sample prediction paths to consider for measurement.thresholds (
ndarray
[tuple
[int
,...
],dtype
[float64
]]) – A sequence of thresholds to be applied onpredictions
, when evaluating true/false positive/negative counts.
- Return type:
- Returns:
A 2-D array with shape
(len(thresholds), 4)
, where each row contains to the counts of true positives, false positives, true negatives and false negatives, for the related threshold, and for the whole dataset.
- mednet.engine.segment.evaluator.load_predictions(prediction_path, predictions)[source]¶
Load predictions and ground-truth from HDF5 files.
Loading pixel-data as simple binary predictions with associated labels allows using sklearn library to compute most metrics defined in this module. Note however that computing metrics this way requires pre-allocation of a potentially large vector, which depends on the number of samples and the size of said samples. This may not work well for very large datasets of large/very large images. Currently, the evaluation system uses
load_count()
instead, which loads and pre-computes the number of true/false positives/negatives using a list of candidate thresholds.
- mednet.engine.segment.evaluator.compute_metric(counts, metric)[source]¶
Compute
metric
for every row ofcounts
.- Parameters:
counts (
ndarray
[tuple
[int
,...
],dtype
[uint64
]]) – A 2-D array with shape(*, 4)
, where each row contains to the counts of true positives, false positives, true negatives and false negatives, that need to be evaluated.metric (
Union
[Callable
[[int
,int
,int
,int
],float
],Callable
[[int
,int
,int
,int
],tuple
[float
,...
]]]) – A callable that takes 4 integers representing true positives, false positives, true negatives and false negatives, and outputs one or more floating-point metrics.
- Return type:
- Returns:
An 1-D array containing the provided metric computed alongside the first dimension and as many columns as
`metric
provides in each call.
- mednet.engine.segment.evaluator.validate_threshold(threshold, splits)[source]¶
Validate the user threshold selection and returns parsed threshold.
- mednet.engine.segment.evaluator.compare_annotators(a1, a2)[source]¶
Compare annotators and outputs all supported metrics.
- Parameters:
a1 (
Path
) – Annotator 1 annotations in the form of a JSON file mapping split-names ot list of lists, each containing the sample name and the (relative) location of an HDF5 file containing at least one boolean dataset namedtarget
. This dataset is considered as the annotations from the first annotator. If a booleanmask
is available, it is also loaded. All elements outside the mask are not considered during the metrics calculations.a2 (
Path
) – Annotator 1 annotations in the form of a JSON file mapping split-names ot list of lists, each containing the sample name and the (relative) location of an HDF5 file containing at least one boolean dataset namedtarget
. This dataset is considered as the annotations from the second annotator.
- Return type:
- Returns:
A dictionary that maps split-names to another dictionary with metric names and values computed by comparing the targets in
a1
anda2
.
- mednet.engine.segment.evaluator.run(predictions, steps, threshold, metric)[source]¶
Evaluate a segmentation model.
- Parameters:
predictions (
Path
) – Path to the filepredictions.json
, containing the list of predictions to be evaluated.steps (
int
) – The number of steps between[0, 1]
to build a threshold list from. This list will be applied to the probability outputs and true/false positive/negative counts generated from those.threshold (
str
|float
) – Which threshold to apply when generating unary summaries of the performance. This can be a value between[0, 1]
, or the name of a split inpredictions
where a threshold will be calculated at.metric (
Literal
['precision'
,'recall'
,'specificity'
,'accuracy'
,'jaccard'
,'f1'
]) – The name of a supported metric that will be used to evaluate the best threshold from a threshold-list uniformily split insteps
, and for which unary summaries are generated.
- Return type:
- Returns:
A JSON-able summary with all figures of merit pre-caculated, for all splits. This is a dictionary where keys are split-names contained in
predictions
, and values are dictionaries with the following keys:counts
: dictionary where keys are thresholds, and values are sequence of integers containing the TP, FP, TN, FN (in this order).roc_auc
: a float indicating the area under the ROC curve for the split. It is calculated using a trapezoidal rule.average_precision
: a float indicating the area under the precision-recall curve, calculated using a rectangle rule.curves
: dictionary with 2 keys:roc
: dictionary with 3 keys:fpr
: a list of floats with the false-positive ratetpr
: a list of floats with the true-positive ratethresholds
: a list of thresholds uniformily separated bysteps
, at which bothfpr
andtpr
are evaluated.
precision_recall
: a dictionary with 3 keys:precision
: a list of floats with the precisionrecall
: a list of floats with the recallthresholds
: a list of thresholds uniformily separated bysteps
, at which bothprecision
andrecall
are evaluated.
threshold_a_priori
: boolean indicating if the threshold for unary metrics where computed with a threshold chosen a priori or a posteriori in this split.<metric-name>
: a float representing the supported metric at the threshold that maximizesmetric
. There will be one entry of this type for each of theSUPPORTED_METRIC_TYPE
’s.
Also returns the threshold considered for all splits.