Statistical Validation of Image Segmentation Quality Based on a Spatial Overlap Index: Scientific Reports


Zou KH, Warfield SK, Bharatha A, Tempany CM, Kaus MR, Haker SJ, Wells WM, Jolesz FA, Kikinis R. Statistical Validation of Image Segmentation Quality Based on a Spatial Overlap Index: Scientific Reports. Acad Radiol. 2004;11 (2) :178-89.

Date Published:

2004 Feb


RATIONALE AND OBJECTIVES: To examine a statistical validation method based on the spatial overlap between two sets of segmentations of the same anatomy. MATERIALS AND METHODS: The Dice similarity coefficient (DSC) was used as a statistical validation metric to evaluate the performance of both the reproducibility of manual segmentations and the spatial overlap accuracy of automated probabilistic fractional segmentation of MR images, illustrated on two clinical examples. Example 1: 10 consecutive cases of prostate brachytherapy patients underwent both preoperative 1.5T and intraoperative 0.5T MR imaging. For each case, 5 repeated manual segmentations of the prostate peripheral zone were performed separately on preoperative and on intraoperative images. Example 2: A semi-automated probabilistic fractional segmentation algorithm was applied to MR imaging of 9 cases with 3 types of brain tumors. DSC values were computed and logit-transformed values were compared in the mean with the analysis of variance (ANOVA). RESULTS: Example 1: The mean DSCs of 0.883 (range, 0.876-0.893) with 1.5T preoperative MRI and 0.838 (range, 0.819-0.852) with 0.5T intraoperative MRI (P < .001) were within and at the margin of the range of good reproducibility, respectively. Example 2: Wide ranges of DSC were observed in brain tumor segmentations: Meningiomas (0.519-0.893), astrocytomas (0.487-0.972), and other mixed gliomas (0.490-0.899). CONCLUSION: The DSC value is a simple and useful summary measure of spatial overlap, which can be applied to studies of reproducibility and accuracy in image segmentation. We observed generally satisfactory but variable validation results in two clinical applications. This metric may be adapted for similar validation tasks.

Last updated on 05/04/2017