Comparison of segmentation algorithms for organs at risk delineation on head-and-neck CT images
Résumé
Purpose or Objective
To investigate the performance of head and neck organs-at-risk (OAR) contouring using several atlas-based (ABAS) vs deep learning (DL) segmentation methods.
Material and Methods
Ten heterogeneous head-and-neck (HN) patients, having body mass index ranging from 18.9 to 30.7, were selected for the atlas collection. On each of the CT patient data, 20 organs-at-risk (OARs) were manually delineated by expert physicians. Three different ABAS algorithms were tested on 15 HN patients using Advanced Medical Imaging Registration Engine (ADMIRE) v3.26 (Elekta AB, Stockholm, Sweden): Simultaneous Truth and Performance Level Estimation (STAPLE), Patch Fusion (PF) and Random Forest label fusion (RF). Their performance was evaluated against manually contoured OARs in terms of the dice similarity coefficient (DICE) and Hausdorff distances (HD and 95-percentileHD (HD95)). The results were compared with two commercially available solutions: one ABAS (MIM-Maestro 7.0.2, Cleveland, USA) based on majority vote algorithm for label fusion and one DL solution trained on multi-centric data (ART-plan Annotate, Therapanacea, Paris, France).
Results
All solutions had superior results compared to MIM-Maestro ABAS software for all the OARs. Between the three fusion Elekta-ABAS algorithms, RF label fusion, which contains artificial intelligence learning features, resulted in the best results for the majority of structures but did not segment optical nerves and cochlea (Fig. 1). Compared to the DL solution, ADMIRE RF had equal, better and worse results for 3, 7 and 5 out of 15 common OARs, respectively. Better DICE were obtained for the eyes (0.91 vs 0.87), the larynx (0.79 vs 0.77), the oral cavity (0.87 vs 0.84), the mandible (0.92 vs 0.89), one submandibular gland (0.78 vs 0.77) and for the external contour (0.99 vs 0.76). The HD metrics confirmed the superiority of the DICE results with smaller distances for the mentioned OARs and particularly for the oral cavity (HD: 10.71 vs 13.93; HD95: 6.53 vs 9.70) and the mandible (HD: 8.10 vs 10.71; HD95: 2.22 vs 3.56). The DL model outperformed ADMIRE RF for one parotid gland (DICE: 0.82 vs 0.80; HD: 13.45 vs 13.33; HD95: 7.49 vs 6.98), one submandibular gland (DICE: 0.81 vs 0.75; HD: 8.47 vs 8.14; HD95: 4.76 vs 4.87), for the esophagus (DICE: 0.73 vs 0.61; HD: 30.44 vs 34.06; HD95: 23.62 vs 24.71) and for the thyroid (DICE: 0.85 vs 0.74; HD: 8.43 vs 13.73; HD95: 3.04 vs 8.00).
Conclusion
The ADMIRE multi-ABAS was capable of segmenting OARs in HN region with comparable or superior accuracy to a DL model. Further work is in progress to evaluate the clinical value of these results by assessing the impact of contouring inaccuracies with respect to dose coverage, and by evaluating the time needed for manual correction of the automatically delineated contours.