Learning a CNN on multiple sclerosis lesion segmentation with self-supervision

Multiple Sclerosis (MS) is a chronic, often disabling, auto-immune disease affecting the central nervous system and characterized by demyelination and neuropathic alterations. Magnetic Resonance (MR) images plays a pivotal role in the diagnosis and the screening of MS. MR images identify and localize demyelinating lesions (or plaques) and possible associated atrophic lesions whose MR aspect is in relation with the evolution of the disease. We propose a novel MS lesions segmentation method for MR images, based on Convolutional Neural Networks (CNNs) and partial self-supervision and studied the pros and cons of using self-supervision for the current segmentation task. Investigating the transferability by freezing the ﬁrsts convolutional layers, we discovered that improvements are obtained when the CNN is re-trained from the ﬁrst layers. We believe such results suggest that MRI segmentation is a singular task needing high level analysis from the very ﬁrst stages of the vision process, as opposed to vision tasks aimed at day-to-day life such as face recognition or trafﬁc sign classiﬁcation. The evaluation of segmentation quality has been performed on full image size binary maps assembled from predictions on image patches from an unseen database.


Introduction
Multiple sclerosis (MS) is a central nervous system autoimmune disease. It affects 1 to more than 200 in 100 000 people depending on the region [1], it generally appears near 30 years old [2] and can rapidly induce high disability [3]. Magnetic resonance (MR) imaging is one of the most valuable exam for the diagnosis, prognosis and following-up of MS [4]. MR images enable to identify, localize, count and to determine activity of demyelinating lesions; this procedure appears to be a repetitive and time consuming task, and is often accomplished with computer vision-based virtual assistance with possible inter-observer variability [5].
The interest in automatic white matter (WM) lesion and especially MS lesion segmentation has grown significantly in the past decade. Several challenges such as the Medical Imaging Computing & Computer Assisted Intervention (MICCAI) MS lesion segmentation 2016 [6] have been conducted for better performance evaluation within the computer vision community. Annotated patient datasets have been made publicly available too, making it easier to explore the capacity of machine learning algorithms such as CNNs to synthesize semantics from medical im-ages. Last researches in this field study the importance of some parameters and suggest different techniques to improve segmentation, most of them use CNNs. Nair et al. proposed in [7] to resort the Montecarlo dropout in CNN to access to segmentation indicators such as prediction variability. Roy et al. presented in [8] a convolutional network and showed improvements by augmenting patch size. Hashemi et al. adapted a 3D U-net in [9] with fully connected ones in the encoder and decoder pathes with better results and studied the influence of loss function parameters and the patch fusion strategy. Valverde et al., in [10], were interested in the reusability of their CNN for images from other centers with other MR scanners and protocols and showed that good results can be obtained with few new annotations and parameters fine-tuning. McKinley et al. [11] demonstrated that simultaneous segmenting WM lesions and brain tissues improves the quality of segmentation and Brosch et al. [12] pre-trained their CNN with convolutional restricted Boltzmann machines in an unsupervised way to improve segmentation performances.
All of aforementioned studies used CNN on either 2D slices or 3D volumes, some of them using patch and others complete slices or volumes from multi-modal MR images. However, most of them utilized a neural network architecture designed as an encoder-decoder with skip connections more or less close to a Unet [13]. Only Valverde et al. [10] considered the segmentation task as a classificication voxel by voxel.
While deep convolutional networks have gradually become a reference in computer vision, as a strongly data-driven supervised technique [14,15], its use in medical imaging is often limited by the small amount of samples available. An other difficulty is to obtain expert annotations from radiologists as it requires a lot of time not always available [16]. Fortunately techniques such as data augmentation or transfer learning have been proposed to overcome such limitations [17].
Because of the great amount of unnanotated MR images available in a hospital, it was proposed to easily improve segmentation results by leveraging available unnanotated data without more annotations. Such process can be adapted for almost all techniques and tasks in MR and Computerized Tomography (CT) imaging using self-supervision.
The self-supervised technique introduced by Doersch et al. in [18] aimed at reducing the need for large numbers of annotated samples. This technique is based on context learning [19] and transfer learning [20]. It relies on training a neural network on an To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. unsupervised, seemingly useless task (i.e. which does not need annotation) to learn context, and to partially reuse the resulting network to learn inference on the initial, supervised target task. As the network is already trained to understand incoming image data on a similar but simpler task, fewer annotated samples are needed.
To the best of our knowledge, only few studies as [21,22] reported utilization of self-supervision in medical imaging analysis. Therefore, we propose a novel strategy to learn context for selfsupervision in medical imaging and to evaluate its performance for multiple sclerosis lesion segmentation.
We assume that our implementation of self-supervision can guide the supervised learning task within a CNN already designed for MR image analysis, thus reducing potentially bad solutions and improving the overall quality of segmentation.
Investigating the potential of self-supervision in MR imaging is important because this technology evolves very quickly, leading to an exponential increase of already high-dimension data spaces for which only few samples are available. In our case we expect to work on multiple MRI sequences in the coming years to study MS, starting with very few patients. In this context we will need new strategies for data augmentation or advanced model learning.

Material and method
The neural network described by Isensee et al. [23] was chosen because of its proven efficiency in segmentation. It achieved the third rank on the MICCAI Brain Tumor Segmentation (BraTS) challenge 2017 and it uses an encoder-decoder architecture with skip connections as in most of recent papers in medical imaging segmentation. We adapted it to take five 3D patches as input, respectively extracted from the T1 weighted (T1W), T1 weighted with contrast enhancing agent (T1Wc), T2 weighted (T2W), T2weighted-fluid-attenuated inversion recovery (FLAIR), and proton density weighted (PDW) images as such sequences are generally analyzed by the radiologist to evaluate the presence of multiple sclerosis lesions.
The initial, unsupervised target task aimed at data selflearning is dedicated to predicting the location of a local patch over the full image. We expect this multi-output regression task to share self-trained features with a model dedicated to brain lesion segmentation. Our architecture for segmentation is an encoderdecoder, so we only trained the encoder part, as no reconstruction is needed for this task. The encoder had to predict the x, y and z coordinates of one patch lying in the whole image. The trained encoder weights were transferred to the segmentation encoder for the supervised task of MS lesion segmentation. During training, the segmentation was performed and evaluated on patches. The method is summarized on Figure 1.
We worked with normalized images aligned to each other and resized to 128 × 128 × 128 voxels. Patch size is set to 32 × 32 × 32, which is a good trade-off between having lesions with their surrounding environment and having sub-images small enough to emulate sample data augmentation. Such a choice has already shown good results with Roy et al. [8]. Our CNN were not trained with data augmentation since it drastically increases the time of training even if it has been shown to improve results. Samples missing a given input modality (e.g. T1Wc or PDW) in the dataset were given a zero-valued image for substitution, and the sequence dropout technique detailed in [24] was also applied. This technique consists in randomly setting input modalities to zero in order to ensure the CNN fitting model is generic enough to provide reliable predictions even when modalities are missing. We apply repeated random sub-sampling cross-validation method during training. A batch consists of 12 randomly selected patches of the same patient with at least 75% of brain avoiding to take into account patches without brain. The CNN is trained to maximize the Dice score (equation 3), which is a compromise between precision (equation 2 and sensibility 1 as detailed in equation 4 aimed at reducing the impact of unbalanced data distribution on result evaluation. The Adam [25] optimizer is used helping reducing the time of convergence.

Results
Four different public data sets of multiple sclerosis lesion segmentation were gathered, namely the MICCAI MS lesion segmentation 2008 [26] and 2016 [6], the International Symposium on Biomedical Imaging (ISBI) 2015 MS segmentation challenge [5] consisting of 21 exams from 5 different patients at different time points, and the public data set of Lesjak et al. [27]. Two other public data sets from a manually selected subset of OASIS3 [28] with cognitively normal and declining patients and MICCAI BraTS challenge 2017 [29] with brain tumors were used as additional image source. More informations about datasets are shown in the table 1. MICCAI 2016 dataset was kept as the test dataset ensuring results to be acquired with as MR input images as possible. All images from datasets were preprocessed to be as comparable as possible. The preprocessing is really close to the one performed by [5] for the ISBI challenge. When it was possible, the unprocessed images were used. After the first N4 bias field correction [30], we chose to register all images to the FLAIR image as most of all lesions from our datasets were segmented into 003-2 IS&T International Symposium on Electronic Imaging 2020 3D Measurement and Data Processing the FLAIR space. The registration were performed by the FSL FLIRT [31,32,33] tool. Before skull stripping [34], the histogram matching [35] were applied for each modality separately. The references images for histograms were chosen among good looking images from our datasets at this step of preprocessing. The T1 image is then skull stripped and registered to a 1mm MNI brain template [36]. The brain mask and the transformation are then applied to other images modalities. The global pipeline is illustrated in the Figure 2. We arbitrary resize all our images to 128 x 128 x 128 voxels with a 1.422 x 1.703 x 1.422 mm 3 resolution to work with reasonable image size and resolution to distinguish lesions.

Figure 2.
Step of the pipeline illustrated with a T2 image.
Pretraining on the unsupervised regression task has been performed with every training data sets, including ones containing no patients with MS. It respected a ratio of 50% of MS exams during an epoch, thus to ensure that the network can build general feature maps from different brain MRI texture avoiding the unbalance between MS and no MS exams. During the second supervised learning on lesion segmentation only MS datasets were utilized. Ground truth was randomly selected from manually annotated maps, some of which being as many as four maps for a single record depending on the number of medical experts in-volved. As manual lesion delineation tolerance can vary from one expert to the other, we assume using such individual maps instead of combinations can help the CNN build its own consensus.
For each patient, 150 randomly chosen patches in the brain area were segmented. Final segmentation were obtained by averaging all overlapping predictions. It ensured to have a consensus prediction covering more than two times the brain volume.
Five transfer strategies are evaluated: • Weight transfers with the six first convolutional layers frozen (6CF) • Weight transfers with the three first convolutional layers frozen (3CF) • Weight transfers with the two first convolutional layers frozen (2CF) • Weight transfers with the first convolutional layer frozen (1CF) • Weights used for initialization before fine-tuning (0CF) For comparison, we trained a vanilla version without selfsupervision only in MS lesion segmentation (VAN). The evaluation metric of similarities used for comparison is the Dice score.
The mean results are presented in table 2 and the box plot showing the distribution of results in the test set is presented in figure 3. The evaluation was calculated on a test set acquired with different machines and protocols with ground truth made of the consensus of segmentation of different radiologists. We achieved great performance compared to recent publications keeping in mind that the test set is not the same and cannot be fully compared to other studies such as Valverde et al. [37] who obtained a Dice score of 53.5%, Valcarcel et al. [38] with 56% and 57%, or Roy et IS&T International Symposium on Electronic Imaging 2020 3D Measurement and Data Processing 003-3 al. [8] obtained 56.39%. The good quality of segmentation for all versions indicates that the CNN architecture, the overall training method and the training set were adapted to our task. The best Dice score and sensitivity are reached with the 0CF method and the best precision is reached with 1CF method. Mediam is represented by the orange horizontal line.
The figure 3 shows that increasing the number of frozen layers lowers the Dice score and the sensitivity and increases the range distribution of results. The precision is also lowered with the number of frozen layers but less than the others metrics.

Discussion
Only the 0CF version of self-supervision outperformed for each metric the VAN method. Our initial hypothesis and motivation for self-supervision were that the unsupervised regression and supervised segmentation tasks would share the first convolutional layers of the deep network model, as expected in multiple computer vision tasks. The results however indicate that it is not the case, and that the very first layers need to be retrained for better segmentation precision. A possible hypothesis is that the MS lesion segmentation task is a very specialized one which needs a fully dedicated prediction model, as opposed to day-to-day visual object recognition or classification tasks which are assumed to share common ground.
The box plot in figure 3 shows that freezing the convolutional layers decreased the Dice score and the sensitivity from the beginning. The loss of precision seems to appear only after the freezing of the two first convolutional layers. Those results suggest that the freezing of the first convolutional layers wood mainly affect the sensitivity of the CNN. So, this is in favour with the assumption that the first layers of the CNN would capture much of the useful image information in the very first layers. It also suggests that selection and gathering of most discriminative features are achieved gradually after the first layer.
A visualisation of segmentation achieved with our technique is illustrated in figure 4. We can see in this picture that augmenting the number of frozen layer decreased the size of segmented area and increases the size of area corresponding to false positive in the ellipse at the left frontal border of the ventricle. It can reveal underfitting because this area usually contains lesions.

003-4
IS&T International Symposium on Electronic Imaging 2020 3D Measurement and Data Processing Our results gives further support to the idea that regression and segmentation task are really different tasks even if they are performed with close images. One explaining reason may be that the size of the training set was to large to observe benefits of pretraining. Observations agree with hypothesis stated in the article by Yosinski et al. [39]. It is advanced that the specialization of the neural network increases with the depth and that transfer learning seems to always provide improvements even after finetuning from the first layer. Unlike the method we propose, the two first layers are not general for both tasks but Yosinski et al. detailed two very close classification tasks and our study is on regression and segmentation task.

Conclusion
We propose a novel way to implement and use selfsupervision methods in medical imaging, using localization of partial image content as an unsupervised task, in order to reuse the trained hyperparameters to fine-tune a supervised task of interest. We obtained good overall results compared to the state of the art. Our technique improves the quality of MS lesion segmentation, not yet as much as expected, and not that way it should. We observed that even the first layers of our CNN appear to be specialized for the segmentation task and are not as general as it can be thought. First layers seems to be more implied in the sensitivity of our CNN, indicating that the discrimination of meaningful information are conducted after those layers.
Our conclusions support the idea that using self-supervision for high-level human vision tasks such as medical imaging diagnosis is not as straightforward as it is for day-to-day vision tasks. Further investigation should be conduced to define the limits of vision task similarity and hierarchy. However, we believe that selfsupervision techniques can be more used and that finding good first unsupervised task to learn can conduce to great improvements in medical image and should be more used in this field.