Convolutional neural network for smoke and fire semantic segmentation

In recent decades, global warming has contributed to an increase in the number and intensity of wildﬁres destroying millions hectares of forest areas and causing many casualties each year. Firemen must therefore have the most effective means to prevent any wildﬁre from breaking out and to ﬁght the blaze before being unable to contain and extinguish it. This article will present a new network architecture based on Convolutional Neural Network to detect and locate smoke and ﬁre. This network generates ﬁre and smoke masks in an RGB image by segmentation. The purpose of this work is to help ﬁremen in assessing the extent of ﬁre or monitor an incipient ﬁre in real time with a camera embedded in a vehicle. To train this network, a database with the corresponding images and masks has been created. Such a database will allow to compare the performances of different networks. A comparison of this network with the best segmentation networks such as U-Net and Yuan networks has highlighted its efﬁciency in terms of location accuracy, reduction of false positive classiﬁcations such as clouds or haze. This architecture is also efﬁcient in


INTRODUCTION
Each year, the news highlights the importance of fire detection when it comes to saving lives, wild forests and homes. Video images are able to detect and locate smoke and fire in real time and help firemen to act quickly. Therefore, most of the time, smoke is the first sign of a fire outbreak. Smoke detection and localization provide information such as starting points, size, type etc. It is essential to allow the firemen to organize the action plan to protect the population and the operation to put out the fire as quickly as possible. In the event of a wildfire, responsiveness is a very important factor in saving lives and protecting nature. Yann Le Cun pointed out the use of Convolutional Neural Network (CNN) for classification in image learning. This type of neural network by their accuracy uninterrupted have kept growing for two decades. The substantial rate of improvements of this type of architecture has kept increasing for image classification [1,2]. CNN's enhancements not only relate to the classification of images but also to the location of objects whose  [3][4][5]. Kaiming He et al combine bounding box and segmentation to improve the object localization [6].
In recent years, semantic segmentation methods have been proposed using convolutions and deconvolutions architectures [7]. The main advantage of semantic segmentation of RGB images is to detect and locate objects in a single operation promptly and accurately. Generally, the network is trained by supervised learning based on examples input RGB images and output masks pairs. We suggest studying and comparing different convolutionaldeconvolutional architectures of neural network segmentation to detect and locate smoke and fire in RGB frames. Our goal is to find the best structure to segment smoke and fire compatible with real time.
Inspired by the success of fully convolutional network segmentation, we introduce in this article a new architecture based on the VGG16 [8] for the convolution phase. To increase the depth of our network and the size of the receptive field, we have replaced the fully connected layer of the VGG16  structure for a 7 ×7 convolution operation kernel Table 1 .
Removing fully connected layers frees us from the input size of images. For the decoding phase, we have chosen to use only three transposed convolutions to reach for the output masks the size of the input data. The output of the first and second up-sampling operation are combined with feature maps of the coding path [9] and followed by a convolution operation. Context information is propagated to the higher resolution layers by this sharing of the feature maps in the decoding path. The paper is organized as follows: In Section 2, first of all, we have reviewed related work to convolutional neural network applied to semantic segmentation as well as the evolution of smoke and fire detection techniques. Then, in the same section, we describe our distinctive network architecture,the composition of our smoke/fire database and the evaluation parameters chosen to compare our network to Yuan [10] and U-Net [11] networks. The experimental results and discussion of our study are presented in the Section 3. Finally, the last section summarizes our work and lists the ways to improve semantic segmentation of smoke and fire.

Convolution neural network for semantic segmentation
Historically, first methods for fire and smoke detection in an image or video rely exclusively on colours. The latter have given satisfactory results. Some interesting works can be mentioned: Toreyin et al. did an extensive work on this field, in [12][13][14][15]. In [12], as an initial step in his fire and flame detection system, he used a hybrid background estimation for moving region detection. Afterwards, colours of moving pixels are compared with a colour distribution obtained from sample images containing fire regions. In the third step, he uses a temporal wavelet analysis to determine high activity region within these moving regions. Finally, he processed a spatial wavelet analysis of moving regions containing fire mask pixels to capture colour variations in pixel values. These two last steps are crucial in Toreyin's approach because of the turbulent high frequency behaviors on the boundary and inside a fire region. In [13] and [14], he enhanced his model by using separate Markov models for flame and non-flame moving pixels. He also, carried out a flicker analysis by using HMMs and a wavelet domain analysis of object contours. Finally in [15], he updates his work using the least-mean-square (LMS) algorithm to combine the decisions from four sub-algorithms: (i) detection of fire coloured moving objects, (ii) temporal and (iii) spatial wavelet analysis for flicker detection, and (iv) contour analysis of flame boundaries. Similarly, Celik made a significant contribution in this area: [16][17][18]: The main originality in his work was the use of YCbCr colour space instead of the RGB one to construct a generic chrominance model for flame pixel classification. Moreover, he developed new rules in YCbCr colour space to alleviate the harmful effects of changing illumination and improved detection performance. Other methods similar and derived from those presented above can be found in [19].
These methods have brought an advanced solution to the field of fire and smoke detection in videos and images. Unfortunately, they remain sensitive to the problem of false alarms. Moreover, these methods require an expert to set the rules and features of the process for the object classification pipeline. On the other hand, there are methods based on neural networks that make it possible to overcome these weaknesses. In 2012, Alex K.'s work highlighted this type of method [1]: the so called "Deep Learning." This field predates Alex's work, and was initiated by Hinton [20], Lecun [21], Bengio [22] etc.
For a few years now, deep learning has become an essential tool for detecting smoke and fire in images or videos. This is due to the robustness of the algorithms used and the increasing availability of data. Sebastien [23] is one of the first researchers who used Convolutional Neural Network (CNN) to detect fire and smoke in a video stream. The CNN model inspired by AlexNet [1] operates directly on raw RGB frame without the need of the feature extraction stage. The CNN automatically learns a set of visual features from the training data. A classification accuracy score of 97 % was achieved. Similarly, Muhammad [24] used a fine-tuned CNN derived from Squeeze Net [25]. The latter allows the detection, localization and semantic interpretation of the fire scene to be carried out at the same time. More recently, Kim [26], proposed a net based on Faster Region-based Convolutional Neural Network (R-CNN) [27] to detect the suspected regions of fire (SRoFs) and of non-fire based on their spatial features. He also used Long Short-Term Memory (LSTM) [28] to interpret the dynamic fire behavior. The last methods give very good results, but that is still not enough, as the location of the fire or smoke region is not precise and is only characterized by a bounding box. To overcome this weakness, we have moved toward complete semantic segmentation. Indeed, semantic segmentation classifies all the pixels of the image, thus making the location of the fire or smoke very accurate.
Technical achievement of Convolutional Neural Network applied to semantic segmentation (CNN Segmentation) [9] has led us to apply this type of architecture to detect and locate smoke and fire. Smoke and fire are difficult to segment due to their non-constant shape and colour characteristics. The fire seems to be easier to segment than smoke due to its hues, but fire is less present at the start of a wildfire and de facto less present in database images involving a difficulty in classifying it.
The U-net network architecture [11] is composed of an encoder-decoder with the distinctive particularity of sharing features maps from the convolution phase to the deconvolution phase.
Feiniu Yuan et al. [10] propose a smoke segmentation using CNN with an architecture composed of two different paths merging at the end to create the smoke mask. Both coding part are based on VGG16 architecture [8]. The coding path is followed by a successive up-sampling operations with concatenations of coding feature maps. The first path, which is deeper, provides global contextual information for smoke segmentation. The second shallower gives rich local information for smoke localization and object details.

Our architecture
We assume a camera onboard a drone or a helicopter to locate a fire or smoke. Vehicle movements might not allow us to fix same spatial pixels in successive frames. This can prevent us from focusing on the temporal dynamic texture of fire and smoke. Our CNN architecture segments fire and smoke in each video frame without taking into account the temporal history of the pixels. Our network ( Figure 1) is based on VGG16 architecture [8] for coding phase. VGG16 is an architecture model proposed by K. Simonyan and A. Zisserman from the University of Oxford used for large-scale image recognition with good accuracy. We have chosen this structure for the coding phase due to the performances of features extraction for a large diversity of objects classification. VGG16 is composed of 13 convolution operation blocks with kernel 3×3 followed by three fully connected layers allowing object classification. For the coding phase, we kept from the VGG16 architecture for our network the five convolution blocks with a 3 × 3 kernel followed by a maxpooling operation. The dense layers of VGG16 structure set the input size of the image at 224 × 224 pixels. We have chosen to replace the fully connected layer by a convolutional operation with a 7 × 7 kernel giving 1024 feature maps. This approach keeps out from the issue of input images size (Table 1). We test different sizes for the last kernel (1×1, 3×3, 5×5, 7×7, 9×9). The 7×7 kernel size was the best compromise between accuracy segmentation and time consuming.
The purpose of the coding phase is to extract local information relating to fire and smoke. The deeper layers lose detail localization but increase the generalization capacity of the classification process. The decoding phase aims to recreate a high resolution segmentation of fire and smoke with good generalization. To achieve this objective, like U-Net network, we concatenate feature maps of the coding phase with the decoding phase to propagate contextual information to higher layers.
The decoding phase is composed of two transpose convolutions (up-sampling operation) with a 4×4 kernel and a last transpose convolutions with a 16×16 kernel Table 2. The wide "receptive field" of the transpose convolution kernel aims to increase the generalization capacity of the mask constructions. The first and second up-sampling operations are followed by a concatenation operation with the feature maps of coding phase and followed by convolution operation with a kernel 3×3. All convolution operations are followed by Rectified Linear Units ReLU activation function. The training parameter number of our network architecture is 57 million. While U-Net architecture use four up-convolutions and Yuan eight with kernel 2×2, we use only three with the kernels 4×4, 4×4, and 16×16, respectively. Our coding path based on VGG16 is different from U-Net one. Our network differs from Yuan's network due to a unique coding-decoding path, as well as the size of the kernel 7×7 of the last convolution of the coding phase.

Our database
Database quality is of paramount importance to train deep network with good accuracy. We use internet images with different sizes and qualities. The presence in the database of different type of rather whitish or blackish smoke is also important to detect and segment correctly most types of the fires. We segmented 366 images and labelled them manually with Labelme software under Linux [29]. We performed offline data augmentation by flipping, cropping, rotating, adding noises, changing contrast/brightness and a combination of theses transformations to reach 8784 images (Figures 2 and 3). The 8784 images are divided into 82% to train our network (7224 images) and 18% (1560 images) to validate it. The validation images set is only used to follow the IoU (Intersection over union) metric for each class and avoid over-fitting. The weight and bias of the network are set during the validation phase.

Evaluation parameters
We used the Python library Tensorflow 1.12.0 and Opencv 3.4.0 under Linux 18.04 to train our network and rise up the num-ber of image data. We worked with GPU of a Nvidia GeForce 1080 graphic card with 11GB RAM. We initialized the parameters of the coder part of the network (weight of the first 13 convolution operations ) by using a VGG16 pre-trained model on the ImageNet database. We trained our model on our train dataset with an Adam optimizer method [30] with a set learning rate of 5×10 −5 and a cross entropy with logit loss function.
We compared our architecture with the U-Net [11] and Yuan [10] networks. To measure fairly the respective performances of these networks, we trained the three networks on our dataset. Unfortunately, we were unable to test our network with Yuan team dataset due to the absence of fire masks.
Training parameters for the U-Net and Yuan network follow the procedure explained in their research articles.

Standard accuracy metrics
This sections describes metrics criteria used to compare the performances of segmentation for the different networks [31]. The confusion matrix ( Figure 4) allows for each class and on all the valid images to calculate standard metrics to evaluate the performance of pixel classification. Accuracy is a good tool to report the percentage of the correctly classified pixels in the image. We have chosen to report accuracy for each class. We calculated the average accuracy (1) on the N validation images for each class c. TP i , TN i , FP i and FN i are for the ith image, respectively, the true positives, true negatives, false positives and false negatives. .
Precision (2) provides a class agreement of the data labels with the positive labels given by the classifier, .
Recall (3) permit to assess the effectiveness of the network to identify positives labels with respect to the ground truth labels. .
We calculated metrics on valid images for each class and not global metrics because they are not appropriate when the representative frequency of the classes is unbalanced. Our database is unbalanced, the pixels of the smoke class are more frequent than those of the fire class.

Intersection over Union
Jaccard index or Intersection over Union (IoU) criterion (4) allows a quantitative evaluation of the accuracy segmentation. We used this criterion on the valid dataset by calculating the average IoU for each class (GroundTruth, smoke and fire ) .

ROC curves
Receive Operating Characteristic (ROC) curve [32] ( Figure 5) is a graphical representation of a model performance in function of the classification threshold. We used a Softmax function on the last feature maps to evaluate the likelihood of each pixel in order to verify if it belong to a given class or not. In addition to the area under the ROC curves [33] , this evaluation method determines the behavior toward the false negatives or false positives of the model. Finally, we selected two methods to define the optimal threshold which gives the maximum correct pixel classification ( Figure 5). The first consists in finding the optimal classification threshold by minimizing the distance d between the point (FPR=0,TPR=1) and the point(FPR,TPR) for a given threshold. The second method is based on maximizing the Youden index J [34] that maximizes the distance between the random chance line and the point (FPR,TPR) for a given threshold. Maximal J criterion is commonly used because it gives the threshold which maximizes the TPR and minimizes the FPR [35].

Other criterion
We chose to plot the accuracy and IoU versus the threshold to evaluate the probability distribution of the pixel classification for a given class. We use the Softmax function at the output of the networks to calculate the probability of pixel prediction. In addition, by observing the shape of the accuracy or IoU versus time, we could compare the ability of networks to segment classes. The decrease in the accuracy curves versus threshold (for high threshold) indicates a low proportion of high probability that the pixels belong to class c means lower segmentation performance.

EXPERIMENTAL RESULTS
In this section, we compare segmentation classification performance for smoke, fire and background in RGB images with different networks. We have chosen the last two best architectures for images segmentation which are U-Net network [11] and Yuan et al. network [10]. We have used the same validation images not yet seen by the network to compare network performances. Tables 3-5 show that the U-Net network achieved the lowest performance in the background, fire and smoke pixels classifi-   #ipr212046-tbl-0004.tab Fire IoU is lower than the smoke and the background classes for all the networks. The first explanation of this low value is due to the manual segmentation of the ground truth in our database. It seems easier to segment fire according to its distinctive red or orange colour but it does not. When we segment an image containing fire and smoke, it is difficult to separate the bounds between fire and smoke. Sometimes, we can see the fire behind the smoke. In this case, do we classify these areas as fire or smoke? The network sometimes detects fire where we had segment a smoke because it finds areas related to fire characteristics ( Figure 6). This segmentation is not really false but the misinterpretation of the network decreases the value of the intersection over union for the fire and smoke. The second explanation is due to the unbalanced number of pixels between the three classes. Fire is less present in images than smoke and background . Therefore, a fire segmentation error will have a greater effect due to the small amount of fire pixels on the database.
To improve the IoU of our unbalanced database, we trained our network with a weighted cross-entropy loss [38]. The three classes are weighted by wc = median_ fc∕ freq(c ) to create a more balanced version of our model. f (c ) is the total number of pixels of the class c divided by the total number of pixels of images where c is present and media_ fc is the median frequency of the frequency of the class c ( Table 6). The IoU results for weighted loss showed a very small increase for all metrics for the fire class and a very small decrease for all the metrics for the smoke  and background (Table 7). Smoke is the first fundamental information visible to detect a wildfire outbreak. The decrease in smoke metrics and the weak improvement in fire metrics incited us to maintain an unweighted loss function to train the networks. Discrete ROC curves for each class are superior to the U-Net and Yang network. The areas under the curve for our model (Table 8) have the highest values that are close to the unit. That indicates the superiority of our prediction model for the three classes. Moreover, ROC curves of our network near the point of origin increase faster, pointing out a lower rate false positives classification whether it is fire, smoke or background classes.  The curve of the Houden index versus classification threshold provides information on the shape of the ROC curve. The faster the curve increases for the low threshold, the closer the ROC curve is to the perfect classification model. A value of the Houden index close to the unit also indicates a good classification. Houden index curve (Figures 7 and 8) highlights a long plateau for high value for our network, whether it be for smoke or fire, which means a high range of classification thresholds achieving an excellent segmentation with a maximum of true positives rate and a minimum of false positives rate. We have chosen not to draw the d measures because they are strongly correlated with the Houden index. U-Net and Yuan networks  The accuracy and IoU curves are plotted according to the threshold which is directly related to the probability prediction of a pixel belonging to the class c. We use the Softmax function at the output of the networks to calculate this probability of pixel prediction.
The accuracy with respect to the threshold provides information on the percentage of the correctly classified pixels for a c class. A large plateau between a low threshold and a threshold close to the unit indicates that the majority of pixels in class c have a high accuracy for a high threshold and de facto a high prediction probabilities. Relating to smoke accuracy curves (Figure 9), we notice a large plateau between few percent threshold and 100% for our network compared with Yuan and U-Net network. The large and constant value of the plateau seems to indicate a clustering of high classification probabilities of the pixels  The same pattern as the accuracy curve one can be observed with the IoU curves (Figure 11 and 13). We can interpret this large plateau of the curve between a threshold of few percent and a threshold of 100% to a very high probability of classification of smoke pixel. For U-Net and Yuan network, the IoU curves decrease for, respectively, 60% and 80% indicating a drop of the locate accuracy segmentation for the high probabilities of smoke pixels classification. The same analysis can be done for the segmentation of the fire (Figures 14 and 15). Nevertheless, the drop of the curve has a lower impact than the smoke curve for the U-Net and Yuan network, which reveals a better pixel segmentation for the fire class than for the smoke class. The IoU and accuracy versus threshold curves assert, for our method, a   better segmentation of fire and smoke with less false positives and false negatives ( Table 9). #ipr212046- fig-0012.fig #ipr212046-fig-0013.fig The Table 10 compares the three network characteristics. Our network is the deepest with 57 million train parameters. However, our network is the fastest to segment images with the three classes due to the smaller number of up-sampling operations, the smaller number of high resolution features maps and only a single coding-decoding path. Our network is almost two times faster than U-Net network and almost four times faster than Yuan network.
#ipr212046-tbl-0009.tab Our architecture with a segmentation rate time greater than 20 frames per seconde is able to segment fire and smoke in a video 640x480 size in real time. Figure 16 exhibits different images which clearly show the smoke mask predicted in green and the fire mask predicted in red for our network, U-Net and Yuan network.
Our network possesses the architecture with the lowest number of up-sampling operations (5+3 for Yuan, 4 for U-Net and 3 decoding transformation for our network). It can be assumed that the number of up-sampling operation is not an essential parameter for creating accurate smoke and fire segmentation.
The effective size of the receptive field is an important parameter in deep learning [36]. For a dense prediction such as segmentation image, it is essential for each pixel class of the output mask to have a large receptive field on the input image to Our network possesses the highest receptive field for the encoding phase due to the last 7×7 convolution operation (These effective receptive fields are respectively for our network, U-Net and yuan: 404, 140 and 196. For our network, a pixel mask PM of coordinates (x,y) are influenced by information given by pixels of the input RGB images in a windows of 404×404 centred around PM position.) Our generative mask method is close to the ground truth masks (e.g. Figure 16). False positives for fire and smoke classes are less prevalent with our network than in the Yuan and U-Net network. In addition, our method misclassified a small number of cloud pixels compared to the U-Net and Yuan methods. The quality of the segmentation can be explained by the large size of the receptive field and the depth of our network.
This article presents a new network architecture for segmenting smoke and fire in RGB images. We mainly compared our architecture with that of the Yuan. However, to prove that the good performances achieved by our network architecture is independent of the database we created, we decided to test it on the Yuan database.
We trained our network and Yuan's one on the Yuan database [39] .The latter is made up of 70,632 synthetic RGB images of size 256×256 pixels and their corresponding smoke masks. We split it into two sets: the train set (80%) and the valid set (20%). Yuan database contains three test datasets named DS01, DS02 and DS03. Each test set consists of 1000 256×256 RGB images and corresponding 8 bits ground truth of alpha channels. Using the ground truth of alpha channels, we created smoke masks ( Figure 17). We had to choose a threshold to create smoke masks from the groundtruth of alpha channels because low values of the alpha channels for the smoke were not visible on the RGB images. We chose the value 20 for this threshold; that is, pixels of the alpha channels with values under 20 were considered as background and values over or equal to 20 were considered as smoke.
We tested performances of the networks by calculating the IoU (4) and mMse (5) the average square difference per pixel between the prediction and the ground truth on the test datasets (DS01, DS02 and DS03).
N is the number of images of the test set, h and w are, respectively, the height and the width of images, pred(X k ) is the prediction of the pixel X k , and Gtruth(X k ) is the ground truth of the pixel X k .
, (5) Tables 11 and 12 show that results of segmentation performances for both architecture on DS01 are almost similar. On   the other hand, results achieved on DS02 and DS03 datasets thanks to ou network architecture outperform those of Yuan's one. Moreover, Table 13 indicates the execution time of the smoke mask prediction for a 256×256 px RGB image. Our network is twice as fast as the Yuan network. We can argue that in case of higher definition images, our architecture would still have results in real-time. This study has proven the quality of our network architecture for semantic segmentation compatible with real-time.

CONCLUSION
Recently, full convolution networks have provided architectures to accurately segment objects in an image. Fire and smoke are objects with wide variety of shape and colours. Despite the difficulty of detecting and locating such objects, our network composed of a coding and decoding phase achieves a much better segmentation task than the Yuan and U-Net networks. Our method has demonstrated accuracy in classifying pixels with low false positives such as clouds or haze. Time consumed is also an important factor in segmenting fire and smoke according to real-time compatibility. Our network outperforms the other architectures for segmentation time.
To improve the segmentation accuracy of the fire class , we could increase the number of fire images in our database (we could add fire images coming from other database to our database like [37]). We could also, when the camera is almost static, use 3D convolutions to capture the dynamics of smoke and fire in successive frames of a video.
Our network outperforms U-Net and Yuan networks for the semantic segmentation method of smoke and fire in terms of location accuracy and segmentation rate time.