Detecting Selected Network Covert Channels Using Machine Learning

Network covert channels break a computer’s security policy to establish a stealthy communication. They are a threat being increasingly used by malicious software. Most previous studies on detecting network covert channels using Machine Learning (ML) were tested with a dataset that was created using one single covert channel tool and also are ineffective at classifying covert channels into patterns. In this paper, selected ML methods are applied to detect popular network covert channels. The capacity of detecting and classifying covert channels with high precision is demonstrated. A dataset was created from nine standard covert channel tools and the covert channels are then accordingly classified into patterns and labelled. Half of the generated dataset is used to train three different ML algorithms. The remaining half is used to verify the algorithms’ performance. The tested ML algorithms are Support Vector Machines (SVM), k-Nearest Neighbors (k-NN) and Deep Neural Networks (DNN). The k-NN model demonstrated the highest precision rate at 98% detection of a given covert channel and with a low false positive rate of 1%.

The fundamental deficiencies on the existent wardens are two. Firstly, the need of human intervention to insert rules that are capable of identifying ambiguities on the network traffic. Secondly, the warden is limited in the sense that it can only detect threats that are already known. A Network Anomaly Detection System (NADS) circumvents the mentioned deficiencies by encountering unusual patterns in the network traffic that are non-compliant with expected normal behavior. NADS can be statistically-based, classification-based, clustering-based or information theory-based. Currently, the anomalies can fall into three groups: collective, contextual and point. Point anomalies refer to any deviation of particular data from a normal pattern of a dataset. When anomalies occur in a particular context, they are designated as a contextual group. Collective anomalies are the correlation of similar anomalies within an entire dataset [54], [36]. The main limitations of NADS are: 1) The inherent need to define the notion of the traffic's "normality". Actually, an object is considered as anomalous if its rate of deviation within the defined profile of normal is adequately high. 2) NADS usually requires human intervention for analyzing, interpreting and acting on the generated alert by the infected systems.
3) The variation detection of network traffic with known anomalies by NADS has not been dealt with in depth.
Indeed, one of the most powerful and cheapest tactics for a cyber-attacker to evade security countermeasure is to develop new variants from existing covert channel techniques. The related academic literature on detecting NCC with ML has mainly two limitations. Firstly, the proposed detection schemes were largely tested on very limited covert channel techniques. Secondly, little attention has been given on classifying covert channels into patterns.
The primary contributions of this paper are: (1) analyzing eleven popular NCC tools and classifying them accordingly into patterns [28]; (2) designing a proof of concept for the detection and classification of NCCs using three different ML algorithms; (3) comparative evaluation of the used ML algorithms; and (4) identification and discussion of the results and indicating possible future research directions.
We present our analysis of related work in Section II. In subsequent sections (Sections III-IV), we describe the detection scheme and discuss the experimental measurements. Our findings are discussed in Section V and conclusion in Section VI.

II. FUNDAMENTALS & RELATED WORK
The traditional approach on detecting NCCs (i.e. signaturebased, behavior-based, heuristic-based) relies on the manual definition of signatures. This approach often fails to detect novel threats. To circumvent this problem ML is widely used [14], which is capable of modelling the normal behaviour of a network traffic and consequently can detect any unexpected behavior within the network traffic without (or with minor) human intervention. ML aims at making computer systems adapt their actions so that these actions get more accurate [58] ML techniques are generally grouped into two categories: supervised and unsupervised. The supervised category learns from a set of labelled data and encodes that learning into a model to predict an attribute for new data. On the other hand, unsupervised ML is used to find patterns within data that are without a specified target variable [24]. Supervised ML, which is also called classification, is characterized by the use of a labelled dataset. Classification is defined by the creation of a model by the training the labelled dataset. The created model is then used to predict the label of a certain dataset with an unknown label. We distinguish three families of datasetss as follows [1]: 1) Synthetic: Created to fulfill special requirements in relation with real data circumstances. 2) Benchmark: Generated on a simulated environment along with network devices. 3) Real life: Prepared by collecting network traffic during a certain period of time. The identification and detection of covert channels are distinct. The identification aims to determine a shared resource that could be utilized as covert carrier. However, the detection examines the event flow in order to reveal a covert channel in operation. To minimize any negative impact on performance, the detection mechanism should be implemented before the elimination one. There are mainly three strategies to detect network covert channels: signature-based, anomaly-based and specification-based [19]. The signature-based detection requisites the creation of a baseline of signatures that need regular maintenance and update (signatures are typically patterns that a warden should monitor). This type of detection is typically the pattern that a warden should monitor to detect covert channels. The anomaly-based detection approach identifies any deviation from the normal traffic. Lastly, the specificationbased approach intends to match the predefined specifications of a protocol to verify any misuse or attacks. The capabilities of the listed detection approaches are limited to the known NCCs. These limitations could be circumvented by the use of ML, which is referred to as the studies of automatic techniques for learning to make accurate predictions based on past observations [45]. There are three types of Intrusion Detection System (IDS) using ML: single, hybrid and ensemble. When an IDS uses an individual ML algorithm it is called single, hybrid IDS uses several algorithms. However, ensemble IDS refers employs a combination of several weak ML algorithms [27].
Various ML algorithms have been proposed in covert channels related literature. For instance, Hidden Markov Model (HMM) was used to detect covert communication on the TCP stack [30]. The main pitfall of it though is the limitation of the algorithm on detecting covert communication in applications that use tunneling. Gilbert and Bhattacharya [50] suggested a twofold detection system that features both covert channel profiling and anomaly detection. The genetic algorithm was used in the IDS first in 1995 by applying a hybrid approach of multiple agents and genetic programming in order to detect anomalies [13], [33]. Some enhanced ML techniques include the Intelligent Heuristic Algorithm (IHA) based on Naive Bayes classifiers to detect covert in IPv6 [41]. Salih et al. [22] improved the detection rate and reached an accuracy of 94% by using enhanced decision trees C4.5 with a very low false negative rate. C4.5 was also applied to detect protocol switching covert channels (PSCC) in [55]. The main inconvenience of the supervised method is that it requires labelled information for efficient learning. Additionally, it can hardly deal with the relationship between consecutive variations of learning inputs without additional prepossessing.
There is a considerable number of works that have used Support Vector Machine (SVM) to classify network anomalies [24], [16], [2], [4], [10], [11], [12], [20], [23], [35]. Compared with other ML algorithms, SVM has faster processing and is capable of processing both supervised and unsupervised learning [26], [17], [6]. For instance, SVM was used in a passive warden to detect TCP anomalies within TCP ISN and IP ID [25] or IPv4 network anomaly detection [29], [8]. SVM could be used to classify patterns based on statistical learning techniques for the regression and the categorization [23], [4]. This algorithm aims to achieve the optimal separating hyperplane in a higher dimensional feature space by using a kernel function.
On the other hand, it has been demonstrated that k-NN is one of the simplest ML algorithms [31]. Firstly, it classifies the entire dataset into training and testing data points. Secondly, it evaluates the distance from all training points to the testing points. The point that has the lowest distance is named nearest neighbor. Tsai et al. [34] suggested a hybrid method based on a triangle area using the k-NNs approach to detect attacks. suggested a hybrid method based on a triangle area using the k-NNs approach to detect attacks. They extract the number of center clusters where each cluster center constitutes one specific type of attack. Then, the triangle area is calculated by two clusters chosen randomly and one data point from the dataset. Finally, the constituted triangle symbolizes one new feature for measuring similar attacks. This k-NN classifier is used based on the feature of triangle areas to detect intrusions. Most studies on the detection of storage covert channels were tested with a single popular tool (e.g. Covert-TCP), [47], [48], [51], [49] [42], [52], [53] with captured traffic [45], [46], [44], [48] or with a personalized developed tool for the purpose of research work [47]. The authors of this paper propose a detection concept that uses ML with 3 different algorithms, which are not based on own particular developed techniques but on popular tools instead.

III. DETECTION APPROACH
The proposed approach is based on three steps: (1) generating the datasetss containing network traffic, (2) training and feature extraction, and (3) testing the models with different tools.

A. Generating the datasets
In order to train the ML algorithms, datasets were created of a mixture of real life and benchmark network packets [1] (ge umhnerated in a lab environment). For the benchmark dataset we have collected a set of tools as listed on Tab. II and covert traffic was produced. Then, PCAP files were labelled according to the type of pattern the covert packets belonged to. Wendzel et al. introduced a pattern-based classification of covert channels. 109 covert channels were categorized into 11 distinct patterns based on their similarities [28]. For example, the pattern P7 represents covert channels that encode data into a reserved or unused field. To train the ML models, large labelled datasets (supervised) that represent each type of pattern were used. As shown in Tab. I, this research work focuses only on the following patterns: Non-P0-P1-P5-P7 Different patterns from the above Labels are saved in a CSV file, so that each packet of the PCAP file is numbered and labelled accordingly.

B. Training and features extraction
As we use supervised ML, our training process requires a pair of files (PCAP and CSV) and uses the Pcap2scikit class from scikit-learn (Python ML library). Each time the model is to be trained, the script checks the previous training model and adds it to the new data. If there is no previous data, then the model is newly created (Fig. 1). At the same time, features are extracted from each packet. The features are extracted when preprocessing the packet data. For example, the TTL field can be preprocessed to determine whether the packet is involved in TTL value modulation. TTL values in packets sent by each sources address are compared to previous TTL values in packets sent by the same source. If the TTL value has changed, then the feature is the percentage of packets that have previously had modified TTLs out of the total packets sent from the same source address.
Both ML algorithms (k-NN and SVM) have a similar training process. Therefore, they are stored in a file. The cross-validation is executed using utility methods provided by SKLEARN. For DNN we used TEENSORFLOW which does not store models on a single file. Instead, a directory is used to store several meta data and graph files. Since TEEN-SORFLOW does not provide convenient methods for crossvalidation or predictions, we have created a method to make these possible on both TEENSORFLOW and SKLEARN.

C. Testing
ML models are also called classifiers, and aim at learning by corresponding the classes with the inputs [38]. Classifiers are widely used to detect general network anomalies and also covert channels. By generating knowledge based on using the normal packets, the classifiers treat any activities that differ from the normal packet attribute as covert. Therefore, novel covert channel techniques can be detected with minimum effort (as they also deviate from normal packets). Selecting the appropriate classifier is a challenging task and generally based on the accuracy of the prediction. In this paper only supervised ML algorithms were used (k-NN, SVM and DNN) for the following reasons: • k-NN is characterized to be one of the most straightforward instance-based learning algorithms [32]. • SVM belongs to the newest supervised machine techniques, it is pertinent with large number of features and it is very useful in insolvency analysis (when data are non regular) [37]. • DNN is capable of learning features automatically at any level of abstraction by mapping the input and the output directly from data with a negligible human-crafted feature [39], [40]. The detection process requires a pair of (PCAP, CSV) files and starts by first extracting the features from the packets similarly to the training process. Secondly, it loads the model from a pkl file. Thirdly, it calculates the accuracy. Lastly, it creates both metrics and confusion matrix. An output file is then produced which contains metrics values such as the number of packets and the prediction rate (whether a packet is normal or covert), as well as the classification of a certain covert packet into covert channel patterns.
To train the ML models we used large labelled datasets (supervised) that represent each type of pattern Tab.II. The following 20 features are extracted from preprocessing the packet data:

D. Metrics
The ML model is first evaluated using the 3-fold crossvalidation via the cross val score function in scikit-learns tool.

E. Model fitting and persistence
When the model is full with data, then it is stored to the models directory. The scores from cross-validation and the label encoder are also added. The original input PCAP and CSV files are saved to the model's directory. The model must be saved to a disk to be reloaded and used for the testing (prediction). The training data is saved to be combined with additional training data in the future.

IV. EXPERIMENTS
A. Experimental setup 1) The dataset: The covert dataset was created using ten popular tools and the normal dataset from http://mawi.wide.ad.jp. Afterwards, they were classified into four patterns. Network packets were generated with each tool and were systematically labelled as belonging to one of four patterns as described in Tab. I. The global dataset is made up of a consolidation of all packets created and labelled. The distribution of normal and covert network packets (dataset) for the classification of training and testing is summarized in Tabs. IV. 2) Evaluation methodology: To measure the performance of the NCC's, the confusion matrix Tab. X, the accuracy, false positive, detection and precision Tab. XI were calculated using the following metrics: 2) k-NN: After having tested many values of k, k=4 was identified as being the best value. As shown in Tab. VI, the rate of detection, accuracy, precision and false positive with SVM are 89%, 90%, 96% and 1%, respectively.
3) DNN: The detection rate is high (92%) with a precision of 85%, the accuracy rate is 67% and false positive 4%.

V. DISCUSSION
This section provides a comparison between the different classifiers based on their metrics performance: accuracy, detection rate, precision, TP, TN, FP and FN. This comparative provides a basis of evaluation in order to identify the best ML algorithm to detect NCC's. Tab. X shows the performance of the training dataset. k-NN performs the best given the highest rates of detection, precision, accuracy, TP and lowest rates of FN and FP. Tab. IV provides the detection capabilities for the different patterns over the training dataset. The results reveal that k-NN is capable of classifying NCC's into patterns with high accuracy and precision. The measurement results of k-NN on testing dataset demonstrates a difference on classification capabilities of the NCC into patterns. (Tab. IX). While NP0157, P1 and P0 allowed for the highest classification accuracy and precision rates, P5 and P7 resulted in the lowest values. On average, compared with DNN and SVM, k-NN provides the best accuracy and precision rates and lowest FP value. DNN has the highest detection and FP rates.  The results indicate that k-NN has a significant level of difference with DNN and SVM over both training and testing datasets. Therefore, a reliable conclusion can be drawn that k-NN perform the best to detect and classify NCC by a considerable margin.
Our work is limited in different ways. First, the dataset of the study was restricted to some selected NCC tools. Second, we obtained the training data from the tools and we believe that this is a problem for real world applications of machine learning algorithms. Third, the work was limited to using 3 machine learning algorithms (i.e. SVM, k-NN, DNN)

VI. CONCLUSION
The rapid growth of computer networks has driven forward the need to acquire security policy that ensure confidentiality, integrity and availability of information. This has led cyberattackers to find ways to break security policy and infiltrate or exfiltrate information using network covert channel techniques. Detection mechanisms to detect covert channels are based on identifying any deviation of nonstandard or abnormal behavior.
In this paper, selected ML methods were applied to detect popular network covert channels. The capacity of not only detecting, but classifying covert channels with high precision is also demonstrated. A dataset was created from eleven standard covert channel tools and the covert channels are then accordingly classified into patterns and labelled. Half of the generated dataset is used to train three different ML algorithms. The remaining half is used to verify the algorithms precision. The tested ML algorithms are Support Vector Machines (SVM), k-Nearest Neighbors (k-NN) and Deep Neural Networks (DNN). The k-NN model demonstrated the highest precision rate at 98% detection of a given covert channel and with a low false positive rate of 1%. DNN has the highest rate of FP and SVM has the lowest precision with testing dataset.
The findings of the research results suggest several areas of future work. Firstly, the possibility of additional covert channel patterns to be tested through the detection scheme. Secondly, further investigation with other ML algorithms could be considered. Most importantly however, research is needed to study how the discussed classifiers impact legitimate network communications while detecting and classifying covert channels on a large scale.