Wednesday, July 1, 2009

A Survey of Techniques for Internet Traffic Classification using Machine Learning

Authors: Thuy T.T. Nguyen and Grenville Armitage

Summary:


The paper is a survey of works that involved machine learning techniques in classifying IP network traffic.

To facilitate the review of the papers, the authors grouped the surveyed works into the following categories:

a) Clustering approaches
In this section, the authors gave a summary of the usage framework and results for the following algorithms:

  1. Expectation Maximization (for flow clustering)
  2. Unsupervised Bayesian classification (coupled with expectation maximization for automated application identification)
  3. Simple K-means (one for TCP-based application identification and one for identifying Web and P2P traffic in the network core
b) Supervised Learning Approaches
The same treatment as with the review of clustering algorithms were used on the review of the following ML techniques:

(The following three have been used for mapping network apps to predetermined QoS traffic classes)
  1. Neural Networks
  2. Linear Discriminate Analysis
  3. Quadratic Discriminant Analysis
  4. (The above three have been used for mapping network apps to predetermined QoS traffic classes)

  5. Supervised Bayesian classification (one for classifying Net traffic based on application, one coupled with Multiple Sub-flows features for real-time traffic classification, and another coupled with Muliple Synthetic Sub-flows Pairs also for real-time classification)
  6. Genetic algorithms (for feature selection and flow classification)
  7. Statistical techniques (coupled with so-called "protocol fingerprints" for flow classification)

c) Hybrid Approaches

Under this type, a proposed semi-supervised classification technique is reported. This technique is a two-step method involving the use of maximum likelihood estimation (via a Bayesian method-like statistic) and later with the employment of K-means clustering.

d) Comparison and Related Work

The last category reported works that included comparisons of algorithms that were mentioned before. In summary, the papers compared clustering vs. other clustering methods, clustering vs. supervised methods, and statistical (particularly Pearson's chi-square test) vs. supervised methods (particularly Naive Bayes. Another concept under this section was the presentation of "novel" ML-based methods: ACAS (ML techniques on application signatures) and BLINC (application classification based on behavior of the source host at the transport layer.

Finally, they gave an assessment of sorts on the works they surveyed based on the following "challenges for operational deployment":

  1. Timely and Continuous Classification
    Some have explored the performance of ML classifiers that utilise only the first few packets of a flow, but they cannot cope with missing the flow’s initial packets. Others have explored techniques for continuous classification of flows using a small sliding window across time, without needing to see the initial packets of a flow.

  2. Directional Neutrality
    The assumption that application flows are bi-directional, and the application’s direction may be inferred prior to classification, permeates many of the works published to date. Most work has assumed that they will see the first packet of each bi-directional flow, that this initial packet is from a client to a server. The classification model is trained using this assumption, and subsequent evaluations have presumed the ML classifier can calculate features with the correct sense of forward and reverse direction.

  3. Efficient Use of Memory and Processors
    There are definite trade-offs to be made between the classification performance of a classifier and the resource consumption of the actual implementation... The overhead of computing complex features (such as effective bandwidth based upon entropy, or Fourier Transform of the packet inter-arrival time) must be considered against the potential loss of accuracy if one simply did without those features.

  4. Portability and Robustness
    None of the reviewed works has addressed and evaluate their model’s robustness in terms of classification performance with the introduction of packet loss, packet fragmentation, delay and jitter.


Critique:

Even though the paper's main purpose is to report on the status of ML usage for traffic classification, this paper also presents other opportunities to which network-related research may be directed. One of the (obvious?) things that merit some research is the wide array of network classification tasks (e.g. flow classification, application identification). A potential topic that comes to mind would be a synthesis (of the output) these different classification tasks into a unified view of the profile of a network. Another one is feature selection (i.e. the task of identifying the attributes needed for input). Although the "standard" set of features is the usual 5-tuple, since there is now a more complex set-up of network transaction, a study could be conducted on another "optimal" set of features to be able to carry out network traffic classification better.

No comments:

Post a Comment