Thursday, October 29, 2009

My opinions and reactions on the class DPI project

When we first started the class DPI project, the first thing that came into my mind was how easy it would be. I mean, how difficult would it be to open packets, inspect their contents, and classify them accordingly. You could say it would be the equivalent of your friendly postman opening your mail, and classifying whether it contained something important, a postcard, some cash, or perhaps spam and even anthrax. And we're not looking for passwords, credit card information or information for the spooks. No, we are more benevolent than that.

Our purpose would be to classify applications running on the network properly. If we're going to monitor our networks, we must have a complete picture of the applications that our users are running on them. While most of these applications are "visible" and can easily be blocked. However, many applications running on the Internet have acquired the ability to bypass firewalls and proxies. Because of many corporate, academic, technical and what-not policies that have governed networks for the past few years, have pushed applications to use proxies or encrypt their communications in order to bypass the usual roadblocks that network administrators have put in place over our networks today.

The wisdom of such blocks have been in serious question, both on the technical and user levels. However, the reality is that these blocks are here to stay, and the target applications of such blocks have adapted to the current Internet landscape. A great example of such a versatile program would be Skype.

The purpose of our class project was to detect peer-to-peer traffic that have managed to pass through the roadblocks that the university network administrators have put in place.

While we did manage to get a sample of the network traces, we have yet to detect any peer-to-peer activity in the university network. So far, the university network administrators have appeared to succeed in their "quest" to block all kinds of peer-to-peer traffic.

We also used some machine learning techniques on the traces, however, I think that we have largely failed in that because we don't have any training data to use... because there have been no peer-to-peer traffic detected. We need to get data which we positively know has peer-to-peer traffic. If we can't detect it, then we should run some applications and actively look for holes in the university network. Once we "detect" our own traces, put them into the machine learning tool and use it as training data to detect the peer-to-peer traffic that do not belong to us.

The other technique that the class investigated, which is actually reading the packet contents, is either a hit-or-miss thing. We can argue that reading the first few bytes of the data can give us the name of the actual application, however, once this traffic is encrypted, all bets are off. I believe this technique will only be useful in the near-to-medium term, and will work only on simple applications that have not acquired the variety of users who need to use special methods to bypass proxies and firewalls.

We are not yet there, but we have learned the "what not to do in DPI". This may sound like an Edisonian way of thinking, but I believe that as we continue to refine our techniques and code, we will be able to achieve a way to detect peer-to-peer traffic without reading the payload.

Wednesday, August 12, 2009

rSim network simulation results

Here are the results from my network simulation.

Thursday, July 23, 2009

Thursday, July 9, 2009

Network Traces



If the slideshow is not showing, you can view it here.

Wednesday, July 8, 2009

Circumventing P2P blocks

Assumptions
  • allowed port22/SSH outgoing
  • Squid proxy on port443 and port80
  • NAT support
  • outgoing VPN allowed

SOCKS4/5 proxy

  • using ssh -D8080 root@remote-host.com
  • using proxifiers (HTTP/SOCKS) / stunnel-encrypt any TCP connection (single port service) over SSL
  • (then use as SOCKS/HTTP proxy in btclient)

Bypassing SQUID

  • HTTP CONNECT on specified FQDN peers (to bypass CONNECT to IPaddr filter). The peers are HTTP proxies.
P2P on VPN (OpenVPN, IPsec)

  • openvpn multiplexes on a single TCP/UDP port
  • IPSec, security scheme on layer3/Network layer (OSI)/Internet layer

NAT on tcp/443
  • all browser sessions use proxy

A measurement study on video acceleration service

P. Pan, Y. Cui, and B. Liu, "A measurement study on video acceleration service," in IEEE CCNC, 2009.

Relevance
  • Pipes getting bigger.
  • Bandwidth and storage getting cheaper.
  • Browsers getting smarter.
  • People getting closer / social media.
  • VoD: Youtube / Tudou / Huulu / etc rely on streaming.
Performance

Buffer
- long time to buffer / multi-connection download, P2P
  • multiple connections over the same data (e.g. TV show)
  • caching at peering points
  • TCP/UDP data transfer
  • intelligent P2P routing between peering points
- buffer may stop / auto-reconnect download session
- multiple instances of buffered data / cache sharing

Result highlights


Conclusion
  • Accelerator in the browser.
  • ISP peering/caching technology
  • Partial Net neutrality?

Wednesday, July 1, 2009

Longitudinal Study of Internet traffic in 1998-2003

M. Fomenkov, K. Keys, D. Moore, and k, claffy, "Longitudinal study of internet traffic in 1998-2003," in WISICT, 2004

Introduction

This research presents a longitudinal study of Internet traffic behavior at a number of institutions for the span of four and half years(1998-2003)
Cited previous works such as:
McCreary and Claffy. They analyzed IP traffic at NASA Ames Internet eXchange point (AIX) for 8 months.
Thompson, Miller, and Wilder discussed characteristics of MCI's commercial Internet backbone which ranged from one day to a week.
Fraleigh et al described IPMON traffic monitoring system and reported observations in the Sprint E-Solutions backbone network for a day.
WAND network research group in University of Waikato conducted measurements on OC3 links between the University of Auckland and the public internet.

Data
They obtained 4000 traffic samples from various sites connected to High Performance Computing networks.
At each site, packet headers were captured between one to eight times a day per month. Average duration of each measurement ranges from 60 to 120 seconds.

Four metrics of measure traffic:
1. number of bytes
2. number of packets where packets are actual quanta of traffic
3. number of flows
4. number of source-destination pairs (port numbers and protocols ignored)

Flow is determined in a sequence of packets if they have the same source IP address, source port, destination IP address, destination port and protocol flow key.

They discovered that traces captured with an FATM card often have problems with accuracy of time measurements such as apparent clock resets and delays. They solved this by checking timestamps, properly converting absolute counts to rates, and averaging rates.

Results and Conclusions

Variations in bit rate are large and mostly without trends which reflects the Internet's traffic bustiness. No observed cycle or consistent long-term growth

Quality of available data is often insufficient for other qualitative measurements. (e.g. traffic flunctuations can be caused by a number of reasons.

Assuming data is representative of overall traffic evolution, they conclude that the data do not support the claim of Internet traffic universally and rapidly increasing both before and after the Internet bubble burst

TCP is the predominant transport protocol.
TCP traffic is between 60% to 90% of the total load
UDP is between 10% to 40 % of the total load
and other protocols combined amount to less than 5 %.

By bytes, the proportion of TCP and UDP traffic on average is 5 to 1 or by packets which is 3 to 1.

Packet rate is sublinear function of bit rate. packet rate ~ bitrate^0.75 and count of flows and IP pairs behave as bitrate^0.5

Analysis of Internet Backbone Traffic and Header Anomalies Observed

Authors: J Wolfgang, S Tafvelin

This is a comparison with their later paper.

Differences
  • length of study
    • this study collected data from spring 2006 (april), 7.5 TB data
    • their later study included data from spring 2006, and newer data from fall 2006 (september to november), 5 TB data
  • focus
    • this study: headers used and anomalies
    • their later study: traffic classes, also observed some header anomalies
This study:
  • ecn deployment is still small 0.2% of tested clients
  • more upd packets are fragmented (97%) than tcp (3%) for their incoming segments. not surprising since path mtu is for TCP only
Their later study:
  • p2p is more aggressive in using SACK
  • WS and TS is more established in http
Common:
This study represents the initial results of their overall study by focusing on headers and their effect on the applications being used. Their later paper presents a more in-depth study and presented the impact of the header anomalies in ways that can be used to improve the monitoring of applications using the network, and detection of malicious attacks being conducted

Micro Transport Protocol

Micro Transport Protocol is basically BitTorrent over UDP.

Traditionally, when an application needs to communicate via a network, it chooses between TCP or UDP for its transport protocol. When the need for reliability is much more important then speed, TCP is the right choice. Otherwise, it can use UDP to take advantage of its strengths. BitTorrent, which deals with the reliable transfer of data, obviously should use TCP.

However, in recent years, BitTorrent started to dominate the Internet. Regardless of the legality of the files being transferred, BitTorrent has become a bandwidth hog. This is a concern for ISPs, particularly in the US where there are no download limits. To combat the BitTorrent onslaught, ISPs started shaping their traffic. This is the start of the net neutrality debate.

Two simple examples of traffic shaping are TCP reset and random packet discard. In TCP reset, the ISP looks for P2P traffic (long session between 2 peers involving large packets) and sends a TCP reset to one or both users. In random packet discard, the ISP simply drops random packets.

To defeat the traffic shaping techniques of the ISPs, BitTorrent designers turn to UDP since UDP traffic are much harder to shape. ISPs will have to look inside the UDP packets to interfere with the traffic but such deep packet inspection is almost similar to wiretapping which is illegal in most countries. And besides, looking for long TCP sessions is so much easier than inspecting UDP packets. It is analogous to counting truck trailers in the highway versus counting the occupants of all the vehicles passing by.

Using UDP, however, has its problems too. Without the congestion control of TCP, there is a danger of flooding which can slow down other applications that use the Internet. Furthermore, TCP features such as retransmission of lost packets should be reimplemented again.

The Micro Transport Protocol addresses congestion issue by controling the transfer rate of TCP connections using information gathered from the transport. It aims to decrease latency caused by applications using the protocol while maximizing bandwidth when latency is not excessive. This way, there is no need for the user to set the upload/download rate since the protocol automatically adjusts to the network.

Since the protocol is still being implemented, its features are typically hidden or made obscure. The lack of an open-source implementation of this protocol or even a standard surrounding it will, in my opinion, slow down its adaptation and may further the debate if UDP is right for BitTorrent or not.

Analysis of Internet Backbone Traffic and Header Anomalies observed

Wolfgang John and Sven Tafvelin
Chalmers University of Technology

Introduction

In order to support research and further development, the Internet community needs to understand the nature of Internet traffic. In this paper, an analysis of IP and TCP traffic was done using headers from two OC-192 links.

Methodology

Collection of Traces.
- April 7 -- 26, 2006
- Optical splitters were used on two OC-192 links attached to Endace DAG6.2SE cards
- The first 120 bytes of each (Packet over SONET) frame were captured by the DAG cards
- Four traces of 20 minutes each day. (2AM, 10AM, 2 PM, 8PM)

Processing and Analysis.
- Payload beyond transport layer were removed.
- Traces were sanitized, checked for inconsistencies.
- Traces were desensitized, stripped of all sensitive information to ensure privacy.

Results
- 148 traces
- 10.77 billion PoS frams
- 7.6 TB of data, 99.97% of the frames contain IPv4 packets

IP packet size distribution
- bimodal
- 44% is between 40 and 100 bytes
- 37% is between 1400 and 1500 bytes

Transport Protocols
- TCP: 90 - 95% of the data volume
- largest fraction of TCP and lowest of UPD during 2PM
- potential UDP DoS detected by high UDP traffic during April 16-17, later confirmed

Analysis of IP properties
- IP options are virtually not used
- only 68 packets carrying IP options were observed
- only 0.06% of IP fragmented traffic was observed, contrary to previous reports of up to 0.67%

Analysis of TCP properties
- MSS and SACK permitted options are widely used on connection establishment. (on the average 99.2% and 89.9% resp.)
- also observed were TCP options misbehavior which included undefined option types and inconsistencies in option header length value and actual option header length

Conclusions
- Current trends in Internet backbone traffic is useful in protocol and application design.
- Anomalies detected were caused by: buggy and misbehaving appliactions and protocol stacks; active OS fingerprinting, and; network attacks exploiting vulnerabilities.

Critique
The results of this paper only applies to the particular Internet backbone links used in the collection of data. A much more wider source of packet traces, (say, hundreds of OC links in different continents,) is needed to generalize the properties of Internet traffic.

A Survey of Techniques for Internet Traffic Classification using Machine Learning

Authors: Thuy T.T. Nguyen and Grenville Armitage

Summary:


The paper is a survey of works that involved machine learning techniques in classifying IP network traffic.

To facilitate the review of the papers, the authors grouped the surveyed works into the following categories:

a) Clustering approaches
In this section, the authors gave a summary of the usage framework and results for the following algorithms:

  1. Expectation Maximization (for flow clustering)
  2. Unsupervised Bayesian classification (coupled with expectation maximization for automated application identification)
  3. Simple K-means (one for TCP-based application identification and one for identifying Web and P2P traffic in the network core
b) Supervised Learning Approaches
The same treatment as with the review of clustering algorithms were used on the review of the following ML techniques:

(The following three have been used for mapping network apps to predetermined QoS traffic classes)
  1. Neural Networks
  2. Linear Discriminate Analysis
  3. Quadratic Discriminant Analysis
  4. (The above three have been used for mapping network apps to predetermined QoS traffic classes)

  5. Supervised Bayesian classification (one for classifying Net traffic based on application, one coupled with Multiple Sub-flows features for real-time traffic classification, and another coupled with Muliple Synthetic Sub-flows Pairs also for real-time classification)
  6. Genetic algorithms (for feature selection and flow classification)
  7. Statistical techniques (coupled with so-called "protocol fingerprints" for flow classification)

c) Hybrid Approaches

Under this type, a proposed semi-supervised classification technique is reported. This technique is a two-step method involving the use of maximum likelihood estimation (via a Bayesian method-like statistic) and later with the employment of K-means clustering.

d) Comparison and Related Work

The last category reported works that included comparisons of algorithms that were mentioned before. In summary, the papers compared clustering vs. other clustering methods, clustering vs. supervised methods, and statistical (particularly Pearson's chi-square test) vs. supervised methods (particularly Naive Bayes. Another concept under this section was the presentation of "novel" ML-based methods: ACAS (ML techniques on application signatures) and BLINC (application classification based on behavior of the source host at the transport layer.

Finally, they gave an assessment of sorts on the works they surveyed based on the following "challenges for operational deployment":

  1. Timely and Continuous Classification
    Some have explored the performance of ML classifiers that utilise only the first few packets of a flow, but they cannot cope with missing the flow’s initial packets. Others have explored techniques for continuous classification of flows using a small sliding window across time, without needing to see the initial packets of a flow.

  2. Directional Neutrality
    The assumption that application flows are bi-directional, and the application’s direction may be inferred prior to classification, permeates many of the works published to date. Most work has assumed that they will see the first packet of each bi-directional flow, that this initial packet is from a client to a server. The classification model is trained using this assumption, and subsequent evaluations have presumed the ML classifier can calculate features with the correct sense of forward and reverse direction.

  3. Efficient Use of Memory and Processors
    There are definite trade-offs to be made between the classification performance of a classifier and the resource consumption of the actual implementation... The overhead of computing complex features (such as effective bandwidth based upon entropy, or Fourier Transform of the packet inter-arrival time) must be considered against the potential loss of accuracy if one simply did without those features.

  4. Portability and Robustness
    None of the reviewed works has addressed and evaluate their model’s robustness in terms of classification performance with the introduction of packet loss, packet fragmentation, delay and jitter.


Critique:

Even though the paper's main purpose is to report on the status of ML usage for traffic classification, this paper also presents other opportunities to which network-related research may be directed. One of the (obvious?) things that merit some research is the wide array of network classification tasks (e.g. flow classification, application identification). A potential topic that comes to mind would be a synthesis (of the output) these different classification tasks into a unified view of the profile of a network. Another one is feature selection (i.e. the task of identifying the attributes needed for input). Although the "standard" set of features is the usual 5-tuple, since there is now a more complex set-up of network transaction, a study could be conducted on another "optimal" set of features to be able to carry out network traffic classification better.

Tuesday, June 30, 2009

PBS: Periodic Behavioral Spectrum of P2P Applications

Tom Z.J. , Yan Hu, Xingang Shi, Dah Ming Chiu, and C. S. Lui
- authors


Introduction

This paper discusses about a new approach in identifying P2P traffic by profiling specific traffic patterns that is introduced by the P2P overlay in the network. from this profiles we can show the "periodic behavior" of the overlay and this behaviors can help us identify the system running on the network without the use of port monitoring and inspecting the payload of certain traffic.

The paper introduces a novel approach, the Two-Phase Tranformation approach

Experiment Design

The research distinguishes 2 kinds of periodic group communication
1. control plane- control signals for the overlay
2. data plane - actual data flows in the overlay network

The resarch also identified three (3) major types of periodic behavior or pattern
1. Buffermap exchange
  • typical on P2P streaming
  • peers exchange buffer information periodically using buffer maps
2. Content flow control
  • mechanic for limiting download rate of peers
  • introduces periodic data flows
3. Synchronized Link Activation and Deactivation
  • used in Bittorrent
Methodology

  • PBS pattern identification done on a selected PC on the network
  • packet detection using wireshark
Two-phase tranformation
1. capture inbound and outbound packets
2. graph packet traffic on a timeline
3. Auto-correlation of the timeline
4. Discrete Fourier Transform

Results
  • PBS profiles for a majority of P2PTV clients such as TVAnts, Sopcast, PPStream, eMule, Joost, PPMate, PPLive, TVKoo and UUSee
  • Tested using 2 scenarios: computer inside the LAN and computer accessing thru DSL connection
  • Tested the usefullness of PBS profiles by capturing traffic for two days in the camppus gateway. Results identified running P2P traffic with 100% accuracy
Critique
  • Testing for identifying P2P traffic using PBS not sufficient in terms of number of experiments
  • PBS profiles were generated using traffic inbound and outbound of a certain node, not gateway traffic. This could introduce innacuracy on PBS profiles
  • Packet header confirmation still needed.

Thursday, June 25, 2009

State of the Art in Traffic Classification: A Research Paper

M. Zhang, W. John, k. claffy, and N. Brownlee, "State of the art in traffic classification: A research review," PAM Student Workshop, 2009.


They surveyed 64 papers with over 80 data sets to create a structured taxonomy of traffic classification papers. The taxonomy is based on the following definition of traffic classification:

"Methods of classifying traffic data sets based on features passively observed in the traffic, according to specific classification goals."

They grouped the papers into 5 categories: analysis, surveys, tools, methodology and others. They used the 5 attributes (in bold) from the definition to categorize the paper.

Data sets:
- can be classified based on what type of traffic is, where it was collected, etc.

Classification goals:
- can be coarse grained(p2p, transaction oriented) or fine grained (from a specific application)

Methods:
- exact method (via port numbers)
- heuristics (based on patterns)
- machine learning methods: supervised or unsupervised learning

Features:
- choosing features to use for traffic classification is related to trends in application development. A good example given in the paper is the trend of modern applications to use UDP instead of TCP and to change ports from time to time. Because of this, mere examination of port numbers may not be enough and we might need to look at payload, flows, etc.

Using the taxonomy that they developed they tried to answer the following question: How much of modern Internet is P2P?

The following are the observations they have gathered from the papers they've surveyed:
- 1.2% to 93% of the traffic are due to P2P file sharing (observed range from 18/64 papers)
- the fractions have increased from 2002 to 2006
- P2P is more popular in Europe
- P2P traffic varies by time of day with higher percentages at night
- P2P is used more at home than in the office

Based on this, they can't have conclusive claims to answer the question above. All they can say is,

”there is a wide range of P2P traffic on Internet links; see your specific link of interest and classification technique you trust for more details.”

Shortcomings of current traffic classification:
- lack of shared current data sets
- lack of standardized measure and classification

Wednesday, June 24, 2009

Trends and Differences in Connection-behavior within Classes of Internet Backbone Traffic

Authors: W John, S Tafvelin, T Olovsson

The focus of the paper are on three main traffic classes:
  • P2P file-sharing protocols
  • Web traffic
  • Malicious and attack traffic
Results reported:
  • P2P and HTTP traffic exhibit different peak times
    • HTTP traffic has its main activities during office hours
    • P2P traffic during the night, up to 90% of transfer volumes
  • SACK option has been deployed mostly on clients, but
    • Web servers neglect its usage
    • Most P2P hosts use it
  • Malicious attacks continue all day without rest
    • Remains constant, even when the traffic volume has increased
Critiques:
  • Basis for choosing the three main classes
  • What about real time applications? or have they been lumped together with HTTP and/or P2P categories?