A simple solution to filtering IP information using disco

Computer Science Department - University of Puerto Rico

Prepared by: José R. Ortiz-Ubarri

Network monitoring

A simple way to monitor you network is to have aggregated information on the traffic of your network.

  1. Aggregation of IP traffic can give you hints of anomalies in your network.
    • An IP in use that has not been delegated
    • An IP generating more traffic than regular.
    • An IP that is not generating traffic.

IP that has not been delegated

In a private network an IP thas has not been delegated can be a machine hijacking your network.

A wise user that does not like to follow rules.

IP generating more traffic than usual

An IP generating more traffic than usual can be a signal of a DoS.

A compromissed computer.

An user using the network for personal purpose in his work.

A computer generating SPAM.

An IP that is not generating traffic

An IP that is expected to generate traffic and is not might have crashed or off.

Might have an internet service crashed or off.

IP aggregation and Filtering

Depending on the degree of a network the task of aggregating all the traffic of an IP or all the IPs in your network gets harder.

An example of a File that contains 5 minutes of NetFlow data of the traffic of the University of Puerto Rico can be as high as 6.5MB. Or 363149 lines of flows.

NetFlows

NetFlow is a network protocol developed by Cisco that has become the standard for traffic monitoring , they run on some network devices and collect aggregated information of the network traffic.

This information is exported to a collector for analysis.

One NetFlow is a record representing an unidirectional sequence of packets that contains information on the source ip, the destination ip, the source port, destination port, the sum of the payload size of the packets, a timestamp, among others.

A NetFlow

136.145.155.233  200.125.49.166   6     80       60239    52          1         
136.145.182.24   190.144.252.102  6     9637     80       80          2         
136.145.182.24   107.15.34.28     6     56387    38973    52          1         
167.8.226.10     136.145.33.149   6     80       51965    52          1         
206.248.76.205   136.145.30.60    6     49839    443      245         4         
173.194.37.118   136.145.170.186  6     443      51991    40          1         
136.145.33.149   167.8.226.10     6     51965    80       40          1         
136.145.230.223  31.13.69.80      6     62206    443      1668        4         
136.145.180.248  54.234.145.249   17    53       48266    160         1         
136.145.115.196  221.219.225.133  6     80       64844    52          1         
136.145.62.155   109.105.242.199  6     35432    51413    60          1         
136.145.240.59   173.194.37.118   6     64488    443      2960        5         
136.145.215.1    136.145.230.222  1     0        2816     1064        19        
136.145.226.5    64.178.214.6     6     49598    443      40          1         
136.145.193.56   204.93.223.146   6     1435     80       40          1         
136.145.249.201  157.56.23.42     6     60301    443      2057        4         
72.21.91.79      136.145.144.211  6     80       50126    3147        5         
72.21.91.79      136.145.144.211  6     80       50125    2150        4         
136.145.95.2     121.97.142.136   17    49349    43112    58          1         
136.145.95.2     122.201.18.193   17    49349    23963    58          1 

Note: This is a flow-print using the flowtools package. The columns are the src ip, dst ip, protocol, src port, dst port, octets and packets.

A simple solution using disco

The map function


def map(line, params):
    # Split the flow data into an array
    data = line.split()
    yield data[0], int(data[5])
    yield data[1], int(data[5])

A simple solution

The reduce function


def reduce(iter, params):
    from disco.util import kvgroup
    for ip, traffic in kvgroup(sorted(iter)):
        yield  ip, sum(traffic)

A simple solution using disco

Creating the job and linking the map and reduce functions


if __name__ == '__main__':
    job = Job().run(input=["file:///bccd/home/jortiz/netflow-print.txt"],
                    map=map,
                    reduce=reduce,
   
    # Print the input/output traffic per IP
    for ip, traffic in result_iterator(job.wait(show=True)):
        print(ip, traffic)

References

NetFlows, http://www.ietf.org/rfc/rfc3954.txt

Disco project, http://discoproject.org/

Python, www.python.org

Wikipedia, http://en.wikipedia.org/wiki/MapReduce