CYB-674 Cyber Data Fusion

School Name: Utica College

Title: Final Project

Student Name: Henry J. Hu

Professor Name: Nikolas Rebovich

Due Date: December 18th, 2020

Project Description:

Analytical Project Assignment

Detecting network intrusions

Associated data set and document

network_traffic.csv
data_description.txt

Data description

XYZ Bank is a large and profitable bank in Saint Louis, Missouri. Like any large corporation, XYZ Bank has a very large and intricate infrastructure that supports its networking system. A Network Analyst recently discovered unusual network activity. Then, pouring over year’s worth of logs, their team of analysts discovered many instances of anomalous network activity that resulted in significant sums of money being siphoned from bank accounts. The Chief Networking Officer has come to our group for help in developing a system that can automatically detect and warn of such known, as well as other unknown, anomalous network activities.

The network_traffic.csv file is a synopsis of logged network activity. It contains labeled examples of benign network sessions as well as examples of sessions involving intrusions. It is important to note that it is likely that there exist many different intrusion types in the data, but we will treat all intrusions as the same. The data_description.txt file provides explanations of each of the attributes found in the network_traffic dataset.

This is a short description of the features contained in the network_traffic dataset.

feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment number of "wrong'' fragments continuous
urgent number of urgent packets continuous
hot number of "hot'' indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of "compromised'' conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if "su root'' command attempted; 0 otherwise discrete
num_root number of "root'' accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the "hot'' list; 0 otherwise discrete
is_guest_login 1 if the login is a "guest'' login; 0 otherwise discrete
is_intrusion 1 if the session resulted in an intrusion; 0 otherwise discrete

Tasks

  1. The overall goal is to use data analysis techniques to identify differences between benign and malicious intrusion data. For each section we will have coding sections to do data analysis and a written portion describing what we are doing with the data and why we are doing it. Someone without a coding background should be able to understand our project.

  2. Import and Explore the data. Give an overview of the data structure, how it is organized, and statistical summaries.

  3. Clean the data. Identify data elements that are incorrect and decide how to replace them. Identify any nulls in the data and clean them appropriately.

  4. Using our data summaries, identify data to visualize to better understand the difference between
    intrusion and benign data. Use effective visualization techniques to illustrate our analysis.

  5. Write a conclusion that brings all our analysis together. Outline the techniques we used to support our conclusions.

1. Import and Explore the Data

We will be using a data analysis tool called Jupyter Notebook to carry out this exercise. The purpose of this section is to import the network data set into Jupyter Notebook and explore it.

1.A. Import libraries into Jupyter Notebook

In this section, we import the libraries necessary for this project into Jupyter Notebook. These imports will allow us to reference these libraries via their aliases of pd, np, plt and sns.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1.B. Import the network data set into Jupyter Notebook

In this section, we import the network security data into Jupyter Notebook. This import will make the data native to Jupyter Notebook, and allows for quick data access and processing.

In [4]:
network_data_df = pd.read_csv('C:/Users/henry/Henry_J_Hu/network_data_CYB_674.csv')
network_data_df
Out[4]:
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot ... root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login is_intrusion
0 NaN udp private SF 105 146 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 0.0 udp private SF 105 105 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
2 0.0 udp private NaN 105 146 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 0.0 udp private SF 105 146 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 0.0 udp private SF 105 147 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
694 0.0 udp domain_u SF 35 91 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
695 0.0 udp domain_u SF 44 44 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
696 0.0 tcp http REJ 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
697 0.0 udp domain_u SF 35 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
698 21.0 tcp ftp SF 239 777 0 0 0 4 ... 0 0 0 0 0 0 0 0 1 0

699 rows × 23 columns

1.C. Find the shape of the dataframe

In this section, we output the shape of the dataframe. The shape is basically the dimension of the dataframe. This exercise allows us to see the maximum number of rows and maximum number of fields the dataframe has.

In [5]:
network_data_df.shape
Out[5]:
(699, 23)

1.D. Find the data field types and lengths

In this section, we output the data type of each data field in the dataframe. The purpose of this exercise is so we could see both the data type and the field length of each data field.

In [6]:
network_data_df.dtypes
Out[6]:
duration              float64
protocol_type          object
service                object
flag                   object
src_bytes               int64
dst_bytes               int64
land                    int64
wrong_fragment          int64
urgent                  int64
hot                     int64
num_failed_logins       int64
logged_in               int64
num_compromised         int64
root_shell              int64
su_attempted            int64
num_root                int64
num_file_creations      int64
num_shells              int64
num_access_files        int64
num_outbound_cmds       int64
is_host_login           int64
is_guest_login          int64
is_intrusion            int64
dtype: object

1.E. Set standard output format

The purpose of this code is to set the floating point number precision in the output dataframe to two decimal places.

In [7]:
pd.set_option('display.float_format', lambda x: '%.2f' % x) 

1.F. Produce a statistical summary of the dataframe

In this section, we output the statisical summary of the dataframe. The purpose of this exercise is so we could see the central tendency and dispersion tendency of the data, and the possible outliers in the data.

In [8]:
network_data_df.describe(include = "all")
Out[8]:
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot ... root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login is_intrusion
count 683.00 699 699 695 699.00 699.00 699.00 699.00 699.00 699.00 ... 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00
unique nan 3 12 5 nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
top nan tcp http SF nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
freq nan 532 365 581 nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
mean 190.05 NaN NaN NaN 18032.05 1806.35 0.00 0.00 0.00 0.15 ... 0.00 0.00 1.43 0.00 0.00 0.01 0.00 0.00 0.05 0.43
std 824.38 NaN NaN NaN 59040.02 8271.11 0.00 0.00 0.00 1.07 ... 0.04 0.08 36.88 0.00 0.00 0.19 0.00 0.00 0.22 0.50
min 0.00 NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 0.00 NaN NaN NaN 105.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 0.00 NaN NaN NaN 217.00 147.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
75% 1.00 NaN NaN NaN 330.50 760.50 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
max 15122.00 NaN NaN NaN 283618.00 176690.00 0.00 0.00 0.00 25.00 ... 1.00 2.00 975.00 0.00 0.00 5.00 0.00 0.00 1.00 1.00

11 rows × 23 columns

1.G. See which fields have null values

In this section, we check the dataframe to see which fields have null values. The purpose of this exercise is so we could determine which fields should be remediated.

In [9]:
network_data_df.isnull().any()
Out[9]:
duration               True
protocol_type         False
service               False
flag                   True
src_bytes             False
dst_bytes             False
land                  False
wrong_fragment        False
urgent                False
hot                   False
num_failed_logins     False
logged_in             False
num_compromised       False
root_shell            False
su_attempted          False
num_root              False
num_file_creations    False
num_shells            False
num_access_files      False
num_outbound_cmds     False
is_host_login         False
is_guest_login        False
is_intrusion          False
dtype: bool

1.H. See how many null values each field of the dataframe has

In this section, we calculate the number of null values each field in the dataframe has. The purpose of this exercise is so we could see the severity of missing values in each field.

In [10]:
network_data_df.isnull().sum()
Out[10]:
duration              16
protocol_type          0
service                0
flag                   4
src_bytes              0
dst_bytes              0
land                   0
wrong_fragment         0
urgent                 0
hot                    0
num_failed_logins      0
logged_in              0
num_compromised        0
root_shell             0
su_attempted           0
num_root               0
num_file_creations     0
num_shells             0
num_access_files       0
num_outbound_cmds      0
is_host_login          0
is_guest_login         0
is_intrusion           0
dtype: int64

1.I. See the total number of null values in the dataframe

In this section, we output the total number of all identified null values in the dataframe. The purpose of this exercise is so we could see the severity of missing values in the entire dataframe.

In [11]:
network_data_df.isnull().sum().sum()
Out[11]:
20

1.J. Check to see if text strings 'nan' or 'NaN' exists in the dataframe

In [12]:
#Making sure there is no existence of string value 'nan' in the dataframe
network_data_df.eq('nan', axis=0).sum().sum()
Out[12]:
0
In [13]:
#Making sure there is no existence of string value 'NaN' in the dataframe
network_data_df.eq('NaN', axis=0).sum().sum()
Out[13]:
0

2. Clean the data

In this section, based on what we have found in section 1 above, we clear the dataframe of missing values and incorrect values of 0, and replace them with values that have meanings. The purpose of this exercise is to prepare the data for further analysis and data visualization.

2.A. View the first and last few rows of the dataframe

In this section, we wanted to make sure we spotted all the data anomalies to be remediated or removed.

In [14]:
network_data_df.head()
Out[14]:
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot ... root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login is_intrusion
0 nan udp private SF 105 146 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 0.00 udp private SF 105 105 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
2 0.00 udp private NaN 105 146 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 0.00 udp private SF 105 146 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 0.00 udp private SF 105 147 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 23 columns

In [15]:
network_data_df.tail()
Out[15]:
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot ... root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login is_intrusion
694 0.00 udp domain_u SF 35 91 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
695 0.00 udp domain_u SF 44 44 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
696 0.00 tcp http REJ 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
697 0.00 udp domain_u SF 35 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
698 21.00 tcp ftp SF 239 777 0 0 0 4 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 23 columns

2.B Remove empty, unnecessary or corrupted data fields from dataframe

In this section, based on what we have found in section 1 above, remove all empty or unnecessary data fields from the dataframe. The purpose of this exercise is to declutter the dataframe with unnecessary data. The maximum and minimum values of 0, and the standard deviations of 0 from the describe() function above have revealed to us that these data fields hold only values of 0. Therefore, these data fields contribute nothing to the analysis, and should be excluded from the dataframe.

In [16]:
network_data_df = network_data_df.drop('land', axis=1)
network_data_df = network_data_df.drop('wrong_fragment', axis=1)
network_data_df = network_data_df.drop('urgent', axis=1)
network_data_df = network_data_df.drop('num_file_creations', axis=1)
network_data_df = network_data_df.drop('num_shells', axis=1)
network_data_df = network_data_df.drop('num_outbound_cmds', axis=1)
network_data_df = network_data_df.drop('is_host_login', axis=1)
network_data_df
Out[16]:
duration protocol_type service flag src_bytes dst_bytes hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_access_files is_guest_login is_intrusion
0 nan udp private SF 105 146 0 0 0 0 0 0 0 0 0 1
1 0.00 udp private SF 105 105 0 0 0 0 0 0 0 0 0 1
2 0.00 udp private NaN 105 146 0 0 0 0 0 0 0 0 0 1
3 0.00 udp private SF 105 146 0 0 0 0 0 0 0 0 0 1
4 0.00 udp private SF 105 147 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
694 0.00 udp domain_u SF 35 91 0 0 0 0 0 0 0 0 0 0
695 0.00 udp domain_u SF 44 44 0 0 0 0 0 0 0 0 0 0
696 0.00 tcp http REJ 0 0 0 0 0 0 0 0 0 0 0 0
697 0.00 udp domain_u SF 35 0 0 0 0 0 0 0 0 0 0 0
698 21.00 tcp ftp SF 239 777 4 0 1 0 0 0 0 0 1 0

699 rows × 16 columns

2.C. Identify data elements that are incorrect and decide how to replace them. Identify any nulls in the data and clean them appropriately.

In this section, based on the fields with null or incorrect values which we have identified in section 1 above, we decided to replace these anomalies with some meaningful values. The purpose of this exercise is to remediate these identified anomalies.

2.C.1. Remediate object variables

In this section, we remediate the null or incorrect values in the object variables. The newly remediated variables will no longer have these anomalies.

In [17]:
# Replacing the missing value with the text string 'unknown'
network_data_df ['flag'] = network_data_df ['flag'].fillna('unknown')
network_data_df
Out[17]:
duration protocol_type service flag src_bytes dst_bytes hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_access_files is_guest_login is_intrusion
0 nan udp private SF 105 146 0 0 0 0 0 0 0 0 0 1
1 0.00 udp private SF 105 105 0 0 0 0 0 0 0 0 0 1
2 0.00 udp private unknown 105 146 0 0 0 0 0 0 0 0 0 1
3 0.00 udp private SF 105 146 0 0 0 0 0 0 0 0 0 1
4 0.00 udp private SF 105 147 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
694 0.00 udp domain_u SF 35 91 0 0 0 0 0 0 0 0 0 0
695 0.00 udp domain_u SF 44 44 0 0 0 0 0 0 0 0 0 0
696 0.00 tcp http REJ 0 0 0 0 0 0 0 0 0 0 0 0
697 0.00 udp domain_u SF 35 0 0 0 0 0 0 0 0 0 0 0
698 21.00 tcp ftp SF 239 777 4 0 1 0 0 0 0 0 1 0

699 rows × 16 columns

2.C.2. Remediate integer and float variables

In this section, we remediate the null or incorrect values in the integer and float variables. The newly remediated variables will no longer have these anomalies.

In [18]:
# Replacing the missing value with the mean
network_data_df["duration"].fillna(network_data_df["duration"].mean(), inplace = True)
network_data_df
Out[18]:
duration protocol_type service flag src_bytes dst_bytes hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_access_files is_guest_login is_intrusion
0 190.05 udp private SF 105 146 0 0 0 0 0 0 0 0 0 1
1 0.00 udp private SF 105 105 0 0 0 0 0 0 0 0 0 1
2 0.00 udp private unknown 105 146 0 0 0 0 0 0 0 0 0 1
3 0.00 udp private SF 105 146 0 0 0 0 0 0 0 0 0 1
4 0.00 udp private SF 105 147 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
694 0.00 udp domain_u SF 35 91 0 0 0 0 0 0 0 0 0 0
695 0.00 udp domain_u SF 44 44 0 0 0 0 0 0 0 0 0 0
696 0.00 tcp http REJ 0 0 0 0 0 0 0 0 0 0 0 0
697 0.00 udp domain_u SF 35 0 0 0 0 0 0 0 0 0 0 0
698 21.00 tcp ftp SF 239 777 4 0 1 0 0 0 0 0 1 0

699 rows × 16 columns

2.C.3. Double checking the remediated variables to make sure they no longer have null or incorrect values

In this section, we carried out a sanity check to make sure that all null or incorrect values were properly remediated. This exercice was just to ensure that our work was properly carried out.

In [19]:
network_data_df.describe(include = "all")
Out[19]:
duration protocol_type service flag src_bytes dst_bytes hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_access_files is_guest_login is_intrusion
count 699.00 699 699 699 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00 699.00
unique nan 3 12 6 nan nan nan nan nan nan nan nan nan nan nan nan
top nan tcp http SF nan nan nan nan nan nan nan nan nan nan nan nan
freq nan 532 365 581 nan nan nan nan nan nan nan nan nan nan nan nan
mean 190.05 NaN NaN NaN 18032.05 1806.35 0.15 0.00 0.59 1.26 0.00 0.00 1.43 0.01 0.05 0.43
std 814.87 NaN NaN NaN 59040.02 8271.11 1.07 0.00 0.49 33.44 0.04 0.08 36.88 0.19 0.22 0.50
min 0.00 NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 0.00 NaN NaN NaN 105.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 0.00 NaN NaN NaN 217.00 147.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
75% 1.00 NaN NaN NaN 330.50 760.50 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
max 15122.00 NaN NaN NaN 283618.00 176690.00 25.00 0.00 1.00 884.00 1.00 2.00 975.00 5.00 1.00 1.00
In [20]:
network_data_df.isnull().sum()
Out[20]:
duration             0
protocol_type        0
service              0
flag                 0
src_bytes            0
dst_bytes            0
hot                  0
num_failed_logins    0
logged_in            0
num_compromised      0
root_shell           0
su_attempted         0
num_root             0
num_access_files     0
is_guest_login       0
is_intrusion         0
dtype: int64
In [21]:
network_data_df.flag.value_counts()
Out[21]:
SF         581
RSTR        64
S0          33
REJ         15
unknown      4
S3           2
Name: flag, dtype: int64
In [22]:
network_data_df.duration.value_counts()
Out[22]:
0.00       508
280.00      30
282.00      22
1.00        20
190.05      16
          ... 
803.00       1
809.00       1
938.00       1
2077.00      1
2595.00      1
Name: duration, Length: 64, dtype: int64
In [23]:
network_data_df.src_bytes.value_counts()
Out[23]:
105       120
0          48
12         37
283618     30
160        12
         ... 
305         1
303         1
301         1
300         1
280         1
Name: src_bytes, Length: 252, dtype: int64
In [24]:
network_data_df.dst_bytes.value_counts()
Out[24]:
0        216
146       59
147       39
105       14
597       10
        ... 
526        1
505        1
501        1
1524       1
13700      1
Name: dst_bytes, Length: 287, dtype: int64

3. Summarizing, subsetting, and plotting the data for visualization

In this section, we summarized and subsetted the data which we have cleaned and fixed in section 2 above. We then plotted the results on various plots, graphs, and charts in ways such that they could easily be visualized, interpreted, and understood by the audience. For this exercise, we did not include continuous variables that have standard deviations of less than 1 because no meaningful analysis could be carried out when the values have too little variance between them. Also, we chose to summarize the data by only categorical variables 'service' and 'is_intrusion'. The reason we chose to summarize the data by the variable 'service' because it had the most number of categories, which allowed us to draw the most number of distinctions between benign and malicious connections. The reason we chose to summarize the data by the variable 'is_intrusion' because it allowed us to see the distinctions between benign and malicious connections. The reason we chose to exclude the other three categorical variables from our summaries because these two categorical variables were enough to draw distinctions between benign and malicious connections.

3.A. Exploring relationships between continuous variables

We used the pairwise scatterplots below to explore the relationships between the continuous variables. We could not used Seaborn Pairplot() because it crashed our Jupypter Notebook.

In [25]:
intrusion = network_data_df['is_intrusion'] == 1
benign = network_data_df['is_intrusion'] == 0
In [26]:
g = sns.lmplot(data=network_data_df, x="duration", y="src_bytes", hue="is_intrusion")
plt.xlabel("Duration (in seconds)")
plt.ylabel("Source Bytes")
plt.title("Source Bytes Vs. Duration" , fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
In [27]:
g = sns.lmplot(data=network_data_df, x="duration", y="dst_bytes", hue="is_intrusion")
plt.xlabel("Duration (in seconds)")
plt.ylabel("Destination Bytes")
plt.title("Destination Bytes Vs. Duration", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
In [28]:
g = sns.lmplot(data=network_data_df, x="duration", y="num_root", hue="is_intrusion")
plt.xlabel("Duration (in seconds)")
plt.ylabel("Number of Root Access")
plt.title("Number of Root Access Vs. Duration", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
In [29]:
g = sns.lmplot(data=network_data_df, x="src_bytes", y="dst_bytes", hue="is_intrusion")
plt.xlabel("Source Bytes")
plt.ylabel("Destination Bytes")
plt.title("Destination Bytes Vs. Source Bytes", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
In [30]:
g = sns.lmplot(data=network_data_df, x="src_bytes", y="num_root", hue="is_intrusion")
plt.xlabel("Source Bytes")
plt.ylabel("Number of Root Access")
plt.title("Number of Root Access\n Vs.\n Source Bytes", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
In [31]:
g = sns.lmplot(data=network_data_df, x="dst_bytes", y="num_root", hue="is_intrusion")
plt.xlabel("Destination Bytes")
plt.ylabel("Number of Root Access")
plt.xticks(rotation="90")
plt.title("Number of Root Access\n Vs.\n Destination Bytes", fontsize=17)

# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
Section Short Summary

As a result of exploring the relationships between the four chosen continuous variables, duration, src_bytes, dst_bytes, and num_root, there does not appear to be any significant correlation between these variables.

3.B. Summarizing and subsetting by count

We summarized and subsetted the data by counting the number of connections under each categorical value or interval of continuous values, and by intrusion type. Then, we used both the Barplots and Histograms to create visualizations of these results. Either the categorical variable or the continuous variables could be on the x-axes of these plots.

In [32]:
sns.histplot(data=network_data_df, x = "is_intrusion", bins=3, hue='is_intrusion', stat = 'probability', multiple="stack", edgecolor=".3", linewidth=1)  
plt.title("Connection Type Distribution", fontsize=17)
plt.xlabel("Connection Type")
plt.ylabel("Connection %")
plt.yticks(np.arange(0,0.6,0.03))
plt.xticks([0.17,0.84],['Intrusion', 'Benign'])
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [33]:
connections_by_service_type = network_data_df.value_counts(['service','is_intrusion']) 
connections_by_service_type
Out[33]:
service   is_intrusion
http      0               265
private   1               100
http      1               100
ftp_data  1                67
smtp      0                41
ftp       1                33
domain_u  0                27
ftp_data  0                21
private   0                20
other     0                 9
ntp_u     0                 6
eco_i     0                 4
ftp       0                 3
urp_i     0                 1
telnet    0                 1
finger    0                 1
dtype: int64
In [34]:
connections_by_service_type_df = connections_by_service_type.unstack()
connections_by_service_type_df
Out[34]:
is_intrusion 0 1
service
domain_u 27.00 nan
eco_i 4.00 nan
finger 1.00 nan
ftp 3.00 33.00
ftp_data 21.00 67.00
http 265.00 100.00
ntp_u 6.00 nan
other 9.00 nan
private 20.00 100.00
smtp 41.00 nan
telnet 1.00 nan
urp_i 1.00 nan
In [35]:
plt.bar(connections_by_service_type_df.index ,connections_by_service_type_df[1], color='orange', alpha = 1,label = "Intrusion", edgecolor='black', bottom = connections_by_service_type_df[0])
plt.bar(connections_by_service_type_df.index ,connections_by_service_type_df[0], color='blue', alpha = 1,label = "Benign", edgecolor='black')
plt.xlabel("Service Type")
plt.ylabel("Number of Connections")
plt.xticks(connections_by_service_type_df.index, connections_by_service_type_df.index, rotation="90")
plt.yticks(np.arange(0,400,20))
plt.title("Connection Demographic\n by\n Service Type and Connection Type", fontsize=17)
plt.legend(bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [36]:
sns.histplot(data=network_data_df, x = "duration", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1) 
plt.title("Duration Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Duration (number of seconds)")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,16000,500), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
In [37]:
sns.histplot(data=network_data_df, x = "src_bytes", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1)   
plt.title("Source Bytes Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Source Bytes")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,300000,10000), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
In [38]:
sns.histplot(data=network_data_df, x = "dst_bytes", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1)   
plt.title("Destination Bytes Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Destination Bytes")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,190000,10000), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
In [39]:
sns.histplot(data=network_data_df, x = "num_root", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1)   
plt.title("Number of Root Access Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Number of Root Access")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,1100,35), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
Section Short Summary

As a result of plotting the summaries, it reveals to us that for the entire dataframe, there are 14% more benign connections than there are malicious connections. While benign connections make up 57% of all connections, malicious connections make up only 43%. Also, it reveals to us that while benign connections are identified with all network service types, malicious connections are identified with only network service types Ftp, Ftp_data, Http, and Private. Malicious connections occur most frequently with service types Http and Private. A majority of malicious connections last for the duration between 0 and 500 seconds, have source byte sizes between 0 and 10,000 bytes, and destination bytes between 0 and 5,000 bytes. Last, it reveals to us that all malicious connections have between 0 and 30 root accesses.

3.C. Summarizing and subsetting by mean

We first summarized and subsetted the data by averaging the values of the chosen continuous variable and by the value of the chosen categorical variable, and by intrusion type. Then, we used the Catplot() function of the Seaborn library to create visualizations of these results.

In [40]:
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="duration", hue="is_intrusion", edgecolor=".3", linewidth=1)
plt.title('Average Duration\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Duration (in seconds)")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
In [41]:
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="src_bytes", hue="is_intrusion", edgecolor=".3", linewidth=1)
plt.title('Average Source Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Source Bytes")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
In [42]:
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="dst_bytes", hue="is_intrusion", edgecolor=".3", linewidth=1)
plt.title('Average Destination Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Destination Bytes")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
In [43]:
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="num_root", hue="is_intrusion", edgecolor=".3", linewidth=1) 
plt.title('Average Number of Root Access\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Number of Root Access")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.15, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
Section Short Summary

As a result of plotting the summaries, it reveals to us that malicious connections have the longest average duration with service type Http, and have the largest average source byte size with service type Ftp_data. On average, malicious connections have longer durations and larger source byte sizes than benign connections. Also, it reveals to us that on average, both the destination bytes and root accesses are almost undetectable. Last, while benign connections have the longest average duration with network service type Telnet, malicious connections have the longest average duration with network service type Http.

3.D. Summarizing and subsetting by max

We first summarized and subsetted the data by the maximum value of the chosen continuous variable and by the value of the chosen categorical variable, and by intrusion type. Then, we used the Plot() function of the Matplotlib library to create visualizations of these results.

In [44]:
network_data_df.groupby(['service','is_intrusion']).max()['duration'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Duration\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Duration (in seconds)")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [45]:
network_data_df.groupby(['service','is_intrusion']).max()['src_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Source Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Source Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [46]:
network_data_df.groupby(['service','is_intrusion']).max()['dst_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Destination Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Destination Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [47]:
network_data_df.groupby(['service','is_intrusion']).max()['num_root'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Number of Root Access\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Number of Root Access")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
Section Short Summary

As a result of plotting the summaries, it reveals to us that malicious connections associated with service type Http have the largest maximum duration. Also, malicious connections associated with service type Ftp_data transmit the largest maximum source byte size. Also, it reveals to us that both destination byte sizes of malicious connections are detectable with only service types Ftp and Private. Moreover, while benign connections have the maximum duration with network service type Telnet, malicious connections have the maximum duration with network service type Ftp. Last, it reveals to us that both the destination byte sizes and the number of root accesses appear to have very little association with malicious connections.

3.E. Summarizing and subsetting by min

We first summarized and subsetted the data by the minimum value of the chosen continuous variable and by the value of the chosen categorical variable, and by intrusion type. Then, we used the Plot() function of the Matplotlib library to create visualizations of these results.

In [48]:
network_data_df.groupby(['service','is_intrusion']).min()['duration'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Duration\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Duration (in seconds)")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [49]:
network_data_df.groupby(['service','is_intrusion']).min()['src_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Source Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Source Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [50]:
network_data_df.groupby(['service','is_intrusion']).min()['dst_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Destination Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Destination Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
In [51]:
network_data_df.groupby(['service','is_intrusion']).min()['num_root'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Number of Root Access\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Number of Root Access")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()