School Name: Utica College
Title: Final Project
Student Name: Henry J. Hu
Professor Name: Nikolas Rebovich
Due Date: December 18th, 2020
network_traffic.csv
data_description.txt
XYZ Bank is a large and profitable bank in Saint Louis, Missouri. Like any large corporation, XYZ Bank has a very large and intricate infrastructure that supports its networking system. A Network Analyst recently discovered unusual network activity. Then, pouring over year’s worth of logs, their team of analysts discovered many instances of anomalous network activity that resulted in significant sums of money being siphoned from bank accounts. The Chief Networking Officer has come to our group for help in developing a system that can automatically detect and warn of such known, as well as other unknown, anomalous network activities.
The network_traffic.csv file is a synopsis of logged network activity. It contains labeled examples of benign network sessions as well as examples of sessions involving intrusions. It is important to note that it is likely that there exist many different intrusion types in the data, but we will treat all intrusions as the same. The data_description.txt file provides explanations of each of the attributes found in the network_traffic dataset.
This is a short description of the features contained in the network_traffic dataset.
feature name | description | type |
---|---|---|
duration | length (number of seconds) of the connection | continuous |
protocol_type | type of protocol, e.g. tcp, udp, etc. | discrete |
service | network service on the destination, e.g., http, telnet, etc. | discrete |
src_bytes | number of data bytes from source to destination | continuous |
dst_bytes | number of data bytes from destination to source | continuous |
flag | normal or error status of the connection | discrete |
land | 1 if connection is from/to the same host/port; 0 otherwise | discrete |
wrong_fragment | number of "wrong'' fragments | continuous |
urgent | number of urgent packets | continuous |
hot | number of "hot'' indicators | continuous |
num_failed_logins | number of failed login attempts | continuous |
logged_in | 1 if successfully logged in; 0 otherwise | discrete |
num_compromised | number of "compromised'' conditions | continuous |
root_shell | 1 if root shell is obtained; 0 otherwise | discrete |
su_attempted | 1 if "su root'' command attempted; 0 otherwise | discrete |
num_root | number of "root'' accesses | continuous |
num_file_creations | number of file creation operations | continuous |
num_shells | number of shell prompts | continuous |
num_access_files | number of operations on access control files | continuous |
num_outbound_cmds | number of outbound commands in an ftp session | continuous |
is_hot_login | 1 if the login belongs to the "hot'' list; 0 otherwise | discrete |
is_guest_login | 1 if the login is a "guest'' login; 0 otherwise | discrete |
is_intrusion | 1 if the session resulted in an intrusion; 0 otherwise | discrete |
The overall goal is to use data analysis techniques to identify differences between benign and malicious intrusion data. For each section we will have coding sections to do data analysis and a written portion describing what we are doing with the data and why we are doing it. Someone without a coding background should be able to understand our project.
Import and Explore the data. Give an overview of the data structure, how it is organized, and statistical summaries.
Clean the data. Identify data elements that are incorrect and decide how to replace them. Identify any nulls in the data and clean them appropriately.
Using our data summaries, identify data to visualize to better understand the difference between
intrusion and benign data. Use effective visualization techniques to illustrate our analysis.
Write a conclusion that brings all our analysis together. Outline the techniques we used to support our conclusions.
We will be using a data analysis tool called Jupyter Notebook to carry out this exercise. The purpose of this section is to import the network data set into Jupyter Notebook and explore it.
In this section, we import the libraries necessary for this project into Jupyter Notebook. These imports will allow us to reference these libraries via their aliases of pd, np, plt and sns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In this section, we import the network security data into Jupyter Notebook. This import will make the data native to Jupyter Notebook, and allows for quick data access and processing.
network_data_df = pd.read_csv('C:/Users/henry/Henry_J_Hu/network_data_CYB_674.csv')
network_data_df
In this section, we output the shape of the dataframe. The shape is basically the dimension of the dataframe. This exercise allows us to see the maximum number of rows and maximum number of fields the dataframe has.
network_data_df.shape
In this section, we output the data type of each data field in the dataframe. The purpose of this exercise is so we could see both the data type and the field length of each data field.
network_data_df.dtypes
The purpose of this code is to set the floating point number precision in the output dataframe to two decimal places.
pd.set_option('display.float_format', lambda x: '%.2f' % x)
In this section, we output the statisical summary of the dataframe. The purpose of this exercise is so we could see the central tendency and dispersion tendency of the data, and the possible outliers in the data.
network_data_df.describe(include = "all")
In this section, we check the dataframe to see which fields have null values. The purpose of this exercise is so we could determine which fields should be remediated.
network_data_df.isnull().any()
In this section, we calculate the number of null values each field in the dataframe has. The purpose of this exercise is so we could see the severity of missing values in each field.
network_data_df.isnull().sum()
In this section, we output the total number of all identified null values in the dataframe. The purpose of this exercise is so we could see the severity of missing values in the entire dataframe.
network_data_df.isnull().sum().sum()
#Making sure there is no existence of string value 'nan' in the dataframe
network_data_df.eq('nan', axis=0).sum().sum()
#Making sure there is no existence of string value 'NaN' in the dataframe
network_data_df.eq('NaN', axis=0).sum().sum()
In this section, based on what we have found in section 1 above, we clear the dataframe of missing values and incorrect values of 0, and replace them with values that have meanings. The purpose of this exercise is to prepare the data for further analysis and data visualization.
In this section, we wanted to make sure we spotted all the data anomalies to be remediated or removed.
network_data_df.head()
network_data_df.tail()
In this section, based on what we have found in section 1 above, remove all empty or unnecessary data fields from the dataframe. The purpose of this exercise is to declutter the dataframe with unnecessary data. The maximum and minimum values of 0, and the standard deviations of 0 from the describe() function above have revealed to us that these data fields hold only values of 0. Therefore, these data fields contribute nothing to the analysis, and should be excluded from the dataframe.
network_data_df = network_data_df.drop('land', axis=1)
network_data_df = network_data_df.drop('wrong_fragment', axis=1)
network_data_df = network_data_df.drop('urgent', axis=1)
network_data_df = network_data_df.drop('num_file_creations', axis=1)
network_data_df = network_data_df.drop('num_shells', axis=1)
network_data_df = network_data_df.drop('num_outbound_cmds', axis=1)
network_data_df = network_data_df.drop('is_host_login', axis=1)
network_data_df
In this section, based on the fields with null or incorrect values which we have identified in section 1 above, we decided to replace these anomalies with some meaningful values. The purpose of this exercise is to remediate these identified anomalies.
In this section, we remediate the null or incorrect values in the object variables. The newly remediated variables will no longer have these anomalies.
# Replacing the missing value with the text string 'unknown'
network_data_df ['flag'] = network_data_df ['flag'].fillna('unknown')
network_data_df
In this section, we remediate the null or incorrect values in the integer and float variables. The newly remediated variables will no longer have these anomalies.
# Replacing the missing value with the mean
network_data_df["duration"].fillna(network_data_df["duration"].mean(), inplace = True)
network_data_df
In this section, we carried out a sanity check to make sure that all null or incorrect values were properly remediated. This exercice was just to ensure that our work was properly carried out.
network_data_df.describe(include = "all")
network_data_df.isnull().sum()
network_data_df.flag.value_counts()
network_data_df.duration.value_counts()
network_data_df.src_bytes.value_counts()
network_data_df.dst_bytes.value_counts()
In this section, we summarized and subsetted the data which we have cleaned and fixed in section 2 above. We then plotted the results on various plots, graphs, and charts in ways such that they could easily be visualized, interpreted, and understood by the audience. For this exercise, we did not include continuous variables that have standard deviations of less than 1 because no meaningful analysis could be carried out when the values have too little variance between them. Also, we chose to summarize the data by only categorical variables 'service' and 'is_intrusion'. The reason we chose to summarize the data by the variable 'service' because it had the most number of categories, which allowed us to draw the most number of distinctions between benign and malicious connections. The reason we chose to summarize the data by the variable 'is_intrusion' because it allowed us to see the distinctions between benign and malicious connections. The reason we chose to exclude the other three categorical variables from our summaries because these two categorical variables were enough to draw distinctions between benign and malicious connections.
We used the pairwise scatterplots below to explore the relationships between the continuous variables. We could not used Seaborn Pairplot() because it crashed our Jupypter Notebook.
intrusion = network_data_df['is_intrusion'] == 1
benign = network_data_df['is_intrusion'] == 0
g = sns.lmplot(data=network_data_df, x="duration", y="src_bytes", hue="is_intrusion")
plt.xlabel("Duration (in seconds)")
plt.ylabel("Source Bytes")
plt.title("Source Bytes Vs. Duration" , fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
g = sns.lmplot(data=network_data_df, x="duration", y="dst_bytes", hue="is_intrusion")
plt.xlabel("Duration (in seconds)")
plt.ylabel("Destination Bytes")
plt.title("Destination Bytes Vs. Duration", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
g = sns.lmplot(data=network_data_df, x="duration", y="num_root", hue="is_intrusion")
plt.xlabel("Duration (in seconds)")
plt.ylabel("Number of Root Access")
plt.title("Number of Root Access Vs. Duration", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
g = sns.lmplot(data=network_data_df, x="src_bytes", y="dst_bytes", hue="is_intrusion")
plt.xlabel("Source Bytes")
plt.ylabel("Destination Bytes")
plt.title("Destination Bytes Vs. Source Bytes", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
g = sns.lmplot(data=network_data_df, x="src_bytes", y="num_root", hue="is_intrusion")
plt.xlabel("Source Bytes")
plt.ylabel("Number of Root Access")
plt.title("Number of Root Access\n Vs.\n Source Bytes", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 0.93))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
g = sns.lmplot(data=network_data_df, x="dst_bytes", y="num_root", hue="is_intrusion")
plt.xlabel("Destination Bytes")
plt.ylabel("Number of Root Access")
plt.xticks(rotation="90")
plt.title("Number of Root Access\n Vs.\n Destination Bytes", fontsize=17)
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(True)
plt.show()
As a result of exploring the relationships between the four chosen continuous variables, duration, src_bytes, dst_bytes, and num_root, there does not appear to be any significant correlation between these variables.
We summarized and subsetted the data by counting the number of connections under each categorical value or interval of continuous values, and by intrusion type. Then, we used both the Barplots and Histograms to create visualizations of these results. Either the categorical variable or the continuous variables could be on the x-axes of these plots.
sns.histplot(data=network_data_df, x = "is_intrusion", bins=3, hue='is_intrusion', stat = 'probability', multiple="stack", edgecolor=".3", linewidth=1)
plt.title("Connection Type Distribution", fontsize=17)
plt.xlabel("Connection Type")
plt.ylabel("Connection %")
plt.yticks(np.arange(0,0.6,0.03))
plt.xticks([0.17,0.84],['Intrusion', 'Benign'])
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
connections_by_service_type = network_data_df.value_counts(['service','is_intrusion'])
connections_by_service_type
connections_by_service_type_df = connections_by_service_type.unstack()
connections_by_service_type_df
plt.bar(connections_by_service_type_df.index ,connections_by_service_type_df[1], color='orange', alpha = 1,label = "Intrusion", edgecolor='black', bottom = connections_by_service_type_df[0])
plt.bar(connections_by_service_type_df.index ,connections_by_service_type_df[0], color='blue', alpha = 1,label = "Benign", edgecolor='black')
plt.xlabel("Service Type")
plt.ylabel("Number of Connections")
plt.xticks(connections_by_service_type_df.index, connections_by_service_type_df.index, rotation="90")
plt.yticks(np.arange(0,400,20))
plt.title("Connection Demographic\n by\n Service Type and Connection Type", fontsize=17)
plt.legend(bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
sns.histplot(data=network_data_df, x = "duration", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1)
plt.title("Duration Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Duration (number of seconds)")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,16000,500), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
sns.histplot(data=network_data_df, x = "src_bytes", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1)
plt.title("Source Bytes Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Source Bytes")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,300000,10000), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
sns.histplot(data=network_data_df, x = "dst_bytes", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1)
plt.title("Destination Bytes Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Destination Bytes")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,190000,10000), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
sns.histplot(data=network_data_df, x = "num_root", bins=30, hue='is_intrusion', multiple="stack", edgecolor=".3", linewidth=1)
plt.title("Number of Root Access Distribution\n by\n Connection Type", fontsize=17)
plt.xlabel("Number of Root Access")
plt.ylabel("Number of Connections")
plt.xticks(np.arange(0,1100,35), rotation="90")
plt.legend(labels=['Intrusion','Benign'],bbox_to_anchor=(1.30, 1.02))
plt.grid(True)
plt.show()
As a result of plotting the summaries, it reveals to us that for the entire dataframe, there are 14% more benign connections than there are malicious connections. While benign connections make up 57% of all connections, malicious connections make up only 43%. Also, it reveals to us that while benign connections are identified with all network service types, malicious connections are identified with only network service types Ftp, Ftp_data, Http, and Private. Malicious connections occur most frequently with service types Http and Private. A majority of malicious connections last for the duration between 0 and 500 seconds, have source byte sizes between 0 and 10,000 bytes, and destination bytes between 0 and 5,000 bytes. Last, it reveals to us that all malicious connections have between 0 and 30 root accesses.
We first summarized and subsetted the data by averaging the values of the chosen continuous variable and by the value of the chosen categorical variable, and by intrusion type. Then, we used the Catplot() function of the Seaborn library to create visualizations of these results.
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="duration", hue="is_intrusion", edgecolor=".3", linewidth=1)
plt.title('Average Duration\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Duration (in seconds)")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="src_bytes", hue="is_intrusion", edgecolor=".3", linewidth=1)
plt.title('Average Source Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Source Bytes")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="dst_bytes", hue="is_intrusion", edgecolor=".3", linewidth=1)
plt.title('Average Destination Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Destination Bytes")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.10, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
g = sns.catplot(data=network_data_df, kind="bar", x="service", y="num_root", hue="is_intrusion", edgecolor=".3", linewidth=1)
plt.title('Average Number of Root Access\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Average Number of Root Access")
plt.xticks(rotation="90")
# title
new_title = 'Connection Type'
g._legend.set_title(new_title)
# replace labels
new_labels = ['Benign', 'Intrusion']
# relocate legend
g._legend.set_bbox_to_anchor((1.15, 1.00))
for t, l in zip(g._legend.texts, new_labels): t.set_text(l)
plt.grid(axis='y')
plt.show()
As a result of plotting the summaries, it reveals to us that malicious connections have the longest average duration with service type Http, and have the largest average source byte size with service type Ftp_data. On average, malicious connections have longer durations and larger source byte sizes than benign connections. Also, it reveals to us that on average, both the destination bytes and root accesses are almost undetectable. Last, while benign connections have the longest average duration with network service type Telnet, malicious connections have the longest average duration with network service type Http.
We first summarized and subsetted the data by the maximum value of the chosen continuous variable and by the value of the chosen categorical variable, and by intrusion type. Then, we used the Plot() function of the Matplotlib library to create visualizations of these results.
network_data_df.groupby(['service','is_intrusion']).max()['duration'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Duration\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Duration (in seconds)")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
network_data_df.groupby(['service','is_intrusion']).max()['src_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Source Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Source Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
network_data_df.groupby(['service','is_intrusion']).max()['dst_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Destination Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Destination Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
network_data_df.groupby(['service','is_intrusion']).max()['num_root'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Max Number of Root Access\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Max Number of Root Access")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
As a result of plotting the summaries, it reveals to us that malicious connections associated with service type Http have the largest maximum duration. Also, malicious connections associated with service type Ftp_data transmit the largest maximum source byte size. Also, it reveals to us that both destination byte sizes of malicious connections are detectable with only service types Ftp and Private. Moreover, while benign connections have the maximum duration with network service type Telnet, malicious connections have the maximum duration with network service type Ftp. Last, it reveals to us that both the destination byte sizes and the number of root accesses appear to have very little association with malicious connections.
We first summarized and subsetted the data by the minimum value of the chosen continuous variable and by the value of the chosen categorical variable, and by intrusion type. Then, we used the Plot() function of the Matplotlib library to create visualizations of these results.
network_data_df.groupby(['service','is_intrusion']).min()['duration'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Duration\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Duration (in seconds)")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
network_data_df.groupby(['service','is_intrusion']).min()['src_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Source Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Source Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
network_data_df.groupby(['service','is_intrusion']).min()['dst_bytes'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Destination Bytes\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Destination Bytes")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()
network_data_df.groupby(['service','is_intrusion']).min()['num_root'].unstack().plot(kind='bar', edgecolor='black')
plt.title('Min Number of Root Access\n by\n Service Type and Connection Type', fontsize=17)
plt.xlabel("Service Type")
plt.ylabel("Min Number of Root Access")
plt.legend(labels=['Benign','Intrusion'],bbox_to_anchor=(1.30, 1.02))
plt.grid(axis='y')
plt.show()