K-means clustering and its use cases in Security Domain

Hello friends, hope you all are fantastic!!

In this article, I would like to tell you all about what is clustering, K-means and how K-means helps to prevent many cyber attacks. So let's get started.

What is clustering in Machine Learning?

You all might have heard the term Unsupervised learning and Supervised learning in Machine Learning. So if we talk about Unsupervised learning( in which there is no one to guide the ML model to predict the right weight), then clustering is the one, that comes under Unsupervised Learning of the ML model. Clustering or cluster analysis is a grouping of the data of a dataset to form different groups called clusters.

What is K-means Clustering?

K-means” is the name of the algorithm which is used to automatically form clusters or groups from a dataset based on certain mathematical formulas such as squared Euclidean distances etc. We only have to give the number of clusters to be formed from a dataset to the algorithm and based on certain formulas it forms different groups of data for us. So the process of forming the clusters by the K-means algorithm is called K-means Clustering.

So finally you can relate the explanation with the following diagram:

How K-means clustering helps in preventing cyber-attacks?

Let me tell you this with the help of an example. Let us consider that we have hosted a website on our webserver. So now if anyone tries to connect to our website, it will show it in the log file of the webserver. Inside the log file, we could see something like this as shown in the below picture:

So, as you can see that the information of all the users that have tried connecting to our website is shown inside the log file. So if we try to form groups based on the data of each user that connected to our website then we have to choose some data using which different groups could be formed. For eg, we could choose the error code that the user got while trying to access our webpage. So from the image, we can group the data together in one cluster which has 404 error, and group another data together in another cluster which has 544 error code and so on as shown below:

This clustering or forming of clusters is done automatically by the “K-means” algorithm and so this helps the cybersecurity experts or the SecOps teams of any organization to analyze the clusters and the common pattern that is repeated again and again that might be a security threat and so this also helps them to do Root Cause Analysis. So, this is the way how K-means algorithm helps in preventing cyber attacks by hackers. Some of the common attacks against which K-means clustering is used are DoS, Probe, U2R, R2L, etc.

Some more use cases of K-means Clustering in Security World

Intrusion Detection System (IDS)

An intrusion detection system (IDS) is a device or application that screens a network for malicious activity or policy violations. Anomaly detection is the basic method to defend against new attacks in intrusion detection but it has moderate accuracy and is associated with the false high alarm. Moreover, it fails to detect most of the attacks. So, to defeat this issue, K-Means clustering is helpful, which groups all information into the relating group prior to applying a classifier for classification reason with sensible bogus alarm rate. This methodology has brought about high precision and good detection rates however with moderate false alarms on novel assaults.

Spam Filtering

Electronic mail (email) has become a fundamental component for Internet clients. The undesirable messages are known as spam emails. These messages are sent in mass to a huge number of beneficiaries. This expanded volume of spam email results in a most common issue for example keeping up with email inbox. Spam Email is a significant issue for the internet community since it causes wastage of assets and furthermore dirties our current environment. To forestall these unfriendly impacts of spam email, spam filtering is a fundamental undertaking.

K-means Clustering is a successful method of distinguishing spam. The way that it works is by taking a look at the various areas of the email (header, sender, and content). The information is then gathered together. These gatherings would then be able to be arranged to recognize which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%.

Like these, there are many more use cases of K-means Clustering in the Security World. Thanks for reading:)

Connect me on LinkedIn: https://www.linkedin.com/in/shivam-prasad-upadhyay/

Learner, Tech Enthusiast