Insider Threat Detection: Detecting Variance in User Behavior using an Ensemble Approach

June 23, 2017 Anirudh Kondaveeti

Insider threat detection is a topic of growing interest these days due to the increasing number of cyber attacks. Understanding user activity within an organization is crucial to detect malicious insiders. Some of the questions that prove useful in this regard are:

What features are useful to characterize user behavior for detecting insider threats?
Which users in the organization are:
- accessing an abnormally high number of servers that they haven’t accessed before, or
- logging into unusual number of user accounts, or
- logging in from unusual locations that is not typical to their usual behavior?
How can user behavioral features be incorporated into a final model to risk score users?

In this blog post, we illustrate a framework, in which we develop an ensemble of models to risk rank users, based on a subset of user behavioral features. The intuition behind this approach is that, if a user starts accessing new servers that he/she hasn’t accessed in the past few months, he/she is given a higher risk score compared to others. If the same user also starts to spin up processes that haven’t been executed in the past, there would be an additional boost in his/her score or rank. As more and more additional features contribute to the anomaly score of the user, he/she is given a higher rank compared to others.

Using a single method to get an anomaly or risk score could lead to lot of false alarms. Ensemble approaches combine the predictive power of individual models to generate a better risk score with lower false positive rate. We use machine learning methods like Principal Component Analysis (PCA) and Regression Analysis to detect variance in user behavior. This framework provides the flexibility to incorporate additional user behavior features and machine learning algorithms to generate a final risk rank for each user.

Data Description and Preprocessing

The data set used for this analysis is the anonymized dataset provided by Los Alamos National Laboratory (LANL), which consists of 58 days of event data from five different data sources. We specifically use only Kerberos based authentication events from individual computers and centralized domain controllers for this analysis.

The timestamp was an integer in the original data, and we converted it to a date format, with the start date being 2017-01-01 and the end date 2017-02-27. We removed the events where the user name ends with a ‘$’, since they correspond to computer accounts and not user accounts. The resulting dataset consists of 110,913,044 kerberos authentication events, out of which 130,581 events correspond to failed authentications. There are 10,326 users and 13,647 computers in the data.

A sample of the dataset used is shown in Fig. 1. The various columns in the data are described below:

time_col : an integer corresponding to time in seconds from the start time
user_src : the user who is performing the authentication operation
user_dest : the target user whose account is being accessed
src : the source computer used for authentication
dest : the destination computer which is being accessed
auth_type : the type of authentication, since we are only concerned with kerberos authentication, this value is always ‘Kerberos’
logon_type : the type of logon
auth_orientation: the type of authentication
pass_fail : whether it is a successful or failed authentication
date_time_col : time_col converted to date format

_{Fig. 1 : Sample of the preprocessed kerberos authentication data used}

Feature Engineering

Behavioral features are important to summarize a user activity and detect abnormal changes in user activity e.g. if a user suddenly starts logging into a large number of servers and he/she hasn’t done so in the past, it could indicate a possible user account compromise. We used the following five User Behavioral Features (UBF) for our analysis.

UBF-1 : Number of distinct destination computers that a user logs on to each day.
UBF-2: Number of distinct source computers that a user logs in from each day.
UBF-3: Number of distinct destination user accounts that a user logs into each day.
UBF-4: Number of distinct processes that a user starts each day.
UBF-5: Time constrained diameter of the authentication graph of a user each day. Authentication graphs are formed by tracing the path of each user among the different servers he/she authorizes each day. Time constrained diameter is the diameter of this graph, calculated by incorporating temporal constraints on the login behavior. This diameter is calculated by making sure that the servers accessed by a user follow a temporal sequence, such that the servers accessed farther along the path have a timestamp greater than the ones accessed earlier in the path.

Model Development

Anomaly detection is a well researched area in machine learning. We use the following two methods for detecting anomalies.

Method-A : A commonly used dimensionality reduction technique is Principal Component Analysis, which is used to reduce a high dimensional data into lower dimensions by finding the principal components in the data capturing the most variance in the data. The input consists of a user behavior matrix B of dimensions nxd where n represent the number of users and d represents the number of days. It is recommended to have at least six months of data, to be able to summarize a user’s behavior accurately. Also, to avoid differences among weekdays and weekends, the model could be run using weekly aggregation of data instead of per day. The value in the matrix corresponds to the feature that is being monitored, which could be one among UBF-1, UBF-2, … or UBF-5 described above. We use PCA on matrix B, with the intent to find the users who exhibit maximum variation among all the days in the specific feature being monitored, to calculate the top k principal components. The matrix B in Fig. 2, shows the user behavior matrix for a particular user behavior feature UBF-1 i.e. number of distinct destination computers that a user accesses each day. The resulting matrix D is also shown, which contains the top k principal components, with dimensions nxk, such that each of the principal components is an n-dimensional vector. To rank the users, we use the matrix D to generate a score for each user. A user with the highest score would be most anomalous and would be given a rank of 1. Similarly, the method is repeated with other user behavior features like UBF-2, UBF-3 .. UBF-5 to rank the users based on these features as well. Each user therefore has five different ranks using Method-A and features UBF-1, UBF-2, .. UBF-5 respectively.

_{Fig. 2 : Principal components of a matrix B capturing the no.of distinct destination computers logged in by n users in d days}

Method-B : Regression analysis is used to predict an outcome variable which is dependent on set of predictor variables. The equation for linear regression involving a single predictor x and an outcome variable y is of the form y = mx + c, where m is the slope and c is the intercept. The slope m is an indication of trend or variation of y with respect to x. The input for this method consists of time series of a single user behavioral feature, where y is the UBF and x is the time index (i.e. day ID). The model learns the slope parameter m and the intercept term c, which are used to rank each user. Each user therefore has five different ranks using Method-B and features UBF-1, UBF-2, ... UBF-5 respectively.

Each method (Method-A & Method-B) uses a single user behavior feature (UBF-1 to UBF-5) at a time to rank every user. Each user thus has ten different ranks using the method and user behavior feature combinations. We finally combine the ranks for each user from these multiple features and models to get a final rank of each user.

We use Robust Rank Aggregation method (RRA), to get the final rank. The RRA method normalizes each rank of a user to a value between 0 and 1, based on the number of users used to generate the rank. The normalized ranks are assumed to be uniformly distributed between 0 and 1. A p-value score is used to compare the ranks per user with a random distribution and obtain a final rank for each user.

The framework for the final model is shown in Fig.3. The framework can be easily extended to include additional user behavior features and outlier detection methods.

_{Fig. 3: The ensemble framework to get the risk rank for each user from multiple features and methods}

Results and Discussion

We discuss the results from individual methods and user behavior features in terms of the top users flagged, along with the final results using the aggregated ranks. Each of the method and feature combinations is able to detect unusual or suspicious activity by itself. We show only the results from Method-A and the combined results after ranking users based on both methods (Method-A and Method-B).

For Method-A with UBF-1 i.e. using PCA to detect abnormal variance in the number of destination computers accessed by a user. Some of the notable cases in the top 40 ranked outliers are shown below along with their ranks.

UserID - U1723, Rank - 1 : As shown in Fig. 4(a), this user (U1723) shows a periodic behavior with a sudden increase in the number of servers being accessed by the user at regular intervals. This could indicate malicious behavior -- perhaps a script is being run using this user account at specific times.

_{Fig. 4(a) : Time series plot of no. of destination computers accessed per day by user U1723}

UserID - U7998, Rank - 2 : The user U7998, as seen in Fig. 4(b), has a sudden decrease in activity after 2017-01-28. This could indicate a possible change in the job function or a possible account take over, with the user account being accessed by someone else until 2017-01-28.

Fig. 4(b) : Time series plot of no. of destination computers accessed per day by user U7998

UserID - U3840, Rank - 4 : The user U3840, as seen in Fig. 4(c), usually accesses less than ten destination computers. However, we see a sudden increase in activity on 2017-01-06 -- the user accesses close to 70 servers, which is abnormal behavior.

_{Fig. 4(c) : Time series plot of no. of destination computers accessed per day by user U3840}

For Method-A with UBF-2 i.e. using PCA to detect abnormal variance in the number of source computers accessed by a user, some of the notable cases in the top 40 ranked outliers are shown below in Fig. 5[a-b] along with their ranks.

UserID - U1653, Rank - 11 : The user U1653, as seen in Fig. 5(a), usually logs in from less than five source computers. However, we see a sudden increase in activity on 2017-01-15, where the user logs in from close to 44 computers. This could indicate a possible account compromise, where an an adversary has got hold of the user's credentials.

Fig. 5(a) : Time series plot of no. of source computers accessed per day by user U1653

UserID - U3328, Rank - 27: As shown in Fig. 5(b), this user (U3328) shows a periodic behavior with a sudden increase in the number of computers the user logins in from. This could indicate a script being run at regular intervals by making use of the user’s credential to login in to various computers in the network.

_{Fig. 5(b) : Time series plot of no. of source computers accessed per day by user U3328}

For Method-A with UBF-3 i.e. using PCA to detect abnormal variance in the number of destination user accounts that a user logs into each day, some of the notable cases in the top 40 ranked outliers are shown in Fig. 6[a-b] along with their ranks.

UserID - U3, Rank - 1: The number of destination accounts this user (U3) logs into has a periodic behavior as shown in Fig. 6(a). This could indicate a possible policy violation, where the user is logging in by making use of other user’s login credentials.

_{Fig. 6(a) : Time series plot of no. of destination user accounts accessed per day by user U3}

UserID - U25@DOM3, Rank - 5 : The number of destination user accounts that the user logs into change to 2 accounts from 1 account on Feb 6, 2017. This could indicate possibly indicate the user’s account being used by someone else to perform malicious activities.

_{Fig. 6(b) : Time series plot of no. of destination user accounts accessed per day by user U25}

For Method-A with UBF-4 i.e. using PCA to detect abnormal variance in the number of processes that a user starts each day, some of the notable cases in the top 40 ranked outliers are shown in Fig.7[a-b] along with their ranks.

UserID - U9614, Rank - 1: As it can be seen from Fig. 7, there are abnormally high number of processes (181) being started by the user on Feb 1, 2017 and Feb 10, 2017, while the user generally spins up less than 10 processes. This could indicate a possible malware on the user’s system.

_{Fig. 7 : Time series plot of no. of processes started per day by user U9614}

For Method-A with UBF-5 i.e. using PCA to detect abnormal variance in the time-constrained diameter of the authentication graph of a user per day, we depict the time series plot of diameter for the top outlier with rank 1 in Fig. 8(a) as well as its corresponding authentication graphs in Fig. 8[b-c].

UserID - U4273, Rank - 1: The time series plot in Fig. 8(a), shows the time series of the diameter of authentication graph of U4273 per day. As it can be seen, the diameter of the graph is mostly one, but it is two on 2017-01-20. Next, we plot and compare the authentication graphs on 2017-01-18 and 2017-01-20.

_{Fig. 8(a) : Time series plot of time constrained diameter per day for user U4273}

Authentication graph for the user U4273 on 2017-01-18 is shown in Fig. 8(b). The blue nodes in the graph correspond to source computers and there is only one source computer i.e. C1570 while the green nodes are all destination computers. As it can be seen from the animation below, the user logins in from C1570 to access destinations C2106 and C486 separately.

_{Fig. 8(b) : Authentication graph for user U4273 on 2017-01-18}

The authentication graph for the same user U4273 on 2017-01-20 is shown in Fig. 8(c). The red node in the graph correspond to computers that are both sources as well as destinations. It can be seen from the animation below, that user logins in from C1570 and access C486. The user than uses C486 as the source to access server C2106, thereby making the diameter of the graph 2. This is unusual behavior for the user as normally he/she can directly access server C2106 from C150 without having to hop from C486. Also, a new server, C612, is accessed from C486 that wasn’t accessed previously.

_{Fig. 8(c) : Authentication graph for user U4273 on 2017-01-20}

Combining results to get final ranks

All users are scored using the five user behavior features and two anomaly detection methods. We use Robust Rank Aggregation (RRA) method described earlier to get the final rank for the users. One advantage of using this RRA method to aggregate ranks is that, if a user is not ranked by a specific feature or method, he has a null value for that specific rank, and RRA can take into account the null values to calculate the final rank for a user. The top 20 most anomalous users based on an aggregation of each of their ranks from each of the two methods is shown in the Figure 9 below. The column headings, correspond to the method and UBF i.e. A-UBF1 is method A and UBF-1. The final rank column contains the final aggregated rank based on the ranks from each of the two methods.

_{Fig. 9 : Top 20 users with lowest aggregated final ranks}

The user with the lowest aggregated rank is U1723, driven by a high contribution from all of the UBFs except the ‘number of destination users’. The authentication graphs for this user on two days 2017-01-06 and 2017-01-08 are shown in Fig. 10(a) and Fig. 10(b) respectively.

_{Fig. 10(a) : Authentication graph of user U1723 on 2017-01-06}

_{Fig. 10(b) : Authentication graph of user U1723 on 2017-01-08}

In the plots above, the between-day variance in behavior and activity is clearly visible for user U1723. Additional information, including job role, type of servers accessed, and type of processes started by this user would be also useful for forensic analysis to categorize the user as either benign or malicious. For instance, if this user belongs to the IT department and regularly conducts software testing, this kind of variation might be deemed normal., However, for users in other departments like sales and marketing, this type of behavior may be deemed unusual.

Comparing with known ground truth

The redteam at LANL also provided ground truth of 81 compromised users (ignoring the domain) in Kerberos authentication logs. The user U1723 appears in the known compromise user list, confirming that the user is malicious. We compared the top 1000 ranked users from our model with this ground truth. We were able to detect 49 out of 81 users i.e about 60% of known compromised users can be detected by investigating the top 1000 ranked users from our model.

Conclusion

In this post, we have described a framework to automatically detect users exhibiting unusual behavior by capturing the variance in their behavior from multiple behavioral features. The framework is flexible to incorporate additional user behavioral features and machine learning algorithms. The ensemble approach proposed helps reduce false positives as we are using more than one feature to summarize the user's activity. Each user behavior feature is like a sensor that triggers independently, and having multiple sensors trigger for a user makes him more suspicious. The users are ranked based on these features and the top outliers are flagged. Once the users are flagged, the next step is for the IT folks to investigate these flagged users and identify malicious users among them.

About the Author

Anirudh Kondaveeti is a Principal Data Scientist and the lead for Security Data Science at Pivotal. Prior to joining Pivotal, he received his B.Tech from Indian Institute of Technology (IIT) Madras and Ph.D. from Arizona State University (ASU) specializing in the area of machine learning and spatio-temporal data mining. He has developed statistical models and machine learning algorithms to detect insider and external threats and "needle-in-hay-stack" anomalies in machine generated network data for leading industries.

Detecting Risky Assets in an Organization Using Time-Variant Graphical Model

How to Help Non-Technical Teammates Understand Your Tech-Heavy Product

Personify your tech-heavy products for team-wide understanding.You’re a clever engineer with a fresh CS deg...

Insider Threat Detection: Detecting Variance in User Behavior using an Ensemble Approach

Data Description and Preprocessing

Feature Engineering

Model Development

Results and Discussion

Combining results to get final ranks

Comparing with known ground truth

Conclusion

About the Author

Previous

Next

Related content in this Stream

Wondering what the White House’s executive order on artificial intelligence means for your business? This blog summarizes what you need to know and provides ideas for how to get started.

Synchronizing artificial intelligence and data science to multiple facets of the application life cycle aids enterprises with generating more business value from their applications.

At VMware Explore in Barcelona, we’re announcing new artificial intelligence and machine learning offerings in the VMware Tanzu portfolio that can help organizations drive business innovation.

A new vector database introduced in VMware Tanzu GemFire enables organizations to unlock the full potential of generative AI.

There are differences between working on a traditional software product and one that incorporates data science. Successfully folding data science into a product team is a little like hunting a bear.

In this playbook you’ll find our advice for effective ways to bring this capability—and the humans who drive it—closer into your fold.

Securing Cloud Applications demystifies complex security protocols, algorithms, and patterns, and demonstrates how to put them into practice in everyday development.

DKube on VMware Tanzu enables you to save time, resources, and cost with IT and data science teams collaborating with best-in-class model operations and infrastructure management.

With Domino Data Lab and VMware Tanzu, code-first data science teams can accelerate research, increase collaboration, and deploy models across an optimized multi-cloud infrastructure.

Hi, Spring fans! In this installment, Josh Long (@starbuxman) talks to webassembly, IoT, data science, and Java guru Brian Sletten (@bsletten).

Greenplum is open-source software for massively parallel database used for reporting, analytics, machine learning, artificial intelligence, and high concurrency SQL. Greenplum database is...

author：Hans Zeller Optimizing joins is the core part of any query optimizer. It consists of picking a good join order, the right join algorithms (hash join, nested loop join, etc.) and various...

In a previous post, we discussed the advantages of running JupyterHub on Kubernetes. We also showed you how to install a local Kubernetes cluster using kind on your Mac, as well as how to install...

Provisioning environments for data scientists and analysts to run simulations, test new models, or experiment with new datasets can be time-consuming and error-prone. Python is a popular choice...

Author: Jared Ruckle Every enterprise is refining their AI strategy. So it’s only fitting that the final installment of Greenplum Summit 2020 focused on how artificial intelligence and neural...

Simplify your migration to the cloud with Tanzu Data Services, a portfolio of on-demand caching, messaging, and database software on VMware Tanzu for development teams building modern applications.

Co-Authored by Ji Lim and Maurice Martin On April 2nd, 2020 VMware Tanzu Data and Amazon Web Services (AWS) participated in a joint webinar detailing the capabilities and benefits of running...