Insider threat detection is a topic of growing interest these days due to the increasing number of cyber attacks. Understanding user activity within an organization is crucial to detect malicious insiders. Some of the questions that prove useful in this regard are:
- What features are useful to characterize user behavior for detecting insider threats?
- Which users in the organization are:
- accessing an abnormally high number of servers that they haven’t accessed before, or
- logging into unusual number of user accounts, or
- logging in from unusual locations that is not typical to their usual behavior?
- How can user behavioral features be incorporated into a final model to risk score users?
In this blog post, we illustrate a framework, in which we develop an ensemble of models to risk rank users, based on a subset of user behavioral features. The intuition behind this approach is that, if a user starts accessing new servers that he/she hasn’t accessed in the past few months, he/she is given a higher risk score compared to others. If the same user also starts to spin up processes that haven’t been executed in the past, there would be an additional boost in his/her score or rank. As more and more additional features contribute to the anomaly score of the user, he/she is given a higher rank compared to others.
Using a single method to get an anomaly or risk score could lead to lot of false alarms. Ensemble approaches combine the predictive power of individual models to generate a better risk score with lower false positive rate. We use machine learning methods like Principal Component Analysis (PCA) and Regression Analysis to detect variance in user behavior. This framework provides the flexibility to incorporate additional user behavior features and machine learning algorithms to generate a final risk rank for each user.
Data Description and Preprocessing
The data set used for this analysis is the anonymized dataset provided by Los Alamos National Laboratory (LANL), which consists of 58 days of event data from five different data sources. We specifically use only Kerberos based authentication events from individual computers and centralized domain controllers for this analysis.
The timestamp was an integer in the original data, and we converted it to a date format, with the start date being 2017-01-01 and the end date 2017-02-27. We removed the events where the user name ends with a ‘$’, since they correspond to computer accounts and not user accounts. The resulting dataset consists of 110,913,044 kerberos authentication events, out of which 130,581 events correspond to failed authentications. There are 10,326 users and 13,647 computers in the data.
A sample of the dataset used is shown in Fig. 1. The various columns in the data are described below:
time_col : an integer corresponding to time in seconds from the start time
user_src : the user who is performing the authentication operation
user_dest : the target user whose account is being accessed
src : the source computer used for authentication
dest : the destination computer which is being accessed
auth_type : the type of authentication, since we are only concerned with kerberos authentication, this value is always ‘Kerberos’
logon_type : the type of logon
auth_orientation: the type of authentication
pass_fail : whether it is a successful or failed authentication
date_time_col : time_col converted to date format
Fig. 1 : Sample of the preprocessed kerberos authentication data used
Behavioral features are important to summarize a user activity and detect abnormal changes in user activity e.g. if a user suddenly starts logging into a large number of servers and he/she hasn’t done so in the past, it could indicate a possible user account compromise. We used the following five User Behavioral Features (UBF) for our analysis.
UBF-1 : Number of distinct destination computers that a user logs on to each day.
UBF-2: Number of distinct source computers that a user logs in from each day.
UBF-3: Number of distinct destination user accounts that a user logs into each day.
UBF-4: Number of distinct processes that a user starts each day.
UBF-5: Time constrained diameter of the authentication graph of a user each day. Authentication graphs are formed by tracing the path of each user among the different servers he/she authorizes each day. Time constrained diameter is the diameter of this graph, calculated by incorporating temporal constraints on the login behavior. This diameter is calculated by making sure that the servers accessed by a user follow a temporal sequence, such that the servers accessed farther along the path have a timestamp greater than the ones accessed earlier in the path.
Anomaly detection is a well researched area in machine learning. We use the following two methods for detecting anomalies.
Method-A : A commonly used dimensionality reduction technique is Principal Component Analysis, which is used to reduce a high dimensional data into lower dimensions by finding the principal components in the data capturing the most variance in the data. The input consists of a user behavior matrix B of dimensions nxd where n represent the number of users and d represents the number of days. It is recommended to have at least six months of data, to be able to summarize a user’s behavior accurately. Also, to avoid differences among weekdays and weekends, the model could be run using weekly aggregation of data instead of per day. The value in the matrix corresponds to the feature that is being monitored, which could be one among UBF-1, UBF-2, … or UBF-5 described above. We use PCA on matrix B, with the intent to find the users who exhibit maximum variation among all the days in the specific feature being monitored, to calculate the top k principal components. The matrix B in Fig. 2, shows the user behavior matrix for a particular user behavior feature UBF-1 i.e. number of distinct destination computers that a user accesses each day. The resulting matrix D is also shown, which contains the top k principal components, with dimensions nxk, such that each of the principal components is an n-dimensional vector. To rank the users, we use the matrix D to generate a score for each user. A user with the highest score would be most anomalous and would be given a rank of 1. Similarly, the method is repeated with other user behavior features like UBF-2, UBF-3 .. UBF-5 to rank the users based on these features as well. Each user therefore has five different ranks using Method-A and features UBF-1, UBF-2, .. UBF-5 respectively.
Fig. 2 : Principal components of a matrix B capturing the no.of distinct destination computers logged in by n users in d days
Method-B : Regression analysis is used to predict an outcome variable which is dependent on set of predictor variables. The equation for linear regression involving a single predictor x and an outcome variable y is of the form y = mx + c, where m is the slope and c is the intercept. The slope m is an indication of trend or variation of y with respect to x. The input for this method consists of time series of a single user behavioral feature, where y is the UBF and x is the time index (i.e. day ID). The model learns the slope parameter m and the intercept term c, which are used to rank each user. Each user therefore has five different ranks using Method-B and features UBF-1, UBF-2, ... UBF-5 respectively.
Each method (Method-A & Method-B) uses a single user behavior feature (UBF-1 to UBF-5) at a time to rank every user. Each user thus has ten different ranks using the method and user behavior feature combinations. We finally combine the ranks for each user from these multiple features and models to get a final rank of each user.
We use Robust Rank Aggregation method (RRA), to get the final rank. The RRA method normalizes each rank of a user to a value between 0 and 1, based on the number of users used to generate the rank. The normalized ranks are assumed to be uniformly distributed between 0 and 1. A p-value score is used to compare the ranks per user with a random distribution and obtain a final rank for each user.
The framework for the final model is shown in Fig.3. The framework can be easily extended to include additional user behavior features and outlier detection methods.
Fig. 3: The ensemble framework to get the risk rank for each user from multiple features and methods
Results and Discussion
We discuss the results from individual methods and user behavior features in terms of the top users flagged, along with the final results using the aggregated ranks. Each of the method and feature combinations is able to detect unusual or suspicious activity by itself. We show only the results from Method-A and the combined results after ranking users based on both methods (Method-A and Method-B).
For Method-A with UBF-1 i.e. using PCA to detect abnormal variance in the number of destination computers accessed by a user. Some of the notable cases in the top 40 ranked outliers are shown below along with their ranks.
UserID - U1723, Rank - 1 : As shown in Fig. 4(a), this user (U1723) shows a periodic behavior with a sudden increase in the number of servers being accessed by the user at regular intervals. This could indicate malicious behavior -- perhaps a script is being run using this user account at specific times.
Fig. 4(a) : Time series plot of no. of destination computers accessed per day by user U1723
UserID - U7998, Rank - 2 : The user U7998, as seen in Fig. 4(b), has a sudden decrease in activity after 2017-01-28. This could indicate a possible change in the job function or a possible account take over, with the user account being accessed by someone else until 2017-01-28.
Fig. 4(b) : Time series plot of no. of destination computers accessed per day by user U7998
UserID - U3840, Rank - 4 : The user U3840, as seen in Fig. 4(c), usually accesses less than ten destination computers. However, we see a sudden increase in activity on 2017-01-06 -- the user accesses close to 70 servers, which is abnormal behavior.
Fig. 4(c) : Time series plot of no. of destination computers accessed per day by user U3840
For Method-A with UBF-2 i.e. using PCA to detect abnormal variance in the number of source computers accessed by a user, some of the notable cases in the top 40 ranked outliers are shown below in Fig. 5[a-b] along with their ranks.
UserID - U1653, Rank - 11 : The user U1653, as seen in Fig. 5(a), usually logs in from less than five source computers. However, we see a sudden increase in activity on 2017-01-15, where the user logs in from close to 44 computers. This could indicate a possible account compromise, where an an adversary has got hold of the user's credentials.
Fig. 5(a) : Time series plot of no. of source computers accessed per day by user U1653
UserID - U3328, Rank - 27: As shown in Fig. 5(b), this user (U3328) shows a periodic behavior with a sudden increase in the number of computers the user logins in from. This could indicate a script being run at regular intervals by making use of the user’s credential to login in to various computers in the network.
Fig. 5(b) : Time series plot of no. of source computers accessed per day by user U3328
For Method-A with UBF-3 i.e. using PCA to detect abnormal variance in the number of destination user accounts that a user logs into each day, some of the notable cases in the top 40 ranked outliers are shown in Fig. 6[a-b] along with their ranks.
UserID - U3, Rank - 1: The number of destination accounts this user (U3) logs into has a periodic behavior as shown in Fig. 6(a). This could indicate a possible policy violation, where the user is logging in by making use of other user’s login credentials.
Fig. 6(a) : Time series plot of no. of destination user accounts accessed per day by user U3
UserID - U25@DOM3, Rank - 5 : The number of destination user accounts that the user logs into change to 2 accounts from 1 account on Feb 6, 2017. This could indicate possibly indicate the user’s account being used by someone else to perform malicious activities.
Fig. 6(b) : Time series plot of no. of destination user accounts accessed per day by user U25
For Method-A with UBF-4 i.e. using PCA to detect abnormal variance in the number of processes that a user starts each day, some of the notable cases in the top 40 ranked outliers are shown in Fig.7[a-b] along with their ranks.
UserID - U9614, Rank - 1: As it can be seen from Fig. 7, there are abnormally high number of processes (181) being started by the user on Feb 1, 2017 and Feb 10, 2017, while the user generally spins up less than 10 processes. This could indicate a possible malware on the user’s system.
Fig. 7 : Time series plot of no. of processes started per day by user U9614
For Method-A with UBF-5 i.e. using PCA to detect abnormal variance in the time-constrained diameter of the authentication graph of a user per day, we depict the time series plot of diameter for the top outlier with rank 1 in Fig. 8(a) as well as its corresponding authentication graphs in Fig. 8[b-c].
UserID - U4273, Rank - 1: The time series plot in Fig. 8(a), shows the time series of the diameter of authentication graph of U4273 per day. As it can be seen, the diameter of the graph is mostly one, but it is two on 2017-01-20. Next, we plot and compare the authentication graphs on 2017-01-18 and 2017-01-20.
Fig. 8(a) : Time series plot of time constrained diameter per day for user U4273
Authentication graph for the user U4273 on 2017-01-18 is shown in Fig. 8(b). The blue nodes in the graph correspond to source computers and there is only one source computer i.e. C1570 while the green nodes are all destination computers. As it can be seen from the animation below, the user logins in from C1570 to access destinations C2106 and C486 separately.
Fig. 8(b) : Authentication graph for user U4273 on 2017-01-18
The authentication graph for the same user U4273 on 2017-01-20 is shown in Fig. 8(c). The red node in the graph correspond to computers that are both sources as well as destinations. It can be seen from the animation below, that user logins in from C1570 and access C486. The user than uses C486 as the source to access server C2106, thereby making the diameter of the graph 2. This is unusual behavior for the user as normally he/she can directly access server C2106 from C150 without having to hop from C486. Also, a new server, C612, is accessed from C486 that wasn’t accessed previously.
Fig. 8(c) : Authentication graph for user U4273 on 2017-01-20
Combining results to get final ranks
All users are scored using the five user behavior features and two anomaly detection methods. We use Robust Rank Aggregation (RRA) method described earlier to get the final rank for the users. One advantage of using this RRA method to aggregate ranks is that, if a user is not ranked by a specific feature or method, he has a null value for that specific rank, and RRA can take into account the null values to calculate the final rank for a user. The top 20 most anomalous users based on an aggregation of each of their ranks from each of the two methods is shown in the Figure 9 below. The column headings, correspond to the method and UBF i.e. A-UBF1 is method A and UBF-1. The final rank column contains the final aggregated rank based on the ranks from each of the two methods.
Fig. 9 : Top 20 users with lowest aggregated final ranks
The user with the lowest aggregated rank is U1723, driven by a high contribution from all of the UBFs except the ‘number of destination users’. The authentication graphs for this user on two days 2017-01-06 and 2017-01-08 are shown in Fig. 10(a) and Fig. 10(b) respectively.
Fig. 10(a) : Authentication graph of user U1723 on 2017-01-06
Fig. 10(b) : Authentication graph of user U1723 on 2017-01-08
In the plots above, the between-day variance in behavior and activity is clearly visible for user U1723. Additional information, including job role, type of servers accessed, and type of processes started by this user would be also useful for forensic analysis to categorize the user as either benign or malicious. For instance, if this user belongs to the IT department and regularly conducts software testing, this kind of variation might be deemed normal., However, for users in other departments like sales and marketing, this type of behavior may be deemed unusual.
Comparing with known ground truth
The redteam at LANL also provided ground truth of 81 compromised users (ignoring the domain) in Kerberos authentication logs. The user U1723 appears in the known compromise user list, confirming that the user is malicious. We compared the top 1000 ranked users from our model with this ground truth. We were able to detect 49 out of 81 users i.e about 60% of known compromised users can be detected by investigating the top 1000 ranked users from our model.
In this post, we have described a framework to automatically detect users exhibiting unusual behavior by capturing the variance in their behavior from multiple behavioral features. The framework is flexible to incorporate additional user behavioral features and machine learning algorithms. The ensemble approach proposed helps reduce false positives as we are using more than one feature to summarize the user's activity. Each user behavior feature is like a sensor that triggers independently, and having multiple sensors trigger for a user makes him more suspicious. The users are ranked based on these features and the top outliers are flagged. Once the users are flagged, the next step is for the IT folks to investigate these flagged users and identify malicious users among them.
About the Author
Anirudh Kondaveeti is a Principal Data Scientist and the lead for Security Data Science at Pivotal. Prior to joining Pivotal, he received his B.Tech from Indian Institute of Technology (IIT) Madras and Ph.D. from Arizona State University (ASU) specializing in the area of machine learning and spatio-temporal data mining. He has developed statistical models and machine learning algorithms to detect insider and external threats and "needle-in-hay-stack" anomalies in machine generated network data for leading industries.More Content by Anirudh Kondaveeti