Insider threat detection is an important topic for all enterprises. After all, nobody wants to be the next Target. In that instance, criminal hackers used credentials stolen from a Target “insider”  the company’s refrigeration vendor  to break into Target’s corporate network and swipe 40 million customer credit card details and personal information of 70 million. In a previous blog post, we discussed unsupervised learning methods for detecting insider threats. In this post, we explain a semisupervised learning approach for insider threat detection. We make use of a known blacklist of users to detect other risky users, computers, and processes in an organization’s network. We primarily address the following questions:

How can we link users together based on their interactions with other assets and users in an organization?

How can the interactions between users and assets (like computers, processes, etc.) be used to develop risk scores for users and assets?

Should the most recent interactions among users be given more importance compared to past interactions to detect risky users?
We use a probabilistic graphical model called Belief Propagation to model the interactions between various entities in an organization. A variant of this algorithm is used to incorporate temporal information so as to give more importance to recent interactions. The method is flexible to incorporate additional attributes, e.g. rarity of the interactions, and also scales well as new users and assets are added to the network.
Data Description and Preprocessing
We use the process start and stop event dataset provided by Los Alamos National Laboratory (LANL). We remove the events where the user name ends with a ‘$’ or user name is ‘LOCAL SERVICE’, since they do not correspond to actual user accounts. We only retain the events that correspond to ‘start of process’, and filter the events corresponding to ‘process stop’ for this analysis. The resulting process logs consist of 44,786,745 events, corresponding to 9,925 users and 24,381 processes.
The column definitions of the process log table are described below and a sample of this data is shown in Fig. 1.

time_col: an integer value corresponding to time, in raw data.

date_time_col: time_col column converted into a date time format.

user_src: a combination of user and the domain.

user_name: user name extracted from the user_src column.

src: the source computer used.

proc_nm: name of the process that is run by the user on the source computer.

start_end: indicator corresponding to start/stop of the process.
_{Fig. 1 : Sample of process logs used for analysis}
As a first data processing step, we remove processes that are started by and source computers that are accessed by 100 or more users. This threshold was chosen based on a frequency plot of the number of users per process. The frequency plot of number of users per process is shown in Fig. 2. As we can see, the frequency of number of users decreases drastically when we pass the 100 or more users threshold. A total of 497 processes and 20 source computers are filtered based on this threshold.
_{ Fig. 2: Frequency count of number of distinct users per process}
Semisupervised learning approaches need a list of users who are blacklisted based on known malicious behavior. As described in a previous blog, we obtain the top 500 ranked users based on user behavioral features showing the maximum variance and use this as a surrogate user blacklist.
Constructing the Graph
The interactions between various entities  users, computers, and processes  are modeled by constructing a graph. The nodes of the graph can be a user, computer or process. An edge is created between a user and a computer if the user ever logs into the computer. The timestamp associated with the edge corresponds to the maximum or the most recent time among all the connections between the specific user and computer. Similarly, an edge is created between a computer and process, and the most recent timestamp between the pair is associated with the edge.
The weights of the edges are dynamic and timevariant. The weight of each edge is calculated using an exponential decay function and the difference between the current week and the week corresponding to the timestamp associated with that edge. If the timestamp associated with an edge corresponds to week t and the current week is t_{0}, the weight associated with the edge would be e^{(tt0)}.
The sample graph is shown in Fig. 3 where each pair is an edge. s corresponds to the source node, t corresponds to the target node and w is the weight of the edge. The graph has a total of 238,938 edges and 44,110 nodes.
_{Fig. 3 : Sample graph constructed from the process logs}
Model Building
The Belief Propagation (BP) algorithm is a popular method used to propagate risk in a connected graph, and is described as follows. Each node in a graph can belong to one of multiple states. In our case, we consider each node to be in one of two states: risky (R) or nonrisky (NR). A twodimensional vector is associated with each node, which corresponds to the probability of being risky and nonrisky i.e (x_{R}, x_{NR}) such that (x_{R} + x_{NR} = 1). Each node has a prior probability to be risky. For known risky nodes, this prior probability would be greater than 0.5. And for nonrisky nodes this probability will be equal to 0.5. Each message that is passed between a pair of nodes, is a multidimensional vector based on the number of states the target node has, e.g. m_{21}(x_{1}) is a twodimensional vector corresponding to the message from node 2 to node 1.
The belief of a particular node is the product of local evidence (based on observed variable) and evidence from all the messages coming to that node. Since they are independent probabilities, the final probability can be obtained by multiplying the individual probabilities. The belief or probability that a node i is in a risky state with a given probability (i.e p(x_{i} = x_{R})) is described in the following equation:
where đť›Ľ is a normalization constant, and É¸_{i} is the local evidence of node i.
The messages are updated as follows: A message from node i to node j would be a product of the information coming to i from all of its other neighbors except j, its local evidence and also the transition function đťś“_{ij} from i to j
A faster and scalable version of BP called linearized belief propagation (LinBP) was introduced recently. The beliefs and transition probabilities are centered by subtracting the mean, and the final beliefs are calculated iteratively using the equation below: Here, B' is the centered matrix of final beliefs, E' is the matrix of centered initial probabilities, H' is the centered transition probability matrix, A is the adjacency matrix and D is the diagonal degree matrix.
We introduce a modification of the LinBP algorithm called timevariant LinBP by incorporating the temporal information of the most recent time when two nodes are connected. The standard implementation of LinBP does not take time into consideration. However, for insider threat detection, it is important to take into account the latest time when a malicious user has logged into a specific computer or started a specific process in order to identify other risky assets. This is because the riskiness of an entity decreases with time. For example, if two users P and Q are connected to a known malicious user X, and the connection between P to X is more recent compared to the connection between Q to X, then user P should be considered riskier than user Q.
In addition to temporal information, additional attributes, like rarity of the edge, can also be incorporated into the belief propagation model. Meta information of the nodes like user group, computer group, process type, etc. can be used to define the rarity of an edge, e.g. how probable is a user belonging to a specific group to start a process of specific type. The rarity information can then be used as an additional attribute to calculate the belief.
Results and Discussion
We ran both the LinBP and timevariant LinBP algorithms on the graph described earlier. The run time of the algorithms was less than five minutes. The top 10 processes and computers with a high probability of being risky using both algorithms are shown in Fig. 4. As we can see, the algorithms produce considerably different results, suggesting that incorporating temporal information has a significant impact on the calculated final beliefs.
_{Fig. 4: The top 10 processes and computers with the highest probability of being risky}
The timevariant LinBP approach associated process P20 with a high probability of risk, and the subset of the graph corresponding to process P20 is shown in Fig. 5(a). We can see that process P20 is being started from 48 different computers. Most of these computers are connected to a red node, indicating that a known risky user has logged onto this computer recently. Hence, the algorithm determines this process to have a high probability of risk.
_{Fig. 5(a) : Subset of a graph showing the connections of process P20. Yellow nodes correspond to process, blue nodes correspond to computers, green nodes correspond to users and red nodes corresponds to blacklisted users.}
The timevariant LinBP approach associated process P7422 with a low probability of risk, and the subset of the graph for process P7422is shown in Fig. 5(b). Process P7422 was started by eight computers, out of which seven computers never had a login from any blacklisted user. Computer C18580, however, had a login from blacklisted user U7374. On investigating the logs, we find that this login occurred during the week of January 9th, which was seven weeks before the current week, i.e. week of February 27th. Therefore, the edge from node U7374 to node C18580 has a low weight of e^{7}or 0.000912, and hence does not contribute significantly to the final belief of process P7422 being risky.
_{Fig. 5(b) : Subset of a graph showing the connections for process P7422. Yellow nodes correspond to process, blue nodes correspond to computers, green nodes correspond to users and red nodes corresponds to blacklisted users.}
The graph for computer C3070, having a high probability of being risky, is shown in Fig. 6(a). As we can see, five out of the 60 users who have recently logged onto this computer are blacklisted. Incorporating temporal information is important in this regard, since more recent logins to a computer by a blacklisted user should be given more weight than less recent logins.
_{Fig. 6(a) : Subset of a graph showing the connections to computer C3070. Blue nodes correspond to computers, green nodes correspond to users and red nodes correspond to blacklisted users. }
The graph for computer C16001, having a low probability of being risky, is shown in Fig. 6(b). The computer has logins from 20 users, none of them blacklisted, thereby making the computer C16001 the least risky.
_{Fig. 6(b) : Subset of a graph showing the connections to computer C16001. Blue nodes correspond to computers and green node correspond to users. }
Conclusion
In this post, we described a method for detecting risky users and assets in an organization’s network given a subset of blacklisted users. The method scales well with large organizations having hundreds of thousands of users and assets. We also demonstrated how to incorporate additional information such as recency of interaction, rarity of the type of interaction, etc. into the model. The users and assets flagged as risky by the model should be investigated by forensic experts to truly categorize them as malicious or benign. This feedback loop with the forensic team can also be used to add validated risky users to the blacklist to bootstrap and improve the efficacy of the model.
About the Author
Anirudh Kondaveeti is a Principal Data Scientist and the lead for Security Data Science at Pivotal. Prior to joining Pivotal, he received his B.Tech from Indian Institute of Technology (IIT) Madras and Ph.D. from Arizona State University (ASU) specializing in the area of machine learning and spatiotemporal data mining. He has developed statistical models and machine learning algorithms to detect insider and external threats and "needleinhaystack" anomalies in machine generated network data for leading industries.
More Content by Anirudh Kondaveeti