*Joint work by Jin Yu and Anirudh Kondaveeti *

In previous blog posts, we have discussed how data science can be used to detect network security threats, such as using sequential pattern mining to detect watering hole attacks. Sequential pattern mining is shown to be effective in revealing “chains of attacks” in a network security breach. This blog post introduces the use of graph mining techniques to discover suspicious domain interaction structures, which may not necessarily be captured in sequential patterns. For example, a small group of domains that only interact with themselves but not others outside their “social networks” can be part of a botnet. Specifically, we are interested in detecting highly connected domain social networks to flag potential coordinated attacks such as watering hole attacks and also small isolated networks that resemble a botnet-like structure.

Fig. 1 shows the high-level workflow for mining domain social networks. The first two steps involve the application of the familiar association rule mining technique to establish pairwise correlations between domains. Two domains are linked if the conditional probability of a user visiting one domain given the fact that the user also visits the other domain within a short proximity of time is greater than a certain threshold. The domain social graph is built from such pairwise correlations (Step 3) and mined for specific structures (Step 4). The entire workflow has been implemented in Pivotal Greenplum Database and tested on proxy logs from a large enterprise network with promising results.

Fig. 1 High-level workflow for mining domain social networks

## Pairwise Domain Associations

Association rule mining is a data mining technique widely used in areas such as market basket analysis. It finds rules of the form “if A, then B” (denoted A => B). For example, “if buy chips and beer, then also buy diaper.” A = {chips, beer} and B = {diaper} are so called itemsets, containing items purchased in a single transaction. Unlike sequential patterns, association rules do not imply sequential orders. As a preparation step for mining domain association rules, domains visited by the same user in a certain time window are collected to form a “basket”, analogous to items purchased in a single transaction as in market basket analysis. To that end, a sliding window is run through the proxy logs to collect *distinct* domains. As an example, the domains in sliding window 1 in Fig. 2: {googlevideo.com, twitter.com, youtube.com, doubleclick.net, google.com}, form a basket. The example uses a 60s window. One can set the time interval of the sliding window to a different value depending on the problem in hand.

Fig. 2 A sliding window (60s-window in the example) is run through the data to form “baskets”.

Given domain-to-basket assignments, pairwise association rule mining mainly involves the evaluation of support and confidence of each rule Domain A => Domain B:

- Ÿ Support: Frequency of A and B in a common basket.
- Ÿ Confidence: Conditional probability P(B|A) of seeing B in the same basket given that A is also present.

Table 1 lists example pairwise domain association rules. High confidence rules (i.e., rules with high P(A|B) or P(B|A)) that involve multiple users over several days—for example the highlighted rule—are generally more interesting than one-off or single-user event.

Table 1 Example pairwise domain association rules.

## Exploring Interactions Between Domains

Building upon association rule mining, we adopt a graph-based approach to explore the interactions between domains. An undirected domain correlation graph is built from pairwise domain association rules. Fig. 3 shows an example correlation graph. Each node represents a domain. An edge connects two domains A and B if the max confidence of the rule A => B and B => A exceeds a user-specified threshold. In the example the max confidence associated with each edge is greater than 0.2. One can also impose a condition on domain co-occurrence frequency (support) to focus on rare (low support) domain associations.

Fig. 3 Example domain correlation graph.

Based on the study of social network structures of known attacks, a tightly connected network such as the one shown in Fig. 3 could indicate coordinated watering hole or botnet attacks. The question arises, “How to quantify the connectivity of a network?”

We take the OddBall approach to measure network connectivity. The first step is to identify a domain’s one-step neighborhood (also called ego-net). Two graph features are then extracted from the ego-net:

- Ÿ
*N*: Number of one-step neighbors - Ÿ
*E*: Number of edges in the ego-net

*N *and *E *follow a power law:

*E* ∝ *N ^{α}*

*,*1 ≤

*α*≤ 2.

Fig. 4 plots *N* vs. *E* in a log-log scale for different ego-nets. Slope = 1, i.e., *E = N*, indicates a star-like network, while slope = 2 indicates a fully connected network, a clique. To approximate the slope, we calculate the *OddBall ratio*: log(*E*)/log(*N*). Given the same number of neighbors, the higher the ratio the higher degree of network connectivity. In practice, domains with an OddBall ratio of greater than, say 1.5, can be given a higher priority for investigation.

Additionally, one can also compute the *clique percentage*, the ratio between *E* and the number of edges needed to form a clique: E/((N^{2}+N)/2). Clique percentage of close to 100% indicates a highly connected network.

Fig. 4 Number of neighbors vs. number of edges in log-log scale. Slope = 1 indicates a star-like network; slope = 2 indicates a fully connected network, a clique (Picture Source: ICDM’12 tutorial on graph anomaly detection).

Fig. 5 provides an example of a domain’s (central orange node) fully connected social network. The one-step neighbors of the central node are fully connected to each other, resulting in an OddBall ratio of log(10)/log(4) =1.66 and a clique percentage of 100%. Evidence from known attacks indicates that highly connected social networks as the one in Fig. 5 can potentially contain malicious domains. In addition, if some domains in the network were already known malicious, then the ego-net would help forensic export to easily identify their associates.

Fig. 5 Example of a domain’s (orange node) fully connected social network.

## Detecting Isolated Clusters

Given a domain correlation graph (cf. Fig. 3), one can also identify isolated groups of domains that only interact with domains in the same group but not others. This resembles a botnet-like structure. Detecting such isolated groups can be formulated as the task of finding connected components (CCs) in a graph. Domains in the example CC shown in Fig. 6 do not interact with the rest of the community. From the network structures of known attacks, the observation is that malicious sites tend to exist in small CCs.

Fig. 6 Malicious sites tend to exist in small CCs.

## Operationalization

Fig. 7 provides an operationalization vision for the analytics approaches introduced in this blog. The models are expected to be used by the security team, refined based on the feedback from the field, and updated periodically using new data loaded into the database.

Fig. 7 Operationalization vision.

In this blog post, we have introduced graph mining techniques for detecting network security threats. The approach builds domain social networks from pairwise domain association rules, and identifies suspicious domain interaction structures by leveraging well-established graph mining techniques. The approach has been applied to proxy logs from a large enterprise network, and flagged malicious domains that were previously unknown. Compared to the sequential pattern mining technique introduced in a related blog post, the graph-based approach focuses on analyzing domain social network structures, while sequential pattern mining more on capturing chains of attacks. The two approaches reveal different aspects of a potential attack, and can be used in conjunction to complement each other.

**Read more:**

- Learn more about Pivotal Greenplum Database
- Learn more about Pivotal Data Science
- Read more about data science-driven security techniques

## About the Author

More Content by Anirudh Kondaveeti