Sequential Pattern Mining Approach for Watering Hole Attack Detection

October 22, 2015 Anirudh Kondaveeti

sfeatured-watercooler-bugsWork jointly performed by Anirudh Kondaveeti & Jin Yu

As malware techniques continue to evolve, it becomes increasingly challenging to detect network security threats, especially Advanced Persistent Threats (APTs) that are orchestrated by sophisticated adversaries. An increasingly common strategy adopted by APT actors to carry out targeted attacks is the watering hole technique. Watering hole attacks target a group of users in an organization by infesting the websites that are most often visited by these users. An example of watering hole attacks is shown in Fig. 1, where the user visits watering hole domain A, then gets redirected to a relatively unknown malicious domain B within a short time window, typically lasting a few seconds. In this blog post, we will discuss the application of sequential pattern mining to detect coordinated network attacks such as the watering hole variety.


Fig. 1: An example of Watering hole attack

Outbound network traffic data are readily available in enterprise firewall/proxy server logs where we observe internal users’ web access activities. In the following example, we will analyze billions of connection records between internal users and external destinations. Data preprocessing steps, as shown in Fig. 2 and listed below, are applied to the proxy logs before the subsequent analysis:

Fig. 2 : Preprocessing proxy logs

Fig. 2 : Preprocessing proxy logs

  • Host name normalization: Appropriate host names are normalized to the second level; for example, is normalized to The goal is to establish a stable list of identifiers for destination hosts, instead of tracking different variations of a single entity. There are cases where host names are kept un-normalized. For example, subdomains of (a popular dynamic DNS site) are treated as different entities.
  • Filtering invalid host names: To account for data collection errors, host names containing invalid characters are filtered out.
  • Identifying unpopular domains: Malicious domains are typically unpopular among users. The analysis therefore focuses on finding associations with domains that are visited by only a small number of users.
  • User-specific sessionization: Sessionization is a data processing procedure, widely used for analyzing user activities within a certain time window. We sessionize proxy logs to create time-ordered domain sequences, in order to analyze sequential patterns later. A new session is created for a user if the time to the previous domain visit is greater than a pre-specified timeout threshold. Fig. 3 shows an example where the domains visited by a user are converted into two sessions, using a timeout threshold of 60 seconds. The time threshold (60 secs) to create the sessions varies depending on the application. Since watering hole attacks happen within a span of few seconds, we chose a threshold of 60 seconds or lesser.

Sequential Pattern Mining (SPM) aims to discover sequential patterns in time-ordered data. It is applied to various domains such as market basket analysis, border security, vehicle trajectory analysis and health care. The quality of a sequential pattern is typically measured by its support and confidence. The support of a pattern <A, B => C> is the proportion of sequences in which the pattern occurs. The confidence is the ratio between support(<A, B, C>) and support(<A, B>), essentially estimating the conditional probability of the sequence <A, B, C> given the prior presence of <A, B>.

Fig. 3: Sessions created from a user's web access records

Fig. 3: Sessions created from a user’s web access records

In order to detect watering hole attacks, we aim to find low support and high confidence sequential patterns of domains. These patterns correspond to uncommon domain sequences (low support) that however have high conditional probabilities of redirect (e.g. traffic to a particular website almost always gets redirected to another particular site). Especially, we are interested in capturing low support and high confidence patterns that are present in multiple users’ web access records.

The steps involved in the modeling process are outlined in Fig. 4. First, time-ordered domain sequences are created from user-specific sessions (cf. Fig. 3). Earliest connection time is used for domain ordering in case of multiple visits to the same domain within a session. For example, the sequence from Session 1 in Fig. 3 is <{}, {}, {}, {}, {,}>, where the last element contains two domains visited at the same time. We then select subset of sequences that contain domains of interest (e.g. unpopular domains visited by a small number of users), and distribute them across multiple segment servers of Pivotal Greenplum Database (GPDB). Leveraging the MPP architecture of GPDB, a modified Sequential Pattern Mining algorithm (m-SPM) is run on sequences residing in different segment servers in parallel to detect correlated domains, that are potentially part of watering hole attacks.

Fig. 4: Model execution of sequential pattern mining

Fig. 4: Model execution of sequential pattern mining

A standard implementation of SPM (e.g. in R) follows the Apriori principle: If a sequence is not frequent (has low support), then none of its super-sequences is frequent, i.e., support has the anti-monotone property: <A>⊂<A,B>, hence support(<A>) ≥ support(<A,B>). Given sequences found so far, SPM grows each sequence by one element at each iteration, using Apriori-based candidate sequence generation to ensure minimum support. In our case, minimum support is enforced to prune domain sequences that are not sufficiently supported by the data. For example, sequences that are seen in one occasion or two are generally not interesting. Furthermore, taking into account the fact that the number of users also follows the anti-monotone property: #users(<A>) ≥ #users(<A,B>), we modify the standard SPM implementation to also filter out sequences that do not meet the minimum-number-of-user condition at each iteration, e.g. sequences from just a single user are less interesting, hence can be pruned.

For each domain of interest, the modified SPM (m-SPM) is run only on the subset of sequences containing that domain. This may cause some sequential patterns to have artificially high confidence. Recall that the confidence of a pattern <A, B => C> is defined as the ratio of support(<A,B,C>) to support(<A,B>), which is equivalent to the ratio of #(<A, B, C>) (the number of sequences containing <A, B, C>) to #(<A, B>) (the number of sequences containing <A, B>) i.e.


If #(<A, B>) is evaluated on the subset of sequences, it may not reflect the popularity of <A, B> in the whole dataset. A high confidence value as evaluated on the subset can therefore be misleading. We therefore design a new confidence metric, called relative confidence, to better estimate the popularity the left-hand-side sequences. The relative confidence of a pattern <A, B => C> is calculated as the ratio of #(<A, B, C>) in the subset to the minimum of #(<A>) and #(<B>) in the whole dataset i.e.


Intuitively, relative confidence favors the pattern whose left hand side contains less popular domains. For example, in Fig. 5, even though the confidence of both patterns are high, the relative confidence for the pattern containing and on the left hand side is lower. The resultant patterns returned by m-SPM are ordered by the relative confidence value, with more interesting patterns having higher values of relative confidence.

Fig. 5: Relative confidence favors patterns with unpopular left-hand-side domains

Fig. 5: Relative confidence favors patterns with unpopular left-hand-side domains

The model execution steps as depicted in Fig. 4 can be easily adapted to obtain new patterns from new proxy logs in an incremental fashion. One additional step is to merge the new results with the existing set of patterns, mainly involving simple summation of sequence counts from two sets of results for the recalculation of support and confidence measures.

In this blog post, we have introduced a data science approach which involves sequential pattern mining to perform watering hole attack detection. The approach has been applied to proxy logs from a large enterprise network. It not only captured known watering hole attacks, but also flagged new attacks that were previously unknown to the more traditional network security solutions. In our next blog post, we will discuss a graph mining based approach for a similar use case.

Read more:

About the Author

Anirudh Kondaveeti

Anirudh Kondaveeti is a Principal Data Scientist and the lead for Security Data Science at Pivotal. Prior to joining Pivotal, he received his B.Tech from Indian Institute of Technology (IIT) Madras and Ph.D. from Arizona State University (ASU) specializing in the area of machine learning and spatio-temporal data mining. He has developed statistical models and machine learning algorithms to detect insider and external threats and "needle-in-hay-stack" anomalies in machine generated network data for leading industries.

More Content by Anirudh Kondaveeti
Pivotal Demonstrates Cloud-Native Apps at JavaOne Conference
Pivotal Demonstrates Cloud-Native Apps at JavaOne Conference

During sessions and presentations, Pivotal will demonstrate its implementations of Cloud-Native Java techno...

A Look At Spring Cloud Services For Pivotal Cloud Foundry
A Look At Spring Cloud Services For Pivotal Cloud Foundry

Developing modern software emphasizes loosely coupled services, resilience, ease of scaling and the like. B...

Be an Early Bird for the Best Rates at SpringOne Platform

Register Now