Demystifying Data Science for Network Security

July 30, 2013 Derek Lin


I am often asked what value Pivotal adds over existing commercial solutions for network security. A quick answer is in the use of data science that provides the new ability to proactively and automatically discover adversarial events from data that might otherwise go undetected for a period of time. In this blog, I’ll explain what this means in more detail. But first, let’s take a look at the current situation of analytics for security.

Deterministic rule-based analytics

Security Information and Event Management (SIEM) products were generally the first products to attempt to apply rule-based analytics at any scale to aid in security operations. But SIEM tools struggled to incorporate the scale, performance or breadth of data sources needed in today’s highly complex IT environments and rapidly changing threat landscape. As a result, they generally found a more comfortable role in compliance than in security operations. Newer technologies such as RSA’s Security Analytics platform address those shortcomings, marrying log, packet, end point and other data sources to provide far greater situational awareness, and marrying compliance and security requirements back into one solution.

These new platforms combine rules-based analytics with heuristic approaches and real-time threat intelligence feeds to help speed up identification of known threats, and provide an investigative workbench for the analyst to spot other unusual activity that could indicate new forms of attack.

Most rules are based on deterministic signature matching, while more advanced rules correlate different sources for complex event processing. Other rules compute metrics involving count, max, and average operations over a window of data. For example, the system flags an alert if the number of password failures in the last five minutes is greater than 20, or if there are more than 100 ports scanned in the last hour. These rules are the bread-and-butter of creating an enterprise security defense posture. They are easy to understand and work well in protecting the enterprise in real-time from known threats and known tactics.

However signature-matching rules alone can be ineffective against threats that are polymorphic in nature. Out-of-the-box correlation or simple metric-based rules are generic, so enterprises would be left on their own to refine or extend them to apply to these new variants. This requires manual engineering and constant threshold tuning. This approach, if not done effectively, returns a high false positive rate. In addition, use of long-term contextual information collected over more than a few days would be out of the question.

Statistical rule-based analytics

Going beyond deterministic and pattern-matching based rules, statistics-driven analytics has emerged in the marketplace. In these cases, rules are no longer about simple arithmetic operations of count, max, average, etc., but are defined by monitoring trends and the distribution of tracked entities. For example, a high number of destination IP’s from a source no longer causes an alert if this is probabilistically consistent with the source IP’s baseline trend. An increase in activity from a user is not deemed anomalous if others in his group behave similarly. Like deterministic rules, most statistical rules are univariate, only including a single variable from one data source at a time. While they add another layer to the overall defense strategy, univariate statistical models can be problematic because their simplistic nature tends to produce high false positive rates.

As mentioned earlier, rule-based analytics (whether deterministic signature- or statistical-based) will always play an important role in commercial enterprise security systems, because they are easy to understand and simple to construct. Many organizations have successfully deployed these techniques today.

But today there exists the opportunity to complement these powerful approaches with another layer of analytics, driven by data science, that can aid in spotting new forms of attack quickly and effectively.

Data science-centric analytics

The use of data science for network security monitoring isn’t new in academic literature. However, the adoption of data science centric analytics in industry has traditionally been difficult for two primary reasons:

  1. First, there is a lack of standardized benchmark data sets, which are critical for academic research publication. Enterprise network traffic data is private and isn’t available for public research. Even if this data were made available, examples of advanced attacks, which would be required for forensic research validation, are either unknown or obsolete.
  2. Second, with the volume, velocity, and variety of data generated from modern enterprises, security research becomes a Big Data problem. Infrastructure support for this data phenomenon has not been available in the past.

At Pivotal, because of the combination of our technology and the data science team, we have the opportunity to accelerate adoption of data science-centric analytics in a variety of market segments and also provide technological solutions that specifically address the Big Data problem. In conjunction with our colleagues at RSA, we are working to bring both these capabilities to the security analytics market. Because of the heightened threat climate and complex characteristics of large enterprise IT implementations, practitioners are eager to push the boundaries of current capabilities and collaborate with researchers and data scientists.

Where to begin

How do data science-centric analytics work? It starts with defining the right use case. Two of the most popular use cases we see today are malware’s beaconing activity detection and anomalous user lateral movement detection. Other use cases include phishing attack detection, at-risk enterprise asset identification, and more. Use case definition determines the approach across data source acquisition and selection, data transformation, and modeling. Let’s look at these stages in more detail.

Data sources

Take the malware beacon detection use case, for example. The goal is to collect historically siloed data or knowledge in one single data repository for later modeling. The useful available data include various network device logs that record events of internal devices communicating to external locations outside the firewall. Firewall proxy server logs are the first standard logs to consider. Next generation firewall software, if available in the local environment, provides additional granularity in the data thanks to traffic classification. We can also use DHCP log information for IP-to-device identity resolution. We may leverage asset tag information, if available, to determine whether a device is an end-user device or a server. Network topology information or even naming convention documents, if available, can also be leveraged for the use case at hand. These additional data sources add valuable context that aids in the investigation.

We then determine the desired data length in time, from a few weeks to many months, depending on subsequent modeling choices and performance constraints. Identifying and defining the data sources for a specific use case is a process by itself.

Key takeaway: Instead of analyzing a few standard fields from a single data source, such as the firewall or proxy server log, we consider all potential data sources and exploit local information and nuances as much as possible for the use case at hand.

Data transformation

Data transformation facilitates or enables the downstream data modeling process. For example, for web click stream analytics of on-line activities, a required step is sessionizing raw click events to a logical session-based group. This is often where statistical features or metadata are derived. Sometimes data scientists can’t determine which features are useful until well after the first iterations of models. To allow possibilities of changing data transformation requirements in later model iterations, we must store data in its raw form.

Key takeaway: Repeated processing of raw data is made possible by today’s Big Data storage and processing support, found in RSA Security Analytics and Pivotal’s Hadoop or MPP database products.


The goal of modeling is to find anomalous events by using machine learning frameworks and data mining methods. Data types, data behavior, desired output characteristics, and performance considerations generally guide modeling choices. For example, since security data is local, not all enterprises use proxy servers for all their data centers. A product’s detection framework that relies only on HTTP data misses signals from data centers not covered by proxy servers.

In another example, user behavior for a manufacturing business is expected to be different from that of software development environment. Software developers tend to exhibit a wider variety of network activities than a factory worker. A behavior-based detection framework that emphasizes the normal usage behavior for the former will underperform for the latter.

Modeling exercises are as much an art as a science. As an artist will choose the right medium for her project, modeling decisions heavily depend on the level of experience and the tooling preference of a data scientist. The best modeling decisions arise from the cross-pollination of ideas amongst data scientists.

Key takeaways:

  1. Modeling decision is local; choose the right framework for the right task.
  2. We find the best modeling choices are the outcome of joint, collaborative discussions among data scientists, each representing different analytical background and experience.


There is a lack of ground truth (labeled incidents, known or unknown) in security analytics. As a result, data modeling in security is an iterative process. Security analysts must validate the alerts from model output to tune or maintain that model accordingly. This feedback loop can also arise from past closed incidents.

Key takeaway: A feedback loop is a critical step in model development.

As pointed out in Annika Jimenez’s blog post, data science is a disruptive transformation. It is affecting many industries. The cyber security industry is no different.

Technology supporting Big Data analytics in security is already here, in the form of products such as RSA Security Analytics and Pivotal’s Hadoop distribution and database. Now we want to make data science-centric analytics available to the security industry. To date, Pivotal have succeeded in applying these techniques and helping clients finding adversarial events that were previously undetected, and we are doing so with very low false positive rates.

Data science-centric analytics will continue to prove their value in the security world. Enterprises that want to extend their security analytics reach are thinking more and more holistically about a strategy combining deterministic, statistical, and data science-centric approaches in conjunction with the investigative capabilities of their security operations teams. This is not just a platform or technology issue but it is dependent upon building the human capability. Sooner rather than later, data science will be a central component of the cyber defense infrastructure within enterprises. They will need to build the resources to create, maintain, and deepen the capability making this possible.

About the Author


More Content by Derek Lin
Domains and IP Addresses Reserved for Documentation – and Why You Should Use Them
Domains and IP Addresses Reserved for Documentation – and Why You Should Use Them

Earlier in my programming days, my go-to example would have been Well, not always. If we had any i...

LicenseFinder Improvements
LicenseFinder Improvements

LicenseFinder, everyone’s favorite gem license auditing tool, has received a slew of new features and impro...

Enter curious. Exit smarter.

Register Now