No longer merely the unexciting but essential back door plumbing of an enterprise, IT will increasingly take a front and center role in company strategy. Businesses with better and innovative execution of IT will gain a competitive business advantage. Otherwise, they risk a loss of confidence and reputation, as we recently saw in the poor execution of healthcare.gov and the Target data breach. As IT operational needs demand increasing complexity, and attacks against the infrastructure grow more sophisticated, a new data science-driven way of thinking will become necessary.
Applications, network components, storage devices, servers, middleware, and virtual machines all generate a large volume of logs for health monitoring, alerting, or reporting. But so far, and increase in data hasn’t translated to greater value from that data. For example, when a business service goes down, it prompts a mad scramble amongst operational folks as they try to find where things went wrong. In such a case, we should ask whether we can leverage data to do more efficient root cause analysis. In another example, IT analysts manually determine and examine thousands of individual alerts coming on their way every day. Can we gain insights on which alert or incident topics command most of analysts’ time so we can focus on the corresponding network entities?
I firmly believe that if we leverage data science over the log data, the answers to these questions are yes. Some example cases that Pivotal’s Data Science Labs team have performed are:
- Leveraging VMware’s vCenter data, such as memory, disk, and CPU periodic readings, to perform capacity planning.
- Processing centrally collected alerts and incidents using advanced text-based clustering methods to reveal hidden topics or themes, which also reveals interesting facts such as mean-time-to-repair per topic.
- Performing impact analysis from enterprise software and hardware topology data to assess damage or uses affected if a node or a technology component goes down.
Examples such as the above leverage log data in ways not seen before, but many more possibilities remain. Ultimately, the ability to predict hardware failures and then to recommend resolutions is the Holy Grail for IT operation folks. To capture the full complexity and inter-dependency of network components for failure prediction, the new thinking demands collection of data from multiple silos and applications. This will enable the innovative predictive analytics that data science affords.
Data science finds abundant applications in IT security as well. There are several major areas where data science applies: cyber defense through countering internal and external threats, identity access management, auditing, and asset management.
- Cyber defense: Statistical language models have been used to spot algorithmically generated domain names. Malware exhibiting Fast Flux behavior have been identified using social network analysis over 30 days of firewall or DNS logs. Anomalous user-to-resource access incidents have been identified using over 6 months of Active Directory data for baseline profiling and outlier detection.
- Identity access management: By joining static role-based access control data against the dynamic behavior logs, we can see the discrepancy between the two. Similarly, physical access data from badging and remote access via VPN can be joined to identify unusual behavior.
- Asset management: We can’t protect what we don’t know we have. Text-based topic mining has been applied to categorize documents stored on servers to assess sensitivities—for example, legal documents, source codes, and PII data.
These are examples of data science applications for IT security that are already happening. These use cases can’t be addressed by typical point-in-time analytics. A new way of thinking will enable long-term data analysis and advanced modeling, as I have blogged about in earlier posts.
Data science has already made inroads in various enterprise business units, especially in retail and marketing areas. While IT departments traditionally drive data growth through logging volume, they have been slow to harness the power of data. As interest in applying data science to IT data increases, a new way of thinking about data collection and processing will be required.
Data science demands constant experimentation and a “fail fast” mentality, meaning that raw data must be placed in one central location. No longer does it make sense to store data in different silos for different use cases. This calls for the creation of a data lake. In a data lake, data sources support wide variety of use cases, whether operational, security, or retail analytics. For example, application logs are equally relevant to operational staff to build failure prediction models, security staff to study and model after breach events, and business folks to perform product recommendation or fraud detection. HR records can support identity access management policies, while also serve as contextual data in incident response.
Data science will drive new possibilities in IT. For this to happen, data science must exist in the center of corporate IT strategic roadmap. Data silos must be integrated, while data owners and stakeholders will need to collaborate with one another. A scalable Big Data infrastructure must be chosen carefully, with data science capabilities built in from the start. As we’ve seen in recent months, competitive, effective, and secure enterprises will need to instill a data science-centric analytics culture that leverages IT data for long-term success.
- Need help with finding value in your data? Contact our Data Labs Team for some of the best expertise on the planet.
- Check out how Pivotal is helping Asia ramp up their data needs with the new Pivotal Centre of Excellence in Singapore.
- Read the news on how CapGemini is partnering with Pivotal to put 8,000 data experts in a new CoE in India.
About the Author
BiographyMore Content by Derek Lin