In a previous blog post, I talked about the improvements data science can bring to IT operations. The ultimate goal of a data science-driven IT infrastructure is one capable of performing automated root cause analysis and failure prediction. To achieve this goal, some foundational blocks must be built. One of these foundational blocks is the automatic clustering of IT alerts. To demonstrate this in greater detail, I’ll use a patented approach the Pivotal Data Science team performed for a client.
Large enterprise IT infrastructure technology components—such as network, storage, or database—generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Yet, humans’ limited capacity is quickly overwhelmed by the sea of red alerts.
In addition, alerts are each reviewed by individual support persons, but this does not create a shared and central corpus of insights for the organization. Such higher-level insights might include: What are the categories of alerts? Which classes of alerts generate the most volume or mean-time-to-repair? Which alert categories require immediate attention?
Insights such as these are useful for setting up alert review policies and can help automate response prioritization. Operationalizing the categorization of alerts enables their automatic routing to the right support person. In addition, root cause analysis and failure prediction use cases can also benefit from signals from these alert clusters.
While broad class labels of alerts may be available, often they are too general to be useful. So how can we use data science to achieve finer categorization of alerts?
Semi-structured text alerts are generated by IT infrastructure components such as storage devices, network devices, servers, etc. For the client-facing data science engagement I detail in this post, we leveraged only the textual information. To cluster the text data, the following steps were performed:
No two alerts are exactly alike. Text tokens exhibiting high variability such as IP address, date and time are either removed or replaced by another constant string (e.g an IP address 126.96.36.199 with “IP”.) High frequency and common tokens (eg. “scom:” or “splunk:”,) so called ‘stop words,’ are also removed. This substantially reduces the alert space and eliminates a lot of noise in the data.
Distance metric computation
To perform clustering in general, distance metrics must be introduced to measure the closeness between a pair of samples. For alert text clustering, we treat each alert as a set of words and define the distance between two alerts as the Jaccard index on the respective sets. For N alerts, we can now construct an NxN distance matrix, recording all pair-wise distances.
Given the NxN distance matrix, several choices of clustering methods apply. After evaluation, we settled with a combination of graph-theoretic methods. We first run a Connected Component-finding algorithm to identify an initial set of clusters. This Connected Component algorithm is part of the Pivotal Data Science tool bag that we use for client engagements. We create a graph where the vertices of the graph correspond to alerts and establish an edge between any two vertices, if the Jaccard distance between the two vertices is less than an empirically chosen threshold. The algorithm then finds parts of the graph not connected to other parts. Such parts are natural clusters of alerts.
An initial run of the Connected Component algorithm yields many tight and homogeneous-looking clusters in alert texts. Yet some clusters are large, with alerts that can benefit from further clustering. For these large clusters, we run another algorithm in Pivotal Data Science tool bag — the graph cut — to further subdivide them. Given an NxN distance matrix, the Graph Cut algorithm solves an eigen system problem to recursively partition the graph, yielding more refined clusters. A good reference for performing this task is the seminal 2000 paper from Jianbo Shi and Jitendra Malik, “Normalized cut and image segmentation.”
Such work gives never-before seen visibility and insight to IT alerts. The figure below presents a visualization of the clusters from four prior broad classes. The size of the bubbles corresponds to the number of alerts belonging to a particular cluster. The intensity of the coloring corresponds to the number of incident tickets (escalated alerts) created from the alerts of a particular cluster. Big-to-medium sized and intense-colored clusters are of particular interest to the support analysts.
Here is another way to gain insight from the clustering information. A historical incident has a quantity known as time-to-resolve, which is the time taken for that particular issue to be resolved. We can compute mean-time-to-resolve from the incidents that got created from the clusters. The figure below shows mean-time-to-resolve for some alert clusters, along with total number of incidents created within the cluster. For example, the first cluster has 4248 alerts, of which only 1407 become incidents. This implies that there is room for improvement by paying attention to the rest of alerts before they become bigger issues.
This result yields an important insight that information can be used to improve infrastructure system management, for example, allowing system managers to prioritize areas for problem solving. But more importantly, this work serves as a foundational block in a data science-driven approach to IT infrastructure, enabling future root-cause analysis and failure prediction use cases. In a future blog post, I’ll go deeper into how this transformational approach to IT operations is implemented.
About the AuthorMore Content by Derek Lin