In the previous blog, a model for forecasting univariate time series with multiple seasonal periods was discussed. In this blog, we will discuss the modeling approaches for forecasting multivariate time series in applications like virtual machine capacity planning. This technique can be broadly applied to other areas as well such as monitoring industrial equipment or vehicle engines.

Virtual machines (VMs) are used in many organizations to power many different kinds of business critical applications. When setting these applications up, it is important to plan the capacity for the VM’s required to run each application. This helps to take full advantage of the available resources without performance degradation of the VMs.

## Collecting The Data & Modeling Plan

In this blog, we discuss the modeling steps involved for VM capacity planning. Given a set of metrics to capture the behavior of VMs, the goal is to forecast these metrics and predict ahead when the capacity is going to hit a threshold. The output of the model is a list of VMs which are predicted to fail in the near future.

The data used for analysis is obtained from vCenter adapter, that collects CPU usage, disk, memory and network related metrics for the virtual machines in the network. The data collected consisted of a total of 34 metrics averaged over 5 minutes for each VM. The data was collected across tens of thousands of VMs, for approximately 3 months from January to March, for a total of 78 days. Some of the metrics include CPU capacity entitlement, CPU usage, CPU idle summation, disk read, memory overhead, etc.

The various steps involved in modeling are:

- Variable selection to reduce the total number of variables
- Generate multivariate time series from the metrics by sampling them at regular intervals
- Forecasting the multivariate time series signal using Vector Autoregressive (VAR) model
- Threshold setting to flag alerts when the forecasted time series value exceeds the threshold

## Variable Selection

As a first step in the analysis, feature selection was done to identify the important features to incorporate into the model. If two features* x* and* y* are highly correlated to each other, retaining both the features in the model would be redundant for forecasting, since feature *x* would provide the same information as feature *y*. Hence, pairwise correlation between the metrics is calculated to identify relevant metrics with low correlation.

Fig. 1: Pearson’s correlation between 13 metrics retained in the model. The labels on the x-axis are same as labels on the y-axis.

Pearson correlation coefficient was calculated between the different metrics, which reflect the degree of linear relationship between metrics. This was done using the MADlib function, where the correlation coefficient is calculated in a distributed manner. The coefficient of correlation varies from -1 to 1. Coefficient of 1 implies perfect correlation where if one variable increases than the other variable also increases. 0 means no correlation, and -1 means perfect anti-correlation. Among the complete set of 34 metrics that describe VM performance, correlation was exploited to reduce them to 13 metrics (i.e., if two metrics have a correlation greater than a threshold, only one among the two are retained in the model). A threshold of 0.8 was chosen randomly.

The correlations between the 13 metrics that are retained in the model are shown in Fig. 1, which shows that the pairwise correlation between any two metrics is less than 0.8.

## Generating Multivariate Time Series

Once the feature selection is done, the next step is to formulate the right model and learn the model coefficients from the training data. We use Vector Autoregressive (VAR) model for forecasting the multivariate time series corresponding to the VM metrics. VAR model extends the univariate autoregressive (AR) model by capturing the linear interdependencies between multiple variables. Each variable has a regression like equation, where it is regressed against its own lagged values and the lagged values of other variables.

In its basic form, a VAR consists of a set of *K* variables *y _{t} = (y_{1t}, y_{2t}, …y_{Kt})*. The

*VAR (p)*process is than defined for ‘

*p’*lags as

*y*where

_{t}= (A_{1}y_{t-1}+A_{2}y_{t-2}… + A_{p}y_{t-p}+ u_{t})*A*are

_{i}*(K x K)*coefficient matrices for

*i = 1, 2, … p*and

*u*is a

_{t}*K*– dimensional process with

*E(u*and a time invariant positive definite covariance matrix

_{t}) = 0*E(u*.

_{t}u_{t}^{T}) = ∑_{u}For our specific problem of interest, since we have 13 VM performance metrics, the value of *K* is 13. Let *y _{t}^{h} = (y_{1t}^{h}, y_{2t}^{h} … y_{13t}^{h})* be the vector of the 13 performance metrics for a specific VM at a particular time of the day

*‘h’*on day

*‘t’*. For each metric of interest,

*y*, at a particular time of the day

_{it}^{h}*‘h’*on day

*‘t’*, the current value of the metric is regressed against its own lagged values at the time of the day

*‘h’*for the previous ‘

*p’*days and the lagged values of other metrics at time of the day

*‘h’*for the previous ‘

*p’*days. Out of the 78 days of data available, the first 68 days were used for training and the rest of the 10 days were used for testing.

As an example, the training data consisting of time series for sample of 3 metrics—disk usage, CPU idle summation, and CPU capacity entitlement average—at 2:15 pm each day for 68 days for a particular VM is shown in Fig. 2. The dotted line in the figure shows the increasing trend for these values.

Fig. 2: Time series of VM metrics at 2:15 pm each day. The top plot is for CPU_capacity_entitlement_average, middle plot is for CPU_idle_summation and the bottom plot is for disk_usage_average.

## Forecasting Using Vector Autoregressive (VAR) Model

We develop a VAR model using the performance metrics for each VM separately, corresponding to a time of the day *‘h’*. Let *VAR(p) ^{h(k)}* denote the model developed for time of the day

*‘h’*using data from the past

*‘p’*days using the performance metrics from the

*‘k’*VM, i.e.,

*VAR(p)*where

^{h(k)}: y_{t}^{h(k)}= (A_{1}^{h(k)}y_{t-1}^{h(k)}+ A_{2}^{h(k)}y_{t-2}^{h(k)}+ … A_{p}^{h(k)}y_{t-p}^{h(k)})*y*denotes the vector of performance metrics for the

_{t}^{h(k)}*‘k’*th VM at a particular time of the day

*‘h’*on day

*‘t’*and

*y*denotes similar vector at the time of the day

_{t-1}^{h(k)}*‘h’*on day

*‘t-1’.*

Total number of models depend on* ‘k’* and ‘*h’*.

- Total number of VMs
*(k). i*.e., if there are*K*VM’s then*k = 1, 2, … K* - Time resolution at which the forecasts are desired
*(h).*i.e., if forecast for each hour is desired then*h = 1, 2, .. 24*, and if forecasts for every quarter hour are desired than*h = 1, 2, … 96.*

As it can be seen, each model built for a specific VM *‘k’* at hour *‘h’* is independent of each other. As a result of the independence of the models, we can take advantage of the MPP architecture of the Greenplum Database & HAWQ to run each model (corresponding to each *‘k’* and *‘h’*) in a separate Greenplum segment, making it computationally feasible. This can be done distributing the data corresponding to a virtual machine ‘k’ and time of the day via PL/R.

The pseudo code for this is shown below, where the* time_series_data *column in the *input_table *consists of the time series corresponding to a particular VM and specific time of the day. The function *var_model *is the PL/R function which calculates the coefficients of the VAR model based on the input data.

Code 1: Function to evaluate the VAR model in a distributed manner.

In a VAR model, the error in the forecast starts increasing if we are forecasting far ahead in the future. This is because the forecasted value at *y _{t+1}* (prone to error) would be used for obtaining the forecast for

*y*. Fig. 3 shows the fit and residual values for CPU_usage_average of a particular VM at 2:15 pm using the VAR model on the 68 days of training data. The x-axis in the plot denotes the day index. Blue dotted lines in the top plot are forecasted values and solid lines are actual values. The black line in the middle plot (with red line at value 0) shows the residuals for the corresponding forecasts. The bottom two plots are the autocorrelation (ACF) and partial autocorrelation (PACF) of the residuals. If there is significant autocorrelation in the residuals, it means that the model can be improved further. From the Fig. 3, it can be seen that there is very little autocorrelation between the residuals signifying that the model is accurate.

_{t+2}Fig. 3: Fitted and residual values from a particular VAR model.

## Generating Thresholds For Alerts

After forecasting the multivariate time series signal, a thresholding mechanism is developed to flag alerts when the metrics exceed the threshold value. Hotelling T2 statistic measures the deviation of a given vector from the sample mean vector. It is generally used to monitor a multivariate time series signal and is defined as *T ^{2} = c(y_{t}^{h} – m)S^{-1}(y_{t}^{h} – m) *where

*y*is the vector of the VM performance metrics,

_{t}^{h }*m*is the mean vector obtained from the training data,

*S*is the covariance matrix.

The *T ^{2}*statistic is calculated from the multivariate time series in the training data as well as the forecasted data. Once the forecasted values are summarized using the

*T*statistic, a threshold, chosen based on

^{2}*T*

^{2}*values from the training data, was used to flag alerts when the VM capacity exceeds the threshold. The Phase I threshold for single observations, commonly used in process control, is calculated based on the number of observations and beta distribution. This threshold determines the Upper Control Limit (UCL) and Lower Control Limit (LCL) for the statistic being monitored. If the statistic is either higher than the UCL or lower than LCL, than it needs to be investigated for a possible out-of-control signal. The values for the*

*T*statistic for a particular VM are shown in Fig. 4. The x-axis denotes the day index and the y-axis denotes the value of the

^{2}*T*

^{2}*statistic. Again, the first 68 days are used for training and the rest of the data from 69th day is obtained by forecasting. The UCL or the threshold for this VM is 29.2.*

The value of *T ^{2}* statistic on 70th day is an indication of an abnormally behaving VM, showing that we are able to forecast the metric well in advance and also alert if a VM is expected to behave abnormally in the future.

Fig. 4: T2 statistic values for a particular VM in training and forecasted data. High values of T2 statistic in forecasted data are flagged as outliers. The Upper Control Limit (UCL) is 29.2 and Lower Control Limit (LCL) is 0.

## Using The Model

In this blog, we have outlined a data science driven approach for flagging capacity outages in a network of virtual machines by analyzing their performance metrics. The performance data from hundreds of thousands of VMs, can be fed into a mobile application running on this model, to output a daily list of VMs in the network which could possibly fail at a particular hour the next day.

The model we developed is also very scalable since the computations for each VM at a particular time of the day occurs in a separate segment independent of the rest of the VMs. As an example, consider an organization having 10,000 VMs interested in monitoring the hourly performance of these VMs, the number of models that have to be built in this case would be 10,000 VMs*24 hours i.e. 240,000 models. By making using of the parallel architecture of Greenplum, each of these models can be run in parallel on an independent Greenplum segment. This would be a very efficient way of implementation compared to having to run the models on a single segment serially.

## About the Author

Anirudh Kondaveeti is a Principal Data Scientist and the lead for Security Data Science at Pivotal. Prior to joining Pivotal, he received his B.Tech from Indian Institute of Technology (IIT) Madras and Ph.D. from Arizona State University (ASU) specializing in the area of machine learning and spatio-temporal data mining. He has developed statistical models and machine learning algorithms to detect insider and external threats and "needle-in-hay-stack" anomalies in machine generated network data for leading industries.

More Content by Anirudh Kondaveeti