Time series data is produced in domains such as IT operations, manufacturing, and telecommunications. Examples of time series data include the number of client logins to a website on a daily basis, cell phone traffic collected per minute, and temperature variation in a region by the hour. Forecasting a time series signal ahead of time helps us make decisions such as planning capacity and estimating demand. Previous time series analysis blog posts focused on processing time series data that resides on Greenplum database using SQL functions. In this post, I will examine the modeling steps involved in forecasting a time series sequence with multiple seasonal periods. The various steps involved are outlined below:
- Multiple seasonality is modelled with the help of fourier series with different periods
- External regressors in the form of fourier terms are added to an ARIMA model to account for the seasonal behavior
- Akaike Information Criteria (AIC) is used to find the best fit model
An example of time series data which shows the number of people logging in a gaming website over two months is shown in Figure 1. The graph shows the number of gamers logging in each hour. The X-axis represents the hour of the day and the Y-axis tracks the number of gamers who are logging in the website in that hour. This time series demonstrates seasonal behavior, with the number of gamers logging in following both daily and weekly seasonality. The highlighted regions in the plot correspond to weekends, during which logins are considerably higher than the weekdays. Within a single day, the number of logins increases towards mid-day and decreases thereafter. We explain the modeling steps to forecast this time series signal using the best line of fit.
Fig. 1: Time series plot of number of gamer logins per hour
Autoregressive integrated moving average (ARIMA) models are generally used to model time series data, however they do not directly handle seasonality. The ARIMA model regresses the current data value against historical data value(s) in the time series. In order to deal with multiple seasonality, external regressors need to be added to the ARIMA model. To incorporate the multiple seasonality in the gamer login behavior, additional Fourier terms are added to the ARIMA model, where Nt is an ARIMA process.
The seasonal pattern is modeled through the addition of Fourier terms which are used as external regressors. This approach is flexible, allowing you to incorporate multiple periods. For example, if there are ‘M’ periods (p1, p2, p3, pM) in the data, we would have different fourier series corresponding to each of the ‘M’ periods. In this particular example, there are two seasonal periods, daily and weekly, with p1 including 24 hours and p2 including 168 hours. For each of the periods ‘pi’, the number of fourier terms (Ki) are chosen to find the best statistical model for a given set of data. Given a set of models, AIC (which is derived from information theory) estimates the quality of model compared to other models, and thus provides a means for model selection. The value of Ki is chosen so as to minimize the AIC criteria.
Fig. 2: Heat Map showing the variation of AIC values with number of Fourier terms. It can be seen that the minimum AIC value is at point (1,2)
In order to find the right number of Fourier terms corresponding to each of the periods, the AIC values of the ARIMA model with varying Fourier terms are calculated. The scatter plot for the AIC values with varying number of Fourier terms for the two periods p1 (24 hours) and p2 (168 hours) is shown in Fig. 2. From the figure, it can be seen the best model, which minimizes the AIC criteria, is one which has one Fourier term corresponding to a period of 24 hours and two Fourier terms corresponding to a period of 168 hours with an AIC value of ~ 10450.
The R code to tune the number of Fourier terms and find the minimum AIC values is shown below. The variable “xreg1” generates a Fourier series with a period of 24 hours and the number of terms are varied using the variable “i”, from 1 to 5. Similarly, the variable “xreg2” generates a fourier series with a period of 168 hours and the number of terms are varied using the variable “j” from 1 to 5. For each combination of “i” and “j”, an ARIMA model is fit using the variables “xreg1” and “xreg2” as external regressors and the corresponding AIC value is calculated. Once all the AIC values are obtained, the best parameters “i” and “j” are those that correspond to the minimum AIC value.
Code 1: Finding the optimum number of fourier terms for a given time series
Fig. 3: Plot showing the fitted values for the time series. Red line is the fitted value and the black line is the actual value
The best line of fit for the time series is an ARIMA (4, 0, 3) model , including four autoregressive terms and three moving terms, with one Fourier term corresponding to a period of 24 hours and two Fourier terms corresponding to a period of hours. The plot of the fitted values is shown in Fig. 3, wherein the highlighted gray regions correspond to the weekends, the red line corresponds to the fitted values, and the black line corresponds to the actual values. The coefficients and the standard errors (s.e.) for the various terms are shown below.
The equation corresponding to the final model is
where yt-1, yt-2, yt-3, yt-4 correspond to the autoregressive terms and zt-1, zt-2, zt-3 correspond to the moving average terms. The external regressors correspond to fourier terms with a period of 24 hours (single frequency) and period of 168 hours (two different frequencies).
To visualize the effect of Fourier terms which are used as external regressors in the model, a plot of these values with time ranging from 1 to 350 hours is shown in Fig. 4. The value in the figure is the sum of sine and cosine terms shown in the equation above for each period. As can be seen from the figure, the red line corresponds to the external regressors that capture the weekly seasonality in the model (a period of 168 hours), and the blue line corresponds to regressors that capture the daily seasonality (a period of 24 hours). The forecast from the time series is thus the sum of the forecasts obtained from the ARIMA (4, 0, 3) model and the additional effect from the fourier signal.
Fig. 4 : Fourier terms used as external regressors in the forecasting model. Red line corresponds to weekly period of 24*7 hours and blue line corresponds to a daily period of 24 hours
Once the appropriate model is fit to the time series, the next step is to use this model to predict the future values. The code used for predicting the next ten values is shown below. The data from 1 to 750 hours is used for training, while the data for the next 10 hours from 751 to 760th hour is used for testing. The variable “xreg1” in the code below creates fourier terms with a period of 24 hours and “xreg2” creates fourier terms with a period of 168 hours. The variable “xtest”, which is a combination of terms from the variables “xreg1” and “xreg2”, is used as external regressor to the predict function with the model built using training data as the input.
Code 2 : Forecasting the time series values using the fitted model
We have outlined the modeling approach to forecast time series with two seasonal periods: daily and weekly. This method can be easily extended if we have more seasonal periods by adding additional Fourier terms corresponding to each period. In general, it is recommended to forecast less than 5 values in advance, because the error in forecasts increases if we forecast far ahead in the future. This is because the ARIMA model uses the past values to predict the future values, and if we predict too much into the future, the error-prone predictions obtained previously would be used as training data.
The massively parallel processing (MPP) capabilities of Pivotal Greenplum Database and HAWQ can be used to forecast multiple time series at different nodes in a scalable and parallel manner. As an example, consider predicting the cell phone outages by analyzing the time series data of cellphone usage from various data centers. A separate forecasting model can be built for each data center using Procedural Language R (PL/R) to monitor each data center independently. To illustrate with an example, consider the table below called input_table, which consists the time series of performance values aggregated in the form of an array for different data centers.
input_table consisting of time series of performance values of different data centers
The PL/R function below can be used to find the optimum number of Fourier terms required for model selection for each of the data centers. All the computations are performed in-database. As it can be seen from the code below (Code 3), the PL/R function is similar to Code 2 with an SQL like wrapper.
Code 3: PL/R function calculate_optimum_fourier_terms to find the optimum number of fourier terms
The above function calculate_optimum_fourier_terms can be used to find the optimum number of fourier terms for each of the time series from the data centers, by executing the function on each row of the input_table as below. The results are stored in a table called optimum_parameter_vals.
Code 4: Create table optimum_parameter_vals to store the optimum number of fourier terms for each time series
The table optimum_parameter_vals created contains the optimum number of fourier terms for the time series from each data center. A sample of the table optimum_parameter_vals is shown below:
optimum_parameter_vals table consisting of the optimum fourier terms for each time series
Once the optimum number of Fourier terms for the time series are obtained, the following function forecast_time_series can be used to fit the appropriate model. This uses the optimum number of Fourier terms and then forecasts the time series signal using the learned model. The function takes the time series signal, optimum parameters (Fourier terms), and the number of desired forecasted points as the inputs and outputs of the forecasted values.
Code 5: Function forecast_time_series to learn and forecast the time series signal
The above function forecast_time_series can be used to obtain the next 5 values for each of the time series signal from the different data centers, by executing it on each row of the input data. The results are stored in forecasted_vals table.
Code 6 : Create table forecasted_vals to find the next 5 forecasted values for each time series.
The table forecasted_val created contains the next 5 forecasted values for the time series from each data center. A sample of the table forecasted_val is shown below. The column Forecasted_Vals consists the forecasted values from the time series.
forecasted_vals table with the forecasted values for each time series
In this blog post, we have shown how to use the Pivotal Greenplum Database to develop a univariate time series forecasting model in a scalable and parallelizable fashion. In the next blog post, we will examine forecasting a more advanced time series signal involving multiple variables using Vector Autoregressive (VAR) models for specific IT operations applications, such as virtual machine capacity planning.
About the Author
Anirudh Kondaveeti is a Principal Data Scientist and the lead for Security Data Science at Pivotal. Prior to joining Pivotal, he received his B.Tech from Indian Institute of Technology (IIT) Madras and Ph.D. from Arizona State University (ASU) specializing in the area of machine learning and spatio-temporal data mining. He has developed statistical models and machine learning algorithms to detect insider and external threats and "needle-in-hay-stack" anomalies in machine generated network data for leading industries.More Content by Anirudh Kondaveeti