The previous blog posts in this series introduced how Window Functions can be used for many types of ordered data analysis. Time series data can be found in many real world applications, including clickstream processing, financial analysis, and sensor data. This post further elaborates how these techniques can be expanded to handle time series resampling and interpolation.
Converting Raw Time Series Data into Discrete Intervals
Many time series analysis problems assume observations over uniform time intervals. In the real world, incoming data often does not arrive in this way. As a result, one common prerequisite for Times Series analytics is to take an initially raw input and transform it into discrete intervals, or to resample an input at one frequency into an input of a different frequency. The same basic techniques can be used for both use cases. There are a number of primitive operations that need to be understood, including:
- Mapping – What time slice does an observation at a particular timestamp map to?
- Gap Filling & Interpolation – When no observations map to a particular time slice, how do you fill in the missing value?
- Aggregation – When multiple observations occur in the same time slice, what value do you decide to use?
Mapping
Mapping is the process of taking a given time stamp and mapping it into a given interval sequence. There are a couple small variations around how this is accomplished, depending on whether you consider a given time interval as inclusive of the start or inclusive of the end time. The basic calculations required are:
step_number = (time - start) / step_size new_time = start + step_number * step_size
The Pivotal Greenplum Database 4.3 provides a function named interval_bound to help facilitate this transformation. The following query demonstrates the interval_bound function using a 10 second interval:
select ts, interval_bound(ts, '10 second') from observations order by ts; ts | interval_bound -------------------------+--------------------- 2014-03-25 16:21:03.402 | 2014-03-25 16:21:00 2014-03-25 16:21:05.026 | 2014-03-25 16:21:00 2014-03-25 16:21:20 | 2014-03-25 16:21:20 2014-03-25 16:21:23.555 | 2014-03-25 16:21:20 2014-03-25 16:21:26.684 | 2014-03-25 16:21:20 2014-03-25 16:21:27.700 | 2014-03-25 16:21:20 2014-03-25 16:21:28.660 | 2014-03-25 16:21:20 2014-03-25 16:21:33.813 | 2014-03-25 16:21:30 2014-03-25 16:21:40.605 | 2014-03-25 16:21:40
The interval_bound function also provides mechanisms for shifting the time window based on an offset or providing an alternate starting point. This functionality will be discussed in more depth in the Aggregation section.
Gap Filling & Interpolation
After mapping the input series to our time intervals there could still be time slices that are missing due to a lack of input data which occurred within that interval. To aid in addressing this problem we leverage two other pieces of functionality: the generate_series() function to generate a dense timeseries, and an outer join between our input data and the dense result.
with bounded as ( select ts, interval_bound(ts, ‘5 second’) as slice, value from observations ), dense as ( select slice from generate_series('2014-03-25 16:21:00'::timestamp, '2014-03-25 16:21:40', '5 second') s(slice) ) select * from bounded natural right join dense order by slice, ts; slice | ts | value ---------------------+----------------------------+------- 2014-03-25 16:21:00 | 2014-03-25 16:21:03.402357 | 4.56 2014-03-25 16:21:05 | 2014-03-25 16:21:05.026502 | 8.88 2014-03-25 16:21:10 | | 2014-03-25 16:21:15 | | 2014-03-25 16:21:20 | 2014-03-25 16:21:20 | 7.32 2014-03-25 16:21:20 | 2014-03-25 16:21:23.555933 | 7.7 2014-03-25 16:21:25 | 2014-03-25 16:21:26.684565 | 9.31 2014-03-25 16:21:25 | 2014-03-25 16:21:27.700739 | 8.33 2014-03-25 16:21:25 | 2014-03-25 16:21:28.660447 | 10.63 2014-03-25 16:21:30 | 2014-03-25 16:21:33.813349 | 9.56 2014-03-25 16:21:35 | | 2014-03-25 16:21:40 | 2014-03-25 16:21:40.605952 | 3.25
Here you can see that time slices for 16:21:10, 16:21:15 and 16:21:35 were filled in as missing values since there were no observations in the input that mapped into those time slices. There are multiple techniques that can be applied to fill these values in ranging from relatively simple “last value seen” calculations to more elaborate interpolation techniques, most of which will require leveraging window functions.
Lead and lag functions, which were introduced in the first post in this series, might seem like a good first choice, but they only look back a single row. This becomes an issue if the gap is more than one row wide. Addressing this issue requires a simple new aggregate to return the last value seen:
create function last_known_t(prev float8, current float8) returns float8 as $$return current or prev$$ language plpythonu; create aggregate last_known(float8) ( stype =float8, sfunc = last_known_t, prefunc = last_known_t ); create function last_known_t(prev timestamp, current timestamp) returns timestamp as $$return current or prev$$ language plpythonu; create aggregate last_known(timestamp) ( stype = timestamp, sfunc = last_known_t, prefunc = last_known_t );
In practice creating the function in java will be more performant, however the python version is much shorter and fitting for a blog post.
With this aggregate defined, we can use it for our simplest form of gap-filling: last value seen.
with bounded as ( select ts, interval_bound(ts, '5 second') as slice, value from observations ), dense as ( select slice from generate_series('2014-03-25 16:21:00'::timestamp, '2014-03-25 16:21:40', '5 second') s(slice) ) select slice, value, last_known(value) over (order by slice, ts) from bounded right join dense using (slice) order by slice,ts; slice | value | last_known ---------------------+-------+------------ 2014-03-25 16:21:00 | 4.56 | 4.56 2014-03-25 16:21:05 | 8.88 | 8.88 2014-03-25 16:21:10 | | 8.88 2014-03-25 16:21:15 | | 8.88 2014-03-25 16:21:20 | 7.32 | 7.32 2014-03-25 16:21:20 | 7.7 | 7.7 2014-03-25 16:21:25 | 9.31 | 9.31 2014-03-25 16:21:25 | 8.33 | 8.33 2014-03-25 16:21:25 | 10.63 | 10.63 2014-03-25 16:21:30 | 9.56 | 9.56 2014-03-25 16:21:35 | | 9.56 2014-03-25 16:21:40 | 3.25 | 3.25
This can be extended to also calculate the next_value seen as well as the timestamps at which those values occurred. With last value, last value timestamp, next value, and next value timestamp we can apply linear interpolation to calculate an interpolated result for our missing values.
with bounded as ( select ts, interval_bound(ts, '5 second') as slice, value from observations ), dense as ( select slice from generate_series('2014-03-25 16:21:00'::timestamp, '2014-03-25 16:21:40', '5 second') s(slice) ) select slice, value, coalesce( value, linear_interpolate( slice, last_known(ts) over (lookback), last_known(value) over (lookback), last_known(ts) over (lookforward), last_known(value) over (lookforward) ) ) interpolated from bounded right join dense using (slice) window lookback as (order by slice, ts), lookforward as (order by slice desc, ts desc) order by slice, ts; slice | value | interpolated ---------------------+-------+------------------ 2014-03-25 16:21:00 | 4.56 | 4.56 2014-03-25 16:21:05 | 8.88 | 8.88 2014-03-25 16:21:10 | | 8.3618407241915 2014-03-25 16:21:15 | | 7.84092036209575 2014-03-25 16:21:20 | 7.32 | 7.32 2014-03-25 16:21:20 | 7.7 | 7.7 2014-03-25 16:21:25 | 9.31 | 9.31 2014-03-25 16:21:25 | 8.33 | 8.33 2014-03-25 16:21:25 | 10.63 | 10.63 2014-03-25 16:21:30 | 9.56 | 9.56 2014-03-25 16:21:35 | | 8.4576585544599 2014-03-25 16:21:40 | 3.25 | 3.25
Aggregation
So far we have mapped our raw time series data into uniform time intervals and interpolated the missing values. The final step in the puzzle is to remove the duplicate entries within a given time slice. As with gap filling, there are many ways this can be done in practice. Starting with the simplest example, we can report the min, max, and average recorded value in each time slice.
with bounded as ( select ts, interval_bound(ts, '5 second') as slice, value from observations ), dense as ( select slice from generate_series('2014-03-25 16:21:00'::timestamp, '2014-03-25 16:21:40', '5 second') s(slice) ), filled as ( select ts, slice, value, coalesce( value, linear_interpolate( slice, last_known(ts) over (lookback), last_known(value) over (lookback), last_known(ts) over (lookforward), last_known(value) over (lookforward) ) ) interpolated from bounded right join dense using (slice) window lookback as (order by slice, ts), lookforward as (order by slice desc, ts desc) ) select slice, min(interpolated), max(interpolated), avg(interpolated) from filled group by slice order by slice; slice | min | max | avg ---------------------+------+-------+----------------- 2014-03-25 16:21:00 | 4.56 | 4.56 | 4.56 2014-03-25 16:21:05 | 8.88 | 8.88 | 8.88 2014-03-25 16:21:10 | | | 8.3618407241915 2014-03-25 16:21:15 | | | 7.84092036209575 2014-03-25 16:21:20 | 7.32 | 7.7 | 7.51 2014-03-25 16:21:25 | 8.33 | 10.63 | 9.42333333333333 2014-03-25 16:21:30 | 9.56 | 9.56 | 9.56 2014-03-25 16:21:35 | | | 8.4576585544599
Another common approach is reporting only the last value seen within a time slice. To accomplish this it can be helpful to create one more useful aggregate.
create type argmax_ts_float as ( ts timestamp, value float ); create function argmax_t( prev argmax_ts_float, current_ts timestamp, current_value float8) returns argmax_ts_float as $$ select case when ($1).ts > $2 then $1 else row($2,$3)::argmax_ts_float end $$ language sql; create function argmax_m( s1 argmax_ts_float, s2 argmax_ts_float) returns argmax_ts_float as $$ select case when ($1).ts > ($2).ts or $2 is null then $1 else $2 end $$ language sql; create function argmax_f(s1 argmax_ts_float) returns float8 as $$ select ($1).value $$ language sql; create aggregate argmax(timestamp, float8) ( stype = argmax_ts_float, sfunc = argmax_t, prefunc = argmax_m, finalfunc = argmax_f );
Which can then be used directly in the query:
with bounded as ( select ts, interval_bound(ts, '5 second') as slice, value from observations ), dense as ( select slice from generate_series('2014-03-25 16:21:00'::timestamp, '2014-03-25 16:21:40', '5 second') s(slice) ), filled as ( select ts, slice, value, coalesce( value, linear_interpolate( slice, last_known(ts) over (lookback), last_known(value) over (lookback), last_known(ts) over (lookforward), last_known(value) over (lookforward) ) ) interpolated from bounded right join dense using (slice) window lookback as (order by slice, ts), lookforward as (order by slice desc, ts desc) ) select slice, argmax(ts, interpolated) from filled group by slice order by slice; slice | argmax ---------------------+------------------ 2014-03-25 16:21:00 | 4.56 2014-03-25 16:21:05 | 8.88 2014-03-25 16:21:10 | 8.3618407241915 2014-03-25 16:21:15 | 7.84092036209575 2014-03-25 16:21:20 | 7.7 2014-03-25 16:21:25 | 10.63 2014-03-25 16:21:30 | 9.56 2014-03-25 16:21:35 | 8.4576585544599 2014-03-25 16:21:40 | 3.25
One more time, now with Grouping
If you have multiple tags, labels or groups within your time series, the same set of approaches still apply but more columns will show up in group by and partition by clauses. This is similar to how grouping has been handled in the previous articles in this series. The main additional complication is the generation of the densified time series, which can be completed as a cross product of the generated series and the distinct set of groups to be supported.
with dense as ( select slice, tag from generate_series('2014-03-25 16:21:00'::timestamp, '2014-03-25 16:21:40', '5 second') s(slice), (select distinct tag from observations where ts >= '2014-03-25 16:21:00'::timestamp and ts <= '2014-03-25 16:21:40') tags )
The final query, with grouping, becomes:
with bounded as ( select tag, ts, interval_bound(ts, '5 second') as slice, value from observations ), dense as ( select slice, tag from generate_series('2014-03-25 16:21:00'::timestamp, '2014-03-25 16:21:40', '5 second') s(slice), (select distinct tag from observations where ts >= '2014-03-25 16:21:00'::timestamp and ts <= '2014-03-25 16:21:40') tags ), filled as ( select tag, ts, slice, value, coalesce( value, linear_interpolate( slice, last_known(ts) over (lookback), last_known(value) over (lookback), last_known(ts) over (lookforward), last_known(value) over (lookforward) ) ) interpolated from bounded right join dense using (tag, slice) window lookback as (partition by tag order by slice, ts), lookforward as (partition by tag order by slice desc, ts desc) ) select tag, slice, argmax(ts, interpolated) as value from filled group by tag, slice order by tag, slice; tag | slice | value -----+---------------------+------------------ a | 2014-03-25 16:21:00 | 4.56 a | 2014-03-25 16:21:05 | 8.88 a | 2014-03-25 16:21:10 | 8.3618407241915 … b | 2014-03-25 16:21:00 | 5.29 b | 2014-03-25 16:21:05 | 9.32 b | 2014-03-25 16:21:10 | 8.6889085743358 …
This only touches on the wide variety of permutations of calculations that may be desired in the discretization/resampling of a time series within SQL, but it provides a solid foundation. Many of those permutations can be implemented using one or more of the techniques outlined in this post.
About the Author