Time Series Analysis Part 3: Resampling and Interpolation

April 9, 2014 Caleb Welton

featured-TimeSeries

The previous blog posts in this series introduced how Window Functions can be used for many types of ordered data analysis. Time series data can be found in many real world applications, including clickstream processing, financial analysis, and sensor data. This post further elaborates how these techniques can be expanded to handle time series resampling and interpolation.

Converting Raw Time Series Data into Discrete Intervals

Many time series analysis problems assume observations over uniform time intervals. In the real world, incoming data often does not arrive in this way. As a result, one common prerequisite for Times Series analytics is to take an initially raw input and transform it into discrete intervals, or to resample an input at one frequency into an input of a different frequency. The same basic techniques can be used for both use cases. There are a number of primitive operations that need to be understood, including:

  • Mapping – What time slice does an observation at a particular timestamp map to?
  • Gap Filling & Interpolation – When no observations map to a particular time slice, how do you fill in the missing value?
  • Aggregation – When multiple observations occur in the same time slice, what value do you decide to use?

Mapping

Mapping is the process of taking a given time stamp and mapping it into a given interval sequence. There are a couple small variations around how this is accomplished, depending on whether you consider a given time interval as inclusive of the start or inclusive of the end time. The basic calculations required are:

step_number = (time - start) / step_size
new_time    = start + step_number * step_size

The Pivotal Greenplum Database 4.3 provides a function named interval_bound to help facilitate this transformation. The following query demonstrates the interval_bound function using a 10 second interval:


select
  ts,
  interval_bound(ts, '10 second')
from observations
order by ts;
             ts          |   interval_bound
-------------------------+---------------------
 2014-03-25 16:21:03.402 | 2014-03-25 16:21:00
 2014-03-25 16:21:05.026 | 2014-03-25 16:21:00
 2014-03-25 16:21:20     | 2014-03-25 16:21:20
 2014-03-25 16:21:23.555 | 2014-03-25 16:21:20
 2014-03-25 16:21:26.684 | 2014-03-25 16:21:20
 2014-03-25 16:21:27.700 | 2014-03-25 16:21:20
 2014-03-25 16:21:28.660 | 2014-03-25 16:21:20
 2014-03-25 16:21:33.813 | 2014-03-25 16:21:30
 2014-03-25 16:21:40.605 | 2014-03-25 16:21:40

The interval_bound function also provides mechanisms for shifting the time window based on an offset or providing an alternate starting point. This functionality will be discussed in more depth in the Aggregation section.

Gap Filling & Interpolation

After mapping the input series to our time intervals there could still be time slices that are missing due to a lack of input data which occurred within that interval. To aid in addressing this problem we leverage two other pieces of functionality: the generate_series() function to generate a dense timeseries, and an outer join between our input data and the dense result.

with
  bounded as (
    select
      ts,
      interval_bound(ts, ‘5 second’) as slice,
      value
    from observations
  ),
  dense as (
    select slice
    from generate_series('2014-03-25 16:21:00'::timestamp,
                         '2014-03-25 16:21:40',
                         '5 second') s(slice)
  )
select *
from bounded natural right join dense
order by slice, ts;
        slice        |             ts             | value
---------------------+----------------------------+-------
 2014-03-25 16:21:00 | 2014-03-25 16:21:03.402357 |  4.56
 2014-03-25 16:21:05 | 2014-03-25 16:21:05.026502 |  8.88
 2014-03-25 16:21:10 |                            |
 2014-03-25 16:21:15 |                            |
 2014-03-25 16:21:20 | 2014-03-25 16:21:20        |  7.32
 2014-03-25 16:21:20 | 2014-03-25 16:21:23.555933 |   7.7
 2014-03-25 16:21:25 | 2014-03-25 16:21:26.684565 |  9.31
 2014-03-25 16:21:25 | 2014-03-25 16:21:27.700739 |  8.33
 2014-03-25 16:21:25 | 2014-03-25 16:21:28.660447 | 10.63
 2014-03-25 16:21:30 | 2014-03-25 16:21:33.813349 |  9.56
 2014-03-25 16:21:35 |                            |
 2014-03-25 16:21:40 | 2014-03-25 16:21:40.605952 |  3.25

Here you can see that time slices for 16:21:10, 16:21:15 and 16:21:35 were filled in as missing values since there were no observations in the input that mapped into those time slices. There are multiple techniques that can be applied to fill these values in ranging from relatively simple “last value seen” calculations to more elaborate interpolation techniques, most of which will require leveraging window functions.

Lead and lag functions, which were introduced in the first post in this series, might seem like a good first choice, but they only look back a single row. This becomes an issue if the gap is more than one row wide. Addressing this issue requires a simple new aggregate to return the last value seen:


create function
  last_known_t(prev float8, current float8)
  returns float8
  as $$return current or prev$$
  language plpythonu;

create aggregate last_known(float8) (
  stype =float8,
  sfunc = last_known_t,
  prefunc = last_known_t
);

create function
  last_known_t(prev timestamp, current timestamp)
  returns timestamp
  as $$return current or prev$$
  language plpythonu;

create aggregate last_known(timestamp) (
  stype = timestamp,
  sfunc = last_known_t,
  prefunc = last_known_t
);

In practice creating the function in java will be more performant, however the python version is much shorter and fitting for a blog post.

With this aggregate defined, we can use it for our simplest form of gap-filling: last value seen.

with
  bounded as (
    select
    ts,
    interval_bound(ts, '5 second') as slice,
    value
  from observations
  ),
  dense as (
    select slice
    from generate_series('2014-03-25 16:21:00'::timestamp,
                         '2014-03-25 16:21:40',
                         '5 second') s(slice)
  )
select
  slice,
  value,
  last_known(value) over (order by slice, ts)
from bounded right join dense using (slice)
order by slice,ts;
        slice        | value | last_known
---------------------+-------+------------
 2014-03-25 16:21:00 |  4.56 |       4.56
 2014-03-25 16:21:05 |  8.88 |       8.88
 2014-03-25 16:21:10 |       |       8.88
 2014-03-25 16:21:15 |       |       8.88
 2014-03-25 16:21:20 |  7.32 |       7.32
 2014-03-25 16:21:20 |   7.7 |        7.7
 2014-03-25 16:21:25 |  9.31 |       9.31
 2014-03-25 16:21:25 |  8.33 |       8.33
 2014-03-25 16:21:25 | 10.63 |      10.63
 2014-03-25 16:21:30 |  9.56 |       9.56
 2014-03-25 16:21:35 |       |       9.56
 2014-03-25 16:21:40 |  3.25 |       3.25

This can be extended to also calculate the next_value seen as well as the timestamps at which those values occurred. With last value, last value timestamp, next value, and next value timestamp we can apply linear interpolation to calculate an interpolated result for our missing values.

with
  bounded as (
    select
    ts,
    interval_bound(ts, '5 second') as slice,
    value
  from observations
  ),
  dense as (
    select slice
    from generate_series('2014-03-25 16:21:00'::timestamp,
                         '2014-03-25 16:21:40',
                         '5 second') s(slice)
  )
select
  slice,
  value,
  coalesce(
    value,
    linear_interpolate(
      slice,
      last_known(ts) over (lookback),
      last_known(value) over (lookback),
      last_known(ts) over (lookforward),
      last_known(value) over (lookforward)
    )
  ) interpolated
from bounded right join dense using (slice)
window
  lookback as (order by slice, ts),
  lookforward as (order by slice desc, ts desc)
order by slice, ts;

        slice        | value |   interpolated
---------------------+-------+------------------
 2014-03-25 16:21:00 |  4.56 |             4.56
 2014-03-25 16:21:05 |  8.88 |             8.88
 2014-03-25 16:21:10 |       |  8.3618407241915
 2014-03-25 16:21:15 |       | 7.84092036209575
 2014-03-25 16:21:20 |  7.32 |             7.32
 2014-03-25 16:21:20 |   7.7 |              7.7
 2014-03-25 16:21:25 |  9.31 |             9.31
 2014-03-25 16:21:25 |  8.33 |             8.33
 2014-03-25 16:21:25 | 10.63 |            10.63
 2014-03-25 16:21:30 |  9.56 |             9.56
 2014-03-25 16:21:35 |       |  8.4576585544599
 2014-03-25 16:21:40 |  3.25 |             3.25

Aggregation

So far we have mapped our raw time series data into uniform time intervals and interpolated the missing values. The final step in the puzzle is to remove the duplicate entries within a given time slice. As with gap filling, there are many ways this can be done in practice. Starting with the simplest example, we can report the min, max, and average recorded value in each time slice.


with
  bounded as (
    select
    ts,
    interval_bound(ts, '5 second') as slice,
    value
  from observations
  ),
  dense as (
    select slice
    from generate_series('2014-03-25 16:21:00'::timestamp,
                         '2014-03-25 16:21:40',
                         '5 second') s(slice)
  ),
  filled as (
    select
      ts,
      slice,
      value,
      coalesce(
        value,
        linear_interpolate(
          slice,
          last_known(ts) over (lookback),
          last_known(value) over (lookback),
          last_known(ts) over (lookforward),
          last_known(value) over (lookforward)
        )
      ) interpolated
    from bounded right join dense using (slice)
    window
      lookback as (order by slice, ts),
      lookforward as (order by slice desc, ts desc)
  )
select
  slice,
  min(interpolated),
  max(interpolated),
  avg(interpolated)
from filled
group by slice
order by slice;

        slice        | min  |  max  |       avg
---------------------+------+-------+-----------------
 2014-03-25 16:21:00 | 4.56 |  4.56 |             4.56
 2014-03-25 16:21:05 | 8.88 |  8.88 |             8.88
 2014-03-25 16:21:10 |      |       |  8.3618407241915
 2014-03-25 16:21:15 |      |       | 7.84092036209575
 2014-03-25 16:21:20 | 7.32 |   7.7 |             7.51
 2014-03-25 16:21:25 | 8.33 | 10.63 | 9.42333333333333
 2014-03-25 16:21:30 | 9.56 |  9.56 |             9.56
 2014-03-25 16:21:35 |      |       |  8.4576585544599

Another common approach is reporting only the last value seen within a time slice. To accomplish this it can be helpful to create one more useful aggregate.

create type argmax_ts_float as (
  ts timestamp,
  value float
);

create function argmax_t(
  prev argmax_ts_float,
  current_ts timestamp,
  current_value float8)
  returns argmax_ts_float as
  $$ select case when ($1).ts > $2 then $1
            else row($2,$3)::argmax_ts_float
            end
  $$ language sql;

create function argmax_m(
  s1 argmax_ts_float,
  s2 argmax_ts_float)
  returns argmax_ts_float as
  $$ select case when ($1).ts > ($2).ts or $2 is null
                 then $1 else $2 end
  $$ language sql;

create function argmax_f(s1 argmax_ts_float)
  returns float8 as
  $$ select ($1).value $$ language sql;

create aggregate argmax(timestamp, float8) (
  stype = argmax_ts_float,
  sfunc = argmax_t,
  prefunc = argmax_m,
  finalfunc = argmax_f
);

Which can then be used directly in the query:

with
  bounded as (
    select
    ts,
    interval_bound(ts, '5 second') as slice,
    value
  from observations
  ),
  dense as (
    select slice
    from generate_series('2014-03-25 16:21:00'::timestamp,
                         '2014-03-25 16:21:40',
                         '5 second') s(slice)
  ),
  filled as (
    select
      ts,
      slice,
      value,
      coalesce(
        value,
        linear_interpolate(
          slice,
          last_known(ts) over (lookback),
          last_known(value) over (lookback),
          last_known(ts) over (lookforward),
          last_known(value) over (lookforward)
        )
      ) interpolated
    from bounded right join dense using (slice)
    window
      lookback as (order by slice, ts),
      lookforward as (order by slice desc, ts desc)
  )
select
  slice,
  argmax(ts, interpolated)
from filled
group by slice
order by slice;

        slice        |      argmax
---------------------+------------------
 2014-03-25 16:21:00 |             4.56
 2014-03-25 16:21:05 |             8.88
 2014-03-25 16:21:10 |  8.3618407241915
 2014-03-25 16:21:15 | 7.84092036209575
 2014-03-25 16:21:20 |              7.7
 2014-03-25 16:21:25 |            10.63
 2014-03-25 16:21:30 |             9.56
 2014-03-25 16:21:35 |  8.4576585544599
 2014-03-25 16:21:40 |             3.25

Download the Pivotal Data Science Lab datasheet.

One more time, now with Grouping

If you have multiple tags, labels or groups within your time series, the same set of approaches still apply but more columns will show up in group by and partition by clauses. This is similar to how grouping has been handled in the previous articles in this series. The main additional complication is the generation of the densified time series, which can be completed as a cross product of the generated series and the distinct set of groups to be supported.


with
   dense as (
    select slice, tag
    from generate_series('2014-03-25 16:21:00'::timestamp,
                         '2014-03-25 16:21:40',
                         '5 second') s(slice),
      (select distinct tag
       from observations
       where ts >= '2014-03-25 16:21:00'::timestamp
         and ts <= '2014-03-25 16:21:40') tags
   )

The final query, with grouping, becomes:

with
  bounded as (
    select
    tag,
    ts,
    interval_bound(ts, '5 second') as slice,
    value
  from observations
  ),
  dense as (
    select slice, tag
    from generate_series('2014-03-25 16:21:00'::timestamp,
                         '2014-03-25 16:21:40',
                         '5 second') s(slice),
      (select distinct tag
       from observations
       where ts >= '2014-03-25 16:21:00'::timestamp
         and ts <= '2014-03-25 16:21:40') tags
   ),
  filled as (
    select
      tag,
      ts,
      slice,
      value,
      coalesce(
        value,
        linear_interpolate(
          slice,
          last_known(ts) over (lookback),
          last_known(value) over (lookback),
          last_known(ts) over (lookforward),
          last_known(value) over (lookforward)
        )
      ) interpolated
    from bounded right join dense using (tag, slice)
    window
      lookback as (partition by tag order by slice, ts),
      lookforward as (partition by tag order by slice desc, ts desc)
  )
select
  tag,
  slice,
  argmax(ts, interpolated) as value
from filled
group by tag, slice
order by tag, slice;

 tag |        slice        |      value
-----+---------------------+------------------
 a   | 2014-03-25 16:21:00 |             4.56
 a   | 2014-03-25 16:21:05 |             8.88
 a   | 2014-03-25 16:21:10 |  8.3618407241915
…
 b   | 2014-03-25 16:21:00 |             5.29
 b   | 2014-03-25 16:21:05 |             9.32
 b   | 2014-03-25 16:21:10 |  8.6889085743358
…

This only touches on the wide variety of permutations of calculations that may be desired in the discretization/resampling of a time series within SQL, but it provides a solid foundation. Many of those permutations can be implemented using one or more of the techniques outlined in this post.

About the Author

Biography

Previous
Feedback in conversation: An adventure in meta-process
Feedback in conversation: An adventure in meta-process

Development is all about feedback loops. We use analytics and AB tests to see how people respond to new fea...

Next
We are All Programmers
We are All Programmers

While my wife was in business school I often found myself the only software engineer at cocktail parties. I...