# Time Series Analysis Part 3: Resampling and Interpolation

April 9, 2014

The previous blog posts in this series introduced how Window Functions can be used for many types of ordered data analysis. Time series data can be found in many real world applications, including clickstream processing, financial analysis, and sensor data. This post further elaborates how these techniques can be expanded to handle time series resampling and interpolation.

## Converting Raw Time Series Data into Discrete Intervals

Many time series analysis problems assume observations over uniform time intervals. In the real world, incoming data often does not arrive in this way. As a result, one common prerequisite for Times Series analytics is to take an initially raw input and transform it into discrete intervals, or to resample an input at one frequency into an input of a different frequency. The same basic techniques can be used for both use cases. There are a number of primitive operations that need to be understood, including:

• Mapping – What time slice does an observation at a particular timestamp map to?
• Gap Filling & Interpolation – When no observations map to a particular time slice, how do you fill in the missing value?
• Aggregation – When multiple observations occur in the same time slice, what value do you decide to use?

### Mapping

Mapping is the process of taking a given time stamp and mapping it into a given interval sequence. There are a couple small variations around how this is accomplished, depending on whether you consider a given time interval as inclusive of the start or inclusive of the end time. The basic calculations required are:

```step_number = (time - start) / step_size
new_time    = start + step_number * step_size
```

The Pivotal Greenplum Database 4.3 provides a function named interval_bound to help facilitate this transformation. The following query demonstrates the interval_bound function using a 10 second interval:

```
select
ts,
interval_bound(ts, '10 second')
from observations
order by ts;
ts          |   interval_bound
-------------------------+---------------------
2014-03-25 16:21:03.402 | 2014-03-25 16:21:00
2014-03-25 16:21:05.026 | 2014-03-25 16:21:00
2014-03-25 16:21:20     | 2014-03-25 16:21:20
2014-03-25 16:21:23.555 | 2014-03-25 16:21:20
2014-03-25 16:21:26.684 | 2014-03-25 16:21:20
2014-03-25 16:21:27.700 | 2014-03-25 16:21:20
2014-03-25 16:21:28.660 | 2014-03-25 16:21:20
2014-03-25 16:21:33.813 | 2014-03-25 16:21:30
2014-03-25 16:21:40.605 | 2014-03-25 16:21:40
```

The interval_bound function also provides mechanisms for shifting the time window based on an offset or providing an alternate starting point. This functionality will be discussed in more depth in the Aggregation section.

### Gap Filling & Interpolation

After mapping the input series to our time intervals there could still be time slices that are missing due to a lack of input data which occurred within that interval. To aid in addressing this problem we leverage two other pieces of functionality: the generate_series() function to generate a dense timeseries, and an outer join between our input data and the dense result.

```with
bounded as (
select
ts,
interval_bound(ts, ‘5 second’) as slice,
value
from observations
),
dense as (
select slice
from generate_series('2014-03-25 16:21:00'::timestamp,
'2014-03-25 16:21:40',
'5 second') s(slice)
)
select *
from bounded natural right join dense
order by slice, ts;
slice        |             ts             | value
---------------------+----------------------------+-------
2014-03-25 16:21:00 | 2014-03-25 16:21:03.402357 |  4.56
2014-03-25 16:21:05 | 2014-03-25 16:21:05.026502 |  8.88
2014-03-25 16:21:10 |                            |
2014-03-25 16:21:15 |                            |
2014-03-25 16:21:20 | 2014-03-25 16:21:20        |  7.32
2014-03-25 16:21:20 | 2014-03-25 16:21:23.555933 |   7.7
2014-03-25 16:21:25 | 2014-03-25 16:21:26.684565 |  9.31
2014-03-25 16:21:25 | 2014-03-25 16:21:27.700739 |  8.33
2014-03-25 16:21:25 | 2014-03-25 16:21:28.660447 | 10.63
2014-03-25 16:21:30 | 2014-03-25 16:21:33.813349 |  9.56
2014-03-25 16:21:35 |                            |
2014-03-25 16:21:40 | 2014-03-25 16:21:40.605952 |  3.25
```

Here you can see that time slices for 16:21:10, 16:21:15 and 16:21:35 were filled in as missing values since there were no observations in the input that mapped into those time slices. There are multiple techniques that can be applied to fill these values in ranging from relatively simple “last value seen” calculations to more elaborate interpolation techniques, most of which will require leveraging window functions.

Lead and lag functions, which were introduced in the first post in this series, might seem like a good first choice, but they only look back a single row. This becomes an issue if the gap is more than one row wide. Addressing this issue requires a simple new aggregate to return the last value seen:

```
create function
last_known_t(prev float8, current float8)
returns float8
as \$\$return current or prev\$\$
language plpythonu;

create aggregate last_known(float8) (
stype =float8,
sfunc = last_known_t,
prefunc = last_known_t
);

create function
last_known_t(prev timestamp, current timestamp)
returns timestamp
as \$\$return current or prev\$\$
language plpythonu;

create aggregate last_known(timestamp) (
stype = timestamp,
sfunc = last_known_t,
prefunc = last_known_t
);
```

In practice creating the function in java will be more performant, however the python version is much shorter and fitting for a blog post.

With this aggregate defined, we can use it for our simplest form of gap-filling: last value seen.

```with
bounded as (
select
ts,
interval_bound(ts, '5 second') as slice,
value
from observations
),
dense as (
select slice
from generate_series('2014-03-25 16:21:00'::timestamp,
'2014-03-25 16:21:40',
'5 second') s(slice)
)
select
slice,
value,
last_known(value) over (order by slice, ts)
from bounded right join dense using (slice)
order by slice,ts;
slice        | value | last_known
---------------------+-------+------------
2014-03-25 16:21:00 |  4.56 |       4.56
2014-03-25 16:21:05 |  8.88 |       8.88
2014-03-25 16:21:10 |       |       8.88
2014-03-25 16:21:15 |       |       8.88
2014-03-25 16:21:20 |  7.32 |       7.32
2014-03-25 16:21:20 |   7.7 |        7.7
2014-03-25 16:21:25 |  9.31 |       9.31
2014-03-25 16:21:25 |  8.33 |       8.33
2014-03-25 16:21:25 | 10.63 |      10.63
2014-03-25 16:21:30 |  9.56 |       9.56
2014-03-25 16:21:35 |       |       9.56
2014-03-25 16:21:40 |  3.25 |       3.25
```

This can be extended to also calculate the next_value seen as well as the timestamps at which those values occurred. With last value, last value timestamp, next value, and next value timestamp we can apply linear interpolation to calculate an interpolated result for our missing values.

```with
bounded as (
select
ts,
interval_bound(ts, '5 second') as slice,
value
from observations
),
dense as (
select slice
from generate_series('2014-03-25 16:21:00'::timestamp,
'2014-03-25 16:21:40',
'5 second') s(slice)
)
select
slice,
value,
coalesce(
value,
linear_interpolate(
slice,
last_known(ts) over (lookback),
last_known(value) over (lookback),
last_known(ts) over (lookforward),
last_known(value) over (lookforward)
)
) interpolated
from bounded right join dense using (slice)
window
lookback as (order by slice, ts),
lookforward as (order by slice desc, ts desc)
order by slice, ts;

slice        | value |   interpolated
---------------------+-------+------------------
2014-03-25 16:21:00 |  4.56 |             4.56
2014-03-25 16:21:05 |  8.88 |             8.88
2014-03-25 16:21:10 |       |  8.3618407241915
2014-03-25 16:21:15 |       | 7.84092036209575
2014-03-25 16:21:20 |  7.32 |             7.32
2014-03-25 16:21:20 |   7.7 |              7.7
2014-03-25 16:21:25 |  9.31 |             9.31
2014-03-25 16:21:25 |  8.33 |             8.33
2014-03-25 16:21:25 | 10.63 |            10.63
2014-03-25 16:21:30 |  9.56 |             9.56
2014-03-25 16:21:35 |       |  8.4576585544599
2014-03-25 16:21:40 |  3.25 |             3.25
```

### Aggregation

So far we have mapped our raw time series data into uniform time intervals and interpolated the missing values. The final step in the puzzle is to remove the duplicate entries within a given time slice. As with gap filling, there are many ways this can be done in practice. Starting with the simplest example, we can report the min, max, and average recorded value in each time slice.

```
with
bounded as (
select
ts,
interval_bound(ts, '5 second') as slice,
value
from observations
),
dense as (
select slice
from generate_series('2014-03-25 16:21:00'::timestamp,
'2014-03-25 16:21:40',
'5 second') s(slice)
),
filled as (
select
ts,
slice,
value,
coalesce(
value,
linear_interpolate(
slice,
last_known(ts) over (lookback),
last_known(value) over (lookback),
last_known(ts) over (lookforward),
last_known(value) over (lookforward)
)
) interpolated
from bounded right join dense using (slice)
window
lookback as (order by slice, ts),
lookforward as (order by slice desc, ts desc)
)
select
slice,
min(interpolated),
max(interpolated),
avg(interpolated)
from filled
group by slice
order by slice;

slice        | min  |  max  |       avg
---------------------+------+-------+-----------------
2014-03-25 16:21:00 | 4.56 |  4.56 |             4.56
2014-03-25 16:21:05 | 8.88 |  8.88 |             8.88
2014-03-25 16:21:10 |      |       |  8.3618407241915
2014-03-25 16:21:15 |      |       | 7.84092036209575
2014-03-25 16:21:20 | 7.32 |   7.7 |             7.51
2014-03-25 16:21:25 | 8.33 | 10.63 | 9.42333333333333
2014-03-25 16:21:30 | 9.56 |  9.56 |             9.56
2014-03-25 16:21:35 |      |       |  8.4576585544599
```

Another common approach is reporting only the last value seen within a time slice. To accomplish this it can be helpful to create one more useful aggregate.

```create type argmax_ts_float as (
ts timestamp,
value float
);

create function argmax_t(
prev argmax_ts_float,
current_ts timestamp,
current_value float8)
returns argmax_ts_float as
\$\$ select case when (\$1).ts > \$2 then \$1
else row(\$2,\$3)::argmax_ts_float
end
\$\$ language sql;

create function argmax_m(
s1 argmax_ts_float,
s2 argmax_ts_float)
returns argmax_ts_float as
\$\$ select case when (\$1).ts > (\$2).ts or \$2 is null
then \$1 else \$2 end
\$\$ language sql;

create function argmax_f(s1 argmax_ts_float)
returns float8 as
\$\$ select (\$1).value \$\$ language sql;

create aggregate argmax(timestamp, float8) (
stype = argmax_ts_float,
sfunc = argmax_t,
prefunc = argmax_m,
finalfunc = argmax_f
);
```

Which can then be used directly in the query:

```with
bounded as (
select
ts,
interval_bound(ts, '5 second') as slice,
value
from observations
),
dense as (
select slice
from generate_series('2014-03-25 16:21:00'::timestamp,
'2014-03-25 16:21:40',
'5 second') s(slice)
),
filled as (
select
ts,
slice,
value,
coalesce(
value,
linear_interpolate(
slice,
last_known(ts) over (lookback),
last_known(value) over (lookback),
last_known(ts) over (lookforward),
last_known(value) over (lookforward)
)
) interpolated
from bounded right join dense using (slice)
window
lookback as (order by slice, ts),
lookforward as (order by slice desc, ts desc)
)
select
slice,
argmax(ts, interpolated)
from filled
group by slice
order by slice;

slice        |      argmax
---------------------+------------------
2014-03-25 16:21:00 |             4.56
2014-03-25 16:21:05 |             8.88
2014-03-25 16:21:10 |  8.3618407241915
2014-03-25 16:21:15 | 7.84092036209575
2014-03-25 16:21:20 |              7.7
2014-03-25 16:21:25 |            10.63
2014-03-25 16:21:30 |             9.56
2014-03-25 16:21:35 |  8.4576585544599
2014-03-25 16:21:40 |             3.25
```

### One more time, now with Grouping

If you have multiple tags, labels or groups within your time series, the same set of approaches still apply but more columns will show up in group by and partition by clauses. This is similar to how grouping has been handled in the previous articles in this series. The main additional complication is the generation of the densified time series, which can be completed as a cross product of the generated series and the distinct set of groups to be supported.

```
with
dense as (
select slice, tag
from generate_series('2014-03-25 16:21:00'::timestamp,
'2014-03-25 16:21:40',
'5 second') s(slice),
(select distinct tag
from observations
where ts >= '2014-03-25 16:21:00'::timestamp
and ts <= '2014-03-25 16:21:40') tags
)
```

The final query, with grouping, becomes:

```with
bounded as (
select
tag,
ts,
interval_bound(ts, '5 second') as slice,
value
from observations
),
dense as (
select slice, tag
from generate_series('2014-03-25 16:21:00'::timestamp,
'2014-03-25 16:21:40',
'5 second') s(slice),
(select distinct tag
from observations
where ts >= '2014-03-25 16:21:00'::timestamp
and ts <= '2014-03-25 16:21:40') tags
),
filled as (
select
tag,
ts,
slice,
value,
coalesce(
value,
linear_interpolate(
slice,
last_known(ts) over (lookback),
last_known(value) over (lookback),
last_known(ts) over (lookforward),
last_known(value) over (lookforward)
)
) interpolated
from bounded right join dense using (tag, slice)
window
lookback as (partition by tag order by slice, ts),
lookforward as (partition by tag order by slice desc, ts desc)
)
select
tag,
slice,
argmax(ts, interpolated) as value
from filled
group by tag, slice
order by tag, slice;

tag |        slice        |      value
-----+---------------------+------------------
a   | 2014-03-25 16:21:00 |             4.56
a   | 2014-03-25 16:21:05 |             8.88
a   | 2014-03-25 16:21:10 |  8.3618407241915
…
b   | 2014-03-25 16:21:00 |             5.29
b   | 2014-03-25 16:21:05 |             9.32
b   | 2014-03-25 16:21:10 |  8.6889085743358
…
```

This only touches on the wide variety of permutations of calculations that may be desired in the discretization/resampling of a time series within SQL, but it provides a solid foundation. Many of those permutations can be implemented using one or more of the techniques outlined in this post.

###### Previous
Feedback in conversation: An adventure in meta-process

Development is all about feedback loops. We use analytics and AB tests to see how people respond to new fea...

###### Next
We are All Programmers

While my wife was in business school I often found myself the only software engineer at cocktail parties. I...