Thread: finding gaps in temporal data

finding gaps in temporal data

From
Samuel Gendler
Date:
I have a fact table that looks like this:

dim1_fk bigint,
time_fk bigint,
tstamp timestamp without timezone
value numeric(16,2)

The tstamp column is completely redundant to a colume in the time_utc table, but I like to leave it there for convenience when writing ad-hoc queries in psql - it allows me to skip the join to a time dimension table.  The fact table is actually partitioned into 1-month child tables, with check constraints on both time_fk and the tstamp column, since there is a 1:1 relationship between those columns.

dim1_fk refers to a dim1 table which has two varchar columns we care about:

dim1_pk bigint,
label1 varchar,
label2 varchar

the time_utc table contains the usual time dimension columns, and I've structured the primary key to be an integer in the form YYYYMMDDHH24mm - so 2011-06-15 15:35 has a primary key of 201106151535 and there is a row every 5 minutes. All data in the fact table is assigned to a given 5 minute window. There is a row in the time_utc table for every possible time value, regardless of whether there is data in a fact table for that 5 minute interval.  For our purposes, we only need care about 2 columns

time_pk bigint,
tstamp timstamp without time zone

I'm looking to run a report which will show me any gaps in the data for any label1/label2 pair that appears in the dim1 table - there are 0 or more rows for each label1/label2 combination in each 5 minute window and I don't actually care about the duplicates (all my queries aggregate multiple rows for a given timestamp via the avg() function).  I can find every missing row in the fact table data by just cross joining dim1 and time_utc for a given date range and then left joining between that and the fact table. Since every possible time value appears in the time table and every possible value1/value2 combination appears in the dim1 table, I will see NULLs in the fact table columns for any row that is missing in the data set.

However, it would be far nicer to get a report which just lists gaps, with a start and end, rather than one row for each missing 5 minute interval, even if they are sequential.  Is there any way to accomplish this?  I'm running 8.4.x at the moment, but would be willing to upgrade to 9.0.x, if necessary.  If I can do it without the left outer join from time_utc to the fact table, then so much the better.

A query to grab just the missing rows looks something like this:

SELECT t.tstamp,
       d.label1,
       d.label2
FROM dimensions.dim1 d
INNER JOIN dimensions.time_utc t
       ON t.tstamp between '2011-06-01' and '2011-06-05'
LEFT OUTER JOIN facts.parent_fact f
             ON f.time_fk = t.time_pk
            AND f.dim1_fk = d.dim1_pk
GROUP BY 3,2,1
HAVING avg(value) IS NULL
ORDER BY 3,2,1

so all I really need to do is figure out how to combine sequential rows into a single row with a start and end time.  Rows are sequential if row1.tstamp - row2.tstamp = 5 minutes


Re: finding gaps in temporal data

From
Samuel Gendler
Date:


On Wed, Jun 15, 2011 at 10:23 AM, Samuel Gendler <sgendler@ideasculptor.com> wrote:
I have a fact table that looks like this:

dim1_fk bigint,
time_fk bigint,
tstamp timestamp without timezone
value numeric(16,2)

The tstamp column is completely redundant to a colume in the time_utc table, but I like to leave it there for convenience when writing ad-hoc queries in psql - it allows me to skip the join to a time dimension table.  The fact table is actually partitioned into 1-month child tables, with check constraints on both time_fk and the tstamp column, since there is a 1:1 relationship between those columns.

dim1_fk refers to a dim1 table which has two varchar columns we care about:

dim1_pk bigint,
label1 varchar,
label2 varchar

the time_utc table contains the usual time dimension columns, and I've structured the primary key to be an integer in the form YYYYMMDDHH24mm - so 2011-06-15 15:35 has a primary key of 201106151535 and there is a row every 5 minutes. All data in the fact table is assigned to a given 5 minute window. There is a row in the time_utc table for every possible time value, regardless of whether there is data in a fact table for that 5 minute interval.  For our purposes, we only need care about 2 columns

time_pk bigint,
tstamp timstamp without time zone

I'm looking to run a report which will show me any gaps in the data for any label1/label2 pair that appears in the dim1 table - there are 0 or more rows for each label1/label2 combination in each 5 minute window and I don't actually care about the duplicates (all my queries aggregate multiple rows for a given timestamp via the avg() function).

OK, I figured this one out on my own.  It looks like this (plus a union all to a query to show entries from dim1 for which there is no data at all):

SELECT d2.label1,
       d2.label2,
       tstamp - gap as start_time,
       tstamp - '5 minute'::interval as end_time 
FROM (
    SELECT d.dim1_pk,
           t.tstamp,
           t.tstamp - lag(t.tstamp,1, '2011-05-15 00:00:00') OVER w AS gap
    FROM
        dim1 d 
        JOIN facts.fact_tbl f
             ON f.dim1_fk = d.dim1_pk    
        JOIN time_utc t 
             ON t.time_pk = f.time_fk
    WHERE f.tstamp between '2011-05-15 00:00:00' and '2011-06-03 23:55:00'
    GROUP BY 1,2
    WINDOW w AS (PARTITION BY d.dim1_pk
                 ORDER BY d.dim1_pk)
    ORDER BY 1,2
) AS q JOIN dim1 d2 ON d2.dim1_pk = q.dim1_pk
WHERE q.gap > '5 minute'::interval
;

That subtracts the tstamp of the previous row from the tstamp of the current row, within a window defined on individual dim1_pk values.  The outer query then selects only rows where the gap is greater than 5 minutes, since sequential values will show a 5 minute interval.  It also joins to dim1 again in order to pull out the label1 and label2 values, since including those in the window instead of dim1_pk resulted in a much slower query, presumably because I don't have indexes on the label columns.

This avoids all of the cross join and left join craziness I was doing, which is useful when attempting to plug gaps with default values, but a pain in the neck when just attempting to determine where the gaps are and how large they are.
 
window functions are a seriously useful tool!

--sam