Thread: optimizing for temporal data behind a view

optimizing for temporal data behind a view

From

Richard Henwood

Date:

17 September 2009, 10:02:27

Hi All,

I have a large quantity of temporal data, 6 billion rows, which I would
like to put into a table so I can exploit SQL datetime queries. Each row
represents a geophysical observation at a particular time and place. The
data is effectively read-only - i.e. very infrequent updates will be
performed. The rows are very 'narrow' (~24bytes of data per row).

When I ingest each data into PostgreSQL a row at a time I discovered
that the row over-head is significant (pg 8.3.7). The projected
resources required to host this table prohibit this simple approach.

In order to reduce the cost of the row over head, I tried storing a
whole minutes worth of data in an array, and now I only require one row
per minute. Total rows decreased by 60, resources required became
realistic.

My schema is thus:

CREATE TABLE geodata1sec (obstime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
statid SMALLINT NOT NULL, geovalue_array REAL[3][60] NOT NULL);
and after ingesting, I add these indexes:
ALTER TABLE geodata1sec ADD PRIMARY KEY (obstime, statid);
CREATE INDEX geodata1sec_statid_idx ON geodata1sec (statid);

Storing whole minutes in a row with the data in an array has the desired
effect of making the table size on disk, and index size in memory,
manageable. However, my queries now need to be sensitive that I've made
this schema design decision. The following query runs nice and quick but
obviously doesn't return all the relevant results (because second
resolution is specified):

EXPLAIN ANALYZE SELECT * FROM geodata1sec WHERE obstime BETWEEN
'2004-10-21 02:03:04' AND '2004-10-21 02:04:08';
QUERY
PLAN
--------------------------------------------------------------------------
 Index Scan using geodata1sec_pkey on geodata1sec  (cost=0.00..38.19
rows=12 width=762) (actual time=0.071..0.148 rows=13 loops=1)
   Index Cond: ((obstime >= '2004-10-21 02:03:04'::timestamp without
time zone) AND (obstime <= '2004-10-21 02:04:08'::timestamp without time
zone))
Total runtime: 0.292 ms
(3 rows)


... So, I constructed a view which would present my data as I originally
intended. This also means that I don't have to give my applications
detailed knowledge of the schema. The view is:


CREATE VIEW geodataview AS SELECT obstime + (s.a*5 || '
seconds')::INTERVAL AS obstime, statid, geovalue_array[s.a+1][1] AS
x_mag, geovalue_array[s.a+1][2] AS y_mag, geovalue_array[s.a+1][3] AS
z_mag FROM generate_series(0, 11) AS s(a), geodata1sec;

So my query returns _all_ the relevant data. However, this query takes a
long time. If I analyse the query I get:

EXPLAIN ANALYZE SELECT * FROM geodataview WHERE obstime BETWEEN
'2004-10-21 02:03:04' AND '2004-10-21 02:04:08';
QUERY PLAN
--------------------------------------------------------------------------
 Nested Loop  (cost=13.50..2314276295.50 rows=4088000000 width=766)
(actual time=2072612.668..3081010.104 rows=169 loops=1)
   Join Filter: (((geodata1sec.obstime + ((((s.a * 5))::text || '
seconds'::text))::interval) >= '2004-10-21 02:03:04'::timestamp without
time zone) AND ((geodata1sec.obstime + ((((s.a * 5))::text || '
seconds'::text))::interval) <= '2004-10-21 02:04:08'::timestamp without
time zone))
   ->  Seq Scan on geodata1sec  (cost=0.00..4556282.00 rows=36792000
width=762) (actual time=17.072..414620.213 rows=36791999 loops=1)
   ->  Materialize  (cost=13.50..23.50 rows=1000 width=4) (actual
time=0.002..0.027 rows=12 loops=36791999)
       ->  Function Scan on generate_series s  (cost=0.00..12.50
rows=1000 width=4) (actual time=0.075..0.102 rows=12 loops=1)
Total runtime: 3081010.613 ms
(6 rows)


This is clearly not going to perform for any practical applications.
However, it struck me that others might have needed similar
functionality for time data so I thought I would air my experience here.

Is it feasible to modify the query planner to make better decisions when
dealing with time data behind a view?

Are there any alternatives to vanilla Postgresql for storing this type
of data? I'm imagining PostGIS but for time based data?


Your time and thoughts are appreciated,
Cheers,
Richard
--
Scanned by iCritical.

Re: optimizing for temporal data behind a view

From

Hrishikesh (हृषीकेश मेहेंदळे)

Date:

17 September 2009, 11:54:54

Hi Richard,

> CREATE VIEW geodataview AS SELECT obstime + (s.a*5 || '
> seconds')::INTERVAL AS obstime, statid, geovalue_array[s.a+1][1] AS
> x_mag, geovalue_array[s.a+1][2] AS y_mag, geovalue_array[s.a+1][3] AS
> z_mag FROM generate_series(0, 11) AS s(a), geodata1sec;

To my (admittedly untrained) eye, it seems that the JOIN that will
implicitly happen (generate_series(0,11) and geodata1sec) will be over
all records in geodata1sec, and the explain analyze of the view you
posted seems to corroborate that. (I suspect that the JOIN also kills
the time filter for geodata1sec, which would worsen things.)

> EXPLAIN ANALYZE SELECT * FROM geodataview WHERE obstime BETWEEN
> '2004-10-21 02:03:04' AND '2004-10-21 02:04:08';

>  Nested Loop  (cost=13.50..2314276295.50 rows=4088000000 width=766)
> (actual time=2072612.668..3081010.104 rows=169 loops=1)
>   Join Filter: (((geodata1sec.obstime + ((((s.a * 5))::text || '
> seconds'::text))::interval) >= '2004-10-21 02:03:04'::timestamp without
> time zone) AND ((geodata1sec.obstime + ((((s.a * 5))::text || '
> seconds'::text))::interval) <= '2004-10-21 02:04:08'::timestamp without
> time zone))
>   ->  Seq Scan on geodata1sec  (cost=0.00..4556282.00 rows=36792000
> width=762) (actual time=17.072..414620.213 rows=36791999 loops=1)

The seqscan should return only 12 rows (as per your original explain
analyze output), but actually returns 37 million.

> This is clearly not going to perform for any practical applications.
> However, it struck me that others might have needed similar
> functionality for time data so I thought I would air my experience here.
>
> Is it feasible to modify the query planner to make better decisions when
> dealing with time data behind a view?

You could use table partitioning and split your geodata1sec table into
(say) one table per hour, which can then hold a lot fewer records to
JOIN with. (with PG 8.3.7 you need to explicitly enable
constraint_exclusion in the config file for this to work).

You could change the view to be a stored proc instead, but I'm
guessing you don't want to (or cannot) change the application which
makes the query.

You could also change the view to call a stored procedure that does, in essence,
for i in (0..11); do { query geodata1sec for t+i; } and return the
resulting recordset, which might be faster.

If you're dealing with mostly invariant-after-insert data, you can use
partitioning then CLUSTER any tables that won't be touched on an
appropriate column so the seqscan (if there is one) is faster, and
vacuum analyze the table once it's clustered.

> Are there any alternatives to vanilla Postgresql for storing this type
> of data? I'm imagining PostGIS but for time based data?

I recently had to deal with something similar (though not on your
scale) for network monitoring - the thread is available at
  http://archives.postgresql.org/pgsql-performance/2009-08/msg00275.php


Cheers,
Hrishi