Thread: Advice requested on structuring aggregation queries

Advice requested on structuring aggregation queries

From

Dave Crooke

Date:

23 February 2010, 02:01:50

Hi folks

I have an application which collects performance stats at time intervals, to which I am retro-fitting a table partitioning scheme in order to improve scalability.

The original data is keyed by a 3-ary tuple of strings .... to keep the row size down, in the new data model I'm actually storing 32-bit int's in Postgres. The new schema for each table looks like this:

(a integer,
b integer,
c integer,
ts timestamp without timezone,
value double precision)

with two indexes: (a, b, ts) and (b, ts)

If the table had a primary key, it would be (a, b, c, ts) but I don't need one, and I don't want to carry column "c" in the index as it isn't necessary ... the number of distinct values of c for any (a,b) pair is small (less than 10). Queries that read the data for graphing are of the form:

..... where a=<constant> and b=<constant> and ts between <x> and <y>

One of the things I need to do periodically is roll up this data, e.g. calculate min / max / average values of a given datum over a period for use in graphing at larger time scales. The rollup tables use the same schema as the raw data ones.

There are about 60 different values of b, and for each such value there is a exactly one type of rollup. The old code is doing the rollups in Postgres with 60 bulk "insert into .... select" statements, hence the need for the second index.

For rollups which are instrinsic SQL functions, and only depend on the value column, this is straightforward, e.g. to take the averages:

insert into rollup_table
select a, 17, c, '2009-02-22 19:00', avg(value)
from data_table
where b=17
and ts between '2009-02-22 18:00' and '2009-02-22 19:00'
group by a, c;

So far so good.

This gets slightly more tricky if the rollup you want to do isn't a simple SQL function .... I have two such cases, (a) the most recent value and (b) the modal (most common) value. In both cases, I've been doing this using a dependent subquery in the select clause, e.g. for the most recent value:

insert into rollup_table
select a, 23, c, '2009-02-23 00:00',
   (select y.value from
      (select z.value, z.ts from data_table z where z.a=.a and z.b=x.b and z.c=x.c) y
   order by y.ts desc limit 1)
from data_table x
where b=23
group by a, c;

Due to the second index, this actually performs quite well, but it of course depends on doing a small index scan on the same data_table for each row of the outer select. I have a few questions:

Q1. Is this the right place for the subquery, or would it be better to move it to the where clause, e.g.

insert into rollup_table
select a, 23, c, '2009-02-23 00:00', value
from data_table x
where b=23 and
   ts=(select max(y.ts) from data_table y where y.a=x.a and y.b=23 and y.c=x.c)
group by a, c;

or does the answer depend on row stats? Is there a rule of thumb on whether to prefer "limit 1" vs. "max"? There will be between 20 and 300 rows with different timestamps for each (a,b,c) tuple.

For better scalability, I am partitioning these tables by time .... I am not using PG's table inheritance and triggers to do the partitioning, but instead dynamically generating the SQL and table names in the application code (Java). In most cases, the rollups will still happen from a single source "data_table" and I plan to continue using the existing SQL, but I have a few cases where the source "data_table" rows may actually come from two adjacent tables.

For intrinsic SQL rollup functions like avg / min / max this is pretty easy:

insert into rollup_table
select a, 17, c, '... date ...', max(value) from (
   select ........ from data_table_1 where b=17 and .......
union all
   select ........ from data_table_2 where b=17 and .......
)
group by a, c;

Q2. Is there any benefit to doing rollups (max() and group by) in the inner queries inside the UNION ALL, or is it a wash? For now I'm expecting each inner select to produce 20-200 rows per unqiue (a,b,c) combination, and to be unioning no more than 2 tables at a time. (I'm aware that max() and min() are composable, but that doing this with avg() and getting correct results involves using sum's and count's' in the subqueries and doing a division in the outer select, but AFAICT that's orthogonal to the performance considerations of whether to do the inner rollups or not).

Q3. (You saw it coming) What really has me wrapped around the axle is how to do the "most recent" and "mode" rollups efficiently in this scenario .... in the examples in Q1 above, in both cases the SQL references the data_table twice, which is fine if it's a real table, but in the partiioned scenario the data has to be pulled together at run time by the UNION ALL subquery, and I'd prefer not to execute it multiple times - I'm not much of a SQL wordsmith but I can't see way to write this in plain SQL without repeating the subquery, e.g.

insert into rollup_table
select a, 23, c, '2009-02-23 00:00',
   (select y.value from
      (select z.value, z.ts from (... union all ...) z where z.a=x.a and z.b=23 and z.c=x.c) y
   order by y.ts desc limit 1)
from (... union all ...) x
group by a, c;

a. Is there a way to write this in plain SQL without syntactically repeating the "union all" subquery, or in a way that will ensure PG only runs that subquery once?

b. Is the PG query planner smart enough to figure out that the union all subquery will return the same result every time (assuming ISOLATION_SERIALIZABLE) and so only run it once?

c. Other ideas I had were
   (i) temporarily create a VIEW which bridges the two tables, and use one of the query structures from Q1 above; or
   (ii) write the results of some kind of subquery to a temporary table, run a query against that, then drop the temporary table; or
   (iii) pull the data back to the application layer and deal with it there; or
   (iv) stored procedure maybe????
Is there something better?

Q4. When partitioning tables for speed / performance reasons, what are good rules of thumb to use to determine how big to make the partitions?

P.S. I realized while drafting this email that instead of running a bulk insert ... select statement for each of the 60 values of b, I could group them according to the rollup algorithm being used, and switch the second index to being just (ts). Since the data is being laid down in time series order, the I/O pattern this creates on the table pages should look like a sequential scan even though it's index driven.

All advice would be most gratefully appreciated!

Cheers
Dave

Re: Advice requested on structuring aggregation queries

From

Joe Conway

Date:

23 February 2010, 04:23:39

On 02/22/2010 07:01 PM, Dave Crooke wrote:
> The original data is keyed by a 3-ary tuple of strings .... to keep the
> row size down, in the new data model I'm actually storing 32-bit int's
> in Postgres. The new schema for each table looks like this:
>
> (a integer,
>  b integer,
>  c integer,
>  ts timestamp without timezone,
>  value double precision)
>
> with two indexes: (a, b, ts) and (b, ts)

[...snip...]

> There are about 60 different values of b, and for each such value there
> is a exactly one type of rollup. The old code is doing the rollups in
> Postgres with 60 bulk "insert into .... select" statements, hence the
> need for the second index.

[...snip...]

> For better scalability, I am partitioning these tables by time .... I am
> not using PG's table inheritance and triggers to do the partitioning,
> but instead dynamically generating the SQL and table names in the
> application code (Java). In most cases, the rollups will still happen
> from a single source "data_table" and I plan to continue using the
> existing SQL, but I have a few cases where the source "data_table" rows
> may actually come from two adjacent tables.

Without going through your very long set of questions in detail, it
strikes me that you might be better off if you:

1) use PostgreSQL partitioning (constraint exclusion)
2) partition by ts range
3) consider also including b in your partitioning scheme
4) create one index as (ts, a)
5) use dynamically generated SQL and table names in the application
   code to create (conditionally) and load the tables

But of course test both this and your proposed method and compare ;-)

Also you might consider PL/R for some of your analysis (e.g. mode would
be simple, but perhaps not as fast):
  http://www.joeconway.com/web/guest/pl/r

HTH,

Joe

Attachment

signature.asc