Thread: SELECT max(time) group by problem

SELECT max(time) group by problem

From

Heigo Niilop

Date:

30 July 2009, 06:38:43

hi,



I have table



CREATE TABLE table

(

  id integer NOT NULL,

  timest timestamp with time zone NOT NULL,

  db_time timestamp with time zone NOT NULL DEFAULT now(),

  "values" text[],

  CONSTRAINT table_pkey PRIMARY KEY (id, timest)

)



=84id=93 have foreign key with table1



and when I try to do



SELECT MAX(table.timest)  FROM table, table1  WHERE

table.id=3Dtable1.id and

table1.id in (1,2,3) GROUP BY table.id



then it is terrible slow, when I use strange syntax



SELECT table.timest  FROM table,table1 WHERE

table.id=3Dtable1.id and table1.id in(1,2,3) and table.timest=3D

(SELECT max(timest) FROM table WHERE table.id=3Dtable1.id)



I receive all needed data very fast.



My questions are

1)      why this first query is slow and what I can do to make it faster
(some more indexes??)?

2)      what kind of danger I have with second query (so far I have right
data)?





I have Postgres 8.3 and table have over million rows.



Regards,

Heigo

Re: SELECT max(time) group by problem

From

nha

Date:

30 July 2009, 21:48:27

Hello,

Le 30/07/09 11:38, Heigo Niilop a écrit :
> hi,
> 
> I have table
> 
> CREATE TABLE table
> (
>   id integer NOT NULL,
>   timest timestamp with time zone NOT NULL,
>   db_time timestamp with time zone NOT NULL DEFAULT now(),
>   "values" text[],
>   CONSTRAINT table_pkey PRIMARY KEY (id, timest)
> )
> [...] 
> SELECT MAX(table.timest)  FROM table, table1  WHERE
> table.id=table1.id and table1.id in (1,2,3) GROUP BY table.id
> 
> [...] it is terrible slow,
> when I use strange syntax
> 
> SELECT table.timest  FROM table,table1 WHERE
> table.id=table1.id and table1.id in(1,2,3) and table.timest=
> (SELECT max(timest) FROM table WHERE table.id=table1.id)  
> 
> I receive all needed data very fast.
> 
> My questions are
> 1)      why this first query is slow and what I can do to make it faster
> (some more indexes??)?
> 2)      what kind of danger I have with second query (so far I have
> right data)?   
> 
> I have Postgres 8.3 and table have over million rows. [...]

Q1) EXPLAIN ANALYZE output would likely give (at least partially) some
clarification about observed performance.

According to "table" definition, the implicit index created on
"table_pkey" would not be efficiently used in the 1st query because of
explicit aggregation on column "id" (ie. partial key of "table"). Full
scan of "table" is assumed for the join despite index scan on "table1"
with "id" index. Each "table1" row is then joined with million of rows
of "table" before matching WHERE clauses (as these latter apply for each
row resulting from join). Slowness is expected.

3 ideas (with or without combination) of improvement come to my mind at
this point:
ID1- Definition of explicit index on "table.id": this would lightly
quicken aggregation by "id" and join on "id";
ID2- Aggregation on "table1.id" instead of "table.id": because of
reference declaration of "table.id" on "table1.id", "table1.id" is
assumed to be a (primary) key of "table1" (and thence bound to an index,
speeding up aggregation);
ID3- Integration of WHERE clause "table1.id IN (1,2,3)" into a subquery
on "table1" (because this filter is independent from "table") and use of
this subquery instead of "table1" call for join. This would reduce the
size of the table to join to "table" and thence reduce the number of
join rows at a sooner stage of query execution.

A possible rewritten query would express as follows:
SELECT MAX(t.timest)
FROM table t
INNER JOIN (SELECT id FROM table1 WHERE id IN (1,2,3)) t1
ON t.id = t1.id
GROUP BY t1.id;
(In this case, indexing on "table1.id" is not necessary although
recommended.)

Inviting you to assess this proposal (or a derivative according to ideas
1, 2, and/or 3).

A 4th idea may consist in directly looking up "table.id IN (1,2,3)" in
case of certainty on these 3 values in column "table1.id". Such a way
strengthens the efficiency of an index to declare on "table.id" so that
lookup quickens.

The query would look like the following:
SELECT MAX(timest)
FROM table
WHERE id IN (1,2,3)
GROUP BY id;

Q2) On the other hand, the 2nd query seems to take advantage of "table"
index on its primay key (id,timest) as the 2 columns are explicitely
referred in WHERE clauses. The sub-select would be a bottleneck; but the
WHERE clause of this sub-select refers to a literal against column "id",
ie. hash join is assumed to be efficient and used in this case by the
database engine for a faster evaluation.

Eventually I do not guess what you have in mind by "danger" from this
2nd query. Maybe I missed some elements.

Regards.
--
nha / Lyon / France.