Re: SELECT max(time) group by problem - Mailing list pgsql-sql
From | nha |
---|---|
Subject | Re: SELECT max(time) group by problem |
Date | |
Msg-id | 4A723EF7.20003@free.fr Whole thread Raw |
In response to | SELECT max(time) group by problem (Heigo Niilop <heigo.niilop@oskando.ee>) |
List | pgsql-sql |
Hello, Le 30/07/09 11:38, Heigo Niilop a écrit : > hi, > > I have table > > CREATE TABLE table > ( > id integer NOT NULL, > timest timestamp with time zone NOT NULL, > db_time timestamp with time zone NOT NULL DEFAULT now(), > "values" text[], > CONSTRAINT table_pkey PRIMARY KEY (id, timest) > ) > [...] > SELECT MAX(table.timest) FROM table, table1 WHERE > table.id=table1.id and table1.id in (1,2,3) GROUP BY table.id > > [...] it is terrible slow, > when I use strange syntax > > SELECT table.timest FROM table,table1 WHERE > table.id=table1.id and table1.id in(1,2,3) and table.timest= > (SELECT max(timest) FROM table WHERE table.id=table1.id) > > I receive all needed data very fast. > > My questions are > 1) why this first query is slow and what I can do to make it faster > (some more indexes??)? > 2) what kind of danger I have with second query (so far I have > right data)? > > I have Postgres 8.3 and table have over million rows. [...] Q1) EXPLAIN ANALYZE output would likely give (at least partially) some clarification about observed performance. According to "table" definition, the implicit index created on "table_pkey" would not be efficiently used in the 1st query because of explicit aggregation on column "id" (ie. partial key of "table"). Full scan of "table" is assumed for the join despite index scan on "table1" with "id" index. Each "table1" row is then joined with million of rows of "table" before matching WHERE clauses (as these latter apply for each row resulting from join). Slowness is expected. 3 ideas (with or without combination) of improvement come to my mind at this point: ID1- Definition of explicit index on "table.id": this would lightly quicken aggregation by "id" and join on "id"; ID2- Aggregation on "table1.id" instead of "table.id": because of reference declaration of "table.id" on "table1.id", "table1.id" is assumed to be a (primary) key of "table1" (and thence bound to an index, speeding up aggregation); ID3- Integration of WHERE clause "table1.id IN (1,2,3)" into a subquery on "table1" (because this filter is independent from "table") and use of this subquery instead of "table1" call for join. This would reduce the size of the table to join to "table" and thence reduce the number of join rows at a sooner stage of query execution. A possible rewritten query would express as follows: SELECT MAX(t.timest) FROM table t INNER JOIN (SELECT id FROM table1 WHERE id IN (1,2,3)) t1 ON t.id = t1.id GROUP BY t1.id; (In this case, indexing on "table1.id" is not necessary although recommended.) Inviting you to assess this proposal (or a derivative according to ideas 1, 2, and/or 3). A 4th idea may consist in directly looking up "table.id IN (1,2,3)" in case of certainty on these 3 values in column "table1.id". Such a way strengthens the efficiency of an index to declare on "table.id" so that lookup quickens. The query would look like the following: SELECT MAX(timest) FROM table WHERE id IN (1,2,3) GROUP BY id; Q2) On the other hand, the 2nd query seems to take advantage of "table" index on its primay key (id,timest) as the 2 columns are explicitely referred in WHERE clauses. The sub-select would be a bottleneck; but the WHERE clause of this sub-select refers to a literal against column "id", ie. hash join is assumed to be efficient and used in this case by the database engine for a faster evaluation. Eventually I do not guess what you have in mind by "danger" from this 2nd query. Maybe I missed some elements. Regards. -- nha / Lyon / France.