Query Planner Filtering Of Specified Value From 'On Distinct' Clause - Mailing list pgsql-general
From | Alex Satrapa |
---|---|
Subject | Query Planner Filtering Of Specified Value From 'On Distinct' Clause |
Date | |
Msg-id | 3FB88D16.30007@lintelsys.com.au Whole thread Raw |
Responses |
Re: Query Planner Filtering Of Specified Value From 'On Distinct' Clause
|
List | pgsql-general |
I have a table from which I'm trying to extract certain information. For historical reasons, we archive every action on a particular thing ('thing' is identified, funnily enough, by 'id'). So the only way to find out the current state of a particular combination of attributes is to "select distinct on (id, ...) ... order by date desc". In the examples below, I've taken real output from psql and done a global search/replace on various names to convince myself that I'm not leaking information. The numbers are intact, just the column/table/view names are different. For sake of argument, here is the table structure: database=> \d souce Table "public.source" Column | Type | Modifiers ---------+-----------------------------+--------------- id | integer | not null date | timestamp without time zone | default 'now' second | integer | not null third | text | not null Indexes: source btree (id) This table contains about 98000 rows. Let's create a view into the "source" table: database=> create view myview as select distinct on (id, second, third) * from source order by id, second, third, date desc; CREATE VIEW Now let's look at the query plan for "select * from myview": database=> explain analyze select * from myview; QUERY PLAN -------------------------------------------------------------- Subquery Scan myview (cost=12483.55..13463.96 rows=9804 width=63) (actual time=1262.08..1450.68 rows=21089 loops=1) -> Unique (cost=12483.55..13463.96 rows=9804 width=63) (actual time=1262.06..1406.71 rows=21089 loops=1) -> Sort (cost=12483.55..12728.65 rows=98041 width=63) (actual time=1262.06..1339.34 rows=98063 loops=1) Sort Key: source.id, source.second, source."third", source.date -> Seq Scan on source (cost=0.00..2247.41 rows=98041 width=63) (actual time=0.02..269.07 rows=98063 loops=1) Total runtime: 1467.78 msec (6 rows) I'm not sure if there's really anything I can do to make that go faster. But look what happens when I specify a particular value for one of the fields in the "on distinct" clause: database=> explain analyze select * from myview where id=12345; QUERY PLAN -------------------------------------------------------------- Subquery Scan myview (cost=12483.55..13463.96 rows=9804 width=63) (actual time=1446.58..1446.81 rows=2 loops=1) Filter: (id = 12345) -> Unique (cost=12483.55..13463.96 rows=9804 width=63) (actual time=1258.87..1434.39 rows=21089 loops=1) -> Sort (cost=12483.55..12728.65 rows=98041 width=63) (actual time=1258.87..1336.89 rows=98063 loops=1) Sort Key: source.id, source.second, source."third", source.date -> Seq Scan on source (cost=0.00..2247.41 rows=98041 width=63) (actual time=0.02..268.54 rows=98063 loops=1) Total runtime: 1457.53 msec (7 rows) Hmm.. not as efficient as I would have liked. The planner happily executed the whole view, and only then filtered out just the bits that I wanted. The most expensive parts of the query were done on the unfiltered set. The result set is only 0.002% of the whole data set, so I was a little shocked that it took *longer* to generate the results that I wanted, than it took to create the unfiltered results! Is there any reason why it would not be sensible to push the filter down to just after the first sequential scan of source? Since "id" is one of the fields in the "on distinct" condition, doesn't it make sense that pushing the filter further down would reduce the amount of work done to get exactly the same result? The same would be true if I was to filter on a specific "second" or "third" value. I understand that filtering on columns not present in the "on distinct" clause would have drastically different effects when done before or after the "Unique" processing. However, since the column(s) I'm filtering on are only those present in the "on distinct" clause, it makes no difference whether the filter is before or after the "Unique"-ifying. Am I totally stark, raving mad? Alex Satrapa
pgsql-general by date: