Query Planner Filtering Of Specified Value From 'On Distinct' Clause - Mailing list pgsql-general

From Alex Satrapa
Subject Query Planner Filtering Of Specified Value From 'On Distinct' Clause
Date
Msg-id 3FB88D16.30007@lintelsys.com.au
Whole thread Raw
Responses Re: Query Planner Filtering Of Specified Value From 'On Distinct' Clause
List pgsql-general
I have a table from which I'm trying to extract certain information. For
historical reasons, we archive every action on a particular thing
('thing' is identified, funnily enough, by 'id'). So the only way to
find out the current state of a particular combination of attributes is
to "select distinct on (id, ...) ... order by date desc".

In the examples below, I've taken real output from psql and done a
global search/replace on various names to convince myself that I'm not
leaking information. The numbers are intact, just the column/table/view
names are different.

For sake of argument, here is the table structure:
database=> \d souce
              Table "public.source"
  Column  |            Type             | Modifiers
---------+-----------------------------+---------------
  id      | integer                     | not null
  date    | timestamp without time zone | default 'now'
  second  | integer                     | not null
  third   | text                        | not null
Indexes: source btree (id)


This table contains about 98000 rows. Let's create a view into the
"source" table:

   database=> create view myview as select distinct
   on (id, second, third) *
   from source
   order by id, second, third, date desc;
   CREATE VIEW

Now let's look at the query plan for "select * from myview":

   database=> explain analyze select * from myview;
                                 QUERY PLAN
   --------------------------------------------------------------
   Subquery Scan myview  (cost=12483.55..13463.96 rows=9804 width=63)
   (actual time=1262.08..1450.68 rows=21089 loops=1)
      ->  Unique  (cost=12483.55..13463.96 rows=9804 width=63)
   (actual time=1262.06..1406.71 rows=21089 loops=1)
        ->  Sort  (cost=12483.55..12728.65 rows=98041 width=63)
   (actual time=1262.06..1339.34 rows=98063 loops=1)
             Sort Key: source.id, source.second, source."third",
   source.date
          ->  Seq Scan on source  (cost=0.00..2247.41 rows=98041
   width=63) (actual time=0.02..269.07 rows=98063 loops=1)
   Total runtime: 1467.78 msec
   (6 rows)

I'm not sure if there's really anything I can do to make that go faster.
But look what happens when I specify a particular value for one of the
fields in the "on distinct" clause:

   database=> explain analyze select * from myview where id=12345;
                                 QUERY PLAN
   --------------------------------------------------------------
   Subquery Scan myview  (cost=12483.55..13463.96 rows=9804 width=63)
   (actual time=1446.58..1446.81 rows=2 loops=1)
   Filter: (id = 12345)
     ->  Unique  (cost=12483.55..13463.96 rows=9804 width=63)
   (actual time=1258.87..1434.39 rows=21089 loops=1)
       ->  Sort  (cost=12483.55..12728.65 rows=98041 width=63)
   (actual time=1258.87..1336.89 rows=98063 loops=1)
             Sort Key: source.id, source.second, source."third",
   source.date
         ->  Seq Scan on source  (cost=0.00..2247.41 rows=98041 width=63)
   (actual time=0.02..268.54 rows=98063 loops=1)
   Total runtime: 1457.53 msec
   (7 rows)


Hmm.. not as efficient as I would have liked. The planner happily
executed the whole view, and only then filtered out just the bits that I
wanted. The most expensive parts of the query were done on the
unfiltered set. The result set is only 0.002% of the whole data set, so
I was a little shocked that it took *longer* to generate the results
that I wanted, than it took to create the unfiltered results!

Is there any reason why it would not be sensible to push the filter down
to just after the first sequential scan of source? Since "id" is one of
the fields in the "on distinct" condition, doesn't it make sense that
pushing the filter further down would reduce the amount of work done to
get exactly the same result?  The same would be true if I was to filter
on a specific "second" or "third" value.

I understand that filtering on columns not present in the "on distinct"
clause would have drastically different effects when done before or
after the "Unique" processing. However, since the column(s) I'm
filtering on are only those present in the "on distinct" clause, it
makes no difference whether the filter is before or after the
"Unique"-ifying.

Am I totally stark, raving mad?
Alex Satrapa


pgsql-general by date:

Previous
From: Doug McNaught
Date:
Subject: Re: uploading files
Next
From: "Yury Shvetsov"
Date:
Subject: SQL text of view