Thread: PostgreSQL 8.3.3 chooses wrong query plan when LIMIT 1 added?

PostgreSQL 8.3.3 chooses wrong query plan when LIMIT 1 added?

From
Mark Cave-Ayland
Date:
Hi everyone,

I'm experiencing a strange issue with PostgreSQL 8.3.3 whereby adding
"LIMIT 1" to the query increases the query time from several 10s of ms
to over 5s, and was wondering if anyone with more planner-fu can shed
some light on this.

The database in question is being used to store information from SVN
logs in a very simple format using a revision_files table given below:


svnlog=# \d revision_files
   Table "public.revision_files"
    Column    |  Type  | Modifiers
-------------+--------+-----------
  revision_id | bigint |
  file_id     | bigint |
Indexes:
     "revision_files_file_id_idx" btree (file_id)
     "revision_files_revision_id_idx" btree (revision_id)


What happens is that for each SVN revision, a row is inserted for each
file currently within that revision so that obtaining a list of all the
individual files within a given revision is nice and fast. As a result
of this, the table grows quite quickly:


svnlog=# select count(*) from revision_files ;
   count
----------
  73886238
(1 row)


Now the problem comes with one particular query where I want to find the
id of the last SVN revision that touched one particular file, which
should be a reasonably instant query. However, what I am actually seeing
is this:


svnlog=# SELECT revision_id FROM revision_files WHERE file_id=(SELECT
file_id
FROM files WHERE filepath='/trunk/app/widgets/gimptoolbox-dnd.c' LIMIT 1)
ORDER BY revision_id DESC LIMIT 1;
  revision_id
-------------
        15011
(1 row)

Time: 5090.449 ms
svnlog=# explain analyze SELECT revision_id FROM revision_files WHERE
file_id=(SELECT file_id
FROM files WHERE filepath='/trunk/app/widgets/gimptoolbox-dnd.c' LIMIT 1)
ORDER BY revision_id DESC LIMIT 1;

            QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=16.32..354.70 rows=1 width=8) (actual
time=5018.176..5018.177 rows=1 loops=1)
    InitPlan
      ->  Limit  (cost=0.00..16.32 rows=1 width=8) (actual
time=0.091..0.092 rows=1 loops=1)
            ->  Index Scan using files_filepath_idx on files
(cost=0.00..16.32 rows=1 width=8) (actual time=0.088..0.088 rows=1 loops=1)
                  Index Cond: ((filepath)::text =
'/trunk/app/widgets/gimptoolbox-dnd.c'::text)
    ->  Index Scan Backward using revision_files_revision_id_idx on
revision_files  (cost=0.00..3756450.27 rows=11101 width=8) (actual
time=5018.171..5018.171 rows=1 loops=1)
          Filter: (file_id = $0)
  Total runtime: 5018.237 ms
(8 rows)

Time: 5019.056 ms


Hmmmm so it seems to be favouring the backwards ordered index scan on
revision_files_revision_id_idx rather than just pulling out the rows
using revision_files_file_id_idx. This seems a strange choice given that
the overall table size is quite large. The interesting part is that if I
remove the LIMIT 1 I get this:


svnlog=# SELECT revision_id FROM revision_files WHERE file_id=(SELECT
file_id
FROM files WHERE filepath='/trunk/app/widgets/gimptoolbox-dnd.c' LIMIT 1)
ORDER BY revision_id DESC;
  revision_id
-------------
        15011
        15010
    ...
        14961
        14960
Time: 22.816 ms
svnlog=# explain analyze SELECT revision_id FROM revision_files WHERE
file_id=(SELECT file_id
FROM files WHERE filepath='/trunk/app/widgets/gimptoolbox-dnd.c' LIMIT 1)
ORDER BY revision_id DESC;

   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=57994.92..58022.67 rows=11101 width=8) (actual
time=42.904..50.153 rows=5297 loops=1)
    Sort Key: revision_files.revision_id
    Sort Method:  quicksort  Memory: 441kB
    InitPlan
      ->  Limit  (cost=0.00..16.32 rows=1 width=8) (actual
time=0.076..0.078 rows=1 loops=1)
            ->  Index Scan using files_filepath_idx on files
(cost=0.00..16.32 rows=1 width=8) (actual time=0.074..0.074 rows=1 loops=1)
                  Index Cond: ((filepath)::text =
'/trunk/app/widgets/gimptoolbox-dnd.c'::text)
    ->  Index Scan using revision_files_file_id_idx on revision_files
(cost=0.00..57232.71 rows=11101 width=8) (actual time=0.130..22.528
rows=5297 loops=1)
          Index Cond: (file_id = $0)
  Total runtime: 57.310 ms
(10 rows)

Time: 58.246 ms


So the part I'm not sure is why the revision_files_revision_id_idx index
is favoured over revision_files_file_id_idx when LIMIT 1 is present?
AFAICT the planner should favour the index scan on file_id given that
the estimated cost in the second query is much lower than the one
estimated for the revision_id index on the first query.


Many thanks,

Mark.

--
Mark Cave-Ayland
Sirius Corporation - The Open Source Experts
http://www.siriusit.co.uk
T: +44 870 608 0063

Re: PostgreSQL 8.3.3 chooses wrong query plan when LIMIT 1 added?

From
Simon Riggs
Date:
On Mon, 2008-10-27 at 16:39 +0000, Mark Cave-Ayland wrote:

> I'm experiencing a strange issue with PostgreSQL 8.3.3 whereby adding
> "LIMIT 1" to the query increases the query time from several 10s of ms
> to over 5s, and was wondering if anyone with more planner-fu can shed
> some light on this.

Sounds like one for the performance list.

LIMIT prevents the planner from transforming subselects. Maybe you want
EXISTS.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: PostgreSQL 8.3.3 chooses wrong query plan when LIMIT 1 added?

From
Tom Lane
Date:
Mark Cave-Ayland <mark.cave-ayland@siriusit.co.uk> writes:
> I'm experiencing a strange issue with PostgreSQL 8.3.3 whereby adding
> "LIMIT 1" to the query increases the query time from several 10s of ms
> to over 5s, and was wondering if anyone with more planner-fu can shed
> some light on this.

It's hoping that the backwards scan will hit a row with the requested
file_id quickly; which might be true on average but isn't true for this
particular file_id (nor, presumably, any file_id that hasn't been
updated recently).

You might consider a two-column index on (file_id, revision_id) to
make this type of query fast.

            regards, tom lane

Re: PostgreSQL 8.3.3 chooses wrong query plan when LIMIT 1 added?

From
Mark Cave-Ayland
Date:
Tom Lane wrote:

> It's hoping that the backwards scan will hit a row with the requested
> file_id quickly; which might be true on average but isn't true for this
> particular file_id (nor, presumably, any file_id that hasn't been
> updated recently).

Right. In the case of this schema, that is not true; here revision_files
contains the (revision_id, file_id) pair for every file that is
*present* within the given revision, not just the files that were
touched for each revision.

> You might consider a two-column index on (file_id, revision_id) to
> make this type of query fast.

Interesting. Adding this index seems to bring the query time down to
around 1s:


svnlog=# SELECT revision_id FROM revision_files WHERE file_id=(SELECT
file_id
FROM files WHERE filepath='/trunk/app/widgets/gimptoolbox-dnd.c' LIMIT 1)
ORDER BY revision_id DESC LIMIT 1;
  revision_id
-------------
        15011
(1 row)

Time: 935.816 ms


However, some more searching came up with this "ORDER BY x + 0"
variation which seems to consistently perform the fastest for varying
flavours of revision_id by forcing use of the file_id index:


svnlog=# SELECT revision_id FROM revision_files WHERE file_id=(SELECT
file_id
FROM files WHERE filepath='/trunk/app/widgets/gimptoolbox-dnd.c' LIMIT 1)
ORDER BY revision_id + 0 DESC LIMIT 1;
  revision_id
-------------
        15011
(1 row)

Time: 11.446 ms


Ah well. Even though it seems a bit of a kludge, it seems to keep the
application performing as expected so I'll have to stick with it.


ATB,

Mark.

--
Mark Cave-Ayland
Sirius Corporation - The Open Source Experts
http://www.siriusit.co.uk
T: +44 870 608 0063

Re: PostgreSQL 8.3.3 chooses wrong query plan when LIMIT 1 added?

From
Tom Lane
Date:
Mark Cave-Ayland <mark.cave-ayland@siriusit.co.uk> writes:
> svnlog=# SELECT revision_id FROM revision_files WHERE file_id=(SELECT
> file_id
> FROM files WHERE filepath='/trunk/app/widgets/gimptoolbox-dnd.c' LIMIT 1)
> ORDER BY revision_id DESC LIMIT 1;
>   revision_id
> -------------
>         15011
> (1 row)

> Time: 935.816 ms

Hmm, I'd expect it to be more or less instantaneous given the right
index.  What does EXPLAIN ANALYZE say about this?

> However, some more searching came up with this "ORDER BY x + 0"
> variation which seems to consistently perform the fastest for varying
> flavours of revision_id by forcing use of the file_id index:

And that?

            regards, tom lane

Re: PostgreSQL 8.3.3 chooses wrong query plan when LIMIT 1 added?

From
Gregory Stark
Date:
This looks like another form of the cross-column dependency problem. Postgres
is assuming that the revisions for all files will be evenly spread throughout
the date range and apparently there's a larger variety of dates than files so
it expects to find the last revision for that file fairly quickly scanning
backwards through the dates.

In fact of course files tend to be hot for a period of time and then mostly
idle, so depending on which file you pick that may work well if it's currently
hot or be absolutely terrible if it's a file that hasn't been touched
recently.

With the LIMIT Postgres favours the plan it thinks will return one row quickly
without sorting. Without it it's favouring the plan that will return all the
rows for that file_id most quickly.

I'm not sure what to suggest for this case if you can't change the data model
except perhaps increasing the statistics target.

One thing that comes to mind though, I would have defined one of those two
indexes to include both columns. Probably the file_id index, so you would have
an index on <revision_id> and an index on <file_id,revision_id>. That would
be a huge win for this query.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!