Re: Unexpected sequential scan on an indexed column - Mailing list pgsql-performance

From Eddy Escardo-Raffo
Subject Re: Unexpected sequential scan on an indexed column
Date
Msg-id 4eaa4a5e0911160318r68a3b61er5406f0886a7de0@mail.gmail.com
Whole thread Raw
In response to Unexpected sequential scan on an indexed column  (Eddy Escardo-Raffo <eescardo@kikini.com>)
Responses Re: Unexpected sequential scan on an indexed column
List pgsql-performance
OK, I think that after reading this doc (which I hadn't encountered before) about the optimizer, something clicked in my brain and I think I can answer my own question. I was basically thinking from my own perspective rather than from the query planner's perspective:
- From my perspective I know that the subselect will return very few values, so naively I expected that the planner would be able to do a bitmap index scan with the small set of values returned, without needing to do a join (such as the nested loop join it ended up choosing).
- However (and this is probably obvious to all of you), the query planner doesn't really know for a fact that a sub-select will result in a small number of rows, so it guesses based on its statistics what the best kind of join would be. A 'bitmap index scan' is not one of the choices for a join, I'm guessing because a 'nested loop join with inner index scan' is a more generally applicable strategy that can get the same order of magnitude of performance in restriction cases that end up being as simple as an IN (list) restriction. However, there are more competing possibilities for picking an appropriate join strategy than for picking a strategy to apply an IN (list) restriction, so the planner may not pick the 'nested loop join with inner index scan' if the ANALYZE statistics don't guide it that way, even if that would be the best strategy in the end.
I guess the only way I can think of to make a generic planner that would have performend well even in the lopsided statistics case is to create some plan nodes with contingency conditions. E.g.:
 
Plan: Nested loop join with sequential scan
Assumption: all table values are the same
Contingency plan: nested loop join with index scan
 
Then, if the assumption for the plan is violated early enough while executing the plan, the query executor would abort that plan node execution and start over with the contingency plan.
 
I guess implementing this kind of system in a generic way could get pretty hairy, and given my limited experience I don't know if the proportion of query plans that would be improved by having these kinds of contingency plans is significant enough to warrant the cost of developing this system, but I'm gathering that most query planners (including the postgres planner) don't do this kind of contingency planning :)
 
Thanks!
Eddy
On Sun, Nov 15, 2009 at 5:46 PM, Eddy Escardo-Raffo <eescardo@kikini.com> wrote:
I was using VALUES in my examples to more closely mirror the results of a sub-select (I abstracted everything else away and noticed that even just using VALUES syntax instead of a sub-select, the performance was bad). The full statement I had that led me into this more narrow investigation in the first place looks more like:
 
explain analyze SELECT u.userid FROM users u, (SELECT locid FROM locations WHERE ...) l WHERE u.location = l.locid LIMIT 10;
 
Based on the investigation so far, it seems like this kind of statement will perform well when the users.location distribution is not overwhelmingly lopsided, but not otherwise. However, using the IN (list) notation with a list of integer literals seems to perform well no matter what is the specific distribution of values in the users.location column.
 
I would like to understand why this is so, to help me write better queries in the future.
 
Thanks,
Eddy
On Sun, Nov 15, 2009 at 5:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Eddy Escardo-Raffo <eescardo@kikini.com> writes:
> For C, the planner estimated 10 thousand rows. For D, the planner estimated
> 100 thousand rows, yet for E the planner estimated only 1 row, which is the
> closest to reality. So, is there any way to specify a query that has a
> SUB-SELECT that returns a small set of values so that the planner treats it
> similar to how it treats statement E, or does statement E get its additional
> edge precisely from the fact that the restriction is defined by integer
> literals?

Currently there is no attempt to look at the exact contents of a VALUES
construct for planning purposes.  For the examples you're showing it
seems like the IN (list) notation is more compact and more widely used,
so improving the VALUES alternative doesn't seem that exciting.

                       regards, tom lane


pgsql-performance by date:

Previous
From: Laurent Laborde
Date:
Subject: Re: limiting performance impact of wal archiving.
Next
From: Wayne Beaver
Date:
Subject: Re: Manual vacs 5x faster than autovacs?