Re: Adding additional index causes 20,000x slowdown for certain select queries - postgres 9.0.3 - Mailing list pgsql-performance

From Robert Haas
Subject Re: Adding additional index causes 20,000x slowdown for certain select queries - postgres 9.0.3
Date
Msg-id BANLkTim2W3c-2J31J7=7kr=cHs4SapprMA@mail.gmail.com
Whole thread Raw
In response to Re: Adding additional index causes 20,000x slowdown for certain select queries - postgres 9.0.3  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On Wed, Mar 16, 2011 at 1:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That isn't going to dissuade the planner from using that index for this
> query.  It would result in the scan being a forward indexscan instead of
> backwards.  Now it'd be worth trying that, to see if you and Kevin are
> right that it's the backwards aspect that's hurting.  I'm not convinced
> though.  I suspect the issue is that the planner is expecting the target
> records (the ones selected by the filter condition) to be approximately
> equally distributed in the month ordering, but really there is a
> correlation which causes them to be much much further back in the index
> than it expects.  So a lot more of the index has to be scanned than it's
> expecting.

This has got to be one of the five most frequently reported planner
problems, and it's nearly always with a *backward* index scan.  So I
agree with Kevin that we probably ought to have a Todo to make
backward index scans look more expensive than forward index scans,
maybe related in some way to the correlation estimates for the
relevant columns.

But I don't really think that's the root of the problem.  When
confronted with this type of query, you can either filter-then-sort,
or index-scan-in-desired-order-then-filter.  I think the heart of the
problem is that we're able to estimate the cost of the first plan much
more accurately than the cost of the second one.  In many cases, the
filter is done using a sequential scan, which is easy to cost, and
even if it's done using a bitmap index scan the cost of that is also
pretty simple to estimate, as long as our selectivity estimate is
somewhere in the ballpark.  The cost of the sort depends primarily on
how many rows we need to sort, and if the qual is something like an
equality condition, as in this case, then we'll know that pretty
accurately as well.  So we're good.

On the other hand, when we use an index scan to get the rows in order,
and then apply the filter condition to them, the cost of the index
scan is heavily dependent on how far we have to scan through the
index, and that depends on the distribution of values in the qual
column relative to the distribution of values in the index column.  We
have no data that allow us to estimate that, so we are basically
shooting in the dark.  This is a multi-column statistics problem, but
I think it's actually harder than what we usually mean by multi-column
statistics, where we only need to estimate selectivity.  A system that
can perfectly estimate the selectivity of state = $1 and zipcode = $2
might still be unable to tell us much about how many zipcodes we'd
have to read in ascending order to find a given number in some
particular state.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-performance by date:

Previous
From: Robert Haas
Date:
Subject: Re: Custom operator class costs
Next
From: John Rouillard
Date:
Subject: Assessing performance of fetches