Re: Less selective index chosen unexpectedly - Mailing list pgsql-bugs

From James Coleman
Subject Re: Less selective index chosen unexpectedly
Date
Msg-id CAAaqYe-dinGV6Gh65jXWByQJHH9evU_ONrsxAgY96j9iLLBkNA@mail.gmail.com
Whole thread Raw
In response to Re: Less selective index chosen unexpectedly  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Less selective index chosen unexpectedly  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-bugs
On Tue, May 18, 2021 at 5:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Coleman <jtc331@gmail.com> writes:
> > Specifically we have a table (simplified repro case):
> > create table items(d date, t text, fk integer);
> > create index on items(d);
> > create index on items(t, fk, d);
>
> > For a query like:
> > select * from items where d = '2021-05-18' and fk = 1 and t = 'type0' limit
> > 1;
>
> > It's possible to get either an index scan on items_d_idx with a filter on
> > "fk" and "t" or an index scan on items_t_fk_d_idx without the need for a
> > filter. Even if both plans estimate a low cost and a single row, it seems
> > to be that the scan on the index containing more columns (and no filter)
> > ought to be pretty strongly preferred unless the cost or estimate rows is
> > dramatically higher.
>
> Actually not.  The multi-column index is going to be physically larger,
> which means that the estimated cost to descend into it is going to be
> larger just on the grounds of more I/O.  The extra comparisons to
> include the additional columns in that search aren't free either.
> Since you've specified LIMIT 1, you've also given up much of any cost
> advantage that might accrue to scanning items after the first match.
> Hence, the only way that the multi-column index is going to win out is
> if the extra filter conditions are estimated to be selective enough
> (relative to the condition on "d") that we have to scan multiple
> entries in the d-only index before getting the first match.
>
> Experimenting by adding
>
> explain select * from items where d = '2021-05-18' limit 1;
>
> (to see what the estimated selectivity of just that condition is)
> at each step of your script, I see that in the trouble cases,
> the "d" condition is by itself estimated to match only one row.
> If that were accurate, the planner would be quite right to pick
> the smaller index.
>
> The only thing I see that's really going wrong here is marginally
> inaccurate stats, especially right after a big insertion that's
> not reflected into the stats yet.  I'm not sure there's much to
> improve there.  You could increase the stats target some more,
> though of course that just pushes out the size of table where
> the issue will appear.

There are two axes that matter here in costing (I should have
communicated this better in my original email): 1.) correctness of
selectivity and 2.) the relative risk of getting that selectivity
wrong.

What I'm interested in here is that, at least in this case (and I'm
theorizing it's more generalizable, but that's why I wanted to get
discussion on it) the risk of getting selectivity wrong is quite high.
And it seems to me that a more specific index (though larger and
costlier to descend) is generally lower risk, albeit higher cost in
best cases for the less specific index. In this particular example it
wouldn't even matter which permutation of column ordering you have in
the more specific index -- it's always guaranteed to return the first
row that's found by descending the tree (excluding vacuumable rows).

James Coleman



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #17022: SQL causing engine crash
Next
From:
Date:
Subject: RE: BUG #17020: meta command psql _eset does not clear the query buffer