Thread: Strange Behaviour with multicolumn indexes

Strange Behaviour with multicolumn indexes

From

"Peter J. Holzer"

Date:

12 September 2019, 16:41:05

[PostgreSQL 11.5 (Ubuntu 11.5-1.pgdg18.04+1) on x86_64-pc-linux-gnu]

I have a table with many columns and many indexes (actually many tables
with many columns and many indexes), but for the sake of this posting,
we'll consider just three columns, which we unimaginatively call a, b,
and c. There are also three indexes:

    t_a_idx   btree (a) WHERE a IS NOT NULL
    t_b_idx   btree (b) WHERE b IS NOT NULL
    t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL

Nowe I have a query
    select c from t where a='A' and b='B';

This uses t_b_idx, not - as I expected - t_a_b_idx.

If I create an index with the columns swapped:

    t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL

this index will be used.

The distribution of values in columns a and b is quite different: a has
10 different values of similar frequency (and no null values). b has
only a single non-null value which with a frequency of about 1 %.

So I definitely understand why it would prefer t_b_idx to t_a_idx, but
certainly t_a_b_idx should be even better? After all it would have to
read only 1/1000 of the rows instead of 1/100. it would also have to
scan much less of the index, so the fact the fact that the index is a
bit larger shouldn't make a difference.

Explain shows that the row estimates are spot on, but the cost for using
t_a_b_idx is higher than for t_b_idx (which is in turn higher than for
t_b_a_idx).

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

signature.asc

Re: Strange Behaviour with multicolumn indexes

From

Tom Lane

Date:

12 September 2019, 16:54:55

"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> we'll consider just three columns, which we unimaginatively call a, b,
> and c. There are also three indexes:

>     t_a_idx   btree (a) WHERE a IS NOT NULL
>     t_b_idx   btree (b) WHERE b IS NOT NULL
>     t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL

> Nowe I have a query 
>     select c from t where a='A' and b='B';

> This uses t_b_idx, not - as I expected - t_a_b_idx.

> The distribution of values in columns a and b is quite different: a has
> 10 different values of similar frequency (and no null values). b has
> only a single non-null value which with a frequency of about 1 %. 

> So I definitely understand why it would prefer t_b_idx to t_a_idx, but
> certainly t_a_b_idx should be even better?

Not necessarily --- t_a_b_idx is (presumably) physically bigger than
t_b_idx, which makes it more expensive to search.  The additional
selectivity gain apparently doesn't outweigh that.

> If I create an index with the columns swapped:
>     t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL
> this index will be used.

Hmm.  Probably that has something to do with a calculation about
the selectivity of the leading index column, ie do you have to
scan 10% of the index or 1% of the index.  It's not taking the
partial-index filter into account in that, I suspect, which skews
the results in this case --- but that would be hard to account
for accurately.  Anyway I can't get excited about optimizing for
a single non-null value.

            regards, tom lane

Re: Strange Behaviour with multicolumn indexes

From

"Peter J. Holzer"

Date:

12 September 2019, 19:04:25

On 2019-09-12 12:54:55 -0400, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> > we'll consider just three columns, which we unimaginatively call a, b,
> > and c. There are also three indexes:
>
> >     t_a_idx   btree (a) WHERE a IS NOT NULL
> >     t_b_idx   btree (b) WHERE b IS NOT NULL
> >     t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL
>
> > Nowe I have a query
> >     select c from t where a='A' and b='B';

Just in case somebody wants to play along at home: A narrow table with
just these three columns and synthetic data shows the problem but
doesn't use the flipped multicolumn index either. I guess you have to
pad the table a bit to sufficiently change the costs.

> > This uses t_b_idx, not - as I expected - t_a_b_idx.
>
> > The distribution of values in columns a and b is quite different: a has
> > 10 different values of similar frequency (and no null values). b has
> > only a single non-null value which with a frequency of about 1 %.
>
> > So I definitely understand why it would prefer t_b_idx to t_a_idx, but
> > certainly t_a_b_idx should be even better?
>
> Not necessarily --- t_a_b_idx is (presumably) physically bigger than
> t_b_idx, which makes it more expensive to search.  The additional
> selectivity gain apparently doesn't outweigh that.

The index is about 20 % larger, but only 1/10 of it has to be searched,
so it should take 1 - (1.2 / 10) = 88 % less time.

> > If I create an index with the columns swapped:
> >     t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL
> > this index will be used.
>
> Hmm.  Probably that has something to do with a calculation about
> the selectivity of the leading index column, ie do you have to
> scan 10% of the index or 1% of the index.

Does it use only the selectivity of the first column? That would explain
it. The total selectivity of both columns is obviously the same.

> It's not taking the partial-index filter into account in that, I
> suspect, which skews the results in this case --- but that would be
> hard to account for accurately.

Hmm. Wouldn't that be a problem for partial indexes in general? They
usually cover only a small portion of the table and if the selectivity
is computed relative to the whole table the result may be way off.

I think it should be possible to adjust for a "WHERE column IS NOT NULL"
filter, because null_frac is in the statistics. For the general case you
would need an estimate of the number of rows covered by the index, which
I don't think we have.

> Anyway I can't get excited about optimizing for a single non-null
> value.

That's certainly an extreme case, but I've seen bad plans for similar
queries every now and then. At least now I have an idea what causes it
and maybe a solution/workaround.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

signature.asc

Re: Strange Behaviour with multicolumn indexes

From

"Peter J. Holzer"

Date:

12 September 2019, 19:56:22

On 2019-09-12 21:04:25 +0200, Peter J. Holzer wrote:
> On 2019-09-12 12:54:55 -0400, Tom Lane wrote:
> > It's not taking the partial-index filter into account in that, I
> > suspect, which skews the results in this case --- but that would be
> > hard to account for accurately.
>
> Hmm. Wouldn't that be a problem for partial indexes in general? They
> usually cover only a small portion of the table and if the selectivity
> is computed relative to the whole table the result may be way off.
>
> I think it should be possible to adjust for a "WHERE column IS NOT NULL"
> filter, because null_frac is in the statistics. For the general case you
> would need an estimate of the number of rows covered by the index, which
> I don't think we have.

Looking through the source I see that the planner does estimate the
number of tuples in the index.

        hp



--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

signature.asc