Thread: Bad estimate with partial index

Bad estimate with partial index

From

André Hänsel

Date:

19 April 2022, 11:25:21

Hi list,

I have a case where Postgres chooses the wrong index and I'm not sure what
to do about it:

https://dbfiddle.uk/?rdbms=postgres_14&fiddle=f356fd56a920ea8a93c192f5a8c16b
1c

Setup:

CREATE TABLE t (
    filename int,
    cropped bool not null default false,
    resized bool not null default false,
    create_date date not null default '1970-01-01'
);

INSERT INTO t
SELECT generate_series(1, 1000000);

UPDATE t SET cropped = true, resized = true
WHERE filename IN (SELECT filename FROM t ORDER BY random() LIMIT 900000);
UPDATE t SET resized = false
WHERE filename IN (SELECT filename FROM t WHERE cropped = true ORDER BY
random() LIMIT 1000);

VACUUM FULL t;
ANALYZE t;

Data now looks like this:

SELECT cropped, resized, count(*)
FROM t
GROUP BY 1,2;

I create two indexes:

CREATE INDEX idx_resized ON t(resized) WHERE NOT resized;
CREATE INDEX specific ON t(cropped,resized) WHERE cropped AND NOT resized;

And then run my query:

EXPLAIN ANALYZE
    SELECT count(*) FROM t WHERE cropped AND NOT resized AND create_date <
CURRENT_DATE;

Aggregate  (cost=4001.25..4001.26 rows=1 width=8) (actual
time=478.557..478.558 rows=1 loops=1)
  ->  Index Scan using idx_resized on t  (cost=0.29..3777.71 rows=89415
width=0) (actual time=478.177..478.480 rows=1000 loops=1)
        Filter: (cropped AND (create_date < CURRENT_DATE))
        Rows Removed by Filter: 100000

It takes 478 ms on dbfiddle.uk (on my machine it's faster but the difference
is still visible).

Now I delete an index:

DROP INDEX idx_resized;

and run the same query again and I get a much better plan:

Aggregate  (cost=11876.27..11876.28 rows=1 width=8) (actual
time=0.315..0.316 rows=1 loops=1)
  ->  Bitmap Heap Scan on t  (cost=35.50..11652.73 rows=89415 width=0)
(actual time=0.054..0.250 rows=1000 loops=1)
        Recheck Cond: (cropped AND (NOT resized))
        Filter: (create_date < CURRENT_DATE)
        Heap Blocks: exact=6
        ->  Bitmap Index Scan on specific  (cost=0.00..13.15 rows=89415
width=0) (actual time=0.040..0.040 rows=1000 loops=1)

which uses the index specific and completes in less than a ms on both
dbfiddle.uk and my machine.

Additional mystery - when I set the values not with an UPDATE but with a
DEFAULT, then the correct index is chosen. What is going on?
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=dc7d8aea14e90f08ab6537a855f34d
8c

Regards,
André

Re: Bad estimate with partial index

From

Tom Lane

Date:

19 April 2022, 18:01:46

=?iso-8859-1?Q?Andr=E9_H=E4nsel?= <andre@webkr.de> writes:
> I have a case where Postgres chooses the wrong index and I'm not sure what
> to do about it:

The core problem here seems to be a poor estimate for the selectivity
of "WHERE cropped AND NOT resized":

regression=# EXPLAIN ANALYZE
SELECT count(*) FROM t
WHERE cropped AND NOT resized ;
...
   ->  Bitmap Heap Scan on t  (cost=35.26..6352.26 rows=91100 width=0) (actual time=0.121..0.190 rows=1000 loops=1)
         Recheck Cond: (cropped AND (NOT resized))
...

I think this is because the planner expects those two columns to be
independent, which they are completely not in your test data.  Perhaps
that assumption is more true in your real-world data, but since you're
here complaining, I suppose not :-(.  What you can do about that, in
recent Postgres versions, is to create extended statistics on the
combination of the columns:

regression=# create statistics t_stats on cropped, resized from t;
CREATE STATISTICS
regression=# analyze t;
ANALYZE
regression=# EXPLAIN ANALYZE
SELECT count(*) FROM t
WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
                                                          QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3145.15..3145.16 rows=1 width=8) (actual time=9.765..9.766 rows=1 loops=1)
   ->  Index Scan using idx_resized on t  (cost=0.29..3142.65 rows=1000 width=0) (actual time=9.608..9.735 rows=1000
loops=1)
         Filter: (cropped AND (create_date < CURRENT_DATE))
         Rows Removed by Filter: 100000
 Planning Time: 0.115 ms
 Execution Time: 9.779 ms

Better estimate, but it's still using the wrong index :-(.  If we force
use of the other one:

regression=# drop index idx_resized;
DROP INDEX
regression=# EXPLAIN ANALYZE
regression-# SELECT count(*) FROM t
regression-# WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
                                                          QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6795.38..6795.39 rows=1 width=8) (actual time=0.189..0.191 rows=1 loops=1)
   ->  Bitmap Heap Scan on t  (cost=13.40..6792.88 rows=1000 width=0) (actual time=0.047..0.147 rows=1000 loops=1)
         Recheck Cond: (cropped AND (NOT resized))
         Filter: (create_date < CURRENT_DATE)
         Heap Blocks: exact=6
         ->  Bitmap Index Scan on specific  (cost=0.00..13.15 rows=91565 width=0) (actual time=0.035..0.035 rows=1000
loops=1)
                                                              ^^^^^^^^^^
 Planning Time: 0.154 ms
 Execution Time: 0.241 ms

it looks like the problem is that the extended stats haven't been used
while forming the estimate of the number of index entries retrieved,
so we overestimate the cost of using this index.

That seems like a bug.  Tomas?

In the meantime, maybe you could dodge the problem by combining
"cropped" and "resized" into one multivalued column, so that there's
not a need to depend on extended stats to arrive at a decent estimate.

            regards, tom lane

Re: Bad estimate with partial index

From

Tom Lane

Date:

19 April 2022, 21:08:36

I wrote:
> it looks like the problem is that the extended stats haven't been used
> while forming the estimate of the number of index entries retrieved, so
> we overestimate the cost of using this index.
> That seems like a bug.  Tomas?

I dug into this enough to locate the source of the problem.
btcostestimate includes the partial index clauses in what it
sends to clauselist_selectivity, but per the comments for
add_predicate_to_index_quals:

 * Note that indexQuals contains RestrictInfo nodes while the indpred
 * does not, so the output list will be mixed.  This is OK for both
 * predicate_implied_by() and clauselist_selectivity(), but might be
 * problematic if the result were passed to other things.

That comment was true when it was written, but it's been falsified
by the extended-stats patches, which have added a whole lot of logic
in and under clauselist_selectivity that ignores clauses that are not
RestrictInfos.

While we could perhaps fix this by having add_predicate_to_index_quals
add RestrictInfos, I'm inclined to feel that the extended-stats code
is in the wrong.  The contract for clauselist_selectivity has always
been that it could optimize if given RestrictInfos rather than bare
clauses, not that it would fail to work entirely without them.
There are probably more places besides add_predicate_to_index_quals
that are relying on that.

            regards, tom lane

Re: Bad estimate with partial index

From

Tomas Vondra

Date:

20 April 2022, 07:58:07

On 4/19/22 23:08, Tom Lane wrote:
> I wrote:
>> it looks like the problem is that the extended stats haven't been used
>> while forming the estimate of the number of index entries retrieved, so
>> we overestimate the cost of using this index.
>> That seems like a bug.  Tomas?
> 
> I dug into this enough to locate the source of the problem.
> btcostestimate includes the partial index clauses in what it
> sends to clauselist_selectivity, but per the comments for
> add_predicate_to_index_quals:
> 
>  * Note that indexQuals contains RestrictInfo nodes while the indpred
>  * does not, so the output list will be mixed.  This is OK for both
>  * predicate_implied_by() and clauselist_selectivity(), but might be
>  * problematic if the result were passed to other things.
> 
> That comment was true when it was written, but it's been falsified
> by the extended-stats patches, which have added a whole lot of logic
> in and under clauselist_selectivity that ignores clauses that are not
> RestrictInfos.
> 
> While we could perhaps fix this by having add_predicate_to_index_quals
> add RestrictInfos, I'm inclined to feel that the extended-stats code
> is in the wrong.  The contract for clauselist_selectivity has always
> been that it could optimize if given RestrictInfos rather than bare
> clauses, not that it would fail to work entirely without them.
> There are probably more places besides add_predicate_to_index_quals
> that are relying on that.
> 

Yes, that seems like a fair assessment. I'll look into fixing this, not
sure how invasive it will get, though.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Bad estimate with partial index

From

Tomas Vondra

Date:

20 April 2022, 13:39:25

On 4/20/22 09:58, Tomas Vondra wrote:
> On 4/19/22 23:08, Tom Lane wrote:
>> I wrote:
>>> it looks like the problem is that the extended stats haven't been used
>>> while forming the estimate of the number of index entries retrieved, so
>>> we overestimate the cost of using this index.
>>> That seems like a bug.  Tomas?
>>
>> I dug into this enough to locate the source of the problem.
>> btcostestimate includes the partial index clauses in what it
>> sends to clauselist_selectivity, but per the comments for
>> add_predicate_to_index_quals:
>>
>>  * Note that indexQuals contains RestrictInfo nodes while the indpred
>>  * does not, so the output list will be mixed.  This is OK for both
>>  * predicate_implied_by() and clauselist_selectivity(), but might be
>>  * problematic if the result were passed to other things.
>>
>> That comment was true when it was written, but it's been falsified
>> by the extended-stats patches, which have added a whole lot of logic
>> in and under clauselist_selectivity that ignores clauses that are not
>> RestrictInfos.
>>
>> While we could perhaps fix this by having add_predicate_to_index_quals
>> add RestrictInfos, I'm inclined to feel that the extended-stats code
>> is in the wrong.  The contract for clauselist_selectivity has always
>> been that it could optimize if given RestrictInfos rather than bare
>> clauses, not that it would fail to work entirely without them.
>> There are probably more places besides add_predicate_to_index_quals
>> that are relying on that.
>>
> 
> Yes, that seems like a fair assessment. I'll look into fixing this, not
> sure how invasive it will get, though.
> 

So, here's a WIP fix that improves the example shared by Andre, and does
not seem to break anything (or at least not any regression test).

The whole idea is that instead of bailing out for non-RestrictInfo case,
it calculates the necessary information for the clause from scratch.
This means relids and pseudoconstant flag, which are checked to decide
if the clause is compatible with extended stats.

But when inspecting how to calculate pseudoconstant, I realized that
maybe that's not really needed. Per distribute_qual_to_rels() we only
set it to 'true' when bms_is_empty(relids), and we already check that
relids is a singleton, so it can't be empty - which means pseudoconstant
can't be true either.

Andre, are you in position to test this fix with your application? Which
Postgres version are you using, actually?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

extended-stats-fix.patch

Re: Bad estimate with partial index

From

Tom Lane

Date:

20 April 2022, 14:15:57

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> The whole idea is that instead of bailing out for non-RestrictInfo case,
> it calculates the necessary information for the clause from scratch.
> This means relids and pseudoconstant flag, which are checked to decide
> if the clause is compatible with extended stats.

Right.

> But when inspecting how to calculate pseudoconstant, I realized that
> maybe that's not really needed. Per distribute_qual_to_rels() we only
> set it to 'true' when bms_is_empty(relids), and we already check that
> relids is a singleton, so it can't be empty - which means pseudoconstant
> can't be true either.

Yeah, I would not bother with the pseudoconstant-related tests for a
bare clause.  Patch looks reasonably sane in a quick once-over otherwise,
and the fact that it fixes the presented test case is promising.
(If you set enable_indexscan = off, you can verify that the estimate
for the number of index entries retrieved is now sane.)  I did not look
to see if there were any other RestrictInfo dependencies, though.

            regards, tom lane

RE: Bad estimate with partial index

From

André Hänsel

Date:

20 April 2022, 14:23:26

Tomas Vondra wrote:
> Andre, are you in position to test this fix with your application? Which Postgres version are you using, actually?

There's a test case in my original email, which obviously was synthetic, but I could also test this with my original
application data if I can get a Postgres running with your patch.

I guess I could probably run the official Dockerfile and apply your patch somewhere between these lines?
https://github.com/docker-library/postgres/blob/e8ebf74e50128123a8d0220b85e357ef2d73a7ec/14/bullseye/Dockerfile#L138

André

Re: Bad estimate with partial index

From

Tomas Vondra

Date:

20 April 2022, 18:27:40

On 4/20/22 16:15, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> The whole idea is that instead of bailing out for non-RestrictInfo case,
>> it calculates the necessary information for the clause from scratch.
>> This means relids and pseudoconstant flag, which are checked to decide
>> if the clause is compatible with extended stats.
> 
> Right.
> 
>> But when inspecting how to calculate pseudoconstant, I realized that
>> maybe that's not really needed. Per distribute_qual_to_rels() we only
>> set it to 'true' when bms_is_empty(relids), and we already check that
>> relids is a singleton, so it can't be empty - which means pseudoconstant
>> can't be true either.
> 
> Yeah, I would not bother with the pseudoconstant-related tests for a
> bare clause.  Patch looks reasonably sane in a quick once-over otherwise,
> and the fact that it fixes the presented test case is promising.

Attached is a slightly more polished patch, adding a couple comments and
removing the unnecessary pseudoconstant checks.

> (If you set enable_indexscan = off, you can verify that the estimate
> for the number of index entries retrieved is now sane.) 

I did that. Sorry for not mentioning that explicitly in my message.

> I did not look to see if there were any other RestrictInfo
> dependencies, though.

I checked the places messing with RestrictInfo in clausesel.c and
src/backend/statististics. Code in clausesel.c seems fine and mcv.c
seems fine to (it merely strips the RestrictInfo).

But dependencies.c might need a fix too, although the issue is somewhat
inverse to this one, because it looks like this:

    if (IsA(clause, RestrictInfo))
    {
        ... do some checks ...
    }

so if there's no RestrictInfo on top, we just accept the clause. I guess
this should do the same thing with checking relids like the fix, but
I've been unable to construct an example demonstrating the issue (it'd
have to be either pseudoconstant or reference multiple rels, which seems
hard to get in btcostestimate).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

extended-stats-fix-v2.patch

Re: Bad estimate with partial index

From

Tom Lane

Date:

20 April 2022, 19:03:25

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> But dependencies.c might need a fix too, although the issue is somewhat
> inverse to this one, because it looks like this:

>     if (IsA(clause, RestrictInfo))
>     {
>         ... do some checks ...
>     }

> so if there's no RestrictInfo on top, we just accept the clause. I guess
> this should do the same thing with checking relids like the fix, but
> I've been unable to construct an example demonstrating the issue (it'd
> have to be either pseudoconstant or reference multiple rels, which seems
> hard to get in btcostestimate).

Hm.  You could get an indexqual referencing other rels when considering
doing a join via a nestloop with parameterized inner indexscan.  However,
that would always be a query WHERE clause, which'd have a RestrictInfo.
At least in this code path, a bare clause would have to be a partial
index's predicate, which could not reference any other rels.  The
pseudoconstant case would require a predicate reducing to WHERE FALSE
or WHERE TRUE, which is at best pointless, though I'm not sure that
we prevent it.

You might have to go looking for other code paths that can pass a
bare clause if you want a test case for this.

            regards, tom lane