Re: Bad estimate with partial index - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Bad estimate with partial index
Date
Msg-id 2568735.1650391306@sss.pgh.pa.us
Whole thread Raw
In response to Bad estimate with partial index  (André Hänsel <andre@webkr.de>)
Responses Re: Bad estimate with partial index
List pgsql-hackers
=?iso-8859-1?Q?Andr=E9_H=E4nsel?= <andre@webkr.de> writes:
> I have a case where Postgres chooses the wrong index and I'm not sure what
> to do about it:

The core problem here seems to be a poor estimate for the selectivity
of "WHERE cropped AND NOT resized":

regression=# EXPLAIN ANALYZE
SELECT count(*) FROM t
WHERE cropped AND NOT resized ;
...
   ->  Bitmap Heap Scan on t  (cost=35.26..6352.26 rows=91100 width=0) (actual time=0.121..0.190 rows=1000 loops=1)
         Recheck Cond: (cropped AND (NOT resized))
...

I think this is because the planner expects those two columns to be
independent, which they are completely not in your test data.  Perhaps
that assumption is more true in your real-world data, but since you're
here complaining, I suppose not :-(.  What you can do about that, in
recent Postgres versions, is to create extended statistics on the
combination of the columns:

regression=# create statistics t_stats on cropped, resized from t;
CREATE STATISTICS
regression=# analyze t;
ANALYZE
regression=# EXPLAIN ANALYZE
SELECT count(*) FROM t
WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
                                                          QUERY PLAN
      

------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3145.15..3145.16 rows=1 width=8) (actual time=9.765..9.766 rows=1 loops=1)
   ->  Index Scan using idx_resized on t  (cost=0.29..3142.65 rows=1000 width=0) (actual time=9.608..9.735 rows=1000
loops=1)
         Filter: (cropped AND (create_date < CURRENT_DATE))
         Rows Removed by Filter: 100000
 Planning Time: 0.115 ms
 Execution Time: 9.779 ms

Better estimate, but it's still using the wrong index :-(.  If we force
use of the other one:

regression=# drop index idx_resized;
DROP INDEX
regression=# EXPLAIN ANALYZE
regression-# SELECT count(*) FROM t
regression-# WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
                                                          QUERY PLAN
       

-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6795.38..6795.39 rows=1 width=8) (actual time=0.189..0.191 rows=1 loops=1)
   ->  Bitmap Heap Scan on t  (cost=13.40..6792.88 rows=1000 width=0) (actual time=0.047..0.147 rows=1000 loops=1)
         Recheck Cond: (cropped AND (NOT resized))
         Filter: (create_date < CURRENT_DATE)
         Heap Blocks: exact=6
         ->  Bitmap Index Scan on specific  (cost=0.00..13.15 rows=91565 width=0) (actual time=0.035..0.035 rows=1000
loops=1)
                                                              ^^^^^^^^^^
 Planning Time: 0.154 ms
 Execution Time: 0.241 ms

it looks like the problem is that the extended stats haven't been used
while forming the estimate of the number of index entries retrieved,
so we overestimate the cost of using this index.

That seems like a bug.  Tomas?

In the meantime, maybe you could dodge the problem by combining
"cropped" and "resized" into one multivalued column, so that there's
not a need to depend on extended stats to arrive at a decent estimate.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: TRAP: FailedAssertion("tabstat->trans == trans", File: "pgstat_relation.c", Line: 508
Next
From: Mahesh Gouru
Date:
Subject: DBT-5 Stored Procedure Development (2022)