Re: Postgres picks suboptimal index after building of an extended statistics - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Postgres picks suboptimal index after building of an extended statistics
Date
Msg-id 3119052.1657231656@sss.pgh.pa.us
Whole thread Raw
In response to Re: Postgres picks suboptimal index after building of an extended statistics  (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses Re: Postgres picks suboptimal index after building of an extended statistics
List pgsql-hackers
Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> On 12/8/21 04:26, Tomas Vondra wrote:
>> I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
>> and improve the cardinality estimates directly, not just costing for 
>> index scans.

> I tried to implement this in different ways. But it causes additional 
> overhead and code complexity - analyzing a list of indexes and match 
> clauses of each index with input clauses in each selectivity estimation.
> I don't like that way and propose a new patch in attachment.

I looked at this briefly.  I do not think that messing with
btcostestimate/genericcostestimate is the right response at all.
The problem can be demonstrated with no index whatever, as in the
attached shortened version of the original example.  I get

                    QUERY PLAN                     
---------------------------------------------------
 Seq Scan on a  (cost=0.00..46.02 rows=1 width=12)
   Filter: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

before adding the extended stats, and

                     QUERY PLAN                     
----------------------------------------------------
 Seq Scan on a  (cost=0.00..46.02 rows=28 width=12)
   Filter: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

afterwards.  So the extended stats have made the rowcount
estimate significantly worse, which seems like an indicator of a
bug somewhere in extended stats.  The more so because I can crank
default_statistics_target all the way to 10000 without these
estimates changing.  If we can't get a dead-on estimate for a
2001-row table at that stats level, we're doing something wrong,
surely?

Also, I found that if I ask only for ndistinct stats,
I still get rows=1.  The fishiness seems to be directly
a problem with dependencies stats.

            regards, tom lane

DROP TABLE IF EXISTS a CASCADE;
DROP STATISTICS IF EXISTS aestat;

CREATE TABLE a AS (
    SELECT
        gs % 10 AS x,
        (gs % 10 + (gs/10::int4) % 10) % 10 AS y,
        (gs / 100)::int4 AS z
    FROM generate_series(1,1000) AS gs
);
INSERT INTO a (SELECT gs,gs,gs FROM generate_series(1000,2000) AS gs);

-- ALTER TABLE a ADD PRIMARY KEY (x,y,z);
-- CREATE INDEX ON a(x);
ANALYZE a;

EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1;

CREATE STATISTICS aestat(dependencies,ndistinct) ON x,y,z FROM a;
ANALYZE a;

EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1;

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: remove more archiving overhead
Next
From: Justin Pryzby
Date:
Subject: Re: explain analyze rows=%.0f