Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> On 12/8/21 04:26, Tomas Vondra wrote:
>> I wonder if we should teach clauselist_selectivity about UNIQUE indexes,
>> and improve the cardinality estimates directly, not just costing for
>> index scans.
> I tried to implement this in different ways. But it causes additional
> overhead and code complexity - analyzing a list of indexes and match
> clauses of each index with input clauses in each selectivity estimation.
> I don't like that way and propose a new patch in attachment.
I looked at this briefly. I do not think that messing with
btcostestimate/genericcostestimate is the right response at all.
The problem can be demonstrated with no index whatever, as in the
attached shortened version of the original example. I get
QUERY PLAN
---------------------------------------------------
Seq Scan on a (cost=0.00..46.02 rows=1 width=12)
Filter: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)
before adding the extended stats, and
QUERY PLAN
----------------------------------------------------
Seq Scan on a (cost=0.00..46.02 rows=28 width=12)
Filter: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)
afterwards. So the extended stats have made the rowcount
estimate significantly worse, which seems like an indicator of a
bug somewhere in extended stats. The more so because I can crank
default_statistics_target all the way to 10000 without these
estimates changing. If we can't get a dead-on estimate for a
2001-row table at that stats level, we're doing something wrong,
surely?
Also, I found that if I ask only for ndistinct stats,
I still get rows=1. The fishiness seems to be directly
a problem with dependencies stats.
regards, tom lane
DROP TABLE IF EXISTS a CASCADE;
DROP STATISTICS IF EXISTS aestat;
CREATE TABLE a AS (
SELECT
gs % 10 AS x,
(gs % 10 + (gs/10::int4) % 10) % 10 AS y,
(gs / 100)::int4 AS z
FROM generate_series(1,1000) AS gs
);
INSERT INTO a (SELECT gs,gs,gs FROM generate_series(1000,2000) AS gs);
-- ALTER TABLE a ADD PRIMARY KEY (x,y,z);
-- CREATE INDEX ON a(x);
ANALYZE a;
EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1;
CREATE STATISTICS aestat(dependencies,ndistinct) ON x,y,z FROM a;
ANALYZE a;
EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1;