Bogus ANALYZE results for an otherwise-unique column with many nulls - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Bogus ANALYZE results for an otherwise-unique column with many nulls |
Date | |
Msg-id | 16143.1470350371@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Bogus ANALYZE results for an otherwise-unique column with many nulls
Re: Bogus ANALYZE results for an otherwise-unique column with many nulls |
List | pgsql-hackers |
I looked into the problem described at https://www.postgresql.org/message-id/flat/VisenaEmail.26.df42f82acae38a58.156463942b8%40tc7-visena and I believe I've reproduced it: the requirement is that the inner join column for the antijoin must contain a lot of NULL values, and what isn't NULL must be unique or nearly so. If ANALYZE doesn't come across any duplicated values, it will set the column's stadistinct value to "-1", which causes the planner to believe that each row of the inner table produces a unique value, resulting in a bogus answer from get_variable_numdistinct() and thence a bogus join size estimate. Here's an example in the regression database, making use of the existing unique column tenk1.unique1: regression=# create table manynulls as select case when random() < 0.1 then unique1 else null end as unique1 from tenk1; SELECT 10000 regression=# analyze manynulls; ANALYZE regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 = t.unique1); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Hash Anti Join (cost=261.00..756.50 rows=1 width=244) (actual time=4.632..14.729 rows=8973 loops=1) Hash Cond: (t.unique1 = m.unique1) -> Seq Scan on tenk1 t (cost=0.00..458.00 rows=10000 width=244) (actual time=0.015..2.683 rows=10000 loops=1) -> Hash (cost=136.00..136.00 rows=10000 width=4) (actual time=4.553..4.553 rows=1027 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 165kB -> Seq Scan on manynulls m (cost=0.00..136.00 rows=10000 width=4) (actual time=0.019..2.668 rows=10000 loops=1) Planning time: 0.808 ms Execution time: 15.670 ms (8 rows) So the antijoin size estimate is way off, but it's hardly the planner's fault because the stats are insane: regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls'; attname | null_frac | n_distinct ---------+-----------+------------ unique1 | 0.8973 | -1 (1 row) With the patch attached below, ANALYZE produces regression=# analyze manynulls; ANALYZE regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls'; attname | null_frac | n_distinct ---------+-----------+------------ unique1 | 0.8973 | -0.1027 (1 row) and now the join size estimate is dead on: regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 = t.unique1); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Hash Anti Join (cost=261.00..847.69 rows=8973 width=244) (actual time=4.501..13.888 rows=8973 loops=1) Hash Cond: (t.unique1 = m.unique1) -> Seq Scan on tenk1 t (cost=0.00..458.00 rows=10000 width=244) (actual time=0.031..4.959 rows=10000 loops=1) -> Hash (cost=136.00..136.00 rows=10000 width=4) (actual time=4.404..4.404 rows=1027 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 165kB -> Seq Scan on manynulls m (cost=0.00..136.00 rows=10000 width=4) (actual time=0.034..2.576 rows=10000 loops=1) Planning time: 1.388 ms Execution time: 14.542 ms (8 rows) What I did in the patch is to scale the formerly fixed "-1.0" stadistinct estimate to discount the fraction of nulls we found. An alternative possibility might be to decree that a fractional stadistinct considers only non-nulls, but then all the other paths through ANALYZE would be wrong. The spec for it in pg_statistic.h doesn't suggest any such interpretation, either. Looking around, there are a couple of places outside commands/analyze.c that are making the same mistake, so this patch isn't complete, but it illustrates what needs to be done. This is a bit reminiscent of the nulls-accounting problem we fixed in commit be4b4dc75, though that tended to result in underestimates not overestimates of the number of distinct values. We didn't back-patch that fix, so probably we shouldn't back-patch this either. On the other hand, it is far more open-and-shut that this is wrong. Thoughts? regards, tom lane diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 5fcedd7..9ac7122 100644 *** a/src/backend/commands/analyze.c --- b/src/backend/commands/analyze.c *************** compute_distinct_stats(VacAttrStatsP sta *** 2049,2056 **** if (nmultiple == 0) { ! /* If we found no repeated values, assume it's a unique column */ ! stats->stadistinct = -1.0; } else if (track_cnt < track_max && toowide_cnt == 0 && nmultiple == track_cnt) --- 2049,2059 ---- if (nmultiple == 0) { ! /* ! * If we found no repeated non-null values, assume it's a unique ! * column; but be sure to discount for any nulls we found. ! */ ! stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac); } else if (track_cnt < track_max && toowide_cnt == 0 && nmultiple == track_cnt) *************** compute_scalar_stats(VacAttrStatsP stats *** 2426,2433 **** if (nmultiple == 0) { ! /* If we found no repeated values, assume it's a unique column */ ! stats->stadistinct = -1.0; } else if (toowide_cnt == 0 && nmultiple == ndistinct) { --- 2429,2439 ---- if (nmultiple == 0) { ! /* ! * If we found no repeated non-null values, assume it's a unique ! * column; but be sure to discount for any nulls we found. ! */ ! stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac); } else if (toowide_cnt == 0 && nmultiple == ndistinct) { *************** compute_scalar_stats(VacAttrStatsP stats *** 2753,2759 **** else stats->stawidth = stats->attrtype->typlen; /* Assume all too-wide values are distinct, so it's a unique column */ ! stats->stadistinct = -1.0; } else if (null_cnt > 0) { --- 2759,2765 ---- else stats->stawidth = stats->attrtype->typlen; /* Assume all too-wide values are distinct, so it's a unique column */ ! stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac); } else if (null_cnt > 0) {
pgsql-hackers by date: