Bogus ANALYZE results for an otherwise-unique column with many nulls - Mailing list pgsql-hackers

From Tom Lane
Subject Bogus ANALYZE results for an otherwise-unique column with many nulls
Date
Msg-id 16143.1470350371@sss.pgh.pa.us
Whole thread Raw
Responses Re: Bogus ANALYZE results for an otherwise-unique column with many nulls
Re: Bogus ANALYZE results for an otherwise-unique column with many nulls
List pgsql-hackers
I looked into the problem described at
https://www.postgresql.org/message-id/flat/VisenaEmail.26.df42f82acae38a58.156463942b8%40tc7-visena
and I believe I've reproduced it: the requirement is that the inner join
column for the antijoin must contain a lot of NULL values, and what isn't
NULL must be unique or nearly so.  If ANALYZE doesn't come across any
duplicated values, it will set the column's stadistinct value to "-1",
which causes the planner to believe that each row of the inner table
produces a unique value, resulting in a bogus answer from
get_variable_numdistinct() and thence a bogus join size estimate.

Here's an example in the regression database, making use of the existing
unique column tenk1.unique1:

regression=# create table manynulls as select case when random() < 0.1 then unique1 else null end as unique1 from
tenk1;
SELECT 10000
regression=# analyze manynulls;
ANALYZE
regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 =
t.unique1);
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=261.00..756.50 rows=1 width=244) (actual time=4.632..14.729 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.015..2.683 rows=10000 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=10000 width=4) (actual time=4.553..4.553 rows=1027 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 165kB
         ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=10000 width=4) (actual time=0.019..2.668 rows=10000
loops=1)
 Planning time: 0.808 ms
 Execution time: 15.670 ms
(8 rows)

So the antijoin size estimate is way off, but it's hardly the planner's
fault because the stats are insane:

regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
 attname | null_frac | n_distinct
---------+-----------+------------
 unique1 |    0.8973 |         -1
(1 row)

With the patch attached below, ANALYZE produces

regression=# analyze manynulls;
ANALYZE
regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
 attname | null_frac | n_distinct
---------+-----------+------------
 unique1 |    0.8973 |    -0.1027
(1 row)

and now the join size estimate is dead on:

regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 =
t.unique1);
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=261.00..847.69 rows=8973 width=244) (actual time=4.501..13.888 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.031..4.959 rows=10000 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=10000 width=4) (actual time=4.404..4.404 rows=1027 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 165kB
         ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=10000 width=4) (actual time=0.034..2.576 rows=10000
loops=1)
 Planning time: 1.388 ms
 Execution time: 14.542 ms
(8 rows)

What I did in the patch is to scale the formerly fixed "-1.0" stadistinct
estimate to discount the fraction of nulls we found.  An alternative
possibility might be to decree that a fractional stadistinct considers
only non-nulls, but then all the other paths through ANALYZE would be
wrong.  The spec for it in pg_statistic.h doesn't suggest any such
interpretation, either.

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

This is a bit reminiscent of the nulls-accounting problem we fixed in
commit be4b4dc75, though that tended to result in underestimates not
overestimates of the number of distinct values.  We didn't back-patch
that fix, so probably we shouldn't back-patch this either.  On the other
hand, it is far more open-and-shut that this is wrong.  Thoughts?

            regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 ****

          if (nmultiple == 0)
          {
!             /* If we found no repeated values, assume it's a unique column */
!             stats->stadistinct = -1.0;
          }
          else if (track_cnt < track_max && toowide_cnt == 0 &&
                   nmultiple == track_cnt)
--- 2049,2059 ----

          if (nmultiple == 0)
          {
!             /*
!              * If we found no repeated non-null values, assume it's a unique
!              * column; but be sure to discount for any nulls we found.
!              */
!             stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
          }
          else if (track_cnt < track_max && toowide_cnt == 0 &&
                   nmultiple == track_cnt)
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 ****

          if (nmultiple == 0)
          {
!             /* If we found no repeated values, assume it's a unique column */
!             stats->stadistinct = -1.0;
          }
          else if (toowide_cnt == 0 && nmultiple == ndistinct)
          {
--- 2429,2439 ----

          if (nmultiple == 0)
          {
!             /*
!              * If we found no repeated non-null values, assume it's a unique
!              * column; but be sure to discount for any nulls we found.
!              */
!             stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
          }
          else if (toowide_cnt == 0 && nmultiple == ndistinct)
          {
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 ****
          else
              stats->stawidth = stats->attrtype->typlen;
          /* Assume all too-wide values are distinct, so it's a unique column */
!         stats->stadistinct = -1.0;
      }
      else if (null_cnt > 0)
      {
--- 2759,2765 ----
          else
              stats->stawidth = stats->attrtype->typlen;
          /* Assume all too-wide values are distinct, so it's a unique column */
!         stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
      }
      else if (null_cnt > 0)
      {

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Heap WARM Tuples - Design Draft
Next
From: Bruce Momjian
Date:
Subject: Re: Heap WARM Tuples - Design Draft