Re: ERROR: corrupt MVNDistinct entry - Mailing list pgsql-hackers

From Richard Guo
Subject Re: ERROR: corrupt MVNDistinct entry
Date
Msg-id CAMbWs4_euoGhR+ZEYAFLduVw1U+Ww3LY7LRpCZPqdxKg1AYy3Q@mail.gmail.com
Whole thread Raw
In response to Re: ERROR: corrupt MVNDistinct entry  (Richard Guo <guofenglinux@gmail.com>)
List pgsql-hackers
On Wed, Dec 25, 2024 at 6:36 PM Richard Guo <guofenglinux@gmail.com> wrote:
> In v16 and later, the nullingrels within the expression "t2.a + t2.b"
> prevent it from being matched to the corresponding expression in
> extended statistics, forcing us to use DEFAULT_UNK_SEL(0.005).

Furthermore, even without extended statistics or expressional index
columns, the nullingrels can still cause issues with selectivity
estimates.  They may cause examine_variable to fail in identifying a
valid RelOptInfo for an expression, making the relation size not
available, which is required in many cases.  For instance,

create table t (a int, b int);
insert into t select i, i from generate_series(1,10)i;
analyze t;

-- in v16 and later
explain (costs on)
select * from t t1
    left join t t2 on true
    left join t t3 on t1.a = t3.a and t3.a < 8
where t1.a = coalesce(t2.a);
                              QUERY PLAN
-----------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..4.94 rows=1 width=24)
   Join Filter: (t1.a = t3.a)
   ->  Nested Loop Left Join  (cost=0.00..3.73 rows=1 width=16)
         Filter: (t1.a = COALESCE(t2.a))
         ->  Seq Scan on t t1  (cost=0.00..1.10 rows=10 width=8)
         ->  Materialize  (cost=0.00..1.15 rows=10 width=8)
               ->  Seq Scan on t t2  (cost=0.00..1.10 rows=10 width=8)
   ->  Seq Scan on t t3  (cost=0.00..1.12 rows=7 width=8)
         Filter: (a < 8)
(9 rows)

-- in v15 and before
explain (costs on)
select * from t t1
    left join t t2 on true
    left join t t3 on t1.a = t3.a and t3.a < 8
where t1.a = coalesce(t2.a);
                              QUERY PLAN
-----------------------------------------------------------------------
 Hash Left Join  (cost=1.21..5.04 rows=10 width=24)
   Hash Cond: (t1.a = t3.a)
   ->  Nested Loop Left Join  (cost=0.00..3.73 rows=10 width=16)
         Filter: (t1.a = COALESCE(t2.a))
         ->  Seq Scan on t t1  (cost=0.00..1.10 rows=10 width=8)
         ->  Materialize  (cost=0.00..1.15 rows=10 width=8)
               ->  Seq Scan on t t2  (cost=0.00..1.10 rows=10 width=8)
   ->  Hash  (cost=1.12..1.12 rows=7 width=8)
         ->  Seq Scan on t t3  (cost=0.00..1.12 rows=7 width=8)
               Filter: (a < 8)
(10 rows)

In v16 and later, when calculating the join selectivity for "t1.a =
coalesce(t2.a)", eqjoinsel sets nd2 — the number of distinct values of
coalesce(t2.a) — to DEFAULT_NUM_DISTINCT (200) because the
corresponding RelOptInfo is not identifiable.  This results in very
inaccurate join selectivity.

I'm wondering if we also need to strip out the nullingrels from the
expression in examine_variable().  I tried doing so and noticed a plan
diff in regression test join.sql.

@@ -2573,10 +2573,11 @@
                ->  Materialize
                      ->  Seq Scan on int4_tbl t2
                            Filter: (f1 > 1)
-         ->  Seq Scan on int4_tbl t3
+         ->  Materialize
+               ->  Seq Scan on int4_tbl t3
    ->  Materialize
          ->  Seq Scan on int4_tbl t4
-(13 rows)
+(14 rows)

Thanks
Richard



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Logical replication timeout
Next
From: Yurii Rashkovskii
Date:
Subject: Re: Add Postgres module info