Thread: Query result differences between PostgreSQL 17 vs 16

Query result differences between PostgreSQL 17 vs 16

From

Ronald Cruz

Date:

21 February, 19:13:17

Hi,

We have observed an issue after upgrading to PostgreSQL 17 that caused us to roll back to 16. Some of our queries are returning what I believe to be erroneous results.

I've attached two files here that can be used to reproduce what I'm seeing:

schema_plus_data.sql - This contains a pg_dump of a reproducible test case with a contrived schema and dataset of our use case.

reproducer.sql - This isn't necessary, but perhaps you find it helpful. This is a SQL script used to generate the schema and random data that hits this edge case. It was used to generate the previous file and it has reliably hit the issue so far from the handful of times I've tried.

The query I'm observing issues for is the following:

SELECT * FROM rc1
    LEFT JOIN rc2 ON rc2.rc1_reference = rc1.description
    LEFT JOIN rc3 ON rc2.id = rc3.rc2_reference
    LEFT JOIN LATERAL rc_select(rc3.id) ON rc3.id IS NOT NULL;

Under PostgreSQL 17, I'm seeing ~400k results returned, whereas in PostgreSQL 16, I see ~6k. The results I believe to be erroneous are those that have 'BUG HIT' in the output for PostgreSQL 17. These are results joined from rc_select where rc3.id is null. I'm not expecting to see any of these rows as is the case in PostgreSQL 16 output (and 15 as well from prior experience).

I've observed this behavior in the latest PostgreSQL 17.4 but have also encountered this in 17.2 and 17.3. The OS being used is RHEL 9.5 (plow). Please let me know if you need any more information.

Thank you in advance for any help,
Ron

Attachment

Re: Query result differences between PostgreSQL 17 vs 16

From

Bruce Momjian

Date:

22 February, 02:05:10

On Fri, Feb 21, 2025 at 11:13:17AM -0500, Ronald Cruz wrote:
> Hi,
> 
> We have observed an issue after upgrading to PostgreSQL 17 that caused us to
> roll back to 16. Some of our queries are returning what I believe to be
> erroneous results.
> 
> I've attached two files here that can be used to reproduce what I'm seeing:
> 
> schema_plus_data.sql - This contains a pg_dump of a reproducible test case with
> a contrived schema and dataset of our use case.
> 
> reproducer.sql - This isn't necessary, but perhaps you find it helpful. This is
> a SQL script used to generate the schema and random data that hits this edge
> case. It was used to generate the previous file and it has reliably hit the
> issue so far from the handful of times I've tried.
> 
> The query I'm observing issues for is the following:
> 
> SELECT * FROM rc1
>     LEFT JOIN rc2 ON rc2.rc1_reference = rc1.description
>     LEFT JOIN rc3 ON rc2.id = rc3.rc2_reference
>     LEFT JOIN LATERAL rc_select(rc3.id) ON rc3.id IS NOT NULL;
> 
> Under PostgreSQL 17, I'm seeing ~400k results returned, whereas in PostgreSQL
> 16, I see ~6k. The results I believe to be erroneous are those that have 'BUG
> HIT' in the output for PostgreSQL 17. These are results joined from rc_select
> where rc3.id is null. I'm not expecting to see any of these rows as is the case
> in PostgreSQL 16 output (and 15 as well from prior experience).
> 
> I've observed this behavior in the latest PostgreSQL 17.4 but have also
> encountered this in 17.2 and 17.3. The OS being used is RHEL 9.5 (plow). Please
> let me know if you need any more information.

We have a known problem with composite types and NULL constraints in PG
17 that I think we are fixing in PG 18.  I saw IS NOT NULL in your
query so I thought I would mention it:

    https://www.postgresql.org/message-id/Z37p0paENWWUarj-%40momjian.us

We do have several NULL optimizations in PG 17.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

22 February, 02:50:58

Bruce Momjian <bruce@momjian.us> writes:
> On Fri, Feb 21, 2025 at 11:13:17AM -0500, Ronald Cruz wrote:
>> We have observed an issue after upgrading to PostgreSQL 17 that caused us to
>> roll back to 16. Some of our queries are returning what I believe to be
>> erroneous results.

> We have a known problem with composite types and NULL constraints in PG
> 17 that I think we are fixing in PG 18.

There's no composite type at hand here.  I think the problem is an
erroneous deduction from a column NOT NULL constraint.  I can
reproduce a faulty plan in the regression database with

explain (costs off)
select * from tenk1
  left join int4_tbl i on (unique1 = f1)
  left join customer on (i.f1 = cid)
  left join int4_tbl j on cid is not null;

v16 produces

 Hash Left Join
   Hash Cond: (tenk1.unique1 = i.f1)
   ->  Seq Scan on tenk1
   ->  Hash
         ->  Nested Loop Left Join
               Join Filter: (customer.cid IS NOT NULL)
               ->  Hash Right Join
                     Hash Cond: (customer.cid = i.f1)
                     ->  Seq Scan on customer
                     ->  Hash
                           ->  Seq Scan on int4_tbl i
               ->  Materialize
                     ->  Seq Scan on int4_tbl j

but HEAD produces

 Hash Left Join
   Hash Cond: (tenk1.unique1 = i.f1)
   ->  Seq Scan on tenk1
   ->  Hash
         ->  Nested Loop Left Join
               ->  Hash Right Join
                     Hash Cond: (customer.cid = i.f1)
                     ->  Seq Scan on customer
                     ->  Hash
                           ->  Seq Scan on int4_tbl i
               ->  Materialize
                     ->  Seq Scan on int4_tbl j

Note the lack of any IS NOT NULL test.  I think the planner has
convinced itself that the not null constraint on customer.cid
makes that test redundant, despite the fact that what it is
testing is a post-outer-join value that most certainly could
be null.

"git bisect" fingers this commit:

b262ad440edecda0b1aba81d967ab560a83acb8a is the first bad commit
commit b262ad440edecda0b1aba81d967ab560a83acb8a
Author: David Rowley <drowley@postgresql.org>
Date:   Tue Jan 23 18:09:18 2024 +1300

    Add better handling of redundant IS [NOT] NULL quals


I've not looked at the code, but I suspect that it is failing
to check varnullingrels before believing that it can trust
the applicability of table constraints.

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

22 February, 04:01:13

On Sat, Feb 22, 2025 at 8:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "git bisect" fingers this commit:
>
> b262ad440edecda0b1aba81d967ab560a83acb8a is the first bad commit
> commit b262ad440edecda0b1aba81d967ab560a83acb8a
> Author: David Rowley <drowley@postgresql.org>
> Date:   Tue Jan 23 18:09:18 2024 +1300
>
>     Add better handling of redundant IS [NOT] NULL quals
>
>
> I've not looked at the code, but I suspect that it is failing
> to check varnullingrels before believing that it can trust
> the applicability of table constraints.

Hmm, we do check varnullingrels in expr_is_nonnullable().  My best
guess is that we have generated two versions of the qual 'customer.cid
IS NOT NULL': one with customer.cid marked as nullable by the left
join to customer, and one without.  The latter is dropped because of
the not null constraint on customer.cid, while the former fails to be
applied on the left join to int4_tbl j.

Thanks
Richard

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

22 February, 05:32:39

Richard Guo <guofenglinux@gmail.com> writes:
> On Sat, Feb 22, 2025 at 8:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I've not looked at the code, but I suspect that it is failing
>> to check varnullingrels before believing that it can trust
>> the applicability of table constraints.

> Hmm, we do check varnullingrels in expr_is_nonnullable().  My best
> guess is that we have generated two versions of the qual 'customer.cid
> IS NOT NULL': one with customer.cid marked as nullable by the left
> join to customer, and one without.  The latter is dropped because of
> the not null constraint on customer.cid, while the former fails to be
> applied on the left join to int4_tbl j.

If the check against table not-null constraints is applied after we
clone outer-join quals, that's probably bad.  I think there are
assumptions in there that every clone qual will have doppelgangers,
so filtering NOT NULLs later would break that.  Maybe not applying
the filter to quals marked has_clone or is_clone would help?

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

22 February, 06:08:35

On Sat, Feb 22, 2025 at 11:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > On Sat, Feb 22, 2025 at 8:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I've not looked at the code, but I suspect that it is failing
> >> to check varnullingrels before believing that it can trust
> >> the applicability of table constraints.
>
> > Hmm, we do check varnullingrels in expr_is_nonnullable().  My best
> > guess is that we have generated two versions of the qual 'customer.cid
> > IS NOT NULL': one with customer.cid marked as nullable by the left
> > join to customer, and one without.  The latter is dropped because of
> > the not null constraint on customer.cid, while the former fails to be
> > applied on the left join to int4_tbl j.
>
> If the check against table not-null constraints is applied after we
> clone outer-join quals, that's probably bad.  I think there are
> assumptions in there that every clone qual will have doppelgangers,
> so filtering NOT NULLs later would break that.  Maybe not applying
> the filter to quals marked has_clone or is_clone would help?

Yeah, I think this approach can fix the issue.  Perhaps we should
check whether the RestrictInfo is a clone clause and avoid assuming
that it's always true in that case, maybe by adding something like
below at the start of restriction_is_always_true.

+   if (restrictinfo->has_clone || restrictinfo->is_clone)
+       return false;

We may lose some optimization opportunities with NOT NULL quals, as
this could prevent us from reducing certain such quals to constant
true, but I think correctness should be our top priority.

I'll go ahead and write a patch.

Thanks
Richard

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

26 February, 09:14:25

On Sat, Feb 22, 2025 at 12:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
> On Sat, Feb 22, 2025 at 11:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > If the check against table not-null constraints is applied after we
> > clone outer-join quals, that's probably bad.  I think there are
> > assumptions in there that every clone qual will have doppelgangers,
> > so filtering NOT NULLs later would break that.  Maybe not applying
> > the filter to quals marked has_clone or is_clone would help?

> Yeah, I think this approach can fix the issue.  Perhaps we should
> check whether the RestrictInfo is a clone clause and avoid assuming
> that it's always true in that case, maybe by adding something like
> below at the start of restriction_is_always_true.
>
> +   if (restrictinfo->has_clone || restrictinfo->is_clone)
> +       return false;
>
> We may lose some optimization opportunities with NOT NULL quals, as
> this could prevent us from reducing certain such quals to constant
> true, but I think correctness should be our top priority.
>
> I'll go ahead and write a patch.

Here is the patch.

Thanks
Richard

Attachment

v1-0001-Do-not-reduce-a-clone-clause-to-a-constant-true.patch

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

26 February, 09:26:09

Richard Guo <guofenglinux@gmail.com> writes:
> Here is the patch.

Thanks for that.  The code and comment added to
restriction_is_always_true look good, but I can't help wondering
whether we don't need the same in restriction_is_always_false.
Not very sure what a query proving the need for that would
look like, but leaving it asymmetric feels wrong.

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

26 February, 10:03:30

On Wed, Feb 26, 2025 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > Here is the patch.

> Thanks for that.  The code and comment added to
> restriction_is_always_true look good, but I can't help wondering
> whether we don't need the same in restriction_is_always_false.
> Not very sure what a query proving the need for that would
> look like, but leaving it asymmetric feels wrong.

Yeah, I think you are right.  The thing here is that we don't have a
reliable way to determine if the input expression of a NullTest is
non-nullable if it's a clone clause.  This applies to both
restriction_is_always_true and restriction_is_always_false.  So I
think we should add the same check in restriction_is_always_false too.

I'll give it a try to find a query that shows this is necessary.

Thanks
Richard

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

26 February, 11:46:27

On Wed, Feb 26, 2025 at 4:03 PM Richard Guo <guofenglinux@gmail.com> wrote:
> On Wed, Feb 26, 2025 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thanks for that.  The code and comment added to
> > restriction_is_always_true look good, but I can't help wondering
> > whether we don't need the same in restriction_is_always_false.
> > Not very sure what a query proving the need for that would
> > look like, but leaving it asymmetric feels wrong.

> Yeah, I think you are right.  The thing here is that we don't have a
> reliable way to determine if the input expression of a NullTest is
> non-nullable if it's a clone clause.  This applies to both
> restriction_is_always_true and restriction_is_always_false.  So I
> think we should add the same check in restriction_is_always_false too.
>
> I'll give it a try to find a query that shows this is necessary.

After some attempts, I failed to find such a query.  An IS NULL qual
is not strict, and it makes the outer join not commutable with other
joins per outer-join identity 3.  We could add another strict qual to
the join clauses to make the outer join commutable.  However, in this
case whether we reduce the IS NULL qual to a constant false would not
affect the join outputs, due to the strict qual.

I'm starting to believe that reducing an IS NULL clone clause to a
constant false should be safe.  Any thoughts?

Thanks
Richard

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

27 February, 07:28:14

Richard Guo <guofenglinux@gmail.com> writes:
>> On Wed, Feb 26, 2025 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Thanks for that.  The code and comment added to
>>> restriction_is_always_true look good, but I can't help wondering
>>> whether we don't need the same in restriction_is_always_false.

> After some attempts, I failed to find such a query.  An IS NULL qual
> is not strict, and it makes the outer join not commutable with other
> joins per outer-join identity 3.  We could add another strict qual to
> the join clauses to make the outer join commutable.  However, in this
> case whether we reduce the IS NULL qual to a constant false would not
> affect the join outputs, due to the strict qual.

> I'm starting to believe that reducing an IS NULL clone clause to a
> constant false should be safe.  Any thoughts?

Not sure.  I think the risk factor here is fooling the clone-clause
mechanisms by failing to install some clone of an original outer
join clause in a rel's restriction list.  Looking at the callers
of restriction_is_always_false:

add_base_clause_to_rel
add_join_clause_to_rels

    These two replace the clause with a Const "false" that is
    marked with the same rinfo_serial.  Therefore the expectation
    that every group of clones will have a representative is
    still satisfied, even though the representative doesn't
    contain quite what you might expect.

apply_child_basequals

    This just summarily rejects the child rel as not needing
    to be built.  I think that might be okay, since we don't
    really make any further deductions with an appendrel
    child's quals.

I've not actually worked through the bug case to verify my hypothesis
that it's the lack of a representative for a clone group that is
breaking things.  (Maybe you did?)  But if that is the direct
cause of the failure, then it's plausible that none of these
callers are subject to it.

I'm a bit inclined to argue that we should put in the no-clone-quals
limitation anyway, to protect future callers that might not manage
to dodge the problem.  OTOH, if we get rid of these functions
altogether as I expressed my desire to do nearby, it's all moot.

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

28 February, 11:47:09

On Thu, Feb 27, 2025 at 1:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've not actually worked through the bug case to verify my hypothesis
> that it's the lack of a representative for a clone group that is
> breaking things.  (Maybe you did?)  But if that is the direct
> cause of the failure, then it's plausible that none of these
> callers are subject to it.

Yeah, I think this is the cause of the failure.  Specifically, when we
start with (A/B)/C and transform to A/(B/C) per OJ identity 3, we need
a way to prevent upper clauses that use only C Vars from being pushed
down and applied as a filter clause at the lower B/C join.  To achieve
this, we consider that the pushed-down B/C join has not completed, and
hence do not include B/C join's ojrelid in its relid set (it will be
added when we form the A/B join).

This is why, in this case, the version of 'customer.cid IS NOT NULL'
that is not marked as nullable by the left join to customer is chosen:
we've assumed that that left join has not completed yet.  However,
performing the NOT NULL deduction on this clone would be unsafe.  It
seems to me that, for clone clauses, the nullingrels mark is not that
reliable for performing such deductions.

> I'm a bit inclined to argue that we should put in the no-clone-quals
> limitation anyway, to protect future callers that might not manage
> to dodge the problem.

Now I'm inclined to agree with you.  It seems we've been lucky so far
not to encounter issues with IS NULL deductions on clone clauses, but
this feels like a problem waiting to happen in the future.

The no-clone-quals limitation in restriction_is_always_false makes the
fix in f00ab1fd1 unnecessary.  I think we should revert those changes.

Attached is the updated patch.

> OTOH, if we get rid of these functions
> altogether as I expressed my desire to do nearby, it's all moot.

Yeah, if we can find a way to perform the NullTest deduction during
eval_const_expressions, it will be much cleaner.

Thanks
Richard

Attachment

v2-0001-Avoid-NullTest-deduction-for-clone-clauses.patch

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

28 February, 18:48:18

Richard Guo <guofenglinux@gmail.com> writes:
> On Thu, Feb 27, 2025 at 1:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I've not actually worked through the bug case to verify my hypothesis
>> that it's the lack of a representative for a clone group that is
>> breaking things.  (Maybe you did?)

> Yeah, I think this is the cause of the failure.  Specifically, when we
> start with (A/B)/C and transform to A/(B/C) per OJ identity 3, we need
> a way to prevent upper clauses that use only C Vars from being pushed
> down and applied as a filter clause at the lower B/C join.  To achieve
> this, we consider that the pushed-down B/C join has not completed, and
> hence do not include B/C join's ojrelid in its relid set (it will be
> added when we form the A/B join).

> This is why, in this case, the version of 'customer.cid IS NOT NULL'
> that is not marked as nullable by the left join to customer is chosen:
> we've assumed that that left join has not completed yet.

Oh!  So I found the right solution through the wrong chain of
reasoning ;=).  It's not that the clone group is missed during qual
selection: it's that the selected representative lacks a nullingrel
bit, so when add_join_clause_to_rel acts on it later, it thinks it's
okay to throw away.

> However,
> performing the NOT NULL deduction on this clone would be unsafe.  It
> seems to me that, for clone clauses, the nullingrels mark is not that
> reliable for performing such deductions.

Precisely.  We have to remove the nullingrel bit from the Var, else
the placement cross-checks in setrefs.c will (correctly) complain that
it's being evaluated at the wrong level.  But relying on it in this
particular way at this particular point in the proceedings gives the
wrong conclusion.

>> I'm a bit inclined to argue that we should put in the no-clone-quals
>> limitation anyway, to protect future callers that might not manage
>> to dodge the problem.

> Now I'm inclined to agree with you.  It seems we've been lucky so far
> not to encounter issues with IS NULL deductions on clone clauses, but
> this feels like a problem waiting to happen in the future.

Yes, I'm on board with that too.  I think the comments need more work
than you've done though.  I think they should say something about
"nullingrel bits in clone clauses may not reflect reality, so we dare
not draw conclusions from clones about whether Vars are guaranteed
not-null".  Also, I'm not convinced any more that this has anything to
do with missing clone members, so we should remove that claim.

> The no-clone-quals limitation in restriction_is_always_false makes the
> fix in f00ab1fd1 unnecessary.  I think we should revert those changes.

Hmm, I'm less on board with that.  The saving-and-restoring of
last_rinfo_serial looks correct to me, even though this fix might
cause the test case you found before to not tickle that bug.

> Attached is the updated patch.

I'm confused by the changes to existing test cases in this?  It looks
like you're undoing some test changes that are not from f00ab1fd1, so
I don't understand why.  My inclination here would be to add a new
test that exposes the current bug, but leave the rest alone, even
though they may no longer be testing exactly what they were originally
meant to do.

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

01 March, 01:16:14

I wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
>> Yeah, I think this is the cause of the failure.  Specifically, when we
>> start with (A/B)/C and transform to A/(B/C) per OJ identity 3, we need
>> a way to prevent upper clauses that use only C Vars from being pushed
>> down and applied as a filter clause at the lower B/C join.  To achieve
>> this, we consider that the pushed-down B/C join has not completed, and
>> hence do not include B/C join's ojrelid in its relid set (it will be
>> added when we form the A/B join).

>> This is why, in this case, the version of 'customer.cid IS NOT NULL'
>> that is not marked as nullable by the left join to customer is chosen:
>> we've assumed that that left join has not completed yet.

> Oh!  So I found the right solution through the wrong chain of
> reasoning ;=).  It's not that the clone group is missed during qual
> selection: it's that the selected representative lacks a nullingrel
> bit, so when add_join_clause_to_rel acts on it later, it thinks it's
> okay to throw away.

I found the time to trace through this example in detail, and your
description is accurate.  This is explained (perhaps not at sufficient
length) in optimizer/README:

  As usual, outer join identity 3 complicates matters.  If we start with
      (A leftjoin B on (Pab)) leftjoin C on (Pbc)
  then the parser will have marked any C Vars appearing above these joins
  with the RT index of the B/C join.  If we now transform to
      A leftjoin (B leftjoin C on (Pbc)) on (Pab)
  then it would appear that a clause using only such Vars could be pushed
  down and applied as a filter clause (not a join clause) at the lower
  B/C join.  But *this might not give the right answer* since the clause
  might see a non-null value for the C Var that will be replaced by null
  once the A/B join is performed.  We handle this by saying that the
  pushed-down join hasn't completely performed the work of the B/C join
  and hence is not entitled to include that outer join relid in its
  relid set.  When we form the A/B join, both outer joins' relids will
  be added to its relid set, and then the upper clause will be applied
  at the correct join level.  (Note there is no problem when identity 3
  is applied in the other direction: if we started with the second form
  then upper C Vars are marked with both outer join relids, so they
  cannot drop below whichever join is applied second.)  Similarly,
  Vars representing the output of a pushed-down join do not acquire
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  nullingrel bits for that join until after the upper join is performed.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So in the case at hand, we generate a clone of the "rc3.id IS NOT
NULL" clause that doesn't have a nullingrel bit for the left join to
rc3, but nonetheless is expected to be applied after that join is
performed.  There is also a clone that does have a nullingrel bit, but
that doesn't save us in this example because it will only be applied
in join orders other than the one that gets chosen.

It'd certainly be nice to make that cleaner, but I'm not sure how,
and we wouldn't risk back-patching the nontrivial changes that would
be needed.  So I think rejecting has_clone/is_clone clauses is the
only path forward for now.

Also, I found a test case proving that restriction_is_always_false
needs it too.  If we patch only restriction_is_always_true, then
try this modification of the submitted repro:

regression=# explain (costs off)
SELECT * FROM rc1
     LEFT JOIN rc2 ON rc2.rc1_reference = rc1.description
     LEFT JOIN rc3 ON rc2.id = rc3.rc2_reference
     LEFT JOIN LATERAL rc_select(rc3.id) rs ON rc3.id IS NOT NULL and rc3.rc2_reference IS NULL;
                             QUERY PLAN
--------------------------------------------------------------------
 Hash Right Join
   Hash Cond: ((rc2.rc1_reference)::text = (rc1.description)::text)
   ->  Nested Loop Left Join
         Join Filter: ((rc3.id IS NOT NULL) AND false)
         ->  Hash Right Join
               Hash Cond: (rc3.rc2_reference = rc2.id)
               ->  Seq Scan on rc3
               ->  Hash
                     ->  Seq Scan on rc2
         ->  Result
               One-Time Filter: false
   ->  Hash
         ->  Seq Scan on rc1
(13 rows)

restriction_is_always_false has decided that "rc3.rc2_reference IS
NULL" is always false, although for exactly the same reasons discussed
here, it isn't.  One other interesting point here is that the code has
also decided that the scan on rc_select can be reduced to dummy
because the ON qual is unsatisfiable.  I think that is actually
correct in this case, but the conclusion is based on very faulty
logic, so I don't have any faith that there aren't nearby cases where
that would be another bug.  With both is_always functions patched,
I get

regression=# explain (costs off)
SELECT * FROM rc1
     LEFT JOIN rc2 ON rc2.rc1_reference = rc1.description
     LEFT JOIN rc3 ON rc2.id = rc3.rc2_reference
     LEFT JOIN LATERAL rc_select(rc3.id) rs ON rc3.id IS NOT NULL and rc3.rc2_reference IS NULL;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Hash Right Join
   Hash Cond: ((rc2.rc1_reference)::text = (rc1.description)::text)
   ->  Nested Loop Left Join
         Join Filter: ((rc3.id IS NOT NULL) AND (rc3.rc2_reference IS NULL))
         ->  Hash Right Join
               Hash Cond: (rc3.rc2_reference = rc2.id)
               ->  Seq Scan on rc3
               ->  Hash
                     ->  Seq Scan on rc2
         ->  Function Scan on rc_select rs
   ->  Hash
         ->  Seq Scan on rc1
(12 rows)

Maybe that's leaving something on the table, but I'm content to call
it good for today.

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

03 March, 09:40:47

On Sat, Mar 1, 2025 at 12:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > Now I'm inclined to agree with you.  It seems we've been lucky so far
> > not to encounter issues with IS NULL deductions on clone clauses, but
> > this feels like a problem waiting to happen in the future.
>
> Yes, I'm on board with that too.  I think the comments need more work
> than you've done though.  I think they should say something about
> "nullingrel bits in clone clauses may not reflect reality, so we dare
> not draw conclusions from clones about whether Vars are guaranteed
> not-null".  Also, I'm not convinced any more that this has anything to
> do with missing clone members, so we should remove that claim.

Agreed.  Thanks for the suggestion.

> > The no-clone-quals limitation in restriction_is_always_false makes the
> > fix in f00ab1fd1 unnecessary.  I think we should revert those changes.
>
> Hmm, I'm less on board with that.  The saving-and-restoring of
> last_rinfo_serial looks correct to me, even though this fix might
> cause the test case you found before to not tickle that bug.

Fair point.

> > Attached is the updated patch.
>
> I'm confused by the changes to existing test cases in this?  It looks
> like you're undoing some test changes that are not from f00ab1fd1, so
> I don't understand why.  My inclination here would be to add a new
> test that exposes the current bug, but leave the rest alone, even
> though they may no longer be testing exactly what they were originally
> meant to do.

Yeah, the patch modifies some existing test cases that are intended to
test a join clause that is IS_NOT_NULL/IS_NULL qual can or cannot be
reduced depending on whether its Var is nullable by any outer joins.
And it happens that the join clauses in these test cases are clone
clauses, which are excluded from reduction by the fix here.  As a
result, they are no longer testing what they are intended to.  This
patch modifies them to ensure that those join clauses are not clone
clauses.

This patch also adds a new test that exposes the current bug, and
tests that clone clauses are not reduced to constant true/false.

Here is an updated patch, which also includes the test case you found
proving the no-clone-quals limitation in restriction_is_always_false
is also needed.

Thanks
Richard

Attachment

v3-0001-Avoid-NullTest-deduction-for-clone-clauses.patch

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

03 March, 09:50:38

On Sat, Mar 1, 2025 at 7:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So in the case at hand, we generate a clone of the "rc3.id IS NOT
> NULL" clause that doesn't have a nullingrel bit for the left join to
> rc3, but nonetheless is expected to be applied after that join is
> performed.  There is also a clone that does have a nullingrel bit, but
> that doesn't save us in this example because it will only be applied
> in join orders other than the one that gets chosen.

Exactly!

> It'd certainly be nice to make that cleaner, but I'm not sure how,
> and we wouldn't risk back-patching the nontrivial changes that would
> be needed.  So I think rejecting has_clone/is_clone clauses is the
> only path forward for now.

Yeah, a better fix might be to find a way to perform this NullTest
deduction during eval_const_expressions, as you mentioned in the other
thread.  However, that would also require some nontrivial changes.

> Also, I found a test case proving that restriction_is_always_false
> needs it too.

Thanks.  I've included this test case in v3 patch.

Thanks
Richard

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

03 March, 18:22:31

Richard Guo <guofenglinux@gmail.com> writes:
> Here is an updated patch, which also includes the test case you found
> proving the no-clone-quals limitation in restriction_is_always_false
> is also needed.

LGTM.

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Richard Guo

Date:

04 March, 10:30:26

On Tue, Mar 4, 2025 at 12:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > Here is an updated patch, which also includes the test case you found
> > proving the no-clone-quals limitation in restriction_is_always_false
> > is also needed.

> LGTM.

Pushed and back-patched to v17.  The fix will be included in next
minor release (17.5).

Thanks
Richard

Re: Query result differences between PostgreSQL 17 vs 16

From

Bruce Momjian

Date:

12 March, 01:03:40

On Tue, Mar  4, 2025 at 04:30:26PM +0900, Richard Guo wrote:
> On Tue, Mar 4, 2025 at 12:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Richard Guo <guofenglinux@gmail.com> writes:
> > > Here is an updated patch, which also includes the test case you found
> > > proving the no-clone-quals limitation in restriction_is_always_false
> > > is also needed.
> 
> > LGTM.
> 
> Pushed and back-patched to v17.  The fix will be included in next
> minor release (17.5).

Uh, if we fixed this for NULL optimization and backpatched, should we do
the same for:

    Parameter NOT NULL to CREATE DOMAIN not the same as CHECK (VALUE IS NOT NULL)
    https://www.postgresql.org/message-id/flat/173591158454.714.7664064332419606037%40wrigleys.postgresql.org

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

12 March, 01:19:49

Bruce Momjian <bruce@momjian.us> writes:
> Uh, if we fixed this for NULL optimization and backpatched, should we do
> the same for:
>     Parameter NOT NULL to CREATE DOMAIN not the same as CHECK (VALUE IS NOT NULL)
>     https://www.postgresql.org/message-id/flat/173591158454.714.7664064332419606037%40wrigleys.postgresql.org

This is unrelated to that.

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Bruce Momjian

Date:

12 March, 01:58:27

On Tue, Mar 11, 2025 at 06:19:49PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Uh, if we fixed this for NULL optimization and backpatched, should we do
> > the same for:
> >     Parameter NOT NULL to CREATE DOMAIN not the same as CHECK (VALUE IS NOT NULL)
> >     https://www.postgresql.org/message-id/flat/173591158454.714.7664064332419606037%40wrigleys.postgresql.org
> 
> This is unrelated to that.

Well, I was trying to figure out if we should be changing the NULL
handling in 17.X, but now that you have done it with one patch, it seems
the above URL patch should also be applied and backpatched to 17, no?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Query result differences between PostgreSQL 17 vs 16

From

Tom Lane

Date:

12 March, 03:57:06

Bruce Momjian <bruce@momjian.us> writes:
> Well, I was trying to figure out if we should be changing the NULL
> handling in 17.X, but now that you have done it with one patch, it seems
> the above URL patch should also be applied and backpatched to 17, no?

If you want somebody to review that patch, shouldn't it be in the CF?

            regards, tom lane

Re: Query result differences between PostgreSQL 17 vs 16

From

Bruce Momjian

Date:

12 March, 04:56:03

On Tue, Mar 11, 2025 at 08:57:06PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Well, I was trying to figure out if we should be changing the NULL
> > handling in 17.X, but now that you have done it with one patch, it seems
> > the above URL patch should also be applied and backpatched to 17, no?
> 
> If you want somebody to review that patch, shouldn't it be in the CF?

Uh, I wasn't sure if it would make it in the next minor release if I did
that.  Is it a good idea?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.