Thread: Query result differences between PostgreSQL 17 vs 16

Query result differences between PostgreSQL 17 vs 16

From
Ronald Cruz
Date:

Hi,

We have observed an issue after upgrading to PostgreSQL 17 that caused us to roll back to 16. Some of our queries are returning what I believe to be erroneous results.

I've attached two files here that can be used to reproduce what I'm seeing:

schema_plus_data.sql - This contains a pg_dump of a reproducible test case with a contrived schema and dataset of our use case.

reproducer.sql - This isn't necessary, but perhaps you find it helpful. This is a SQL script used to generate the schema and random data that hits this edge case. It was used to generate the previous file and it has reliably hit the issue so far from the handful of times I've tried.

The query I'm observing issues for is the following:

SELECT * FROM rc1
    LEFT JOIN rc2 ON rc2.rc1_reference = rc1.description
    LEFT JOIN rc3 ON rc2.id = rc3.rc2_reference
    LEFT JOIN LATERAL rc_select(rc3.id) ON rc3.id IS NOT NULL;

Under PostgreSQL 17, I'm seeing ~400k results returned, whereas in PostgreSQL 16, I see ~6k. The results I believe to be erroneous are those that have 'BUG HIT' in the output for PostgreSQL 17. These are results joined from rc_select where rc3.id is null. I'm not expecting to see any of these rows as is the case in PostgreSQL 16 output (and 15 as well from prior experience).

I've observed this behavior in the latest PostgreSQL 17.4 but have also encountered this in 17.2 and 17.3. The OS being used is RHEL 9.5 (plow). Please let me know if you need any more information.

Thank you in advance for any help,
Ron


Attachment

Re: Query result differences between PostgreSQL 17 vs 16

From
Bruce Momjian
Date:
On Fri, Feb 21, 2025 at 11:13:17AM -0500, Ronald Cruz wrote:
> Hi,
> 
> We have observed an issue after upgrading to PostgreSQL 17 that caused us to
> roll back to 16. Some of our queries are returning what I believe to be
> erroneous results.
> 
> I've attached two files here that can be used to reproduce what I'm seeing:
> 
> schema_plus_data.sql - This contains a pg_dump of a reproducible test case with
> a contrived schema and dataset of our use case.
> 
> reproducer.sql - This isn't necessary, but perhaps you find it helpful. This is
> a SQL script used to generate the schema and random data that hits this edge
> case. It was used to generate the previous file and it has reliably hit the
> issue so far from the handful of times I've tried.
> 
> The query I'm observing issues for is the following:
> 
> SELECT * FROM rc1
>     LEFT JOIN rc2 ON rc2.rc1_reference = rc1.description
>     LEFT JOIN rc3 ON rc2.id = rc3.rc2_reference
>     LEFT JOIN LATERAL rc_select(rc3.id) ON rc3.id IS NOT NULL;
> 
> Under PostgreSQL 17, I'm seeing ~400k results returned, whereas in PostgreSQL
> 16, I see ~6k. The results I believe to be erroneous are those that have 'BUG
> HIT' in the output for PostgreSQL 17. These are results joined from rc_select
> where rc3.id is null. I'm not expecting to see any of these rows as is the case
> in PostgreSQL 16 output (and 15 as well from prior experience).
> 
> I've observed this behavior in the latest PostgreSQL 17.4 but have also
> encountered this in 17.2 and 17.3. The OS being used is RHEL 9.5 (plow). Please
> let me know if you need any more information.

We have a known problem with composite types and NULL constraints in PG
17 that I think we are fixing in PG 18.  I saw IS NOT NULL in your
query so I thought I would mention it:

    https://www.postgresql.org/message-id/Z37p0paENWWUarj-%40momjian.us

We do have several NULL optimizations in PG 17.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: Query result differences between PostgreSQL 17 vs 16

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Fri, Feb 21, 2025 at 11:13:17AM -0500, Ronald Cruz wrote:
>> We have observed an issue after upgrading to PostgreSQL 17 that caused us to
>> roll back to 16. Some of our queries are returning what I believe to be
>> erroneous results.

> We have a known problem with composite types and NULL constraints in PG
> 17 that I think we are fixing in PG 18.

There's no composite type at hand here.  I think the problem is an
erroneous deduction from a column NOT NULL constraint.  I can
reproduce a faulty plan in the regression database with

explain (costs off)
select * from tenk1
  left join int4_tbl i on (unique1 = f1)
  left join customer on (i.f1 = cid)
  left join int4_tbl j on cid is not null;

v16 produces

 Hash Left Join
   Hash Cond: (tenk1.unique1 = i.f1)
   ->  Seq Scan on tenk1
   ->  Hash
         ->  Nested Loop Left Join
               Join Filter: (customer.cid IS NOT NULL)
               ->  Hash Right Join
                     Hash Cond: (customer.cid = i.f1)
                     ->  Seq Scan on customer
                     ->  Hash
                           ->  Seq Scan on int4_tbl i
               ->  Materialize
                     ->  Seq Scan on int4_tbl j

but HEAD produces

 Hash Left Join
   Hash Cond: (tenk1.unique1 = i.f1)
   ->  Seq Scan on tenk1
   ->  Hash
         ->  Nested Loop Left Join
               ->  Hash Right Join
                     Hash Cond: (customer.cid = i.f1)
                     ->  Seq Scan on customer
                     ->  Hash
                           ->  Seq Scan on int4_tbl i
               ->  Materialize
                     ->  Seq Scan on int4_tbl j

Note the lack of any IS NOT NULL test.  I think the planner has
convinced itself that the not null constraint on customer.cid
makes that test redundant, despite the fact that what it is
testing is a post-outer-join value that most certainly could
be null.

"git bisect" fingers this commit:

b262ad440edecda0b1aba81d967ab560a83acb8a is the first bad commit
commit b262ad440edecda0b1aba81d967ab560a83acb8a
Author: David Rowley <drowley@postgresql.org>
Date:   Tue Jan 23 18:09:18 2024 +1300

    Add better handling of redundant IS [NOT] NULL quals


I've not looked at the code, but I suspect that it is failing
to check varnullingrels before believing that it can trust
the applicability of table constraints.

            regards, tom lane



Re: Query result differences between PostgreSQL 17 vs 16

From
Richard Guo
Date:
On Sat, Feb 22, 2025 at 8:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "git bisect" fingers this commit:
>
> b262ad440edecda0b1aba81d967ab560a83acb8a is the first bad commit
> commit b262ad440edecda0b1aba81d967ab560a83acb8a
> Author: David Rowley <drowley@postgresql.org>
> Date:   Tue Jan 23 18:09:18 2024 +1300
>
>     Add better handling of redundant IS [NOT] NULL quals
>
>
> I've not looked at the code, but I suspect that it is failing
> to check varnullingrels before believing that it can trust
> the applicability of table constraints.

Hmm, we do check varnullingrels in expr_is_nonnullable().  My best
guess is that we have generated two versions of the qual 'customer.cid
IS NOT NULL': one with customer.cid marked as nullable by the left
join to customer, and one without.  The latter is dropped because of
the not null constraint on customer.cid, while the former fails to be
applied on the left join to int4_tbl j.

Thanks
Richard



Re: Query result differences between PostgreSQL 17 vs 16

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> On Sat, Feb 22, 2025 at 8:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I've not looked at the code, but I suspect that it is failing
>> to check varnullingrels before believing that it can trust
>> the applicability of table constraints.

> Hmm, we do check varnullingrels in expr_is_nonnullable().  My best
> guess is that we have generated two versions of the qual 'customer.cid
> IS NOT NULL': one with customer.cid marked as nullable by the left
> join to customer, and one without.  The latter is dropped because of
> the not null constraint on customer.cid, while the former fails to be
> applied on the left join to int4_tbl j.

If the check against table not-null constraints is applied after we
clone outer-join quals, that's probably bad.  I think there are
assumptions in there that every clone qual will have doppelgangers,
so filtering NOT NULLs later would break that.  Maybe not applying
the filter to quals marked has_clone or is_clone would help?

            regards, tom lane



Re: Query result differences between PostgreSQL 17 vs 16

From
Richard Guo
Date:
On Sat, Feb 22, 2025 at 11:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > On Sat, Feb 22, 2025 at 8:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I've not looked at the code, but I suspect that it is failing
> >> to check varnullingrels before believing that it can trust
> >> the applicability of table constraints.
>
> > Hmm, we do check varnullingrels in expr_is_nonnullable().  My best
> > guess is that we have generated two versions of the qual 'customer.cid
> > IS NOT NULL': one with customer.cid marked as nullable by the left
> > join to customer, and one without.  The latter is dropped because of
> > the not null constraint on customer.cid, while the former fails to be
> > applied on the left join to int4_tbl j.
>
> If the check against table not-null constraints is applied after we
> clone outer-join quals, that's probably bad.  I think there are
> assumptions in there that every clone qual will have doppelgangers,
> so filtering NOT NULLs later would break that.  Maybe not applying
> the filter to quals marked has_clone or is_clone would help?

Yeah, I think this approach can fix the issue.  Perhaps we should
check whether the RestrictInfo is a clone clause and avoid assuming
that it's always true in that case, maybe by adding something like
below at the start of restriction_is_always_true.

+   if (restrictinfo->has_clone || restrictinfo->is_clone)
+       return false;

We may lose some optimization opportunities with NOT NULL quals, as
this could prevent us from reducing certain such quals to constant
true, but I think correctness should be our top priority.

I'll go ahead and write a patch.

Thanks
Richard



Re: Query result differences between PostgreSQL 17 vs 16

From
Richard Guo
Date:
On Sat, Feb 22, 2025 at 12:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
> On Sat, Feb 22, 2025 at 11:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > If the check against table not-null constraints is applied after we
> > clone outer-join quals, that's probably bad.  I think there are
> > assumptions in there that every clone qual will have doppelgangers,
> > so filtering NOT NULLs later would break that.  Maybe not applying
> > the filter to quals marked has_clone or is_clone would help?

> Yeah, I think this approach can fix the issue.  Perhaps we should
> check whether the RestrictInfo is a clone clause and avoid assuming
> that it's always true in that case, maybe by adding something like
> below at the start of restriction_is_always_true.
>
> +   if (restrictinfo->has_clone || restrictinfo->is_clone)
> +       return false;
>
> We may lose some optimization opportunities with NOT NULL quals, as
> this could prevent us from reducing certain such quals to constant
> true, but I think correctness should be our top priority.
>
> I'll go ahead and write a patch.

Here is the patch.

Thanks
Richard

Attachment

Re: Query result differences between PostgreSQL 17 vs 16

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> Here is the patch.

Thanks for that.  The code and comment added to
restriction_is_always_true look good, but I can't help wondering
whether we don't need the same in restriction_is_always_false.
Not very sure what a query proving the need for that would
look like, but leaving it asymmetric feels wrong.

            regards, tom lane



Re: Query result differences between PostgreSQL 17 vs 16

From
Richard Guo
Date:
On Wed, Feb 26, 2025 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > Here is the patch.

> Thanks for that.  The code and comment added to
> restriction_is_always_true look good, but I can't help wondering
> whether we don't need the same in restriction_is_always_false.
> Not very sure what a query proving the need for that would
> look like, but leaving it asymmetric feels wrong.

Yeah, I think you are right.  The thing here is that we don't have a
reliable way to determine if the input expression of a NullTest is
non-nullable if it's a clone clause.  This applies to both
restriction_is_always_true and restriction_is_always_false.  So I
think we should add the same check in restriction_is_always_false too.

I'll give it a try to find a query that shows this is necessary.

Thanks
Richard



Re: Query result differences between PostgreSQL 17 vs 16

From
Richard Guo
Date:
On Wed, Feb 26, 2025 at 4:03 PM Richard Guo <guofenglinux@gmail.com> wrote:
> On Wed, Feb 26, 2025 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thanks for that.  The code and comment added to
> > restriction_is_always_true look good, but I can't help wondering
> > whether we don't need the same in restriction_is_always_false.
> > Not very sure what a query proving the need for that would
> > look like, but leaving it asymmetric feels wrong.

> Yeah, I think you are right.  The thing here is that we don't have a
> reliable way to determine if the input expression of a NullTest is
> non-nullable if it's a clone clause.  This applies to both
> restriction_is_always_true and restriction_is_always_false.  So I
> think we should add the same check in restriction_is_always_false too.
>
> I'll give it a try to find a query that shows this is necessary.

After some attempts, I failed to find such a query.  An IS NULL qual
is not strict, and it makes the outer join not commutable with other
joins per outer-join identity 3.  We could add another strict qual to
the join clauses to make the outer join commutable.  However, in this
case whether we reduce the IS NULL qual to a constant false would not
affect the join outputs, due to the strict qual.

I'm starting to believe that reducing an IS NULL clone clause to a
constant false should be safe.  Any thoughts?

Thanks
Richard



Re: Query result differences between PostgreSQL 17 vs 16

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
>> On Wed, Feb 26, 2025 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Thanks for that.  The code and comment added to
>>> restriction_is_always_true look good, but I can't help wondering
>>> whether we don't need the same in restriction_is_always_false.

> After some attempts, I failed to find such a query.  An IS NULL qual
> is not strict, and it makes the outer join not commutable with other
> joins per outer-join identity 3.  We could add another strict qual to
> the join clauses to make the outer join commutable.  However, in this
> case whether we reduce the IS NULL qual to a constant false would not
> affect the join outputs, due to the strict qual.

> I'm starting to believe that reducing an IS NULL clone clause to a
> constant false should be safe.  Any thoughts?

Not sure.  I think the risk factor here is fooling the clone-clause
mechanisms by failing to install some clone of an original outer
join clause in a rel's restriction list.  Looking at the callers
of restriction_is_always_false:

add_base_clause_to_rel
add_join_clause_to_rels

    These two replace the clause with a Const "false" that is
    marked with the same rinfo_serial.  Therefore the expectation
    that every group of clones will have a representative is
    still satisfied, even though the representative doesn't
    contain quite what you might expect.

apply_child_basequals

    This just summarily rejects the child rel as not needing
    to be built.  I think that might be okay, since we don't
    really make any further deductions with an appendrel
    child's quals.

I've not actually worked through the bug case to verify my hypothesis
that it's the lack of a representative for a clone group that is
breaking things.  (Maybe you did?)  But if that is the direct
cause of the failure, then it's plausible that none of these
callers are subject to it.

I'm a bit inclined to argue that we should put in the no-clone-quals
limitation anyway, to protect future callers that might not manage
to dodge the problem.  OTOH, if we get rid of these functions
altogether as I expressed my desire to do nearby, it's all moot.

            regards, tom lane