Thread: Inconsistency between try_mergejoin_path and create_mergejoin_plan

Inconsistency between try_mergejoin_path and create_mergejoin_plan

From
Alexander Pyhalov
Date:
Hi.

There's the following inconsistency between try_mergejoin_path() and 
create_mergejoin_plan().
When clause operator has no commutator, we can end up with mergejoin 
path.
Later create_mergejoin_plan() will call get_switched_clauses(). This 
function can error out with

ERROR:  could not find commutator for operator XXX

The similar behavior seems to be present also for hash join.

Attaching a test case (in patch) and a possible fix.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachment
On Mon, Jun 17, 2024 at 10:51 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:
> There's the following inconsistency between try_mergejoin_path() and
> create_mergejoin_plan().
> When clause operator has no commutator, we can end up with mergejoin
> path.
> Later create_mergejoin_plan() will call get_switched_clauses(). This
> function can error out with
>
> ERROR:  could not find commutator for operator XXX

Interesting.  This error can be reproduced with table 'ec1' from
sql/equivclass.sql.

set enable_indexscan to off;

explain select * from ec1 t1 join ec1 t2 on t2.ff = t1.f1;
ERROR:  could not find commutator for operator 30450

The column ec1.f1 has a type of 'int8alias1', a new data type created in
this test file.  Additionally, there is also a newly created operator
'int8 = int8alias1' which is mergejoinable but lacks a valid commutator.
Therefore, there is no problem generating the mergejoin path, but when
we create the mergejoin plan, get_switched_clauses would notice the
absence of a valid commutator needed to commute the clause.

It seems to me that the new operator is somewhat artificial, since it is
designed to support a mergejoin but lacks a valid commutator.  So before
we proceed to discuss the fix, I'd like to know whether this is a valid
issue that needs fixing.

Any thoughts?

Thanks
Richard



Richard Guo <guofenglinux@gmail.com> writes:
> On Mon, Jun 17, 2024 at 10:51 PM Alexander Pyhalov
> <a.pyhalov@postgrespro.ru> wrote:
>> ERROR:  could not find commutator for operator XXX

> It seems to me that the new operator is somewhat artificial, since it is
> designed to support a mergejoin but lacks a valid commutator.  So before
> we proceed to discuss the fix, I'd like to know whether this is a valid
> issue that needs fixing.

Well, there's no doubt that the case is artificial: nobody who knew
what they were doing would install an incomplete opclass like this
in a production setting.  However, there are several parts of the
planner that take pains to avoid this type of failure.  I am pretty
sure that we are careful about flipping around candidate indexscan
quals for instance.  And the "broken equivalence class" mechanism
is all about that, which is what equivclass.sql is setting up this
opclass to test.  So I find it a bit sad if mergejoin creation is
tripping over this case.

I do not think we should add a great deal of complexity or extra
planner cycles to deal with this; but if it can be fixed at low
cost, we should.

            regards, tom lane



On Wed, Jun 19, 2024 at 12:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Richard Guo <guofenglinux@gmail.com> writes:
> > It seems to me that the new operator is somewhat artificial, since it is
> > designed to support a mergejoin but lacks a valid commutator.  So before
> > we proceed to discuss the fix, I'd like to know whether this is a valid
> > issue that needs fixing.

> I do not think we should add a great deal of complexity or extra
> planner cycles to deal with this; but if it can be fixed at low
> cost, we should.

I think we can simply verify the validity of commutators for clauses in
the form "inner op outer" when selecting mergejoin/hash clauses.  If a
clause lacks a commutator, we should consider it unusable for the given
pair of outer and inner rels.  Please see the attached patch.

Thanks
Richard

Attachment

Re: Inconsistency between try_mergejoin_path and create_mergejoin_plan

From
Alexander Pyhalov
Date:
Richard Guo писал(а) 2024-06-19 16:30:
> On Wed, Jun 19, 2024 at 12:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Richard Guo <guofenglinux@gmail.com> writes:
>> > It seems to me that the new operator is somewhat artificial, since it is
>> > designed to support a mergejoin but lacks a valid commutator.  So before
>> > we proceed to discuss the fix, I'd like to know whether this is a valid
>> > issue that needs fixing.
> 
>> I do not think we should add a great deal of complexity or extra
>> planner cycles to deal with this; but if it can be fixed at low
>> cost, we should.
> 
> I think we can simply verify the validity of commutators for clauses in
> the form "inner op outer" when selecting mergejoin/hash clauses.  If a
> clause lacks a commutator, we should consider it unusable for the given
> pair of outer and inner rels.  Please see the attached patch.
> 

This seems to be working for my test cases.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



On Wed, Jun 19, 2024 at 10:24 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:
> Richard Guo писал(а) 2024-06-19 16:30:
> > I think we can simply verify the validity of commutators for clauses in
> > the form "inner op outer" when selecting mergejoin/hash clauses.  If a
> > clause lacks a commutator, we should consider it unusable for the given
> > pair of outer and inner rels.  Please see the attached patch.

> This seems to be working for my test cases.

Thank you for confirming.  Here is an updated patch with some tweaks to
the comments and commit message.  I've parked this patch in the July
commitfest.

Thanks
Richard

Attachment
Richard Guo <guofenglinux@gmail.com> writes:
> Thank you for confirming.  Here is an updated patch with some tweaks to
> the comments and commit message.  I've parked this patch in the July
> commitfest.

I took a brief look at this.  I think the basic idea is sound,
but I have a couple of nits:

* It's not entirely obvious that the checks preceding these additions
are sufficient to ensure that the clauses are OpExprs.  They probably
are, since the clauses are marked hashable or mergeable, but that test
is mighty far away.  More to the point, if they ever weren't OpExprs
the result would likely be to pass a bogus OID to get_commutator and
thus silently fail, allowing the problem to go undetected for a long
time.  I'd suggest using castNode() rather than a hard-wired
assumption that the clause is an OpExpr.

* Do we really need to construct a whole new set of bogus operators
and opclasses to test this?  As you noted, the regression tests
already set up an incomplete opclass for other purposes.  Why can't
we extend that test, to reduce the amount of cycles wasted forevermore
on this rather trivial point?

(I'm actually wondering whether we really need to memorialize this
with a regression test case at all.  But I'd settle for minimizing
the amount of test cycles added.)

            regards, tom lane



On Thu, Jul 25, 2024 at 12:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I took a brief look at this.  I think the basic idea is sound,
> but I have a couple of nits:

Thank you for reviewing this patch!

> * It's not entirely obvious that the checks preceding these additions
> are sufficient to ensure that the clauses are OpExprs.  They probably
> are, since the clauses are marked hashable or mergeable, but that test
> is mighty far away.  More to the point, if they ever weren't OpExprs
> the result would likely be to pass a bogus OID to get_commutator and
> thus silently fail, allowing the problem to go undetected for a long
> time.  I'd suggest using castNode() rather than a hard-wired
> assumption that the clause is an OpExpr.

Good point.  I've modified the code to use castNode(), and added
comment accordingly.

> * Do we really need to construct a whole new set of bogus operators
> and opclasses to test this?  As you noted, the regression tests
> already set up an incomplete opclass for other purposes.  Why can't
> we extend that test, to reduce the amount of cycles wasted forevermore
> on this rather trivial point?
>
> (I'm actually wondering whether we really need to memorialize this
> with a regression test case at all.  But I'd settle for minimizing
> the amount of test cycles added.)

At first I planned to use the alias type 'int8alias1' created in
equivclass.sql for this test.  However, I found that this type is not
created yet when running join.sql.  Perhaps it's more convenient to
place this test in equivclass.sql to leverage the bogus operators
created there, but the test does not seem to be related to 'broken'
ECs.  I'm not sure if equivclass.sql is the appropriate place.

Do you think it works if we place this test in equivclass.sql and
write a comment explaining why it's there, like attached?  Now I’m
also starting to wonder if this change actually warrants such a test.

BTW, do you think this patch is worth backpatching?

Thanks
Richard

Attachment

Re: Inconsistency between try_mergejoin_path and create_mergejoin_plan

From
Richard Guo
Date:
On Tue, Sep 3, 2024 at 5:51 PM Richard Guo <guofenglinux@gmail.com> wrote:
> The new test case fails starting from adf97c156, and we have to
> install a hash opfamily and a hash function for the hacked int8alias1
> type to make the test case work again.
>
> Now, I'm more dubious about whether we really need to add a test case
> for this change.

I pushed this patch with the test case remaining, as it adds only a
minimal number of test cycles.  I explained in the commit message why
the test case is included in equivclass.sql rather than in join.sql.

I did not do backpatch because this bug cannot be reproduced without
installing an incomplete opclass, which is unlikely to happen in
practice.

Thanks for the report and review.

Thanks
Richard



Re: Inconsistency between try_mergejoin_path and create_mergejoin_plan

From
Alexander Lakhin
Date:
Hello Richard and Tom,

04.09.2024 06:50, Richard Guo wrote:
> I pushed this patch with the test case remaining, as it adds only a
> minimal number of test cycles.  I explained in the commit message why
> the test case is included in equivclass.sql rather than in join.sql.

While playing with the equivclass test, I've discovered that the next step
to define a complete set of operators in the test:
@@ -65,6 +65,7 @@
      procedure = int8alias1eq,
      leftarg = int8, rightarg = int8alias1,
      restrict = eqsel, join = eqjoinsel,
+commutator = =,
      merges
  );

produces an internal error:
ERROR:  XX000: operator 32312 is not a member of opfamily 1976
LOCATION:  get_op_opfamily_properties, lsyscache.c:149

pg_regress/equivclass BACKTRACE:
get_op_opfamily_properties at lsyscache.c:149:3
MJExamineQuals at nodeMergejoin.c:228:19
ExecInitMergeJoin at nodeMergejoin.c:1608:25
ExecInitNode at execProcnode.c:303:27
InitPlan at execMain.c:964:14

Maybe the error itself is not that unexpected, but I'm confused by a
comment above the function:
  * Caller should already have verified that opno is a member of opfamily,
  * therefore we raise an error if the tuple is not found.
  */
void
get_op_opfamily_properties(Oid opno, Oid opfamily, bool ordering_op,
                            int *strategy,
                            Oid *lefttype,
                            Oid *righttype)
{
     HeapTuple   tp;
     Form_pg_amop amop_tup;

     tp = SearchSysCache3(AMOPOPID,
                          ObjectIdGetDatum(opno),
                          CharGetDatum(ordering_op ? AMOP_ORDER : AMOP_SEARCH),
                          ObjectIdGetDatum(opfamily));
     if (!HeapTupleIsValid(tp))
         elog(ERROR, "operator %u is not a member of opfamily %u",
              opno, opfamily);

This behavior reproduced on commit a33cf1041, dated 2007-01-23, which
added "ALTER OPERATOR FAMILY", but I think it also can be reproduced on
previous commits with manual catalog editing. (The comment was added by
a78fcfb51 from 2006-12-23, which introduced operator families.)

Best regards,
Alexander