Thread: Inconsistency between try_mergejoin_path and create_mergejoin_plan
Hi. There's the following inconsistency between try_mergejoin_path() and create_mergejoin_plan(). When clause operator has no commutator, we can end up with mergejoin path. Later create_mergejoin_plan() will call get_switched_clauses(). This function can error out with ERROR: could not find commutator for operator XXX The similar behavior seems to be present also for hash join. Attaching a test case (in patch) and a possible fix. -- Best regards, Alexander Pyhalov, Postgres Professional
Attachment
On Mon, Jun 17, 2024 at 10:51 PM Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote: > There's the following inconsistency between try_mergejoin_path() and > create_mergejoin_plan(). > When clause operator has no commutator, we can end up with mergejoin > path. > Later create_mergejoin_plan() will call get_switched_clauses(). This > function can error out with > > ERROR: could not find commutator for operator XXX Interesting. This error can be reproduced with table 'ec1' from sql/equivclass.sql. set enable_indexscan to off; explain select * from ec1 t1 join ec1 t2 on t2.ff = t1.f1; ERROR: could not find commutator for operator 30450 The column ec1.f1 has a type of 'int8alias1', a new data type created in this test file. Additionally, there is also a newly created operator 'int8 = int8alias1' which is mergejoinable but lacks a valid commutator. Therefore, there is no problem generating the mergejoin path, but when we create the mergejoin plan, get_switched_clauses would notice the absence of a valid commutator needed to commute the clause. It seems to me that the new operator is somewhat artificial, since it is designed to support a mergejoin but lacks a valid commutator. So before we proceed to discuss the fix, I'd like to know whether this is a valid issue that needs fixing. Any thoughts? Thanks Richard
Richard Guo <guofenglinux@gmail.com> writes: > On Mon, Jun 17, 2024 at 10:51 PM Alexander Pyhalov > <a.pyhalov@postgrespro.ru> wrote: >> ERROR: could not find commutator for operator XXX > It seems to me that the new operator is somewhat artificial, since it is > designed to support a mergejoin but lacks a valid commutator. So before > we proceed to discuss the fix, I'd like to know whether this is a valid > issue that needs fixing. Well, there's no doubt that the case is artificial: nobody who knew what they were doing would install an incomplete opclass like this in a production setting. However, there are several parts of the planner that take pains to avoid this type of failure. I am pretty sure that we are careful about flipping around candidate indexscan quals for instance. And the "broken equivalence class" mechanism is all about that, which is what equivclass.sql is setting up this opclass to test. So I find it a bit sad if mergejoin creation is tripping over this case. I do not think we should add a great deal of complexity or extra planner cycles to deal with this; but if it can be fixed at low cost, we should. regards, tom lane
On Wed, Jun 19, 2024 at 12:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Richard Guo <guofenglinux@gmail.com> writes: > > It seems to me that the new operator is somewhat artificial, since it is > > designed to support a mergejoin but lacks a valid commutator. So before > > we proceed to discuss the fix, I'd like to know whether this is a valid > > issue that needs fixing. > I do not think we should add a great deal of complexity or extra > planner cycles to deal with this; but if it can be fixed at low > cost, we should. I think we can simply verify the validity of commutators for clauses in the form "inner op outer" when selecting mergejoin/hash clauses. If a clause lacks a commutator, we should consider it unusable for the given pair of outer and inner rels. Please see the attached patch. Thanks Richard
Attachment
Richard Guo писал(а) 2024-06-19 16:30: > On Wed, Jun 19, 2024 at 12:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Richard Guo <guofenglinux@gmail.com> writes: >> > It seems to me that the new operator is somewhat artificial, since it is >> > designed to support a mergejoin but lacks a valid commutator. So before >> > we proceed to discuss the fix, I'd like to know whether this is a valid >> > issue that needs fixing. > >> I do not think we should add a great deal of complexity or extra >> planner cycles to deal with this; but if it can be fixed at low >> cost, we should. > > I think we can simply verify the validity of commutators for clauses in > the form "inner op outer" when selecting mergejoin/hash clauses. If a > clause lacks a commutator, we should consider it unusable for the given > pair of outer and inner rels. Please see the attached patch. > This seems to be working for my test cases. -- Best regards, Alexander Pyhalov, Postgres Professional
On Wed, Jun 19, 2024 at 10:24 PM Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote: > Richard Guo писал(а) 2024-06-19 16:30: > > I think we can simply verify the validity of commutators for clauses in > > the form "inner op outer" when selecting mergejoin/hash clauses. If a > > clause lacks a commutator, we should consider it unusable for the given > > pair of outer and inner rels. Please see the attached patch. > This seems to be working for my test cases. Thank you for confirming. Here is an updated patch with some tweaks to the comments and commit message. I've parked this patch in the July commitfest. Thanks Richard
Attachment
Richard Guo <guofenglinux@gmail.com> writes: > Thank you for confirming. Here is an updated patch with some tweaks to > the comments and commit message. I've parked this patch in the July > commitfest. I took a brief look at this. I think the basic idea is sound, but I have a couple of nits: * It's not entirely obvious that the checks preceding these additions are sufficient to ensure that the clauses are OpExprs. They probably are, since the clauses are marked hashable or mergeable, but that test is mighty far away. More to the point, if they ever weren't OpExprs the result would likely be to pass a bogus OID to get_commutator and thus silently fail, allowing the problem to go undetected for a long time. I'd suggest using castNode() rather than a hard-wired assumption that the clause is an OpExpr. * Do we really need to construct a whole new set of bogus operators and opclasses to test this? As you noted, the regression tests already set up an incomplete opclass for other purposes. Why can't we extend that test, to reduce the amount of cycles wasted forevermore on this rather trivial point? (I'm actually wondering whether we really need to memorialize this with a regression test case at all. But I'd settle for minimizing the amount of test cycles added.) regards, tom lane
On Thu, Jul 25, 2024 at 12:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I took a brief look at this. I think the basic idea is sound, > but I have a couple of nits: Thank you for reviewing this patch! > * It's not entirely obvious that the checks preceding these additions > are sufficient to ensure that the clauses are OpExprs. They probably > are, since the clauses are marked hashable or mergeable, but that test > is mighty far away. More to the point, if they ever weren't OpExprs > the result would likely be to pass a bogus OID to get_commutator and > thus silently fail, allowing the problem to go undetected for a long > time. I'd suggest using castNode() rather than a hard-wired > assumption that the clause is an OpExpr. Good point. I've modified the code to use castNode(), and added comment accordingly. > * Do we really need to construct a whole new set of bogus operators > and opclasses to test this? As you noted, the regression tests > already set up an incomplete opclass for other purposes. Why can't > we extend that test, to reduce the amount of cycles wasted forevermore > on this rather trivial point? > > (I'm actually wondering whether we really need to memorialize this > with a regression test case at all. But I'd settle for minimizing > the amount of test cycles added.) At first I planned to use the alias type 'int8alias1' created in equivclass.sql for this test. However, I found that this type is not created yet when running join.sql. Perhaps it's more convenient to place this test in equivclass.sql to leverage the bogus operators created there, but the test does not seem to be related to 'broken' ECs. I'm not sure if equivclass.sql is the appropriate place. Do you think it works if we place this test in equivclass.sql and write a comment explaining why it's there, like attached? Now I’m also starting to wonder if this change actually warrants such a test. BTW, do you think this patch is worth backpatching? Thanks Richard
Attachment
On Tue, Sep 3, 2024 at 5:51 PM Richard Guo <guofenglinux@gmail.com> wrote: > The new test case fails starting from adf97c156, and we have to > install a hash opfamily and a hash function for the hacked int8alias1 > type to make the test case work again. > > Now, I'm more dubious about whether we really need to add a test case > for this change. I pushed this patch with the test case remaining, as it adds only a minimal number of test cycles. I explained in the commit message why the test case is included in equivclass.sql rather than in join.sql. I did not do backpatch because this bug cannot be reproduced without installing an incomplete opclass, which is unlikely to happen in practice. Thanks for the report and review. Thanks Richard
Hello Richard and Tom, 04.09.2024 06:50, Richard Guo wrote: > I pushed this patch with the test case remaining, as it adds only a > minimal number of test cycles. I explained in the commit message why > the test case is included in equivclass.sql rather than in join.sql. While playing with the equivclass test, I've discovered that the next step to define a complete set of operators in the test: @@ -65,6 +65,7 @@ procedure = int8alias1eq, leftarg = int8, rightarg = int8alias1, restrict = eqsel, join = eqjoinsel, +commutator = =, merges ); produces an internal error: ERROR: XX000: operator 32312 is not a member of opfamily 1976 LOCATION: get_op_opfamily_properties, lsyscache.c:149 pg_regress/equivclass BACKTRACE: get_op_opfamily_properties at lsyscache.c:149:3 MJExamineQuals at nodeMergejoin.c:228:19 ExecInitMergeJoin at nodeMergejoin.c:1608:25 ExecInitNode at execProcnode.c:303:27 InitPlan at execMain.c:964:14 Maybe the error itself is not that unexpected, but I'm confused by a comment above the function: * Caller should already have verified that opno is a member of opfamily, * therefore we raise an error if the tuple is not found. */ void get_op_opfamily_properties(Oid opno, Oid opfamily, bool ordering_op, int *strategy, Oid *lefttype, Oid *righttype) { HeapTuple tp; Form_pg_amop amop_tup; tp = SearchSysCache3(AMOPOPID, ObjectIdGetDatum(opno), CharGetDatum(ordering_op ? AMOP_ORDER : AMOP_SEARCH), ObjectIdGetDatum(opfamily)); if (!HeapTupleIsValid(tp)) elog(ERROR, "operator %u is not a member of opfamily %u", opno, opfamily); This behavior reproduced on commit a33cf1041, dated 2007-01-23, which added "ALTER OPERATOR FAMILY", but I think it also can be reproduced on previous commits with manual catalog editing. (The comment was added by a78fcfb51 from 2006-12-23, which introduced operator families.) Best regards, Alexander