Thread: Better management of mergejoinable operators

Better management of mergejoinable operators

From
Tom Lane
Date:
I noticed today that process_implied_equality() still contains an ugly
hack that should have been got rid of awhile ago: it assumes that
mergejoinable operators are named "=".  This has been a bogus assumption
for several releases, as illustrated by this failure:

regression=# select * from text_tbl a,text_tbl b,text_tbl c where a.f1 ~=~ b.f1 and b.f1 ~=~ c.f1;
ERROR:  equality operator for types text and text should be merge-joinable, but isn't

It can also be fooled by schema-search-path issues, if the needed
operator exists but isn't in the path.  Since we've not heard complaints
from the field about this, I'm not feeling urgent about having a
back-patchable solution, but I want to find one going forward.

What is actually needed in this function is to be able to find a
mergejoinable equality operator whose oprlsortop and oprrsortop are the
two sortops already known for the input pathkey columns.  We have a
couple of problems doing that though: first, with the present system
catalog layout there seems no way to do that short of a seqscan through
all of pg_operator; and second, what if there's not a unique answer,
ie, multiple equality operators alleging the same lsortop/rsortop?

Right offhand I cannot see a reason why there should be different
equality operators with the same sortops.  (If anyone can come up with
a plausible scenario for that, stop me here...)  So what I'm thinking
about is a unique index on oprlsortop/oprrsortop; that would both allow
efficient search, and prevent multiple answers.

Now we can't do that directly because most of the entries in pg_operator
in fact contain zeroes in these columns, and would cause uniqueness
failures.  Probably the cleanest answer would be to allow these two
columns to be NULL, not zero, when not meaningful; but that would be a
bit of a mess to implement because of the code's assumption of fixed
layout for pg_operator tuples.

What I'm considering doing is moving the oprlsortop/oprrsortop/
oprltcmpop/oprgtcmpop fields out of pg_operator and into a new auxiliary
catalog, named say pg_mergejoinop, that would have entries only for
mergejoinable equality operators.  This would have the same kind of
relationship to pg_operator that pg_aggregate has to pg_proc: if a
pg_operator entry has "oprcanmerge" true, then there's an extension
row for it in pg_mergejoinop.  The catalog would be fairly small and
cheap to search (48 entries in a default install, as of CVS head),
and could support a unique index to constrain the oprlsortop/oprrsortop
columns.

Comments?
        regards, tom lane


Re: Better management of mergejoinable operators

From
Michael Glaesemann
Date:
On Dec 13, 2006, at 7:56 , Tom Lane wrote:

> Right offhand I cannot see a reason why there should be different
> equality operators with the same sortops.  (If anyone can come up with
> a plausible scenario for that, stop me here...)  So what I'm thinking
> about is a unique index on oprlsortop/oprrsortop; that would both  
> allow
> efficient search, and prevent multiple answers.

I think this makes sense. Would this be affected at all by equality  
of text strings, taking into account locale? Or would there be  
equality for text in each locale (so oprlsortop and oprrsortop would  
always be not only the same type (text) but also of the same locale)?  
I'd think this is would be the case so it wouldn't end up being a  
problem.

Michael Glaesemann
grzm seespotcode net




Re: Better management of mergejoinable operators

From
Tom Lane
Date:
Michael Glaesemann <grzm@seespotcode.net> writes:
> On Dec 13, 2006, at 7:56 , Tom Lane wrote:
>> Right offhand I cannot see a reason why there should be different
>> equality operators with the same sortops.  (If anyone can come up with
>> a plausible scenario for that, stop me here...)

> I think this makes sense. Would this be affected at all by equality  
> of text strings, taking into account locale?

If it is, then we'd have far greater problems to deal with than just
this one --- the entire operator/function structure is built on the
assumption that there is, say, only one "=" between any two datatypes.
I think if locale wants actually different operators then it will have
to make strings of different locales be distinct datatypes.

It's probably a lot more practical to keep text as just one datatype and
store the locale indicator as part of each value.  (There's also been
some blue sky thoughts about trying to keep it in typmod, but that
wouldn't result in multiple operators either.)
        regards, tom lane


Re: Better management of mergejoinable operators

From
Tom Lane
Date:
I wrote:
>>> Right offhand I cannot see a reason why there should be different
>>> equality operators with the same sortops.  (If anyone can come up with
>>> a plausible scenario for that, stop me here...)

BTW, I think it's possible to prove that there need never be two for the
case of both sides the same datatype.  If we have a sortop "A < B" on a
single datatype, then its commutator is well defined: "A > B" if and
only if "B < A".  And by the trichotomy law, "A = B" must be true in
exactly those cases for which neither "A < B" nor "A > B".  So there is
only one possible behavior for an equality operator that is consistent
with the sortop.  (This is, in fact, the reason that we can get away
with considering a single sortop as fully specifying a sort order.)

This argument doesn't immediately go through if A and B are of different
datatypes, but it's pretty hard to think of a case where it wouldn't
hold.
        regards, tom lane


Re: Better management of mergejoinable operators

From
Michael Glaesemann
Date:
On Dec 13, 2006, at 8:45 , Tom Lane wrote:

> the entire operator/function structure is built on the
> assumption that there is, say, only one "=" between any two datatypes.

You mean only on "=" between any two values of a given datatype? Or  
is there something else I'm missing? So what you're doing will just  
reinforce that.

Michael Glaesemann
grzm seespotcode net




Re: Better management of mergejoinable operators

From
Michael Glaesemann
Date:
On Dec 13, 2006, at 12:33 , Michael Glaesemann wrote:

>
> On Dec 13, 2006, at 8:45 , Tom Lane wrote:
>
>> the entire operator/function structure is built on the
>> assumption that there is, say, only one "=" between any two  
>> datatypes.
>
> You mean only on "=" between any two values of a given datatype?

Ignore that. :) if that were true, you wouldn't need to have both  
left and right argument types. I think I got it now.


Michael Glaesemann
grzm seespotcode net




Re: Better management of mergejoinable operators

From
Andrew - Supernews
Date:
On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>>>> Right offhand I cannot see a reason why there should be different
>>>> equality operators with the same sortops.  (If anyone can come up with
>>>> a plausible scenario for that, stop me here...)
>
> BTW, I think it's possible to prove that there need never be two for the
> case of both sides the same datatype.  If we have a sortop "A < B" on a
> single datatype, then its commutator is well defined: "A > B" if and
> only if "B < A".  And by the trichotomy law, "A = B" must be true in
> exactly those cases for which neither "A < B" nor "A > B".  So there is
> only one possible behavior for an equality operator that is consistent
> with the sortop.

Counterexample even for a single data type: define an operator x =* y
which is true when 2x = y.  This is mergejoinable using the following
operators: SORT1 = <, SORT2 = <, LTCMP = (2x < y), RTCMP = (2x > y)
(which is of course the same sortops as for regular =).

The LTCMP and GTCMP operators imply a unique join operator due to
trichotomy, but this is not true for the sortops. While the above is
a bit contrived, I think non-contrived examples could be found too.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Better management of mergejoinable operators

From
Tom Lane
Date:
Andrew - Supernews <andrew+nonews@supernews.com> writes:
> On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, I think it's possible to prove that there need never be two for the
>> case of both sides the same datatype.

> Counterexample even for a single data type: define an operator x =* y
> which is true when 2x = y.  This is mergejoinable using the following
> operators: SORT1 = <, SORT2 = <, LTCMP = (2x < y), RTCMP = (2x > y)
> (which is of course the same sortops as for regular =).

I think not --- the corresponding sort operators would have to be
"2x < y" etc, else the trichotomy law fails, and so do all standard
sort algorithms.
        regards, tom lane


Re: Better management of mergejoinable operators

From
Andrew - Supernews
Date:
On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew - Supernews <andrew+nonews@supernews.com> writes:
>> On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> BTW, I think it's possible to prove that there need never be two for the
>>> case of both sides the same datatype.
>
>> Counterexample even for a single data type: define an operator x =* y
>> which is true when 2x = y.  This is mergejoinable using the following
>> operators: SORT1 = <, SORT2 = <, LTCMP = (2x < y), RTCMP = (2x > y)
>> (which is of course the same sortops as for regular =).
>
> I think not --- the corresponding sort operators would have to be
> "2x < y" etc, else the trichotomy law fails, and so do all standard
> sort algorithms.

No, because if x < x' then 2x < 2x'.  Or to put it another way, doing
a merge join on (2x = y) simply requires matching the sorted lists of
x's and y's against each other in a different place, rather than changing
the sort order of either.

You're suffering from a fundamental confusion between the ltcmp/rtcmp
operators (which indeed must be trichotomous with the join operator)
and the sort operators.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Better management of mergejoinable operators

From
Tom Lane
Date:
Andrew - Supernews <andrew+nonews@supernews.com> writes:
> You're suffering from a fundamental confusion between the ltcmp/rtcmp
> operators (which indeed must be trichotomous with the join operator)
> and the sort operators.

[ thinks for awhile... ]  OK, you have a point, but if we want to take
that seriously then we have to invent a different concept that supports
what the planner needs to do, ie, draw transitive-equality inferences
("if A = B and B = C then A = C").

The real question on the table is whether it's worth distinguishing
between mergejoinable equality operators and transitive equality
operators.  I suggest that it probably isn't --- do you have any
examples with more real-world application than the x = 2y case?
        regards, tom lane


Re: Better management of mergejoinable operators

From
Tom Lane
Date:
I wrote:
> The real question on the table is whether it's worth distinguishing
> between mergejoinable equality operators and transitive equality
> operators.  I suggest that it probably isn't --- do you have any
> examples with more real-world application than the x = 2y case?

The proposal I just sent in effectively eliminates the concept of a
mergejoinable operator as such --- instead, it uses btree opclass
semantics to decide what's mergejoinable.  I believe this eliminates
the possibility of using mergejoins for cases like Andrew's x = 2y
operator.  Again, has anyone got any real-world examples where it'd
be important to be able to handle such things via mergejoin?

(Note: you can of course mergejoin a query like "WHERE x = 2*y", because
the *operator* is still the vanilla mergejoinable equality.  Funny stuff
in the computation of the merge keys isn't a problem.)
        regards, tom lane