Thread: Better management of mergejoinable operators
I noticed today that process_implied_equality() still contains an ugly hack that should have been got rid of awhile ago: it assumes that mergejoinable operators are named "=". This has been a bogus assumption for several releases, as illustrated by this failure: regression=# select * from text_tbl a,text_tbl b,text_tbl c where a.f1 ~=~ b.f1 and b.f1 ~=~ c.f1; ERROR: equality operator for types text and text should be merge-joinable, but isn't It can also be fooled by schema-search-path issues, if the needed operator exists but isn't in the path. Since we've not heard complaints from the field about this, I'm not feeling urgent about having a back-patchable solution, but I want to find one going forward. What is actually needed in this function is to be able to find a mergejoinable equality operator whose oprlsortop and oprrsortop are the two sortops already known for the input pathkey columns. We have a couple of problems doing that though: first, with the present system catalog layout there seems no way to do that short of a seqscan through all of pg_operator; and second, what if there's not a unique answer, ie, multiple equality operators alleging the same lsortop/rsortop? Right offhand I cannot see a reason why there should be different equality operators with the same sortops. (If anyone can come up with a plausible scenario for that, stop me here...) So what I'm thinking about is a unique index on oprlsortop/oprrsortop; that would both allow efficient search, and prevent multiple answers. Now we can't do that directly because most of the entries in pg_operator in fact contain zeroes in these columns, and would cause uniqueness failures. Probably the cleanest answer would be to allow these two columns to be NULL, not zero, when not meaningful; but that would be a bit of a mess to implement because of the code's assumption of fixed layout for pg_operator tuples. What I'm considering doing is moving the oprlsortop/oprrsortop/ oprltcmpop/oprgtcmpop fields out of pg_operator and into a new auxiliary catalog, named say pg_mergejoinop, that would have entries only for mergejoinable equality operators. This would have the same kind of relationship to pg_operator that pg_aggregate has to pg_proc: if a pg_operator entry has "oprcanmerge" true, then there's an extension row for it in pg_mergejoinop. The catalog would be fairly small and cheap to search (48 entries in a default install, as of CVS head), and could support a unique index to constrain the oprlsortop/oprrsortop columns. Comments? regards, tom lane
On Dec 13, 2006, at 7:56 , Tom Lane wrote: > Right offhand I cannot see a reason why there should be different > equality operators with the same sortops. (If anyone can come up with > a plausible scenario for that, stop me here...) So what I'm thinking > about is a unique index on oprlsortop/oprrsortop; that would both > allow > efficient search, and prevent multiple answers. I think this makes sense. Would this be affected at all by equality of text strings, taking into account locale? Or would there be equality for text in each locale (so oprlsortop and oprrsortop would always be not only the same type (text) but also of the same locale)? I'd think this is would be the case so it wouldn't end up being a problem. Michael Glaesemann grzm seespotcode net
Michael Glaesemann <grzm@seespotcode.net> writes: > On Dec 13, 2006, at 7:56 , Tom Lane wrote: >> Right offhand I cannot see a reason why there should be different >> equality operators with the same sortops. (If anyone can come up with >> a plausible scenario for that, stop me here...) > I think this makes sense. Would this be affected at all by equality > of text strings, taking into account locale? If it is, then we'd have far greater problems to deal with than just this one --- the entire operator/function structure is built on the assumption that there is, say, only one "=" between any two datatypes. I think if locale wants actually different operators then it will have to make strings of different locales be distinct datatypes. It's probably a lot more practical to keep text as just one datatype and store the locale indicator as part of each value. (There's also been some blue sky thoughts about trying to keep it in typmod, but that wouldn't result in multiple operators either.) regards, tom lane
I wrote: >>> Right offhand I cannot see a reason why there should be different >>> equality operators with the same sortops. (If anyone can come up with >>> a plausible scenario for that, stop me here...) BTW, I think it's possible to prove that there need never be two for the case of both sides the same datatype. If we have a sortop "A < B" on a single datatype, then its commutator is well defined: "A > B" if and only if "B < A". And by the trichotomy law, "A = B" must be true in exactly those cases for which neither "A < B" nor "A > B". So there is only one possible behavior for an equality operator that is consistent with the sortop. (This is, in fact, the reason that we can get away with considering a single sortop as fully specifying a sort order.) This argument doesn't immediately go through if A and B are of different datatypes, but it's pretty hard to think of a case where it wouldn't hold. regards, tom lane
On Dec 13, 2006, at 8:45 , Tom Lane wrote: > the entire operator/function structure is built on the > assumption that there is, say, only one "=" between any two datatypes. You mean only on "=" between any two values of a given datatype? Or is there something else I'm missing? So what you're doing will just reinforce that. Michael Glaesemann grzm seespotcode net
On Dec 13, 2006, at 12:33 , Michael Glaesemann wrote: > > On Dec 13, 2006, at 8:45 , Tom Lane wrote: > >> the entire operator/function structure is built on the >> assumption that there is, say, only one "=" between any two >> datatypes. > > You mean only on "=" between any two values of a given datatype? Ignore that. :) if that were true, you wouldn't need to have both left and right argument types. I think I got it now. Michael Glaesemann grzm seespotcode net
On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >>>> Right offhand I cannot see a reason why there should be different >>>> equality operators with the same sortops. (If anyone can come up with >>>> a plausible scenario for that, stop me here...) > > BTW, I think it's possible to prove that there need never be two for the > case of both sides the same datatype. If we have a sortop "A < B" on a > single datatype, then its commutator is well defined: "A > B" if and > only if "B < A". And by the trichotomy law, "A = B" must be true in > exactly those cases for which neither "A < B" nor "A > B". So there is > only one possible behavior for an equality operator that is consistent > with the sortop. Counterexample even for a single data type: define an operator x =* y which is true when 2x = y. This is mergejoinable using the following operators: SORT1 = <, SORT2 = <, LTCMP = (2x < y), RTCMP = (2x > y) (which is of course the same sortops as for regular =). The LTCMP and GTCMP operators imply a unique join operator due to trichotomy, but this is not true for the sortops. While the above is a bit contrived, I think non-contrived examples could be found too. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Andrew - Supernews <andrew+nonews@supernews.com> writes: > On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, I think it's possible to prove that there need never be two for the >> case of both sides the same datatype. > Counterexample even for a single data type: define an operator x =* y > which is true when 2x = y. This is mergejoinable using the following > operators: SORT1 = <, SORT2 = <, LTCMP = (2x < y), RTCMP = (2x > y) > (which is of course the same sortops as for regular =). I think not --- the corresponding sort operators would have to be "2x < y" etc, else the trichotomy law fails, and so do all standard sort algorithms. regards, tom lane
On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew - Supernews <andrew+nonews@supernews.com> writes: >> On 2006-12-13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> BTW, I think it's possible to prove that there need never be two for the >>> case of both sides the same datatype. > >> Counterexample even for a single data type: define an operator x =* y >> which is true when 2x = y. This is mergejoinable using the following >> operators: SORT1 = <, SORT2 = <, LTCMP = (2x < y), RTCMP = (2x > y) >> (which is of course the same sortops as for regular =). > > I think not --- the corresponding sort operators would have to be > "2x < y" etc, else the trichotomy law fails, and so do all standard > sort algorithms. No, because if x < x' then 2x < 2x'. Or to put it another way, doing a merge join on (2x = y) simply requires matching the sorted lists of x's and y's against each other in a different place, rather than changing the sort order of either. You're suffering from a fundamental confusion between the ltcmp/rtcmp operators (which indeed must be trichotomous with the join operator) and the sort operators. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Andrew - Supernews <andrew+nonews@supernews.com> writes: > You're suffering from a fundamental confusion between the ltcmp/rtcmp > operators (which indeed must be trichotomous with the join operator) > and the sort operators. [ thinks for awhile... ] OK, you have a point, but if we want to take that seriously then we have to invent a different concept that supports what the planner needs to do, ie, draw transitive-equality inferences ("if A = B and B = C then A = C"). The real question on the table is whether it's worth distinguishing between mergejoinable equality operators and transitive equality operators. I suggest that it probably isn't --- do you have any examples with more real-world application than the x = 2y case? regards, tom lane
I wrote: > The real question on the table is whether it's worth distinguishing > between mergejoinable equality operators and transitive equality > operators. I suggest that it probably isn't --- do you have any > examples with more real-world application than the x = 2y case? The proposal I just sent in effectively eliminates the concept of a mergejoinable operator as such --- instead, it uses btree opclass semantics to decide what's mergejoinable. I believe this eliminates the possibility of using mergejoins for cases like Andrew's x = 2y operator. Again, has anyone got any real-world examples where it'd be important to be able to handle such things via mergejoin? (Note: you can of course mergejoin a query like "WHERE x = 2*y", because the *operator* is still the vanilla mergejoinable equality. Funny stuff in the computation of the merge keys isn't a problem.) regards, tom lane