Thread: IN, BETWEEN, spec compliance, and odd operator names
I was looking just now at gram.y's handling of various peculiar SQL constructs, and was reminded of a point that's bothered me before, but I don't recall if it's ever been discussed explicitly on -hackers. As an example, take the production for BETWEEN ASYMMETRIC: a_expr BETWEEN opt_asymmetric b_expr AND b_expr { $$ = (Node *) makeA_Expr(AEXPR_AND,NIL, (Node *) makeSimpleA_Expr(AEXPR_OP, ">=", $1, $4, @2), (Node *) makeSimpleA_Expr(AEXPR_OP, "<=", $1, $6, @2), @2); } Okay, this is a pretty direct implementation of how SQL99 defines the construct: 8.3 <between predicate> syntax rule 6 saith "X BETWEEN ASYMMETRIC Y AND Z" is equivalent to "X>=Y AND X<=Z". But it leaves me feeling dissatisfied. What if the datatype has standard comparison operators (as identified by a default btree opclass) but they're not named ">=" and "<=" ? Perhaps more plausibly, what if those operators exist but aren't in the current search path? The production for NOT BETWEEN is even more troubling: a_expr NOT BETWEEN opt_asymmetric b_expr AND b_expr { $$ = (Node *) makeA_Expr(AEXPR_OR,NIL, (Node *) makeSimpleA_Expr(AEXPR_OP, "<", $1, $5, @2), (Node *) makeSimpleA_Expr(AEXPR_OP, ">", $1, $7, @2), @2); } I can't object too much to the hardwired application of DeMorgan's law (NOT (A AND B) => (NOT A) OR (NOT B)) but what this also has is a hardwired assumption that "<" and ">" exist and are the negators of ">=" and "<=" respectively. Probably true, but let's see you find chapter and verse in the SQL spec to support that... Seems to me that what this boils down to is whether we want to read the spec literally ("it says the construct is defined in terms of operators named >= and <=, therefore we should do that") or by intent (obviously what they *want* is a construct that behaves sensibly in terms of the datatype's semantics). We are more than a bit schizophrenic on this point --- in different parts of the system you can find these things being done both ways. There is plenty of code that insists on finding a default btree opclass to define notions of "less" or "greater"; but we have these purely name-based transformations in gram.y, and I think there are some other parts of the parser that do similar things. I'm not particularly eager to start changing things in this area right now, but it seems to me that it'd be a good idea to establish a project policy about what we consider to be the preferred behavior, with an eye to eventually migrating the parts of the system that don't conform. My own feeling is that we should avoid imputing particular semantics to particular operator names, and so these constructs should always be defined by reference to operators found in a default opclass for the datatype, rather than by specific operator names. However, that way will likely take more code and cycles to implement than purely name-based definitions; and there is also the argument that it violates the in-so-many-words definitions given by the spec. Comments? regards, tom lane PS: there are some other issues here, like whether BETWEEN should be allowed to cause double evaluation of its left-hand argument, and whether we wouldn't like it to get reverse-listed by ruleutils.c in the original BETWEEN format rather than as an expanded version. However, what I'd like to focus on in this particular thread is the narrow issue of defining the constructs in terms of operator names vs operator semantics.
On Sun, Aug 24, 2008 at 09:24:23PM -0400, Tom Lane wrote: > My own feeling is that we should avoid imputing particular semantics > to particular operator names, and so these constructs should always be > defined by reference to operators found in a default opclass for the > datatype, rather than by specific operator names. However, that way > will likely take more code and cycles to implement than purely > name-based definitions; and there is also the argument that it violates > the in-so-many-words definitions given by the spec. ISTM the problem is that there's no easy way to refer to "operators found in a default opclass", so perhaps we could invent a construct: A OPERATOR(btree,2) B Which would refer to the second operator in the default btree operator class. The problem is inferring the type, if A and B are different types, which operator class do you use? When the BETWEEN construct is expanded, is there currently any guarentee that the chosen operators will actually be from the same operator class? As for the negators, I think the parser should simply wrap the whole expression in NOT and let the optimiser sort it out. You could then use it in other places, like LIKE optimisation or the ORDER BY clause. Essentially anywhere where currently there are assumptions about the real operator name is required. And then the optimiser can fill in the actual operator by which time it should be clear what it is. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Le lundi 25 août 2008, Martijn van Oosterhout a écrit : > ISTM the problem is that there's no easy way to refer to "operators > found in a default opclass", so perhaps we could invent a construct: Perhaps here again we're showing a need for some higher level semantics about operators and types. The case was recently raised about equality operators and operators transitivity, but I can't find the archive reference just now. Maybe we should add some common semantics to operators. CREATE OPERATOR would support some new clauses: IS_TRANSITIVE IS_EQUALITY IS_LT IS_LE ... Not sure about how to name those new clauses, obviously. It seems to me OPERATOR CLASS/FAMILY are all about how to index a given type, not how to offer semantics to type operators, so adding those clauses is "orthogonal" to the existing operator advanced notions. We could require any given type to provide at least an equality operator, whatever the name, and we could even provide equality operators for different types, or for types in the same categories. So int4 = int8 could use or could avoid using the same operator as int8 = int8, e.g. > assumptions about the real operator name is required. And then the > optimiser can fill in the actual operator by which time it should be > clear what it is. How would the planner get the estimated costs associated to any operator choice in this case? Regards, -- dim
"Dimitri Fontaine" <dfontaine@hi-media.com> writes: > Le lundi 25 août 2008, Martijn van Oosterhout a écrit : >> ISTM the problem is that there's no easy way to refer to "operators >> found in a default opclass", so perhaps we could invent a construct: > > Perhaps here again we're showing a need for some higher level semantics about > operators and types. The case was recently raised about equality operators > and operators transitivity, but I can't find the archive reference just now. > Maybe we should add some common semantics to operators. CREATE OPERATOR would > support some new clauses: > IS_TRANSITIVE > IS_EQUALITY > IS_LT > IS_LE > ... I'm not sure it's made explicit anywhere in the documentation but those properties *are* what btree operator classes define. You would end up duplicating the same structures (like, LT is meaningless unless you associate it with the corresponding EQUALITY, LE, GT, and GE operators). >> assumptions about the real operator name is required. And then the >> optimiser can fill in the actual operator by which time it should be >> clear what it is. > > How would the planner get the estimated costs associated to any operator > choice in this case? We don't need to evaluate costs until well after this point. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Le lundi 25 août 2008, Gregory Stark a écrit : > I'm not sure it's made explicit anywhere in the documentation but those > properties *are* what btree operator classes define. You would end up > duplicating the same structures (like, LT is meaningless unless you > associate it with the corresponding EQUALITY, LE, GT, and GE operators). But, IIRC, only in the context of index searches, not at the planner level. ISTM the planner can't currently make strong assumptions on operators in order to simplify or analyze some parts of queries... > >> assumptions about the real operator name is required. And then the > >> optimiser can fill in the actual operator by which time it should be > >> clear what it is. > > > > How would the planner get the estimated costs associated to any operator > > choice in this case? > > We don't need to evaluate costs until well after this point. Oh, perfect then. I guess it simply shows I need to read up on PG architecture wrt to how planner/optimiser talk to each other. Regards, -- dim
On Monday 25 August 2008 04:24:23 Tom Lane wrote: > Seems to me that what this boils down to is whether we want to read the > spec literally ("it says the construct is defined in terms of operators > named >= and <=, therefore we should do that") or by intent (obviously > what they *want* is a construct that behaves sensibly in terms of the > datatype's semantics). I would be generally in favor of moving more toward a semantics-based approach instead of a name-based approach. It would be interesting to see an enumeration of all the parts of the system that hardcode operator names. Based on that we might gain more insight into what it would take to go one way or another.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sun, Aug 24, 2008 at 09:24:23PM -0400, Tom Lane wrote: >> My own feeling is that we should avoid imputing particular semantics >> to particular operator names, and so these constructs should always be >> defined by reference to operators found in a default opclass for the >> datatype, rather than by specific operator names. > ISTM the problem is that there's no easy way to refer to "operators > found in a default opclass", so perhaps we could invent a construct: > A OPERATOR(btree,2) B Huh? I don't understand why you think we need to expose this to users. A user would presumably just write the name of the operator he wants, if he's writing out a direct operator call. To me the issue is what we consider IN and BETWEEN and similar constructs to "mean", which in a datatype world boils down to choosing which of the datatype's operators to implement the construct with. > The problem is inferring the type, if A and B are > different types, which operator class do you use? Yeah, the cross-type problem occurred to me this morning too. For instance consider int4_var BETWEEN int8_var AND numeric_var The current implementation of BETWEEN acts fairly sanely in this case; although it ends up choosing two entirely unrelated operators (int48_ge and numeric_le, which are not in any of the same operator families), which is not great for subsequent optimization purposes. I'm not sure about how we'd make a good choice using an opclass-driven approach. Would we want to insist that both operators are found in the same family? Perhaps so, because otherwise it's not real clear that you've created a meaningful range constraint. Yet it's definitely possible that such a requirement would cause the query to fail where it used to work (for some value of "work"). Another way to approach it would be to consider the problem as being similar to overloaded-function resolution, ie think of it as trying to match an implicit or explicit function between(anyelement, anyelement, anyelement). But that would fail to take into account whether there are actually any suitable comparison operators for whichever common type it chooses. There's also an issue of not wanting to coerce variables unnecessarily. In the above example, suppose int4_var has an index. The derived clause int4_var int48_ge int8_var will be indexable using that index, whereas int4_var::numeric numeric_le numeric_var won't be, If we coerce all three inputs to numeric to enforce that we have consistent semantics on both ends of the range check, then neither end will be indexable; which seems like a step backwards. So it's all a bit harder than it looks. Thoughts? regards, tom lane
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Le lundi 25 ao�t 2008, Gregory Stark a �crit�: >> I'm not sure it's made explicit anywhere in the documentation but those >> properties *are* what btree operator classes define. You would end up >> duplicating the same structures (like, LT is meaningless unless you >> associate it with the corresponding EQUALITY, LE, GT, and GE operators). > But, IIRC, only in the context of index searches, not at the planner level. No, that's not true at all. There are lots and lots of places now where we use btree and/or hash operator classes to reason about the properties of operators. The most in-your-face example right now is ORDER BY: if you say ORDER BY x, that's gonna fail outright unless x's type has a default btree opclass. And if it does, the < member of the opclass is what defines the sort order. So we have certainly set some precedents involving using opclasses to make semantic choices. What's really bothering me is that we aren't consistent about it: some things like ORDER BY are opclass-driven, and some things like BETWEEN are operator-name-driven. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Le 25 août 08 à 16:48, Tom Lane a écrit : >> But, IIRC, only in the context of index searches, not at the >> planner level. > > No, that's not true at all. There are lots and lots of places now > where > we use btree and/or hash operator classes to reason about the > properties > of operators. Yes, but always in relation to operator classes, so from BTrees opclass or such, which I refered to as "the context of index searches", as I don't really see any theorical need for opclass if it's not for indexing. My formulation was "outright wrong", as you would say, but I hope to have explained a little better what I'm on: there's not enough direct semantic information concerning operators for the planner to take full profit out if it. It this assertion more true? Regards, - -- dim -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iEYEARECAAYFAkizBdsACgkQlBXRlnbh1blP5wCgh5h3vAn8EUonABN0ZYV58JQe xjMAoMpBNMiBLat1lfwGEz0w6rQip8LP =Lgxd -----END PGP SIGNATURE-----
Dimitri Fontaine <dfontaine@hi-media.com> writes: > ... I don't really see any theorical need for opclass if > it's not for indexing. Well, if we were starting from a green field, we might design it differently, but I see little point in changing the system structure now that it's established. You need *some* representation for the properties and relationships of sets of operators, and so far the btree and hash index opclasses have matched up quite well with what other parts of the system wanted to know; so why duplicate that information? > My formulation was "outright wrong", as you would say, but I hope to > have explained a little better what I'm on: there's not enough direct > semantic information concerning operators for the planner to take full > profit out if it. It this assertion more true? Not really. I can see an argument that there might be something we wish to know about groups of operators that can't reasonably be expressed within the opclass infrastructure, and in that situation maybe it would be time to invent something else. But it hasn't come up yet. Note that new properties of *individual* operators would just be handled by adding columns to pg_operator, and aren't really relevant to this discussion. regards, tom lane
> Yes, but always in relation to operator classes, so from BTrees opclass or > such, which I refered to as "the context of index searches", as I don't > really see any theorical need for opclass if it's not for indexing. IIRC analyze uses the btree op class to build the selectivity histogram. -Nathan
On Mon, Aug 25, 2008 at 10:43:30AM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > ISTM the problem is that there's no easy way to refer to "operators > > found in a default opclass", so perhaps we could invent a construct: > > > A OPERATOR(btree,2) B > > Huh? I don't understand why you think we need to expose this to users. > A user would presumably just write the name of the operator he wants, > if he's writing out a direct operator call. The user would, yes. I was talking about the internal representation. We're talking about the operator name that the parser is going to use when expanding the BETWEEN clause. Instead of using the string "<=" you use the string "btree.2" and then in the operator resolution code you treat that as an operator class lookup. The above construct doesn't actually work for the end-user because of a parser error. > To me the issue is what we consider IN and BETWEEN and similar > constructs to "mean", which in a datatype world boils down to choosing > which of the datatype's operators to implement the construct with. I was thinking the problem was using real operator names. Here we make up an unambiguous name that is not currently in use that can be resolved to the required operator on demand. If you define the lookup to always match the type of the left-hand value you can force consistant semantics. Whether its the semantics you want is another question. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Mon, Aug 25, 2008 at 10:43:30AM -0400, Tom Lane wrote: >> To me the issue is what we consider IN and BETWEEN and similar >> constructs to "mean", which in a datatype world boils down to choosing >> which of the datatype's operators to implement the construct with. > I was thinking the problem was using real operator names. Here we make > up an unambiguous name that is not currently in use that can be > resolved to the required operator on demand. No, I don't think that really helps anything. The parser would just immediately move to the next step of trying to resolve the operator, so you've only introduced another bit of notation without actually solving any problems. The issue of concern here is *how* to identify the required operator, not exactly when/where it happens. > If you define the lookup to always match the type of the left-hand > value you can force consistant semantics. Whether its the semantics you > want is another question. Probably not, since it would fail outright in cases where the LHS has to be promoted in order to get a match (such as my int4 vs numeric example). We've allowed that in the past, so rejecting it would probably be a hard sell. regards, tom lane