Thread: IN, BETWEEN, spec compliance, and odd operator names

IN, BETWEEN, spec compliance, and odd operator names

From

Tom Lane

Date:

24 August 2008, 22:24:30

I was looking just now at gram.y's handling of various peculiar SQL
constructs, and was reminded of a point that's bothered me before,
but I don't recall if it's ever been discussed explicitly on -hackers.

As an example, take the production for BETWEEN ASYMMETRIC:
           a_expr BETWEEN opt_asymmetric b_expr AND b_expr               {                   $$ = (Node *)
makeA_Expr(AEXPR_AND,NIL,                       (Node *) makeSimpleA_Expr(AEXPR_OP, ">=", $1, $4, @2),
    (Node *) makeSimpleA_Expr(AEXPR_OP, "<=", $1, $6, @2),                                            @2);
}
 

Okay, this is a pretty direct implementation of how SQL99 defines the
construct: 8.3 <between predicate> syntax rule 6 saith
        "X BETWEEN ASYMMETRIC Y AND Z" is equivalent to "X>=Y AND X<=Z".

But it leaves me feeling dissatisfied.  What if the datatype has
standard comparison operators (as identified by a default btree opclass)
but they're not named ">=" and "<=" ?  Perhaps more plausibly, what if
those operators exist but aren't in the current search path?

The production for NOT BETWEEN is even more troubling:
           a_expr NOT BETWEEN opt_asymmetric b_expr AND b_expr               {                   $$ = (Node *)
makeA_Expr(AEXPR_OR,NIL,                       (Node *) makeSimpleA_Expr(AEXPR_OP, "<", $1, $5, @2),
  (Node *) makeSimpleA_Expr(AEXPR_OP, ">", $1, $7, @2),                                            @2);
}

I can't object too much to the hardwired application of DeMorgan's law
(NOT (A AND B) => (NOT A) OR (NOT B)) but what this also has is a
hardwired assumption that "<" and ">" exist and are the negators of
">=" and "<=" respectively.  Probably true, but let's see you find
chapter and verse in the SQL spec to support that...


Seems to me that what this boils down to is whether we want to read the
spec literally ("it says the construct is defined in terms of operators
named >= and <=, therefore we should do that") or by intent (obviously
what they *want* is a construct that behaves sensibly in terms of the
datatype's semantics).

We are more than a bit schizophrenic on this point --- in different
parts of the system you can find these things being done both ways.
There is plenty of code that insists on finding a default btree opclass
to define notions of "less" or "greater"; but we have these purely
name-based transformations in gram.y, and I think there are some other
parts of the parser that do similar things.

I'm not particularly eager to start changing things in this area right
now, but it seems to me that it'd be a good idea to establish a project
policy about what we consider to be the preferred behavior, with an eye
to eventually migrating the parts of the system that don't conform.

My own feeling is that we should avoid imputing particular semantics
to particular operator names, and so these constructs should always be
defined by reference to operators found in a default opclass for the
datatype, rather than by specific operator names.  However, that way
will likely take more code and cycles to implement than purely
name-based definitions; and there is also the argument that it violates
the in-so-many-words definitions given by the spec.

Comments?
        regards, tom lane

PS: there are some other issues here, like whether BETWEEN should be
allowed to cause double evaluation of its left-hand argument, and
whether we wouldn't like it to get reverse-listed by ruleutils.c
in the original BETWEEN format rather than as an expanded version.
However, what I'd like to focus on in this particular thread is the
narrow issue of defining the constructs in terms of operator names
vs operator semantics.

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Martijn van Oosterhout

Date:

25 August 2008, 03:31:33

On Sun, Aug 24, 2008 at 09:24:23PM -0400, Tom Lane wrote:
> My own feeling is that we should avoid imputing particular semantics
> to particular operator names, and so these constructs should always be
> defined by reference to operators found in a default opclass for the
> datatype, rather than by specific operator names.  However, that way
> will likely take more code and cycles to implement than purely
> name-based definitions; and there is also the argument that it violates
> the in-so-many-words definitions given by the spec.

ISTM the problem is that there's no easy way to refer to "operators
found in a default opclass", so perhaps we could invent a construct:

A OPERATOR(btree,2) B

Which would refer to the second operator in the default btree
operator class. The problem is inferring the type, if A and B are
different types, which operator class do you use? When the BETWEEN
construct is expanded, is there currently any guarentee that the chosen
operators will actually be from the same operator class?

As for the negators, I think the parser should simply wrap the whole
expression in NOT and let the optimiser sort it out.

You could then use it in other places, like LIKE optimisation or the
ORDER BY clause. Essentially anywhere where currently there are
assumptions about the real operator name is required. And then the
optimiser can fill in the actual operator by which time it should be
clear what it is.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Dimitri Fontaine

Date:

25 August 2008, 05:40:33

Le lundi 25 août 2008, Martijn van Oosterhout a écrit :
> ISTM the problem is that there's no easy way to refer to "operators
> found in a default opclass", so perhaps we could invent a construct:

Perhaps here again we're showing a need for some higher level semantics about
operators and types. The case was recently raised about equality operators
and operators transitivity, but I can't find the archive reference just now.
Maybe we should add some common semantics to operators. CREATE OPERATOR would
support some new clauses: IS_TRANSITIVE IS_EQUALITY IS_LT IS_LE ...

Not sure about how to name those new clauses, obviously.
It seems to me OPERATOR CLASS/FAMILY are all about how to index a given type,
not how to offer semantics to type operators, so adding those clauses
is "orthogonal" to the existing operator advanced notions.

We could require any given type to provide at least an equality operator,
whatever the name, and we could even provide equality operators for different
types, or for types in the same categories. So int4 = int8 could use or could
avoid using the same operator as int8 = int8, e.g.

> assumptions about the real operator name is required. And then the
> optimiser can fill in the actual operator by which time it should be
> clear what it is.

How would the planner get the estimated costs associated to any operator
choice in this case?

Regards,
--
dim

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Gregory Stark

Date:

25 August 2008, 06:23:51

"Dimitri Fontaine" <dfontaine@hi-media.com> writes:

> Le lundi 25 août 2008, Martijn van Oosterhout a écrit :
>> ISTM the problem is that there's no easy way to refer to "operators
>> found in a default opclass", so perhaps we could invent a construct:
>
> Perhaps here again we're showing a need for some higher level semantics about
> operators and types. The case was recently raised about equality operators
> and operators transitivity, but I can't find the archive reference just now.
> Maybe we should add some common semantics to operators. CREATE OPERATOR would
> support some new clauses:
>   IS_TRANSITIVE
>   IS_EQUALITY
>   IS_LT
>   IS_LE
>   ...

I'm not sure it's made explicit anywhere in the documentation but those
properties *are* what btree operator classes define. You would end up
duplicating the same structures (like, LT is meaningless unless you associate
it with the corresponding EQUALITY, LE, GT, and GE operators).

>> assumptions about the real operator name is required. And then the
>> optimiser can fill in the actual operator by which time it should be
>> clear what it is.
>
> How would the planner get the estimated costs associated to any operator
> choice in this case?

We don't need to evaluate costs until well after this point.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Dimitri Fontaine

Date:

25 August 2008, 07:15:03

Le lundi 25 août 2008, Gregory Stark a écrit :
> I'm not sure it's made explicit anywhere in the documentation but those
> properties *are* what btree operator classes define. You would end up
> duplicating the same structures (like, LT is meaningless unless you
> associate it with the corresponding EQUALITY, LE, GT, and GE operators).

But, IIRC, only in the context of index searches, not at the planner level.
ISTM the planner can't currently make strong assumptions on operators in
order to simplify or analyze some parts of queries...

> >> assumptions about the real operator name is required. And then the
> >> optimiser can fill in the actual operator by which time it should be
> >> clear what it is.
> >
> > How would the planner get the estimated costs associated to any operator
> > choice in this case?
>
> We don't need to evaluate costs until well after this point.

Oh, perfect then. I guess it simply shows I need to read up on PG architecture
wrt to how planner/optimiser talk to each other.

Regards,
--
dim

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Peter Eisentraut

Date:

25 August 2008, 11:19:32

On Monday 25 August 2008 04:24:23 Tom Lane wrote:
> Seems to me that what this boils down to is whether we want to read the
> spec literally ("it says the construct is defined in terms of operators
> named >= and <=, therefore we should do that") or by intent (obviously
> what they *want* is a construct that behaves sensibly in terms of the
> datatype's semantics).

I would be generally in favor of moving more toward a semantics-based approach 
instead of a name-based approach.

It would be interesting to see an enumeration of all the parts of the system 
that hardcode operator names.  Based on that we might gain more insight into 
what it would take to go one way or another.

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Tom Lane

Date:

25 August 2008, 11:44:30

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sun, Aug 24, 2008 at 09:24:23PM -0400, Tom Lane wrote:
>> My own feeling is that we should avoid imputing particular semantics
>> to particular operator names, and so these constructs should always be
>> defined by reference to operators found in a default opclass for the
>> datatype, rather than by specific operator names.

> ISTM the problem is that there's no easy way to refer to "operators
> found in a default opclass", so perhaps we could invent a construct:

> A OPERATOR(btree,2) B

Huh?  I don't understand why you think we need to expose this to users.
A user would presumably just write the name of the operator he wants,
if he's writing out a direct operator call.

To me the issue is what we consider IN and BETWEEN and similar
constructs to "mean", which in a datatype world boils down to choosing
which of the datatype's operators to implement the construct with.

> The problem is inferring the type, if A and B are
> different types, which operator class do you use?

Yeah, the cross-type problem occurred to me this morning too.  For
instance consider
int4_var BETWEEN int8_var AND numeric_var

The current implementation of BETWEEN acts fairly sanely in this case;
although it ends up choosing two entirely unrelated operators (int48_ge
and numeric_le, which are not in any of the same operator families),
which is not great for subsequent optimization purposes.  I'm not sure
about how we'd make a good choice using an opclass-driven approach.
Would we want to insist that both operators are found in the same
family?  Perhaps so, because otherwise it's not real clear that you've
created a meaningful range constraint.  Yet it's definitely possible
that such a requirement would cause the query to fail where it used to
work (for some value of "work").

Another way to approach it would be to consider the problem as being
similar to overloaded-function resolution, ie think of it as trying
to match an implicit or explicit function between(anyelement,
anyelement, anyelement).  But that would fail to take into account
whether there are actually any suitable comparison operators for
whichever common type it chooses.

There's also an issue of not wanting to coerce variables unnecessarily.
In the above example, suppose int4_var has an index.  The derived
clause int4_var int48_ge int8_var will be indexable using that index,
whereas int4_var::numeric numeric_le numeric_var won't be,  If we coerce
all three inputs to numeric to enforce that we have consistent semantics
on both ends of the range check, then neither end will be indexable;
which seems like a step backwards.

So it's all a bit harder than it looks.  Thoughts?
        regards, tom lane

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Tom Lane

Date:

25 August 2008, 11:49:45

Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Le lundi 25 ao�t 2008, Gregory Stark a �crit�:
>> I'm not sure it's made explicit anywhere in the documentation but those
>> properties *are* what btree operator classes define. You would end up
>> duplicating the same structures (like, LT is meaningless unless you
>> associate it with the corresponding EQUALITY, LE, GT, and GE operators).

> But, IIRC, only in the context of index searches, not at the planner level. 

No, that's not true at all.  There are lots and lots of places now where
we use btree and/or hash operator classes to reason about the properties
of operators.  The most in-your-face example right now is ORDER BY: if
you say ORDER BY x, that's gonna fail outright unless x's type has a
default btree opclass.  And if it does, the < member of the opclass is
what defines the sort order.

So we have certainly set some precedents involving using opclasses to
make semantic choices.  What's really bothering me is that we aren't
consistent about it: some things like ORDER BY are opclass-driven,
and some things like BETWEEN are operator-name-driven.
        regards, tom lane

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Dimitri Fontaine

Date:

25 August 2008, 16:20:10

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Le 25 août 08 à 16:48, Tom Lane a écrit :
>> But, IIRC, only in the context of index searches, not at the
>> planner level.
>
> No, that's not true at all.  There are lots and lots of places now
> where
> we use btree and/or hash operator classes to reason about the
> properties
> of operators.

Yes, but always in relation to operator classes, so from BTrees
opclass or such, which I refered to as "the context of index
searches", as I don't really see any theorical need for opclass if
it's not for indexing.
My formulation was "outright wrong", as you would say, but I hope to
have explained a little better what I'm on: there's not enough direct
semantic information concerning operators for the planner to take full
profit out if it. It this assertion more true?

Regards,
- --
dim


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkizBdsACgkQlBXRlnbh1blP5wCgh5h3vAn8EUonABN0ZYV58JQe
xjMAoMpBNMiBLat1lfwGEz0w6rQip8LP
=Lgxd
-----END PGP SIGNATURE-----

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Tom Lane

Date:

25 August 2008, 16:44:19

Dimitri Fontaine <dfontaine@hi-media.com> writes:
> ... I don't really see any theorical need for opclass if  
> it's not for indexing.

Well, if we were starting from a green field, we might design it
differently, but I see little point in changing the system structure
now that it's established.  You need *some* representation for the
properties and relationships of sets of operators, and so far the btree
and hash index opclasses have matched up quite well with what other
parts of the system wanted to know; so why duplicate that information?

> My formulation was "outright wrong", as you would say, but I hope to  
> have explained a little better what I'm on: there's not enough direct  
> semantic information concerning operators for the planner to take full  
> profit out if it. It this assertion more true?

Not really.  I can see an argument that there might be something we wish
to know about groups of operators that can't reasonably be expressed
within the opclass infrastructure, and in that situation maybe it would
be time to invent something else.  But it hasn't come up yet.

Note that new properties of *individual* operators would just be handled
by adding columns to pg_operator, and aren't really relevant to this
discussion.
        regards, tom lane

Re: IN, BETWEEN, spec compliance, and odd operator names

From

"Nathan Boley"

Date:

25 August 2008, 16:51:21

> Yes, but always in relation to operator classes, so from BTrees opclass or
> such, which I refered to as "the context of index searches", as I don't
> really see any theorical need for opclass if it's not for indexing.

IIRC analyze uses the btree op class to build the selectivity histogram.

-Nathan

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Martijn van Oosterhout

Date:

25 August 2008, 16:54:13

On Mon, Aug 25, 2008 at 10:43:30AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > ISTM the problem is that there's no easy way to refer to "operators
> > found in a default opclass", so perhaps we could invent a construct:
>
> > A OPERATOR(btree,2) B
>
> Huh?  I don't understand why you think we need to expose this to users.
> A user would presumably just write the name of the operator he wants,
> if he's writing out a direct operator call.

The user would, yes. I was talking about the internal representation.
We're talking about the operator name that the parser is going to use
when expanding the BETWEEN clause. Instead of using the string "<=" you
use the string "btree.2" and then in the operator resolution code you
treat that as an operator class lookup. The above construct doesn't
actually work for the end-user because of a parser error.

> To me the issue is what we consider IN and BETWEEN and similar
> constructs to "mean", which in a datatype world boils down to choosing
> which of the datatype's operators to implement the construct with.

I was thinking the problem was using real operator names. Here we make
up an unambiguous name that is not currently in use that can be
resolved to the required operator on demand.

If you define the lookup to always match the type of the left-hand
value you can force consistant semantics. Whether its the semantics you
want is another question.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: IN, BETWEEN, spec compliance, and odd operator names

From

Tom Lane

Date:

25 August 2008, 17:01:43

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Mon, Aug 25, 2008 at 10:43:30AM -0400, Tom Lane wrote:
>> To me the issue is what we consider IN and BETWEEN and similar
>> constructs to "mean", which in a datatype world boils down to choosing
>> which of the datatype's operators to implement the construct with.

> I was thinking the problem was using real operator names. Here we make
> up an unambiguous name that is not currently in use that can be
> resolved to the required operator on demand.

No, I don't think that really helps anything.  The parser would just
immediately move to the next step of trying to resolve the operator,
so you've only introduced another bit of notation without actually
solving any problems.  The issue of concern here is *how* to identify
the required operator, not exactly when/where it happens.

> If you define the lookup to always match the type of the left-hand
> value you can force consistant semantics. Whether its the semantics you
> want is another question.

Probably not, since it would fail outright in cases where the LHS has
to be promoted in order to get a match (such as my int4 vs numeric
example).  We've allowed that in the past, so rejecting it would
probably be a hard sell.
        regards, tom lane