Thread: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Zeugswetter Andreas SB
Date:
> > Yes, column = NULL should *never* return true according to the spec (it
> > should always return NULL in fact as stated).  The reason for breaking
> > with the spec is AFAIK to work with broken microsoft clients that seem to
> > think that =NULL is a meaningful test and generate queries using that.
> 
> Microsoft Access is the guilty party, IIRC.  I recently tried to stir up
> some interest in changing this behavior back to the standard, but
> apparently there are still too many people using broken versions of
> Access.

Actually I am not sure whether the column = NULL syntax is even defined 
or allowed in SQL92 (e.g. Informix interprets the NULL as column name in 
this context and errs out).
If not it would simply be an extension to the standard with the defined
behavior of beeing identical to "column is null".

Andreas


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Ivar Helbekkmo
Date:
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

> Actually I am not sure whether the column = NULL syntax is even defined 
> or allowed in SQL92 (e.g. Informix interprets the NULL as column name in 
> this context and errs out).

I don't have the standard handy, but I do have Joe Celko's book, "Data
& Databases: Concepts in Practice".  He says (in section 8.2, under
the heading "Multivalued Logic"):
       A NULL cannot be compared to another NULL or to a value       with what Dr. Codd called a theta operator and
what      programmers call a comparison operator (equal, not equal,       less than, greater than, and so forth).  This
resultsin       a three-valued logic, which has an UNKNOWN in addition       to TRUE and FALSE.  [...]  UNKNOWN is a
logicalvalue and       not the same as a NULL, which is a data value.  That is       why you have to say X IS [NOT]
NULLin SQL and not use       X = NULL instead.  Theta operators are expressions of the       form X <comp op> Y; when X
orY or both are NULL, theta       operators will return an UNKNOWN and not a NULL.
 

He goes on to explain three-valued logic in more detail, showing truth
tables according to Jan Lukasiewicz (the inventor of RPN), and says,
of SQL-92, that it "is comforting to see that [it has] the same truth
tables as the three-valued system of Lukasiewicz".  Further, he says:
       SQL-92 added a new predicate of the form
       <search condition> IS [NOT] TRUE | FALSE | UNKNOWN
       which will let you map any combination of three-valued       logic to the two Boolean values.

A quick test run with psql shows that PostgreSQL does not properly
implement three-valued logic: it does not recognize the UNKNOWN
keyword alongside TRUE and FALSE, in any situation.  It will also
return boolean truth values for comparisons with NULL values, using
them as "real" data values in the comparison.  Worse (IMHO), this is
not consistent: while a test for "column = NULL" will return rows
where that is true, and a test for "not column = NULL" will return the
rest, "column <> NULL" returns no rows!  This means that the theta
operators are not all treated the same way, which is surely wrong!

It seems to me that the idea of NULL as an unkown data value and
UNKNOWN as the corresponding truth value, combined with the rules for
propagation of NULL in mathematical operations, of UNKNOWN in truth
operations, and from NULL to UNKNOWN by theta operators, is a very
clean, intuitive way of handling these issues.  It feels right!  :-)

-tih
-- 
The basic difference is this: hackers build things, crackers break them.


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Stephan Szabo
Date:
On 7 Jun 2001, Tom Ivar Helbekkmo wrote:

> Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:
> 
> > Actually I am not sure whether the column = NULL syntax is even defined 
> > or allowed in SQL92 (e.g. Informix interprets the NULL as column name in 
> > this context and errs out).
> 
> He goes on to explain three-valued logic in more detail, showing truth
> tables according to Jan Lukasiewicz (the inventor of RPN), and says,
> of SQL-92, that it "is comforting to see that [it has] the same truth
> tables as the three-valued system of Lukasiewicz".  Further, he says:
> 
>         SQL-92 added a new predicate of the form
> 
>         <search condition> IS [NOT] TRUE | FALSE | UNKNOWN
> 
>         which will let you map any combination of three-valued
>         logic to the two Boolean values.
> 
> A quick test run with psql shows that PostgreSQL does not properly
> implement three-valued logic: it does not recognize the UNKNOWN
> keyword alongside TRUE and FALSE, in any situation.  It will also
> return boolean truth values for comparisons with NULL values, using
> them as "real" data values in the comparison.  Worse (IMHO), this is
> not consistent: while a test for "column = NULL" will return rows
> where that is true, and a test for "not column = NULL" will return the
> rest, "column <> NULL" returns no rows!  This means that the theta
> operators are not all treated the same way, which is surely wrong!

That's the nature of the hack we're talking about.  It's a grammar level
hack to turn a specific sequence of tokens (= NULL) into IS NULL due
to a client's generated queries.  If you're comparing something other
than the constant NULL, it should do what is expected (ie, a comparison
between a NULL in a table or even CAST(NULL as INT4) does the "right"
thing).

I think adding IS UNKNOWN would probably be trivial (I think the code is
basically there in IS NULL.)




Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Peter Eisentraut
Date:
Zeugswetter Andreas SB writes:

> Actually I am not sure whether the column = NULL syntax is even defined
> or allowed in SQL92

I think you're right.

> (e.g. Informix interprets the NULL as column name in this context and
> errs out).

NULL is a reserved word, so this would be wrong.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Ivar Helbekkmo
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:

> That's the nature of the hack we're talking about.  It's a grammar
> level hack to turn a specific sequence of tokens (= NULL) into IS
> NULL due to a client's generated queries.

Aha!  Sorry -- I jumped in late in the discussion without checking
back to see how it started...

OK, I've already said that I like the cleanliness and orthogonality of
NULL is a missing data value, UNKNOWN is a missing truth value, both
propagate in expressions, comparisons with NULL generate UNKNOWN, and
you can use the special comparisons IS [NOT] NULL and IS [NOT] UNKNOWN
to get plain, two-valued Boolean truth values out of them.

The Microsoft compatibility hack is ugly, and should be either a)
removed, b) expanded to include the other comparison operators and
documented as a PostgreSQL proprietary extension, or c) made into a
special feature that's turned on at will by a SET command.  I would
applaud a), approve of c), and be dismayed by b).

> I think adding IS UNKNOWN would probably be trivial (I think the
> code is basically there in IS NULL.)

But if it's implemented, shouldn't the code also differentiate between
UNKNOWN and NULL, by not (as now) using the latter to represent the
former?  Or do I misunderstand how it's handled now?

-tih
-- 
The basic difference is this: hackers build things, crackers break them.


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Ivar Helbekkmo
Date:
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

> Actually I am not sure whether the column = NULL syntax is even
> defined or allowed in SQL92

I've just checked, by reading the relevant paragraphs and studying the
BNF, and the standard says that any comparison of the form X <comp op>
Y is unknown if X or Y (or both) is NULL, including the case where
NULL is given as an explicit constant.  So, SQL92 clearly demands that
"column = NULL" is UNKNOWN, never TRUE or FALSE.

-tih
-- 
The basic difference is this: hackers build things, crackers break them.


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Stephan Szabo
Date:
On 7 Jun 2001, Tom Ivar Helbekkmo wrote:

> Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> 
> > That's the nature of the hack we're talking about.  It's a grammar
> > level hack to turn a specific sequence of tokens (= NULL) into IS
> > NULL due to a client's generated queries.
> 
> Aha!  Sorry -- I jumped in late in the discussion without checking
> back to see how it started...
> 
> OK, I've already said that I like the cleanliness and orthogonality of
> NULL is a missing data value, UNKNOWN is a missing truth value, both
> propagate in expressions, comparisons with NULL generate UNKNOWN, and
> you can use the special comparisons IS [NOT] NULL and IS [NOT] UNKNOWN
> to get plain, two-valued Boolean truth values out of them.
>
> The Microsoft compatibility hack is ugly, and should be either a)
> removed, b) expanded to include the other comparison operators and
> documented as a PostgreSQL proprietary extension, or c) made into a
> special feature that's turned on at will by a SET command.  I would
> applaud a), approve of c), and be dismayed by b).

c is the most likely thing to happen probably.

> > I think adding IS UNKNOWN would probably be trivial (I think the
> > code is basically there in IS NULL.)
> 
> But if it's implemented, shouldn't the code also differentiate between
> UNKNOWN and NULL, by not (as now) using the latter to represent the
> former?  Or do I misunderstand how it's handled now?

Within what's there (using null as unknown), the two tests are nearly
identical and would probably be just a grammar change.  Creating a
separate unknown would be more difficult, and I'm not sure it's necessary
to make the distinction.  NULL is an unknown value, I'm not sure that
you'd need a separate unknown value specifically for booleans.



Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> Actually I am not sure whether the column = NULL syntax is even defined 
> or allowed in SQL92 (e.g. Informix interprets the NULL as column name in 
> this context and errs out).

Strictly speaking, SQL92 would require you to writefoo = CAST (NULL AS type-of-foo)
However, we allow unadorned NULL in other contexts as a shorthand for
the CAST notation, so it's inconsistent of us to say that in this
context it means something different.

The real problem with accepting this Microsoftism is that it's a trap
for unwary programmers.  Case 1: someone who's not studied SQL in detail
might experiment with examples involving "foo = NULL" and jump to
reasonable but entirely incorrect conclusions about how comparisons
involving NULL operate.  Case 2: someone who *has* studied SQL, and is
also aware that we accept unadorned NULLs, will also draw the wrong
conclusions about what this construct will do.  Bottom line: this kluge
surprises everyone except those who already know it exists.  I don't
like systems that surprise their users in inconsistent ways.
        regards, tom lane


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
Tom Ivar Helbekkmo <tih@kpnQwest.no> quotes:
>         ... This results in
>         a three-valued logic, which has an UNKNOWN in addition
>         to TRUE and FALSE.  [...]  UNKNOWN is a logical value and
>         not the same as a NULL, which is a data value.

SQL92 is not very clear about whether NULL and UNKNOWN are distinct,
but it is worth noticing that their truth tables for comparison
operators, and/or/not, etc, only mention unknown --- never null ---
as a possible value of a boolean condition.  SQL99 clarifies the
intent:
        The data type boolean comprises the distinct truth values true and        false. Unless prohibited by a NOT
NULLconstraint, the boolean        data type also supports the unknown truth value as the null value.        This
specificationdoes not make a distinction between the null        value of the boolean data type and the unknown truth
valuethat is        the result of an SQL <predicate>, <search condition>, or <boolean        value expression>; they
maybe used interchangeably to mean exactly        the same thing.
 

Which in fact is what Postgres does.

> A quick test run with psql shows that PostgreSQL does not properly
> implement three-valued logic: it does not recognize the UNKNOWN
> keyword alongside TRUE and FALSE, in any situation.

We do not currently have correct implementations of IS TRUE, IS FALSE,
or IS UNKNOWN (IS TRUE/FALSE are in there but give the wrong result
for null inputs).  This is on my to-do list to fix; not sure if the
master TODO list mentions it or not.  Actually it'd be a good project
for a newbie hacker who wants to learn about the backend's
expression-handling machinery.  Anyone want to take it on?

It's also worth noticing that our implementation of IS NULL isn't really
up to speed: the spec allows the argument to be a row value constructor,
not just a scalar.  But we mostly don't have support for row-value-
constructor expressions anyway (it's not an Entry SQL feature).
        regards, tom lane


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
"Joe Conway"
Date:
> We do not currently have correct implementations of IS TRUE, IS FALSE,
> or IS UNKNOWN (IS TRUE/FALSE are in there but give the wrong result
> for null inputs).  This is on my to-do list to fix; not sure if the
> master TODO list mentions it or not.  Actually it'd be a good project
> for a newbie hacker who wants to learn about the backend's
> expression-handling machinery.  Anyone want to take it on?
>

I'd like to finish up the has_table_privilege function over the next week or
so and then take this on. Can you point me in a direction to start looking?

Thanks,

Joe




Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
"Joe Conway" <joe@conway-family.com> writes:
>> We do not currently have correct implementations of IS TRUE, IS FALSE,
>> or IS UNKNOWN (IS TRUE/FALSE are in there but give the wrong result
>> for null inputs).  This is on my to-do list to fix; not sure if the
>> master TODO list mentions it or not.  Actually it'd be a good project
>> for a newbie hacker who wants to learn about the backend's
>> expression-handling machinery.  Anyone want to take it on?

> I'd like to finish up the has_table_privilege function over the next week or
> so and then take this on. Can you point me in a direction to start looking?

The way things currently work is that gram.y translates "x IS TRUE" etc
to "x = true" etc.  This is wrong because it does the wrong thing for
null input.  Another objection is that it's impossible for ruleutils.c
to reverse-list the expression tree in its original form.

IS [NOT] NULL is handled a little differently: gram.y generates a
specialized Expr node, which parse_expr.c translates to a function call
on the specialized functions nullvalue() and nonnullvalue()
respectively.  I don't much care for this implementation either, again
partly because ruleutils.c has to be uglified to deal with it, but
partly because the optimizer can't cheaply recognize IS NULL tests
either.

I'd like to see all eight of these guys translated into a specialized
kind of expression node, called perhaps BooleanTest.  Actually, it'd
probably be wise to keep IS NULL separate from the six boolean tests,
with an eye to the future when it will need to support nonscalar
arguments.  So maybe BooleanTest and NullTest node types, each with a
field showing exactly which test is wanted.

Adding a new expression node type is a straightforward but rather
tedious exercise in teaching some dozens of places what to do with it.
A grep for existing expression node types, such as CaseExpr or
FieldSelect or RelabelType, will give you a good idea what needs to be
done.
        regards, tom lane


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Thomas Lockhart
Date:
> The real problem with accepting this Microsoftism is that it's a trap
> for unwary programmers.  Case 1: someone who's not studied SQL in detail
> might experiment with examples involving "foo = NULL" and jump to
> reasonable but entirely incorrect conclusions about how comparisons
> involving NULL operate.  Case 2: someone who *has* studied SQL, and is
> also aware that we accept unadorned NULLs, will also draw the wrong
> conclusions about what this construct will do.  Bottom line: this kluge
> surprises everyone except those who already know it exists.  I don't
> like systems that surprise their users in inconsistent ways.

These were all points that were brought up and discussed when the hack
was implemented. At the time, the trade between implementing a construct
which is *not allowed* in SQL9x vs enabling the M$Access community to
migrate to PostgreSQL was hashed over and a consensus was reached that
the benefits to allowing it outweighed the drawbacks.

I was of the initial opinion that we should not support M$ pathelogical
non-standard constructs (jeez, they should know better, and probably do,
so it is likely a pathetic attempt to lock in users). But since the
construct is not allowed (or useless), why would anyone feel they need
to use it?

Clearly it is not the case that "this kluge surprises everyone except
those who already know it exists." We have had a strong consensus that
SQL9x standard constructs should be the norm for PostgreSQL, so no need
to rehash that. The issue boils down to, as it did when it first came
up, whether we will make it easier for M$Access users to start migrating
to PostgreSQL. If newer versions of Access emit standard constructs,
then it would be even easier to say that we should jettison the
construct. But if not, istm that we are killing a usability feature on
principle, not from need, and it may be premature to do that.

Using a "SET xxx" switch is not a step in the right direction istm since
it requires an extra PostgreSQL-specific command to get things working
with Access. We aren't teaching SQL, we are trying to let people get to
their data.

btw, I *was* suprised to see the "IS UNKNOWN" construct. It's been
lurking in my reference books for years. afaict it is very uncommonly
used, since most RDBMSes accept IS NULL as an equivalent. But is is
trivial to add to the grammar.
                     - Thomas


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Thomas Lockhart
Date:
> The way things currently work is that gram.y translates "x IS TRUE" etc
> to "x = true" etc.  This is wrong because it does the wrong thing for
> null input.  Another objection is that it's impossible for ruleutils.c
> to reverse-list the expression tree in its original form.

fyi, in case it helps: we used to have gram.y translate these into
function calls, rather than the operator expression. But that precluded
the optimizer from ever having a shot at optimizing out "const = const"
kinds of expressions and other fluff.

If we go to a specialized node in the parse tree, then the optimizer
could be taught to handle that, which is better than the original
straight function call which would have masked things too much.
                   - Thomas


Re: Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Ivar Helbekkmo
Date:
Thomas Lockhart <lockhart@fourpalms.org> writes:

> But since the construct is not allowed (or useless), why would
> anyone feel they need to use it?

Because it isn't entirely useless, actually.  I agree that no
programmer in his right mind would write, by hand, a comparison
involving NULL, knowing that the truth value of that comparison is
required by the standard to be UNKNOWN (i.e. NULL).  However, I'm
looking at using machine generated SQL code (generated on the fly in
an interactive application) to implement a dynamically adapting set of
tables, rules/triggers and their supporting stored procedures, and
it's just a matter of time before the first "= NULL" happens to show
up in code generated like that.  I'd like it to behave according to
the standard when that situation occurs, and the standard says that
any comparison with NULL, even "NULL = NULL", is UNKNOWN.

-tih
-- 
The basic difference is this: hackers build things, crackers break them.


Re: Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
Thomas Lockhart <lockhart@fourpalms.org> writes:
> Clearly it is not the case that "this kluge surprises everyone except
> those who already know it exists."

How can you argue that, when the topic comes up again every couple of
months?
        regards, tom lane


Re: Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Thomas Lockhart
Date:
> > Clearly it is not the case that "this kluge surprises everyone except
> > those who already know it exists."
> How can you argue that, when the topic comes up again every couple of
> months?

I'm not looking for an argument. But the statement is not factually
true: "suprises everyone" (most folks don't notice, and don't care much)
and "every couple of months" (??) stray far enough from facts that we
should get back on topic. Whatever the facts, the messages that we do
*not* get are just as significant: a relatively large portion of
feedback from users of both Access and PostgreSQL at the time the
"feature" was added indicated that it was a significant stumbling block
for those users, and those messages will start up anew without adequate
planning and implementation.

Since back then, we have some additional clarification that it occurs
only for users of Access/Forms, and that others are worried that it will
lead to difficulties for others under different circumstances (please
remember, or note if it is too long ago, that at the time I argued
against adding the "feature", but tried to listen to those who would
find it useful). I'm not ignoring those worries, but in the battle
between purity and usefulness we shouldn't always line up with the
strongest or loudest voice(s) that day, but try to respect and keep in
mind the others who have contributed to the discussion in the past.

That said, the issues should be broken down into managable chunks, but
imho forcing the last step (removal of the "= NULL" accomodation) first
is premature. How about phasing this so that we can accomodate the
long-standing issue of M$ interoperability (via ODBC improvements to
make it transparent) before ripping out the current workaround?

From Andreas' comments, it seems that for his application he would like
a different behavior, but frankly I'm not certain why the current
behavior would be detrimental in the use case he mentioned. If SQL92
requires that any query with "= NULL" be rejected as illegal (my books
aren't nearby at the moment), he might still want to have code at the
application level for some graceful handling of illegal values.
                           - Thomas


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
"Joe Conway"
Date:
> IS [NOT] NULL is handled a little differently: gram.y generates a
> specialized Expr node, which parse_expr.c translates to a function call
> on the specialized functions nullvalue() and nonnullvalue()
> respectively.  I don't much care for this implementation either, again
> partly because ruleutils.c has to be uglified to deal with it, but
> partly because the optimizer can't cheaply recognize IS NULL tests
> either.
>
> I'd like to see all eight of these guys translated into a specialized
> kind of expression node, called perhaps BooleanTest.  Actually, it'd
> probably be wise to keep IS NULL separate from the six boolean tests,
> with an eye to the future when it will need to support nonscalar
> arguments.  So maybe BooleanTest and NullTest node types, each with a
> field showing exactly which test is wanted.
>

Attached is a patch for a new NullTest node type for review and comment.
Since it didn't seem like there was consensus regarding removal of the "a =
null" conversion to "a is null" behavior, I left it in. It is worth
mentioning, however, that neither Oracle 8.1.6 or MSSQL 7 seem to support
this -- see below:

Oracle:
****************************************
SQL> select f1,f2 from foo where f2 = null;

no rows selected

MSSQL 7
****************************************
select f1,f2 from foo where f2 = null
f1          f2
----------- --------------------------------------------------

(0 row(s) affected)

PostgreSQL
****************************************
test=# select f1,f2 from foo where f2 = null;f1 | f2
----+---- 1 | 4 |
(2 rows)

In all 3 cases table foo has 4 rows, 2 of which have null values for f2.
Based on this, should support for the converting "a = null" to "a is null"
be dropped?

I also noticed that in PostgreSQL I can do the following (both before and
after this patch):   select f2 is null from foo;
whereas in both Oracle and MSSQL it causes a syntax error. Any thoughts on
this?

Thanks,

-- Joe

Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
"Joe Conway" <joseph.conway@home.com> writes:
> Attached is a patch for a new NullTest node type for review and comment.

I assume you are just looking for review at this point; I would not
recommend applying to CVS until the BooleanTest part is done too.
(Since parsetree changes affect stored rules, the change really should
include a catversion.h increment, and thus it's best to bunch this sort
of change together to avoid forcing extra initdbs on other hackers.)
I'll look through the code later, but...

> Based on this, should support for the converting "a = null" to "a is null"
> be dropped?

My opinion on that is already on record ;-)

> I also noticed that in PostgreSQL I can do the following (both before and
> after this patch):
>     select f2 is null from foo;
> whereas in both Oracle and MSSQL it causes a syntax error. Any thoughts on
> this?

Curious; I'd have said that that is clearly within the spec.  Can anyone
check it on some other engines?
        regards, tom lane


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
"Joe Conway"
Date:
> I assume you are just looking for review at this point; I would not
> recommend applying to CVS until the BooleanTest part is done too.
> (Since parsetree changes affect stored rules, the change really should
> include a catversion.h increment, and thus it's best to bunch this sort
> of change together to avoid forcing extra initdbs on other hackers.)
> I'll look through the code later, but...

OK -- I'll get started on BooleanTest in the next day or two.

-- Joe



Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
"Joe Conway" <joseph.conway@home.com> writes:
> I also noticed that in PostgreSQL I can do the following (both before and
> after this patch):
>     select f2 is null from foo;
> whereas in both Oracle and MSSQL it causes a syntax error. Any thoughts on
> this?

I dug into this further and discovered that indeed it is not SQL92
... but it is SQL99.  Amazingly enough, SQL92 doesn't allow boolean
expressions as a possible type of general expression:
        <value expression> ::=               <numeric value expression>             | <string value expression>
   | <datetime value expression>             | <interval value expression>
 

It only allows them as <search condition>s, which is to say WHERE,
HAVING, CASE WHEN, CHECK, and one or two other places.

But SQL99 gets it right:
        <value expression> ::=               <numeric value expression>             | <string value expression>
   | <datetime value expression>             | <interval value expression>             | <boolean value expression>
       | <user-defined type value expression>             | <row value expression>             | <reference value
expression>            | <collection value expression>
 

Looks like we're ahead of the curve here...
        regards, tom lane


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
"Joe Conway"
Date:
> > I assume you are just looking for review at this point; I would not
> > recommend applying to CVS until the BooleanTest part is done too.
> > (Since parsetree changes affect stored rules, the change really should
> > include a catversion.h increment, and thus it's best to bunch this sort
> > of change together to avoid forcing extra initdbs on other hackers.)
> > I'll look through the code later, but...
>

OK -- here's the full patch for review. It includes two new expression node
types: NullTest and BooleanTest. I have a couple of questions/comments
remaining WRT this:

-- Should I increment catversion.h as part of the patch (I didn't in this
patch), or is that usually centrally controlled by Bruce (or whomever
commits the change)?

-- IMHO, if we are going to keep the (a = null) to (a is null) conversion,
then there should also be a similar conversion from (a != null) to (a is not
null). Otherwise the two operations which may be expected to be
complimentary (as evidenced by at least one recent post) are not.

-- If I have interpreted SQL92 correctly UNKNOWN IS TRUE should return
FALSE, and UNKNOWN IS NOT TRUE is equivalent to NOT (UNKNOWN IS TRUE) ==>
TRUE. Is this correct?

Thanks,

-- Joe

Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
"Joe Conway" <joseph.conway@home.com> writes:
> -- Should I increment catversion.h as part of the patch (I didn't in this
> patch), or is that usually centrally controlled by Bruce (or whomever
> commits the change)?

It's good to put the catversion bump into the patch, else the committer
might forget to do it.

> -- IMHO, if we are going to keep the (a = null) to (a is null) conversion,
> then there should also be a similar conversion from (a != null) to (a is not
> null). Otherwise the two operations which may be expected to be
> complimentary (as evidenced by at least one recent post) are not.

I'd resist this.  The only reason the =NULL hack is in there at all is
to support Access97.  We shouldn't extend the deviation from standards
further than the minimum needed to do that.  The hack is fundamentally
inconsistent anyway, and breaking our standards compliance further in
pursuit of bogus consistency seems misguided.

Personally I'd rather take out the =NULL conversion anyway...

> -- If I have interpreted SQL92 correctly UNKNOWN IS TRUE should return
> FALSE, and UNKNOWN IS NOT TRUE is equivalent to NOT (UNKNOWN IS TRUE) ==>
> TRUE. Is this correct?

Yes.  Table 15 is pretty illegible in the ASCII draft copies of SQL92
and SQL99, but the PDF version of SQL99 is okay, and it makes clear
what you'd expect:
input        IS TRUE        IS FALSE    IS UNKNOWN
true        true        false        falsefalse        false        true        falseunknown        false        false
     true
 

and then the NOT variants are defined as
x IS NOT foo    ==    NOT (x IS foo)


I'll try to look over and commit the patch later today.

For extra credit ;-) ... if you'd like to learn a little bit about the
optimizer, think about teaching clause_selectivity() in
optimizer/path/clausesel.c how to estimate the selectivity of these new
expression nodes.  In the case where the argument is a boolean column
that we have statistics for, it should be possible to derive the correct
answer (including accounting for NULLs).  If the argument is more
complex than that, you probably can't do anything really intelligent,
but you could handwave away NULLs and then compute the appropriate
function of the clause_selectivity() of the argument.
        regards, tom lane


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
"Joe Conway"
Date:
> > -- Should I increment catversion.h as part of the patch (I didn't in
this
> > patch), or is that usually centrally controlled by Bruce (or whomever
> > commits the change)?
>
> It's good to put the catversion bump into the patch, else the committer
> might forget to do it.
>OK -- will do next time.

> further than the minimum needed to do that.  The hack is fundamentally
> inconsistent anyway, and breaking our standards compliance further in
> pursuit of bogus consistency seems misguided.
>
> Personally I'd rather take out the =NULL conversion anyway...

I'd second that.


> Yes.  Table 15 is pretty illegible in the ASCII draft copies of SQL92
> and SQL99, but the PDF version of SQL99 is okay, and it makes clear
> what you'd expect:
>

OT -- I need to buy a copy of SQL99, but it seems to be split into several
parts (that didn't exist for SQL92). Which one (or more) are the most useful
for PostgreSQL hacking?


> For extra credit ;-) ... if you'd like to learn a little bit about the
> optimizer, think about teaching clause_selectivity() in
> optimizer/path/clausesel.c how to estimate the selectivity of these new
> expression nodes.  In the case where the argument is a boolean column
> that we have statistics for, it should be possible to derive the correct
> answer (including accounting for NULLs).  If the argument is more
> complex than that, you probably can't do anything really intelligent,
> but you could handwave away NULLs and then compute the appropriate
> function of the clause_selectivity() of the argument.

Sure, I love a challenge ;) -- I'll take a look.

One issue I noticed this morning with this patch is that 't' and 'f' are no
longer being implicitly cast into boolean, i.e. test=# select 't' is true;?column?
----------f
(1 row)

test=# select 't'::bool is true;?column?
----------t
(1 row)

Any thoughts on where look to fix this?

Thanks,

-- Joe




Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
"Joe Conway" <joseph.conway@home.com> writes:
> OT -- I need to buy a copy of SQL99, but it seems to be split into several
> parts (that didn't exist for SQL92). Which one (or more) are the most useful
> for PostgreSQL hacking?

I find that Part 2 is almost the only one I ever look at.  I'm not even
sure what's in the other parts ...

> One issue I noticed this morning with this patch is that 't' and 'f' are no
> longer being implicitly cast into boolean, i.e. test=# select 't' is true;
>  ?column?
> ----------
>  f
> (1 row)

Now that you mention it, it looks like all of our constructs that expect
boolean fail to coerce unknown-type literals into bools.  The rest of
them raise errors, eg:
regression=# select not 't';ERROR:  argument to NOT is type 'unknown', not 'bool'

but this is pretty bogus.  ISTM all these places should try to coerce
their inputs to bool before they complain.  This involves calling
can_coerce_type and then coerce_type; it's tedious enough that it'd be
worth setting up a subroutine to do it.  I'll add something for that,
and fix the other places too.
        regards, tom lane


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
"Joe Conway"
Date:
> For extra credit ;-) ... if you'd like to learn a little bit about the
> optimizer, think about teaching clause_selectivity() in
> optimizer/path/clausesel.c how to estimate the selectivity of these new
> expression nodes.  In the case where the argument is a boolean column
> that we have statistics for, it should be possible to derive the correct
> answer (including accounting for NULLs).  If the argument is more
> complex than that, you probably can't do anything really intelligent,
> but you could handwave away NULLs and then compute the appropriate
> function of the clause_selectivity() of the argument.

Attached is my initial attempt to teach clause_selectivity about BooleanTest
nodes, for review and comment. Please let me know if I'm headed in the right
direction.

Thanks,

Joe


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
"Joe Conway"
Date:
Thanks for your thorough review and comments, Tom.

Here's a new patch for review. Summary of changes/response to earlier
comments:
- add a routine for NullTest nodes -- done.
- declare selectivity functions without fmgr notation -- done.
- create selfuncs.h for declarations -- done, but I didn't move anything
else out of builtins.h
- use DatumGetBool() and adjust style -- done
- create better default selectivities -- done:    - DEFAULT_UNK_SEL = 0.005    - DEFAULT_NOT_UNK_SEL = 1 -
DEFAULT_UNK_SEL   - DEFAULT_BOOL_SEL = 0.5
 
- recurse clause_selectivity() for non-Var input -- done
- simplify MCV logic -- done, used 2nd approach (always use the first most
common val's frequency)

Questions:
- I added a debug define (BOOLTESTDEBUG) to selfuncs.h, and a corresponding
ifdef/elog NOTICE to clause_selectivity(). This was to help me debug/verify
the calculations. Should this be left in the code when I create a patch (it
is in this one), and if so, is there a preferred "standard" approach to this
type of debug code?
- Using the debug code mentioned above, I noted that clause_selectivity()
did not seem to get called at all for clauses like "where myfield = 0" or
"where myfield > 0". I haven't looked too closely at it yet, but I was
wondering if this is expected behavior?

Thanks,

-- Joe


Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From
Tom Lane
Date:
"Joe Conway" <joseph.conway@home.com> writes:
> Attached is my initial attempt to teach clause_selectivity about BooleanTest
> nodes, for review and comment. Please let me know if I'm headed in the right
> direction.

Looks like you're headed the right way.  A few comments ---

1. Don't forget to add a routine for NullTest nodes, too.  These can be
simpler, since you know that all you care about is the stanullfrac; no
need to look at common values.

2. I don't see any real good reason to force booltestsel() (likewise
nulltestsel(), when you add it) to have the same call signature as the
various operator selectivity routines.  Those have to be looked up in a
table and called via fmgr, but there's no reason to think that
booltestsel will ever be called in the same way.  We could certainly
lose the useless "operator" argument.  I'd be inclined to declare
booltestsel as
Selectivity booltestsel(Query *root,                        BooleanTest *clause,                        int varRelid);

and skip *all* the fmgr notational cruft.  The only trouble with this is
we'd not want to put such a declaration into builtins.h, because it'd
force including a bunch more .h files into builtins.h, which is ugly
and might even cause some circularity problems.  There are already some
selectivity-related things in builtins.h that don't really belong there.
I am starting to think that selfuncs.c deserves its own header file
separate from builtins.h, so that it doesn't clutter builtins.h with a
bunch of selectivity-specific stuff.  (Comments anyone?)

3. You should be using tests like "if (DatumGetBool(values[i]))" or
"if (!DatumGetBool(values[i]))" to examine the MCV-list values, not
"if (values[i] == true)".  The latter is a type cheat: it assumes that
Datum can be implicitly converted to bool, which is only sort-of true at
the moment and might be not-at-all-true in the future.  You could write
"if (DatumGetBool(values[i]) == true)" and satisfy the strict-typing
part of this concern, but there's also a more stylistic thing here.
Personally I find "if (boolean == true)" or "if (boolean == false)"
to be poor style, it should be just "if (boolean)" or "if (!boolean)".
This gets back to whether you consider boolean to be a logically
distinct type from int, I suppose.  Feel free to ignore that part
if you strongly disagree.

4. I don't think that DEFAULT_EQ_SEL is an appropriate default result
for this routine.  Arguably, it should elog(ERROR) if not passed a
BooleanTest node, so the first one or two occurrences are nonissues
anyway.  However, if you find a non-Var argument or a Var that you
can't get statistics for, you do need to return some sort of reasonable
default, and that I think is not the best.  In any case the default should
take the test type into account.  For IS_NULL and IS_UNKNOWN, I'd think
the default assumption should be small but not zero (0.01 maybe?  Any
thoughts out there?).  IS_NOT_NULL/UNKNOWN should obviously yield one
minus whatever we use for IS_NULL.  It seems reasonable to use 0.5
as default for IS_TRUE and the other three BooleanTests.

5. Actually, for IS_TRUE and the other three BooleanTests, what you
should probably do with a non-Var input is ignore the possibility of a
NULL value and just try to estimate the probability of TRUE vs FALSE,
which you can do by recursing back to clause_selectivity() on the
argument (note this would be invalid for NullTest where the argument
isn't necessarily boolean, but it's legit for BooleanTests).  Then
you use either that result or one minus that result, depending on
which BooleanTest you're dealing with.

6. The way you are working with the MCV values seems unnecessarily
complicated.  I had a hard time seeing whether it was generating the
right answer (actually, it demonstrably isn't, since for example you
produce the same result for IS_TRUE and IS_NOT_FALSE, which ought to
differ by the stanullfrac amount; but it's too complex anyway).
I'd be inclined to do this in one of two ways:1. Scan the MCV array to find the entry for "true", and work   with its
frequencyand the stanullfrac frequency to derive   the appropriate answer depending on the test type.2. Always use the
first(most common val's) frequency.  This   value is either true or false, so adjust accordingly.   The simplest way
wouldbe to derive true's frequency as   either freq[0] or 1 - freq[0] - stanullfrac, and then   proceed as in #1.
 
#2 is a little more robust, since it would still work if for some reason
the MCV list contains only an entry for true or only an entry for false.
(I believe that could happen, if the column contains only one value and
possibly some nulls.)


You're off to an excellent start though; you seem to be finding your
way through the code very well.
        regards, tom lane