Thread: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards
> > Yes, column = NULL should *never* return true according to the spec (it > > should always return NULL in fact as stated). The reason for breaking > > with the spec is AFAIK to work with broken microsoft clients that seem to > > think that =NULL is a meaningful test and generate queries using that. > > Microsoft Access is the guilty party, IIRC. I recently tried to stir up > some interest in changing this behavior back to the standard, but > apparently there are still too many people using broken versions of > Access. Actually I am not sure whether the column = NULL syntax is even defined or allowed in SQL92 (e.g. Informix interprets the NULL as column name in this context and errs out). If not it would simply be an extension to the standard with the defined behavior of beeing identical to "column is null". Andreas
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > Actually I am not sure whether the column = NULL syntax is even defined > or allowed in SQL92 (e.g. Informix interprets the NULL as column name in > this context and errs out). I don't have the standard handy, but I do have Joe Celko's book, "Data & Databases: Concepts in Practice". He says (in section 8.2, under the heading "Multivalued Logic"): A NULL cannot be compared to another NULL or to a value with what Dr. Codd called a theta operator and what programmers call a comparison operator (equal, not equal, less than, greater than, and so forth). This resultsin a three-valued logic, which has an UNKNOWN in addition to TRUE and FALSE. [...] UNKNOWN is a logicalvalue and not the same as a NULL, which is a data value. That is why you have to say X IS [NOT] NULLin SQL and not use X = NULL instead. Theta operators are expressions of the form X <comp op> Y; when X orY or both are NULL, theta operators will return an UNKNOWN and not a NULL. He goes on to explain three-valued logic in more detail, showing truth tables according to Jan Lukasiewicz (the inventor of RPN), and says, of SQL-92, that it "is comforting to see that [it has] the same truth tables as the three-valued system of Lukasiewicz". Further, he says: SQL-92 added a new predicate of the form <search condition> IS [NOT] TRUE | FALSE | UNKNOWN which will let you map any combination of three-valued logic to the two Boolean values. A quick test run with psql shows that PostgreSQL does not properly implement three-valued logic: it does not recognize the UNKNOWN keyword alongside TRUE and FALSE, in any situation. It will also return boolean truth values for comparisons with NULL values, using them as "real" data values in the comparison. Worse (IMHO), this is not consistent: while a test for "column = NULL" will return rows where that is true, and a test for "not column = NULL" will return the rest, "column <> NULL" returns no rows! This means that the theta operators are not all treated the same way, which is surely wrong! It seems to me that the idea of NULL as an unkown data value and UNKNOWN as the corresponding truth value, combined with the rules for propagation of NULL in mathematical operations, of UNKNOWN in truth operations, and from NULL to UNKNOWN by theta operators, is a very clean, intuitive way of handling these issues. It feels right! :-) -tih -- The basic difference is this: hackers build things, crackers break them.
On 7 Jun 2001, Tom Ivar Helbekkmo wrote: > Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > > > Actually I am not sure whether the column = NULL syntax is even defined > > or allowed in SQL92 (e.g. Informix interprets the NULL as column name in > > this context and errs out). > > He goes on to explain three-valued logic in more detail, showing truth > tables according to Jan Lukasiewicz (the inventor of RPN), and says, > of SQL-92, that it "is comforting to see that [it has] the same truth > tables as the three-valued system of Lukasiewicz". Further, he says: > > SQL-92 added a new predicate of the form > > <search condition> IS [NOT] TRUE | FALSE | UNKNOWN > > which will let you map any combination of three-valued > logic to the two Boolean values. > > A quick test run with psql shows that PostgreSQL does not properly > implement three-valued logic: it does not recognize the UNKNOWN > keyword alongside TRUE and FALSE, in any situation. It will also > return boolean truth values for comparisons with NULL values, using > them as "real" data values in the comparison. Worse (IMHO), this is > not consistent: while a test for "column = NULL" will return rows > where that is true, and a test for "not column = NULL" will return the > rest, "column <> NULL" returns no rows! This means that the theta > operators are not all treated the same way, which is surely wrong! That's the nature of the hack we're talking about. It's a grammar level hack to turn a specific sequence of tokens (= NULL) into IS NULL due to a client's generated queries. If you're comparing something other than the constant NULL, it should do what is expected (ie, a comparison between a NULL in a table or even CAST(NULL as INT4) does the "right" thing). I think adding IS UNKNOWN would probably be trivial (I think the code is basically there in IS NULL.)
Zeugswetter Andreas SB writes: > Actually I am not sure whether the column = NULL syntax is even defined > or allowed in SQL92 I think you're right. > (e.g. Informix interprets the NULL as column name in this context and > errs out). NULL is a reserved word, so this would be wrong. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > That's the nature of the hack we're talking about. It's a grammar > level hack to turn a specific sequence of tokens (= NULL) into IS > NULL due to a client's generated queries. Aha! Sorry -- I jumped in late in the discussion without checking back to see how it started... OK, I've already said that I like the cleanliness and orthogonality of NULL is a missing data value, UNKNOWN is a missing truth value, both propagate in expressions, comparisons with NULL generate UNKNOWN, and you can use the special comparisons IS [NOT] NULL and IS [NOT] UNKNOWN to get plain, two-valued Boolean truth values out of them. The Microsoft compatibility hack is ugly, and should be either a) removed, b) expanded to include the other comparison operators and documented as a PostgreSQL proprietary extension, or c) made into a special feature that's turned on at will by a SET command. I would applaud a), approve of c), and be dismayed by b). > I think adding IS UNKNOWN would probably be trivial (I think the > code is basically there in IS NULL.) But if it's implemented, shouldn't the code also differentiate between UNKNOWN and NULL, by not (as now) using the latter to represent the former? Or do I misunderstand how it's handled now? -tih -- The basic difference is this: hackers build things, crackers break them.
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > Actually I am not sure whether the column = NULL syntax is even > defined or allowed in SQL92 I've just checked, by reading the relevant paragraphs and studying the BNF, and the standard says that any comparison of the form X <comp op> Y is unknown if X or Y (or both) is NULL, including the case where NULL is given as an explicit constant. So, SQL92 clearly demands that "column = NULL" is UNKNOWN, never TRUE or FALSE. -tih -- The basic difference is this: hackers build things, crackers break them.
On 7 Jun 2001, Tom Ivar Helbekkmo wrote: > Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > > That's the nature of the hack we're talking about. It's a grammar > > level hack to turn a specific sequence of tokens (= NULL) into IS > > NULL due to a client's generated queries. > > Aha! Sorry -- I jumped in late in the discussion without checking > back to see how it started... > > OK, I've already said that I like the cleanliness and orthogonality of > NULL is a missing data value, UNKNOWN is a missing truth value, both > propagate in expressions, comparisons with NULL generate UNKNOWN, and > you can use the special comparisons IS [NOT] NULL and IS [NOT] UNKNOWN > to get plain, two-valued Boolean truth values out of them. > > The Microsoft compatibility hack is ugly, and should be either a) > removed, b) expanded to include the other comparison operators and > documented as a PostgreSQL proprietary extension, or c) made into a > special feature that's turned on at will by a SET command. I would > applaud a), approve of c), and be dismayed by b). c is the most likely thing to happen probably. > > I think adding IS UNKNOWN would probably be trivial (I think the > > code is basically there in IS NULL.) > > But if it's implemented, shouldn't the code also differentiate between > UNKNOWN and NULL, by not (as now) using the latter to represent the > former? Or do I misunderstand how it's handled now? Within what's there (using null as unknown), the two tests are nearly identical and would probably be just a grammar change. Creating a separate unknown would be more difficult, and I'm not sure it's necessary to make the distinction. NULL is an unknown value, I'm not sure that you'd need a separate unknown value specifically for booleans.
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > Actually I am not sure whether the column = NULL syntax is even defined > or allowed in SQL92 (e.g. Informix interprets the NULL as column name in > this context and errs out). Strictly speaking, SQL92 would require you to writefoo = CAST (NULL AS type-of-foo) However, we allow unadorned NULL in other contexts as a shorthand for the CAST notation, so it's inconsistent of us to say that in this context it means something different. The real problem with accepting this Microsoftism is that it's a trap for unwary programmers. Case 1: someone who's not studied SQL in detail might experiment with examples involving "foo = NULL" and jump to reasonable but entirely incorrect conclusions about how comparisons involving NULL operate. Case 2: someone who *has* studied SQL, and is also aware that we accept unadorned NULLs, will also draw the wrong conclusions about what this construct will do. Bottom line: this kluge surprises everyone except those who already know it exists. I don't like systems that surprise their users in inconsistent ways. regards, tom lane
Tom Ivar Helbekkmo <tih@kpnQwest.no> quotes: > ... This results in > a three-valued logic, which has an UNKNOWN in addition > to TRUE and FALSE. [...] UNKNOWN is a logical value and > not the same as a NULL, which is a data value. SQL92 is not very clear about whether NULL and UNKNOWN are distinct, but it is worth noticing that their truth tables for comparison operators, and/or/not, etc, only mention unknown --- never null --- as a possible value of a boolean condition. SQL99 clarifies the intent: The data type boolean comprises the distinct truth values true and false. Unless prohibited by a NOT NULLconstraint, the boolean data type also supports the unknown truth value as the null value. This specificationdoes not make a distinction between the null value of the boolean data type and the unknown truth valuethat is the result of an SQL <predicate>, <search condition>, or <boolean value expression>; they maybe used interchangeably to mean exactly the same thing. Which in fact is what Postgres does. > A quick test run with psql shows that PostgreSQL does not properly > implement three-valued logic: it does not recognize the UNKNOWN > keyword alongside TRUE and FALSE, in any situation. We do not currently have correct implementations of IS TRUE, IS FALSE, or IS UNKNOWN (IS TRUE/FALSE are in there but give the wrong result for null inputs). This is on my to-do list to fix; not sure if the master TODO list mentions it or not. Actually it'd be a good project for a newbie hacker who wants to learn about the backend's expression-handling machinery. Anyone want to take it on? It's also worth noticing that our implementation of IS NULL isn't really up to speed: the spec allows the argument to be a row value constructor, not just a scalar. But we mostly don't have support for row-value- constructor expressions anyway (it's not an Entry SQL feature). regards, tom lane
> We do not currently have correct implementations of IS TRUE, IS FALSE, > or IS UNKNOWN (IS TRUE/FALSE are in there but give the wrong result > for null inputs). This is on my to-do list to fix; not sure if the > master TODO list mentions it or not. Actually it'd be a good project > for a newbie hacker who wants to learn about the backend's > expression-handling machinery. Anyone want to take it on? > I'd like to finish up the has_table_privilege function over the next week or so and then take this on. Can you point me in a direction to start looking? Thanks, Joe
"Joe Conway" <joe@conway-family.com> writes: >> We do not currently have correct implementations of IS TRUE, IS FALSE, >> or IS UNKNOWN (IS TRUE/FALSE are in there but give the wrong result >> for null inputs). This is on my to-do list to fix; not sure if the >> master TODO list mentions it or not. Actually it'd be a good project >> for a newbie hacker who wants to learn about the backend's >> expression-handling machinery. Anyone want to take it on? > I'd like to finish up the has_table_privilege function over the next week or > so and then take this on. Can you point me in a direction to start looking? The way things currently work is that gram.y translates "x IS TRUE" etc to "x = true" etc. This is wrong because it does the wrong thing for null input. Another objection is that it's impossible for ruleutils.c to reverse-list the expression tree in its original form. IS [NOT] NULL is handled a little differently: gram.y generates a specialized Expr node, which parse_expr.c translates to a function call on the specialized functions nullvalue() and nonnullvalue() respectively. I don't much care for this implementation either, again partly because ruleutils.c has to be uglified to deal with it, but partly because the optimizer can't cheaply recognize IS NULL tests either. I'd like to see all eight of these guys translated into a specialized kind of expression node, called perhaps BooleanTest. Actually, it'd probably be wise to keep IS NULL separate from the six boolean tests, with an eye to the future when it will need to support nonscalar arguments. So maybe BooleanTest and NullTest node types, each with a field showing exactly which test is wanted. Adding a new expression node type is a straightforward but rather tedious exercise in teaching some dozens of places what to do with it. A grep for existing expression node types, such as CaseExpr or FieldSelect or RelabelType, will give you a good idea what needs to be done. regards, tom lane
> The real problem with accepting this Microsoftism is that it's a trap > for unwary programmers. Case 1: someone who's not studied SQL in detail > might experiment with examples involving "foo = NULL" and jump to > reasonable but entirely incorrect conclusions about how comparisons > involving NULL operate. Case 2: someone who *has* studied SQL, and is > also aware that we accept unadorned NULLs, will also draw the wrong > conclusions about what this construct will do. Bottom line: this kluge > surprises everyone except those who already know it exists. I don't > like systems that surprise their users in inconsistent ways. These were all points that were brought up and discussed when the hack was implemented. At the time, the trade between implementing a construct which is *not allowed* in SQL9x vs enabling the M$Access community to migrate to PostgreSQL was hashed over and a consensus was reached that the benefits to allowing it outweighed the drawbacks. I was of the initial opinion that we should not support M$ pathelogical non-standard constructs (jeez, they should know better, and probably do, so it is likely a pathetic attempt to lock in users). But since the construct is not allowed (or useless), why would anyone feel they need to use it? Clearly it is not the case that "this kluge surprises everyone except those who already know it exists." We have had a strong consensus that SQL9x standard constructs should be the norm for PostgreSQL, so no need to rehash that. The issue boils down to, as it did when it first came up, whether we will make it easier for M$Access users to start migrating to PostgreSQL. If newer versions of Access emit standard constructs, then it would be even easier to say that we should jettison the construct. But if not, istm that we are killing a usability feature on principle, not from need, and it may be premature to do that. Using a "SET xxx" switch is not a step in the right direction istm since it requires an extra PostgreSQL-specific command to get things working with Access. We aren't teaching SQL, we are trying to let people get to their data. btw, I *was* suprised to see the "IS UNKNOWN" construct. It's been lurking in my reference books for years. afaict it is very uncommonly used, since most RDBMSes accept IS NULL as an equivalent. But is is trivial to add to the grammar. - Thomas
> The way things currently work is that gram.y translates "x IS TRUE" etc > to "x = true" etc. This is wrong because it does the wrong thing for > null input. Another objection is that it's impossible for ruleutils.c > to reverse-list the expression tree in its original form. fyi, in case it helps: we used to have gram.y translate these into function calls, rather than the operator expression. But that precluded the optimizer from ever having a shot at optimizing out "const = const" kinds of expressions and other fluff. If we go to a specialized node in the parse tree, then the optimizer could be taught to handle that, which is better than the original straight function call which would have masked things too much. - Thomas
Thomas Lockhart <lockhart@fourpalms.org> writes: > But since the construct is not allowed (or useless), why would > anyone feel they need to use it? Because it isn't entirely useless, actually. I agree that no programmer in his right mind would write, by hand, a comparison involving NULL, knowing that the truth value of that comparison is required by the standard to be UNKNOWN (i.e. NULL). However, I'm looking at using machine generated SQL code (generated on the fly in an interactive application) to implement a dynamically adapting set of tables, rules/triggers and their supporting stored procedures, and it's just a matter of time before the first "= NULL" happens to show up in code generated like that. I'd like it to behave according to the standard when that situation occurs, and the standard says that any comparison with NULL, even "NULL = NULL", is UNKNOWN. -tih -- The basic difference is this: hackers build things, crackers break them.
Thomas Lockhart <lockhart@fourpalms.org> writes: > Clearly it is not the case that "this kluge surprises everyone except > those who already know it exists." How can you argue that, when the topic comes up again every couple of months? regards, tom lane
> > Clearly it is not the case that "this kluge surprises everyone except > > those who already know it exists." > How can you argue that, when the topic comes up again every couple of > months? I'm not looking for an argument. But the statement is not factually true: "suprises everyone" (most folks don't notice, and don't care much) and "every couple of months" (??) stray far enough from facts that we should get back on topic. Whatever the facts, the messages that we do *not* get are just as significant: a relatively large portion of feedback from users of both Access and PostgreSQL at the time the "feature" was added indicated that it was a significant stumbling block for those users, and those messages will start up anew without adequate planning and implementation. Since back then, we have some additional clarification that it occurs only for users of Access/Forms, and that others are worried that it will lead to difficulties for others under different circumstances (please remember, or note if it is too long ago, that at the time I argued against adding the "feature", but tried to listen to those who would find it useful). I'm not ignoring those worries, but in the battle between purity and usefulness we shouldn't always line up with the strongest or loudest voice(s) that day, but try to respect and keep in mind the others who have contributed to the discussion in the past. That said, the issues should be broken down into managable chunks, but imho forcing the last step (removal of the "= NULL" accomodation) first is premature. How about phasing this so that we can accomodate the long-standing issue of M$ interoperability (via ODBC improvements to make it transparent) before ripping out the current workaround? From Andreas' comments, it seems that for his application he would like a different behavior, but frankly I'm not certain why the current behavior would be detrimental in the use case he mentioned. If SQL92 requires that any query with "= NULL" be rejected as illegal (my books aren't nearby at the moment), he might still want to have code at the application level for some graceful handling of illegal values. - Thomas
> IS [NOT] NULL is handled a little differently: gram.y generates a > specialized Expr node, which parse_expr.c translates to a function call > on the specialized functions nullvalue() and nonnullvalue() > respectively. I don't much care for this implementation either, again > partly because ruleutils.c has to be uglified to deal with it, but > partly because the optimizer can't cheaply recognize IS NULL tests > either. > > I'd like to see all eight of these guys translated into a specialized > kind of expression node, called perhaps BooleanTest. Actually, it'd > probably be wise to keep IS NULL separate from the six boolean tests, > with an eye to the future when it will need to support nonscalar > arguments. So maybe BooleanTest and NullTest node types, each with a > field showing exactly which test is wanted. > Attached is a patch for a new NullTest node type for review and comment. Since it didn't seem like there was consensus regarding removal of the "a = null" conversion to "a is null" behavior, I left it in. It is worth mentioning, however, that neither Oracle 8.1.6 or MSSQL 7 seem to support this -- see below: Oracle: **************************************** SQL> select f1,f2 from foo where f2 = null; no rows selected MSSQL 7 **************************************** select f1,f2 from foo where f2 = null f1 f2 ----------- -------------------------------------------------- (0 row(s) affected) PostgreSQL **************************************** test=# select f1,f2 from foo where f2 = null;f1 | f2 ----+---- 1 | 4 | (2 rows) In all 3 cases table foo has 4 rows, 2 of which have null values for f2. Based on this, should support for the converting "a = null" to "a is null" be dropped? I also noticed that in PostgreSQL I can do the following (both before and after this patch): select f2 is null from foo; whereas in both Oracle and MSSQL it causes a syntax error. Any thoughts on this? Thanks, -- Joe
"Joe Conway" <joseph.conway@home.com> writes: > Attached is a patch for a new NullTest node type for review and comment. I assume you are just looking for review at this point; I would not recommend applying to CVS until the BooleanTest part is done too. (Since parsetree changes affect stored rules, the change really should include a catversion.h increment, and thus it's best to bunch this sort of change together to avoid forcing extra initdbs on other hackers.) I'll look through the code later, but... > Based on this, should support for the converting "a = null" to "a is null" > be dropped? My opinion on that is already on record ;-) > I also noticed that in PostgreSQL I can do the following (both before and > after this patch): > select f2 is null from foo; > whereas in both Oracle and MSSQL it causes a syntax error. Any thoughts on > this? Curious; I'd have said that that is clearly within the spec. Can anyone check it on some other engines? regards, tom lane
> I assume you are just looking for review at this point; I would not > recommend applying to CVS until the BooleanTest part is done too. > (Since parsetree changes affect stored rules, the change really should > include a catversion.h increment, and thus it's best to bunch this sort > of change together to avoid forcing extra initdbs on other hackers.) > I'll look through the code later, but... OK -- I'll get started on BooleanTest in the next day or two. -- Joe
"Joe Conway" <joseph.conway@home.com> writes: > I also noticed that in PostgreSQL I can do the following (both before and > after this patch): > select f2 is null from foo; > whereas in both Oracle and MSSQL it causes a syntax error. Any thoughts on > this? I dug into this further and discovered that indeed it is not SQL92 ... but it is SQL99. Amazingly enough, SQL92 doesn't allow boolean expressions as a possible type of general expression: <value expression> ::= <numeric value expression> | <string value expression> | <datetime value expression> | <interval value expression> It only allows them as <search condition>s, which is to say WHERE, HAVING, CASE WHEN, CHECK, and one or two other places. But SQL99 gets it right: <value expression> ::= <numeric value expression> | <string value expression> | <datetime value expression> | <interval value expression> | <boolean value expression> | <user-defined type value expression> | <row value expression> | <reference value expression> | <collection value expression> Looks like we're ahead of the curve here... regards, tom lane
> > I assume you are just looking for review at this point; I would not > > recommend applying to CVS until the BooleanTest part is done too. > > (Since parsetree changes affect stored rules, the change really should > > include a catversion.h increment, and thus it's best to bunch this sort > > of change together to avoid forcing extra initdbs on other hackers.) > > I'll look through the code later, but... > OK -- here's the full patch for review. It includes two new expression node types: NullTest and BooleanTest. I have a couple of questions/comments remaining WRT this: -- Should I increment catversion.h as part of the patch (I didn't in this patch), or is that usually centrally controlled by Bruce (or whomever commits the change)? -- IMHO, if we are going to keep the (a = null) to (a is null) conversion, then there should also be a similar conversion from (a != null) to (a is not null). Otherwise the two operations which may be expected to be complimentary (as evidenced by at least one recent post) are not. -- If I have interpreted SQL92 correctly UNKNOWN IS TRUE should return FALSE, and UNKNOWN IS NOT TRUE is equivalent to NOT (UNKNOWN IS TRUE) ==> TRUE. Is this correct? Thanks, -- Joe
"Joe Conway" <joseph.conway@home.com> writes: > -- Should I increment catversion.h as part of the patch (I didn't in this > patch), or is that usually centrally controlled by Bruce (or whomever > commits the change)? It's good to put the catversion bump into the patch, else the committer might forget to do it. > -- IMHO, if we are going to keep the (a = null) to (a is null) conversion, > then there should also be a similar conversion from (a != null) to (a is not > null). Otherwise the two operations which may be expected to be > complimentary (as evidenced by at least one recent post) are not. I'd resist this. The only reason the =NULL hack is in there at all is to support Access97. We shouldn't extend the deviation from standards further than the minimum needed to do that. The hack is fundamentally inconsistent anyway, and breaking our standards compliance further in pursuit of bogus consistency seems misguided. Personally I'd rather take out the =NULL conversion anyway... > -- If I have interpreted SQL92 correctly UNKNOWN IS TRUE should return > FALSE, and UNKNOWN IS NOT TRUE is equivalent to NOT (UNKNOWN IS TRUE) ==> > TRUE. Is this correct? Yes. Table 15 is pretty illegible in the ASCII draft copies of SQL92 and SQL99, but the PDF version of SQL99 is okay, and it makes clear what you'd expect: input IS TRUE IS FALSE IS UNKNOWN true true false falsefalse false true falseunknown false false true and then the NOT variants are defined as x IS NOT foo == NOT (x IS foo) I'll try to look over and commit the patch later today. For extra credit ;-) ... if you'd like to learn a little bit about the optimizer, think about teaching clause_selectivity() in optimizer/path/clausesel.c how to estimate the selectivity of these new expression nodes. In the case where the argument is a boolean column that we have statistics for, it should be possible to derive the correct answer (including accounting for NULLs). If the argument is more complex than that, you probably can't do anything really intelligent, but you could handwave away NULLs and then compute the appropriate function of the clause_selectivity() of the argument. regards, tom lane
> > -- Should I increment catversion.h as part of the patch (I didn't in this > > patch), or is that usually centrally controlled by Bruce (or whomever > > commits the change)? > > It's good to put the catversion bump into the patch, else the committer > might forget to do it. >OK -- will do next time. > further than the minimum needed to do that. The hack is fundamentally > inconsistent anyway, and breaking our standards compliance further in > pursuit of bogus consistency seems misguided. > > Personally I'd rather take out the =NULL conversion anyway... I'd second that. > Yes. Table 15 is pretty illegible in the ASCII draft copies of SQL92 > and SQL99, but the PDF version of SQL99 is okay, and it makes clear > what you'd expect: > OT -- I need to buy a copy of SQL99, but it seems to be split into several parts (that didn't exist for SQL92). Which one (or more) are the most useful for PostgreSQL hacking? > For extra credit ;-) ... if you'd like to learn a little bit about the > optimizer, think about teaching clause_selectivity() in > optimizer/path/clausesel.c how to estimate the selectivity of these new > expression nodes. In the case where the argument is a boolean column > that we have statistics for, it should be possible to derive the correct > answer (including accounting for NULLs). If the argument is more > complex than that, you probably can't do anything really intelligent, > but you could handwave away NULLs and then compute the appropriate > function of the clause_selectivity() of the argument. Sure, I love a challenge ;) -- I'll take a look. One issue I noticed this morning with this patch is that 't' and 'f' are no longer being implicitly cast into boolean, i.e. test=# select 't' is true;?column? ----------f (1 row) test=# select 't'::bool is true;?column? ----------t (1 row) Any thoughts on where look to fix this? Thanks, -- Joe
"Joe Conway" <joseph.conway@home.com> writes: > OT -- I need to buy a copy of SQL99, but it seems to be split into several > parts (that didn't exist for SQL92). Which one (or more) are the most useful > for PostgreSQL hacking? I find that Part 2 is almost the only one I ever look at. I'm not even sure what's in the other parts ... > One issue I noticed this morning with this patch is that 't' and 'f' are no > longer being implicitly cast into boolean, i.e. test=# select 't' is true; > ?column? > ---------- > f > (1 row) Now that you mention it, it looks like all of our constructs that expect boolean fail to coerce unknown-type literals into bools. The rest of them raise errors, eg: regression=# select not 't';ERROR: argument to NOT is type 'unknown', not 'bool' but this is pretty bogus. ISTM all these places should try to coerce their inputs to bool before they complain. This involves calling can_coerce_type and then coerce_type; it's tedious enough that it'd be worth setting up a subroutine to do it. I'll add something for that, and fix the other places too. regards, tom lane
> For extra credit ;-) ... if you'd like to learn a little bit about the > optimizer, think about teaching clause_selectivity() in > optimizer/path/clausesel.c how to estimate the selectivity of these new > expression nodes. In the case where the argument is a boolean column > that we have statistics for, it should be possible to derive the correct > answer (including accounting for NULLs). If the argument is more > complex than that, you probably can't do anything really intelligent, > but you could handwave away NULLs and then compute the appropriate > function of the clause_selectivity() of the argument. Attached is my initial attempt to teach clause_selectivity about BooleanTest nodes, for review and comment. Please let me know if I'm headed in the right direction. Thanks, Joe
Thanks for your thorough review and comments, Tom. Here's a new patch for review. Summary of changes/response to earlier comments: - add a routine for NullTest nodes -- done. - declare selectivity functions without fmgr notation -- done. - create selfuncs.h for declarations -- done, but I didn't move anything else out of builtins.h - use DatumGetBool() and adjust style -- done - create better default selectivities -- done: - DEFAULT_UNK_SEL = 0.005 - DEFAULT_NOT_UNK_SEL = 1 - DEFAULT_UNK_SEL - DEFAULT_BOOL_SEL = 0.5 - recurse clause_selectivity() for non-Var input -- done - simplify MCV logic -- done, used 2nd approach (always use the first most common val's frequency) Questions: - I added a debug define (BOOLTESTDEBUG) to selfuncs.h, and a corresponding ifdef/elog NOTICE to clause_selectivity(). This was to help me debug/verify the calculations. Should this be left in the code when I create a patch (it is in this one), and if so, is there a preferred "standard" approach to this type of debug code? - Using the debug code mentioned above, I noted that clause_selectivity() did not seem to get called at all for clauses like "where myfield = 0" or "where myfield > 0". I haven't looked too closely at it yet, but I was wondering if this is expected behavior? Thanks, -- Joe
"Joe Conway" <joseph.conway@home.com> writes: > Attached is my initial attempt to teach clause_selectivity about BooleanTest > nodes, for review and comment. Please let me know if I'm headed in the right > direction. Looks like you're headed the right way. A few comments --- 1. Don't forget to add a routine for NullTest nodes, too. These can be simpler, since you know that all you care about is the stanullfrac; no need to look at common values. 2. I don't see any real good reason to force booltestsel() (likewise nulltestsel(), when you add it) to have the same call signature as the various operator selectivity routines. Those have to be looked up in a table and called via fmgr, but there's no reason to think that booltestsel will ever be called in the same way. We could certainly lose the useless "operator" argument. I'd be inclined to declare booltestsel as Selectivity booltestsel(Query *root, BooleanTest *clause, int varRelid); and skip *all* the fmgr notational cruft. The only trouble with this is we'd not want to put such a declaration into builtins.h, because it'd force including a bunch more .h files into builtins.h, which is ugly and might even cause some circularity problems. There are already some selectivity-related things in builtins.h that don't really belong there. I am starting to think that selfuncs.c deserves its own header file separate from builtins.h, so that it doesn't clutter builtins.h with a bunch of selectivity-specific stuff. (Comments anyone?) 3. You should be using tests like "if (DatumGetBool(values[i]))" or "if (!DatumGetBool(values[i]))" to examine the MCV-list values, not "if (values[i] == true)". The latter is a type cheat: it assumes that Datum can be implicitly converted to bool, which is only sort-of true at the moment and might be not-at-all-true in the future. You could write "if (DatumGetBool(values[i]) == true)" and satisfy the strict-typing part of this concern, but there's also a more stylistic thing here. Personally I find "if (boolean == true)" or "if (boolean == false)" to be poor style, it should be just "if (boolean)" or "if (!boolean)". This gets back to whether you consider boolean to be a logically distinct type from int, I suppose. Feel free to ignore that part if you strongly disagree. 4. I don't think that DEFAULT_EQ_SEL is an appropriate default result for this routine. Arguably, it should elog(ERROR) if not passed a BooleanTest node, so the first one or two occurrences are nonissues anyway. However, if you find a non-Var argument or a Var that you can't get statistics for, you do need to return some sort of reasonable default, and that I think is not the best. In any case the default should take the test type into account. For IS_NULL and IS_UNKNOWN, I'd think the default assumption should be small but not zero (0.01 maybe? Any thoughts out there?). IS_NOT_NULL/UNKNOWN should obviously yield one minus whatever we use for IS_NULL. It seems reasonable to use 0.5 as default for IS_TRUE and the other three BooleanTests. 5. Actually, for IS_TRUE and the other three BooleanTests, what you should probably do with a non-Var input is ignore the possibility of a NULL value and just try to estimate the probability of TRUE vs FALSE, which you can do by recursing back to clause_selectivity() on the argument (note this would be invalid for NullTest where the argument isn't necessarily boolean, but it's legit for BooleanTests). Then you use either that result or one minus that result, depending on which BooleanTest you're dealing with. 6. The way you are working with the MCV values seems unnecessarily complicated. I had a hard time seeing whether it was generating the right answer (actually, it demonstrably isn't, since for example you produce the same result for IS_TRUE and IS_NOT_FALSE, which ought to differ by the stanullfrac amount; but it's too complex anyway). I'd be inclined to do this in one of two ways:1. Scan the MCV array to find the entry for "true", and work with its frequencyand the stanullfrac frequency to derive the appropriate answer depending on the test type.2. Always use the first(most common val's) frequency. This value is either true or false, so adjust accordingly. The simplest way wouldbe to derive true's frequency as either freq[0] or 1 - freq[0] - stanullfrac, and then proceed as in #1. #2 is a little more robust, since it would still work if for some reason the MCV list contains only an entry for true or only an entry for false. (I believe that could happen, if the column contains only one value and possibly some nulls.) You're off to an excellent start though; you seem to be finding your way through the code very well. regards, tom lane