Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards |
Date | |
Msg-id | 8714.993329917@sss.pgh.pa.us Whole thread Raw |
In response to | Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards ("Joe Conway" <joseph.conway@home.com>) |
List | pgsql-hackers |
"Joe Conway" <joseph.conway@home.com> writes: > Attached is my initial attempt to teach clause_selectivity about BooleanTest > nodes, for review and comment. Please let me know if I'm headed in the right > direction. Looks like you're headed the right way. A few comments --- 1. Don't forget to add a routine for NullTest nodes, too. These can be simpler, since you know that all you care about is the stanullfrac; no need to look at common values. 2. I don't see any real good reason to force booltestsel() (likewise nulltestsel(), when you add it) to have the same call signature as the various operator selectivity routines. Those have to be looked up in a table and called via fmgr, but there's no reason to think that booltestsel will ever be called in the same way. We could certainly lose the useless "operator" argument. I'd be inclined to declare booltestsel as Selectivity booltestsel(Query *root, BooleanTest *clause, int varRelid); and skip *all* the fmgr notational cruft. The only trouble with this is we'd not want to put such a declaration into builtins.h, because it'd force including a bunch more .h files into builtins.h, which is ugly and might even cause some circularity problems. There are already some selectivity-related things in builtins.h that don't really belong there. I am starting to think that selfuncs.c deserves its own header file separate from builtins.h, so that it doesn't clutter builtins.h with a bunch of selectivity-specific stuff. (Comments anyone?) 3. You should be using tests like "if (DatumGetBool(values[i]))" or "if (!DatumGetBool(values[i]))" to examine the MCV-list values, not "if (values[i] == true)". The latter is a type cheat: it assumes that Datum can be implicitly converted to bool, which is only sort-of true at the moment and might be not-at-all-true in the future. You could write "if (DatumGetBool(values[i]) == true)" and satisfy the strict-typing part of this concern, but there's also a more stylistic thing here. Personally I find "if (boolean == true)" or "if (boolean == false)" to be poor style, it should be just "if (boolean)" or "if (!boolean)". This gets back to whether you consider boolean to be a logically distinct type from int, I suppose. Feel free to ignore that part if you strongly disagree. 4. I don't think that DEFAULT_EQ_SEL is an appropriate default result for this routine. Arguably, it should elog(ERROR) if not passed a BooleanTest node, so the first one or two occurrences are nonissues anyway. However, if you find a non-Var argument or a Var that you can't get statistics for, you do need to return some sort of reasonable default, and that I think is not the best. In any case the default should take the test type into account. For IS_NULL and IS_UNKNOWN, I'd think the default assumption should be small but not zero (0.01 maybe? Any thoughts out there?). IS_NOT_NULL/UNKNOWN should obviously yield one minus whatever we use for IS_NULL. It seems reasonable to use 0.5 as default for IS_TRUE and the other three BooleanTests. 5. Actually, for IS_TRUE and the other three BooleanTests, what you should probably do with a non-Var input is ignore the possibility of a NULL value and just try to estimate the probability of TRUE vs FALSE, which you can do by recursing back to clause_selectivity() on the argument (note this would be invalid for NullTest where the argument isn't necessarily boolean, but it's legit for BooleanTests). Then you use either that result or one minus that result, depending on which BooleanTest you're dealing with. 6. The way you are working with the MCV values seems unnecessarily complicated. I had a hard time seeing whether it was generating the right answer (actually, it demonstrably isn't, since for example you produce the same result for IS_TRUE and IS_NOT_FALSE, which ought to differ by the stanullfrac amount; but it's too complex anyway). I'd be inclined to do this in one of two ways:1. Scan the MCV array to find the entry for "true", and work with its frequencyand the stanullfrac frequency to derive the appropriate answer depending on the test type.2. Always use the first(most common val's) frequency. This value is either true or false, so adjust accordingly. The simplest way wouldbe to derive true's frequency as either freq[0] or 1 - freq[0] - stanullfrac, and then proceed as in #1. #2 is a little more robust, since it would still work if for some reason the MCV list contains only an entry for true or only an entry for false. (I believe that could happen, if the column contains only one value and possibly some nulls.) You're off to an excellent start though; you seem to be finding your way through the code very well. regards, tom lane
pgsql-hackers by date: