Thread: Strict-typing benefits/costs
I continue to hope that 8.4 later (or 8.3 patched) will provide behavior / compatibility controls.. I personally have requested others in past, but one outstanding one now, seems to be an ability to turn back-on implicit type conversion. I'm struggling to see Mysql, Mssql, Oracle, and others following PG and turning off their implicit conversion. People will have to choose whether to re-code (explicit casts) their apps for sake of PG. Granted, some of the other databases have bugs in their CAST implementations (see http://bugs.mysql.com/bug.php?id=34562 & http://bugs.mysql.com/bug.php?id=34564).... bugs which makes writing PG 8.3 portable code arduous or impossible when dealing with legacy table designs and program which counted on implicit casts. But even when mysql gets fixed, one will ask: "should I re-code my apps just so they will work with PG>=8.3 (and test the code on other DBs), or should I only support PG<8.3? From http://www.postgresql.org/docs/8.3/static/release-8-3.html: "these expressions formerly worked: 1) substr(current_date, 1, 4) 2) 23 LIKE '2%' but will now draw "function does not exist" and "operator does not exist" errors respectively. Use an explicit cast instead. 3) current_date < 2017-11-17 " (the section also prominently cites a non-portable cast syntax) Questions: For case 1, regarding type safety: we know use of LIKE (and SUBSTR) requires *implicit or explicit* conversion to a text type. Why require that explicitly? Closet answer I can see, is that for (example #1) text-formed dates in some servers are not iso8601 format. But, even explicit datetime->text cast has this same hazard -- it doesn't require a format argument. For case #2, explicit or implicit again have same risk: that base-10 and with/out decimal, should be the default format. For case #3, I see "in the presence of automatic casts both sides were cast to text", but can that not be fixed? Operand 2 yields an integer, and integer and date compares should failfast, they are not comparable. (I believe sql requires delimited iso8601 fields, so both 20080414120000 and '20080414120000' should failfast) int compare(datetime a, object b) { if (typeof(b=="charsequence") return compare(cast(b as datetime), a); if (typeof(b)=="date") return compare(cast(b as datetime), a); if (typeof(b)=="time") //allowed? return compare(cast(b as datetime), a); throw "cannot compare "+a.getClass()+" and "+b.getClass(); } WHERE (current_date < 2017-11-17) -> FAIL WHERE (current_date < CAST(2017-11-17 AS datetime)) -> FAIL WHERE (current_date < '2017-11-17') -> PASS WHERE (current_date < CAST('2017-11-17' AS datetime)) -> PASS Sure, we know stricter typing will help performance, by encouraging more-correct design. For example it may be prudent to ALWAYS failfast when trying to join key/indexed table columns of differing types (one char to another's int). But on non-key where conditions? Or make that a behavior option. Perhaps An SQLWarning could be set when mismatches occur, yet make a best effort at autocast. A char to int key comparison affects performance horridly and indicates (though not always) a potential mis-design, yet it MAY be safe to autocast the text to numeric before compare, because the narrowing conversion will assert that only base-10 chars exist. So, are there other examples of why, esp for runtime/type safety, we should force explicit conversion?
Ken Johanson <pg-user@kensystem.com> writes: > For case 1, regarding type safety: we know use of LIKE (and SUBSTR) > requires *implicit or explicit* conversion to a text type. Why require > that explicitly? Because it's way too easy to burn yourself with implicit conversions. Cases comparable to the one mentioned (current_date < 2017-11-17 silently doing something very different than the user expected) have been cropping up every month or two for *years* --- try trolling the PG list archives for awhile for examples. After you've wasted a day or three chasing a problem like that, or pehaps had your app fail in the field because of a problem like that, you'll realize that having to write a few explicit casts is a small price to pay for not having such booby-traps in the system. As for the lack of a compatibility switch, we would probably have provided one if we could do so reasonably; but a large part of the change consisted of altering the initial contents of pg_cast and some other system catalogs. There isn't any good way to flip that on and off. regards, tom lane
On Thu, 2008-02-14 at 15:55 -0700, Ken Johanson wrote: > Granted, some of the other databases have bugs in their CAST > implementations (see http://bugs.mysql.com/bug.php?id=34562 & > http://bugs.mysql.com/bug.php?id=34564).... bugs which makes writing PG > 8.3 portable code arduous or impossible when dealing with legacy table > designs and program which counted on implicit casts. But even when mysql > gets fixed, one will ask: "should I re-code my apps just so they will > work with PG>=8.3 (and test the code on other DBs), or should I only > support PG<8.3? If postgresql were to revert to 8.2 implicit casting behavior, would that actually improve compatibility with other DBMSs? Every DBMS probably has it's own rules for implicit casting, different from every other DBMS. So are you sure it wouldn't just introduce more compatibility problems somewhere else? Or worse, it could hide the problems during migration/testing, and they could surface after you put it into production. > From > http://www.postgresql.org/docs/8.3/static/release-8-3.html: > > "these expressions formerly worked: > 1) substr(current_date, 1, 4) > 2) 23 LIKE '2%' > but will now draw "function does not exist" and "operator does not > exist" errors respectively. Use an explicit cast instead. > 3) current_date < 2017-11-17 > " > > (the section also prominently cites a non-portable cast syntax) > > Questions: > > For case 1, regarding type safety: we know use of LIKE (and SUBSTR) > requires *implicit or explicit* conversion to a text type. Why require Not this substr() function: create function substr(date, int, int) returns text as $$ begin return 'foo'; end; $$ language plpgsql; Are you saying we make special cases for all of the "obvious" functions, and they should behave differently from user-defined functions? > For case #3, I see "in the presence of automatic casts both sides were > cast to text", but can that not be fixed? Operand 2 yields an integer, > and integer and date compares should failfast, they are not comparable. > (I believe sql requires delimited iso8601 fields, so both 20080414120000 > and '20080414120000' should failfast) > > int compare(datetime a, object b) > { > if (typeof(b=="charsequence") > return compare(cast(b as datetime), a); > if (typeof(b)=="date") > return compare(cast(b as datetime), a); > if (typeof(b)=="time") //allowed? > return compare(cast(b as datetime), a); > throw "cannot compare "+a.getClass()+" and "+b.getClass(); > } > > WHERE (current_date < 2017-11-17) -> FAIL > WHERE (current_date < CAST(2017-11-17 AS datetime)) -> FAIL > WHERE (current_date < '2017-11-17') -> PASS > WHERE (current_date < CAST('2017-11-17' AS datetime)) -> PASS > > Sure, we know stricter typing will help performance, by encouraging I don't think performance was the goal of removing implicit type casts. > more-correct design. For example it may be prudent to ALWAYS failfast > when trying to join key/indexed table columns of differing types (one > char to another's int). But on non-key where conditions? Or make that a > behavior option. Perhaps An SQLWarning could be set when mismatches > occur, yet make a best effort at autocast. A few comments: * Keys should not behave differently from non-keys in comparisons. * I think the distinction you're trying to make is the casting of literals versus the casting of variables. All of the examples you gave involved literals. * If it's making its "best effort" to autocast, and it fails, how could it merely issue a warning? If it can't find a match it needs to error, because there's nothing it can do to continue even if we wanted it to. * It can almost always find a match, the question is whether it is the match that the user intended or not. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Are you saying we make special cases for all of the "obvious" functions, > and they should behave differently from user-defined functions? It's possible to special-case any particular function you really feel you need this behavior for. We did special-case || (the string concatenation operator), and there was some discussion of also putting in a built-in special case for LIKE, but we desisted from sliding any further down that slippery slope. Since it's possible for users to install such hacks for themselves, as in the example here, http://archives.postgresql.org/pgsql-general/2007-11/msg00538.php there didn't seem to be a convincing case for weakening the type checking for everybody. regards, tom lane
Jeff Davis wrote: > If postgresql were to revert to 8.2 implicit casting behavior, would > that actually improve compatibility with other DBMSs? Every DBMS > probably has it's own rules for implicit casting, different from every > other DBMS. > > So are you sure it wouldn't just introduce more compatibility problems > somewhere else? Or worse, it could hide the problems during > migration/testing, and they could surface after you put it into > production. In my opinion the autocasting behaviors of the database are probably more consistent and fined tuned, than their explicit cast function. Or in the least, they may actually *save* mistakes that (lay) programmers would make (by adding casts for the sake of PG).
Tom Lane wrote: > > It's possible to special-case any particular function you really feel > you need this behavior for. We did special-case || (the string > concatenation operator), and there was some discussion of also putting > in a built-in special case for LIKE, but we desisted from sliding any > further down that slippery slope. Since it's possible for users to > install such hacks for themselves, as in the example here, > http://archives.postgresql.org/pgsql-general/2007-11/msg00538.php > there didn't seem to be a convincing case for weakening the type > checking for everybody. > Tom, is it accurate to assume that newer PG versions will further tighten type-strictness (say, '2008-01-01' presently being comparable to a datetime)? Also, do you know of any other vendors that are heading in this direction (removing by default the autocasts)?
On Feb 15, 2008, at 18:11 , Ken Johanson wrote: > Tom, is it accurate to assume that newer PG versions will further > tighten type-strictness (say, '2008-01-01' presently being > comparable to a datetime)? Also, do you know of any other vendors > that are heading in this direction (removing by default the > autocasts)? '2008-01-01' does not indicate some kind of string: it's just an untyped literal. Postgres will determine its type in context. For example: test=# select 'today is ' || '2008-02-16'; ?column? --------------------- today is 2008-02-16 (1 row) || is a string concatenation operator, so '2008-02-16' is treated as a string: it's not cast from date to text: from Postgres' eyes, it never was a date. test=# select '2008-02-15' > CURRENT_DATE; ?column? ---------- f (1 row) Here, we're comparing against a date type, so Postgres treats '2008-02-15' as a date. Michael Glaesemann grzm seespotcode net
Michael Glaesemann <grzm@seespotcode.net> writes: > test=# select '2008-02-15' > CURRENT_DATE; > Here, we're comparing against a date type, so Postgres treats > '2008-02-15' as a date. It might be worth pointing out that this is not magic, but an application of the general rule mentioned at step 2a here: http://www.postgresql.org/docs/8.3/static/typeconv-oper.html In any case where Postgres is trying to interpret a binary operator, and one input has a known type while the other is an unknown-type literal constant, the preferred interpretation will be that the constant has the same type as the known-type input. In a case like 'today is ' || '2008-02-16', *both* inputs are initially unknown-type literals. There is a fallback heuristic that prefers to resolve such cases as type text, which is why you get text concatenation rather than a "couldn't resolve operator" error. In no case does Postgres look at the content of an unknown literal to determine its type. '2008-02-16' is not treated differently from 'foobar'. regards, tom lane
Michael Glaesemann wrote: > > On Feb 15, 2008, at 18:11 , Ken Johanson wrote: > >> Tom, is it accurate to assume that newer PG versions will further >> tighten type-strictness (say, '2008-01-01' presently being comparable >> to a datetime)? Also, do you know of any other vendors that are >> heading in this direction (removing by default the autocasts)? > > '2008-01-01' does not indicate some kind of string: it's just an untyped > literal. Postgres will determine its type in context. Exactly, it is performing a context based auto conversion, what some will call a cast. select 5<'6' -> true select 5>'6' -> false select 15<'60' -> true select 15>'60' -> false So one can argue that is is convenient, and safe, to perform the same implicit/auto conversion for many functions which no longer do that. And that even if looses-typing / auto cast it allows/encourages bad design, that does not mean that the all designs cases will be bad. Some users prefer convenience over type safety, and some of those same users *will* produce error free code. On the other hand, should we go the extra mile and failfast when comparing 5 and '6'? No, because there is clearly only one appropriate conversion path (cast string to numeric) for the context. Or, some might argue we should not allow that comparison. select position('5' in 5) select position('.' in 5.1) select position('2008' in current_timestamp) Numbers and datetime in sql have exactly prescribed standard char representations (even if others dbs don't use them for datetimes). So one can argue implicit conversion to char IS safe for these types and any char-consuming functions.
Ken Johanson <pg-user@kensystem.com> writes: > select 5<'6' -> true > select 5>'6' -> false > select 15<'60' -> true > select 15>'60' -> false These examples miss the point, because they'd give the same answer whether you think the values are text or integer. Consider instead these cases: regression=# select 7 > '60'; -- int > int ?column? ---------- f (1 row) regression=# select '7' > '60'; -- text > text ?column? ---------- t (1 row) regression=# select 7 > '08'; -- int > int ?column? ---------- f (1 row) regression=# select '7' > '08'; -- text > text ?column? ---------- t (1 row) All of a sudden it seems much more important to be clear about what data type is involved, no? > Numbers and datetime in sql have exactly prescribed standard char > representations (even if others dbs don't use them for datetimes). See the datestyle parameter before you maintain that Postgres should assume that. regards, tom lane
Tom Lane wrote: > Ken Johanson <pg-user@kensystem.com> writes: >> select 5<'6' -> true >> select 5>'6' -> false >> select 15<'60' -> true >> select 15>'60' -> false > > These examples miss the point, because they'd give the same answer > whether you think the values are text or integer. Consider instead > these cases: > > regression=# select 7 > '60'; -- int > int > ?column? > ---------- > f > (1 row) > > regression=# select '7' > '60'; -- text > text > ?column? > ---------- > t > (1 row) > > regression=# select 7 > '08'; -- int > int > ?column? > ---------- > f > (1 row) > > regression=# select '7' > '08'; -- text > text > ?column? > ---------- > t > (1 row) > > All of a sudden it seems much more important to be clear about > what data type is involved, no? Agreed, so should we disallow 7 > '08'? Because that is (tell me if you disagree), much more hazardous than allowing, say TRIM(7) or POSITION('7' IN 7). Or for non-failfast comparison of two columns of dissimilar types (say bigint, integer, real, char). select 'ba'>'ab' -> true select 'ab'>'ba' -> false select '0.5'=.5 -> true (is char comparator or numeric to laymen?) select '7a'<'070' -> true (is char comparator or numeric to laymen?) select '7a'<70 -> failfast, good. > >> Numbers and datetime in sql have exactly prescribed standard char >> representations (even if others dbs don't use them for datetimes). > > See the datestyle parameter before you maintain that Postgres > should assume that. > > I agree. Unless the date style is know to always be iso8601, which is not true owed to datestyle. Unless sql spec allows for it, could this be an argument for removing the datestyle implict (non-iso8601) feature?
Ken Johanson <pg-user@kensystem.com> writes: > Tom Lane wrote: >> These examples miss the point, because they'd give the same answer >> whether you think the values are text or integer. ... > Agreed, so should we disallow 7 > '08'? Maybe, but the usability ramifications would be enormous --- you'd also be talking about breaking sale_date > '2008-01-20' and other cases where "leave off the quotes" is not an available option. In practice the current handling of one-known-and-one-unknown-input has not seemed to cause problems for people; it almost always does what they're expecting it to do. It's the cases where the values are of *known and incompatible* types that silently casting to make them match has proven to be dangerous. >>> Numbers and datetime in sql have exactly prescribed standard char >>> representations (even if others dbs don't use them for datetimes). >> >> See the datestyle parameter before you maintain that Postgres >> should assume that. > I agree. Unless the date style is know to always be iso8601, which is > not true owed to datestyle. Unless sql spec allows for it, could this be > an argument for removing the datestyle implict (non-iso8601) feature? I doubt we'll be doing that. It would not fix the problem anyway, since now that I think about it, your argument doesn't hold for numbers either. Consider regression=# select 2.0 / 3.0; ?column? ------------------------ 0.66666666666666666667 (1 row) regression=# select 10 ^ 14; ?column? ----------------- 100000000000000 (1 row) regression=# select 10 ^ 15; ?column? ---------- 1e+15 (1 row) regression=# select 1.230e15; ?column? ------------------ 1230000000000000 (1 row) There are a large number of details here that are not prescribed by the SQL standard (much less actually standardized across systems). They make little or no difference in the numeric domain, but they sure do as soon as you start doing textual operations. regards, tom lane