Thread: [HACKERS] adding an immutable variant of to_date
Hello everybody,
following up on this thread: https://www.postgresql.org/message-id/flat/d297e048-ac49-9bed-32e3-9dd4e65d0978%40mail.de
specifically on this mail: https://www.postgresql.org/message-id/baef819f-acf0-a64d-c1eb-d2c5da1e5030%40mail.de
I hope this idea fulfills the requirements. So let's see what you have to say.
to_date is not an immutable function but the best way to parse dates besides casting. Unfortunately, creating an functional index with to_date or casting will fail with an error.
Researching the topic in the past and currently, the situation hasn't improved. Many SO solutions or blog posts suggest to roll your own UDF to wrap up an mutable expression and marking it immutable by doing so. This approach has several disadvantages:
1) unawareness of the locale and timezone issues date/timestamp parsing involves
2) hiding errors in production when environment variables are missing/set differently than on dev machines
3) many re-implementations of the same problem
I consider 1) the biggest issue. Developers need to be aware of these kinds of issues to solve them properly. This is especially problematic when defining functional indexes where you don't have access to the actual date values.
For this purpose, I would like to adhere to the PostgreSQL roadmap and "scratch my own itch" by implementing the following solution which I think best solves the issue at hand:
** Idea **
An additional, immutable variant of to_date with a third parameter to specify the locale by which the date should be parsed. The additional documentation line (https://www.postgresql.org/docs/9.6/static/functions-formatting.html) would look like this:
to_date(text, text, text) date convert string to date given a specific locale to_date('05 Dez 2000', 'DD Mon YYYY', 'de_DE')
I think this approach has the following advantages:
1) a single, recommended and safe way to parse dates
2) make people aware of the locale/timezone issue but give them a standard tool to solve it
3) eventually make all those and related Google entries (https://www.google.de/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=postgresql+to_date+immutable&*) point to the same and safe solution
** Rejected Ideas **
Geoff suggest to simply mark to_date as stable and being able to parse dates by trying all locales. Two arguments work against this ideas: 1) it would create prohibitively large lookup tables and 2) some mappings are ambiguous: https://www.postgresql.org/message-id/aba44f78-2b84-e752-9b6f-3784bd0f981c%40mail.de
Another suggestion from Geoff >>format string with an additional locale "{locale=en_US}"<< couldn't fly as well because a function cannot be set immutable for some inputs.
What do you think?
Regards,
Sven
following up on this thread: https://www.postgresql.org/message-id/flat/d297e048-ac49-9bed-32e3-9dd4e65d0978%40mail.de
specifically on this mail: https://www.postgresql.org/message-id/baef819f-acf0-a64d-c1eb-d2c5da1e5030%40mail.de
I hope this idea fulfills the requirements. So let's see what you have to say.
to_date is not an immutable function but the best way to parse dates besides casting. Unfortunately, creating an functional index with to_date or casting will fail with an error.
Researching the topic in the past and currently, the situation hasn't improved. Many SO solutions or blog posts suggest to roll your own UDF to wrap up an mutable expression and marking it immutable by doing so. This approach has several disadvantages:
1) unawareness of the locale and timezone issues date/timestamp parsing involves
2) hiding errors in production when environment variables are missing/set differently than on dev machines
3) many re-implementations of the same problem
I consider 1) the biggest issue. Developers need to be aware of these kinds of issues to solve them properly. This is especially problematic when defining functional indexes where you don't have access to the actual date values.
For this purpose, I would like to adhere to the PostgreSQL roadmap and "scratch my own itch" by implementing the following solution which I think best solves the issue at hand:
** Idea **
An additional, immutable variant of to_date with a third parameter to specify the locale by which the date should be parsed. The additional documentation line (https://www.postgresql.org/docs/9.6/static/functions-formatting.html) would look like this:
to_date(text, text, text) date convert string to date given a specific locale to_date('05 Dez 2000', 'DD Mon YYYY', 'de_DE')
I think this approach has the following advantages:
1) a single, recommended and safe way to parse dates
2) make people aware of the locale/timezone issue but give them a standard tool to solve it
3) eventually make all those and related Google entries (https://www.google.de/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=postgresql+to_date+immutable&*) point to the same and safe solution
** Rejected Ideas **
Geoff suggest to simply mark to_date as stable and being able to parse dates by trying all locales. Two arguments work against this ideas: 1) it would create prohibitively large lookup tables and 2) some mappings are ambiguous: https://www.postgresql.org/message-id/aba44f78-2b84-e752-9b6f-3784bd0f981c%40mail.de
Another suggestion from Geoff >>format string with an additional locale "{locale=en_US}"<< couldn't fly as well because a function cannot be set immutable for some inputs.
What do you think?
Regards,
Sven
On 03/03/2017 10:41 PM, Sven R. Kunze wrote: > What do you think? I have some thoughts: 1) I do not think we currently allow setting the locale like this anywhere, so this will introduce a new concept to PostgreSQL. And you will probably need to add support for caching per locale. 2) As far as I can tell from reading the code to_date currently ignores the M suffix which indicates that you want localized month/day names, so i guess that to_date is currently immutable. Maybe it is stable due to the idea that we may want to support the M suffix in the future. 3) I do not like the to_date function. It is much too forgiving with invalid input. For example 2017-02-30 because 2017-03-02. Also just ignoring the M suffix in the format string seems pretty bad Personally I would rather see a new date parsing function which is easier to work with or somehow fix to_date without pissing too many users off, but I have no idea if this is a view shared with the rest of the community. Andreas
On 07.03.2017 03:21, Andreas Karlsson wrote: > 1) I do not think we currently allow setting the locale like this > anywhere, so this will introduce a new concept to PostgreSQL. And you > will probably need to add support for caching per locale. Good to know. Could you explain what you mean by "caching per locale"? > 2) As far as I can tell from reading the code to_date currently > ignores the M suffix which indicates that you want localized month/day > names, so i guess that to_date is currently immutable. Maybe it is > stable due to the idea that we may want to support the M suffix in the > future. I think that's the case. > 3) I do not like the to_date function. It is much too forgiving with > invalid input. For example 2017-02-30 because 2017-03-02. That's indeed a funny parsing result. Why does to_date perform this kind of error-correction? > Also just ignoring the M suffix in the format string seems pretty bad. > > Personally I would rather see a new date parsing function which is > easier to work with or somehow fix to_date without pissing too many > users off, but I have no idea if this is a view shared with the rest > of the community. Neither do I. Many business applications have to deal with date(times). I came from the practical issue of indexing json objects. Regards, Sven
On Tue, Mar 7, 2017 at 3:56 PM, Sven R. Kunze <srkunze@mail.de> wrote: >> 2) As far as I can tell from reading the code to_date currently ignores >> the M suffix which indicates that you want localized month/day names, so i >> guess that to_date is currently immutable. Maybe it is stable due to the >> idea that we may want to support the M suffix in the future. > > I think that's the case. The commit message for 64353640e87b54250f1b8a1d2a708d270dff4bfd has some interesting perspective on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/07/2017 09:56 PM, Sven R. Kunze wrote: > On 07.03.2017 03:21, Andreas Karlsson wrote: >> 1) I do not think we currently allow setting the locale like this >> anywhere, so this will introduce a new concept to PostgreSQL. And you >> will probably need to add support for caching per locale. > > Good to know. Could you explain what you mean by "caching per locale"? The current code for to_char will on the first call to to_char build arrays with the localized names of the week days and the months. I suspect that you may need to build something similar but a set of arrays per locale. See the DCH_to_char function and its call to cache_locale_time. Andreas
Thanks! On 08.03.2017 17:37, Andreas Karlsson wrote: > The current code for to_char will on the first call to to_char build > arrays with the localized names of the week days and the months. I > suspect that you may need to build something similar but a set of > arrays per locale. Not sure what the most C-like way of solving this is. Coming from a Python background, I would tend to use a dict. The keys are the locales, the values are the arrays. I would use something which is described here: http://stackoverflow.com/questions/3269881/how-to-represent-a-python-like-dictionary-in-c . Preferred libc-related, like hcreate and hsearch. > See the DCH_to_char function and its call to cache_locale_time. Alright. Found the cache here: **** src/backend/utils/adt/pg_locale.c /* lc_time localization cache */ char *localized_abbrev_days[7]; char *localized_full_days[7]; char *localized_abbrev_months[12]; char *localized_full_months[12]; /* .... */ void cache_locale_time(void) { char *save_lc_time; time_t timenow; struct tm *timeinfo; int i; #ifdef WIN32 char *save_lc_ctype; #endif /* did we do this already? */ if (CurrentLCTimeValid) return; I would replace the invalidation check with a hsearch lookup. Grepping for the cache variables, I found: ~/src/postgres$ grep -r localized_abbrev_days * src/backend/utils/adt/pg_locale.c:char *localized_abbrev_days[7]; src/backend/utils/adt/pg_locale.c: cache_single_time(&localized_abbrev_days[i], "%a", timeinfo); src/backend/utils/adt/formatting.c: char *str = str_toupper_z(localized_abbrev_days[tm->tm_wday], collid); src/backend/utils/adt/formatting.c: char *str = str_initcap_z(localized_abbrev_days[tm->tm_wday], collid); src/backend/utils/adt/formatting.c: char *str = str_tolower_z(localized_abbrev_days[tm->tm_wday], collid); src/include/utils/pg_locale.h:extern char *localized_abbrev_days[]; ~/src/postgres$ grep -r localized_full_days * src/backend/utils/adt/pg_locale.c:char *localized_full_days[7]; src/backend/utils/adt/pg_locale.c: cache_single_time(&localized_full_days[i], "%A", timeinfo); src/backend/utils/adt/formatting.c: char *str = str_toupper_z(localized_full_days[tm->tm_wday], collid); src/backend/utils/adt/formatting.c: char *str = str_initcap_z(localized_full_days[tm->tm_wday], collid); src/backend/utils/adt/formatting.c: char *str = str_tolower_z(localized_full_days[tm->tm_wday], collid); src/include/utils/pg_locale.h:extern char *localized_full_days[]; ~/src/postgres$ grep -r localized_abbrev_months * src/backend/utils/adt/pg_locale.c:char *localized_abbrev_months[12]; src/backend/utils/adt/pg_locale.c: cache_single_time(&localized_abbrev_months[i], "%b", timeinfo); src/backend/utils/adt/formatting.c: char *str = str_toupper_z(localized_abbrev_months[tm->tm_mon - 1], collid); src/backend/utils/adt/formatting.c: char *str = str_initcap_z(localized_abbrev_months[tm->tm_mon - 1], collid); src/backend/utils/adt/formatting.c: char *str = str_tolower_z(localized_abbrev_months[tm->tm_mon - 1], collid); src/include/utils/pg_locale.h:extern char *localized_abbrev_months[]; srkunze@mexico:~/src/postgres$ grep -r localized_full_months * src/backend/utils/adt/pg_locale.c:char *localized_full_months[12]; src/backend/utils/adt/pg_locale.c: cache_single_time(&localized_full_months[i], "%B", timeinfo); src/backend/utils/adt/formatting.c: char *str = str_toupper_z(localized_full_months[tm->tm_mon - 1], collid); src/backend/utils/adt/formatting.c: char *str = str_initcap_z(localized_full_months[tm->tm_mon - 1], collid); src/backend/utils/adt/formatting.c: char *str = str_tolower_z(localized_full_months[tm->tm_mon - 1], collid); src/include/utils/pg_locale.h:extern char *localized_full_months[]; It seems to me, that I would need to make them point to the correct array address at the beginning of those functions by doing an hsearch. That step is basically independent of the actual immutable to_date idea. What do you think? Best, Sven
On 08.03.2017 03:37, Robert Haas wrote: > The commit message for 64353640e87b54250f1b8a1d2a708d270dff4bfd has > some interesting perspective on this. """ Also, mark to_date() and to_char(interval) as stable; although these appear not to depend on any GUC variables as of CVS HEAD, that seems a property unlikely to survive future improvements. It seems best to mark all the formatting functions stable and be done with it. """ My take away from this commit is the following: Historically, the immutability of "to_date(text, text)" was an emergent feature. Proven to be possibly mutable, the parsing feature had a higher priority, so the immutability needed to be removed. The proposed variant on the other hand should be immutable first before everything else. Thus, future improvements cannot violate that. This might restrict the possible parsing functionality but that shouldn't be a problem in practice. Regards, Sven