Thread: [HACKERS] adding an immutable variant of to_date

[HACKERS] adding an immutable variant of to_date

From
"Sven R. Kunze"
Date:
Hello everybody,

following up on this thread: https://www.postgresql.org/message-id/flat/d297e048-ac49-9bed-32e3-9dd4e65d0978%40mail.de
specifically on this mail: https://www.postgresql.org/message-id/baef819f-acf0-a64d-c1eb-d2c5da1e5030%40mail.de
I hope this idea fulfills the requirements. So let's see what you have to say.

to_date is not an immutable function but the best way to parse dates besides casting. Unfortunately, creating an functional index with to_date or casting will fail with an error.

Researching the topic in the past and currently, the situation hasn't improved. Many SO solutions or blog posts suggest to roll your own UDF to wrap up an mutable expression and marking it immutable by doing so. This approach has several disadvantages:

1) unawareness of the locale and timezone issues date/timestamp parsing involves
2) hiding errors in production when environment variables are missing/set differently than on dev machines
3) many re-implementations of the same problem

I consider 1) the biggest issue. Developers need to be aware of these kinds of issues to solve them properly. This is especially problematic when defining functional indexes where you don't have access to the actual date values.

For this purpose, I would like to adhere to the PostgreSQL roadmap and "scratch my own itch" by implementing the following solution which I think best solves the issue at hand:

** Idea **

An additional, immutable variant of to_date with a third parameter to specify the locale by which the date should be parsed. The additional documentation line (https://www.postgresql.org/docs/9.6/static/functions-formatting.html) would look like this:

to_date(text, text, text)    date    convert string to date given a specific locale    to_date('05 Dez 2000', 'DD Mon YYYY', 'de_DE')


I think this approach has the following advantages:

1) a single, recommended and safe way to parse dates
2) make people aware of the locale/timezone issue but give them a standard tool to solve it
3) eventually make all those and related Google entries (https://www.google.de/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=postgresql+to_date+immutable&*) point to the same and safe solution

** Rejected Ideas **
Geoff suggest to simply mark to_date as stable and being able to parse dates by trying all locales. Two arguments work against this ideas: 1) it would create prohibitively large lookup tables and 2) some mappings are ambiguous: https://www.postgresql.org/message-id/aba44f78-2b84-e752-9b6f-3784bd0f981c%40mail.de

Another suggestion from Geoff >>format string with an additional locale "{locale=en_US}"<< couldn't fly as well because a function cannot be set immutable for some inputs.


What do you think?


Regards,
Sven

Re: [HACKERS] adding an immutable variant of to_date

From
Andreas Karlsson
Date:
On 03/03/2017 10:41 PM, Sven R. Kunze wrote:
> What do you think?

I have some thoughts:

1) I do not think we currently allow setting the locale like this 
anywhere, so this will introduce a new concept to PostgreSQL. And you 
will probably need to add support for caching per locale.

2) As far as I can tell from reading the code to_date currently ignores 
the M suffix which indicates that you want localized month/day names, so 
i guess that to_date is currently immutable. Maybe it is stable due to 
the idea that we may want to support the M suffix in the future.

3) I do not like the to_date function. It is much too forgiving with 
invalid input. For example 2017-02-30 because 2017-03-02. Also just 
ignoring the M suffix in the format string seems pretty bad

Personally I would rather see a new date parsing function which is 
easier to work with or somehow fix to_date without pissing too many 
users off, but I have no idea if this is a view shared with the rest of 
the community.

Andreas



Re: [HACKERS] adding an immutable variant of to_date

From
"Sven R. Kunze"
Date:
On 07.03.2017 03:21, Andreas Karlsson wrote:
> 1) I do not think we currently allow setting the locale like this 
> anywhere, so this will introduce a new concept to PostgreSQL. And you 
> will probably need to add support for caching per locale.

Good to know. Could you explain what you mean by "caching per locale"?

> 2) As far as I can tell from reading the code to_date currently 
> ignores the M suffix which indicates that you want localized month/day 
> names, so i guess that to_date is currently immutable. Maybe it is 
> stable due to the idea that we may want to support the M suffix in the 
> future.

I think that's the case.

> 3) I do not like the to_date function. It is much too forgiving with 
> invalid input. For example 2017-02-30 because 2017-03-02.

That's indeed a funny parsing result. Why does to_date perform this kind 
of error-correction?

> Also just ignoring the M suffix in the format string seems pretty bad.
>
> Personally I would rather see a new date parsing function which is 
> easier to work with or somehow fix to_date without pissing too many 
> users off, but I have no idea if this is a view shared with the rest 
> of the community.

Neither do I.

Many business applications have to deal with date(times). I came from 
the practical issue of indexing json objects.

Regards,
Sven




Re: [HACKERS] adding an immutable variant of to_date

From
Robert Haas
Date:
On Tue, Mar 7, 2017 at 3:56 PM, Sven R. Kunze <srkunze@mail.de> wrote:
>> 2) As far as I can tell from reading the code to_date currently ignores
>> the M suffix which indicates that you want localized month/day names, so i
>> guess that to_date is currently immutable. Maybe it is stable due to the
>> idea that we may want to support the M suffix in the future.
>
> I think that's the case.

The commit message for 64353640e87b54250f1b8a1d2a708d270dff4bfd has
some interesting perspective on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] adding an immutable variant of to_date

From
Andreas Karlsson
Date:
On 03/07/2017 09:56 PM, Sven R. Kunze wrote:
> On 07.03.2017 03:21, Andreas Karlsson wrote:
>> 1) I do not think we currently allow setting the locale like this
>> anywhere, so this will introduce a new concept to PostgreSQL. And you
>> will probably need to add support for caching per locale.
>
> Good to know. Could you explain what you mean by "caching per locale"?

The current code for to_char will on the first call to to_char build 
arrays with the localized names of the week days and the months. I 
suspect that you may need to build something similar but a set of arrays 
per locale.

See the DCH_to_char function and its call to cache_locale_time.

Andreas



Re: [HACKERS] adding an immutable variant of to_date

From
"Sven R. Kunze"
Date:
Thanks!

On 08.03.2017 17:37, Andreas Karlsson wrote:
> The current code for to_char will on the first call to to_char build 
> arrays with the localized names of the week days and the months. I 
> suspect that you may need to build something similar but a set of 
> arrays per locale.

Not sure what the most C-like way of solving this is. Coming from a 
Python background, I would tend to use a dict. The keys are the locales, 
the values are the arrays.

I would use something which is described here: 
http://stackoverflow.com/questions/3269881/how-to-represent-a-python-like-dictionary-in-c 
. Preferred libc-related, like hcreate and hsearch.

> See the DCH_to_char function and its call to cache_locale_time.

Alright. Found the cache here:

**** src/backend/utils/adt/pg_locale.c
/* lc_time localization cache */
char       *localized_abbrev_days[7];
char       *localized_full_days[7];
char       *localized_abbrev_months[12];
char       *localized_full_months[12];

/* .... */

void
cache_locale_time(void)
{        char       *save_lc_time;        time_t          timenow;        struct tm  *timeinfo;        int
      i;
 

#ifdef WIN32        char       *save_lc_ctype;
#endif
        /* did we do this already? */        if (CurrentLCTimeValid)                return;

I would replace the invalidation check with a hsearch lookup.



Grepping for the cache variables, I found:

~/src/postgres$ grep -r localized_abbrev_days *
src/backend/utils/adt/pg_locale.c:char *localized_abbrev_days[7];
src/backend/utils/adt/pg_locale.c: 
cache_single_time(&localized_abbrev_days[i], "%a", timeinfo);
src/backend/utils/adt/formatting.c:                    char *str = 
str_toupper_z(localized_abbrev_days[tm->tm_wday], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_initcap_z(localized_abbrev_days[tm->tm_wday], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_tolower_z(localized_abbrev_days[tm->tm_wday], collid);
src/include/utils/pg_locale.h:extern char *localized_abbrev_days[];

~/src/postgres$ grep -r localized_full_days *
src/backend/utils/adt/pg_locale.c:char *localized_full_days[7];
src/backend/utils/adt/pg_locale.c: 
cache_single_time(&localized_full_days[i], "%A", timeinfo);
src/backend/utils/adt/formatting.c:                    char *str = 
str_toupper_z(localized_full_days[tm->tm_wday], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_initcap_z(localized_full_days[tm->tm_wday], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_tolower_z(localized_full_days[tm->tm_wday], collid);
src/include/utils/pg_locale.h:extern char *localized_full_days[];

~/src/postgres$ grep -r localized_abbrev_months *
src/backend/utils/adt/pg_locale.c:char *localized_abbrev_months[12];
src/backend/utils/adt/pg_locale.c: 
cache_single_time(&localized_abbrev_months[i], "%b", timeinfo);
src/backend/utils/adt/formatting.c:                    char *str = 
str_toupper_z(localized_abbrev_months[tm->tm_mon - 1], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_initcap_z(localized_abbrev_months[tm->tm_mon - 1], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_tolower_z(localized_abbrev_months[tm->tm_mon - 1], collid);
src/include/utils/pg_locale.h:extern char *localized_abbrev_months[];

srkunze@mexico:~/src/postgres$ grep -r localized_full_months *
src/backend/utils/adt/pg_locale.c:char *localized_full_months[12];
src/backend/utils/adt/pg_locale.c: 
cache_single_time(&localized_full_months[i], "%B", timeinfo);
src/backend/utils/adt/formatting.c:                    char *str = 
str_toupper_z(localized_full_months[tm->tm_mon - 1], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_initcap_z(localized_full_months[tm->tm_mon - 1], collid);
src/backend/utils/adt/formatting.c:                    char *str = 
str_tolower_z(localized_full_months[tm->tm_mon - 1], collid);
src/include/utils/pg_locale.h:extern char *localized_full_months[];


It seems to me, that I would need to make them point to the correct 
array address at the beginning of those functions by doing an hsearch.

That step is basically independent of the actual immutable to_date idea.


What do you think?

Best,
Sven




Re: [HACKERS] adding an immutable variant of to_date

From
"Sven R. Kunze"
Date:
On 08.03.2017 03:37, Robert Haas wrote:
> The commit message for 64353640e87b54250f1b8a1d2a708d270dff4bfd has
> some interesting perspective on this.

"""
Also, mark to_date() and to_char(interval) as stable; although these appear
not to depend on any GUC variables as of CVS HEAD, that seems a property
unlikely to survive future improvements.  It seems best to mark all the
formatting functions stable and be done with it.
"""

My take away from this commit is the following:

Historically, the immutability of "to_date(text, text)" was an emergent 
feature. Proven to be possibly mutable, the parsing feature had a higher 
priority, so the immutability needed to be removed.

The proposed variant on the other hand should be immutable first before 
everything else. Thus, future improvements cannot violate that. This 
might restrict the possible parsing functionality but that shouldn't be 
a problem in practice.

Regards,
Sven