Re: Allow to_date() and to_timestamp() to accept localized names - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Allow to_date() and to_timestamp() to accept localized names |
Date | |
Msg-id | 7212.1580177517@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Allow to_date() and to_timestamp() to accept localized names (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>) |
Responses |
Re: Allow to_date() and to_timestamp() to accept localized names
Re: Allow to_date() and to_timestamp() to accept localized names |
List | pgsql-hackers |
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > For the record, the correct form of that would appear to be > select to_date('Ιανουάριος', 'TMMonth'); > with the accent. I had tried different variations of that and they all > failed. OK, so for anyone who is as confused as I was, the main point here seems to be this: the upper case form of Greek sigma is 'Σ', and the lower case form is 'σ' ... except as the final letter of a word, where it is supposed to be written like 'ς'. If I set lc_collate, lc_ctype, and lc_time to 'el_GR.utf8', then (on a somewhat hoary glibc platform) I get u8=# select to_char('2020-01-01'::timestamptz, 'TMMONTH'); to_char ---------------------- ΙΑΝΟΥΆΡΙΟΣ (1 row) u8=# select to_char('2020-01-01'::timestamptz, 'TMMonth'); to_char ---------------------- Ιανουάριος (1 row) u8=# select to_char('2020-01-01'::timestamptz, 'TMmonth'); to_char ---------------------- ιανουάριος (1 row) which is correct AFAICS ... but u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH')); lower ---------------------- ιανουάριοσ (1 row) So what we actually have here, ISTM, is a bug in lower() not to_char(). The bug is unsurprising because str_tolower() simply applies towlower_l() to each character independently, so there's no way for it to account for the word-final rule. I'm not aware that glibc provides any API whereby that could be done correctly. On the other hand, we get it right when using an ICU collation for lower(): u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH') collate "el-gr-x-icu"); lower ---------------------- ιανουάριος (1 row) because that code path passes the whole string to ICU at once, and of course getting this right is ICU's entire job. I haven't double-checked, but I imagine that the reason that to_char gets the month name case-folding right is that what comes out of strftime(..."%B"...) is "Ιανουάριος" which we are able to upcase correctly, while the downcasing code paths don't affect 'ς'. I thought for a little bit about trying to dodge this issue in the patch by folding to upper case, not lower, before comparing month/day names. I fear that that would just shift the problem cases to some other language(s). However, it makes Greek better, and I think it makes German better (does 'ß' appear in any month/day names there?), so maybe we should just roll with that. In the end, it doesn't seem right to reject this patch just because lower() is broken on some platforms. The other question your example raises is whether we should be trying to de-accent before comparison, ie was it right for 'Ιανουάριος' to be treated differently from 'Ιανουαριος'. I don't know enough Greek to say, but it kind of feels like that should be outside to_date's purview. regards, tom lane
pgsql-hackers by date: