Re: Allow to_date() and to_timestamp() to accept localized names - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Allow to_date() and to_timestamp() to accept localized names
Date
Msg-id 7212.1580177517@sss.pgh.pa.us
Whole thread Raw
In response to Re: Allow to_date() and to_timestamp() to accept localized names  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Responses Re: Allow to_date() and to_timestamp() to accept localized names  (Mark Dilger <mark.dilger@enterprisedb.com>)
Re: Allow to_date() and to_timestamp() to accept localized names  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> For the record, the correct form of that would appear to be
> select to_date('Ιανουάριος', 'TMMonth');
> with the accent.  I had tried different variations of that and they all
> failed.

OK, so for anyone who is as confused as I was, the main point here
seems to be this: the upper case form of Greek sigma is 'Σ',
and the lower case form is 'σ' ... except as the final letter of
a word, where it is supposed to be written like 'ς'.

If I set lc_collate, lc_ctype, and lc_time to 'el_GR.utf8',
then (on a somewhat hoary glibc platform) I get

u8=# select to_char('2020-01-01'::timestamptz, 'TMMONTH');
       to_char
----------------------
 ΙΑΝΟΥΆΡΙΟΣ
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMMonth');
       to_char
----------------------
 Ιανουάριος
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMmonth');
       to_char
----------------------
 ιανουάριος
(1 row)

which is correct AFAICS ... but

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH'));
        lower
----------------------
 ιανουάριοσ
(1 row)

So what we actually have here, ISTM, is a bug in lower() not to_char().
The bug is unsurprising because str_tolower() simply applies towlower_l()
to each character independently, so there's no way for it to account for
the word-final rule.  I'm not aware that glibc provides any API whereby
that could be done correctly.  On the other hand, we get it right when
using an ICU collation for lower():

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH') collate "el-gr-x-icu");
        lower
----------------------
 ιανουάριος
(1 row)

because that code path passes the whole string to ICU at once, and
of course getting this right is ICU's entire job.

I haven't double-checked, but I imagine that the reason that to_char
gets the month name case-folding right is that what comes out of
strftime(..."%B"...) is "Ιανουάριος" which we are able to upcase
correctly, while the downcasing code paths don't affect 'ς'.

I thought for a little bit about trying to dodge this issue in the
patch by folding to upper case, not lower, before comparing month/day
names.  I fear that that would just shift the problem cases to some
other language(s).  However, it makes Greek better, and I think it
makes German better (does 'ß' appear in any month/day names there?),
so maybe we should just roll with that.  In the end, it doesn't seem
right to reject this patch just because lower() is broken on some
platforms.

The other question your example raises is whether we should be trying
to de-accent before comparison, ie was it right for 'Ιανουάριος' to
be treated differently from 'Ιανουαριος'.  I don't know enough Greek
to say, but it kind of feels like that should be outside to_date's
purview.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BufFileRead() error signalling
Next
From: Thomas Munro
Date:
Subject: Re: Condition variables vs interrupts