Re: TM format can mix encodings in to_char() - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: TM format can mix encodings in to_char() |
Date | |
Msg-id | 15600.1555692459@sss.pgh.pa.us Whole thread Raw |
In response to | TM format can mix encodings in to_char() (Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>) |
Responses |
Re: TM format can mix encodings in to_char()
|
List | pgsql-hackers |
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes: > The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), > while the test runs in UTF8. So the following code will raise an error: > SET lc_time TO 'tr_TR'; > SELECT to_char(date '2010-02-01', 'DD TMMON YYYY'); > ERROR: invalid byte sequence for encoding "UTF8": 0xde 0x75 Ugh. > The problem seems to be in the code touched in the attached patch. Hmm. I'd always imagined that the way that libc works is that LC_CTYPE determines the encoding (codeset) it's using across the board, so that functions like strftime would deliver data in that encoding. That's mainly based on the observation that nl_langinfo(CODESET) is specified to depend on LC_CTYPE, and it would be monumentally stupid for any libc functions to be operating according to a codeset that there's no way to discover. However, your example shows that at least glibc is indeed monumentally stupid about this :-(. But ... perhaps other implementations are not so silly? I went looking into the POSIX spec to see if it says anything about this, and discovered (in Base Definitions section 7, Locale): If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined. "Undefined" is a term of art here: it means the library can misbehave arbitrarily badly, up to and including abort() or halt-and-catch-fire. We do *not* want to be invoking undefined behavior, even if particular implementations seem to behave sanely. Your proposed patch isn't getting us out of that, and what it is doing instead is embedding an assumption that the implementation handles this in a particular way. So what I'm thinking really needs to be done here is to force it to work according to the LC_CTYPE-determines-the-codeset-for-everything model. Note that that model is embedded into PG in quite a few ways besides the one at stake here; for instance, pg_perm_setlocale thinks it should make gettext track the LC_CTYPE encoding, not anything else. If we're willing to assume a lot about how locale names are spelled, we could imagine fixing this in cache_locale_time by having it strip any encoding spec from the given LC_TIME string and then adding on the codeset name from nl_langinfo(CODESET). Not sure about how well that'd play on Windows, though. We'd also need to adjust check_locale so that it does the same dance. BTW, it seems very likely that we have similar issues with LC_MONETARY and LC_NUMERIC in PGLC_localeconv(). There's an interesting Windows-only hack in there now that seems to be addressing more or less the same issue; I wonder whether that would be rendered unnecessary if we approached it like this? I'm also wondering why we have not noticed any comparable problem with LC_MESSAGES or LC_COLLATE. It's not so surprising that we haven't understood this hazard before with LC_TIME/LC_MONETARY/LC_NUMERIC given their limited usage in PG, but the same can't be said of LC_MESSAGES or LC_COLLATE. regards, tom lane
pgsql-hackers by date: