Re: Windows and locales and UTF-8 (oh my) - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Windows and locales and UTF-8 (oh my)
Date
Msg-id 20071015114010.GD5806@svr2.hagander.net
Whole thread Raw
In response to Re: Windows and locales and UTF-8 (oh my)  (Magnus Hagander <magnus@hagander.net>)
Responses Re: Windows and locales and UTF-8 (oh my)
List pgsql-hackers
On Mon, Oct 15, 2007 at 01:26:00PM +0200, Magnus Hagander wrote:
> On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote:
> > On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
> > > I am thinking that Dave's discovery explains some previously unsolved
> > > bug reports, such as
> > > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> > > If Windows returns LC_CTYPE=C in a situation like this, then
> > > the various single-byte-charset optimization paths that are enabled by
> > > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
> > > upper()/lower() and other places.  ISTM we had better hack
> > > lc_ctype_is_c() so that on Windows (only), if the database encoding
> > > is UTF-8 then it returns FALSE regardless of what setlocale says.
> > 
> > Yes, I think we a change to that routine.
> > 
> > But. What about the case when we actually *have* locale=C and
> > encoding=UTF8. We need to care for that one somehow. Perhaps we should look
> > at LC_COLLATE instead (again, on Windows only. Possibly even only in the
> > windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?
> 
> Hmm. Looking more at that, may there be another problem? Looking at
> WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which
> will then be "C" - even if the database isn't in C.
> 
> But I don't really know when that code is called, or if I'm just looking at
> things wrong. Just starting up and shutting down the database leaves it at
> Swedish_Sweden.1252, not C.
> (1252 is still the wrong encoding specifyer, but it'll work anyway since we
> convert to UTF16)

Gah, got that backwards. Of course it does, because it only returns "C" if
we set to Swedish_Sweden.65001, and we don't *do* that with the patch I
sent in earlier. We set it to Swedish_Sweden, which is a perfectly valid
LC_CTYPE.

And given that, do we even nede to special-case lc_ctype_is_c() at all? If
we never pass in a .65001 locale (which we don't, because it fails)?

//Magnus


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Windows and locales and UTF-8 (oh my)
Next
From: Tom Lane
Date:
Subject: Re: Windows and locales and UTF-8 (oh my)