Home > mailing lists

Re: UTF-8 encoding problem w/ libpq - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: UTF-8 encoding problem w/ libpq
Date	June 3, 2013 18:41:44
Msg-id	51ACE361.2050006@vmware.com Whole thread Raw
In response to	Re: UTF-8 encoding problem w/ libpq (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: UTF-8 encoding problem w/ libpq
List	pgsql-hackers

Tree view

On 03.06.2013 21:28, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
>> backend is supposed to leave bytes with the high-bit set alone, ie. in
>> UTF-8 encoding, it's supposed to leave Ã¤ and ÃŸ alone.
>
> Well, actually, downcase_truncate_identifier() is doing this:
>
>         unsigned char ch = (unsigned char) ident[i];
>
>         if (ch>= 'A'&&  ch<= 'Z')
>             ch += 'a' - 'A';
>         else if (IS_HIGHBIT_SET(ch)&&  isupper(ch))
>             ch = tolower(ch);
>
> There's basically no way that that second case can give pleasant results
> in a multibyte encoding, other than by not doing anything.

Hmph, I see.

> I suspect
> that Windows' libc has fewer defenses than other implementations and
> performs some transformation that we don't get elsewhere.  This may also
> explain the gripe yesterday in -general about funny results in OS X.

Can't really blame Windows on that. On Windows, we don't require that 
the encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in 
the server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE 
implies WIN1252 encoding. We allow that and it generally works on 
Windows because in varstr_cmp, we use MultiByteToWideChar() followed by 
wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. 
But for isupper(), it matters.

> We talked about this before and went off into the weeds about whether
> it was sensible to try to use towlower() and whether that wouldn't
> create undesirably platform-sensitive results.  I wonder though if we
> couldn't just fix this code to not do anything to high-bit-set bytes
> in multibyte encodings.

Yeah, we should do that. It makes no sense to call isupper or tolower on 
bytes belonging to multi-byte characters.

- Heikki

pgsql-hackers by date:

From: Jim Nasby
Date: 03 June 2013, 18:41:25
Subject: Re: Optimising Foreign Key checks

From: Andrew Dunstan
Date: 03 June 2013, 18:42:19
Subject: Re: UTF-8 encoding problem w/ libpq

Re: UTF-8 encoding problem w/ libpq - Mailing list pgsql-hackers

Previous

Next