Re: UTF-8 encoding problem w/ libpq - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: UTF-8 encoding problem w/ libpq
Date
Msg-id 51ACC2E3.9020309@vmware.com
Whole thread Raw
In response to Re: UTF-8 encoding problem w/ libpq  ("ktm@rice.edu" <ktm@rice.edu>)
Responses Re: UTF-8 encoding problem w/ libpq  (Andrew Dunstan <andrew@dunslane.net>)
Re: UTF-8 encoding problem w/ libpq  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 03.06.2013 18:27, ktm@rice.edu wrote:
> On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:
>>
>>>> If I change the strCreate query and add double quotes around the column
>>> name, then the problem disappears. But the original name is already in
>>> lowercase, so I think it should also work without quoting the column name.
>>>> Am I missing some setup in either the database or in the use of libpq?
>>>>
>>>> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
>>>>
>>>> The database uses:
>>>> ENCODING = 'UTF8'
>>>> LC_COLLATE = 'English_United Kingdom.1252'
>>>> LC_CTYPE = 'English_United Kingdom.1252'
>>>>
>>>> Thanks for any help,
>>>>
>>>> Martin
>>>>
>>>
>>> Hi Martin,
>>>
>>> If you do not want the lowercase behavior, you must put double-quotes
>>> around the column name per the documentation:
>>>
>>> http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
>>> lexical.html#SQL-SYNTAX-IDENTIFIERS
>>>
>>> section 4.1.1.
>>>
>>> Regards,
>>> Ken
>>
>> The original name 'id_äß' is already in lowercase. The backend should leave it unchanged IMO.
>
> Only in utf-8 which needs to be double-quoted for a column name as you have
> seen, otherwise the value will be lowercased per byte.

He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the 
backend is supposed to leave bytes with the high-bit set alone, ie. in 
UTF-8 encoding, it's supposed to leave ä and ß alone.

I suspect that the conversion to UTF-8, before the string is sent to the 
server, is not being done correctly. I'm not sure what's wrong there, 
but I'd suggest printing the actual byte sequence sent to the server, to 
check if it's in fact valid UTF-8. ie. replace the PQexec() line with 
something like:
    const char *s = ToUtf8(strCreate.c_str()).c_str();    int i;    for (i=0; s[i]; i++)      printf("%02x", (unsigned
char)s[i]);    printf("\n");    pResult = PQexec(pConn, s);
 

That should contain the UTF-8 byte sequence for äß, "c3a4c39f"

- Heikki



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: Perl 5.18 breaks pl/perl regression tests?
Next
From: Merlin Moncure
Date:
Subject: Re: Re: [HACKERS] high io BUT huge amount of free memory