Thread: UTF-8 encoding problem w/ libpq

UTF-8 encoding problem w/ libpq

From

Martin Schäfer

Date:

03 June 2013, 14:40:26

<div class="WordSection1"><p class="MsoNormal">I try to create database columns with umlauts, using the UTF8 client
encoding.However, the server seems to mess up the column names. In particular, it seems to perform a lowercase
operationon each byte of the UTF-8 multi-byte sequence.<p class="MsoNormal"> <p class="MsoNormal">Here is my code:<p
class="MsoNormal"> <pclass="MsoNormal">    const wchar_t *strName = L"id_äß";<p class="MsoNormal">    wstring strCreate
=wstring(L"create table test_umlaut(") + strName + L" integer primary key)";<p class="MsoNormal"> <p
class="MsoNormal">   PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres", "******");<p
class="MsoNormal">   if (!pConn) FAIL;<p class="MsoNormal">    if (PQsetClientEncoding(pConn, "UTF-8")) FAIL;<p
class="MsoNormal"> <pclass="MsoNormal">    PGresult *pResult = PQexec(pConn, "drop table test_umlaut");<p
class="MsoNormal">   if (pResult) PQclear(pResult);<p class="MsoNormal"> <p class="MsoNormal">    pResult =
PQexec(pConn,ToUtf8(strCreate.c_str()).c_str());<p class="MsoNormal">    if (pResult) PQclear(pResult);<p
class="MsoNormal"> <pclass="MsoNormal">    pResult = PQexec(pConn, "select * from test_umlaut");<p
class="MsoNormal">   if (!pResult) FAIL;<p class="MsoNormal">    if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL;<p
class="MsoNormal">   if (PQnfields(pResult)!=1) FAIL;<p class="MsoNormal">    const char *fName = PQfname(pResult,0);<p
class="MsoNormal"> <pclass="MsoNormal">    ShowW("Name:     ", strName);<p class="MsoNormal">    ShowA("in UTF8:  ",
ToUtf8(strName).c_str());<pclass="MsoNormal">    ShowA("from DB:  ", fName);<p class="MsoNormal">    ShowW("in UTF16:
",ToWide(fName).c_str());<p class="MsoNormal"> <p class="MsoNormal">    PQclear(pResult);<p class="MsoNormal">   
PQreset(pConn);<pclass="MsoNormal"> <p class="MsoNormal">(ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use
WideCharToMultiByte/MultiByteToWideCharwith CP_UTF8.)<p class="MsoNormal"> <p class="MsoNormal">And this is the output
generated:<pclass="MsoNormal"> <p class="MsoNormal">Name:     id_äß<p class="MsoNormal">in UTF8:  id_Ã¤ÃŸ<p
class="MsoNormal">fromDB:  id_ã¤ãÿ<p class="MsoNormal">in UTF16: id_???<p class="MsoNormal"> <p class="MsoNormal">It
seemslike the backend thinks the name is in ANSI encoding, not in UTF-8.<p class="MsoNormal">If I change the strCreate
queryand add double quotes around the column name, then the problem disappears. But the original name is already in
lowercase,so I think it should also work without quoting the column name.<p class="MsoNormal">Am I missing some setup
ineither the database or in the use of libpq?<p class="MsoNormal"> <p class="MsoNormal">I’m using PostgreSQL 9.2.1,
compiledby Visual C++ build 1600, 64-bit<p class="MsoNormal"> <p class="MsoNormal">The database uses:<p
class="MsoNormal">ENCODING= 'UTF8'<p class="MsoNormal">LC_COLLATE = 'English_United Kingdom.1252'<p
class="MsoNormal">LC_CTYPE= 'English_United Kingdom.1252'<p class="MsoNormal"> <p class="MsoNormal">Thanks for any
help,<pclass="MsoNormal"> <p class="MsoNormal">Martin<p class="MsoNormal"> </div>

Re: UTF-8 encoding problem w/ libpq

From

"ktm@rice.edu"

Date:

03 June 2013, 14:48:13

On Mon, Jun 03, 2013 at 03:40:14PM +0100, Martin Schäfer wrote:
> I try to create database columns with umlauts, using the UTF8 client encoding. However, the server seems to mess up
thecolumn names. In particular, it seems to perform a lowercase operation on each byte of the UTF-8 multi-byte
sequence.
> 
> Here is my code:
> 
>     const wchar_t *strName = L"id_äß";
>     wstring strCreate = wstring(L"create table test_umlaut(") + strName + L" integer primary key)";
> 
>     PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres", "******");
>     if (!pConn) FAIL;
>     if (PQsetClientEncoding(pConn, "UTF-8")) FAIL;
> 
>     PGresult *pResult = PQexec(pConn, "drop table test_umlaut");
>     if (pResult) PQclear(pResult);
> 
>     pResult = PQexec(pConn, ToUtf8(strCreate.c_str()).c_str());
>     if (pResult) PQclear(pResult);
> 
>     pResult = PQexec(pConn, "select * from test_umlaut");
>     if (!pResult) FAIL;
>     if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL;
>     if (PQnfields(pResult)!=1) FAIL;
>     const char *fName = PQfname(pResult,0);
> 
>     ShowW("Name:     ", strName);
>     ShowA("in UTF8:  ", ToUtf8(strName).c_str());
>     ShowA("from DB:  ", fName);
>     ShowW("in UTF16: ", ToWide(fName).c_str());
> 
>     PQclear(pResult);
>     PQreset(pConn);
> 
> (ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use WideCharToMultiByte/MultiByteToWideChar with CP_UTF8.)
> 
> And this is the output generated:
> 
> Name:     id_äß
> in UTF8:  id_Ã¤ÃŸ
> from DB:  id_ã¤ãÿ
> in UTF16: id_???
> 
> It seems like the backend thinks the name is in ANSI encoding, not in UTF-8.
> If I change the strCreate query and add double quotes around the column name, then the problem disappears. But the
originalname is already in lowercase, so I think it should also work without quoting the column name.
 
> Am I missing some setup in either the database or in the use of libpq?
> 
> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
> 
> The database uses:
> ENCODING = 'UTF8'
> LC_COLLATE = 'English_United Kingdom.1252'
> LC_CTYPE = 'English_United Kingdom.1252'
> 
> Thanks for any help,
> 
> Martin
> 

Hi Martin,

If you do not want the lowercase behavior, you must put double-quotes around the
column name per the documentation:

http://www.postgresql.org/docs/9.2/interactive/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

section 4.1.1.

Regards,
Ken

Re: UTF-8 encoding problem w/ libpq

From

Martin Schäfer

Date:

03 June 2013, 15:09:39


> -----Original Message-----
> From: ktm@rice.edu [mailto:ktm@rice.edu]
> Sent: 03 June 2013 16:48
> To: Martin Schäfer
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] UTF-8 encoding problem w/ libpq
> 
> On Mon, Jun 03, 2013 at 03:40:14PM +0100, Martin Schäfer wrote:
> > I try to create database columns with umlauts, using the UTF8 client
> encoding. However, the server seems to mess up the column names. In
> particular, it seems to perform a lowercase operation on each byte of the
> UTF-8 multi-byte sequence.
> >
> > Here is my code:
> >
> >     const wchar_t *strName = L"id_äß";
> >     wstring strCreate = wstring(L"create table test_umlaut(") +
> > strName + L" integer primary key)";
> >
> >     PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres",
> "******");
> >     if (!pConn) FAIL;
> >     if (PQsetClientEncoding(pConn, "UTF-8")) FAIL;
> >
> >     PGresult *pResult = PQexec(pConn, "drop table test_umlaut");
> >     if (pResult) PQclear(pResult);
> >
> >     pResult = PQexec(pConn, ToUtf8(strCreate.c_str()).c_str());
> >     if (pResult) PQclear(pResult);
> >
> >     pResult = PQexec(pConn, "select * from test_umlaut");
> >     if (!pResult) FAIL;
> >     if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL;
> >     if (PQnfields(pResult)!=1) FAIL;
> >     const char *fName = PQfname(pResult,0);
> >
> >     ShowW("Name:     ", strName);
> >     ShowA("in UTF8:  ", ToUtf8(strName).c_str());
> >     ShowA("from DB:  ", fName);
> >     ShowW("in UTF16: ", ToWide(fName).c_str());
> >
> >     PQclear(pResult);
> >     PQreset(pConn);
> >
> > (ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use
> > WideCharToMultiByte/MultiByteToWideChar with CP_UTF8.)
> >
> > And this is the output generated:
> >
> > Name:     id_äß
> > in UTF8:  id_Ã¤ÃŸ
> > from DB:  id_ã¤ãÿ
> > in UTF16: id_???
> >
> > It seems like the backend thinks the name is in ANSI encoding, not in UTF-8.
> > If I change the strCreate query and add double quotes around the column
> name, then the problem disappears. But the original name is already in
> lowercase, so I think it should also work without quoting the column name.
> > Am I missing some setup in either the database or in the use of libpq?
> >
> > I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
> >
> > The database uses:
> > ENCODING = 'UTF8'
> > LC_COLLATE = 'English_United Kingdom.1252'
> > LC_CTYPE = 'English_United Kingdom.1252'
> >
> > Thanks for any help,
> >
> > Martin
> >
> 
> Hi Martin,
> 
> If you do not want the lowercase behavior, you must put double-quotes
> around the column name per the documentation:
> 
> http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
> lexical.html#SQL-SYNTAX-IDENTIFIERS
> 
> section 4.1.1.
> 
> Regards,
> Ken

The original name 'id_äß' is already in lowercase. The backend should leave it unchanged IMO.

Regards,
Martin

Re: UTF-8 encoding problem w/ libpq

From

"ktm@rice.edu"

Date:

03 June 2013, 15:28:08

On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:
> 
> > > If I change the strCreate query and add double quotes around the column
> > name, then the problem disappears. But the original name is already in
> > lowercase, so I think it should also work without quoting the column name.
> > > Am I missing some setup in either the database or in the use of libpq?
> > >
> > > I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
> > >
> > > The database uses:
> > > ENCODING = 'UTF8'
> > > LC_COLLATE = 'English_United Kingdom.1252'
> > > LC_CTYPE = 'English_United Kingdom.1252'
> > >
> > > Thanks for any help,
> > >
> > > Martin
> > >
> > 
> > Hi Martin,
> > 
> > If you do not want the lowercase behavior, you must put double-quotes
> > around the column name per the documentation:
> > 
> > http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
> > lexical.html#SQL-SYNTAX-IDENTIFIERS
> > 
> > section 4.1.1.
> > 
> > Regards,
> > Ken
> 
> The original name 'id_äß' is already in lowercase. The backend should leave it unchanged IMO.
> 
> Regards,
> Martin
> 

Only in utf-8 which needs to be double-quoted for a column name as you have
seen, otherwise the value will be lowercased per byte.

Regards,
Ken

Re: UTF-8 encoding problem w/ libpq

From

Heikki Linnakangas

Date:

03 June 2013, 16:23:10

On 03.06.2013 18:27, ktm@rice.edu wrote:
> On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:
>>
>>>> If I change the strCreate query and add double quotes around the column
>>> name, then the problem disappears. But the original name is already in
>>> lowercase, so I think it should also work without quoting the column name.
>>>> Am I missing some setup in either the database or in the use of libpq?
>>>>
>>>> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
>>>>
>>>> The database uses:
>>>> ENCODING = 'UTF8'
>>>> LC_COLLATE = 'English_United Kingdom.1252'
>>>> LC_CTYPE = 'English_United Kingdom.1252'
>>>>
>>>> Thanks for any help,
>>>>
>>>> Martin
>>>>
>>>
>>> Hi Martin,
>>>
>>> If you do not want the lowercase behavior, you must put double-quotes
>>> around the column name per the documentation:
>>>
>>> http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
>>> lexical.html#SQL-SYNTAX-IDENTIFIERS
>>>
>>> section 4.1.1.
>>>
>>> Regards,
>>> Ken
>>
>> The original name 'id_äß' is already in lowercase. The backend should leave it unchanged IMO.
>
> Only in utf-8 which needs to be double-quoted for a column name as you have
> seen, otherwise the value will be lowercased per byte.

He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the 
backend is supposed to leave bytes with the high-bit set alone, ie. in 
UTF-8 encoding, it's supposed to leave ä and ß alone.

I suspect that the conversion to UTF-8, before the string is sent to the 
server, is not being done correctly. I'm not sure what's wrong there, 
but I'd suggest printing the actual byte sequence sent to the server, to 
check if it's in fact valid UTF-8. ie. replace the PQexec() line with 
something like:
    const char *s = ToUtf8(strCreate.c_str()).c_str();    int i;    for (i=0; s[i]; i++)      printf("%02x", (unsigned
char)s[i]);    printf("\n");    pResult = PQexec(pConn, s);
 

That should contain the UTF-8 byte sequence for äß, "c3a4c39f"

- Heikki

Re: UTF-8 encoding problem w/ libpq

From

Andrew Dunstan

Date:

03 June 2013, 17:44:57

On 06/03/2013 12:22 PM, Heikki Linnakangas wrote:
> On 03.06.2013 18:27, ktm@rice.edu wrote:
>> On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:
>>>
>>>>> If I change the strCreate query and add double quotes around the
>>>>> column
>>>> name, then the problem disappears. But the original name is already in
>>>> lowercase, so I think it should also work without quoting the
>>>> column name.
>>>>> Am I missing some setup in either the database or in the use of
>>>>> libpq?
>>>>>
>>>>> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
>>>>>
>>>>> The database uses:
>>>>> ENCODING = 'UTF8'
>>>>> LC_COLLATE = 'English_United Kingdom.1252'
>>>>> LC_CTYPE = 'English_United Kingdom.1252'
>>>>>
>>>>> Thanks for any help,
>>>>>
>>>>> Martin
>>>>>
>>>>
>>>> Hi Martin,
>>>>
>>>> If you do not want the lowercase behavior, you must put double-quotes
>>>> around the column name per the documentation:
>>>>
>>>> http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
>>>> lexical.html#SQL-SYNTAX-IDENTIFIERS
>>>>
>>>> section 4.1.1.
>>>>
>>>> Regards,
>>>> Ken
>>>
>>> The original name 'id_äß' is already in lowercase. The backend
>>> should leave it unchanged IMO.
>>
>> Only in utf-8 which needs to be double-quoted for a column name as
>> you have
>> seen, otherwise the value will be lowercased per byte.
>
> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
> backend is supposed to leave bytes with the high-bit set alone, ie. in
> UTF-8 encoding, it's supposed to leave ä and ß alone.
>
> I suspect that the conversion to UTF-8, before the string is sent to
> the server, is not being done correctly. I'm not sure what's wrong
> there, but I'd suggest printing the actual byte sequence sent to the
> server, to check if it's in fact valid UTF-8. ie. replace the PQexec()
> line with something like:
>
>     const char *s = ToUtf8(strCreate.c_str()).c_str();
>     int i;
>     for (i=0; s[i]; i++)
>       printf("%02x", (unsigned char) s[i]);
>     printf("\n");
>     pResult = PQexec(pConn, s);
>
> That should contain the UTF-8 byte sequence for äß, "c3a4c39f"
>
>


Umm, no, the backend code doesn't do it right. Some time ago I suggested
a fix for this - see
<http://www.postgresql.org/message-id/50ACF7FA.7070108@dunslane.net>.
Tom thought there might be other places that need fixing, and I haven't
had time to look for them. But maybe we should just fix this one for now
at least.

cheers

andrew

Re: UTF-8 encoding problem w/ libpq

From

Tom Lane

Date:

03 June 2013, 18:28:38

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the 
> backend is supposed to leave bytes with the high-bit set alone, ie. in 
> UTF-8 encoding, it's supposed to leave ä and ß alone.

Well, actually, downcase_truncate_identifier() is doing this:
    unsigned char ch = (unsigned char) ident[i];
    if (ch >= 'A' && ch <= 'Z')        ch += 'a' - 'A';    else if (IS_HIGHBIT_SET(ch) && isupper(ch))        ch =
tolower(ch);

There's basically no way that that second case can give pleasant results
in a multibyte encoding, other than by not doing anything.  I suspect
that Windows' libc has fewer defenses than other implementations and
performs some transformation that we don't get elsewhere.  This may also
explain the gripe yesterday in -general about funny results in OS X.

We talked about this before and went off into the weeds about whether
it was sensible to try to use towlower() and whether that wouldn't
create undesirably platform-sensitive results.  I wonder though if we
couldn't just fix this code to not do anything to high-bit-set bytes
in multibyte encodings.
        regards, tom lane

Re: UTF-8 encoding problem w/ libpq

From

Heikki Linnakangas

Date:

03 June 2013, 18:41:44

On 03.06.2013 21:28, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
>> backend is supposed to leave bytes with the high-bit set alone, ie. in
>> UTF-8 encoding, it's supposed to leave Ã¤ and ÃŸ alone.
>
> Well, actually, downcase_truncate_identifier() is doing this:
>
>         unsigned char ch = (unsigned char) ident[i];
>
>         if (ch>= 'A'&&  ch<= 'Z')
>             ch += 'a' - 'A';
>         else if (IS_HIGHBIT_SET(ch)&&  isupper(ch))
>             ch = tolower(ch);
>
> There's basically no way that that second case can give pleasant results
> in a multibyte encoding, other than by not doing anything.

Hmph, I see.

> I suspect
> that Windows' libc has fewer defenses than other implementations and
> performs some transformation that we don't get elsewhere.  This may also
> explain the gripe yesterday in -general about funny results in OS X.

Can't really blame Windows on that. On Windows, we don't require that 
the encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in 
the server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE 
implies WIN1252 encoding. We allow that and it generally works on 
Windows because in varstr_cmp, we use MultiByteToWideChar() followed by 
wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. 
But for isupper(), it matters.

> We talked about this before and went off into the weeds about whether
> it was sensible to try to use towlower() and whether that wouldn't
> create undesirably platform-sensitive results.  I wonder though if we
> couldn't just fix this code to not do anything to high-bit-set bytes
> in multibyte encodings.

Yeah, we should do that. It makes no sense to call isupper or tolower on 
bytes belonging to multi-byte characters.

- Heikki

Re: UTF-8 encoding problem w/ libpq

From

Andrew Dunstan

Date:

03 June 2013, 18:42:19

On 06/03/2013 02:28 PM, Tom Lane wrote:
> . I wonder though if we couldn't just fix this code to not do anything 
> to high-bit-set bytes in multibyte encodings.


That's exactly what I suggested back in November.

cheers

andrew

Re: UTF-8 encoding problem w/ libpq

From

Martin Schäfer

Date:

04 June 2013, 06:40:09

> Can't really blame Windows on that. On Windows, we don't require that the
> encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the
> server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies
> WIN1252 encoding. We allow that and it generally works on Windows
> because in varstr_cmp, we use MultiByteToWideChar() followed by
> wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE.
> But for isupper(), it matters.

Does this mean that the UTF-8 messing up would disappear if the database were using a different locale for LC_CTYPE? If
so,which locale should I use?
 
This would be useful for a temporary workaround.

> > We talked about this before and went off into the weeds about whether
> > it was sensible to try to use towlower() and whether that wouldn't
> > create undesirably platform-sensitive results.  I wonder though if we
> > couldn't just fix this code to not do anything to high-bit-set bytes
> > in multibyte encodings.
> 
> Yeah, we should do that. It makes no sense to call isupper or tolower on
> bytes belonging to multi-byte characters.

Actually, I would expect that 'create table HÄUSER (...)' would create a table named 'häuser', and not a table named
'hÄuser',so towlower seems the right choice IMHO.
 

Martin

Re: UTF-8 encoding problem w/ libpq

From

Andrew Dunstan

Date:

08 June 2013, 14:43:18

On 06/03/2013 02:41 PM, Andrew Dunstan wrote:
>
> On 06/03/2013 02:28 PM, Tom Lane wrote:
>> . I wonder though if we couldn't just fix this code to not do 
>> anything to high-bit-set bytes in multibyte encodings.
>
>
> That's exactly what I suggested back in November.

This thread seems to have gone cold, so I have applied the fix I 
originally suggested along these lines to all live branches.

At least that means we won't produce junk, but we still need to work out 
how to downcase multi-byte characters.

If anyone thinks there are other places in the code that need similar 
treatment, they are welcome to find them. I have not yet found one.

cheers

andrew

Re: UTF-8 encoding problem w/ libpq

From

Martin Schäfer

Date:

10 June 2013, 07:08:26

Thanks Andrew. I will test the next release.

Martin

> -----Original Message-----
> From: Andrew Dunstan [mailto:andrew@dunslane.net]
> Sent: 08 June 2013 16:43
> To: Tom Lane
> Cc: Heikki Linnakangas; ktm@rice.edu; Martin Schäfer; pgsql-
> hackers@postgresql.org
> Subject: Re: [HACKERS] UTF-8 encoding problem w/ libpq
>
>
> On 06/03/2013 02:41 PM, Andrew Dunstan wrote:
> >
> > On 06/03/2013 02:28 PM, Tom Lane wrote:
> >> . I wonder though if we couldn't just fix this code to not do
> >> anything to high-bit-set bytes in multibyte encodings.
> >
> >
> > That's exactly what I suggested back in November.
>
>
> This thread seems to have gone cold, so I have applied the fix I originally
> suggested along these lines to all live branches.
>
> At least that means we won't produce junk, but we still need to work out
> how to downcase multi-byte characters.
>
> If anyone thinks there are other places in the code that need similar
> treatment, they are welcome to find them. I have not yet found one.
>
>
> cheers
>
> andrew
>

Re: UTF-8 encoding problem w/ libpq

From

Heikki Linnakangas

Date:

10 June 2013, 09:38:58

On 04.06.2013 09:39, Martin Schäfer wrote:
>> Can't really blame Windows on that. On Windows, we don't require that the
>> encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the
>> server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies
>> WIN1252 encoding. We allow that and it generally works on Windows
>> because in varstr_cmp, we use MultiByteToWideChar() followed by
>> wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE.
>> But for isupper(), it matters.
>
> Does this mean that the UTF-8 messing up would disappear if the database were using a different locale for LC_CTYPE?
Ifso, which locale should I use?

> This would be useful for a temporary workaround.

Maybe, not sure. The logical thing to do would be to set LC_CTYPE to 
"English_United Kingdom.65001", which tell Windows to expect UTF-8 
charset. However, old discussions on this subject suggest that Windows 
won't accept that:

http://www.postgresql.org/message-id/20071015090954.GD4653@svr2.hagander.net

It's still worth a try, I think. Things might've changed since then. If 
that doesn't work, you could also try some other random codepages as a 
workaround. If you're lucky, one of them might work better, even though 
it would still be the wrong codepage for UTF-8.

- Heikki