Thread: UTF-8 encoding problem w/ libpq
<div class="WordSection1"><p class="MsoNormal">I try to create database columns with umlauts, using the UTF8 client encoding.However, the server seems to mess up the column names. In particular, it seems to perform a lowercase operationon each byte of the UTF-8 multi-byte sequence.<p class="MsoNormal"> <p class="MsoNormal">Here is my code:<p class="MsoNormal"> <pclass="MsoNormal"> const wchar_t *strName = L"id_äß";<p class="MsoNormal"> wstring strCreate =wstring(L"create table test_umlaut(") + strName + L" integer primary key)";<p class="MsoNormal"> <p class="MsoNormal"> PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres", "******");<p class="MsoNormal"> if (!pConn) FAIL;<p class="MsoNormal"> if (PQsetClientEncoding(pConn, "UTF-8")) FAIL;<p class="MsoNormal"> <pclass="MsoNormal"> PGresult *pResult = PQexec(pConn, "drop table test_umlaut");<p class="MsoNormal"> if (pResult) PQclear(pResult);<p class="MsoNormal"> <p class="MsoNormal"> pResult = PQexec(pConn,ToUtf8(strCreate.c_str()).c_str());<p class="MsoNormal"> if (pResult) PQclear(pResult);<p class="MsoNormal"> <pclass="MsoNormal"> pResult = PQexec(pConn, "select * from test_umlaut");<p class="MsoNormal"> if (!pResult) FAIL;<p class="MsoNormal"> if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL;<p class="MsoNormal"> if (PQnfields(pResult)!=1) FAIL;<p class="MsoNormal"> const char *fName = PQfname(pResult,0);<p class="MsoNormal"> <pclass="MsoNormal"> ShowW("Name: ", strName);<p class="MsoNormal"> ShowA("in UTF8: ", ToUtf8(strName).c_str());<pclass="MsoNormal"> ShowA("from DB: ", fName);<p class="MsoNormal"> ShowW("in UTF16: ",ToWide(fName).c_str());<p class="MsoNormal"> <p class="MsoNormal"> PQclear(pResult);<p class="MsoNormal"> PQreset(pConn);<pclass="MsoNormal"> <p class="MsoNormal">(ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use WideCharToMultiByte/MultiByteToWideCharwith CP_UTF8.)<p class="MsoNormal"> <p class="MsoNormal">And this is the output generated:<pclass="MsoNormal"> <p class="MsoNormal">Name: id_äß<p class="MsoNormal">in UTF8: id_äß<p class="MsoNormal">fromDB: id_ã¤ãÿ<p class="MsoNormal">in UTF16: id_???<p class="MsoNormal"> <p class="MsoNormal">It seemslike the backend thinks the name is in ANSI encoding, not in UTF-8.<p class="MsoNormal">If I change the strCreate queryand add double quotes around the column name, then the problem disappears. But the original name is already in lowercase,so I think it should also work without quoting the column name.<p class="MsoNormal">Am I missing some setup ineither the database or in the use of libpq?<p class="MsoNormal"> <p class="MsoNormal">I’m using PostgreSQL 9.2.1, compiledby Visual C++ build 1600, 64-bit<p class="MsoNormal"> <p class="MsoNormal">The database uses:<p class="MsoNormal">ENCODING= 'UTF8'<p class="MsoNormal">LC_COLLATE = 'English_United Kingdom.1252'<p class="MsoNormal">LC_CTYPE= 'English_United Kingdom.1252'<p class="MsoNormal"> <p class="MsoNormal">Thanks for any help,<pclass="MsoNormal"> <p class="MsoNormal">Martin<p class="MsoNormal"> </div>
On Mon, Jun 03, 2013 at 03:40:14PM +0100, Martin Schäfer wrote: > I try to create database columns with umlauts, using the UTF8 client encoding. However, the server seems to mess up thecolumn names. In particular, it seems to perform a lowercase operation on each byte of the UTF-8 multi-byte sequence. > > Here is my code: > > const wchar_t *strName = L"id_äß"; > wstring strCreate = wstring(L"create table test_umlaut(") + strName + L" integer primary key)"; > > PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres", "******"); > if (!pConn) FAIL; > if (PQsetClientEncoding(pConn, "UTF-8")) FAIL; > > PGresult *pResult = PQexec(pConn, "drop table test_umlaut"); > if (pResult) PQclear(pResult); > > pResult = PQexec(pConn, ToUtf8(strCreate.c_str()).c_str()); > if (pResult) PQclear(pResult); > > pResult = PQexec(pConn, "select * from test_umlaut"); > if (!pResult) FAIL; > if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL; > if (PQnfields(pResult)!=1) FAIL; > const char *fName = PQfname(pResult,0); > > ShowW("Name: ", strName); > ShowA("in UTF8: ", ToUtf8(strName).c_str()); > ShowA("from DB: ", fName); > ShowW("in UTF16: ", ToWide(fName).c_str()); > > PQclear(pResult); > PQreset(pConn); > > (ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use WideCharToMultiByte/MultiByteToWideChar with CP_UTF8.) > > And this is the output generated: > > Name: id_äß > in UTF8: id_äß > from DB: id_ã¤ãÿ > in UTF16: id_??? > > It seems like the backend thinks the name is in ANSI encoding, not in UTF-8. > If I change the strCreate query and add double quotes around the column name, then the problem disappears. But the originalname is already in lowercase, so I think it should also work without quoting the column name. > Am I missing some setup in either the database or in the use of libpq? > > I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit > > The database uses: > ENCODING = 'UTF8' > LC_COLLATE = 'English_United Kingdom.1252' > LC_CTYPE = 'English_United Kingdom.1252' > > Thanks for any help, > > Martin > Hi Martin, If you do not want the lowercase behavior, you must put double-quotes around the column name per the documentation: http://www.postgresql.org/docs/9.2/interactive/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS section 4.1.1. Regards, Ken
> -----Original Message----- > From: ktm@rice.edu [mailto:ktm@rice.edu] > Sent: 03 June 2013 16:48 > To: Martin Schäfer > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] UTF-8 encoding problem w/ libpq > > On Mon, Jun 03, 2013 at 03:40:14PM +0100, Martin Schäfer wrote: > > I try to create database columns with umlauts, using the UTF8 client > encoding. However, the server seems to mess up the column names. In > particular, it seems to perform a lowercase operation on each byte of the > UTF-8 multi-byte sequence. > > > > Here is my code: > > > > const wchar_t *strName = L"id_äß"; > > wstring strCreate = wstring(L"create table test_umlaut(") + > > strName + L" integer primary key)"; > > > > PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres", > "******"); > > if (!pConn) FAIL; > > if (PQsetClientEncoding(pConn, "UTF-8")) FAIL; > > > > PGresult *pResult = PQexec(pConn, "drop table test_umlaut"); > > if (pResult) PQclear(pResult); > > > > pResult = PQexec(pConn, ToUtf8(strCreate.c_str()).c_str()); > > if (pResult) PQclear(pResult); > > > > pResult = PQexec(pConn, "select * from test_umlaut"); > > if (!pResult) FAIL; > > if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL; > > if (PQnfields(pResult)!=1) FAIL; > > const char *fName = PQfname(pResult,0); > > > > ShowW("Name: ", strName); > > ShowA("in UTF8: ", ToUtf8(strName).c_str()); > > ShowA("from DB: ", fName); > > ShowW("in UTF16: ", ToWide(fName).c_str()); > > > > PQclear(pResult); > > PQreset(pConn); > > > > (ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use > > WideCharToMultiByte/MultiByteToWideChar with CP_UTF8.) > > > > And this is the output generated: > > > > Name: id_äß > > in UTF8: id_äß > > from DB: id_ã¤ãÿ > > in UTF16: id_??? > > > > It seems like the backend thinks the name is in ANSI encoding, not in UTF-8. > > If I change the strCreate query and add double quotes around the column > name, then the problem disappears. But the original name is already in > lowercase, so I think it should also work without quoting the column name. > > Am I missing some setup in either the database or in the use of libpq? > > > > I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit > > > > The database uses: > > ENCODING = 'UTF8' > > LC_COLLATE = 'English_United Kingdom.1252' > > LC_CTYPE = 'English_United Kingdom.1252' > > > > Thanks for any help, > > > > Martin > > > > Hi Martin, > > If you do not want the lowercase behavior, you must put double-quotes > around the column name per the documentation: > > http://www.postgresql.org/docs/9.2/interactive/sql-syntax- > lexical.html#SQL-SYNTAX-IDENTIFIERS > > section 4.1.1. > > Regards, > Ken The original name 'id_äß' is already in lowercase. The backend should leave it unchanged IMO. Regards, Martin
On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote: > > > > If I change the strCreate query and add double quotes around the column > > name, then the problem disappears. But the original name is already in > > lowercase, so I think it should also work without quoting the column name. > > > Am I missing some setup in either the database or in the use of libpq? > > > > > > I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit > > > > > > The database uses: > > > ENCODING = 'UTF8' > > > LC_COLLATE = 'English_United Kingdom.1252' > > > LC_CTYPE = 'English_United Kingdom.1252' > > > > > > Thanks for any help, > > > > > > Martin > > > > > > > Hi Martin, > > > > If you do not want the lowercase behavior, you must put double-quotes > > around the column name per the documentation: > > > > http://www.postgresql.org/docs/9.2/interactive/sql-syntax- > > lexical.html#SQL-SYNTAX-IDENTIFIERS > > > > section 4.1.1. > > > > Regards, > > Ken > > The original name 'id_äß' is already in lowercase. The backend should leave it unchanged IMO. > > Regards, > Martin > Only in utf-8 which needs to be double-quoted for a column name as you have seen, otherwise the value will be lowercased per byte. Regards, Ken
On 03.06.2013 18:27, ktm@rice.edu wrote: > On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote: >> >>>> If I change the strCreate query and add double quotes around the column >>> name, then the problem disappears. But the original name is already in >>> lowercase, so I think it should also work without quoting the column name. >>>> Am I missing some setup in either the database or in the use of libpq? >>>> >>>> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit >>>> >>>> The database uses: >>>> ENCODING = 'UTF8' >>>> LC_COLLATE = 'English_United Kingdom.1252' >>>> LC_CTYPE = 'English_United Kingdom.1252' >>>> >>>> Thanks for any help, >>>> >>>> Martin >>>> >>> >>> Hi Martin, >>> >>> If you do not want the lowercase behavior, you must put double-quotes >>> around the column name per the documentation: >>> >>> http://www.postgresql.org/docs/9.2/interactive/sql-syntax- >>> lexical.html#SQL-SYNTAX-IDENTIFIERS >>> >>> section 4.1.1. >>> >>> Regards, >>> Ken >> >> The original name 'id_äß' is already in lowercase. The backend should leave it unchanged IMO. > > Only in utf-8 which needs to be double-quoted for a column name as you have > seen, otherwise the value will be lowercased per byte. He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the backend is supposed to leave bytes with the high-bit set alone, ie. in UTF-8 encoding, it's supposed to leave ä and ß alone. I suspect that the conversion to UTF-8, before the string is sent to the server, is not being done correctly. I'm not sure what's wrong there, but I'd suggest printing the actual byte sequence sent to the server, to check if it's in fact valid UTF-8. ie. replace the PQexec() line with something like: const char *s = ToUtf8(strCreate.c_str()).c_str(); int i; for (i=0; s[i]; i++) printf("%02x", (unsigned char)s[i]); printf("\n"); pResult = PQexec(pConn, s); That should contain the UTF-8 byte sequence for äß, "c3a4c39f" - Heikki
On 06/03/2013 12:22 PM, Heikki Linnakangas wrote: > On 03.06.2013 18:27, ktm@rice.edu wrote: >> On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote: >>> >>>>> If I change the strCreate query and add double quotes around the >>>>> column >>>> name, then the problem disappears. But the original name is already in >>>> lowercase, so I think it should also work without quoting the >>>> column name. >>>>> Am I missing some setup in either the database or in the use of >>>>> libpq? >>>>> >>>>> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit >>>>> >>>>> The database uses: >>>>> ENCODING = 'UTF8' >>>>> LC_COLLATE = 'English_United Kingdom.1252' >>>>> LC_CTYPE = 'English_United Kingdom.1252' >>>>> >>>>> Thanks for any help, >>>>> >>>>> Martin >>>>> >>>> >>>> Hi Martin, >>>> >>>> If you do not want the lowercase behavior, you must put double-quotes >>>> around the column name per the documentation: >>>> >>>> http://www.postgresql.org/docs/9.2/interactive/sql-syntax- >>>> lexical.html#SQL-SYNTAX-IDENTIFIERS >>>> >>>> section 4.1.1. >>>> >>>> Regards, >>>> Ken >>> >>> The original name 'id_äß' is already in lowercase. The backend >>> should leave it unchanged IMO. >> >> Only in utf-8 which needs to be double-quoted for a column name as >> you have >> seen, otherwise the value will be lowercased per byte. > > He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the > backend is supposed to leave bytes with the high-bit set alone, ie. in > UTF-8 encoding, it's supposed to leave ä and ß alone. > > I suspect that the conversion to UTF-8, before the string is sent to > the server, is not being done correctly. I'm not sure what's wrong > there, but I'd suggest printing the actual byte sequence sent to the > server, to check if it's in fact valid UTF-8. ie. replace the PQexec() > line with something like: > > const char *s = ToUtf8(strCreate.c_str()).c_str(); > int i; > for (i=0; s[i]; i++) > printf("%02x", (unsigned char) s[i]); > printf("\n"); > pResult = PQexec(pConn, s); > > That should contain the UTF-8 byte sequence for äß, "c3a4c39f" > > Umm, no, the backend code doesn't do it right. Some time ago I suggested a fix for this - see <http://www.postgresql.org/message-id/50ACF7FA.7070108@dunslane.net>. Tom thought there might be other places that need fixing, and I haven't had time to look for them. But maybe we should just fix this one for now at least. cheers andrew
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the > backend is supposed to leave bytes with the high-bit set alone, ie. in > UTF-8 encoding, it's supposed to leave ä and ß alone. Well, actually, downcase_truncate_identifier() is doing this: unsigned char ch = (unsigned char) ident[i]; if (ch >= 'A' && ch <= 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch) && isupper(ch)) ch = tolower(ch); There's basically no way that that second case can give pleasant results in a multibyte encoding, other than by not doing anything. I suspect that Windows' libc has fewer defenses than other implementations and performs some transformation that we don't get elsewhere. This may also explain the gripe yesterday in -general about funny results in OS X. We talked about this before and went off into the weeds about whether it was sensible to try to use towlower() and whether that wouldn't create undesirably platform-sensitive results. I wonder though if we couldn't just fix this code to not do anything to high-bit-set bytes in multibyte encodings. regards, tom lane
On 03.06.2013 21:28, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the >> backend is supposed to leave bytes with the high-bit set alone, ie. in >> UTF-8 encoding, it's supposed to leave ä and ß alone. > > Well, actually, downcase_truncate_identifier() is doing this: > > unsigned char ch = (unsigned char) ident[i]; > > if (ch>= 'A'&& ch<= 'Z') > ch += 'a' - 'A'; > else if (IS_HIGHBIT_SET(ch)&& isupper(ch)) > ch = tolower(ch); > > There's basically no way that that second case can give pleasant results > in a multibyte encoding, other than by not doing anything. Hmph, I see. > I suspect > that Windows' libc has fewer defenses than other implementations and > performs some transformation that we don't get elsewhere. This may also > explain the gripe yesterday in -general about funny results in OS X. Can't really blame Windows on that. On Windows, we don't require that the encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies WIN1252 encoding. We allow that and it generally works on Windows because in varstr_cmp, we use MultiByteToWideChar() followed by wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. But for isupper(), it matters. > We talked about this before and went off into the weeds about whether > it was sensible to try to use towlower() and whether that wouldn't > create undesirably platform-sensitive results. I wonder though if we > couldn't just fix this code to not do anything to high-bit-set bytes > in multibyte encodings. Yeah, we should do that. It makes no sense to call isupper or tolower on bytes belonging to multi-byte characters. - Heikki
On 06/03/2013 02:28 PM, Tom Lane wrote: > . I wonder though if we couldn't just fix this code to not do anything > to high-bit-set bytes in multibyte encodings. That's exactly what I suggested back in November. cheers andrew
> Can't really blame Windows on that. On Windows, we don't require that the > encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the > server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies > WIN1252 encoding. We allow that and it generally works on Windows > because in varstr_cmp, we use MultiByteToWideChar() followed by > wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. > But for isupper(), it matters. Does this mean that the UTF-8 messing up would disappear if the database were using a different locale for LC_CTYPE? If so,which locale should I use? This would be useful for a temporary workaround. > > We talked about this before and went off into the weeds about whether > > it was sensible to try to use towlower() and whether that wouldn't > > create undesirably platform-sensitive results. I wonder though if we > > couldn't just fix this code to not do anything to high-bit-set bytes > > in multibyte encodings. > > Yeah, we should do that. It makes no sense to call isupper or tolower on > bytes belonging to multi-byte characters. Actually, I would expect that 'create table HÄUSER (...)' would create a table named 'häuser', and not a table named 'hÄuser',so towlower seems the right choice IMHO. Martin
On 06/03/2013 02:41 PM, Andrew Dunstan wrote: > > On 06/03/2013 02:28 PM, Tom Lane wrote: >> . I wonder though if we couldn't just fix this code to not do >> anything to high-bit-set bytes in multibyte encodings. > > > That's exactly what I suggested back in November. This thread seems to have gone cold, so I have applied the fix I originally suggested along these lines to all live branches. At least that means we won't produce junk, but we still need to work out how to downcase multi-byte characters. If anyone thinks there are other places in the code that need similar treatment, they are welcome to find them. I have not yet found one. cheers andrew
Thanks Andrew. I will test the next release. Martin > -----Original Message----- > From: Andrew Dunstan [mailto:andrew@dunslane.net] > Sent: 08 June 2013 16:43 > To: Tom Lane > Cc: Heikki Linnakangas; ktm@rice.edu; Martin Schäfer; pgsql- > hackers@postgresql.org > Subject: Re: [HACKERS] UTF-8 encoding problem w/ libpq > > > On 06/03/2013 02:41 PM, Andrew Dunstan wrote: > > > > On 06/03/2013 02:28 PM, Tom Lane wrote: > >> . I wonder though if we couldn't just fix this code to not do > >> anything to high-bit-set bytes in multibyte encodings. > > > > > > That's exactly what I suggested back in November. > > > This thread seems to have gone cold, so I have applied the fix I originally > suggested along these lines to all live branches. > > At least that means we won't produce junk, but we still need to work out > how to downcase multi-byte characters. > > If anyone thinks there are other places in the code that need similar > treatment, they are welcome to find them. I have not yet found one. > > > cheers > > andrew >
On 04.06.2013 09:39, Martin Schäfer wrote: >> Can't really blame Windows on that. On Windows, we don't require that the >> encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the >> server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies >> WIN1252 encoding. We allow that and it generally works on Windows >> because in varstr_cmp, we use MultiByteToWideChar() followed by >> wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. >> But for isupper(), it matters. > > Does this mean that the UTF-8 messing up would disappear if the database were using a different locale for LC_CTYPE? Ifso, which locale should I use? > This would be useful for a temporary workaround. Maybe, not sure. The logical thing to do would be to set LC_CTYPE to "English_United Kingdom.65001", which tell Windows to expect UTF-8 charset. However, old discussions on this subject suggest that Windows won't accept that: http://www.postgresql.org/message-id/20071015090954.GD4653@svr2.hagander.net It's still worth a try, I think. Things might've changed since then. If that doesn't work, you could also try some other random codepages as a workaround. If you're lucky, one of them might work better, even though it would still be the wrong codepage for UTF-8. - Heikki