Thread: Re: psqlODBC-Driver Test / text fields
> -----Original Message----- > From: Johann Zuschlag [mailto:zuschlag2@online.de] > Sent: 27 March 2006 13:46 > To: Dave Page > Cc: pgsql-odbc@postgresql.org > Subject: psqlODBC-Driver Test / text fields > > Hi Dave, > > here is another one, same problem like Miguel: Miguel's problem is somewhat unique to BDE from what I recall - specifically it ignores Unicode text columns which the Unicode driver will offer by default. The same problem has been reported with Oracle and other DBMS's on usenet. > PostgreSQL 8.0X on Debian Sarge > Unicode Database > Application with psqlodbc+ 7.02.02.60 on Win XP Sp2 > > Operation: Reading and writing of a, A, u, U, o, O-Umlaut and > sz (german > characters) > Data type: text > Remarks: none > Protocol: 7.4+ > Result: failed, insert works, querying doesn't work OK, that sounds like the old bug we were seeing. What is the difference between those and the previous results you reported a few minutes ago? Is it just the text/varchar difference? Regards, Dave.
Dave Page schrieb: > > OK, that sounds like the old bug we were seeing. What is the difference > between those and the previous results you reported a few minutes ago? > Is it just the text/varchar difference? > > > Yes, I forgot to test the text field. I shall set debug higher to see what is going on. Regards, Johann
Dave Page schrieb: >OK, that sounds like the old bug we were seeing. What is the difference >between those and the previous results you reported a few minutes ago? >Is it just the text/varchar difference? > >Regards, Dave. > > > Hi Dave, the problem doesn't seem to be related to the text field. Please note the examples below: 1. Searching for a string starting with a 't' seems to work fine: 2006-03-27 16:49:48 [2931] LOG: statement: declare "SQL_CUR0210FD50" cursor with hold for SELECT t6.* FROM KUNDE t6 WHERE t6.name >= 't' AND t6.name <= 'tz' ORDER BY t6.name ASC, t6.kundenid ASC 2. Searching for a string starting with o-Umlaut (german character) doesn't return any results (in my app.). 2006-03-27 16:50:37 [2931] LOG: statement: declare "SQL_CUR0210FD50" cursor with hold for SELECT t6.* FROM KUNDE t6 WHERE t6.name >= 'ö' AND t6.name <= 'öz' ORDER BY t6.name ASC, t6.kundenid ASC Maybe the WHERE-statement is not parsed by the driver. But the hex representation of 'ö' is 'C3B6', that is the correct UTF8 code (not unicode) of o-Umlaut. 3. Furthermore I noticed that I wouldn't get anything back if the searched string contains any "Umlauts". So I guess there could be a problem with the result set too. SELECT without a WHERE-statement works. Any ideas? A mylog ist available, as well as the tiny data set (6 lines).. Hope that helps. Regards, Johann
Johann Zuschlag wrote: > Dave Page schrieb: > >> OK, that sounds like the old bug we were seeing. What is the difference >> between those and the previous results you reported a few minutes ago? >> Is it just the text/varchar difference? >> >> Regards, Dave. >> >> >> > > Hi Dave, > > the problem doesn't seem to be related to the text field. Please note > the examples below: > > 2. Searching for a string starting with o-Umlaut (german character) > doesn't return any results (in my app.). > > 2006-03-27 16:50:37 [2931] LOG: statement: declare "SQL_CUR0210FD50" > cursor with hold for SELECT t6.* FROM KUNDE t6 WHERE t6.name >= 'ö' > AND t6.name <= 'öz' ORDER BY t6.name ASC, t6.kundenid ASC > > Maybe the WHERE-statement is not parsed by the driver. But the hex > representation of 'ö' is 'C3B6', that is the correct UTF8 code (not > unicode) of o-Umlaut. Hi Johann, Could you try the same query using psql with the client_encoding 'UTF8' ? regards, Hiroshi Inoue
Hiroshi Inoue schrieb: > >> >> 2. Searching for a string starting with o-Umlaut (german character) >> doesn't return any results (in my app.). >> >> 2006-03-27 16:50:37 [2931] LOG: statement: declare "SQL_CUR0210FD50" >> cursor with hold for SELECT t6.* FROM KUNDE t6 WHERE t6.name >= 'ö' >> AND t6.name <= 'öz' ORDER BY t6.name ASC, t6.kundenid ASC >> >> Maybe the WHERE-statement is not parsed by the driver. But the hex >> representation of 'ö' is 'C3B6', that is the correct UTF8 code (not >> unicode) of o-Umlaut. > > > Hi Johann, > Could you try the same query using psql with the client_encoding 'UTF8' ? > > regards, > Hiroshi Inoue > Hi Hiroshi, Do you mean psql on the Linux-server? my locales: de_DE.ISO-8859-1 (default) de_DE.UTF-8 de_DE.UTF-8@euro de_DE.ISO-8859-15@euro I call psql: - set client_encoding='UTF8'; - select name from kunde; result e.g: 'öä-test' (=o-Umlaut, a-Umlaut, -test) - select name from kunde where name >= 'ö' and name <= 'öz' order by name asc; result: ERROR: Unicode characters greater than or equal to 0x10000 are not supported export LANG=de_DE.UTF-8 doesn't change the behavior. (select name from kunde where name >= 'ö' and name <= 'öz' order by name asc; works of course) regards, Johann
Johann Zuschlag schrieb: > > > (select name from kunde where name >= 'ö' and name <= 'öz' order by > name asc; works of course) > > Maybe that is more precise: select name from kunde where name >= 'ö' and name <= 'öz' order by name asc; Does not give an error, but 0 result lines. regards, Johann
Johann Zuschlag wrote: > Johann Zuschlag schrieb: > >> >> >> (select name from kunde where name >= 'ö' and name <= 'öz' order by >> name asc; works of course) >> >> > Maybe that is more precise: > > select name from kunde where name >= 'ö' and name <= 'öz' order by > name asc; > > Does not give an error, but 0 result lines. Thanks. Could you issue the following 2 queries select name from kunde where name >= 'ö' order by name asc; select name from kunde where name <= 'öz' order by name asc; and see the results ? regards, Hiroshi Inoue
Hiroshi Inoue schrieb: > > Thanks. > Could you issue the following 2 queries > > select name from kunde where name >= 'ö' order by name asc; > name -------- öä-test öäüÃà ÃÃ-test (2 Zeilen) > select name from kunde where name <= 'öz' order by name asc; > name -------- Hühne Müller Täst test test-2 (5 Zeilen) The complete "data" set: select name from kunde order by name asc; name --------------- Hühne Müller Täst test test-2 öä-test öäüÃà ÃÃ-test (7 Zeilen) Seems, that the query is not correct my application is sending. Regards, Johann
Johann Zuschlag wrote: > Hiroshi Inoue schrieb: > >> >> Thanks. >> Could you issue the following 2 queries >> >> select name from kunde where name >= 'ö' order by name asc; >> > name > -------- > öä-test > öäüÃà > ÃÃ-test > (2 Zeilen) > >> select name from kunde where name <= 'öz' order by name asc; >> > name > -------- > Hühne > Müller > Täst > test > test-2 > (5 Zeilen) Hmm utf8 code of a-umlaut seems bigger than 'z' . Well how is the result of the following(original ?) query under default encoding ? select name from kunde where name >= 'ö' and name <= 'öz' order by name asc; regards, Hiroshi inoue
When I look at a character map, for instance with Windows Start, Run, All Programs, Accessories, System Tools, Character Map and a extended character set like Arial. I see that all the accented characters (umlauts, graves, etc.) are higher that unaccented letters. I also can see the UTF8 equivalents. Hiroshi Inoue wrote: > Johann Zuschlag wrote: > >> Hiroshi Inoue schrieb: >> >>> >>> Thanks. >>> Could you issue the following 2 queries >>> >>> select name from kunde where name >= 'ö' order by name asc; >>> >> name >> -------- >> öä-test >> öäüÃà >> ÃÃ-test >> (2 Zeilen) >> >>> select name from kunde where name <= 'öz' order by name asc; >>> >> name >> -------- >> Hühne >> Müller >> Täst >> test >> test-2 >> (5 Zeilen) > > > > Hmm utf8 code of a-umlaut seems bigger than 'z' . > > Well how is the result of the following(original ?) query > under default encoding ? > > select name from kunde where name >= 'ö' and name <= 'öz' order by name > asc; > > regards, > Hiroshi inoue > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend
Attachment
Hiroshi Inoue schrieb: > > Hmm utf8 code of a-umlaut seems bigger than 'z' . > > Well how is the result of the following(original ?) query > under default encoding ? > > select name from kunde where name >= 'ö' and name <= 'öz' order by > name asc; > ERROR: Unicode characters greater than or equal to 0x10000 are not supported :-) Regards, Johann
Johann Zuschlag wrote: > Hiroshi Inoue schrieb: > >> >> Hmm utf8 code of a-umlaut seems bigger than 'z' . >> >> Well how is the result of the following(original ?) query >> under default encoding ? >> >> select name from kunde where name >= 'ö' and name <= 'öz' order by >> name asc; >> > ERROR: Unicode characters greater than or equal to 0x10000 are not > supported With the default encoding(ISO-8859-1 ?) ? regards, Hiroshi Inoue
Campbell, Greg wrote: > When I look at a character map, for instance with Windows Start, Run, > All Programs, Accessories, System Tools, Character Map and a extended > character set like Arial. > I see that all the accented characters (umlauts, graves, etc.) are > higher that unaccented letters. I also can see the UTF8 equivalents. > Do you mean the behabior reported by Johann is reasonable ? I'm not familiar with LATIN encoding. regards, Hiroshi Inoue
I think, yes. When I look at characters and numeric representations of them, SORTS and comparison operators seem to be behaving correctly. Hiroshi Inoue wrote: > Campbell, Greg wrote: > >> When I look at a character map, for instance with Windows Start, Run, >> All Programs, Accessories, System Tools, Character Map and a extended >> character set like Arial. >> I see that all the accented characters (umlauts, graves, etc.) are >> higher that unaccented letters. I also can see the UTF8 equivalents. >> > > Do you mean the behabior reported by Johann is reasonable ? > I'm not familiar with LATIN encoding. > > regards, > Hiroshi Inoue > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
Attachment
"Campbell, Greg" <greg.campbell@us.michelin.com> writes: > When I look at a character map, for instance with Windows Start, Run, > All Programs, Accessories, System Tools, Character Map and a extended > character set like Arial. > I see that all the accented characters (umlauts, graves, etc.) are > higher that unaccented letters. I also can see the UTF8 equivalents. > Not 100% sure I understood this issue, but it seems you are confusing encoding and collation. See strcmp() and strcoll().
Hiroshi Inoue schrieb: > Johann Zuschlag wrote: > >> Hiroshi Inoue schrieb: >> >>> >>> Hmm utf8 code of a-umlaut seems bigger than 'z' . >>> >>> Well how is the result of the following(original ?) query >>> under default encoding ? >>> >>> select name from kunde where name >= 'ö' and name <= 'öz' order by >>> name asc; >>> >> ERROR: Unicode characters greater than or equal to 0x10000 are not >> supported > > > With the default encoding(ISO-8859-1 ?) ? > Oh, I misunderstood you. Do you mean the database default encoding? IIRC ISO-8859-1 would be equivalent to LATIN1? I generated a new database with the same data in LATIN1: test-latin1=# select name from kunde order by name asc; name -------------- Hühne Müller Täst test test-2 öä-test öäüÖÄÜß-test (7 Zeilen) test-latin1=# select name from kunde where name >= 'ö' and name <= 'öz' order by name asc; name ------ (0 Zeilen) And please note the following: test-latin1=# select name from kunde where name >= 'ö' and name <= 'ö' order by name asc; name ------ (0 Zeilen) Hmm... Regards, Johann
> test-latin1=# select name from kunde order by name asc; > name > -------------- > Hühne > Müller > Täst > test > test-2 > öä-test > öäüÖÄÜß-test > (7 Zeilen) > > test-latin1=# select name from kunde where name >= 'ö' and name <= 'öz' > order by name asc; > name > ------ > (0 Zeilen) > > And please note the following: > > test-latin1=# select name from kunde where name >= 'ö' and name <= 'ö' > order by name asc; > name > ------ > (0 Zeilen) > > Hmm... The last query is equivalent with: select name from kunde where name = 'ö' So I'm not surprised with the result ;-) What is the locale for backend? Isn't this the real problem? When you have the problem in psql client feel free to ask in another pgsql-* mailing list. It isn't odbc related bug. Maybe we solve your problem but maybe you get the right advice faster in another list. Regards, Luf
> -----Original Message----- > From: Ludek Finstrle [mailto:luf@pzkagis.cz] > Sent: 29 March 2006 15:35 > To: Johann Zuschlag > Cc: Hiroshi Inoue; Dave Page; pgsql-odbc@postgresql.org > Subject: Re: [ODBC] psqlODBC-Driver Test / text fields > > When you have the problem in psql client feel free to ask > in another pgsql-* mailing list. It isn't odbc related bug. > Maybe we solve your problem but maybe you get the right advice > faster in another list. Hi Luf, Johann is trying out some queries in psql to help Hiroshi Inoue track down a psqlODBC problem (the bug that caused us to release both ANSI and Unicode builds of the driver). Regards, Dave
> > When you have the problem in psql client feel free to ask > > in another pgsql-* mailing list. It isn't odbc related bug. > > Maybe we solve your problem but maybe you get the right advice > > faster in another list. > > Johann is trying out some queries in psql to help Hiroshi Inoue track > down a psqlODBC problem (the bug that caused us to release both ANSI and > Unicode builds of the driver). Hello, maybe I remember it wrong but I remember that there is same problem in psql client. Regards, Luf
> -----Original Message----- > From: Ludek Finstrle [mailto:luf@pzkagis.cz] > Sent: 29 March 2006 15:40 > To: Dave Page > Cc: Ludek Finstrle; Johann Zuschlag; Hiroshi Inoue; > pgsql-odbc@postgresql.org > Subject: Re: [ODBC] psqlODBC-Driver Test / text fields > > > > When you have the problem in psql client feel free to ask > > > in another pgsql-* mailing list. It isn't odbc related bug. > > > Maybe we solve your problem but maybe you get the right advice > > > faster in another list. > > > > Johann is trying out some queries in psql to help Hiroshi > Inoue track > > down a psqlODBC problem (the bug that caused us to release > both ANSI and > > Unicode builds of the driver). > > Hello, > > maybe I remember it wrong but I remember that there is same problem > in psql client. Haven't heard of it. I would think that's the sort of thing that would get fixed pretty quickly to be honest. Regards, Dave
Ludek Finstrle schrieb: >> The last query is equivalent with: >> select name from kunde where name = 'ö' >> >> So I'm not surprised with the result ;-) >> >> What is the locale for backend? Isn't this the real problem? >> Hi Luf, sure, that is the same, but the result is 0 lines for both. :-) test-latin1=# select name from kunde where name = 'ö'; name ------ (0 Zeilen) the psql driver is definitely better than ever. In my (humble) opinion we don't need two drivers anymore. Thanks a lot for your work. So far all my tests with UNICODE, LATIN1 and SQL-ASCII databases seem to work with the new 7.02.260 driver on Win XP, (i.e. insert, update, select) My set-up is: PostgreSQL 8.0X on Debian Sarge Unicode or LATIN1 or SQL-ASCII Database locale=de_DE de_DE.ISO-8859-1 (default) de_DE.UTF-8 de_DE.UTF-8@euro de_DE.ISO-8859-15@euro changing to de_DE didn't change anything, client_encoding didn't help, since the problem doesn't seem to be related to the driver. Hiroshi tracked down the problem by doing the same query with the psql.-client. So you may be right that this an old problem. regards, Johann
> > > > When you have the problem in psql client feel free to ask > > > > in another pgsql-* mailing list. It isn't odbc related bug. > > > > Maybe we solve your problem but maybe you get the right advice > > > > faster in another list. > > > > > > Johann is trying out some queries in psql to help Hiroshi > > Inoue track > > > down a psqlODBC problem (the bug that caused us to release > > both ANSI and > > > Unicode builds of the driver). > > > > Hello, > > > > maybe I remember it wrong but I remember that there is same problem > > in psql client. > > Haven't heard of it. I would think that's the sort of thing that would > get fixed pretty quickly to be honest. Let's try read and it's ancestor: http://archives.postgresql.org/pgsql-odbc/2006-03/msg00188.php Regards, Luf
> -----Original Message----- > From: Ludek Finstrle [mailto:luf@pzkagis.cz] > Sent: 29 March 2006 16:23 > To: Dave Page > Cc: Ludek Finstrle; Johann Zuschlag; Hiroshi Inoue; > pgsql-odbc@postgresql.org > Subject: Re: [ODBC] psqlODBC-Driver Test / text fields > > Let's try read and it's ancestor: > http://archives.postgresql.org/pgsql-odbc/2006-03/msg00188.php I'm not sure I understand that that test is actually valid anyway. Consider the test query: select name from kunde where name >= 'ö'; If 'ö' is 'ö', then isn't the query above mixing single and a multibyte encoding? Ie. It should all be single byte - e.g. select name from kunde where name >= 'ö' order by name asc; Or all multibyte (displayed byte by byte) whatever that results in: s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e* *>*=* *'*ö'*;* Of course, we all know how well I grok encoding issues :-) Regards, Dave.
> >>The last query is equivalent with: > >>select name from kunde where name = 'ö' > >> > >>So I'm not surprised with the result ;-) > >> > >>What is the locale for backend? Isn't this the real problem? > > sure, that is the same, but the result is 0 lines for both. :-) It's ok result. I see no 'ö' in data you have posted. I see only 'ösomething'. > we don't need two drivers anymore. Thanks a lot for your work. So far It's not my work. I only fixed 08.01 branch since 08.01.0102. The enahnced branch is Hiroshi (Inoue and Saito) work. > PostgreSQL 8.0X on Debian Sarge > Unicode or LATIN1 or SQL-ASCII Database You tried this all? > locale=de_DE I think it could be the problem. You have unicode data but you try sort it in de_DE.ISO-8859-1. Could you try change the locale for _backend_ process to de_DE.UTF-8 or de_DE.UTF-8@euro? (Maybe also in postmaster.conf configuration file of postgresql). I'm not sure if it isn't even initdb time related. > de_DE.ISO-8859-1 (default) > de_DE.UTF-8 > de_DE.UTF-8@euro > de_DE.ISO-8859-15@euro > > changing to de_DE didn't change anything, client_encoding didn't help, > since the problem doesn't seem to be related to the driver. I don't speak about psql (client or driver) locale. I speak about backend locale (and settings in postmaster.conf). The sort operation is performed in backend process. I suggest to try ask for this (with psql client problem) in pgsql-general. Regards, Luf
> I'm not sure I understand that that test is actually valid anyway. ... > Of course, we all know how well I grok encoding issues :-) I know the encoding issue (UTF8, UNICODE, ...) at same level as you. My good knowledge ending at single byte encoding. I see no progress for some time so I hope someone in pgsql-general (maybe peter_e) could help when the (or similar) problem is in psql client too. Regards, Luf
Hi Luv,<br /> Hi Hiroshi,<br /> Hi Dave,<br /><br /> I'm sorry, but I made a mistake during my last tests.<br /><blockquotecite="mid20060329153519.GL18148@soptik.pzkagis.cz" type="cite"><pre wrap="">It's not my work. I only fixed 08.01branch since 08.01.0102. The enahnced branch is Hiroshi (Inoue and Saito) work. </pre></blockquote> Yeah, I know. But I hope your work for 8.01.0102 can be included in the future driver<br /><blockquotecite="mid20060329153519.GL18148@soptik.pzkagis.cz" type="cite"><pre wrap=""></pre><blockquote type="cite"><prewrap="">PostgreSQL 8.0X on Debian Sarge Unicode or LATIN1 or SQL-ASCII Database </pre></blockquote><pre wrap=""> You tried this all? </pre></blockquote> Yes, and while doing so many initdb's I made a mistake with my locales for the last tests.<br /> Thanksfor pointing me in the right direction.<br /><br /> So my first test results where indeed correct. I'm sorry for keepingyou all so busy. But I wanted to supply you with complete test results. So finally: concerning the ANSI/UTF8 problemthe driver seems to work (at least for me :-) ). WinME test will follow later.<br /><br /> Only some minor issuesare left:<br /><br /> - sometimes my application gives an (memory) error when leaving. (Will try to trace that)<br/> - continuously I get 'unknown configuration parameter "max_identifier_length"<br /> - Typ "lo" is not existing(I don't think my app is doing that)<br /><br /> I never got the last two with the former drivers. But I'll checkthe mylog.<br /><br /> For further tests I switched back my Debian test system to PostgreSQL 7.4.X since my productionserver is still running that version.<br /><br /> Regards,<br /> Johann<br /><br />
Johann Zuschlag wrote: >Hi Luv, >Hi Hiroshi, >Hi Dave, > >I'm sorry, but I made a mistake during my last tests. > >>It's not my work. I only fixed 08.01 branch since 08.01.0102. >>The enahnced branch is Hiroshi (Inoue and Saito) work. >> >> >> >Yeah, I know. But I hope your work for 8.01.0102 can be included in the future >driver > >>>PostgreSQL 8.0X on Debian Sarge >>>Unicode or LATIN1 or SQL-ASCII Database >>> >>> >> >>You tried this all? >> >> >> >Yes, and while doing so many initdb's I made a mistake with my locales for the >last tests. >Thanks for pointing me in the right direction. > >So my first test results where indeed correct. I'm sorry for keeping you all so >busy. But I wanted to supply you with complete test results. So finally: >concerning the ANSI/UTF8 problem the driver seems to work (at least for me :-) >). WinME test will follow later. > > Thanks for your testing. What we've had so far is *locale* support which I've never used and we've had no *collation* support unfortunaltely. >Only some minor issues are left: >- sometimes my application gives an (memory) error when leaving. (Will try to >trace that) > > I think this is fixed in the dll at http://www.geocities.jp/inocchichichi/psqlodbc/index.html . >- continuously I get 'unknown configuration parameter "max_identifier_length" > > 7.4 doesn't seems to have the max_identifier_length parameter. This may be fixed in the abov dll also.. >- Typ "lo" is not existing (I don't think my app is doing that) > > This isn't an error at all, just checking the existence of lo type when connecting to the server. regards, Hiroshi Inoue
Dave Page schrieb: > If 'ö' is 'ö', then isn't the query above mixing single and a multibyte encoding? Ie. It should all be single byte - e.g. > > select name from kunde where name >= 'ö' order by name asc; > > Or all multibyte (displayed byte by byte) whatever that results in: > > s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e* *>*=* *'*ö'*;* > > Of course, we all know how well I grok encoding issues :-) > Hi Dave, I can understand you. This encoding issues drive me also crazy some times. :-) The problem with UTF-8 is that all ASCII characters are represented by one byte and all non ASCII characters, e.g. German Umlauts, are represented by two bytes. That's why UTF-8 is called a "variable-length multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two bytes, every character is represented by two bytes (fixed-length multibyte encoding). So Unicode is not equal to UTF-8, even though the PostgreSQL documentation is stating that. If you like, see: http://www.utf8-chartable.de/ or some explanation at http://czyborra.com/utf/ Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian. Unfortunately (or fortunately?) Windows seems to use UTF-8 for European languages. Hiroshi can you explain that? I guess the Japanese edition of Windows XP is using pure 2 byte Unicode. I can't say anything about psql. But the new psqlodbc driver 7.03.26X seems to handle that situation very well. So I suppose the test was valid to a certain extend, since the characters are handled in this mixed way in Win XP. I still have some funny behaviour with Unicode in psql (even after setting LC_COLLATE correctly :-) ). For my production machines I will anyway use ISO-8859-1 (or ISO-8859-15). Then the driver will convert all characters to single byte avoiding all kind of problems. But feel free to ask me for tests... ;-) Regards, Johann
Johann Zuschlag wrote: > Dave Page schrieb: > >> If 'ö' is 'ö', then isn't the query above mixing single and a >> multibyte encoding? Ie. It should all be single byte - e.g. >> >> select name from kunde where name >= 'ö' order by name asc; >> >> Or all multibyte (displayed byte by byte) whatever that results in: >> >> s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e* >> *>*=* *'*ö'*;* >> >> Of course, we all know how well I grok encoding issues :-) >> > > Hi Dave, > > I can understand you. This encoding issues drive me also crazy some > times. :-) > > The problem with UTF-8 is that all ASCII characters are represented by > one byte and all non ASCII characters, e.g. German Umlauts, are > represented by two bytes. That's why UTF-8 is called a > "variable-length multibyte encoding". In a pure Unicode world, e.g. > U+xxxx with two bytes, every character is represented by two bytes > (fixed-length multibyte encoding). So Unicode is not equal to UTF-8, > even though the PostgreSQL documentation is stating that. > > If you like, see: http://www.utf8-chartable.de/ or some explanation at > http://czyborra.com/utf/ > > Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian. > Unfortunately (or fortunately?) Windows seems to use UTF-8 for > European languages. Hiroshi can you explain that? I guess the Japanese > edition of Windows XP is using pure 2 byte Unicode. Unicode ODBC drivers handle UCS-2 not UTF-8 even in European environemt. Unfortunately PostgreSQL doesn't handle UCS-2 directly(because it could contain NULL bytes in the string), the unicode driver sets the client_encoding to UTF-8 automatically and converts from UCS-2 data to UTF-8 data which the PostgreSQL backend can understands when sending queries. So what you can see in the backend log is UTF-8. Then the backend converts from UTF-8 data to the server encoding data. After all, the locale (especially LC_COLLATE) setting you need is the one which matches the backend encoding. > > I can't say anything about psql. But the new psqlodbc driver 7.03.26X > seems to handle that situation very well. > > So I suppose the test was valid to a certain extend, Yes thanks. I can't test the LATINxx encoding by myself. regards, Hiroshi Inoue
Johann Zuschlag wrote: > The problem with UTF-8 is that all ASCII characters are represented by > one byte and all non ASCII characters, e.g. German Umlauts, are > represented by two bytes. That's why UTF-8 is called a "variable-length > multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two > bytes, every character is represented by two bytes (fixed-length > multibyte encoding). So Unicode is not equal to UTF-8, even though the > PostgreSQL documentation is stating that. Well, it's actually even more complicated, because Unicode is actually a 32-bit character set. There is actually UTF8 (variable-length multibyte, 8 bits per unit), UTF16 (variable-length multibyte) and UTF32 (fixed-length multibyte). There is also UCS2 (fixed-length 16-bit), which is limited to the 16 bits of the Basic Multilingual Plane, and UCS4, which is functionally identical to UTF32. UTF-8 actually supports up to 4 bytes per character, so it is more complete than the purely 16-bit UCS-2. Any of the variable-length encodings, and the 32-bit UTF-32 and UCS-4 encodings can represent the whole of the character set. A pure Unicode world can use any of those encodings, so it's a tradeoff. If you want a direct relationship between the number of characters in a string and the number of bytes taken, use a fixed-length encoding. If you want to be able to encode everything, use a variable-length encoding or a 32-bit encoding. If you want to use little space, use an 8-bit encoding. That's it. > Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian. > Unfortunately (or fortunately?) Windows seems to use UTF-8 for European > languages. Hiroshi can you explain that? I guess the Japanese edition of > Windows XP is using pure 2 byte Unicode. In fact, the Win32 API is UTF-16 even in European languages(started out as UCS-2 but became UTF-16 when Unicode went 32-bit :-) ), but it provides an 8-bit compatibility interface. Don't know if te 8-bit encoding is UTF-8 or plain 8-bit code pages though. Reference: http://en.wikipedia.org/wiki/Unicode Cheers, Bart
Hiroshi Inoue schrieb: > > Unicode ODBC drivers handle UCS-2 not UTF-8 even in European > environemt. Unfortunately PostgreSQL doesn't handle UCS-2 > directly(because it could contain NULL bytes in the string), the > unicode driver sets the client_encoding to UTF-8 automatically and > converts from UCS-2 data to UTF-8 data which the PostgreSQL backend > can understands when sending queries. So what you > can see in the backend log is UTF-8. Then the backend converts from > UTF-8 data to the server encoding data. After all, the locale > (especially LC_COLLATE) setting you need is the one which matches the > backend encoding. > Hmm..., so Windows XP uses UCS-2 or do be more correct (like Bart mentioned) UTF-16 (which is nearly the same, except for the surrogates). That is converted to UTF-8, sent to the backend and then converted to the proper locale and stored. I've read about the problems with the NULL bytes on Unix machines. Let's have two examples: 1. backend-1 = ISO8859-1 backend-2 = UTF-8 'A' = U+0041 (does windows use big-endian?) Win UCS-2: U+0041 ODBC UTF-8: U+41 backend-1 stores = 0x41 backend-2 stores = U+41 2. 'Ä' = U+00C4 (german A-Umlaut) Win UCS-2: U+00C4 ODBC UTF-8: U+C384 backend-1 stores = 0xC4 backend-2 stores = U+C384 Did I get that right? So I have to be really careful when testing. Regards, Johann
Johann Zuschlag schrieb: > Let's have two examples: > 1. > backend-1 = ISO8859-1 > backend-2 = UTF-8 > > 'A' = U+0041 (does windows use big-endian?) > > Win UCS-2: U+0041 > ODBC UTF-8: U+41 > backend-1 stores = 0x41 > backend-2 stores = U+41 > > 2. > 'Ä' = U+00C4 (german A-Umlaut) > > Win UCS-2: U+00C4 > ODBC UTF-8: U+C384 > backend-1 stores = 0xC4 > backend-2 stores = U+C384 > > Did I get that right? So I have to be really careful when testing. > No, again wrong. Or is it more like this: 1. a) locale = ISO8859-1 backend-1 = LATIN1 b) locale = UTF-8 backend-2 = Unicode 'A' = U+0041 (does windows use big-endian?) Win UCS-2: U+0041 ODBC UTF-8: U+41 backend-1 stores = U+41 backend-2 stores = U+0041 2. 'Ä' = U+00C4 (german A-Umlaut) Win UCS-2: U+00C4 ODBC UTF-8: U+C384 backend-1 stores = 0xC4 backend-2 stores = U+00C4 Did I get that right? Regards, Johann
Johann Zuschlag <zuschlag2@online.de> writes: > 'A' = U+0041 (does windows use big-endian?) Argh, please do not make it even more complex than it needs to be! Endianness is by chance an _independent_ issue. You just care about it at the low-low level when dealing with files or network sockets, but then it's over and you never want to hear about it anymore at a higher level. So U+0041 is an integer whose value is: zero thousand zero hundred forty one and this is always true, whatever is the byte ordering used by the processor. You don't need to know more than this, even when converting to UTF-8 or anything else.
Johann Zuschlag <zuschlag2@online.de> writes: > Hmm..., so Windows XP uses UCS-2 or do be more correct (like Bart > mentioned) UTF-16 (which is nearly the same, except for the > surrogates). It's nearly the same... but that makes a huge difference. The reason why you use fixed-character length encoding in memory is speed. This saves you a lot of time when computing string lengths, look for some characters (isalnum(),...), collating etc. If don't care about all this speed then you better stay in a variable-length encoding like UTF-8 which saves you A LOT of space, especially with small occidental alphabets. I think that by moving from UCS-2 to UTF-16 you lose on BOTH sides [insert some missing benchmarks here] And you can be sure that it brings a lot of bugs: one bug every time some string code has been "forgotten" and not updated, still assuming UCS-2. Anyway those bugs are only for far-away and unknown countries out of the BMP so who cares? :-/ So it really looks like a poor compatibility hack to me (java does it too).
Johann Zuschlag <zuschlag2@online.de> writes: > I've read about the problems with the NULL bytes on Unix machines. This problem is not related to Unix at all but to the programming language used. Most standard C functions use the zero byte convention as a string terminator, so it becomes a forbidden character in C. On the other hand String objects in C++ and Java use a separate length field, and having NULLs inside a string is a no brainer there. The ODBC API has been designed for C and Cobol. Cobol does not forbid zero as a character either. When browsing the ODBC spec you'll notice it carefully caters for the two ways. Guess which programming language is used PostgreSQL. I suspect unicode does not care at all about this. After all unicode is just about characters not about strings.
Marc Herbert wrote: > Johann Zuschlag <zuschlag2@online.de> writes: > >> I've read about the problems with the NULL bytes on Unix machines. > > This problem is not related to Unix at all but to the programming > language used. Most standard C functions use the zero byte convention > as a string terminator, so it becomes a forbidden character in C. > > On the other hand String objects in C++ and Java use a separate length > field, and having NULLs inside a string is a no brainer there. > > The ODBC API has been designed for C and Cobol. Cobol does not forbid > zero as a character either. When browsing the ODBC spec you'll notice > it carefully caters for the two ways. > > > Guess which programming language is used PostgreSQL. C++ even introduced a special alternative character type "wchar_t" for this, just so that people could handle both 8-bit char* and 16-bit wchar_t* strings. In wchar_t* strings, 8-bit NULs are not a problem because only 16-bit NULs count (and AFAIK the Unicode standard does allows this to be interpreted as a NUL aka end-of-string). The downside of this solution is that no application actually uses it, and everybody is stuck with 8-bit ASCII plus a random local codepage unless special support is added. Why didn't they just upgrade chars to 32 bits and be done with it... :-/ Cheers, Bart
Johann Zuschlag wrote: > Johann Zuschlag schrieb: > >> > No, again wrong. Or is it more like this: > > 1. > a) locale = ISO8859-1 > backend-1 = LATIN1 > > b) locale = UTF-8 > backend-2 = Unicode What do you mean by the Unicode and are you really setting b) as above ? First note that in PostgreSQL the encoding has nothing to do with the locale setting. Though PostgreSQL manages the encoding settings by itself, as for the locale setting it completely relies on the OS environment. There exists an essential flaw from the first. Anyway you can change the encoding as you like per database at createdb time but the locale setting LC_COLLATE and LC_CTYPE are fixed at initdb time. regards, Hiroshi Inoue
Bart Samwel <bart@samwel.tk> writes: > > C++ even introduced a special alternative character type "wchar_t" for > this, just so that people could handle both 8-bit char* and 16-bit > wchar_t* strings. In wchar_t* strings, 8-bit NULs are not a problem > because only 16-bit NULs count (and AFAIK the Unicode standard does > allows this to be interpreted as a NUL aka end-of-string). The > downside of this solution is that no application actually uses it, and > everybody is stuck with 8-bit ASCII plus a random local codepage > unless special support is added. wchar_t is not defined as 16-bits, but as "wide enough to hold any character of the platform". For instance if the platform uses UCS-4, then wchar_t is 32 bits wide. (UTF-16 wchar_t violates this) I don't clearly see how you want to use a 8-bit NULL to terminate a (wider) wchar_t array... ? > Why didn't they just upgrade chars to 32 bits and be done with > it... :-/ Because "char" was and is still used to store multibyte / variable-length / encoded characters.
Marc Herbert wrote: > Bart Samwel <bart@samwel.tk> writes: > wchar_t is not defined as 16-bits, but as "wide enough to hold any > character of the platform". For instance if the platform uses UCS-4, > then wchar_t is 32 bits wide. > > (UTF-16 wchar_t violates this) Ahhh, this explains a lot. The same assumption used to be true for char until they came up with UTF-8 char. And they couldn't just upgrade char because too much code assumed that char was one byte. Then platforms started to use UCS-2 wchar_t, then upgraded those to UTF-16 because they couldn't just upgrade wchar_t because too much code assumed that wchar_t was two bytes. Same pattern. Time to introduce wwchar_t_t. :-) > I don't clearly see how you want to use a 8-bit NULL to terminate a > (wider) wchar_t array... ? This was a backreference to a situation mentioned earlier in the discussion, where wchar_t buffers couldn't be "tunneled through" a layer that used char*, as the wider wchar_t characters may contain NUL bytes. Cheers, Bart
Hiroshi Inoue schrieb: > Johann Zuschlag wrote: >> Johann Zuschlag schrieb: >> >>> >> No, again wrong. Or is it more like this: >> >> 1. >> a) locale = ISO8859-1 >> backend-1 = LATIN1 >> >> b) locale = UTF-8 >> backend-2 = Unicode > > What do you mean by the Unicode and are you really setting b) > as above ? > Oh, typo! On my system (Debian Sarge) it is in fact: b) locale = de_DE.UTF-8 backend-2 = Unicode Regards, Johann
Johann Zuschlag wrote: > Hiroshi Inoue schrieb: > >> Johann Zuschlag wrote: >> >>> Johann Zuschlag schrieb: >>> >>>> >>> No, again wrong. Or is it more like this: >>> >>> 1. >>> a) locale = ISO8859-1 >>> backend-1 = LATIN1 >>> >>> b) locale = UTF-8 >>> backend-2 = Unicode >> >> >> What do you mean by the Unicode and are you really setting b) >> as above ? >> > Oh, typo! On my system (Debian Sarge) it is in fact: > > b) locale = de_DE.UTF-8 > backend-2 = Unicode What's the result of show server_encoding ? regards, Hiroshi Inoue
Hiroshi Inoue schrieb: > b) locale = de_DE.UTF-8 >> backend-2 = Unicode > > > What's the result of > show server_encoding Hi Hiroshi, when I use an unicode database it is UNICODE, for LATIN1 it's of course LATIN1. (and for LATIN9 it's LATIN9) Thank you for your explanations. Now I'm understanding what is happening with a windows client on one end and PostgreSQL on the other end. I am just experimenting with the three above settings (initdb with ISO8859-15, locale ISO8859-15 only). So far it seems to work pretty well. No problems with an application using psqlodbc driver. All German characters including euro (LATIN9, UNICODE) seem to work. Sorting is ok. pgAdminIII is producing the same output. psql works as well, except psql in conjunction with unicode behaves still a little bit strange. Maybe my locale setting is not appropriate, but de_DE.UTF-8 doesn't change it. Will check my console later. Unfortunately I couldn't test 7.03.262 (psqlodbc35W.dll) since the zip file has some kind of error. Regards, Johann
Johann Zuschlag wrote: > Hiroshi Inoue schrieb: > >> b) locale = de_DE.UTF-8 >> >>> backend-2 = Unicode >> >> >> >> What's the result of >> show server_encoding > > Hi Hiroshi, Hi Johann, > Unfortunately I couldn't test 7.03.262 (psqlodbc35W.dll) since the zip > file has some kind of error. Sorry. Fixed. regards, Hiroshi Inoue