Thread: BUG #3638: UTF8 Character encoding does NOT work
The following bug has been logged online: Bug reference: 3638 Logged by: Fil Matthews Email address: fil@internetmediapro.com PostgreSQL version: 8-1 , 8-2 Operating system: Linux Debian - Windows XP Description: UTF8 Character encoding does NOT work Details: Judging from the amount of Google page hits with the exact same problem I am surprised and mystified by this obvious flaw in Postgres Technology.. Just how is one expected to work with UTF8 character sets when all and every attempt at using even Postgres clients produces the SAME problem every time ??? "invalid byte sequence for encoding "UTF8": 0x92" In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8 Tables test.text -> (Chararcter varying 10) In any Postgres Client ie psql , dbadmin III Insert into test values ( chr(146));; Query returned successfully: 1 rows affected, 32 ms execution time. copy test to '/tmp/testfile.txt'; Query returned successfully: 1 rows affected, 15 ms execution time. copy test from '/tmp/testfile.txt'; Come on are you serious?? .. Just how does one work with completly valid data that has an ascii 128 + value ?? Currently this flaw make Postgres an un-useable database technology .. Or can some-one please explain this and a possible work around .. ?? Thank You
Why do you think that an UTF-8 encoded string starting with 0x92 is valid? 0x92 can appear in the second, third or fourth octet, but should never appear in the first octet. -- Tatsuo Ishii SRA OSS, Inc. Japan > The following bug has been logged online: > > Bug reference: 3638 > Logged by: Fil Matthews > Email address: fil@internetmediapro.com > PostgreSQL version: 8-1 , 8-2 > Operating system: Linux Debian - Windows XP > Description: UTF8 Character encoding does NOT work > Details: > > Judging from the amount of Google page hits with the exact same problem I am > surprised and mystified by this obvious flaw in Postgres Technology.. > > Just how is one expected to work with UTF8 character sets when all and > every attempt at using even Postgres clients produces the SAME problem > every time ??? > > "invalid byte sequence for encoding "UTF8": 0x92" > > In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8 > > Tables test.text -> (Chararcter varying 10) > > In any Postgres Client ie psql , dbadmin III > > Insert into test values ( chr(146));; > > > Query returned successfully: 1 rows affected, 32 ms execution time. > > copy test to '/tmp/testfile.txt'; > > > Query returned successfully: 1 rows affected, 15 ms execution time. > > copy test from '/tmp/testfile.txt'; > > > Come on are you serious?? .. Just how does one work with completly valid > data that has an ascii 128 + value ?? > > Currently this flaw make Postgres an un-useable database technology .. Or > can some-one please explain this and a possible work around .. ?? > > Thank You > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Fil Matthews wrote: > Come on are you serious?? .. Just how does one work with completly valid > data that has an ascii 128 + value ?? A string with a single byte that has high-bit set (characters 128-255) is *not* a valid UTF-8 sequence. That's why you get the error. One could argue that chr(146) should throw an error right away, since 146 doesn't represent a valid ASCII character either. In fact, this has been changed in the CVS; in the upcoming 8.3 version, chr(146) in UTF-8 encoding will return a two byte character corresponding the unicode code point 146, which seems to be a control character, marked for "private use only". > Currently this flaw make Postgres an un-useable database technology .. Or > can some-one please explain this and a possible work around .. ?? That's quite an exaggeration, don't you think? As a work around, don't put invalid data in your database. Don't use chr-function, it's really only there for compatibility with other DBMSs that has it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Fil Matthews" <fil@internetmediapro.com> writes: > In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8 ... > Insert into test values ( chr(146));; Can you explain what you expected to happen here? Did you, for example, expect the character with Unicode code point 146 to be inserted? Because the single byte 146 isn't a valid UTF8 character. In PostgreSQL 8.2 I don't think there's any function to generate an arbitrary Unicode code point. You'll have to do that on the client end and encode it in UTF8 before sending. In PostgreSQL 8.3 chr() will in fact be modified to do this. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
> Tatsuo Ishii wrote: > > Why do you think that an UTF-8 encoded string starting with 0x92 is > > valid? > > > > 0x92 can appear in the second, third or fourth octet, but should never > > appear in the first octet. > > -- > > Tatsuo Ishii > > SRA OSS, Inc. Japan > > > > > >> The following bug has been logged online: > >> > >> Bug reference: 3638 > >> Logged by: Fil Matthews > >> Email address: fil@internetmediapro.com > >> PostgreSQL version: 8-1 , 8-2 > >> Operating system: Linux Debian - Windows XP > >> Description: UTF8 Character encoding does NOT work > >> Details: > >> > >> Judging from the amount of Google page hits with the exact same problem I am > >> surprised and mystified by this obvious flaw in Postgres Technology.. > >> > >> Just how is one expected to work with UTF8 character sets when all and > >> every attempt at using even Postgres clients produces the SAME problem > >> every time ??? > >> > >> "invalid byte sequence for encoding "UTF8": 0x92" > >> > >> In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8 > >> > >> Tables test.text -> (Chararcter varying 10) > >> > >> In any Postgres Client ie psql , dbadmin III > >> > >> Insert into test values ( chr(146));; > >> > >> > >> Query returned successfully: 1 rows affected, 32 ms execution time. > >> > >> copy test to '/tmp/testfile.txt'; > >> > >> > >> Query returned successfully: 1 rows affected, 15 ms execution time. > >> > >> copy test from '/tmp/testfile.txt'; > >> > >> > >> Come on are you serious?? .. Just how does one work with completly valid > >> data that has an ascii 128 + value ?? > >> > >> Currently this flaw make Postgres an un-useable database technology .. Or > >> can some-one please explain this and a possible work around .. ?? > >> > >> Thank You > >> > >> ---------------------------(end of broadcast)--------------------------- > >> TIP 1: if posting/reading through Usenet, please send an appropriate > >> subscribe-nomail command to majordomo@postgresql.org so that your > >> message can get through to the mailing list cleanly > >> > > > > > > Sorry But I don't agree.. Why can't Postgres store a legitimate 8 bit > byte value that is below 255?? and treat it as text .. > Not being able to do this this makes Postgres unusable.. for storing > TEXT values.. > > I do not know ANY other database technology that doesn't allow some form > of storing a legitimate 8 bit byte ... > > Even the most simplest open -source database in the world (and most > popular) can do this.. > > The biggest and best (Thank you Larry) can do this ... > > Postgres can't. > > In other words You are claiming that UTF8 is actually UTF7 .... No. > There are 8 bits in a byte.. not 7 .. If UTF8 can't by definition > store 8 bits then what standard can?? UTF-8 does not accept arbitary 8 bit characters. The byte ranges UTF-8 accepts are precisely defined in the standard. If our implementation is different from it, please let us know. > The technology is wrong and it is incorrect... If one looks at the > output of the copy file > od -c then QUITE correctly the 8 bit value is stored as the value > given.. > > What then is the problem in putting this value back in the text field it > came from ?? PostgreSQL needs to follow the standard. I suggest you double check the standard. -- Tatsuo Ishii SRA OSS, Inc. Japan
> Then how then does Postgres expect to be able to store text that goes > beyond SQL_ASCII.. ??? Because what you are saying in effect is that If you need to store an arbitrary byte sequences, BYTEA is your friend. -- Tatsuo Ishii SRA OSS, Inc. Japan
> Tatsuo Ishii wrote: > > Why do you think that an UTF-8 encoded string starting with 0x92 is > > valid? > > > > 0x92 can appear in the second, third or fourth octet, but should never > > appear in the first octet. > > -- > > Tatsuo Ishii > > SRA OSS, Inc. Japan > > > > > >> The following bug has been logged online: > >> > >> Bug reference: 3638 > >> Logged by: Fil Matthews > >> Email address: fil@internetmediapro.com > >> PostgreSQL version: 8-1 , 8-2 > >> Operating system: Linux Debian - Windows XP > >> Description: UTF8 Character encoding does NOT work > >> Details: > >> > >> Judging from the amount of Google page hits with the exact same problem I am > >> surprised and mystified by this obvious flaw in Postgres Technology.. > >> > >> Just how is one expected to work with UTF8 character sets when all and > >> every attempt at using even Postgres clients produces the SAME problem > >> every time ??? > >> > >> "invalid byte sequence for encoding "UTF8": 0x92" > >> > >> In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8 > >> > >> Tables test.text -> (Chararcter varying 10) > >> > >> In any Postgres Client ie psql , dbadmin III > >> > >> Insert into test values ( chr(146));; > >> > >> > >> Query returned successfully: 1 rows affected, 32 ms execution time. > >> > >> copy test to '/tmp/testfile.txt'; > >> > >> > >> Query returned successfully: 1 rows affected, 15 ms execution time. > >> > >> copy test from '/tmp/testfile.txt'; > >> > >> > >> Come on are you serious?? .. Just how does one work with completly valid > >> data that has an ascii 128 + value ?? > >> > >> Currently this flaw make Postgres an un-useable database technology .. Or > >> can some-one please explain this and a possible work around .. ?? > >> > >> Thank You > >> > >> ---------------------------(end of broadcast)--------------------------- > >> TIP 1: if posting/reading through Usenet, please send an appropriate > >> subscribe-nomail command to majordomo@postgresql.org so that your > >> message can get through to the mailing list cleanly > >> > > > > > > Sorry But I don't agree.. Why can't Postgres store a legitimate 8 bit > byte value that is below 255?? and treat it as text .. > Not being able to do this this makes Postgres unusable.. for storing > TEXT values.. > > I do not know ANY other database technology that doesn't allow some form > of storing a legitimate 8 bit byte ... > > Even the most simplest open -source database in the world (and most > popular) can do this.. > > The biggest and best (Thank you Larry) can do this ... > > Postgres can't. > > In other words You are claiming that UTF8 is actually UTF7 .... No. > There are 8 bits in a byte.. not 7 .. If UTF8 can't by definition > store 8 bits then what standard can?? UTF-8 does not accept arbitary 8 bit characters. The byte ranges UTF-8 accepts are precisely defined in the standard. If our implementation is different from it, please let us know. > The technology is wrong and it is incorrect... If one looks at the > output of the copy file > od -c then QUITE correctly the 8 bit value is stored as the value > given.. > > What then is the problem in putting this value back in the text field it > came from ?? PostgreSQL needs to follow the standard. That's it. -- Tatsuo Ishii SRA OSS, Inc. Japan