Thread: BUG #3638: UTF8 Character encoding does NOT work

BUG #3638: UTF8 Character encoding does NOT work

From

"Fil Matthews"

Date:

27 September 2007, 04:32:31

The following bug has been logged online:

Bug reference:      3638
Logged by:          Fil Matthews
Email address:      fil@internetmediapro.com
PostgreSQL version: 8-1    , 8-2
Operating system:   Linux  Debian  - Windows XP
Description:        UTF8 Character encoding does NOT work
Details:

Judging from the amount of Google page hits with the exact same problem I am
surprised and mystified by this obvious flaw in Postgres Technology..

Just how is one expected to work with  UTF8 character sets when all and
every attempt at using even Postgres clients produces the SAME problem
every time ???

 "invalid byte sequence for encoding "UTF8": 0x92"

In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8

Tables test.text ->   (Chararcter varying 10)

In any  Postgres Client ie  psql , dbadmin III

Insert into test values (  chr(146));;


Query returned successfully: 1 rows affected, 32 ms execution time.

copy test to '/tmp/testfile.txt';


Query returned successfully: 1 rows affected, 15 ms execution time.

copy test from '/tmp/testfile.txt';


Come on are you serious?? ..  Just how does one work with completly valid
data that has an ascii 128 +  value ??

Currently this flaw make Postgres an un-useable database technology ..  Or
can some-one please explain this and a possible work around ..  ??

Thank You

Re: BUG #3638: UTF8 Character encoding does NOT work

From

Tatsuo Ishii

Date:

27 September 2007, 04:58:37

Why do you think that an UTF-8 encoded string starting with 0x92 is
valid?

0x92 can appear in the second, third or fourth octet, but should never
appear in the first octet.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> The following bug has been logged online:
>
> Bug reference:      3638
> Logged by:          Fil Matthews
> Email address:      fil@internetmediapro.com
> PostgreSQL version: 8-1    , 8-2
> Operating system:   Linux  Debian  - Windows XP
> Description:        UTF8 Character encoding does NOT work
> Details:
>
> Judging from the amount of Google page hits with the exact same problem I am
> surprised and mystified by this obvious flaw in Postgres Technology..
>
> Just how is one expected to work with  UTF8 character sets when all and
> every attempt at using even Postgres clients produces the SAME problem
> every time ???
>
>  "invalid byte sequence for encoding "UTF8": 0x92"
>
> In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8
>
> Tables test.text ->   (Chararcter varying 10)
>
> In any  Postgres Client ie  psql , dbadmin III
>
> Insert into test values (  chr(146));;
>
>
> Query returned successfully: 1 rows affected, 32 ms execution time.
>
> copy test to '/tmp/testfile.txt';
>
>
> Query returned successfully: 1 rows affected, 15 ms execution time.
>
> copy test from '/tmp/testfile.txt';
>
>
> Come on are you serious?? ..  Just how does one work with completly valid
> data that has an ascii 128 +  value ??
>
> Currently this flaw make Postgres an un-useable database technology ..  Or
> can some-one please explain this and a possible work around ..  ??
>
> Thank You
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

Re: BUG #3638: UTF8 Character encoding does NOT work

From

Heikki Linnakangas

Date:

27 September 2007, 05:00:43

Fil Matthews wrote:
> Come on are you serious?? ..  Just how does one work with completly valid
> data that has an ascii 128 +  value ??

A string with a single byte that has high-bit set (characters 128-255)
is *not* a valid UTF-8 sequence. That's why you get the error. One could
argue that chr(146) should throw an error right away, since 146 doesn't
represent a valid ASCII character either.

In fact, this has been changed in the CVS; in the upcoming 8.3 version,
chr(146) in UTF-8 encoding will return a two byte character
corresponding the unicode code point 146, which seems to be a control
character, marked for "private use only".

> Currently this flaw make Postgres an un-useable database technology ..  Or
> can some-one please explain this and a possible work around ..  ??

That's quite an exaggeration, don't you think? As a work around, don't
put invalid data in your database. Don't use chr-function, it's really
only there for compatibility with other DBMSs that has it.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #3638: UTF8 Character encoding does NOT work

From

Gregory Stark

Date:

27 September 2007, 05:46:35

"Fil Matthews" <fil@internetmediapro.com> writes:

> In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8
...
> Insert into test values (  chr(146));;

Can you explain what you expected to happen here? Did you, for example, expect
the character with Unicode code point 146 to be inserted? Because the single
byte 146 isn't a valid UTF8 character.

In PostgreSQL 8.2 I don't think there's any function to generate an arbitrary
Unicode code point. You'll have to do that on the client end and encode it in
UTF8 before sending. In PostgreSQL 8.3 chr() will in fact be modified to do
this.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: BUG #3638: UTF8 Character encoding does NOT work

From

Tatsuo Ishii

Date:

27 September 2007, 22:27:03

> Tatsuo Ishii wrote:
> > Why do you think that an UTF-8 encoded string starting with 0x92 is
> > valid?
> >
> > 0x92 can appear in the second, third or fourth octet, but should never
> > appear in the first octet.
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> >
> >
> >> The following bug has been logged online:
> >>
> >> Bug reference:      3638
> >> Logged by:          Fil Matthews
> >> Email address:      fil@internetmediapro.com
> >> PostgreSQL version: 8-1    , 8-2
> >> Operating system:   Linux  Debian  - Windows XP
> >> Description:        UTF8 Character encoding does NOT work
> >> Details:
> >>
> >> Judging from the amount of Google page hits with the exact same problem I am
> >> surprised and mystified by this obvious flaw in Postgres Technology..
> >>
> >> Just how is one expected to work with  UTF8 character sets when all and
> >> every attempt at using even Postgres clients produces the SAME problem
> >> every time ???
> >>
> >>  "invalid byte sequence for encoding "UTF8": 0x92"
> >>
> >> In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8
> >>
> >> Tables test.text ->   (Chararcter varying 10)
> >>
> >> In any  Postgres Client ie  psql , dbadmin III
> >>
> >> Insert into test values (  chr(146));;
> >>
> >>
> >> Query returned successfully: 1 rows affected, 32 ms execution time.
> >>
> >> copy test to '/tmp/testfile.txt';
> >>
> >>
> >> Query returned successfully: 1 rows affected, 15 ms execution time.
> >>
> >> copy test from '/tmp/testfile.txt';
> >>
> >>
> >> Come on are you serious?? ..  Just how does one work with completly valid
> >> data that has an ascii 128 +  value ??
> >>
> >> Currently this flaw make Postgres an un-useable database technology ..  Or
> >> can some-one please explain this and a possible work around ..  ??
> >>
> >> Thank You
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 1: if posting/reading through Usenet, please send an appropriate
> >>        subscribe-nomail command to majordomo@postgresql.org so that your
> >>        message can get through to the mailing list cleanly
> >>
> >
> >
>
> Sorry But I don't agree.. Why can't Postgres store a legitimate 8 bit
> byte value that is below 255??  and treat it as text ..
> Not being able to do this this makes Postgres unusable.. for storing
> TEXT values..
>
> I do not know ANY other database technology that doesn't allow some form
> of storing a legitimate 8 bit byte ...
>
> Even the most simplest  open -source database in the world  (and most
> popular)  can do this..
>
> The biggest and best  (Thank you Larry) can do this ...
>
> Postgres can't.
>
> In other words  You are claiming that UTF8  is  actually UTF7 ....

No.

> There are 8 bits in a byte.. not 7 ..  If UTF8  can't by definition
> store 8 bits  then what standard can??

UTF-8 does not accept arbitary 8 bit characters. The byte ranges UTF-8
accepts are precisely defined in the standard. If our implementation
is different from it, please let us know.

> The technology is wrong  and it is incorrect...  If one looks at the
> output  of the copy file
> od -c       then QUITE correctly the 8 bit value  is stored as the value
> given..
>
> What then is the problem in putting this value back in the text field it
> came from ??

PostgreSQL needs to follow the standard. I suggest you double check
the standard.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: BUG #3638: UTF8 Character encoding does NOT work

From

Tatsuo Ishii

Date:

28 September 2007, 00:40:52

> Then how then does Postgres  expect to be able to store text that goes
> beyond SQL_ASCII..  ???  Because what you are saying in effect is that

If you need to store an arbitrary byte sequences, BYTEA is your
friend.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: BUG #3638: UTF8 Character encoding does NOT work

From

Tatsuo Ishii

Date:

28 September 2007, 05:09:38

> Tatsuo Ishii wrote:
> > Why do you think that an UTF-8 encoded string starting with 0x92 is
> > valid?
> >
> > 0x92 can appear in the second, third or fourth octet, but should never
> > appear in the first octet.
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> >
> >
> >> The following bug has been logged online:
> >>
> >> Bug reference:      3638
> >> Logged by:          Fil Matthews
> >> Email address:      fil@internetmediapro.com
> >> PostgreSQL version: 8-1    , 8-2
> >> Operating system:   Linux  Debian  - Windows XP
> >> Description:        UTF8 Character encoding does NOT work
> >> Details:
> >>
> >> Judging from the amount of Google page hits with the exact same problem I am
> >> surprised and mystified by this obvious flaw in Postgres Technology..
> >>
> >> Just how is one expected to work with  UTF8 character sets when all and
> >> every attempt at using even Postgres clients produces the SAME problem
> >> every time ???
> >>
> >>  "invalid byte sequence for encoding "UTF8": 0x92"
> >>
> >> In Short A Postgres UTF8 database .. PGCLIENENCODING=UTF8
> >>
> >> Tables test.text ->   (Chararcter varying 10)
> >>
> >> In any  Postgres Client ie  psql , dbadmin III
> >>
> >> Insert into test values (  chr(146));;
> >>
> >>
> >> Query returned successfully: 1 rows affected, 32 ms execution time.
> >>
> >> copy test to '/tmp/testfile.txt';
> >>
> >>
> >> Query returned successfully: 1 rows affected, 15 ms execution time.
> >>
> >> copy test from '/tmp/testfile.txt';
> >>
> >>
> >> Come on are you serious?? ..  Just how does one work with completly valid
> >> data that has an ascii 128 +  value ??
> >>
> >> Currently this flaw make Postgres an un-useable database technology ..  Or
> >> can some-one please explain this and a possible work around ..  ??
> >>
> >> Thank You
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 1: if posting/reading through Usenet, please send an appropriate
> >>        subscribe-nomail command to majordomo@postgresql.org so that your
> >>        message can get through to the mailing list cleanly
> >>
> >
> >
>
> Sorry But I don't agree.. Why can't Postgres store a legitimate 8 bit
> byte value that is below 255??  and treat it as text ..
> Not being able to do this this makes Postgres unusable.. for storing
> TEXT values..
>
> I do not know ANY other database technology that doesn't allow some form
> of storing a legitimate 8 bit byte ...
>
> Even the most simplest  open -source database in the world  (and most
> popular)  can do this..
>
> The biggest and best  (Thank you Larry) can do this ...
>
> Postgres can't.
>
> In other words  You are claiming that UTF8  is  actually UTF7 ....

No.

> There are 8 bits in a byte.. not 7 ..  If UTF8  can't by definition
> store 8 bits  then what standard can??

UTF-8 does not accept arbitary 8 bit characters. The byte ranges UTF-8
accepts are precisely defined in the standard. If our implementation
is different from it, please let us know.

> The technology is wrong  and it is incorrect...  If one looks at the
> output  of the copy file
> od -c       then QUITE correctly the 8 bit value  is stored as the value
> given..
>
> What then is the problem in putting this value back in the text field it
> came from ??

PostgreSQL needs to follow the standard. That's it.
--
Tatsuo Ishii
SRA OSS, Inc. Japan