Thread: BUG #3819: UTF8 can't handle \000

BUG #3819: UTF8 can't handle \000

From
"Franklin Schmidt"
Date:
The following bug has been logged online:

Bug reference:      3819
Logged by:          Franklin Schmidt
Email address:      fschmidt@gmail.com
PostgreSQL version: 8.2
Operating system:   XP & Linux
Description:        UTF8 can't handle \000
Details:

Trying to store \000 in a text field with UTF8 encoding causes an error. I
assume this is because Postgres is written in C, but it's still wrong.  A
solution was suggested here:

http://www.nabble.com/invalid-byte-sequence-for-encoding-%22UTF8%22%3A-0x00-
tp9058998p9096326.html

"I can think of some ways the server could support it without extensive
changes .. e.g. use a modified UTF8 representation which stores \u0000 as
0xc0 0x80 internally"

Re: BUG #3819: UTF8 can't handle \000

From
Bruce Momjian
Date:
Franklin Schmidt wrote:
>
> The following bug has been logged online:
>
> Bug reference:      3819
> Logged by:          Franklin Schmidt
> Email address:      fschmidt@gmail.com
> PostgreSQL version: 8.2
> Operating system:   XP & Linux
> Description:        UTF8 can't handle \000
> Details:
>
> Trying to store \000 in a text field with UTF8 encoding causes an error. I
> assume this is because Postgres is written in C, but it's still wrong.  A
> solution was suggested here:
>
> http://www.nabble.com/invalid-byte-sequence-for-encoding-%22UTF8%22%3A-0x00-
> tp9058998p9096326.html
>
> "I can think of some ways the server could support it without extensive
> changes .. e.g. use a modified UTF8 representation which stores \u0000 as
> 0xc0 0x80 internally"

Uh, as far as I know 0x00 is not a valid UTF8 byte value. I suggest you
use bytea to store 0x00.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: BUG #3819: UTF8 can't handle \000

From
Bruce Momjian
Date:
Franklin Schmidt wrote:
> On Dec 17, 2007 12:54 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > Uh, as far as I know 0x00 is not a valid UTF8 byte value.
>
>
> I think it is a valid value.  RFC 3629 says:
>
> "Character numbers from U+0000 to U+007F (US-ASCII repertoire)
> correspond to octets 00 to 7F (7 bit US-ASCII values)."
>
> http://www.faqs.org/rfcs/rfc3629.html

Well, I realize 0x00 is a valid ASCII value and therefore a valid UTF8
value but we have never had anyone complain they can't store the 0x00
character because it doesn't mean anything in ASCII.  They use bytea to
store binary data like 0x00.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: BUG #3819: UTF8 can't handle \000

From
"Franklin Schmidt"
Date:
On Dec 17, 2007 12:54 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> Uh, as far as I know 0x00 is not a valid UTF8 byte value.


I think it is a valid value.  RFC 3629 says:

"Character numbers from U+0000 to U+007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values)."

http://www.faqs.org/rfcs/rfc3629.html

Re: BUG #3819: UTF8 can't handle \000

From
"Franklin Schmidt"
Date:
On Dec 17, 2007 1:28 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> Well, I realize 0x00 is a valid ASCII value and therefore a valid UTF8
> value but we have never had anyone complain they can't store the 0x00
> character because it doesn't mean anything in ASCII.  They use bytea to
> store binary data like 0x00.


Here are a few complaints:

http://www.nabble.com/-tp9058998.html
http://www.nabble.com/-tp11750041.html
http://www.nabble.com/-tp8414157.html

I agree that storing 0x00 in a UTF8 string is weird, but I am
converting a huge database to postgres, and in a huge database, weird
things happen.  Using bytea for a text field just because one in a
million records has a 0x00 doesn't make sense to me.  I did hack
around it in my conversion code to remove the 0x00 but I expect that
anyone else who tries converting a big database to postgres will also
confront this issue.

Re: BUG #3819: UTF8 can't handle \000

From
Heikki Linnakangas
Date:
Franklin Schmidt wrote:
> I agree that storing 0x00 in a UTF8 string is weird, but I am
> converting a huge database to postgres, and in a huge database, weird
> things happen.  Using bytea for a text field just because one in a
> million records has a 0x00 doesn't make sense to me.  I did hack
> around it in my conversion code to remove the 0x00 but I expect that
> anyone else who tries converting a big database to postgres will also
> confront this issue.

That's the right solution. If you have 0x00 bytes in your text fields,
you're much better off cleaning them away anyway, than trying to work
around them.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com