Thread: BUG #3819: UTF8 can't handle \000
The following bug has been logged online: Bug reference: 3819 Logged by: Franklin Schmidt Email address: fschmidt@gmail.com PostgreSQL version: 8.2 Operating system: XP & Linux Description: UTF8 can't handle \000 Details: Trying to store \000 in a text field with UTF8 encoding causes an error. I assume this is because Postgres is written in C, but it's still wrong. A solution was suggested here: http://www.nabble.com/invalid-byte-sequence-for-encoding-%22UTF8%22%3A-0x00- tp9058998p9096326.html "I can think of some ways the server could support it without extensive changes .. e.g. use a modified UTF8 representation which stores \u0000 as 0xc0 0x80 internally"
Franklin Schmidt wrote: > > The following bug has been logged online: > > Bug reference: 3819 > Logged by: Franklin Schmidt > Email address: fschmidt@gmail.com > PostgreSQL version: 8.2 > Operating system: XP & Linux > Description: UTF8 can't handle \000 > Details: > > Trying to store \000 in a text field with UTF8 encoding causes an error. I > assume this is because Postgres is written in C, but it's still wrong. A > solution was suggested here: > > http://www.nabble.com/invalid-byte-sequence-for-encoding-%22UTF8%22%3A-0x00- > tp9058998p9096326.html > > "I can think of some ways the server could support it without extensive > changes .. e.g. use a modified UTF8 representation which stores \u0000 as > 0xc0 0x80 internally" Uh, as far as I know 0x00 is not a valid UTF8 byte value. I suggest you use bytea to store 0x00. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Franklin Schmidt wrote: > On Dec 17, 2007 12:54 AM, Bruce Momjian <bruce@momjian.us> wrote: > > > > Uh, as far as I know 0x00 is not a valid UTF8 byte value. > > > I think it is a valid value. RFC 3629 says: > > "Character numbers from U+0000 to U+007F (US-ASCII repertoire) > correspond to octets 00 to 7F (7 bit US-ASCII values)." > > http://www.faqs.org/rfcs/rfc3629.html Well, I realize 0x00 is a valid ASCII value and therefore a valid UTF8 value but we have never had anyone complain they can't store the 0x00 character because it doesn't mean anything in ASCII. They use bytea to store binary data like 0x00. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Dec 17, 2007 12:54 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Uh, as far as I know 0x00 is not a valid UTF8 byte value. I think it is a valid value. RFC 3629 says: "Character numbers from U+0000 to U+007F (US-ASCII repertoire) correspond to octets 00 to 7F (7 bit US-ASCII values)." http://www.faqs.org/rfcs/rfc3629.html
On Dec 17, 2007 1:28 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Well, I realize 0x00 is a valid ASCII value and therefore a valid UTF8 > value but we have never had anyone complain they can't store the 0x00 > character because it doesn't mean anything in ASCII. They use bytea to > store binary data like 0x00. Here are a few complaints: http://www.nabble.com/-tp9058998.html http://www.nabble.com/-tp11750041.html http://www.nabble.com/-tp8414157.html I agree that storing 0x00 in a UTF8 string is weird, but I am converting a huge database to postgres, and in a huge database, weird things happen. Using bytea for a text field just because one in a million records has a 0x00 doesn't make sense to me. I did hack around it in my conversion code to remove the 0x00 but I expect that anyone else who tries converting a big database to postgres will also confront this issue.
Franklin Schmidt wrote: > I agree that storing 0x00 in a UTF8 string is weird, but I am > converting a huge database to postgres, and in a huge database, weird > things happen. Using bytea for a text field just because one in a > million records has a 0x00 doesn't make sense to me. I did hack > around it in my conversion code to remove the 0x00 but I expect that > anyone else who tries converting a big database to postgres will also > confront this issue. That's the right solution. If you have 0x00 bytes in your text fields, you're much better off cleaning them away anyway, than trying to work around them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com