Thread: Making TEXT NUL-transparent
Occasionally, we get bitten by embedded NUL bytes in TEXT values. We take care of generating proper UTF-8, but this additional restriction sometimes slips by. It would be really helpful if PostgreSQL could store such TEXT fields as-is (at least if they are computed internally, or come from query parameters). I haven't even looked how difficult it would be to implement this. Do you think it's worth the trouble? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Hello 2011/11/24 Florian Weimer <fweimer@bfk.de>: > Occasionally, we get bitten by embedded NUL bytes in TEXT values. We > take care of generating proper UTF-8, but this additional restriction > sometimes slips by. It would be really helpful if PostgreSQL could > store such TEXT fields as-is (at least if they are computed internally, > or come from query parameters). > and why you don't use bytea ? Text should be correct literal. Regards Pavel Stehule > I haven't even looked how difficult it would be to implement this. Do > you think it's worth the trouble? > > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Kriegsstraße 100 tel: +49-721-96201-1 > D-76133 Karlsruhe fax: +49-721-96201-99 > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
* Pavel Stehule: > Hello > > 2011/11/24 Florian Weimer <fweimer@bfk.de>: >> Occasionally, we get bitten by embedded NUL bytes in TEXT values. We >> take care of generating proper UTF-8, but this additional restriction >> sometimes slips by. It would be really helpful if PostgreSQL could >> store such TEXT fields as-is (at least if they are computed internally, >> or come from query parameters). >> > > and why you don't use bytea ? Text should be correct literal. It's actually UTF-8 text, and some PostgreSQL functions are only available for TEXT, but not BYTEA, e.g.: bfk_int=> SELECT '\x006500'::bytea ~ 'A'; ERROR: operator does not exist: bytea ~ unknown LINE 1: SELECT '\x006500'::bytea ~ 'A'; ^ HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Excerpts from Florian Weimer's message of Thu Nov 24 11:27:51 +0200 2011: > > > and why you don't use bytea ? Text should be correct literal. > > It's actually UTF-8 text, and some PostgreSQL functions are only > available for TEXT, but not BYTEA, e.g.: > > bfk_int=> SELECT '\x006500'::bytea ~ 'A'; > ERROR: operator does not exist: bytea ~ unknown And how will those TEXT functions behave on a value with an embedded NUL? Or is it not only about being able to *store*NULs in a text field? -- Alex
2011/11/24 Florian Weimer <fweimer@bfk.de>: > * Pavel Stehule: > >> Hello >> >> 2011/11/24 Florian Weimer <fweimer@bfk.de>: >>> Occasionally, we get bitten by embedded NUL bytes in TEXT values. We >>> take care of generating proper UTF-8, but this additional restriction >>> sometimes slips by. It would be really helpful if PostgreSQL could >>> store such TEXT fields as-is (at least if they are computed internally, >>> or come from query parameters). >>> >> >> and why you don't use bytea ? Text should be correct literal. > > It's actually UTF-8 text, and some PostgreSQL functions are only > available for TEXT, but not BYTEA, e.g.: > > bfk_int=> SELECT '\x006500'::bytea ~ 'A'; > ERROR: operator does not exist: bytea ~ unknown > LINE 1: SELECT '\x006500'::bytea ~ 'A'; > ^ > HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. > then you can prepare some custom function that can filter zeroes and prepare valid text functions - some postgresql's functions (based on external libs) cannot to work with string containing zeroes too. Probably you can do it in plperl. I don't think so this functionality is generally requested. If you have broken strings, then you have to clean it before storing to database. Regards Pavel > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Kriegsstraße 100 tel: +49-721-96201-1 > D-76133 Karlsruhe fax: +49-721-96201-99 >
* Alexander Shulgin: >> It's actually UTF-8 text, and some PostgreSQL functions are only >> available for TEXT, but not BYTEA, e.g.: >> >> bfk_int=> SELECT '\x006500'::bytea ~ 'A'; >> ERROR: operator does not exist: bytea ~ unknown > > And how will those TEXT functions behave on a value with an embedded > NUL? They need to be audited and fixed if necessary. I'm not saying that this would be a trivial change. > Or is it not only about being able to *store* NULs in a text field? No, the entire core should be NUL-transparent. By the way, I refuse the notion that UTF-8 strings with embedded NULs are "broken". I can't recall any other system which enforces UTF-8 well-formedness, but does not permit embedded NULs. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
2011/11/24 Florian Weimer <fweimer@bfk.de>: > * Alexander Shulgin: > >>> It's actually UTF-8 text, and some PostgreSQL functions are only >>> available for TEXT, but not BYTEA, e.g.: >>> >>> bfk_int=> SELECT '\x006500'::bytea ~ 'A'; >>> ERROR: operator does not exist: bytea ~ unknown >> >> And how will those TEXT functions behave on a value with an embedded >> NUL? > > They need to be audited and fixed if necessary. I'm not saying that > this would be a trivial change. > >> Or is it not only about being able to *store* NULs in a text field? > > No, the entire core should be NUL-transparent. > > By the way, I refuse the notion that UTF-8 strings with embedded NULs > are "broken". I can't recall any other system which enforces UTF-8 > well-formedness, but does not permit embedded NULs. > I have a different question. What is reason for embedded NULs inside strings? Regards Pavel Stehule > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Kriegsstraße 100 tel: +49-721-96201-1 > D-76133 Karlsruhe fax: +49-721-96201-99 >
* Pavel Stehule: >> By the way, I refuse the notion that UTF-8 strings with embedded NULs >> are "broken". I can't recall any other system which enforces UTF-8 >> well-formedness, but does not permit embedded NULs. > I have a different question. What is reason for embedded NULs inside > strings? The source system does not enforce that constraint, so from time to time, such data slips through. I don't know why it's there in the first place, and I have no control over the original data source. Usually, it's okay to silently strip NUL bytes; it's just a bit awkward. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Excerpts from Florian Weimer's message of Thu Nov 24 12:59:09 +0200 2011: > > > I have a different question. What is reason for embedded NULs inside > > strings? > > The source system does not enforce that constraint, so from time to > time, such data slips through. I don't know why it's there in the first > place, and I have no control over the original data source. Usually, > it's okay to silently strip NUL bytes; it's just a bit awkward. Strip? Like this: "loss\0less" => "lossless"? That would be awkward for sure.
2011/11/24 Alexander Shulgin <ash@commandprompt.com>: > > Excerpts from Florian Weimer's message of Thu Nov 24 12:59:09 +0200 2011: >> >> > I have a different question. What is reason for embedded NULs inside >> > strings? >> >> The source system does not enforce that constraint, so from time to >> time, such data slips through. I don't know why it's there in the first >> place, and I have no control over the original data source. Usually, >> it's okay to silently strip NUL bytes; it's just a bit awkward. > the source should be buggy application that has broken replace or concat of strings. Pavel > Strip? Like this: "loss\0less" => "lossless"? That would be awkward for sure. >
On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer <fweimer@bfk.de> wrote: > By the way, I refuse the notion that UTF-8 strings with embedded NULs > are "broken". I can't recall any other system which enforces UTF-8 > well-formedness, but does not permit embedded NULs. This seems like a key point. If UTF-8 allows embedded NULs, then a text field supposedly encoded as UTF-8 ought to allow them, too. However, our propensity to translate between text and cstring might make this difficult to implement in practice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Nov24, 2011, at 10:54 , Florian Weimer wrote: >> Or is it not only about being able to *store* NULs in a text field? > > No, the entire core should be NUL-transparent. That's unlikely to happen. A more realistic approach would be to solve this only for UTF-8 encoded strings by encoding the NUL character not as a single 0 byte, but as sequence of non-0 bytes. Such a thing is possible in UTF-8 because there are multiple ways to encode the same character once you drop the requirement that characters be encoded in the *shortest* possible way. Since we very probably won't loosen up UTF-8's integrity checks to allow that, it'd have to be done as a new encoding, say 'utf8-loose'. That new encoding could, for example, use 0xC0 0x80 to represent NUL characters. This byte sequence is invalid in standard-conforming UTF-8 because it's a non-normalized (i.e. overly long) representation a code point (the code point NUL, incidentally). A bit of googling suggests that quite a few piece of software use this kind of modified UTF-8 encoding. Java, for example, seems to use it to serialize Strings (which may contain NUL characters) to UTF-8. Should you try to add a new encoding which supports that, you might also want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means that code points representable by UTF-16 surrogate pairs may be encoded by separately encoding the two surrogate characters in UTF-8. best regards, Florian Pflug
<br /><br /><div class="gmail_quote">On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer <span dir="ltr"><<a href="mailto:fweimer@bfk.de">fweimer@bfk.de</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin: 0pt0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><br /> By the way, I refuse the notion thatUTF-8 strings with embedded NULs<br /> are "broken". I can't recall any other system which enforces UTF-8<br /> well-formedness,but does not permit embedded NULs.<br /><div class="im HOEnZb"><br /></div></blockquote></div><br /><br />Refuseaway, but I don't think you know what you're asking for. Given the number of places that treat text values as beingnul-terminated in the code, we'd be many years fixing bugs from a change like this, IMNSHO.<br /><br />cheers<br /><br/>andrew<br />
Andrew Dunstan <andrew@dunslane.net> writes: > On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer <fweimer@bfk.de> wrote: >> By the way, I refuse the notion that UTF-8 strings with embedded NULs >> are "broken". I can't recall any other system which enforces UTF-8 >> well-formedness, but does not permit embedded NULs. > Refuse away, but I don't think you know what you're asking for. Given the > number of places that treat text values as being nul-terminated in the > code, we'd be many years fixing bugs from a change like this, IMNSHO. Yeah. This has been discussed before (see the archives) and every time we've concluded that the amount of pain and developer effort would outweigh the benefit by many orders of magnitude. Just to start with, we'd have to change the convention that datatype I/O functions take and return cstring, thus breaking every core and add-on datatype. If you really need to store nulls, use bytea. If there are specific operations you need that aren't available on bytea, we'd probably entertain proposals to add them. regards, tom lane
* Florian Pflug: > On Nov24, 2011, at 10:54 , Florian Weimer wrote: >>> Or is it not only about being able to *store* NULs in a text field? >> >> No, the entire core should be NUL-transparent. > > That's unlikely to happen. Yes, with the type input/output functions tied to NUL-terminated strings, that seems indeed unlikely to happen. > A more realistic approach would be to solve this only for UTF-8 > encoded strings by encoding the NUL character not as a single 0 byte, > but as sequence of non-0 bytes. 0xFF cannot occur in valid UTF-8, so that's one possibility. > Java, for example, seems to use it to serialize Strings (which may contain > NUL characters) to UTF-8. Only internally in the VM. UTF-8 produced by the I/O encoder/decoders produces and consumes NUL bytes. > Should you try to add a new encoding which supports that, you might also > want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means > that code points representable by UTF-16 surrogate pairs may be encoded by > separately encoding the two surrogate characters in UTF-8. I'm not sure if this is a good idea. The motivation behind CESU-8 is that it sorts byte-encoded strings in the same order as UTF-16, which is a completely separate concern. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99