Thread: Making TEXT NUL-transparent

Making TEXT NUL-transparent

From
Florian Weimer
Date:
Occasionally, we get bitten by embedded NUL bytes in TEXT values.  We
take care of generating proper UTF-8, but this additional restriction
sometimes slips by.  It would be really helpful if PostgreSQL could
store such TEXT fields as-is (at least if they are computed internally,
or come from query parameters).

I haven't even looked how difficult it would be to implement this.  Do
you think it's worth the trouble?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Making TEXT NUL-transparent

From
Pavel Stehule
Date:
Hello

2011/11/24 Florian Weimer <fweimer@bfk.de>:
> Occasionally, we get bitten by embedded NUL bytes in TEXT values.  We
> take care of generating proper UTF-8, but this additional restriction
> sometimes slips by.  It would be really helpful if PostgreSQL could
> store such TEXT fields as-is (at least if they are computed internally,
> or come from query parameters).
>

and why you don't use bytea ? Text should be correct literal.

Regards

Pavel Stehule

> I haven't even looked how difficult it would be to implement this.  Do
> you think it's worth the trouble?
>
> --
> Florian Weimer                <fweimer@bfk.de>
> BFK edv-consulting GmbH       http://www.bfk.de/
> Kriegsstraße 100              tel: +49-721-96201-1
> D-76133 Karlsruhe             fax: +49-721-96201-99
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: Making TEXT NUL-transparent

From
Florian Weimer
Date:
* Pavel Stehule:

> Hello
>
> 2011/11/24 Florian Weimer <fweimer@bfk.de>:
>> Occasionally, we get bitten by embedded NUL bytes in TEXT values.  We
>> take care of generating proper UTF-8, but this additional restriction
>> sometimes slips by.  It would be really helpful if PostgreSQL could
>> store such TEXT fields as-is (at least if they are computed internally,
>> or come from query parameters).
>>
>
> and why you don't use bytea ? Text should be correct literal.

It's actually UTF-8 text, and some PostgreSQL functions are only
available for TEXT, but not BYTEA, e.g.:

bfk_int=> SELECT '\x006500'::bytea ~ 'A';
ERROR:  operator does not exist: bytea ~ unknown
LINE 1: SELECT '\x006500'::bytea ~ 'A';                                ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Making TEXT NUL-transparent

From
Alexander Shulgin
Date:
Excerpts from Florian Weimer's message of Thu Nov 24 11:27:51 +0200 2011:
>
> > and why you don't use bytea ? Text should be correct literal.
>
> It's actually UTF-8 text, and some PostgreSQL functions are only
> available for TEXT, but not BYTEA, e.g.:
>
> bfk_int=> SELECT '\x006500'::bytea ~ 'A';
> ERROR:  operator does not exist: bytea ~ unknown

And how will those TEXT functions behave on a value with an embedded NUL?  Or is it not only about being able to
*store*NULs in a text field? 

--
Alex


Re: Making TEXT NUL-transparent

From
Pavel Stehule
Date:
2011/11/24 Florian Weimer <fweimer@bfk.de>:
> * Pavel Stehule:
>
>> Hello
>>
>> 2011/11/24 Florian Weimer <fweimer@bfk.de>:
>>> Occasionally, we get bitten by embedded NUL bytes in TEXT values.  We
>>> take care of generating proper UTF-8, but this additional restriction
>>> sometimes slips by.  It would be really helpful if PostgreSQL could
>>> store such TEXT fields as-is (at least if they are computed internally,
>>> or come from query parameters).
>>>
>>
>> and why you don't use bytea ? Text should be correct literal.
>
> It's actually UTF-8 text, and some PostgreSQL functions are only
> available for TEXT, but not BYTEA, e.g.:
>
> bfk_int=> SELECT '\x006500'::bytea ~ 'A';
> ERROR:  operator does not exist: bytea ~ unknown
> LINE 1: SELECT '\x006500'::bytea ~ 'A';
>                                 ^
> HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
>

then you can prepare some custom function that can filter zeroes and
prepare valid text functions - some postgresql's functions (based on
external libs) cannot to work with string containing zeroes too.
Probably you can do it in plperl.

I don't think so this functionality is generally requested. If you
have broken strings, then you have to clean it before storing to
database.

Regards

Pavel

> --
> Florian Weimer                <fweimer@bfk.de>
> BFK edv-consulting GmbH       http://www.bfk.de/
> Kriegsstraße 100              tel: +49-721-96201-1
> D-76133 Karlsruhe             fax: +49-721-96201-99
>


Re: Making TEXT NUL-transparent

From
Florian Weimer
Date:
* Alexander Shulgin:

>> It's actually UTF-8 text, and some PostgreSQL functions are only
>> available for TEXT, but not BYTEA, e.g.:
>>
>> bfk_int=> SELECT '\x006500'::bytea ~ 'A';
>> ERROR:  operator does not exist: bytea ~ unknown
>
> And how will those TEXT functions behave on a value with an embedded
> NUL?

They need to be audited and fixed if necessary.  I'm not saying that
this would be a trivial change.

> Or is it not only about being able to *store* NULs in a text field?

No, the entire core should be NUL-transparent.

By the way, I refuse the notion that UTF-8 strings with embedded NULs
are "broken".  I can't recall any other system which enforces UTF-8
well-formedness, but does not permit embedded NULs.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Making TEXT NUL-transparent

From
Pavel Stehule
Date:
2011/11/24 Florian Weimer <fweimer@bfk.de>:
> * Alexander Shulgin:
>
>>> It's actually UTF-8 text, and some PostgreSQL functions are only
>>> available for TEXT, but not BYTEA, e.g.:
>>>
>>> bfk_int=> SELECT '\x006500'::bytea ~ 'A';
>>> ERROR:  operator does not exist: bytea ~ unknown
>>
>> And how will those TEXT functions behave on a value with an embedded
>> NUL?
>
> They need to be audited and fixed if necessary.  I'm not saying that
> this would be a trivial change.
>
>> Or is it not only about being able to *store* NULs in a text field?
>
> No, the entire core should be NUL-transparent.
>
> By the way, I refuse the notion that UTF-8 strings with embedded NULs
> are "broken".  I can't recall any other system which enforces UTF-8
> well-formedness, but does not permit embedded NULs.
>

I have a different question. What is reason for embedded NULs inside strings?

Regards

Pavel Stehule

> --
> Florian Weimer                <fweimer@bfk.de>
> BFK edv-consulting GmbH       http://www.bfk.de/
> Kriegsstraße 100              tel: +49-721-96201-1
> D-76133 Karlsruhe             fax: +49-721-96201-99
>


Re: Making TEXT NUL-transparent

From
Florian Weimer
Date:
* Pavel Stehule:

>> By the way, I refuse the notion that UTF-8 strings with embedded NULs
>> are "broken".  I can't recall any other system which enforces UTF-8
>> well-formedness, but does not permit embedded NULs.

> I have a different question. What is reason for embedded NULs inside
> strings?

The source system does not enforce that constraint, so from time to
time, such data slips through.  I don't know why it's there in the first
place, and I have no control over the original data source.  Usually,
it's okay to silently strip NUL bytes; it's just a bit awkward.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Making TEXT NUL-transparent

From
Alexander Shulgin
Date:
Excerpts from Florian Weimer's message of Thu Nov 24 12:59:09 +0200 2011:
>
> > I have a different question. What is reason for embedded NULs inside
> > strings?
>
> The source system does not enforce that constraint, so from time to
> time, such data slips through.  I don't know why it's there in the first
> place, and I have no control over the original data source.  Usually,
> it's okay to silently strip NUL bytes; it's just a bit awkward.

Strip?  Like this: "loss\0less" => "lossless"?  That would be awkward for sure.


Re: Making TEXT NUL-transparent

From
Pavel Stehule
Date:
2011/11/24 Alexander Shulgin <ash@commandprompt.com>:
>
> Excerpts from Florian Weimer's message of Thu Nov 24 12:59:09 +0200 2011:
>>
>> > I have a different question. What is reason for embedded NULs inside
>> > strings?
>>
>> The source system does not enforce that constraint, so from time to
>> time, such data slips through.  I don't know why it's there in the first
>> place, and I have no control over the original data source.  Usually,
>> it's okay to silently strip NUL bytes; it's just a bit awkward.
>

the source should be buggy application that has broken replace or
concat of strings.

Pavel

> Strip?  Like this: "loss\0less" => "lossless"?  That would be awkward for sure.
>


Re: Making TEXT NUL-transparent

From
Robert Haas
Date:
On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer <fweimer@bfk.de> wrote:
> By the way, I refuse the notion that UTF-8 strings with embedded NULs
> are "broken".  I can't recall any other system which enforces UTF-8
> well-formedness, but does not permit embedded NULs.

This seems like a key point.  If UTF-8 allows embedded NULs, then a
text field supposedly encoded as UTF-8 ought to allow them, too.

However, our propensity to translate between text and cstring might
make this difficult to implement in practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Making TEXT NUL-transparent

From
Florian Pflug
Date:
On Nov24, 2011, at 10:54 , Florian Weimer wrote:
>> Or is it not only about being able to *store* NULs in a text field?
> 
> No, the entire core should be NUL-transparent.

That's unlikely to happen. A more realistic approach would be to solve
this only for UTF-8 encoded strings by encoding the NUL character not as
a single 0 byte, but as sequence of non-0 bytes.

Such a thing is possible in UTF-8 because there are multiple ways to
encode the same character once you drop the requirement that characters
be encoded in the *shortest* possible way.

Since we very probably won't loosen up UTF-8's integrity checks to allow
that, it'd have to be done as a new encoding, say 'utf8-loose'.

That new encoding could, for example, use 0xC0 0x80 to represent NUL
characters. This byte sequence is invalid in standard-conforming UTF-8
because it's a non-normalized (i.e. overly long) representation a code
point (the code point NUL, incidentally). A bit of googling suggests that
quite a few piece of software use this kind of modified UTF-8 encoding.

Java, for example, seems to use it to serialize Strings (which may contain
NUL characters) to UTF-8.

Should you try to add a new encoding which supports that, you might also
want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means
that code points representable by UTF-16 surrogate pairs may be encoded by
separately encoding the two surrogate characters in UTF-8.

best regards,
Florian Pflug



Re: Making TEXT NUL-transparent

From
Andrew Dunstan
Date:
<br /><br /><div class="gmail_quote">On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer <span dir="ltr"><<a
href="mailto:fweimer@bfk.de">fweimer@bfk.de</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:
0pt0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><br /> By the way, I refuse the notion
thatUTF-8 strings with embedded NULs<br /> are "broken".  I can't recall any other system which enforces UTF-8<br />
well-formedness,but does not permit embedded NULs.<br /><div class="im HOEnZb"><br /></div></blockquote></div><br /><br
/>Refuseaway, but I don't think you know what you're asking for. Given the number of places that treat text values as
beingnul-terminated in the code, we'd be many years fixing bugs from a change like this, IMNSHO.<br /><br />cheers<br
/><br/>andrew<br /> 

Re: Making TEXT NUL-transparent

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer <fweimer@bfk.de> wrote:
>> By the way, I refuse the notion that UTF-8 strings with embedded NULs
>> are "broken".  I can't recall any other system which enforces UTF-8
>> well-formedness, but does not permit embedded NULs.

> Refuse away, but I don't think you know what you're asking for. Given the
> number of places that treat text values as being nul-terminated in the
> code, we'd be many years fixing bugs from a change like this, IMNSHO.

Yeah.  This has been discussed before (see the archives) and every time
we've concluded that the amount of pain and developer effort would
outweigh the benefit by many orders of magnitude.  Just to start with,
we'd have to change the convention that datatype I/O functions take and
return cstring, thus breaking every core and add-on datatype.

If you really need to store nulls, use bytea.  If there are specific
operations you need that aren't available on bytea, we'd probably
entertain proposals to add them.
        regards, tom lane


Re: Making TEXT NUL-transparent

From
Florian Weimer
Date:
* Florian Pflug:

> On Nov24, 2011, at 10:54 , Florian Weimer wrote:
>>> Or is it not only about being able to *store* NULs in a text field?
>>
>> No, the entire core should be NUL-transparent.
>
> That's unlikely to happen.

Yes, with the type input/output functions tied to NUL-terminated
strings, that seems indeed unlikely to happen.

> A more realistic approach would be to solve this only for UTF-8
> encoded strings by encoding the NUL character not as a single 0 byte,
> but as sequence of non-0 bytes.

0xFF cannot occur in valid UTF-8, so that's one possibility.

> Java, for example, seems to use it to serialize Strings (which may contain
> NUL characters) to UTF-8.

Only internally in the VM.  UTF-8 produced by the I/O encoder/decoders
produces and consumes NUL bytes.

> Should you try to add a new encoding which supports that, you might also
> want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means
> that code points representable by UTF-16 surrogate pairs may be encoded by
> separately encoding the two surrogate characters in UTF-8.

I'm not sure if this is a good idea.  The motivation behind CESU-8 is
that it sorts byte-encoded strings in the same order as UTF-16, which is
a completely separate concern.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99