Re: Making TEXT NUL-transparent - Mailing list pgsql-hackers

From Florian Pflug
Subject Re: Making TEXT NUL-transparent
Date
Msg-id 21D9E9C6-552A-4CE1-BF9A-178D4C2DC272@phlo.org
Whole thread Raw
In response to Re: Making TEXT NUL-transparent  (Florian Weimer <fweimer@bfk.de>)
List pgsql-hackers
On Nov24, 2011, at 10:54 , Florian Weimer wrote:
>> Or is it not only about being able to *store* NULs in a text field?
> 
> No, the entire core should be NUL-transparent.

That's unlikely to happen. A more realistic approach would be to solve
this only for UTF-8 encoded strings by encoding the NUL character not as
a single 0 byte, but as sequence of non-0 bytes.

Such a thing is possible in UTF-8 because there are multiple ways to
encode the same character once you drop the requirement that characters
be encoded in the *shortest* possible way.

Since we very probably won't loosen up UTF-8's integrity checks to allow
that, it'd have to be done as a new encoding, say 'utf8-loose'.

That new encoding could, for example, use 0xC0 0x80 to represent NUL
characters. This byte sequence is invalid in standard-conforming UTF-8
because it's a non-normalized (i.e. overly long) representation a code
point (the code point NUL, incidentally). A bit of googling suggests that
quite a few piece of software use this kind of modified UTF-8 encoding.

Java, for example, seems to use it to serialize Strings (which may contain
NUL characters) to UTF-8.

Should you try to add a new encoding which supports that, you might also
want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means
that code points representable by UTF-16 surrogate pairs may be encoded by
separately encoding the two surrogate characters in UTF-8.

best regards,
Florian Pflug



pgsql-hackers by date:

Previous
From: Alexander Shulgin
Date:
Subject: Re: Notes on implementing URI syntax for libpq
Next
From: Robert Haas
Date:
Subject: Re: Time bug with small years