Re: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

From Álvaro Hernández Tortosa
Subject Re: Implementing full UTF-8 support (aka supporting 0x00)
Date
Msg-id b2f6204e-a4a1-06e9-f333-1b18477d3504@8kdata.com
Whole thread Raw
In response to Re: Implementing full UTF-8 support (aka supporting 0x00)  (Álvaro Hernández Tortosa <aht@8kdata.com>)
Responses Re: Implementing full UTF-8 support (aka supporting 0x00)  (Geoff Winkless <pgsqladmin@geoff.dj>)
Re: Implementing full UTF-8 support (aka supporting 0x00)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers

On 03/08/16 20:14, Álvaro Hernández Tortosa wrote:
>
>
> On 03/08/16 17:47, Kevin Grittner wrote:
>> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa 
>> <aht@8kdata.com> wrote:
>>
>>>      What would it take to support it?
>> Would it be of any value to support "Modified UTF-8"?
>>
>> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
>>
>
>     That's nice, but I don't think so.
>
>     The problem is that you cannot predict how people would send you 
> data, like when importing from other databases. I guess it may work if 
> Postgres would implement such UTF-8 variant and also the drivers, but 
> that would still require an encoding conversion (i.e., parsing every 
> string) to change the 0x00, which seems like a serious performance hit.
>
>     It could be worse than nothing, though!
>
>     Thanks,
>
>     Álvaro
>
    It may indeed work.
    According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout 
the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code 
point. In short, the \u00 nul is represented (overlong encoding) by the 
two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 
so should not appear in an otherwise valid UTF-8 string. Yet they are 
accepted by Postgres (like if Postgres would support Modified UTF-8 
intentionally). The caracter in psql does not render as a nul but as 
this symbol: "삀".
    Given that this works, the process would look like this:

- Parse all input data looking for bytes with hex value 0x00. If they 
appear in the string, they are the null byte.
- Replace that byte with the two bytes 0xc080.
- Reverse the operation when reading.
    This is OK but of course a performance hit (searching for 0x00 and 
then augmenting the byte[] or whatever data structure to account for the 
extra byte). A little bit of a PITA, but I guess better than fixing it 
all :)

    Álvaro


-- 

Álvaro Hernández Tortosa


-----------
8Kdata




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Why we lost Uber as a user
Next
From: Tom Lane
Date:
Subject: Re: Why we lost Uber as a user