Home > mailing lists

Re: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

From	Craig Ringer
Subject	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date	August 3, 2016 17:16:46
Msg-id	CAMsr+YHa3AjtBmXc2hOZaKVLbuyjWPipNknSPa_JRmdZh4X4Ww@mail.gmail.com Whole thread Raw
In response to	Implementing full UTF-8 support (aka supporting 0x00) (Álvaro Hernández Tortosa <aht@8kdata.com>)
Responses	Re: Implementing full UTF-8 support (aka supporting 0x00)
List	pgsql-hackers

Tree view

On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:

Hi list.

As has been previously discussed (see https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl for instance) varlena fields cannot accept the literal 0x00 value. Sure, you can use bytea, but this hardly a good solution. The problem seems to be hitting some use cases, like:

- People migrating data from other databases (apart from PostgreSQL, I don't know of any other database which suffers the same problem).
- People using drivers which use UTF-8 or equivalent encodings by default (Java for example)

Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're strictly non-compliant. And given the general Postgres policy regarding standards compliance and the people being hit by this, I think it should be addressed. Specially since all the usual fixes are a real PITA (re-parsing, re-generating strings, which is very expensive, or dropping data).

What would it take to support it? Isn't the varlena header propagated everywhere, which could help infer the real length of the string? Any pointers or suggestions would be welcome.

One of the bigger pain points is that our interaction with C library collation routines for sorting uses NULL-terminated C strings. strcoll, strxfrm, etc.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

From: Craig Ringer
Date: 03 August 2016, 17:11:57
Subject: Re: Detecting skipped data from logical slots (data silently skipped)

From: Andreas Seltenreich
Date: 03 August 2016, 17:19:49
Subject: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

Re: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

Previous

Next