Re: Re : Re: UTF-32 support in PostgreSQL ? - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Re : Re: UTF-32 support in PostgreSQL ?
Date
Msg-id CAMsr+YEZ57vcqEMFLBxjDxmz5O+h69-QfPYVxjAz2RPT_mycbg@mail.gmail.com
Whole thread Raw
In response to Re : Re: UTF-32 support in PostgreSQL ?  (fortin.christian@videotron.ca)
Responses Re: Re : Re: UTF-32 support in PostgreSQL ?  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On 27 October 2015 at 05:39,  <fortin.christian@videotron.ca> wrote:

> I mean for ALL, data stored, source code, and translation files.
> For source code, I think then GCC must support UTF-32 before.

Why?

UTF-32 is an incredibly inefficient way to store text that's
predominantly or entirely within the 7-bit ASCII space. UTF-8 is a
much better way to handle it.

Anyway, while gcc supports sources encoded in utf-8 just fine, it's
more typical to represent chars using byte escapes so that people with
misconfigured text editors don't mangle them. It does not support
utf-8 identifiers (variable names, function names, etc) containing
characters outside the 7-bit ASCII space, but you can work around it
with UCN if you need to; see the FAQ:

https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F

I don't think the PostgreSQL project is likely to accept patches using
characters outside the 7-bit ascii space in the near future, as
compiler and text editor support is unfortunately still too primitive.
We support a variety of legacy platforms and toolchains, many of which
won't cope at all. There isn't a pressing reason, since at the user
level the support for a wide variety of charsets (including all
characters in the UTF-32 space) is already present.

I am aware this is a form of English-language privilege. Of course
it's easy for me as an English first-language speaker to say "oh, we
don't need support for your language in the code". It's also practical
though - code in a variety of languages, so that no one person can
read or understand all of it, is not maintainable in the long term.
Especially when people join and leave the project. It's the same
reason the project is picky about introducing new programming
languages, even though it might be nice to be able to write parts of
the system in Python, parts in Haskell, etc.

So I don't think we need UTF-32 source code support, or even full
UTF-8 source code support, because even if we had it we probably
wouldn't use it.


> I sent an e-mail to Oracle to see what they tink about this huge idea.

I don't understand how this is a huge idea. The representation of the
characters doesn't matter, so long as the DB can represent the full
character suite. Right?

> Well, I know it's not efficient space wise, but this in the only way that we
> can deployed worldwide.

UTF-8 is widely used worldwide and covers the full Unicode 32-bit code space.

I wonder if you are misunderstanding UTF-8 vs UCS-2 vs UTF-16 vs UTF-32.

UTF-8 is an encoding that can represent the full 32-bit Unicode space
using escape sequences. It is endianness-independent. One character is
a variable number of bytes, so lookups to find the n'th character,
substring operations, etc are a bit ugly. UTF-8 is the character set
used by most UNIX APIs.

UCS-2 is a legacy encoding that can represent the lower 16 bits of the
Unicode space. It cannot represent the full 32-bit Unicode space. It
has two different forms, little-endian and big-endian, so you have to
include a marker to say which is which, or be careful about handling
it in your code. It's easy to do n'th character lookups, substrings,
etc.

UTF-16 is like UCS-2, but adds UTF-8-like escape sequences to handle
the high 16 bits of the 32-bit Unicode space. It combines the worst
features of UTF-8 and UCS-2. UTF-16 is the character set used by
Windows APIs and the ICU library.

UTF-32 (UCS-4) is much like UCS-2, but uses 4 bytes per character to
represent the full Unicode character set. The downside is that it uses
a full 4 bytes for every character, even when only one byte would be
needed if you were using utf-8. It's easy to do substrings and n'th
character lookups. UCS-4 is horrible on CPU cache and memory. Few APIs
use native UTF-32.

So we already support one of the best text encodings available.

We could add support for using UTF-16 and UTF-32 as the
client_encoding on the wire. But really, the client application can
convert between the protocol's UTF-8 and whatever it wants to use
internally; there's no benefit to using UTF-16 or UTF-32 on the wire,
and it'd be a lot slower. Especially without protocol compression.

So can you explain why you believe UTF-32 support is necessary?

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Proposal: Trigonometric functions in degrees
Next
From: Peter Geoghegan
Date:
Subject: Re: Re : Re: UTF-32 support in PostgreSQL ?