Thread: UTF-32 support in PostgreSQL ?

UTF-32 support in PostgreSQL ?

From

fortin.christian@videotron.ca

Date:

26 October 2015, 18:02:24

Is PostgreSQL support UNICODE UTF-32 characters ?

If not, I think it's a must to be internationnal.

To help you in this task, you could use this UTF-32 editor:

https://wxmedit.github.io/downloads.html

thanks.

Re: UTF-32 support in PostgreSQL ?

From

Andres Freund

Date:

26 October 2015, 18:09:23

On 2015-10-23 23:29:53 -0400, fortin.christian@videotron.ca wrote:
> Is PostgreSQL support UNICODE UTF-32 characters ?

No.

> If not, I think it's a must to be internationnal.

Why? I think Unicode support is a must, but I don't see why utf-32
support is. Postgres supports UTF-8, and I haven't heard any convincing
arguments why that's not sufficient.

Andres

Re: UTF-32 support in PostgreSQL ?

From

Tom Lane

Date:

26 October 2015, 18:12:01

fortin.christian@videotron.ca writes:
> Is PostgreSQL support UNICODE UTF-32 characters ?

There's no particular intention of supporting the UTF32 representation
inside the database.  We do support UTF8 representation of the entire
Unicode character set.  You can transcode to and from UTF32 easily enough,
if you have a client application that prefers to work in that
representation.

            regards, tom lane

Re: UTF-32 support in PostgreSQL ?

From

Andrew Dunstan

Date:

26 October 2015, 20:51:24

On 10/23/2015 11:29 PM, fortin.christian@videotron.ca wrote:
> Is PostgreSQL support UNICODE  UTF-32 characters ?
>
> If not, I think it's a must to be internationnal.
>
> To help you in this task, you could use this UTF-32 editor:
>
> https://wxmedit.github.io/downloads.html
>
> thanks.

Do you mean data stored as UTF-32, or source code in UTF-32, or
translation files as UTF-32?

For data, UTF-32 does not meet our requirements for server side encoding
(as well as being horribly inefficient space-wise).

cheers

andrew

Re : Re: UTF-32 support in PostgreSQL ?

From

fortin.christian@videotron.ca

Date:

26 October 2015, 22:24:51

----Do you mean data stored as UTF-32, or source code in UTF-32, or translation files as UTF-32?

I mean for ALL, data stored, source code, and translation files.
For source code, I think then GCC must support UTF-32 before.

I sent an e-mail to Oracle to see what they tink about this huge idea.
Well, I know it's not efficient space wise, but this in the only way that we can deploye worldwide.
I think it must be add to the next version of the SQL languages.

Le 26/10/15, Andrew Dunstan <andrew@dunslane.net> a écrit :

On 10/23/2015 11:29 PM, fortin.christian@videotron.ca wrote:
>Is PostgreSQL support UNICODE UTF-32 characters ?
>
>If not, I think it's a must to be internationnal.
>
>To help you in this task, you could use this UTF-32 editor:
>
>https://wxmedit.github.io/downloads.html
>
>thanks.

Do you mean data stored as UTF-32, or source code in UTF-32, or translation files as UTF-32?

For data, UTF-32 does not meet our requirements for server side encoding (as well as being horribly inefficient space-wise).

cheers

andrew

Re: Re : Re: UTF-32 support in PostgreSQL ?

From

Craig Ringer

Date:

27 October 2015, 01:20:28

On 27 October 2015 at 05:39, <fortin.christian@videotron.ca> wrote:

> I mean for ALL, data stored, source code, and translation files.
> For source code, I think then GCC must support UTF-32 before.

Why?

UTF-32 is an incredibly inefficient way to store text that's
predominantly or entirely within the 7-bit ASCII space. UTF-8 is a
much better way to handle it.

Anyway, while gcc supports sources encoded in utf-8 just fine, it's
more typical to represent chars using byte escapes so that people with
misconfigured text editors don't mangle them. It does not support
utf-8 identifiers (variable names, function names, etc) containing
characters outside the 7-bit ASCII space, but you can work around it
with UCN if you need to; see the FAQ:

https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F

I don't think the PostgreSQL project is likely to accept patches using
characters outside the 7-bit ascii space in the near future, as
compiler and text editor support is unfortunately still too primitive.
We support a variety of legacy platforms and toolchains, many of which
won't cope at all. There isn't a pressing reason, since at the user
level the support for a wide variety of charsets (including all
characters in the UTF-32 space) is already present.

I am aware this is a form of English-language privilege. Of course
it's easy for me as an English first-language speaker to say "oh, we
don't need support for your language in the code". It's also practical
though - code in a variety of languages, so that no one person can
read or understand all of it, is not maintainable in the long term.
Especially when people join and leave the project. It's the same
reason the project is picky about introducing new programming
languages, even though it might be nice to be able to write parts of
the system in Python, parts in Haskell, etc.

So I don't think we need UTF-32 source code support, or even full
UTF-8 source code support, because even if we had it we probably
wouldn't use it.

> I sent an e-mail to Oracle to see what they tink about this huge idea.

I don't understand how this is a huge idea. The representation of the
characters doesn't matter, so long as the DB can represent the full
character suite. Right?

> Well, I know it's not efficient space wise, but this in the only way that we
> can deployed worldwide.

UTF-8 is widely used worldwide and covers the full Unicode 32-bit code space.

I wonder if you are misunderstanding UTF-8 vs UCS-2 vs UTF-16 vs UTF-32.

UTF-8 is an encoding that can represent the full 32-bit Unicode space
using escape sequences. It is endianness-independent. One character is
a variable number of bytes, so lookups to find the n'th character,
substring operations, etc are a bit ugly. UTF-8 is the character set
used by most UNIX APIs.

UCS-2 is a legacy encoding that can represent the lower 16 bits of the
Unicode space. It cannot represent the full 32-bit Unicode space. It
has two different forms, little-endian and big-endian, so you have to
include a marker to say which is which, or be careful about handling
it in your code. It's easy to do n'th character lookups, substrings,
etc.

UTF-16 is like UCS-2, but adds UTF-8-like escape sequences to handle
the high 16 bits of the 32-bit Unicode space. It combines the worst
features of UTF-8 and UCS-2. UTF-16 is the character set used by
Windows APIs and the ICU library.

UTF-32 (UCS-4) is much like UCS-2, but uses 4 bytes per character to
represent the full Unicode character set. The downside is that it uses
a full 4 bytes for every character, even when only one byte would be
needed if you were using utf-8. It's easy to do substrings and n'th
character lookups. UCS-4 is horrible on CPU cache and memory. Few APIs
use native UTF-32.

So we already support one of the best text encodings available.

We could add support for using UTF-16 and UTF-32 as the
client_encoding on the wire. But really, the client application can
convert between the protocol's UTF-8 and whatever it wants to use
internally; there's no benefit to using UTF-16 or UTF-32 on the wire,
and it'd be a lot slower. Especially without protocol compression.

So can you explain why you believe UTF-32 support is necessary?

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Re : Re: UTF-32 support in PostgreSQL ?

From

Peter Geoghegan

Date:

27 October 2015, 01:28:00

On Mon, Oct 26, 2015 at 6:20 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> UTF-16 is like UCS-2, but adds UTF-8-like escape sequences to handle
> the high 16 bits of the 32-bit Unicode space. It combines the worst
> features of UTF-8 and UCS-2. UTF-16 is the character set used by
> Windows APIs and the ICU library.

ICU can be built to support UTF-8 natively. UTF-8 support has been at
the same level as UTF-16 support for some time now.

"English language privilege" on your part (as you put it) could be
argued if the OP was arguing for UTF-16, but since he argued for
UTF-32, I don't see how that could possibly apply. UTF-16 is slightly
preferable for storing East Asian text, but UTF-32 is a niche encoding
worldwide.

--
Peter Geoghegan

Re : Re: Re : Re: UTF-32 support in PostgreSQL ?

From

fortin.christian@videotron.ca

Date:

30 October 2015, 17:29:03

Now I received the authorization to give you an answer to the WHY question!<br />Because basicly, this project is
classifiedTOP SECTRET.<br /><br />Well, we know then we have no real avantage to use UTF-32 in comparaison to UTF-8.<br
/>Butwe need to establish a gateway between two huge networks.<br /><br />One network is Internet, the other is ...
namedit  extra-Internet.<br />Extra-Internet is older then the Internet that you already know.<br />It don't use IP
protocol,but use a 32 bit per character encoding.<br />This 32 bit encoding is not UTF-32, but supports 40 languages.
Languageswhich are not include in UTF-32.<br />The language which have the less characters, use 100 characters.<br
/>Thebigger alphabet have 10000 characters.<br />The most used language has 500 characters.<br /><br />This
extra-internetis as big as the actual Internet that you know.<br />This extra-Internet has not been built by USA, but
byan other country.<br />Well, I try to convince peoples to use UTF-32.<br />I will need to ask to UNICODE to integrate
theforeign 32 bits encoding in the future release of UTF-32.<br />And ask to the extra-internet authority, to integrate
theUTF-32 in there standard 32 bits encoding.<br />I request to IETF to support UTF-32 int IPv6. I asked to w3.org to
supportUTF-32 in the future HTML format.<br />I plan to propose to the extra-Internet autority to upgrade to IPv6.<br
/>Theyactualy have problems with the availability of address, like we have with IPv4. The protocol they use is very
basic,more basic than IPv4. And fail very often.<br /><br />Well, hope it give answer to our question.<br /><br
/><span>Le26/10/15, <b class="name">Craig Ringer </b> <craig@2ndquadrant.com> a écrit :</span><blockquote
cite="mid:CAMsr+YEZ57vcqEMFLBxjDxmz5O+h69-QfPYVxjAz2RPT_mycbg@mail.gmail.com"class="iwcQuote" style="border-left: 1px
solid#00F; padding-left: 13px; margin-left: 0;" type="cite"><div class="mimepart text plain">On 27 October 2015 at
05:39, <fortin.christian@videotron.ca> wrote:<br /><br />> I mean for ALL, data stored, source code, and
translationfiles.<br />> For source code, I think then GCC must support UTF-32 before.<br /><br />Why?<br /><br
/>UTF-32is an incredibly inefficient way to store text that's<br />predominantly or entirely within the 7-bit ASCII
space.UTF-8 is a<br />much better way to handle it.<br /><br />Anyway, while gcc supports sources encoded in utf-8 just
fine,it's<br />more typical to represent chars using byte escapes so that people with<br />misconfigured text editors
don'tmangle them. It does not support<br />utf-8 identifiers (variable names, function names, etc) containing<br
/>charactersoutside the 7-bit ASCII space, but you can work around it<br />with UCN if you need to; see the FAQ:<br
/><br/><a
href="https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F"
target="l">https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F</a><br
/><br/>I don't think the PostgreSQL project is likely to accept patches using<br />characters outside the 7-bit ascii
spacein the near future, as<br />compiler and text editor support is unfortunately still too primitive.<br />We support
avariety of legacy platforms and toolchains, many of which<br />won't cope at all. There isn't a pressing reason, since
atthe user<br />level the support for a wide variety of charsets (including all<br />characters in the UTF-32 space) is
alreadypresent.<br /><br />I am aware this is a form of English-language privilege. Of course<br />it's easy for me as
anEnglish first-language speaker to say "oh, we<br />don't need support for your language in the code". It's also
practical<br/>though - code in a variety of languages, so that no one person can<br />read or understand all of it, is
notmaintainable in the long term.<br />Especially when people join and leave the project. It's the same<br />reason the
projectis picky about introducing new programming<br />languages, even though it might be nice to be able to write
partsof<br />the system in Python, parts in Haskell, etc.<br /><br />So I don't think we need UTF-32 source code
support,or even full<br />UTF-8 source code support, because even if we had it we probably<br />wouldn't use it.<br
/><br/><br />> I sent an e-mail to Oracle to see what they tink about this huge idea.<br /><br />I don't understand
howthis is a huge idea. The representation of the<br />characters doesn't matter, so long as the DB can represent the
full<br/>character suite. Right?<br /><br />> Well, I know it's not efficient space wise, but this in the only way
thatwe<br />> can deployed worldwide.<br /><br />UTF-8 is widely used worldwide and covers the full Unicode 32-bit
codespace.<br /><br />I wonder if you are misunderstanding UTF-8 vs UCS-2 vs UTF-16 vs UTF-32.<br /><br />UTF-8 is an
encodingthat can represent the full 32-bit Unicode space<br />using escape sequences. It is endianness-independent. One
characteris<br />a variable number of bytes, so lookups to find the n'th character,<br />substring operations, etc are
abit ugly. UTF-8 is the character set<br />used by most UNIX APIs.<br /><br />UCS-2 is a legacy encoding that can
representthe lower 16 bits of the<br />Unicode space. It cannot represent the full 32-bit Unicode space. It<br />has
twodifferent forms, little-endian and big-endian, so you have to<br />include a marker to say which is which, or be
carefulabout handling<br />it in your code. It's easy to do n'th character lookups, substrings,<br />etc.<br /><br
/>UTF-16is like UCS-2, but adds UTF-8-like escape sequences to handle<br />the high 16 bits of the 32-bit Unicode
space.It combines the worst<br />features of UTF-8 and UCS-2. UTF-16 is the character set used by<br />Windows APIs and
theICU library.<br /><br />UTF-32 (UCS-4) is much like UCS-2, but uses 4 bytes per character to<br />represent the full
Unicodecharacter set. The downside is that it uses<br />a full 4 bytes for every character, even when only one byte
wouldbe<br />needed if you were using utf-8. It's easy to do substrings and n'th<br />character lookups. UCS-4 is
horribleon CPU cache and memory. Few APIs<br />use native UTF-32.<br /><br />So we already support one of the best text
encodingsavailable.<br /><br />We could add support for using UTF-16 and UTF-32 as the<br />client_encoding on the
wire.But really, the client application can<br />convert between the protocol's UTF-8 and whatever it wants to use<br
/>internally;there's no benefit to using UTF-16 or UTF-32 on the wire,<br />and it'd be a lot slower. Especially
withoutprotocol compression.<br /><br />So can you explain why you believe UTF-32 support is necessary?<br /><br
/> CraigRinger                   <a href="http://www.2ndQuadrant.com/" target="l">http://www.2ndQuadrant.com/</a><br
/> PostgreSQLDevelopment, 24x7 Support, Training & Services<br /></div></blockquote>