Re : Re: Re : Re: UTF-32 support in PostgreSQL ? - Mailing list pgsql-hackers

From fortin.christian@videotron.ca
Subject Re : Re: Re : Re: UTF-32 support in PostgreSQL ?
Date
Msg-id 747095644b99f.563141a9@videotron.ca
Whole thread Raw
In response to UTF-32 support in PostgreSQL ?  (fortin.christian@videotron.ca)
List pgsql-hackers
Now I received the authorization to give you an answer to the WHY question!<br />Because basicly, this project is
classifiedTOP SECTRET.<br /><br />Well, we know then we have no real avantage to use UTF-32 in comparaison to UTF-8.<br
/>Butwe need to establish a gateway between two huge networks.<br /><br />One network is Internet, the other is ...
namedit  extra-Internet.<br />Extra-Internet is older then the Internet that you already know.<br />It don't use IP
protocol,but use a 32 bit per character encoding.<br />This 32 bit encoding is not UTF-32, but supports 40 languages.
Languageswhich are not include in UTF-32.<br />The language which have the less characters, use 100 characters.<br
/>Thebigger alphabet have 10000 characters.<br />The most used language has 500 characters.<br /><br />This
extra-internetis as big as the actual Internet that you know.<br />This extra-Internet has not been built by USA, but
byan other country.<br />Well, I try to convince peoples to use UTF-32.<br />I will need to ask to UNICODE to integrate
theforeign 32 bits encoding in the future release of UTF-32.<br />And ask to the extra-internet authority, to integrate
theUTF-32 in there standard 32 bits encoding.<br />I request to IETF to support UTF-32 int IPv6. I asked to w3.org to
supportUTF-32 in the future HTML format.<br />I plan to propose to the extra-Internet autority to upgrade to IPv6.<br
/>Theyactualy have problems with the availability of address, like we have with IPv4. The protocol they use is very
basic,more basic than IPv4. And fail very often.<br /><br />Well, hope it give answer to our question.<br /><br
/><span>Le26/10/15, <b class="name">Craig Ringer </b> <craig@2ndquadrant.com> a écrit :</span><blockquote
cite="mid:CAMsr+YEZ57vcqEMFLBxjDxmz5O+h69-QfPYVxjAz2RPT_mycbg@mail.gmail.com"class="iwcQuote" style="border-left: 1px
solid#00F; padding-left: 13px; margin-left: 0;" type="cite"><div class="mimepart text plain">On 27 October 2015 at
05:39, <fortin.christian@videotron.ca> wrote:<br /><br />> I mean for ALL, data stored, source code, and
translationfiles.<br />> For source code, I think then GCC must support UTF-32 before.<br /><br />Why?<br /><br
/>UTF-32is an incredibly inefficient way to store text that's<br />predominantly or entirely within the 7-bit ASCII
space.UTF-8 is a<br />much better way to handle it.<br /><br />Anyway, while gcc supports sources encoded in utf-8 just
fine,it's<br />more typical to represent chars using byte escapes so that people with<br />misconfigured text editors
don'tmangle them. It does not support<br />utf-8 identifiers (variable names, function names, etc) containing<br
/>charactersoutside the 7-bit ASCII space, but you can work around it<br />with UCN if you need to; see the FAQ:<br
/><br/><a
href="https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F"
target="l">https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F</a><br
/><br/>I don't think the PostgreSQL project is likely to accept patches using<br />characters outside the 7-bit ascii
spacein the near future, as<br />compiler and text editor support is unfortunately still too primitive.<br />We support
avariety of legacy platforms and toolchains, many of which<br />won't cope at all. There isn't a pressing reason, since
atthe user<br />level the support for a wide variety of charsets (including all<br />characters in the UTF-32 space) is
alreadypresent.<br /><br />I am aware this is a form of English-language privilege. Of course<br />it's easy for me as
anEnglish first-language speaker to say "oh, we<br />don't need support for your language in the code". It's also
practical<br/>though - code in a variety of languages, so that no one person can<br />read or understand all of it, is
notmaintainable in the long term.<br />Especially when people join and leave the project. It's the same<br />reason the
projectis picky about introducing new programming<br />languages, even though it might be nice to be able to write
partsof<br />the system in Python, parts in Haskell, etc.<br /><br />So I don't think we need UTF-32 source code
support,or even full<br />UTF-8 source code support, because even if we had it we probably<br />wouldn't use it.<br
/><br/><br />> I sent an e-mail to Oracle to see what they tink about this huge idea.<br /><br />I don't understand
howthis is a huge idea. The representation of the<br />characters doesn't matter, so long as the DB can represent the
full<br/>character suite. Right?<br /><br />> Well, I know it's not efficient space wise, but this in the only way
thatwe<br />> can deployed worldwide.<br /><br />UTF-8 is widely used worldwide and covers the full Unicode 32-bit
codespace.<br /><br />I wonder if you are misunderstanding UTF-8 vs UCS-2 vs UTF-16 vs UTF-32.<br /><br />UTF-8 is an
encodingthat can represent the full 32-bit Unicode space<br />using escape sequences. It is endianness-independent. One
characteris<br />a variable number of bytes, so lookups to find the n'th character,<br />substring operations, etc are
abit ugly. UTF-8 is the character set<br />used by most UNIX APIs.<br /><br />UCS-2 is a legacy encoding that can
representthe lower 16 bits of the<br />Unicode space. It cannot represent the full 32-bit Unicode space. It<br />has
twodifferent forms, little-endian and big-endian, so you have to<br />include a marker to say which is which, or be
carefulabout handling<br />it in your code. It's easy to do n'th character lookups, substrings,<br />etc.<br /><br
/>UTF-16is like UCS-2, but adds UTF-8-like escape sequences to handle<br />the high 16 bits of the 32-bit Unicode
space.It combines the worst<br />features of UTF-8 and UCS-2. UTF-16 is the character set used by<br />Windows APIs and
theICU library.<br /><br />UTF-32 (UCS-4) is much like UCS-2, but uses 4 bytes per character to<br />represent the full
Unicodecharacter set. The downside is that it uses<br />a full 4 bytes for every character, even when only one byte
wouldbe<br />needed if you were using utf-8. It's easy to do substrings and n'th<br />character lookups. UCS-4 is
horribleon CPU cache and memory. Few APIs<br />use native UTF-32.<br /><br />So we already support one of the best text
encodingsavailable.<br /><br />We could add support for using UTF-16 and UTF-32 as the<br />client_encoding on the
wire.But really, the client application can<br />convert between the protocol's UTF-8 and whatever it wants to use<br
/>internally;there's no benefit to using UTF-16 or UTF-32 on the wire,<br />and it'd be a lot slower. Especially
withoutprotocol compression.<br /><br />So can you explain why you believe UTF-32 support is necessary?<br /><br
/> CraigRinger                   <a href="http://www.2ndQuadrant.com/" target="l">http://www.2ndQuadrant.com/</a><br
/> PostgreSQLDevelopment, 24x7 Support, Training & Services<br /></div></blockquote> 

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Patch: Implement failover on libpq connect level.
Next
From: Jim Nasby
Date:
Subject: Patch to install config/missing