Re : Re: Re : Re: UTF-32 support in PostgreSQL ? - Mailing list pgsql-hackers
From | fortin.christian@videotron.ca |
---|---|
Subject | Re : Re: Re : Re: UTF-32 support in PostgreSQL ? |
Date | |
Msg-id | 747095644b99f.563141a9@videotron.ca Whole thread Raw |
In response to | UTF-32 support in PostgreSQL ? (fortin.christian@videotron.ca) |
List | pgsql-hackers |
Now I received the authorization to give you an answer to the WHY question!<br />Because basicly, this project is classifiedTOP SECTRET.<br /><br />Well, we know then we have no real avantage to use UTF-32 in comparaison to UTF-8.<br />Butwe need to establish a gateway between two huge networks.<br /><br />One network is Internet, the other is ... namedit extra-Internet.<br />Extra-Internet is older then the Internet that you already know.<br />It don't use IP protocol,but use a 32 bit per character encoding.<br />This 32 bit encoding is not UTF-32, but supports 40 languages. Languageswhich are not include in UTF-32.<br />The language which have the less characters, use 100 characters.<br />Thebigger alphabet have 10000 characters.<br />The most used language has 500 characters.<br /><br />This extra-internetis as big as the actual Internet that you know.<br />This extra-Internet has not been built by USA, but byan other country.<br />Well, I try to convince peoples to use UTF-32.<br />I will need to ask to UNICODE to integrate theforeign 32 bits encoding in the future release of UTF-32.<br />And ask to the extra-internet authority, to integrate theUTF-32 in there standard 32 bits encoding.<br />I request to IETF to support UTF-32 int IPv6. I asked to w3.org to supportUTF-32 in the future HTML format.<br />I plan to propose to the extra-Internet autority to upgrade to IPv6.<br />Theyactualy have problems with the availability of address, like we have with IPv4. The protocol they use is very basic,more basic than IPv4. And fail very often.<br /><br />Well, hope it give answer to our question.<br /><br /><span>Le26/10/15, <b class="name">Craig Ringer </b> <craig@2ndquadrant.com> a écrit :</span><blockquote cite="mid:CAMsr+YEZ57vcqEMFLBxjDxmz5O+h69-QfPYVxjAz2RPT_mycbg@mail.gmail.com"class="iwcQuote" style="border-left: 1px solid#00F; padding-left: 13px; margin-left: 0;" type="cite"><div class="mimepart text plain">On 27 October 2015 at 05:39, <fortin.christian@videotron.ca> wrote:<br /><br />> I mean for ALL, data stored, source code, and translationfiles.<br />> For source code, I think then GCC must support UTF-32 before.<br /><br />Why?<br /><br />UTF-32is an incredibly inefficient way to store text that's<br />predominantly or entirely within the 7-bit ASCII space.UTF-8 is a<br />much better way to handle it.<br /><br />Anyway, while gcc supports sources encoded in utf-8 just fine,it's<br />more typical to represent chars using byte escapes so that people with<br />misconfigured text editors don'tmangle them. It does not support<br />utf-8 identifiers (variable names, function names, etc) containing<br />charactersoutside the 7-bit ASCII space, but you can work around it<br />with UCN if you need to; see the FAQ:<br /><br/><a href="https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F" target="l">https://gcc.gnu.org/wiki/FAQ#What_is_the_status_of_adding_the_UTF-8_support_for_identifier_names_in_GCC.3F</a><br /><br/>I don't think the PostgreSQL project is likely to accept patches using<br />characters outside the 7-bit ascii spacein the near future, as<br />compiler and text editor support is unfortunately still too primitive.<br />We support avariety of legacy platforms and toolchains, many of which<br />won't cope at all. There isn't a pressing reason, since atthe user<br />level the support for a wide variety of charsets (including all<br />characters in the UTF-32 space) is alreadypresent.<br /><br />I am aware this is a form of English-language privilege. Of course<br />it's easy for me as anEnglish first-language speaker to say "oh, we<br />don't need support for your language in the code". It's also practical<br/>though - code in a variety of languages, so that no one person can<br />read or understand all of it, is notmaintainable in the long term.<br />Especially when people join and leave the project. It's the same<br />reason the projectis picky about introducing new programming<br />languages, even though it might be nice to be able to write partsof<br />the system in Python, parts in Haskell, etc.<br /><br />So I don't think we need UTF-32 source code support,or even full<br />UTF-8 source code support, because even if we had it we probably<br />wouldn't use it.<br /><br/><br />> I sent an e-mail to Oracle to see what they tink about this huge idea.<br /><br />I don't understand howthis is a huge idea. The representation of the<br />characters doesn't matter, so long as the DB can represent the full<br/>character suite. Right?<br /><br />> Well, I know it's not efficient space wise, but this in the only way thatwe<br />> can deployed worldwide.<br /><br />UTF-8 is widely used worldwide and covers the full Unicode 32-bit codespace.<br /><br />I wonder if you are misunderstanding UTF-8 vs UCS-2 vs UTF-16 vs UTF-32.<br /><br />UTF-8 is an encodingthat can represent the full 32-bit Unicode space<br />using escape sequences. It is endianness-independent. One characteris<br />a variable number of bytes, so lookups to find the n'th character,<br />substring operations, etc are abit ugly. UTF-8 is the character set<br />used by most UNIX APIs.<br /><br />UCS-2 is a legacy encoding that can representthe lower 16 bits of the<br />Unicode space. It cannot represent the full 32-bit Unicode space. It<br />has twodifferent forms, little-endian and big-endian, so you have to<br />include a marker to say which is which, or be carefulabout handling<br />it in your code. It's easy to do n'th character lookups, substrings,<br />etc.<br /><br />UTF-16is like UCS-2, but adds UTF-8-like escape sequences to handle<br />the high 16 bits of the 32-bit Unicode space.It combines the worst<br />features of UTF-8 and UCS-2. UTF-16 is the character set used by<br />Windows APIs and theICU library.<br /><br />UTF-32 (UCS-4) is much like UCS-2, but uses 4 bytes per character to<br />represent the full Unicodecharacter set. The downside is that it uses<br />a full 4 bytes for every character, even when only one byte wouldbe<br />needed if you were using utf-8. It's easy to do substrings and n'th<br />character lookups. UCS-4 is horribleon CPU cache and memory. Few APIs<br />use native UTF-32.<br /><br />So we already support one of the best text encodingsavailable.<br /><br />We could add support for using UTF-16 and UTF-32 as the<br />client_encoding on the wire.But really, the client application can<br />convert between the protocol's UTF-8 and whatever it wants to use<br />internally;there's no benefit to using UTF-16 or UTF-32 on the wire,<br />and it'd be a lot slower. Especially withoutprotocol compression.<br /><br />So can you explain why you believe UTF-32 support is necessary?<br /><br /> CraigRinger <a href="http://www.2ndQuadrant.com/" target="l">http://www.2ndQuadrant.com/</a><br /> PostgreSQLDevelopment, 24x7 Support, Training & Services<br /></div></blockquote>
pgsql-hackers by date: