Re: Reducing the overhead of NUMERIC data - Mailing list pgsql-hackers
From | Martijn van Oosterhout |
---|---|
Subject | Re: Reducing the overhead of NUMERIC data |
Date | |
Msg-id | 20051104234026.GF13966@svana.org Whole thread Raw |
In response to | Re: Reducing the overhead of NUMERIC data (Gregory Maxwell <gmaxwell@gmail.com>) |
Responses |
Re: Reducing the overhead of NUMERIC data
|
List | pgsql-hackers |
On Fri, Nov 04, 2005 at 02:58:05PM -0500, Gregory Maxwell wrote: > The correct question to ask is something like "Does it support non-bmp > characters?" or "Does it really support UTF-16 or just UCS2?" > > UTF-16 is (now) a variable width encoding which is a strict superset > of UCS2 which allows the representation of all Unicode characters. > UCS2 is fixed width and only supports characters from the basic > multilingual plane. UTF-32 and UCS4 are (now) effectively the same > thing and can represent all unicode characters with a 4 byte fixed > length word. It's all on their website: : How is a Unicode string represented in ICU? : : A Unicode string is currently represented as UTF-16 by default. The : endianess of UTF-16 is platform dependent. You can guarantee the : endianess of UTF-16 by using a converter. UTF-16 strings can be : converted to other Unicode forms by using a converter or with the UTF : conversion macros. : : ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not : support surrogates, and UTF-16 does support surrogates. This means : that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The : notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed : its default encoding to UTF-16. <snip> : What is the performance difference between UTF-8 and UTF-16? : : Most of the time, the memory throughput of the hard drive and RAM is : the main performance constraint. UTF-8 is 50% smaller than UTF-16 for : US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South : Asian scripts. There is no memory difference for Latin extensions, : Greek, Cyrillic, Hebrew, and Arabic. <snip> http://icu.sourceforge.net/userguide/icufaq.html : Using UTF-8 strings with ICU : : As mentioned in the overview of this chapter, ICU and most other : Unicode-supporting software uses 16-bit Unicode for internal : processing. However, there are circumstances where UTF-8 is used : instead. This is usually the case for software that does little or no : processing of non-ASCII characters, and/or for APIs that predate : Unicode, use byte-based strings, and cannot be changed or replaced : for various reasons. <snip> : While ICU does not natively use UTF-8 strings, there are many ways to : work with UTF-8 strings and ICU. The following list is probably : incomplete. http://icu.sourceforge.net/userguide/strings.html#strings Basically you use a "converter" to process the UTF-8 strings, prusumably converting them to UTF-16 (which is not UCS-2 as noted above). UTF-32 needs a converter also, so no point using that either. > The code can demand UTF-16 but still be fine for non-BMP characters. > However, many things which claim to support UTF-16 really only support > UCS2 or at least have bugs in their handling of non-bmp characters. > Software that supports UTF-8 is somewhat more likely to support > non-bmp characters correctly since the variable length code paths get > more of a workout in many environments. :) I think ICU deals with that, but feel free to peruse the website yourself... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
pgsql-hackers by date: