Re: Reducing the overhead of NUMERIC data - Mailing list pgsql-hackers

From Martijn van Oosterhout
Subject Re: Reducing the overhead of NUMERIC data
Date
Msg-id 20051104234026.GF13966@svana.org
Whole thread Raw
In response to Re: Reducing the overhead of NUMERIC data  (Gregory Maxwell <gmaxwell@gmail.com>)
Responses Re: Reducing the overhead of NUMERIC data
List pgsql-hackers
On Fri, Nov 04, 2005 at 02:58:05PM -0500, Gregory Maxwell wrote:
> The correct question to ask is something like "Does it support non-bmp
> characters?" or "Does it really support UTF-16 or just UCS2?"
>
> UTF-16 is (now) a variable width encoding which is a strict superset
> of UCS2 which allows the representation of all Unicode characters.
> UCS2 is fixed width and only supports characters from the basic
> multilingual plane.  UTF-32 and UCS4 are (now) effectively the same
> thing and can represent all unicode characters with a 4 byte fixed
> length word.

It's all on their website:

: How is a Unicode string represented in ICU?
:
: A Unicode string is currently represented as UTF-16 by default. The
: endianess of UTF-16 is platform dependent. You can guarantee the
: endianess of UTF-16 by using a converter. UTF-16 strings can be
: converted to other Unicode forms by using a converter or with the UTF
: conversion macros.
:
: ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not
: support surrogates, and UTF-16 does support surrogates. This means
: that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The
: notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed
: its default encoding to UTF-16.
<snip>
: What is the performance difference between UTF-8 and UTF-16?
:
: Most of the time, the memory throughput of the hard drive and RAM is
: the main performance constraint. UTF-8 is 50% smaller than UTF-16 for
: US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South
: Asian scripts. There is no memory difference for Latin extensions,
: Greek, Cyrillic, Hebrew, and Arabic.
<snip>
http://icu.sourceforge.net/userguide/icufaq.html

: Using UTF-8 strings with ICU
:
: As mentioned in the overview of this chapter, ICU and most other
: Unicode-supporting software uses 16-bit Unicode for internal
: processing. However, there are circumstances where UTF-8 is used
: instead. This is usually the case for software that does little or no
: processing of non-ASCII characters, and/or for APIs that predate
: Unicode, use byte-based strings, and cannot be changed or replaced
: for various reasons.
<snip>
: While ICU does not natively use UTF-8 strings, there are many ways to
: work with UTF-8 strings and ICU. The following list is probably
: incomplete.
http://icu.sourceforge.net/userguide/strings.html#strings

Basically you use a "converter" to process the UTF-8 strings,
prusumably converting them to UTF-16 (which is not UCS-2 as noted
above). UTF-32 needs a converter also, so no point using that either.

> The code can demand UTF-16 but still be fine for non-BMP characters.
> However, many things which claim to support UTF-16 really only support
> UCS2 or at least have bugs in their handling of non-bmp characters.
> Software that supports UTF-8 is somewhat more likely to support
> non-bmp characters correctly since the variable length code paths get
> more of a workout in many environments. :)

I think ICU deals with that, but feel free to peruse the website
yourself...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

pgsql-hackers by date:

Previous
From: Tony Caduto
Date:
Subject: Possible problem with pg_reload_conf() and view pg_settings
Next
From: Mark Wong
Date:
Subject: Re: Spinlocks, yet again: analysis and proposed patches