On 11/3/05, Martijn van Oosterhout <kleptog@svana.org> wrote:
> That's called UTF-16 and is currently not supported by PostgreSQL at
> all. That may change, since the locale library ICU requires UTF-16 for
> everything.
UTF-16 doesn't get us out of the variable length character game, for
that we need UTF-32... Unless we were to only support UCS-2, which is
what some databases do for their Unicode support. I think that would
be a huge step back and as you pointed out below, it is not efficient.
:)
> The question is, if someone declares a field CHAR(20), do they really
> mean to fix 40 bytes of storage for each and every row? I doubt it,
> that's even more wasteful of space than a varlena header.
>
> Which puts you right back to variable length fields.
Another way to look at this is in the context of compression: With
unicode, characters are really 32bit values... But only a small range
of these values is common. So we store and work with them in a
compressed format, UTF-8.
The costs of compression is that fixed width fields can not be fixed
width, and the some operations are much more expensive than they would
be otherwise.
As such it might be more interesting to ask some other questions like:
are we using the best compression algorithm for the application, and,
why do we sometimes stack two compression algorithms? For longer
fields would we be better off working with UTF-32 and being more
agressive about where we LZ compress the fields?
> > I dunno... no opinion on the matter here, but I did want to point out
> > that the field can be fixed length without a header. Those proposing such
> > a change, however, should accept that this may result in an overall
> > expense.
>
> The only time this may be useful is for *very* short fields, in the
> order of 4 characters or less. Else the overhead swamps the varlena
> header...
Not even 4 characters if we are to support all of unicode... Length +
UTF-8 is a win vs UTF-32 in most cases for fields with more than one
character.