On Tue, Jul 12, 2005 at 05:37:32PM -0400, Joe wrote:
> Tom Lane wrote:
> >Because the length specification is in *characters*, which is not by any
> >means the same as *bytes*.
> >
> >We could possibly put enough intelligence into the low-level tuple
> >manipulation routines to count characters in whatever encoding we happen
> >to be using, but it's a lot faster and more robust to insist on a count
> >word for every variable-width field.
>
> I guess what you're saying is that PostgreSQL stores characters in
> varying-length encodings.
It _may_ store characters in variable length encodings. It can use
fixed-length encodings too, such as latin1 or plain ASCII (actually,
unchecked 8 bits, which means about anything) -- you define that at
initdb time or database creation time, I forget. It would be painful
for the code to distinguish fixed-length from variable-length at
runtime, an optimization that would allow getting rid of the otherwise
required length word. So far, nobody has cared enough about it to do
the job.
> If it stored character data in Unicode (UCS-16) it would always take
> up two-bytes per character.
Really? We don't support UCS-16, for good reasons (we'd have to rewrite
several parts of the code in order to support '0' bytes embedded in
strings ... we use regular C strings extensively).
However we do support Unicode as UTF-8, but it's been said a couple of
times that characters can be wider than 2 or 3 bytes in some cases. So,
I don't see how UCS-16 could always use only 2 bytes.
> Have you considered supporting NCHAR/NVARCHAR, aka NATIONAL character
> data?
There have been noises, but so far nobody has stepped up the plate to do
the work.
--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"Those who use electric razors are infidels destined to burn in hell while
we drink from rivers of beer, download free vids and mingle with naked
well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)