Re: [PATCHES] UNICODE characters above 0x10000 - Mailing list pgsql-hackers

From John Hansen
Subject Re: [PATCHES] UNICODE characters above 0x10000
Date
Msg-id 5066E5A966339E42AA04BA10BA706AE56174@rodrick.geeknet.com.au
Whole thread Raw
List pgsql-hackers
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Sunday, August 08, 2004 2:43 AM
> To: Dennis Bjorklund
> Cc: Tatsuo Ishii; John Hansen; pgsql-hackers@postgresql.org;
> pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
>
> Dennis Bjorklund <db@zigo.dhs.org> writes:
> > On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
> >> Anyway my point is if current specification of Unicode only allows
> >> 24-bit range, why we need to allow usage against the specification?
>
> > Is there a specific reason you want to restrict it to 24 bits?
>
> I see several places that have to allocate space on the basis
> of the maximum encoded character length possible in the
> current encoding (look for uses of
> pg_database_encoding_max_length).  Probably the only one
> that's really significant for performance is text_substr(),
> but that's enough to be an argument against setting maxmblen
> higher than we have to.
>
> It looks to me like supporting 4-byte UTF-8 characters would
> be enough to handle the existing range of Unicode codepoints,
> and that is probably as much as we want to do.
>
> If I understood what I was reading, this would take several things:
> * Remove the "special UTF-8 check" in pg_verifymbstr;

I strongly disagree, this would mean one could store any sequence of
characters in the db, as long as the bytes are above 0x80. This would
not be valid utf8, and leave the data in an inconsistent state.
Setting the client encoding to unicode, implies that this is what we're
going to feed the database, and should guarantee, that what comes out of
a select is valid utf8. We can make sure of that, by doing the check
before it's inserted.

> * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte
case;

pg_utf_mblen should handle any case according to the specification.
Currently, it will return 3, even for 4,5, and 6 byte sequences. Those
places where pg_utf_mblen is called, we should check to make sure, that
the length is between 1 and 4 inclusive, and that the sequence is valid.
This is what I made the patch for.

> * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

That I have no problem with.

> Are there any other places that would have to change?  Would
> this break anything?  The testing aspect is what's bothering
> me at the moment.
>
>             regards, tom lane
>
>

Just my $0.02 worth,

Kind Regards,

John Hansen

pgsql-hackers by date:

Previous
From: Oliver Jowett
Date:
Subject: Re: parameter hints to the optimizer
Next
From: "John Hansen"
Date:
Subject: Re: UNICODE characters above 0x10000