Home > mailing lists

Re: [PATCHES] UNICODE characters above 0x10000 - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [PATCHES] UNICODE characters above 0x10000
Date	August 7, 2004 16:43:32
Msg-id	350.1091897000@sss.pgh.pa.us Whole thread Raw
In response to	Re: [PATCHES] UNICODE characters above 0x10000 (Dennis Bjorklund <db@zigo.dhs.org>)
Responses	Re: [PATCHES] UNICODE characters above 0x10000
List	pgsql-hackers

Tree view

Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
>> Anyway my point is if current specification of Unicode only allows
>> 24-bit range, why we need to allow usage against the specification?

> Is there a specific reason you want to restrict it to 24 bits?

I see several places that have to allocate space on the basis of the
maximum encoded character length possible in the current encoding
(look for uses of pg_database_encoding_max_length).  Probably the only
one that's really significant for performance is text_substr(), but
that's enough to be an argument against setting maxmblen higher than
we have to.

It looks to me like supporting 4-byte UTF-8 characters would be enough
to handle the existing range of Unicode codepoints, and that is probably
as much as we want to do.

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.

            regards, tom lane

pgsql-hackers by date:

From: Bernd Helmle
Date: 07 August 2004, 15:35:07
Subject: Backend crashes with notification rule

From: Tom Lane
Date: 07 August 2004, 16:59:29
Subject: Re: Vacuum Cost Documentation?

Re: [PATCHES] UNICODE characters above 0x10000 - Mailing list pgsql-hackers

Previous

Next