Thread: Patch for bug #12845 (GB18030 encoding)

Patch for bug #12845 (GB18030 encoding)

From
Arjen Nienhuis
Date:
Hi,

Can someone look at this patch. It should fix bug #12845.

The current tests for conversions are very minimal. I expanded them a
bit for this bug.

I think the binary search in the .map files should be removed but I
leave that for another patch.

Attachment

Re: Patch for bug #12845 (GB18030 encoding)

From
Robert Haas
Date:
On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:
> Can someone look at this patch. It should fix bug #12845.
>
> The current tests for conversions are very minimal. I expanded them a
> bit for this bug.
>
> I think the binary search in the .map files should be removed but I
> leave that for another patch.

Please add this patch to
https://commitfest.postgresql.org/action/commitfest_view/open so we
don't forget about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Patch for bug #12845 (GB18030 encoding)

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:
> > Can someone look at this patch. It should fix bug #12845.
> >
> > The current tests for conversions are very minimal. I expanded them a
> > bit for this bug.
> >
> > I think the binary search in the .map files should be removed but I
> > leave that for another patch.
> 
> Please add this patch to
> https://commitfest.postgresql.org/action/commitfest_view/open so we
> don't forget about it.

If we think this is a bug fix, we should add it to the open items list,
https://wiki.postgresql.org/wiki/Open_Items

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Patch for bug #12845 (GB18030 encoding)

From
Robert Haas
Date:
On Wed, May 6, 2015 at 10:55 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:
>> > Can someone look at this patch. It should fix bug #12845.
>> >
>> > The current tests for conversions are very minimal. I expanded them a
>> > bit for this bug.
>> >
>> > I think the binary search in the .map files should be removed but I
>> > leave that for another patch.
>>
>> Please add this patch to
>> https://commitfest.postgresql.org/action/commitfest_view/open so we
>> don't forget about it.
>
> If we think this is a bug fix, we should add it to the open items list,
> https://wiki.postgresql.org/wiki/Open_Items

It's a behavior change, so I don't think we would consider a back-patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Patch for bug #12845 (GB18030 encoding)

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Wed, May 6, 2015 at 10:55 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Robert Haas wrote:
> >> On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:
> >> > Can someone look at this patch. It should fix bug #12845.
> >> >
> >> > The current tests for conversions are very minimal. I expanded them a
> >> > bit for this bug.
> >> >
> >> > I think the binary search in the .map files should be removed but I
> >> > leave that for another patch.
> >>
> >> Please add this patch to
> >> https://commitfest.postgresql.org/action/commitfest_view/open so we
> >> don't forget about it.
> >
> > If we think this is a bug fix, we should add it to the open items list,
> > https://wiki.postgresql.org/wiki/Open_Items
> 
> It's a behavior change, so I don't think we would consider a back-patch.

Maybe not, but at the very least we should consider getting it fixed in
9.5 rather than waiting a full development cycle.  Same as in
https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
I'm not saying we MUST include it in 9.5, but we should at least
consider it.  If we simply stash it in the open CF we guarantee that it
will linger there for a year.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Patch for bug #12845 (GB18030 encoding)

From
Robert Haas
Date:
On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> It's a behavior change, so I don't think we would consider a back-patch.
>
> Maybe not, but at the very least we should consider getting it fixed in
> 9.5 rather than waiting a full development cycle.  Same as in
> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
> I'm not saying we MUST include it in 9.5, but we should at least
> consider it.  If we simply stash it in the open CF we guarantee that it
> will linger there for a year.

Sure, if somebody has the time to put into it now, I'm fine with that.
I'm afraid it won't be me, though: even if I had the time, I don't
know enough about encodings.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Patch for bug #12845 (GB18030 encoding)

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Maybe not, but at the very least we should consider getting it fixed in
>> 9.5 rather than waiting a full development cycle.  Same as in
>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
>> I'm not saying we MUST include it in 9.5, but we should at least
>> consider it.  If we simply stash it in the open CF we guarantee that it
>> will linger there for a year.

> Sure, if somebody has the time to put into it now, I'm fine with that.
> I'm afraid it won't be me, though: even if I had the time, I don't
> know enough about encodings.

I concur that we should at least consider this patch for 9.5.  I've
added it to
https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

I'm willing to look at it myself, whenever my non-copious spare time
permits; but that won't be in the immediate future.
        regards, tom lane



Re: Patch for bug #12845 (GB18030 encoding)

From
Tom Lane
Date:
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> Maybe not, but at the very least we should consider getting it fixed in
>>> 9.5 rather than waiting a full development cycle.  Same as in
>>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
>>> I'm not saying we MUST include it in 9.5, but we should at least
>>> consider it.  If we simply stash it in the open CF we guarantee that it
>>> will linger there for a year.

>> Sure, if somebody has the time to put into it now, I'm fine with that.
>> I'm afraid it won't be me, though: even if I had the time, I don't
>> know enough about encodings.

> I concur that we should at least consider this patch for 9.5.  I've
> added it to
> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

I looked at this patch a bit, and read up on GB18030 (thank you
wikipedia).  I concur we have a problem to fix.  I do not like the way
this patch went about it though, ie copying-and-pasting LocalToUtf and
UtfToLocal and their supporting routines into utf8_and_gb18030.c.
Aside from being duplicative, this means the improved mapping capability
isn't available to use with anything except GB18030.  (I do not know
whether there are any linear mapping ranges in other encodings, but
seeing that the Unicode crowd went to the trouble of defining a notation
for it in http://www.unicode.org/reports/tr22/, I'm betting there are.)

What I think would be a better solution, if slightly more invasive,
is to extend LocalToUtf and UtfToLocal to add a callback function
argument for a function of signature "uint32 translate(uint32)".
This function, if provided, would be called after failing to find a
mapping in the mapping table(s), and it could implement any translation
that would be better handled by code than as a boatload of mapping-table
entries.  If it returns zero then it doesn't know a translation either,
so throw error as before.

An alternative definition that could be proposed would be to call the
function before consulting the mapping tables, not after, on the grounds
that the function can probably exit cheaply if the input's not in a range
that it cares about.  However, consulting the mapping table first wins
if you have ranges that mostly work but contain a few exceptions: put
the exceptions in the mapping table and then the function need not worry
about handling them.

Another alternative approach would be to try to define linear mapping
ranges in a tabular fashion, for more consistency with what's there now.
But that probably wouldn't work terribly well because the bytewise
character representations used in this logic have to be converted into
code points before you can do any sort of linear mapping.  We could
hard-wire that conversion for UTF8, but the conversion in the other code
space would be encoding-specific.  So we might as well just treat the
whole linear mapping behavior as a black box function for each encoding.

I'm also discounting the possibility that someone would want an
algorithmic mapping for cases involving "combined" codes (ie pairs of
UTF8 characters).  Of the encodings we support, only EUC_JIS_2004 and
SHIFT_JIS_2004 need such cases at all, and those have only a handful of
cases; so it doesn't seem popular enough to justify the extra complexity.

I also notice that pg_gb18030_verifier isn't even close to strict enough;
it basically relies on pg_gb18030_mblen which contains no checks
whatsoever on the third and fourth bytes.  So that needs to be fixed.

The verification tightening would definitely not be something to
back-patch, and I'm inclined to think that the additional mapping
capability shouldn't be either, in view of the facts that (a) we've
had few if any field complaints yet, and (b) changing the signatures
of LocalToUtf/UtfToLocal might possibly break third-party code.
So I'm seeing this as a HEAD-only patch, but I do want to try to
squeeze it into 9.5 rather than wait another year.

Barring objections, I'll go make this happen.
        regards, tom lane



Re: Patch for bug #12845 (GB18030 encoding)

From
Arjen Nienhuis
Date:
On Thu, May 14, 2015 at 11:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera
>>> <alvherre@2ndquadrant.com> wrote:
>>>> Maybe not, but at the very least we should consider getting it fixed in
>>>> 9.5 rather than waiting a full development cycle.  Same as in
>>>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
>>>> I'm not saying we MUST include it in 9.5, but we should at least
>>>> consider it.  If we simply stash it in the open CF we guarantee that it
>>>> will linger there for a year.
>
>>> Sure, if somebody has the time to put into it now, I'm fine with that.
>>> I'm afraid it won't be me, though: even if I had the time, I don't
>>> know enough about encodings.
>
>> I concur that we should at least consider this patch for 9.5.  I've
>> added it to
>> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items
>
> I looked at this patch a bit, and read up on GB18030 (thank you
> wikipedia).  I concur we have a problem to fix.  I do not like the way
> this patch went about it though, ie copying-and-pasting LocalToUtf and
> UtfToLocal and their supporting routines into utf8_and_gb18030.c.
> Aside from being duplicative, this means the improved mapping capability
> isn't available to use with anything except GB18030.  (I do not know
> whether there are any linear mapping ranges in other encodings, but
> seeing that the Unicode crowd went to the trouble of defining a notation
> for it in http://www.unicode.org/reports/tr22/, I'm betting there are.)
>
> What I think would be a better solution, if slightly more invasive,
> is to extend LocalToUtf and UtfToLocal to add a callback function
> argument for a function of signature "uint32 translate(uint32)".
> This function, if provided, would be called after failing to find a
> mapping in the mapping table(s), and it could implement any translation
> that would be better handled by code than as a boatload of mapping-table
> entries.  If it returns zero then it doesn't know a translation either,
> so throw error as before.
>
> An alternative definition that could be proposed would be to call the
> function before consulting the mapping tables, not after, on the grounds
> that the function can probably exit cheaply if the input's not in a range
> that it cares about.  However, consulting the mapping table first wins
> if you have ranges that mostly work but contain a few exceptions: put
> the exceptions in the mapping table and then the function need not worry
> about handling them.
>
> Another alternative approach would be to try to define linear mapping
> ranges in a tabular fashion, for more consistency with what's there now.
> But that probably wouldn't work terribly well because the bytewise
> character representations used in this logic have to be converted into
> code points before you can do any sort of linear mapping.  We could
> hard-wire that conversion for UTF8, but the conversion in the other code
> space would be encoding-specific.  So we might as well just treat the
> whole linear mapping behavior as a black box function for each encoding.
>
> I'm also discounting the possibility that someone would want an
> algorithmic mapping for cases involving "combined" codes (ie pairs of
> UTF8 characters).  Of the encodings we support, only EUC_JIS_2004 and
> SHIFT_JIS_2004 need such cases at all, and those have only a handful of
> cases; so it doesn't seem popular enough to justify the extra complexity.
>
> I also notice that pg_gb18030_verifier isn't even close to strict enough;
> it basically relies on pg_gb18030_mblen which contains no checks
> whatsoever on the third and fourth bytes.  So that needs to be fixed.
>
> The verification tightening would definitely not be something to
> back-patch, and I'm inclined to think that the additional mapping
> capability shouldn't be either, in view of the facts that (a) we've
> had few if any field complaints yet, and (b) changing the signatures
> of LocalToUtf/UtfToLocal might possibly break third-party code.
> So I'm seeing this as a HEAD-only patch, but I do want to try to
> squeeze it into 9.5 rather than wait another year.
>
> Barring objections, I'll go make this happen.

GB18030 is a special case, because it's a full mapping of all unicode
characters, and most of it is algorithmically defined. This makes
UtfToLocal a bad choice to implement it. UtfToLocal assumes a sparse
array with only the defined characters. It uses binary search to find
a character. The 2 tables it uses now are huge (the .so file is 1MB).
Adding the rest of the valid characters to this scheme is possible,
but would make the problem worse.

I think fixing UtfToLocal only for the new characters is not optimal.

I think the best solution is to get rid of UtfToLocal for GB18030. Use
a specialized algorithm:
- For characters > U+FFFF use the algorithm from my patch
- For charcaters <= U+FFFF use special mapping tables to map from/to
UTF32. Those tables would be smaller, and the code would be faster (I
assume).

For example (256 KB):

unsigned int utf32_to_gb18030[65536] = {
        /* 0x0 */ 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7,
        /* 0x8 */ 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf,
--
        /* 0xdb08 */ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
--
        /* 0xfff0 */ 0x8431a334, 0x8431a335, 0x8431a336, 0x8431a337,
0x8431a338, 0x8431a339, 0x8431a430, 0x8431a431,
        /* 0xfff8 */ 0x8431a432, 0x8431a433, 0x8431a434, 0x8431a435,
0x8431a436, 0x8431a437, 0x8431a438, 0x8431a439
};

Instead of (500KB):

static pg_utf_to_local ULmapGB18030[ 63360 ] = {
  {0xc280, 0x81308130},
  {0xc281, 0x81308131},
--
  {0xefbfbe, 0x8431a438},
  {0xefbfbf, 0x8431a439}
};


See the attachment for a python script to generate those mappings.

Gr. Arjen

Attachment

Re: Patch for bug #12845 (GB18030 encoding)

From
Tom Lane
Date:
Arjen Nienhuis <a.g.nienhuis@gmail.com> writes:
> GB18030 is a special case, because it's a full mapping of all unicode
> characters, and most of it is algorithmically defined.

True.

> This makes UtfToLocal a bad choice to implement it.

I disagree with that conclusion.  There are still 30000+ characters
that need to be translated via lookup table, so we still need either
UtfToLocal or a clone of it; and as I said previously, I'm not on board
with cloning it.

> I think the best solution is to get rid of UtfToLocal for GB18030. Use
> a specialized algorithm:
> - For characters > U+FFFF use the algorithm from my patch
> - For charcaters <= U+FFFF use special mapping tables to map from/to
> UTF32. Those tables would be smaller, and the code would be faster (I
> assume).

I looked at what wikipeda claims is the authoritative conversion table:

http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

According to that, about half of the characters below U+FFFF can be
processed via linear conversions, so I think we ought to save table
space by doing that.  However, the remaining stuff that has to be
processed by lookup still contains a pretty substantial number of
characters that map to 4-byte GB18030 characters, so I don't think
we can get any table size savings by adopting a bespoke table format.
We might as well use UtfToLocal.  (Worth noting in this connection
is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
table entries for other encodings, even though most of the others
are not concerned with characters outside the BMP.)
        regards, tom lane



Re: Patch for bug #12845 (GB18030 encoding)

From
Arjen Nienhuis
Date:
On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Arjen Nienhuis <a.g.nienhuis@gmail.com> writes:
>> GB18030 is a special case, because it's a full mapping of all unicode
>> characters, and most of it is algorithmically defined.
>
> True.
>
>> This makes UtfToLocal a bad choice to implement it.
>
> I disagree with that conclusion.  There are still 30000+ characters
> that need to be translated via lookup table, so we still need either
> UtfToLocal or a clone of it; and as I said previously, I'm not on board
> with cloning it.
>
>> I think the best solution is to get rid of UtfToLocal for GB18030. Use
>> a specialized algorithm:
>> - For characters > U+FFFF use the algorithm from my patch
>> - For charcaters <= U+FFFF use special mapping tables to map from/to
>> UTF32. Those tables would be smaller, and the code would be faster (I
>> assume).
>
> I looked at what wikipeda claims is the authoritative conversion table:
>
> http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
>
> According to that, about half of the characters below U+FFFF can be
> processed via linear conversions, so I think we ought to save table
> space by doing that.  However, the remaining stuff that has to be
> processed by lookup still contains a pretty substantial number of
> characters that map to 4-byte GB18030 characters, so I don't think
> we can get any table size savings by adopting a bespoke table format.
> We might as well use UtfToLocal.  (Worth noting in this connection
> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
> table entries for other encodings, even though most of the others
> are not concerned with characters outside the BMP.)
>

It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal
uses a sparse array:

map = {{0, x}, {1, y}, {2, z}, ...}

v.s.

map = {x, y, z, ...}

That's fine when not every code point is used, but it's different for
GB18030 where almost all code points are used. Using a plain array
saves space and saves a binary search.

Gr. Arjen



Re: Patch for bug #12845 (GB18030 encoding)

From
Tom Lane
Date:
Arjen Nienhuis <a.g.nienhuis@gmail.com> writes:
> On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> According to that, about half of the characters below U+FFFF can be
>> processed via linear conversions, so I think we ought to save table
>> space by doing that.  However, the remaining stuff that has to be
>> processed by lookup still contains a pretty substantial number of
>> characters that map to 4-byte GB18030 characters, so I don't think
>> we can get any table size savings by adopting a bespoke table format.
>> We might as well use UtfToLocal.  (Worth noting in this connection
>> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
>> table entries for other encodings, even though most of the others
>> are not concerned with characters outside the BMP.)

> It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal
> uses a sparse array:

> map = {{0, x}, {1, y}, {2, z}, ...}

> v.s.

> map = {x, y, z, ...}

> That's fine when not every code point is used, but it's different for
> GB18030 where almost all code points are used. Using a plain array
> saves space and saves a binary search.

Well, it doesn't save any space: if we get rid of the additional linear
ranges in the lookup table, what remains is 30733 entries requiring about
256K, same as (or a bit less than) what you suggest.

The point about possibly being able to do this with a simple lookup table
instead of binary search is valid, but I still say it's a mistake to
suppose that we should consider that only for GB18030.  With the reduced
table size, the GB18030 conversion tables are not all that far out of line
with the other Far Eastern conversions:

$ size utf8*.so | sort -n  text    data     bss     dec     hex filename  1880     512      16    2408     968
utf8_and_ascii.so 2394     528      16    2938     b7a utf8_and_iso8859_1.so  6674     512      16    7202    1c22
utf8_and_cyrillic.so24318     904      16   25238    6296 utf8_and_win.so 28750     968      16   29734    7426
utf8_and_iso8859.so121110    512      16  121638   1db26 utf8_and_euc_cn.so123458     512      16  123986   1e452
utf8_and_sjis.so133606    512      16  134134   20bf6 utf8_and_euc_kr.so185014     512      16  185542   2d4c6
utf8_and_sjis2004.so185522    512      16  186050   2d6c2 utf8_and_euc2004.so212950     512      16  213478   341e6
utf8_and_euc_jp.so221394    512      16  221922   362e2 utf8_and_big5.so274772     512      16  275300   43364
utf8_and_johab.so277776    512      16  278304   43f20 utf8_and_uhc.so332262     512      16  332790   513f6
utf8_and_euc_tw.so350640    512      16  351168   55bc0 utf8_and_gbk.so496680     512      16  497208   79638
utf8_and_gb18030.so

If we were to get excited about reducing the conversion time for GB18030,
it would clearly make sense to use similar infrastructure for GBK, and
perhaps the EUC encodings too.

However, I'm not that excited about changing it.  We have not heard field
complaints about these converters being too slow.  What's more, there
doesn't seem to be any practical way to apply the same idea to the other
conversion direction, which means if you do feel there's a speed problem
this would only halfway fix it.

So my feeling is that the most practical and maintainable answer is to
keep GB18030 using code that is mostly shared with the other encodings.
I've committed a fix that does it that way for 9.5.  If you want to
pursue the idea of a faster conversion using direct lookup tables,
I think that would be 9.6 material at this point.
        regards, tom lane



Re: Patch for bug #12845 (GB18030 encoding)

From
Robert Haas
Date:
On Fri, May 15, 2015 at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, I'm not that excited about changing it.  We have not heard field
> complaints about these converters being too slow.  What's more, there
> doesn't seem to be any practical way to apply the same idea to the other
> conversion direction, which means if you do feel there's a speed problem
> this would only halfway fix it.

Half a loaf is better than none.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Patch for bug #12845 (GB18030 encoding)

From
Arjen Nienhuis
Date:
>> That's fine when not every code point is used, but it's different for
>> GB18030 where almost all code points are used. Using a plain array
>> saves space and saves a binary search.
>
> Well, it doesn't save any space: if we get rid of the additional linear
> ranges in the lookup table, what remains is 30733 entries requiring about
> 256K, same as (or a bit less than) what you suggest.

We could do both. What about something like this:

static unsigned int utf32_to_gb18030_from_0x0001[1105] = {
/* 0x0 */ 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
...
static unsigned int utf32_to_gb18030_from_0x2010[1587] = {
/* 0x0 */ 0xa95c, 0x8136a532, 0x8136a533, 0xa843, 0xa1aa, 0xa844,
0xa1ac, 0x8136a534,
...
static unsigned int utf32_to_gb18030_from_0x2E81[28965] = {
/* 0x0 */ 0xfe50, 0x8138fd39, 0x8138fe30, 0xfe54, 0x8138fe31,
0x8138fe32, 0x8138fe33, 0xfe57,
...
static unsigned int utf32_to_gb18030_from_0xE000[2149] = {
/* 0x0 */ 0xaaa1, 0xaaa2, 0xaaa3, 0xaaa4, 0xaaa5, 0xaaa6, 0xaaa7, 0xaaa8,
...
static unsigned int utf32_to_gb18030_from_0xF92C[254] = {
/* 0x0 */ 0xfd9c, 0x84308535, 0x84308536, 0x84308537, 0x84308538,
0x84308539, 0x84308630, 0x84308631,
...
static unsigned int utf32_to_gb18030_from_0xFE30[464] = {
/* 0x0 */ 0xa955, 0xa6f2, 0x84318538, 0xa6f4, 0xa6f5, 0xa6e0, 0xa6e1, 0xa6f0,
...

static uint32
conv_utf8_to_18030(uint32 code)
{   uint32      ucs = utf8word_to_unicode(code);

#define conv_lin(minunicode, maxunicode, mincode) \   if (ucs >= minunicode && ucs <= maxunicode) \       return
gb_unlinear(ucs- minunicode + gb_linear(mincode))
 

#define conv_array(minunicode, maxunicode) \   if (ucs >= minunicode && ucs <= maxunicode) \       return
utf32_to_gb18030_from_##minunicode[ucs- minunicode];
 
   conv_array(0x0001, 0x0452);   conv_lin(0x0452, 0x200F, 0x8130D330);   conv_array(0x2010, 0x2643);   conv_lin(0x2643,
0x2E80,0x8137A839);   conv_array(0x2E81, 0x9FA6);   conv_lin(0x9FA6, 0xD7FF, 0x82358F33);   conv_array(0xE000, 0xE865);
 conv_lin(0xE865, 0xF92B, 0x8336D030);   conv_array(0xF92C, 0xFA2A);   conv_lin(0xFA2A, 0xFE2F, 0x84309C38);
conv_array(0xFE30,0x10000);   conv_lin(0x10000, 0x10FFFF, 0x90308130);   /* No mapping exists */   return 0;
 
}

>
> The point about possibly being able to do this with a simple lookup table
> instead of binary search is valid, but I still say it's a mistake to
> suppose that we should consider that only for GB18030.  With the reduced
> table size, the GB18030 conversion tables are not all that far out of line
> with the other Far Eastern conversions:
>
> $ size utf8*.so | sort -n
>    text    data     bss     dec     hex filename
>    1880     512      16    2408     968 utf8_and_ascii.so
>    2394     528      16    2938     b7a utf8_and_iso8859_1.so
>    6674     512      16    7202    1c22 utf8_and_cyrillic.so
>   24318     904      16   25238    6296 utf8_and_win.so
>   28750     968      16   29734    7426 utf8_and_iso8859.so
>  121110     512      16  121638   1db26 utf8_and_euc_cn.so
>  123458     512      16  123986   1e452 utf8_and_sjis.so
>  133606     512      16  134134   20bf6 utf8_and_euc_kr.so
>  185014     512      16  185542   2d4c6 utf8_and_sjis2004.so
>  185522     512      16  186050   2d6c2 utf8_and_euc2004.so
>  212950     512      16  213478   341e6 utf8_and_euc_jp.so
>  221394     512      16  221922   362e2 utf8_and_big5.so
>  274772     512      16  275300   43364 utf8_and_johab.so
>  277776     512      16  278304   43f20 utf8_and_uhc.so
>  332262     512      16  332790   513f6 utf8_and_euc_tw.so
>  350640     512      16  351168   55bc0 utf8_and_gbk.so
>  496680     512      16  497208   79638 utf8_and_gb18030.so
>
> If we were to get excited about reducing the conversion time for GB18030,
> it would clearly make sense to use similar infrastructure for GBK, and
> perhaps the EUC encodings too.

I'll check them as well. If they have linear ranges it should work.

>
> However, I'm not that excited about changing it.  We have not heard field
> complaints about these converters being too slow.  What's more, there
> doesn't seem to be any practical way to apply the same idea to the other
> conversion direction, which means if you do feel there's a speed problem
> this would only halfway fix it.

It does work if you linearlize it first. That's why we need to convert
to utf32 first as well. That's a form of linearization.