Re: Patch for bug #12845 (GB18030 encoding) - Mailing list pgsql-hackers

From Arjen Nienhuis
Subject Re: Patch for bug #12845 (GB18030 encoding)
Date
Msg-id CAG6W84JJ5jmgPFSqgSufO1XbRjScH4kBrzmj50xHSd_ZaCMh4A@mail.gmail.com
Whole thread Raw
In response to Re: Patch for bug #12845 (GB18030 encoding)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
>> That's fine when not every code point is used, but it's different for
>> GB18030 where almost all code points are used. Using a plain array
>> saves space and saves a binary search.
>
> Well, it doesn't save any space: if we get rid of the additional linear
> ranges in the lookup table, what remains is 30733 entries requiring about
> 256K, same as (or a bit less than) what you suggest.

We could do both. What about something like this:

static unsigned int utf32_to_gb18030_from_0x0001[1105] = {
/* 0x0 */ 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
...
static unsigned int utf32_to_gb18030_from_0x2010[1587] = {
/* 0x0 */ 0xa95c, 0x8136a532, 0x8136a533, 0xa843, 0xa1aa, 0xa844,
0xa1ac, 0x8136a534,
...
static unsigned int utf32_to_gb18030_from_0x2E81[28965] = {
/* 0x0 */ 0xfe50, 0x8138fd39, 0x8138fe30, 0xfe54, 0x8138fe31,
0x8138fe32, 0x8138fe33, 0xfe57,
...
static unsigned int utf32_to_gb18030_from_0xE000[2149] = {
/* 0x0 */ 0xaaa1, 0xaaa2, 0xaaa3, 0xaaa4, 0xaaa5, 0xaaa6, 0xaaa7, 0xaaa8,
...
static unsigned int utf32_to_gb18030_from_0xF92C[254] = {
/* 0x0 */ 0xfd9c, 0x84308535, 0x84308536, 0x84308537, 0x84308538,
0x84308539, 0x84308630, 0x84308631,
...
static unsigned int utf32_to_gb18030_from_0xFE30[464] = {
/* 0x0 */ 0xa955, 0xa6f2, 0x84318538, 0xa6f4, 0xa6f5, 0xa6e0, 0xa6e1, 0xa6f0,
...

static uint32
conv_utf8_to_18030(uint32 code)
{   uint32      ucs = utf8word_to_unicode(code);

#define conv_lin(minunicode, maxunicode, mincode) \   if (ucs >= minunicode && ucs <= maxunicode) \       return
gb_unlinear(ucs- minunicode + gb_linear(mincode))
 

#define conv_array(minunicode, maxunicode) \   if (ucs >= minunicode && ucs <= maxunicode) \       return
utf32_to_gb18030_from_##minunicode[ucs- minunicode];
 
   conv_array(0x0001, 0x0452);   conv_lin(0x0452, 0x200F, 0x8130D330);   conv_array(0x2010, 0x2643);   conv_lin(0x2643,
0x2E80,0x8137A839);   conv_array(0x2E81, 0x9FA6);   conv_lin(0x9FA6, 0xD7FF, 0x82358F33);   conv_array(0xE000, 0xE865);
 conv_lin(0xE865, 0xF92B, 0x8336D030);   conv_array(0xF92C, 0xFA2A);   conv_lin(0xFA2A, 0xFE2F, 0x84309C38);
conv_array(0xFE30,0x10000);   conv_lin(0x10000, 0x10FFFF, 0x90308130);   /* No mapping exists */   return 0;
 
}

>
> The point about possibly being able to do this with a simple lookup table
> instead of binary search is valid, but I still say it's a mistake to
> suppose that we should consider that only for GB18030.  With the reduced
> table size, the GB18030 conversion tables are not all that far out of line
> with the other Far Eastern conversions:
>
> $ size utf8*.so | sort -n
>    text    data     bss     dec     hex filename
>    1880     512      16    2408     968 utf8_and_ascii.so
>    2394     528      16    2938     b7a utf8_and_iso8859_1.so
>    6674     512      16    7202    1c22 utf8_and_cyrillic.so
>   24318     904      16   25238    6296 utf8_and_win.so
>   28750     968      16   29734    7426 utf8_and_iso8859.so
>  121110     512      16  121638   1db26 utf8_and_euc_cn.so
>  123458     512      16  123986   1e452 utf8_and_sjis.so
>  133606     512      16  134134   20bf6 utf8_and_euc_kr.so
>  185014     512      16  185542   2d4c6 utf8_and_sjis2004.so
>  185522     512      16  186050   2d6c2 utf8_and_euc2004.so
>  212950     512      16  213478   341e6 utf8_and_euc_jp.so
>  221394     512      16  221922   362e2 utf8_and_big5.so
>  274772     512      16  275300   43364 utf8_and_johab.so
>  277776     512      16  278304   43f20 utf8_and_uhc.so
>  332262     512      16  332790   513f6 utf8_and_euc_tw.so
>  350640     512      16  351168   55bc0 utf8_and_gbk.so
>  496680     512      16  497208   79638 utf8_and_gb18030.so
>
> If we were to get excited about reducing the conversion time for GB18030,
> it would clearly make sense to use similar infrastructure for GBK, and
> perhaps the EUC encodings too.

I'll check them as well. If they have linear ranges it should work.

>
> However, I'm not that excited about changing it.  We have not heard field
> complaints about these converters being too slow.  What's more, there
> doesn't seem to be any practical way to apply the same idea to the other
> conversion direction, which means if you do feel there's a speed problem
> this would only halfway fix it.

It does work if you linearlize it first. That's why we need to convert
to utf32 first as well. That's a form of linearization.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Run pgindent now?
Next
From: Dave Cramer
Date:
Subject: Re: Problems with question marks in operators (JDBC, ECPG, ...)