Thread: Patch for bug #12845 (GB18030 encoding)
Hi, Can someone look at this patch. It should fix bug #12845. The current tests for conversions are very minimal. I expanded them a bit for this bug. I think the binary search in the .map files should be removed but I leave that for another patch.
Attachment
On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote: > Can someone look at this patch. It should fix bug #12845. > > The current tests for conversions are very minimal. I expanded them a > bit for this bug. > > I think the binary search in the .map files should be removed but I > leave that for another patch. Please add this patch to https://commitfest.postgresql.org/action/commitfest_view/open so we don't forget about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote: > > Can someone look at this patch. It should fix bug #12845. > > > > The current tests for conversions are very minimal. I expanded them a > > bit for this bug. > > > > I think the binary search in the .map files should be removed but I > > leave that for another patch. > > Please add this patch to > https://commitfest.postgresql.org/action/commitfest_view/open so we > don't forget about it. If we think this is a bug fix, we should add it to the open items list, https://wiki.postgresql.org/wiki/Open_Items -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 6, 2015 at 10:55 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote: >> > Can someone look at this patch. It should fix bug #12845. >> > >> > The current tests for conversions are very minimal. I expanded them a >> > bit for this bug. >> > >> > I think the binary search in the .map files should be removed but I >> > leave that for another patch. >> >> Please add this patch to >> https://commitfest.postgresql.org/action/commitfest_view/open so we >> don't forget about it. > > If we think this is a bug fix, we should add it to the open items list, > https://wiki.postgresql.org/wiki/Open_Items It's a behavior change, so I don't think we would consider a back-patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, May 6, 2015 at 10:55 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Robert Haas wrote: > >> On Tue, May 5, 2015 at 9:04 AM, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote: > >> > Can someone look at this patch. It should fix bug #12845. > >> > > >> > The current tests for conversions are very minimal. I expanded them a > >> > bit for this bug. > >> > > >> > I think the binary search in the .map files should be removed but I > >> > leave that for another patch. > >> > >> Please add this patch to > >> https://commitfest.postgresql.org/action/commitfest_view/open so we > >> don't forget about it. > > > > If we think this is a bug fix, we should add it to the open items list, > > https://wiki.postgresql.org/wiki/Open_Items > > It's a behavior change, so I don't think we would consider a back-patch. Maybe not, but at the very least we should consider getting it fixed in 9.5 rather than waiting a full development cycle. Same as in https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us I'm not saying we MUST include it in 9.5, but we should at least consider it. If we simply stash it in the open CF we guarantee that it will linger there for a year. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> It's a behavior change, so I don't think we would consider a back-patch. > > Maybe not, but at the very least we should consider getting it fixed in > 9.5 rather than waiting a full development cycle. Same as in > https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us > I'm not saying we MUST include it in 9.5, but we should at least > consider it. If we simply stash it in the open CF we guarantee that it > will linger there for a year. Sure, if somebody has the time to put into it now, I'm fine with that. I'm afraid it won't be me, though: even if I had the time, I don't know enough about encodings. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Maybe not, but at the very least we should consider getting it fixed in >> 9.5 rather than waiting a full development cycle. Same as in >> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us >> I'm not saying we MUST include it in 9.5, but we should at least >> consider it. If we simply stash it in the open CF we guarantee that it >> will linger there for a year. > Sure, if somebody has the time to put into it now, I'm fine with that. > I'm afraid it won't be me, though: even if I had the time, I don't > know enough about encodings. I concur that we should at least consider this patch for 9.5. I've added it to https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items I'm willing to look at it myself, whenever my non-copious spare time permits; but that won't be in the immediate future. regards, tom lane
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Maybe not, but at the very least we should consider getting it fixed in >>> 9.5 rather than waiting a full development cycle. Same as in >>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us >>> I'm not saying we MUST include it in 9.5, but we should at least >>> consider it. If we simply stash it in the open CF we guarantee that it >>> will linger there for a year. >> Sure, if somebody has the time to put into it now, I'm fine with that. >> I'm afraid it won't be me, though: even if I had the time, I don't >> know enough about encodings. > I concur that we should at least consider this patch for 9.5. I've > added it to > https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items I looked at this patch a bit, and read up on GB18030 (thank you wikipedia). I concur we have a problem to fix. I do not like the way this patch went about it though, ie copying-and-pasting LocalToUtf and UtfToLocal and their supporting routines into utf8_and_gb18030.c. Aside from being duplicative, this means the improved mapping capability isn't available to use with anything except GB18030. (I do not know whether there are any linear mapping ranges in other encodings, but seeing that the Unicode crowd went to the trouble of defining a notation for it in http://www.unicode.org/reports/tr22/, I'm betting there are.) What I think would be a better solution, if slightly more invasive, is to extend LocalToUtf and UtfToLocal to add a callback function argument for a function of signature "uint32 translate(uint32)". This function, if provided, would be called after failing to find a mapping in the mapping table(s), and it could implement any translation that would be better handled by code than as a boatload of mapping-table entries. If it returns zero then it doesn't know a translation either, so throw error as before. An alternative definition that could be proposed would be to call the function before consulting the mapping tables, not after, on the grounds that the function can probably exit cheaply if the input's not in a range that it cares about. However, consulting the mapping table first wins if you have ranges that mostly work but contain a few exceptions: put the exceptions in the mapping table and then the function need not worry about handling them. Another alternative approach would be to try to define linear mapping ranges in a tabular fashion, for more consistency with what's there now. But that probably wouldn't work terribly well because the bytewise character representations used in this logic have to be converted into code points before you can do any sort of linear mapping. We could hard-wire that conversion for UTF8, but the conversion in the other code space would be encoding-specific. So we might as well just treat the whole linear mapping behavior as a black box function for each encoding. I'm also discounting the possibility that someone would want an algorithmic mapping for cases involving "combined" codes (ie pairs of UTF8 characters). Of the encodings we support, only EUC_JIS_2004 and SHIFT_JIS_2004 need such cases at all, and those have only a handful of cases; so it doesn't seem popular enough to justify the extra complexity. I also notice that pg_gb18030_verifier isn't even close to strict enough; it basically relies on pg_gb18030_mblen which contains no checks whatsoever on the third and fourth bytes. So that needs to be fixed. The verification tightening would definitely not be something to back-patch, and I'm inclined to think that the additional mapping capability shouldn't be either, in view of the facts that (a) we've had few if any field complaints yet, and (b) changing the signatures of LocalToUtf/UtfToLocal might possibly break third-party code. So I'm seeing this as a HEAD-only patch, but I do want to try to squeeze it into 9.5 rather than wait another year. Barring objections, I'll go make this happen. regards, tom lane
On Thu, May 14, 2015 at 11:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera >>> <alvherre@2ndquadrant.com> wrote: >>>> Maybe not, but at the very least we should consider getting it fixed in >>>> 9.5 rather than waiting a full development cycle. Same as in >>>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us >>>> I'm not saying we MUST include it in 9.5, but we should at least >>>> consider it. If we simply stash it in the open CF we guarantee that it >>>> will linger there for a year. > >>> Sure, if somebody has the time to put into it now, I'm fine with that. >>> I'm afraid it won't be me, though: even if I had the time, I don't >>> know enough about encodings. > >> I concur that we should at least consider this patch for 9.5. I've >> added it to >> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items > > I looked at this patch a bit, and read up on GB18030 (thank you > wikipedia). I concur we have a problem to fix. I do not like the way > this patch went about it though, ie copying-and-pasting LocalToUtf and > UtfToLocal and their supporting routines into utf8_and_gb18030.c. > Aside from being duplicative, this means the improved mapping capability > isn't available to use with anything except GB18030. (I do not know > whether there are any linear mapping ranges in other encodings, but > seeing that the Unicode crowd went to the trouble of defining a notation > for it in http://www.unicode.org/reports/tr22/, I'm betting there are.) > > What I think would be a better solution, if slightly more invasive, > is to extend LocalToUtf and UtfToLocal to add a callback function > argument for a function of signature "uint32 translate(uint32)". > This function, if provided, would be called after failing to find a > mapping in the mapping table(s), and it could implement any translation > that would be better handled by code than as a boatload of mapping-table > entries. If it returns zero then it doesn't know a translation either, > so throw error as before. > > An alternative definition that could be proposed would be to call the > function before consulting the mapping tables, not after, on the grounds > that the function can probably exit cheaply if the input's not in a range > that it cares about. However, consulting the mapping table first wins > if you have ranges that mostly work but contain a few exceptions: put > the exceptions in the mapping table and then the function need not worry > about handling them. > > Another alternative approach would be to try to define linear mapping > ranges in a tabular fashion, for more consistency with what's there now. > But that probably wouldn't work terribly well because the bytewise > character representations used in this logic have to be converted into > code points before you can do any sort of linear mapping. We could > hard-wire that conversion for UTF8, but the conversion in the other code > space would be encoding-specific. So we might as well just treat the > whole linear mapping behavior as a black box function for each encoding. > > I'm also discounting the possibility that someone would want an > algorithmic mapping for cases involving "combined" codes (ie pairs of > UTF8 characters). Of the encodings we support, only EUC_JIS_2004 and > SHIFT_JIS_2004 need such cases at all, and those have only a handful of > cases; so it doesn't seem popular enough to justify the extra complexity. > > I also notice that pg_gb18030_verifier isn't even close to strict enough; > it basically relies on pg_gb18030_mblen which contains no checks > whatsoever on the third and fourth bytes. So that needs to be fixed. > > The verification tightening would definitely not be something to > back-patch, and I'm inclined to think that the additional mapping > capability shouldn't be either, in view of the facts that (a) we've > had few if any field complaints yet, and (b) changing the signatures > of LocalToUtf/UtfToLocal might possibly break third-party code. > So I'm seeing this as a HEAD-only patch, but I do want to try to > squeeze it into 9.5 rather than wait another year. > > Barring objections, I'll go make this happen. GB18030 is a special case, because it's a full mapping of all unicode characters, and most of it is algorithmically defined. This makes UtfToLocal a bad choice to implement it. UtfToLocal assumes a sparse array with only the defined characters. It uses binary search to find a character. The 2 tables it uses now are huge (the .so file is 1MB). Adding the rest of the valid characters to this scheme is possible, but would make the problem worse. I think fixing UtfToLocal only for the new characters is not optimal. I think the best solution is to get rid of UtfToLocal for GB18030. Use a specialized algorithm: - For characters > U+FFFF use the algorithm from my patch - For charcaters <= U+FFFF use special mapping tables to map from/to UTF32. Those tables would be smaller, and the code would be faster (I assume). For example (256 KB): unsigned int utf32_to_gb18030[65536] = { /* 0x0 */ 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, /* 0x8 */ 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, -- /* 0xdb08 */ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, -- /* 0xfff0 */ 0x8431a334, 0x8431a335, 0x8431a336, 0x8431a337, 0x8431a338, 0x8431a339, 0x8431a430, 0x8431a431, /* 0xfff8 */ 0x8431a432, 0x8431a433, 0x8431a434, 0x8431a435, 0x8431a436, 0x8431a437, 0x8431a438, 0x8431a439 }; Instead of (500KB): static pg_utf_to_local ULmapGB18030[ 63360 ] = { {0xc280, 0x81308130}, {0xc281, 0x81308131}, -- {0xefbfbe, 0x8431a438}, {0xefbfbf, 0x8431a439} }; See the attachment for a python script to generate those mappings. Gr. Arjen
Attachment
Arjen Nienhuis <a.g.nienhuis@gmail.com> writes: > GB18030 is a special case, because it's a full mapping of all unicode > characters, and most of it is algorithmically defined. True. > This makes UtfToLocal a bad choice to implement it. I disagree with that conclusion. There are still 30000+ characters that need to be translated via lookup table, so we still need either UtfToLocal or a clone of it; and as I said previously, I'm not on board with cloning it. > I think the best solution is to get rid of UtfToLocal for GB18030. Use > a specialized algorithm: > - For characters > U+FFFF use the algorithm from my patch > - For charcaters <= U+FFFF use special mapping tables to map from/to > UTF32. Those tables would be smaller, and the code would be faster (I > assume). I looked at what wikipeda claims is the authoritative conversion table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml According to that, about half of the characters below U+FFFF can be processed via linear conversions, so I think we ought to save table space by doing that. However, the remaining stuff that has to be processed by lookup still contains a pretty substantial number of characters that map to 4-byte GB18030 characters, so I don't think we can get any table size savings by adopting a bespoke table format. We might as well use UtfToLocal. (Worth noting in this connection is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte table entries for other encodings, even though most of the others are not concerned with characters outside the BMP.) regards, tom lane
On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Arjen Nienhuis <a.g.nienhuis@gmail.com> writes: >> GB18030 is a special case, because it's a full mapping of all unicode >> characters, and most of it is algorithmically defined. > > True. > >> This makes UtfToLocal a bad choice to implement it. > > I disagree with that conclusion. There are still 30000+ characters > that need to be translated via lookup table, so we still need either > UtfToLocal or a clone of it; and as I said previously, I'm not on board > with cloning it. > >> I think the best solution is to get rid of UtfToLocal for GB18030. Use >> a specialized algorithm: >> - For characters > U+FFFF use the algorithm from my patch >> - For charcaters <= U+FFFF use special mapping tables to map from/to >> UTF32. Those tables would be smaller, and the code would be faster (I >> assume). > > I looked at what wikipeda claims is the authoritative conversion table: > > http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml > > According to that, about half of the characters below U+FFFF can be > processed via linear conversions, so I think we ought to save table > space by doing that. However, the remaining stuff that has to be > processed by lookup still contains a pretty substantial number of > characters that map to 4-byte GB18030 characters, so I don't think > we can get any table size savings by adopting a bespoke table format. > We might as well use UtfToLocal. (Worth noting in this connection > is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte > table entries for other encodings, even though most of the others > are not concerned with characters outside the BMP.) > It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal uses a sparse array: map = {{0, x}, {1, y}, {2, z}, ...} v.s. map = {x, y, z, ...} That's fine when not every code point is used, but it's different for GB18030 where almost all code points are used. Using a plain array saves space and saves a binary search. Gr. Arjen
Arjen Nienhuis <a.g.nienhuis@gmail.com> writes: > On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> According to that, about half of the characters below U+FFFF can be >> processed via linear conversions, so I think we ought to save table >> space by doing that. However, the remaining stuff that has to be >> processed by lookup still contains a pretty substantial number of >> characters that map to 4-byte GB18030 characters, so I don't think >> we can get any table size savings by adopting a bespoke table format. >> We might as well use UtfToLocal. (Worth noting in this connection >> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte >> table entries for other encodings, even though most of the others >> are not concerned with characters outside the BMP.) > It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal > uses a sparse array: > map = {{0, x}, {1, y}, {2, z}, ...} > v.s. > map = {x, y, z, ...} > That's fine when not every code point is used, but it's different for > GB18030 where almost all code points are used. Using a plain array > saves space and saves a binary search. Well, it doesn't save any space: if we get rid of the additional linear ranges in the lookup table, what remains is 30733 entries requiring about 256K, same as (or a bit less than) what you suggest. The point about possibly being able to do this with a simple lookup table instead of binary search is valid, but I still say it's a mistake to suppose that we should consider that only for GB18030. With the reduced table size, the GB18030 conversion tables are not all that far out of line with the other Far Eastern conversions: $ size utf8*.so | sort -n text data bss dec hex filename 1880 512 16 2408 968 utf8_and_ascii.so 2394 528 16 2938 b7a utf8_and_iso8859_1.so 6674 512 16 7202 1c22 utf8_and_cyrillic.so24318 904 16 25238 6296 utf8_and_win.so 28750 968 16 29734 7426 utf8_and_iso8859.so121110 512 16 121638 1db26 utf8_and_euc_cn.so123458 512 16 123986 1e452 utf8_and_sjis.so133606 512 16 134134 20bf6 utf8_and_euc_kr.so185014 512 16 185542 2d4c6 utf8_and_sjis2004.so185522 512 16 186050 2d6c2 utf8_and_euc2004.so212950 512 16 213478 341e6 utf8_and_euc_jp.so221394 512 16 221922 362e2 utf8_and_big5.so274772 512 16 275300 43364 utf8_and_johab.so277776 512 16 278304 43f20 utf8_and_uhc.so332262 512 16 332790 513f6 utf8_and_euc_tw.so350640 512 16 351168 55bc0 utf8_and_gbk.so496680 512 16 497208 79638 utf8_and_gb18030.so If we were to get excited about reducing the conversion time for GB18030, it would clearly make sense to use similar infrastructure for GBK, and perhaps the EUC encodings too. However, I'm not that excited about changing it. We have not heard field complaints about these converters being too slow. What's more, there doesn't seem to be any practical way to apply the same idea to the other conversion direction, which means if you do feel there's a speed problem this would only halfway fix it. So my feeling is that the most practical and maintainable answer is to keep GB18030 using code that is mostly shared with the other encodings. I've committed a fix that does it that way for 9.5. If you want to pursue the idea of a faster conversion using direct lookup tables, I think that would be 9.6 material at this point. regards, tom lane
On Fri, May 15, 2015 at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > However, I'm not that excited about changing it. We have not heard field > complaints about these converters being too slow. What's more, there > doesn't seem to be any practical way to apply the same idea to the other > conversion direction, which means if you do feel there's a speed problem > this would only halfway fix it. Half a loaf is better than none. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> That's fine when not every code point is used, but it's different for >> GB18030 where almost all code points are used. Using a plain array >> saves space and saves a binary search. > > Well, it doesn't save any space: if we get rid of the additional linear > ranges in the lookup table, what remains is 30733 entries requiring about > 256K, same as (or a bit less than) what you suggest. We could do both. What about something like this: static unsigned int utf32_to_gb18030_from_0x0001[1105] = { /* 0x0 */ 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, ... static unsigned int utf32_to_gb18030_from_0x2010[1587] = { /* 0x0 */ 0xa95c, 0x8136a532, 0x8136a533, 0xa843, 0xa1aa, 0xa844, 0xa1ac, 0x8136a534, ... static unsigned int utf32_to_gb18030_from_0x2E81[28965] = { /* 0x0 */ 0xfe50, 0x8138fd39, 0x8138fe30, 0xfe54, 0x8138fe31, 0x8138fe32, 0x8138fe33, 0xfe57, ... static unsigned int utf32_to_gb18030_from_0xE000[2149] = { /* 0x0 */ 0xaaa1, 0xaaa2, 0xaaa3, 0xaaa4, 0xaaa5, 0xaaa6, 0xaaa7, 0xaaa8, ... static unsigned int utf32_to_gb18030_from_0xF92C[254] = { /* 0x0 */ 0xfd9c, 0x84308535, 0x84308536, 0x84308537, 0x84308538, 0x84308539, 0x84308630, 0x84308631, ... static unsigned int utf32_to_gb18030_from_0xFE30[464] = { /* 0x0 */ 0xa955, 0xa6f2, 0x84318538, 0xa6f4, 0xa6f5, 0xa6e0, 0xa6e1, 0xa6f0, ... static uint32 conv_utf8_to_18030(uint32 code) { uint32 ucs = utf8word_to_unicode(code); #define conv_lin(minunicode, maxunicode, mincode) \ if (ucs >= minunicode && ucs <= maxunicode) \ return gb_unlinear(ucs- minunicode + gb_linear(mincode)) #define conv_array(minunicode, maxunicode) \ if (ucs >= minunicode && ucs <= maxunicode) \ return utf32_to_gb18030_from_##minunicode[ucs- minunicode]; conv_array(0x0001, 0x0452); conv_lin(0x0452, 0x200F, 0x8130D330); conv_array(0x2010, 0x2643); conv_lin(0x2643, 0x2E80,0x8137A839); conv_array(0x2E81, 0x9FA6); conv_lin(0x9FA6, 0xD7FF, 0x82358F33); conv_array(0xE000, 0xE865); conv_lin(0xE865, 0xF92B, 0x8336D030); conv_array(0xF92C, 0xFA2A); conv_lin(0xFA2A, 0xFE2F, 0x84309C38); conv_array(0xFE30,0x10000); conv_lin(0x10000, 0x10FFFF, 0x90308130); /* No mapping exists */ return 0; } > > The point about possibly being able to do this with a simple lookup table > instead of binary search is valid, but I still say it's a mistake to > suppose that we should consider that only for GB18030. With the reduced > table size, the GB18030 conversion tables are not all that far out of line > with the other Far Eastern conversions: > > $ size utf8*.so | sort -n > text data bss dec hex filename > 1880 512 16 2408 968 utf8_and_ascii.so > 2394 528 16 2938 b7a utf8_and_iso8859_1.so > 6674 512 16 7202 1c22 utf8_and_cyrillic.so > 24318 904 16 25238 6296 utf8_and_win.so > 28750 968 16 29734 7426 utf8_and_iso8859.so > 121110 512 16 121638 1db26 utf8_and_euc_cn.so > 123458 512 16 123986 1e452 utf8_and_sjis.so > 133606 512 16 134134 20bf6 utf8_and_euc_kr.so > 185014 512 16 185542 2d4c6 utf8_and_sjis2004.so > 185522 512 16 186050 2d6c2 utf8_and_euc2004.so > 212950 512 16 213478 341e6 utf8_and_euc_jp.so > 221394 512 16 221922 362e2 utf8_and_big5.so > 274772 512 16 275300 43364 utf8_and_johab.so > 277776 512 16 278304 43f20 utf8_and_uhc.so > 332262 512 16 332790 513f6 utf8_and_euc_tw.so > 350640 512 16 351168 55bc0 utf8_and_gbk.so > 496680 512 16 497208 79638 utf8_and_gb18030.so > > If we were to get excited about reducing the conversion time for GB18030, > it would clearly make sense to use similar infrastructure for GBK, and > perhaps the EUC encodings too. I'll check them as well. If they have linear ranges it should work. > > However, I'm not that excited about changing it. We have not heard field > complaints about these converters being too slow. What's more, there > doesn't seem to be any practical way to apply the same idea to the other > conversion direction, which means if you do feel there's a speed problem > this would only halfway fix it. It does work if you linearlize it first. That's why we need to convert to utf32 first as well. That's a form of linearization.