Re: Patch for bug #12845 (GB18030 encoding) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Patch for bug #12845 (GB18030 encoding) |
Date | |
Msg-id | 11269.1431637452@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Patch for bug #12845 (GB18030 encoding) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Patch for bug #12845 (GB18030 encoding)
|
List | pgsql-hackers |
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Maybe not, but at the very least we should consider getting it fixed in >>> 9.5 rather than waiting a full development cycle. Same as in >>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us >>> I'm not saying we MUST include it in 9.5, but we should at least >>> consider it. If we simply stash it in the open CF we guarantee that it >>> will linger there for a year. >> Sure, if somebody has the time to put into it now, I'm fine with that. >> I'm afraid it won't be me, though: even if I had the time, I don't >> know enough about encodings. > I concur that we should at least consider this patch for 9.5. I've > added it to > https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items I looked at this patch a bit, and read up on GB18030 (thank you wikipedia). I concur we have a problem to fix. I do not like the way this patch went about it though, ie copying-and-pasting LocalToUtf and UtfToLocal and their supporting routines into utf8_and_gb18030.c. Aside from being duplicative, this means the improved mapping capability isn't available to use with anything except GB18030. (I do not know whether there are any linear mapping ranges in other encodings, but seeing that the Unicode crowd went to the trouble of defining a notation for it in http://www.unicode.org/reports/tr22/, I'm betting there are.) What I think would be a better solution, if slightly more invasive, is to extend LocalToUtf and UtfToLocal to add a callback function argument for a function of signature "uint32 translate(uint32)". This function, if provided, would be called after failing to find a mapping in the mapping table(s), and it could implement any translation that would be better handled by code than as a boatload of mapping-table entries. If it returns zero then it doesn't know a translation either, so throw error as before. An alternative definition that could be proposed would be to call the function before consulting the mapping tables, not after, on the grounds that the function can probably exit cheaply if the input's not in a range that it cares about. However, consulting the mapping table first wins if you have ranges that mostly work but contain a few exceptions: put the exceptions in the mapping table and then the function need not worry about handling them. Another alternative approach would be to try to define linear mapping ranges in a tabular fashion, for more consistency with what's there now. But that probably wouldn't work terribly well because the bytewise character representations used in this logic have to be converted into code points before you can do any sort of linear mapping. We could hard-wire that conversion for UTF8, but the conversion in the other code space would be encoding-specific. So we might as well just treat the whole linear mapping behavior as a black box function for each encoding. I'm also discounting the possibility that someone would want an algorithmic mapping for cases involving "combined" codes (ie pairs of UTF8 characters). Of the encodings we support, only EUC_JIS_2004 and SHIFT_JIS_2004 need such cases at all, and those have only a handful of cases; so it doesn't seem popular enough to justify the extra complexity. I also notice that pg_gb18030_verifier isn't even close to strict enough; it basically relies on pg_gb18030_mblen which contains no checks whatsoever on the third and fourth bytes. So that needs to be fixed. The verification tightening would definitely not be something to back-patch, and I'm inclined to think that the additional mapping capability shouldn't be either, in view of the facts that (a) we've had few if any field complaints yet, and (b) changing the signatures of LocalToUtf/UtfToLocal might possibly break third-party code. So I'm seeing this as a HEAD-only patch, but I do want to try to squeeze it into 9.5 rather than wait another year. Barring objections, I'll go make this happen. regards, tom lane
pgsql-hackers by date: