Re: Patch for bug #12845 (GB18030 encoding) - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Patch for bug #12845 (GB18030 encoding)
Date
Msg-id 11269.1431637452@sss.pgh.pa.us
Whole thread Raw
In response to Re: Patch for bug #12845 (GB18030 encoding)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Patch for bug #12845 (GB18030 encoding)
List pgsql-hackers
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> Maybe not, but at the very least we should consider getting it fixed in
>>> 9.5 rather than waiting a full development cycle.  Same as in
>>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
>>> I'm not saying we MUST include it in 9.5, but we should at least
>>> consider it.  If we simply stash it in the open CF we guarantee that it
>>> will linger there for a year.

>> Sure, if somebody has the time to put into it now, I'm fine with that.
>> I'm afraid it won't be me, though: even if I had the time, I don't
>> know enough about encodings.

> I concur that we should at least consider this patch for 9.5.  I've
> added it to
> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

I looked at this patch a bit, and read up on GB18030 (thank you
wikipedia).  I concur we have a problem to fix.  I do not like the way
this patch went about it though, ie copying-and-pasting LocalToUtf and
UtfToLocal and their supporting routines into utf8_and_gb18030.c.
Aside from being duplicative, this means the improved mapping capability
isn't available to use with anything except GB18030.  (I do not know
whether there are any linear mapping ranges in other encodings, but
seeing that the Unicode crowd went to the trouble of defining a notation
for it in http://www.unicode.org/reports/tr22/, I'm betting there are.)

What I think would be a better solution, if slightly more invasive,
is to extend LocalToUtf and UtfToLocal to add a callback function
argument for a function of signature "uint32 translate(uint32)".
This function, if provided, would be called after failing to find a
mapping in the mapping table(s), and it could implement any translation
that would be better handled by code than as a boatload of mapping-table
entries.  If it returns zero then it doesn't know a translation either,
so throw error as before.

An alternative definition that could be proposed would be to call the
function before consulting the mapping tables, not after, on the grounds
that the function can probably exit cheaply if the input's not in a range
that it cares about.  However, consulting the mapping table first wins
if you have ranges that mostly work but contain a few exceptions: put
the exceptions in the mapping table and then the function need not worry
about handling them.

Another alternative approach would be to try to define linear mapping
ranges in a tabular fashion, for more consistency with what's there now.
But that probably wouldn't work terribly well because the bytewise
character representations used in this logic have to be converted into
code points before you can do any sort of linear mapping.  We could
hard-wire that conversion for UTF8, but the conversion in the other code
space would be encoding-specific.  So we might as well just treat the
whole linear mapping behavior as a black box function for each encoding.

I'm also discounting the possibility that someone would want an
algorithmic mapping for cases involving "combined" codes (ie pairs of
UTF8 characters).  Of the encodings we support, only EUC_JIS_2004 and
SHIFT_JIS_2004 need such cases at all, and those have only a handful of
cases; so it doesn't seem popular enough to justify the extra complexity.

I also notice that pg_gb18030_verifier isn't even close to strict enough;
it basically relies on pg_gb18030_mblen which contains no checks
whatsoever on the third and fourth bytes.  So that needs to be fixed.

The verification tightening would definitely not be something to
back-patch, and I'm inclined to think that the additional mapping
capability shouldn't be either, in view of the facts that (a) we've
had few if any field complaints yet, and (b) changing the signatures
of LocalToUtf/UtfToLocal might possibly break third-party code.
So I'm seeing this as a HEAD-only patch, but I do want to try to
squeeze it into 9.5 rather than wait another year.

Barring objections, I'll go make this happen.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Jason Petersen
Date:
Subject: Query Deparsing Support
Next
From: Tom Lane
Date:
Subject: Re: Triaging the remaining open commitfest items