Re: patch: utf8_to_unicode (trivial) - Mailing list pgsql-hackers

From Joseph Adams
Subject Re: patch: utf8_to_unicode (trivial)
Date
Msg-id AANLkTin2x3OaKFZXNpMR+Z3WBDA_3d5QNp_dRYF4JzOJ@mail.gmail.com
Whole thread Raw
In response to patch: utf8_to_unicode (trivial)  (Joseph Adams <joeyadams3.14159@gmail.com>)
Responses Re: patch: utf8_to_unicode (trivial)
Re: patch: utf8_to_unicode (trivial)
List pgsql-hackers
On Tue, Jul 27, 2010 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jul 24, 2010 at 10:34 PM, Joseph Adams
> <joeyadams3.14159@gmail.com> wrote:
>> In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 ,
>> but no corresponding utf8_to_unicode .  However, there is a static
>> function called utf2ucs that does what utf8_to_unicode would do.
>>
>> I'd like this function to be available because the JSON code needs to
>> convert UTF-8 to and from Unicode codepoints, and I'm currently using
>> a separate UTF-8 to codepoint function for that.
>>
>> This patch renames utf2ucs to utf8_to_unicode and makes it public.  It
>> also fixes the version of utf2ucs in  src/bin/psql/mbprint.c so that
>> it's equivalent to the one in wchar.c .
>>
>> This is a patch against CVS HEAD for application.  It compiles and
>> tests successfully.
>>
>> Comments?  Thanks,
>
> I feel obliged to respond this since I'm supposed to be covering your
> GSoC project while Magnus is on vacation, but I actually know very
> little about this topic.  What's undeniable, however, is that the
> coding in the two versions of utf8ucs() in the tree right now don't
> match.  src/backend/utils/mb/wchar.c has:
>
>        else if ((*c & 0xf8) == 0xf0)
>
> while src/bin/psql/mbprint.c, which is otherwise identical, has:
>
>        else if ((*c & 0xf0) == 0xf0)
>
> I'm inclined to believe that your patch is right to think that the
> former version is correct, because it used to match the latter version
> until Tom Lane changed it in 2007, and I suspect he simply failed to
> update both copies.  But I'd like someone who actually understands
> what this code is doing to confirm that.
>
> http://archives.postgresql.org/pgsql-committers/2007-01/msg00293.php
>
> I suspect we need to not only fix this, but back-patch it at least to
> 8.2, which is the first release where there are two copies of this
> function.  I am not sure whether earlier releases need to be changed,
> or not.  But again, someone who understands the issues better than I
> do needs to weigh in here.
>
> In terms of making this function non-static, I'm inclined to think
> that a better approach would be to move it to src/port.  That gets rid
> of the need to have two copies in the first place.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.

This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
.  It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.

I'm not sure whether I like the old patch better or the new one.


Joey Adams

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: MERGE command for inheritance
Next
From: Boxuan Zhai
Date:
Subject: Re: MERGE command for inheritance