Re: patch: utf8_to_unicode (trivial) - Mailing list pgsql-hackers
From | Joseph Adams |
---|---|
Subject | Re: patch: utf8_to_unicode (trivial) |
Date | |
Msg-id | AANLkTin2x3OaKFZXNpMR+Z3WBDA_3d5QNp_dRYF4JzOJ@mail.gmail.com Whole thread Raw |
In response to | patch: utf8_to_unicode (trivial) (Joseph Adams <joeyadams3.14159@gmail.com>) |
Responses |
Re: patch: utf8_to_unicode (trivial)
Re: patch: utf8_to_unicode (trivial) |
List | pgsql-hackers |
On Tue, Jul 27, 2010 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jul 24, 2010 at 10:34 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 , >> but no corresponding utf8_to_unicode . However, there is a static >> function called utf2ucs that does what utf8_to_unicode would do. >> >> I'd like this function to be available because the JSON code needs to >> convert UTF-8 to and from Unicode codepoints, and I'm currently using >> a separate UTF-8 to codepoint function for that. >> >> This patch renames utf2ucs to utf8_to_unicode and makes it public. It >> also fixes the version of utf2ucs in src/bin/psql/mbprint.c so that >> it's equivalent to the one in wchar.c . >> >> This is a patch against CVS HEAD for application. It compiles and >> tests successfully. >> >> Comments? Thanks, > > I feel obliged to respond this since I'm supposed to be covering your > GSoC project while Magnus is on vacation, but I actually know very > little about this topic. What's undeniable, however, is that the > coding in the two versions of utf8ucs() in the tree right now don't > match. src/backend/utils/mb/wchar.c has: > > else if ((*c & 0xf8) == 0xf0) > > while src/bin/psql/mbprint.c, which is otherwise identical, has: > > else if ((*c & 0xf0) == 0xf0) > > I'm inclined to believe that your patch is right to think that the > former version is correct, because it used to match the latter version > until Tom Lane changed it in 2007, and I suspect he simply failed to > update both copies. But I'd like someone who actually understands > what this code is doing to confirm that. > > http://archives.postgresql.org/pgsql-committers/2007-01/msg00293.php > > I suspect we need to not only fix this, but back-patch it at least to > 8.2, which is the first release where there are two copies of this > function. I am not sure whether earlier releases need to be changed, > or not. But again, someone who understands the issues better than I > do needs to weigh in here. > > In terms of making this function non-static, I'm inclined to think > that a better approach would be to move it to src/port. That gets rid > of the need to have two copies in the first place. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > I've attached another patch that moves utf8_to_unicode to src/port per Robert Haas's suggestion. This patch itself is not quite as elegant as the first one because it puts platform-independent code that "belongs" in wchar.c into src/port . It also uses unsigned int instead of pg_wchar because the typedef of pg_wchar isn't available to the frontend, if I'm not mistaken. I'm not sure whether I like the old patch better or the new one. Joey Adams
Attachment
pgsql-hackers by date: