Re: [HACKERS] Wrong charset mappings - Mailing list pgsql-jdbc
From | Barry Lind |
---|---|
Subject | Re: [HACKERS] Wrong charset mappings |
Date | |
Msg-id | 3E4A8A44.7040600@xythos.com Whole thread Raw |
In response to | Re: [HACKERS] Wrong charset mappings (Thomas O'Dowd <tom@nooper.com>) |
List | pgsql-jdbc |
I don't see any jdbc specific requirements here, other than the fact that jdbc assumes that the following conversions are done correctly: dbcharset <-> utf8 <-> java/utf16 where the dbcharset to/from utf8 conversion is done by the backend and the utf8 to/from java/utf16 is done in the jdbc driver. Prior to 7.3 the jdbc driver did the entire conversion itself. However versions of the jdk prior to 1.4 do a terrible job when it comes to the performance of the conversion. So for a significant speed up in 7.3 we moved most of the work to the backend. thanks, --Barry Thomas O'Dowd wrote: > Hi Ishii-san, > > Thanks for the reply. Why was the particular change made between 7.2 and > 7.3? It seems to have moved away from the standard. I found the > following file... > > src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl > > Which generates the mappings. I found it references 3 files from unicode > organisation, namely: > > http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT > http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT > http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT > > The JIS0208.TXT has the line... > > 0x8160 0x2141 0x301C # WAVE DASH > > 1st col is sjis, 2nd is EUC - 0x8080, 3rd is utf16. > > Incidently those mapping files are marked obsolete but I guess the old > mappings still hold. > > I guess if I run the perl script it will generate a mapping file > different to what postgresql is currently using. It might be interesting > to pull out the diffs and see what's right/wrong. I guess its not run > anymore? > > I can't see how the change will affect the JDBC driver. It should only > improve the situation. Right now its not possible to go from sjis -> > database (utf8) -> java (jdbc/utf16) -> sjis for the WAVE DASH character > because the mapping is wrong in postgresql. I'll cc the JDBC list and > maybe we'll find out if its a real problem to change the mapping. > > Changing the mapping I think is the correct thing to do from what I can > see all around me in different tools like iconv, java 1.4.1, utf-8 > terminal and any unicode reference on the web. > > What do you think? > > Tom. > > On Wed, 2003-02-12 at 22:30, Tatsuo Ishii wrote: > >>I think the problem you see is due to the the mapping table changes >>between 7.2 and 7.3. It seems there are more changes other than >>u301c. Moreover according to the recent discussion in Japanese local >>mailing list, 7.3's JDBC driver now relies on the encoding conversion >>performed by the backend. ie. The driver issues "set client_encoding = >>'UNICODE'". This problem is very complex and I need time to find good >>solution. I don't think simply backout the changes to the mapping >>table solves the problem. >> >> >>>Hi all, >>> >>>One Japanese character has been causing my head to swim lately. I've >>>finally tracked down the problem to both Java 1.3 and Postgresql. >>> >>>The problem character is namely: >>>utf-16: 0x301C >>>utf-8: 0xE3809C >>>SJIS: 0x8160 >>>EUC_JP: 0xA1C1 >>>Otherwise known as the WAVE DASH character. >>> >>>The confusion stems from a very similar character 0xFF5E (utf-16) or >>>0xEFBD9E (utf-8) the FULLWIDTH TILDE. >>> >>>Java has just lately (1.4.1) finally fixed their mappings so that 0x301C >>>maps correctly to both the correct SJIS and EUC-JP character. Previously >>>(at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C, >>>causing all sorts of trouble. >>> >>>Postgresql at least picked one of the two characters namely 0xFF5E, so >>>conversions in and out of the database to/from sjis/euc seemed to be >>>working. Problem is when you try to view utf-8 from the database or if >>>you read the data into java (utf-16) and try converting to euc or sjis >>>from there. >>> >>>Anyway, I think postgresql needs to be fixed for this character. In my >>>opinion what needs to be done is to change the mappings... >>> >>>euc-jp -> utf-8 -> euc-jp >>>====== ======== ====== >>>0xA1C1 -> 0xE3809C 0xA1C1 >>> >>>sjis -> utf-8 -> sjis >>>====== ======== ====== >>>0x8160 -> 0xE3809C 0x8160 >>> >>>As to what to do with the current mapping of 0xEFBD9E (utf-8)? It >>>probably should be removed. Maybe you could keep the mapping back to the >>>sjis/euc characters to help backward compatibility though. I'm not sure >>>what is the correct approach there. >>> >>>If anyone can tell me how to edit the mappings under: >>> src/backend/utils/mb/Unicode/ >>> >>>and rebuild postgres to use them, then I can test this out locally. >> >>Just edit src/backend/utils/mb/Unicode/*.map and rebiuld >>PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and >>euc_jp_to_utf8.map. >>-- >>Tatsuo Ishii
pgsql-jdbc by date: