Re: [HACKERS] Wrong charset mappings - Mailing list pgsql-jdbc
From | Thomas O'Dowd |
---|---|
Subject | Re: [HACKERS] Wrong charset mappings |
Date | |
Msg-id | 1045062831.13002.5.camel@beast.uwillsee.com Whole thread Raw |
Responses |
Re: [HACKERS] Wrong charset mappings
|
List | pgsql-jdbc |
Hi Ishii-san, Thanks for the reply. Why was the particular change made between 7.2 and 7.3? It seems to have moved away from the standard. I found the following file... src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl Which generates the mappings. I found it references 3 files from unicode organisation, namely: http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT The JIS0208.TXT has the line... 0x8160 0x2141 0x301C # WAVE DASH 1st col is sjis, 2nd is EUC - 0x8080, 3rd is utf16. Incidently those mapping files are marked obsolete but I guess the old mappings still hold. I guess if I run the perl script it will generate a mapping file different to what postgresql is currently using. It might be interesting to pull out the diffs and see what's right/wrong. I guess its not run anymore? I can't see how the change will affect the JDBC driver. It should only improve the situation. Right now its not possible to go from sjis -> database (utf8) -> java (jdbc/utf16) -> sjis for the WAVE DASH character because the mapping is wrong in postgresql. I'll cc the JDBC list and maybe we'll find out if its a real problem to change the mapping. Changing the mapping I think is the correct thing to do from what I can see all around me in different tools like iconv, java 1.4.1, utf-8 terminal and any unicode reference on the web. What do you think? Tom. On Wed, 2003-02-12 at 22:30, Tatsuo Ishii wrote: > I think the problem you see is due to the the mapping table changes > between 7.2 and 7.3. It seems there are more changes other than > u301c. Moreover according to the recent discussion in Japanese local > mailing list, 7.3's JDBC driver now relies on the encoding conversion > performed by the backend. ie. The driver issues "set client_encoding = > 'UNICODE'". This problem is very complex and I need time to find good > solution. I don't think simply backout the changes to the mapping > table solves the problem. > > > Hi all, > > > > One Japanese character has been causing my head to swim lately. I've > > finally tracked down the problem to both Java 1.3 and Postgresql. > > > > The problem character is namely: > > utf-16: 0x301C > > utf-8: 0xE3809C > > SJIS: 0x8160 > > EUC_JP: 0xA1C1 > > Otherwise known as the WAVE DASH character. > > > > The confusion stems from a very similar character 0xFF5E (utf-16) or > > 0xEFBD9E (utf-8) the FULLWIDTH TILDE. > > > > Java has just lately (1.4.1) finally fixed their mappings so that 0x301C > > maps correctly to both the correct SJIS and EUC-JP character. Previously > > (at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C, > > causing all sorts of trouble. > > > > Postgresql at least picked one of the two characters namely 0xFF5E, so > > conversions in and out of the database to/from sjis/euc seemed to be > > working. Problem is when you try to view utf-8 from the database or if > > you read the data into java (utf-16) and try converting to euc or sjis > > from there. > > > > Anyway, I think postgresql needs to be fixed for this character. In my > > opinion what needs to be done is to change the mappings... > > > > euc-jp -> utf-8 -> euc-jp > > ====== ======== ====== > > 0xA1C1 -> 0xE3809C 0xA1C1 > > > > sjis -> utf-8 -> sjis > > ====== ======== ====== > > 0x8160 -> 0xE3809C 0x8160 > > > > As to what to do with the current mapping of 0xEFBD9E (utf-8)? It > > probably should be removed. Maybe you could keep the mapping back to the > > sjis/euc characters to help backward compatibility though. I'm not sure > > what is the correct approach there. > > > > If anyone can tell me how to edit the mappings under: > > src/backend/utils/mb/Unicode/ > > > > and rebuild postgres to use them, then I can test this out locally. > > Just edit src/backend/utils/mb/Unicode/*.map and rebiuld > PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and > euc_jp_to_utf8.map. > -- > Tatsuo Ishii -- Thomas O'Dowd <tom@nooper.com> Nooper.com Mobile Services Inc
pgsql-jdbc by date: