Thread: UTF-8 encoding question regarding PhpPgAdmin development
Dear all, We are working on PhpPgAdmin UTF-8 support. I would like to be able to view UTF-8, ASCII and Latin1 databases in PhpPgAdmin without changing HTML header encodings. I guess this can be done using: SET CLIENT_ENCODING='Unicode' for all PhpPgAdmin connections. My question are: - Are some database encodings not translatable into UTF-8 using SET CLIENT_ENCODING = 'Unicode'. It used to be the case for Latin1, but it has been fixed now. - Some letters, like the euro sign, do not belong to Latin1. Example: let's say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I input a euro sign, does it get rejected by PostgreSQL? - More generaly, is it safe to convert an Encoding (ex: Latin1 or Chinese multi-byte) into UTF-8 using SET CLIENT_ENCODING? Can all multi-byte encodings be converted into/from UTF-8 safely? Best regards, Jean-Michel
Jean-Michel POURE writes: > - Are some database encodings not translatable into UTF-8 using SET > CLIENT_ENCODING = 'Unicode'. It used to be the case for Latin1, but it has > been fixed now. It should be possible. If not, it's a bug. > - Some letters, like the euro sign, do not belong to Latin1. Example: let's > say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I > input a euro sign, does it get rejected by PostgreSQL? Currently, it gives you a warning and ignores the character. Not sure that is ideal. > - More generaly, is it safe to convert an Encoding (ex: Latin1 or Chinese > multi-byte) into UTF-8 using SET CLIENT_ENCODING? Can all multi-byte > encodings be converted into/from UTF-8 safely? Some points to keep in mind: Some character sets contain characters that are not in Unicode, although you might choose to ignore that fact because it is of relatively minor importance. Round-trip conversion is not safely possible, so if your tool provides a read/edit/write tool then you will have problems. Finally, when you display East Asian characters you will have a font problem because the Chinese, Japanese, and Korean characters are mapped to the same range in Unicode but you are supposed to use country-specific glyphs. In short, I don't think what you are trying to do is easily achievable. -- Peter Eisentraut peter_e@gmx.net
Dear Peter, Thank you very much for your answers. It rings a bell. > Finally, when you display East Asian characters you will > have a font problem because the Chinese, Japanese, and Korean characters > are mapped to the same range in Unicode but you are supposed to use > country-specific glyphs. Do you mean that glyph hexaX will display differently in UTF-8 and EUC_JP? If it is really the case, we cannot use UTF-8. > Round-trip conversion is not safely possible, so if your tool provides a > read/edit/write tool then you will have problems. Maybe we could use "getdatabaseencoding()" to determine the dabase encoding and generate HTML pages with the corresponding headers. Example: Latin1 database <-> ISOS-8859-1 headers. The problem is that PhpPgAdmin interface needs to be localized in several languages, not related to database encoding. Example: EUC_JP interface and Latin1 databases. Maybe a solution would be to use the ISO 10646 notation for PhpPgAdmin interface localization: "H;", where H is a hexadecimal number. Cheers, Jean-MIchel POURE
On Tue, 2003-01-07 at 21:59, Peter Eisentraut wrote: > > - Some letters, like the euro sign, do not belong to Latin1. Example: let's > > say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I > > input a euro sign, does it get rejected by PostgreSQL? > > Currently, it gives you a warning and ignores the character. Not sure > that is ideal. (Yes, I should try this myself...) Ignored as in 'passed through unchanged'; or ignored as in 'removed from the string'? cheers -- vbi -- this email is protected by a digital signature: http://fortytwo.ch/gpg
Jean-Michel POURE writes: > > Finally, when you display East Asian characters you will > > have a font problem because the Chinese, Japanese, and Korean characters > > are mapped to the same range in Unicode but you are supposed to use > > country-specific glyphs. > > Do you mean that glyph hexaX will display differently in UTF-8 and EUC_JP? If > it is really the case, we cannot use UTF-8. Well, it's not completely different, but customized to the language. The Chinese, Japanese, and Korean ideographs are really the same historically but are displayed slightly differently. If you use a country-specific character set you probably also get a country-specific font with it, but if you map it to Unicode then you will get whatever the default look is on your computer. This is actually not so bad because as I understand it, for example, a Japanese book that quotes Chinese text uses the Japanese-look ideographs for the Chinese portions as well. But a database administration tool is not a Japanese book, so you need to judge it. -- Peter Eisentraut peter_e@gmx.net
> > > - Some letters, like the euro sign, do not belong to Latin1. Example: let's > > > say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I > > > input a euro sign, does it get rejected by PostgreSQL? > > > > Currently, it gives you a warning and ignores the character. Not sure > > that is ideal. > > (Yes, I should try this myself...) > > Ignored as in 'passed through unchanged'; or ignored as in 'removed from > the string'? "removed from the string". BTW, if I remember correctly, the euro sign is supported in ISO-8859-16, not in ISO-8859-1. -- Tatsuo Ishii