Thread: Concerning about Unicode-aware string handling
Hello, We have problems (currently using 8.4, but also in latest 9.1.3) in our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'), Russian and of course potentially other languages. For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž. lower() and ~* comparison works only with locale that is set (no internationalization). Could we expect Unciode support in near future? Or should we do quick hacks by reimplementing regexp_replace(), lower(), upper() and other string SQL functions using, for example, Qt libraries..? Or maybe are there some kind simpler workarounds? Thank you, Vincas.
On 05/21/12 2:09 AM, Vincas Dargis wrote: > We have problems (currently using 8.4, but also in latest 9.1.3) in > our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'), > Russian and of course potentially other languages. > > For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž. > > lower() and ~* comparison works only with locale that is set (no > internationalization). > > Could we expect Unciode support in near future? Or should we do quick > hacks by reimplementing regexp_replace(), lower(), upper() and other > string SQL functions using, for example, Qt libraries..? Or maybe are > there some kind simpler workarounds? your database encoding is UTF8 ? the language or environment you're using to generate those strings such as 'acząčž' is also UTF8 ? postgresql supports UTF-8 unicode just fine. It does not directly support the bastardized UTF-16 'unicode' implemented by Windows NT and derivatives (2000, XP, 2003, Vista, 2008, 7), but on those platforms it generally behaves fairly sanely as long as you realize UTF8 is its native tongue. of course, the database has to be created as a UTF8 database, its possible to initialize the server cluster in "C"/"POSIX"/"SQLASCII" which says bytes-are-bytes and encodings are unknown, or in various 8 bit encodings like LATIN-1. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote: > support the bastardized UTF-16 'unicode' implemented by Windows NT To be fair to Microsoft, while the BOM might be an irritant, they do use a perfectly legitimate encoding of Unicode. There is no Unicode requirement that code points be stored as UTF-8, and there is a strong argument to be made that, for some languages, UTF-8 is extremely inefficient and therefore the least preferred encoding. (Microsoft's dependence on the BOM with UTF-16 -- really UCS2 -- is problematic, of course, and appears to be adjusted in funny ways in Win 7.) Because all wire protocols from the IETF use UTF-8 for Unicode encoding, your best bet is still UTF-8 for maximal portability, so your point about needing to make the database encoding and client locale UTF-8 is correct. Best, A -- Andrew Sullivan ajs@crankycanuck.ca
Sorry I have to write "manual" replay since I've messed up mailing list settings (got "Partial Digest"...). John R Pierce wrote: > your database encoding is UTF8 ? the language or environment you're using to generate those strings such as 'acząčž' isalso UTF8 ? Database created using: initdb -D ../data -E utf-8 -U postgres I have to take back statements about "lower()" and "~*" - they do not work only for (strange) clusters created with "--no-locale". But regexp_replace issue is still there. Regexp "\w" understands only as "ascii word character" ?
Vincas Dargis wrote: > We have problems (currently using 8.4, but also in latest 9.1.3) in > our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'), > Russian and of course potentially other languages. > > For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž. > > lower() and ~* comparison works only with locale that is set (no > internationalization). > > Could we expect Unciode support in near future? Or should we do quick > hacks by reimplementing regexp_replace(), lower(), upper() and other > string SQL functions using, for example, Qt libraries..? Or maybe are > there some kind simpler workarounds? I tried it with 9.1.3 on Linux: upper() and lower() works fine, no matter what the database encoding is: test=> SELECT upper('acząčž'); upper -------- ACZĄČŽ (1 row) And this seems OK with LATIN7: lt2=> SHOW server_encoding; server_encoding ----------------- LATIN7 (1 row) lt2=> SHOW lc_ctype; lc_ctype ---------- lt_LT (1 row) lt2=> SHOW lc_collate; lc_collate ------------ lt_LT (1 row) lt2=> SELECT 'ą' ~* '\w'; ?column? ---------- t (1 row) But it looks wrong with UTF8: lt=> SHOW server_encoding; server_encoding ----------------- UTF8 (1 row) lt=> SHOW lc_ctype; lc_ctype ------------ lt_LT.utf8 (1 row) lt=> SHOW lc_collate; lc_collate ------------ lt_LT.utf8 (1 row) lt=> SELECT 'ą' ~* '\w'; ?column? ---------- f (1 row) Is that what you are complaining about? Yours, Laurenz Albe
I've forgot to mention I'm working on Windows XP SP3 Yes, we are using UTF8 encoding and regexp works wrong. It looks like you replicated that. 2012/5/21 Albe Laurenz <laurenz.albe@wien.gv.at>: > > I tried it with 9.1.3 on Linux: > > upper() and lower() works fine, no matter what the > database encoding is: > > test=> SELECT upper('acząčž'); > upper > -------- > ACZĄČŽ > (1 row) > > And this seems OK with LATIN7: > > lt2=> SHOW server_encoding; > server_encoding > ----------------- > LATIN7 > (1 row) > > lt2=> SHOW lc_ctype; > lc_ctype > ---------- > lt_LT > (1 row) > > lt2=> SHOW lc_collate; > lc_collate > ------------ > lt_LT > (1 row) > > lt2=> SELECT 'ą' ~* '\w'; > ?column? > ---------- > t > (1 row) > > But it looks wrong with UTF8: > > lt=> SHOW server_encoding; > server_encoding > ----------------- > UTF8 > (1 row) > > lt=> SHOW lc_ctype; > lc_ctype > ------------ > lt_LT.utf8 > (1 row) > > lt=> SHOW lc_collate; > lc_collate > ------------ > lt_LT.utf8 > (1 row) > > lt=> SELECT 'ą' ~* '\w'; > ?column? > ---------- > f > (1 row) > > > Is that what you are complaining about? > > Yours, > Laurenz Albe
Vincas Dargis <vindrg@gmail.com> writes: > Database created using: > initdb -D ../data -E utf-8 -U postgres That looks fairly dangerous, as it will absorb the database's locale settings (particularly LC_CTYPE, which is what you care about for these operations) from your shell environment. If the environment locale is not for UTF8 encoding then it won't work at all. Best to specify a --locale switch as well. See http://www.postgresql.org/docs/9.1/static/charset.html > But regexp_replace issue is still there. Regexp "\w" understands only > as "ascii word character" ? Locale-specific character classes in regexps are not terribly bright about UTF8, because historically that code has not considered any character codes above 255 :-(. So in UTF8 you only got correct behavior for ASCII and LATIN1 characters. 9.2 will be better though not perfect: http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=e00f68e49 regards, tom lane
On 05/21/2012 06:59 PM, Andrew Sullivan wrote: > On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote: >> support the bastardized UTF-16 'unicode' implemented by Windows NT > To be fair to Microsoft, while the BOM might be an irritant, they do > use a perfectly legitimate encoding of Unicode. There is no Unicode > requirement that code points be stored as UTF-8, and there is a strong > argument to be made that, for some languages, UTF-8 is extremely > inefficient and therefore the least preferred encoding. (Microsoft's > dependence on the BOM with UTF-16 -- really UCS2 -- is problematic, of > course, and appears to be adjusted in funny ways in Win 7.) In fact, until it became clear that UCS-2 (now UTF-16) wasn't enough and we'd need 4 bytes to represent characters, Microsoft's choice of UCS-2 with BOM looked really good. They just didn't realise that UCS-2 would turn into UTF-16 when UCS-4 came on the scene, so they'd be left holding a bastardised half-way mess that's usually-but-not-always 2 bytes per character. MS's choice allowed programs to work with the safe (at the time) assumption that each char was 2 bytes, which made a lot of things way simpler than they are in UTF-8 and was well and truly worth the storage bloat IMO. Pity Unicode had to grow again and break the assumption. -- Craig Ringer