Thread: Unicode + LC_COLLATE
Priem, Alexander said: > I recreated my entire database (luckily I keep scripts for > table/index/view > creation) and initdb-ed it using --lc-collate=C --encoding=UNICODE. In my > psqlODBC DSN settings I added "set client_encoding='LATIN9';" to the > Connect Settings and that solved all my problems regarding the > special characters. Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will be for sorts (and indexes?) when a multibyte unicode character is encountered? Is --lc-collate=C --encoding=UNICODE even valid? And if it's valid what unexpected nasties could it cause? Is it also true that if LC_COLLATE != 'C' that indexes cannot be used for LIKE comparisons (and is this also true for en_US.iso885915)? Our database is UNICODE with LC_COLLATE=en_US.iso885915. Does anyone know what the effect of someone storing a cyrillic/chinese or korean character is? (We are using JDBC with a webapp so all the unicode concerns are handled transparently, apparantly). When the data is extracted from the DB will it render correctly in the browser provided we send all responses encoded in UTF-8? Although http://www.postgresql.org/docs/7.4/interactive/charset.html describes Postgres specific implementation and "how to configure for" a given locale - the subtle nuances of combinations of encoding and LC_COLLATE, and the tradeoffs are not entirely clear (to me at least). For example are the performance penalties of using UNICODE over ASCII significant? Maybe it's just my inexperience but this topic seems to cause lots of questions. A good/simple technote would be really useful... I'd do one but I really don't know my ass from my elbow around this topic (and probably many others too!). Thanks for any answers/feedback/more info. John Sidney-Woollett
"John Sidney-Woollett" <johnsw@wardbrook.com> writes: > Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will > be for sorts (and indexes?) when a multibyte unicode character is > encountered? C locale basically means "sort by the byte sequence values". It'll do something self-consistent, but maybe not what you'd like for UTF8 characters. > Our database is UNICODE with LC_COLLATE=en_US.iso885915. Does that sort rationally at all? I should think you'd need to specify an LC_COLLATE setting that's designed for UTF8 encoding, not 8859-15. If you only ever store characters that are in 7-bit ASCII then none of this will affect you, and you can get away with broken combinations of encoding and locale. But if you'd like to sort characters outside the minimal ASCII set then you need to get it right ... regards, tom lane
Am Donnerstag, 22. April 2004 13:17 schrieb John Sidney-Woollett: > Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will > be for sorts (and indexes?) when a multibyte unicode character is > encountered? You get your strings sorted in binary order of the UTF-8 encoding, which is probably not very interesting, but it's possible. > Is it also true that if LC_COLLATE != 'C' that indexes cannot be used for > LIKE comparisons (and is this also true for en_US.iso885915)? No, see <http://www.postgresql.org/docs/7.4/static/indexes-opclass.html>. > Our database is UNICODE with LC_COLLATE=en_US.iso885915. Does anyone know > what the effect of someone storing a cyrillic/chinese or korean character > is? This setup will result in UTF-8 characters being sorted by the system thinking they are actually ISO-8859-15 characters. So the result will be random at best. > (We are using JDBC with a webapp so all the unicode concerns are > handled transparently, apparantly). When the data is extracted from the DB > will it render correctly in the browser provided we send all responses > encoded in UTF-8? If your database is in UNICODE and you're using JDBC then you should be all set as far as PostgreSQL is concerned. Of course, your HTML pages need to declare the encoding correctly as well.
Tom Lane said: > C locale basically means "sort by the byte sequence values". It'll do > something self-consistent, but maybe not what you'd like for UTF8 > characters. OK, that explains that. I guess I will need to try it out to see what the effect is on extended character sets. >> Our database is UNICODE with LC_COLLATE=en_US.iso885915. > Does that sort rationally at all? I should think you'd need to specify > an LC_COLLATE setting that's designed for UTF8 encoding, not 8859-15. Er..., actually the LC_COLLATE for the DB in question is C - I was looking at the wrong database (wrong telnet session)! So your comments above apply in this case. > If you only ever store characters that are in 7-bit ASCII then none of > this will affect you, and you can get away with broken combinations of > encoding and locale. But if you'd like to sort characters outside the > minimal ASCII set then you need to get it right ... Tom, thanks for the answers above. I guess if I have some time I should build some different DBs with different combinations of encoding and collations and summarise my findings using different types of data and sort/search commands, in case anyone else has the same level of confusion that I do... John Sidney-Woollett
John, > I guess if I have some time I should build some different DBs with > different combinations of encoding and collations and summarise my > findings using different types of data and sort/search commands, in case > anyone else has the same level of confusion that I do... that'd be excellent. Be sure to offer the writeup for inclusion into the techdocs site. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Peter Eisentraut said: > Am Donnerstag, 22. April 2004 13:17 schrieb John Sidney-Woollett: > You get your strings sorted in binary order of the UTF-8 encoding, which > is probably not very interesting, but it's possible. Agreed. >> Is it also true that if LC_COLLATE != 'C' that indexes cannot be used >> for LIKE comparisons (and is this also true for en_US.iso885915)? > No, see <http://www.postgresql.org/docs/7.4/static/indexes-opclass.html>. I wish I understood what this page actually was trying to say. Is it saying that varchar_pattern_ops sorts according to the 'C' locale regardless of LC_COLLATE, and that varchar_ops sorts according to the current value of LC_COLLATE? > This setup will result in UTF-8 characters being sorted by the system > thinking > they are actually ISO-8859-15 characters. So the result will be random at > best. Actually the LC_COLLATE is currently 'C' not as I reported ISO-8859-1. What would be a correct LC_COLLATE value for my database if we want to primarily service ISO-8859-1, but allow for cyrillic/chinese/japanese/korean characters too and have them sorting and indexing correctly? We are building a multilanguage website... ls /usr/share/locale produces: ca de en@boldquot en_SE fi hr ko no sk zh_TW cs el en_GB en_US fr it locale.alias pl sv da en en@quot es gl ja nl pt_BR tr Thanks for anymore info. John Sidney-Woollett
> C locale basically means "sort by the byte sequence values". It'll do > something self-consistent, but maybe not what you'd like for UTF8 > characters. > > Does that sort rationally at all? I should think you'd need to specify > an LC_COLLATE setting that's designed for UTF8 encoding, not 8859-15. > > If you only ever store characters that are in 7-bit ASCII then none of > this will affect you, and you can get away with broken combinations of > encoding and locale. But if you'd like to sort characters outside the > minimal ASCII set then you need to get it right ... But if you use anything other than C, you can't use indexes in Like-clauses, right? Would lc-collate=C be bad in combination with UNICODE encoding? What lc-collate setting would you recommend for UNICODE encoding which will provide good sorting for all (most) common languages? (dutch, english, french, german) Alexander Priem
Priem, Alexander said: > Would lc-collate=C be bad in combination with UNICODE encoding? What > lc-collate setting would you recommend for UNICODE encoding which will > provide good sorting for all (most) common languages? (dutch, english, > french, german) It seems that LC_COLLATE=C is not a good idea when using UTF-8... On my db server, /etc/sysconfig/i18n contains LANG="en_GB.UTF-8" SUPPORTED="en_GB.UTF-8:en_GB:en:en_US.UTF-8:en_US:en" SYSFONT="latarcyrheb-sun16" and locale -a produces C [..snip..] en_GB en_GB.iso885915 en_GB.utf8 [..snip..] en_US en_US.iso885915 en_US.utf8 [..snip..] POSIX and locale produces locale LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF-8" LC_ALL= QUESTION: Because I want UNICODE encoding/support in postgres, does that mean that when I init my db I should specify (for en_GB support) initdb -E UNICODE <-- (NO LOCALE, default to en_GB) or initdb --locale=en_GB -E UNICODE or should I use initdb --locale=en_GB.utf8 -E UNICODE Basically I want the database to be able to support all character encodings, and to sort according to the en_GB locale. The monetary and date formats determined by the locale are irrelevant for us, as we return correctly localised versions of this data from our web app (which is connected to postgres) based on the user's selected language for their current web browser session. Thanks to anyone who can answer the initdb question above, hopefully the answer will help others too. John Sidney-Woollett
Am Donnerstag, 22. April 2004 16:37 schrieb Priem, Alexander: > But if you use anything other than C, you can't use indexes in > Like-clauses, right? No, see <http://www.postgresql.org/docs/7.4/static/indexes-opclass.html>. > Would lc-collate=C be bad in combination with UNICODE encoding? What > lc-collate setting would you recommend for UNICODE encoding which will > provide good sorting for all (most) common languages? (dutch, english, > french, german) That is currently not supported at all by PostgreSQL. You would need to use a single-byte encoding.