Thread: Yet another problem with ILIKE and UTF-8
Hello all, I have a nasty-looking problem case. Shortly described as follows: INSERT INTO mytable (id, value) VALUES (4242, '=FAabcd=FA'); SELECT id FROM mytable WHERE value ILIKE '%abc%'; In environment A, the row of the ID just inserted is returned correctly, but in environment B no rows are found. Uh! (Sadly environment B is the productive environment... :/) Notice the UTF-8 chars in the inserted sting and the _lack_ of UTF-8 chars in the searched string. Environment A: Win2000, psql 8.2.4, lc_* is C, all encondings (client, server, DB) are UTF-8. Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc 2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server, DB) are UTF-8. ^[1]: I know :( but I'm not the sysadmin. Notes: - The same query works if ILIKE is replaced with LIKE. - In environment B, ILIKE '%bcd=FA%' also returns 0 rows, but ILIKE '%bcd%' works well. Quite strange. - I could not find the exact condition of the problem. A good lot of queries work (provided the UTF-8 characters in the query exactly match the DB value), but others like the above one do not. - From my prev. googling, I've learnt that ILIKE does not handle UTF-8 strings searched _for_, and I've already implemented a workaround for that, but obviously I can't do much for this case, when the operand is a simple ASCII string. - From the first glance, this problem seems different from the ones I've found during my searches (eg. links from http://www.postgresql.org/docs/faqs.TODO.html, http://svr5.postgresql.org/pgsql-bugs/2005-07/msg00157.php and friends), but I'm not sure. If you think it's a new bug, please tell it to me and I'll report it. Well, any help is welcome, as our system is live and productive... :/ Best regards, Gergely BOR
"Gergely Bor" <borg42@gmail.com> writes: > I have a nasty-looking problem case. Shortly described as follows: > > INSERT INTO mytable (id, value) VALUES (4242, '=FAabcd=FA'); > SELECT id FROM mytable WHERE value ILIKE '%abc%'; > > In environment A, the row of the ID just inserted is returned > correctly, but in environment B no rows are found. Uh! (Sadly > environment B is the productive environment... :/) > > Notice the UTF-8 chars in the inserted sting and the _lack_ of UTF-8 > chars in the searched string. > > Environment A: Win2000, psql 8.2.4, lc_* is C, all encondings (client, > server, DB) are UTF-8. > Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc > 2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server, > DB) are UTF-8. I'm not sure this is the right answer but what happens if you initdb a database on the Debian box with lc_* set to hu_HU.UTF-8 ?=20 (You may have to add it to /etc/locale.gen and rerun locale-gen) Also, what does lower('=FAabcd=FA') return in that locale?=20 --=20 Gregory Stark EnterpriseDB http://www.enterprisedb.com
Hello Gregory, We'll google the initdb stuff and try it ASAP. What I've tried is LOWER and UPPER, and they seem to return trash for Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm... maybe ILIKE requires LOWER and UPPER to work? Would not be illogical...) Best regards, Gergely BOR On 10/25/07, Gregory Stark <stark@enterprisedb.com> wrote: > "Gergely Bor" <borg42@gmail.com> writes: > > > I have a nasty-looking problem case. Shortly described as follows: > > > > INSERT INTO mytable (id, value) VALUES (4242, '=FAabcd=FA'); > > SELECT id FROM mytable WHERE value ILIKE '%abc%'; > > > > In environment A, the row of the ID just inserted is returned > > correctly, but in environment B no rows are found. Uh! (Sadly > > environment B is the productive environment... :/) > > > > Notice the UTF-8 chars in the inserted sting and the _lack_ of UTF-8 > > chars in the searched string. > > > > Environment A: Win2000, psql 8.2.4, lc_* is C, all encondings (client, > > server, DB) are UTF-8. > > Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc > > 2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server, > > DB) are UTF-8. > > I'm not sure this is the right answer but what happens if you initdb a > database on the Debian box with lc_* set to hu_HU.UTF-8 ? > (You may have to add it to /etc/locale.gen and rerun locale-gen) > > Also, what does lower('=FAabcd=FA') return in that locale? > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com >
"Gergely Bor" <borg42@gmail.com> writes: > We'll google the initdb stuff and try it ASAP. > > What I've tried is LOWER and UPPER, and they seem to return trash for > Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm... > maybe ILIKE requires LOWER and UPPER to work? Would not be > illogical...) It does. I think it works by just downcasing both strings. It's possible to do better but tricky. I think 8.3 has an optimization for that for single-byte encodings but it had to be disabled for utf-8 in the end. If it's returning trash for those characters then it's not prepared to handle UTF-8 data. You have to use an encoding compatible with your locale and vice-versa. If you want to store UTF-8 data I suggest you . add hu_HU.UTF-8 to /etc/locale.gen, . rerun /usr/sbin/locale-gen . pg_dump your database . re-initdb with the locale set to hu_HU.UTF-8 . pg_restore your data. Unfortunately that'll take quite a while and involve down-time. You should probably do this in a second directory aside from your existing database just in case you've created any invalidly encoded utf-8 strings. You'll have to fix them before restoring. (Actually I don't recall which version got strict about that.) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Gergely Bor" <borg42@gmail.com> writes: >> Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc >> 2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server, >> DB) are UTF-8. > I'm not sure this is the right answer but what happens if you initdb a > database on the Debian box with lc_* set to hu_HU.UTF-8 ? On my Fedora Core 6 machine, the encoding implied by LANG=hu_HU seems to be LATIN2, not UTF8. It's possible that Debian's glibc does this differently than Fedora's, but not real likely. So I think Greg has probably identified the problem correctly: you have a locale-vs-encoding mismatch on the Debian setup. FWIW, 8.3 will reject this sort of misconfiguration: $ LANG=hu_HU initdb -E utf8 The files belonging to this database system will be owned by user "tgl". This user must also own the server process. The database cluster will be initialized with locale hu_HU. initdb: encoding mismatch The encoding you selected (UTF8) and the encoding that the selected locale uses (LATIN2) do not match. This would lead to misbehavior in various character string processing functions. Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. regards, tom lane
Hello Gregory, hello all, I've tested the DB created with hu_HU.UTF8 lc_* settings and it works like a charm! We'll re-create the database and move the contents. It's a nice improvement of 8.3 to detect and disallow such misconfiguration. ------ Summing up the issue & fix (for googling): - Version: postgresql 8.2.x (and possibly older ones as well) - Symptom: UTF-8 values are stored correctly but some ILIKE queries don't return results as expected, even when searching for ASCII substrings. - Reason: database misconfiguration -> UPPER()/LOWER() cannot handle UTF-8 chars -> ILIKE cannot handle UTF-8 chars. The error above occurs if the DB is created with UTF-8 internal encoding, but lc_* settings are not referring UTF-8 locales. (Eg. bad locale reference: "hu_HU", good locale reference: "hu_HU.UTF8".) - Fix: a new database (or DB cluster?) should be created with the corrected lc_* settings. If there's data already, it should be dumped and restored (see steps below). - Additional info: lc_* settings (and a whole lot of others) can be displayed with the psql command "SHOW ALL". - Additional info: in case the DB is configured incorrectly, psql 8.2 just fails to match rows correctly (without any errors), but 8.3+ will reject this bad configuration in time. ------ Thanks a lot, guys! Best regards, Gergely BOR On 10/25/07, Gregory Stark <stark@enterprisedb.com> wrote: > > "Gergely Bor" <borg42@gmail.com> writes: > > > We'll google the initdb stuff and try it ASAP. > > > > What I've tried is LOWER and UPPER, and they seem to return trash for > > Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm... > > maybe ILIKE requires LOWER and UPPER to work? Would not be > > illogical...) > > It does. I think it works by just downcasing both strings. It's possible to do > better but tricky. I think 8.3 has an optimization for that for single-byte > encodings but it had to be disabled for utf-8 in the end. > > If it's returning trash for those characters then it's not prepared to handle > UTF-8 data. You have to use an encoding compatible with your locale and > vice-versa. > > If you want to store UTF-8 data I suggest you > > . add hu_HU.UTF-8 to /etc/locale.gen, > . rerun /usr/sbin/locale-gen > . pg_dump your database > . re-initdb with the locale set to hu_HU.UTF-8 > . pg_restore your data. > > Unfortunately that'll take quite a while and involve down-time. > > You should probably do this in a second directory aside from your existing > database just in case you've created any invalidly encoded utf-8 strings. > You'll have to fix them before restoring. (Actually I don't recall which > version got strict about that.) > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com >