Thread: Yet another problem with ILIKE and UTF-8

Yet another problem with ILIKE and UTF-8

From
"Gergely Bor"
Date:
Hello all,


I have a nasty-looking problem case. Shortly described as follows:

INSERT INTO mytable (id, value) VALUES (4242, '=FAabcd=FA');
SELECT id FROM mytable WHERE value ILIKE '%abc%';

In environment A, the row of the ID just inserted is returned
correctly, but in environment B no rows are found. Uh! (Sadly
environment B is the productive environment... :/)

Notice the UTF-8 chars in the inserted sting and the _lack_ of UTF-8
chars in the searched string.

Environment A: Win2000, psql 8.2.4, lc_* is C, all encondings (client,
server, DB) are UTF-8.
Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc
2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server,
DB) are UTF-8.

^[1]: I know :( but I'm not the sysadmin.

Notes:
 - The same query works if ILIKE is replaced with LIKE.
 - In environment B, ILIKE '%bcd=FA%' also returns 0 rows, but ILIKE
'%bcd%' works well. Quite strange.
 - I could not find the exact condition of the problem. A good lot of
queries work (provided the UTF-8 characters in the query exactly match
the DB value), but others like the above one do not.
 - From my prev. googling, I've learnt that ILIKE does not handle
UTF-8 strings searched _for_, and I've already implemented a
workaround for that, but obviously I can't do much for this case, when
the operand is a simple ASCII string.
 - From the first glance, this problem seems different from the ones
I've found during my searches (eg. links from
http://www.postgresql.org/docs/faqs.TODO.html,
http://svr5.postgresql.org/pgsql-bugs/2005-07/msg00157.php and
friends), but I'm not sure. If you think it's a new bug, please tell
it to me and I'll report it.

Well, any help is welcome, as our system is live and productive... :/


Best regards,
  Gergely BOR

Re: Yet another problem with ILIKE and UTF-8

From
Gregory Stark
Date:
"Gergely Bor" <borg42@gmail.com> writes:

> I have a nasty-looking problem case. Shortly described as follows:
>
> INSERT INTO mytable (id, value) VALUES (4242, '=FAabcd=FA');
> SELECT id FROM mytable WHERE value ILIKE '%abc%';
>
> In environment A, the row of the ID just inserted is returned
> correctly, but in environment B no rows are found. Uh! (Sadly
> environment B is the productive environment... :/)
>
> Notice the UTF-8 chars in the inserted sting and the _lack_ of UTF-8
> chars in the searched string.
>
> Environment A: Win2000, psql 8.2.4, lc_* is C, all encondings (client,
> server, DB) are UTF-8.
> Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc
> 2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server,
> DB) are UTF-8.

I'm not sure this is the right answer but what happens if you initdb a
database on the Debian box with lc_* set to hu_HU.UTF-8 ?=20
(You may have to add it to /etc/locale.gen and rerun locale-gen)

Also, what does lower('=FAabcd=FA') return in that locale?=20

--=20
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Yet another problem with ILIKE and UTF-8

From
"Gergely Bor"
Date:
Hello Gregory,


We'll google the initdb stuff and try it ASAP.

What I've tried is LOWER and UPPER, and they seem to return trash for
Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm...
maybe ILIKE requires LOWER and UPPER to work? Would not be
illogical...)


Best regards,
  Gergely BOR

On 10/25/07, Gregory Stark <stark@enterprisedb.com> wrote:
> "Gergely Bor" <borg42@gmail.com> writes:
>
> > I have a nasty-looking problem case. Shortly described as follows:
> >
> > INSERT INTO mytable (id, value) VALUES (4242, '=FAabcd=FA');
> > SELECT id FROM mytable WHERE value ILIKE '%abc%';
> >
> > In environment A, the row of the ID just inserted is returned
> > correctly, but in environment B no rows are found. Uh! (Sadly
> > environment B is the productive environment... :/)
> >
> > Notice the UTF-8 chars in the inserted sting and the _lack_ of UTF-8
> > chars in the searched string.
> >
> > Environment A: Win2000, psql 8.2.4, lc_* is C, all encondings (client,
> > server, DB) are UTF-8.
> > Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc
> > 2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server,
> > DB) are UTF-8.
>
> I'm not sure this is the right answer but what happens if you initdb a
> database on the Debian box with lc_* set to hu_HU.UTF-8 ?
> (You may have to add it to /etc/locale.gen and rerun locale-gen)
>
> Also, what does lower('=FAabcd=FA') return in that locale?
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>

Re: Yet another problem with ILIKE and UTF-8

From
Gregory Stark
Date:
"Gergely Bor" <borg42@gmail.com> writes:

> We'll google the initdb stuff and try it ASAP.
>
> What I've tried is LOWER and UPPER, and they seem to return trash for
> Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm...
> maybe ILIKE requires LOWER and UPPER to work? Would not be
> illogical...)

It does. I think it works by just downcasing both strings. It's possible to do
better but tricky. I think 8.3 has an optimization for that for single-byte
encodings but it had to be disabled for utf-8 in the end.

If it's returning trash for those characters then it's not prepared to handle
UTF-8 data. You have to use an encoding compatible with your locale and
vice-versa.

If you want to store UTF-8 data I suggest you

. add hu_HU.UTF-8 to /etc/locale.gen,
. rerun /usr/sbin/locale-gen
. pg_dump your database
. re-initdb with the locale set to hu_HU.UTF-8
. pg_restore your data.

Unfortunately that'll take quite a while and involve down-time.

You should probably do this in a second directory aside from your existing
database just in case you've created any invalidly encoded utf-8 strings.
You'll have to fix them before restoring. (Actually I don't recall which
version got strict about that.)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Yet another problem with ILIKE and UTF-8

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Gergely Bor" <borg42@gmail.com> writes:
>> Environment B: Debian lenny/sid ^[1], kernel version 2.6.20.1, glibc
>> 2.6.1-5, psql 8.2.5, lc_* is hu_HU, all encondings (client, server,
>> DB) are UTF-8.

> I'm not sure this is the right answer but what happens if you initdb a
> database on the Debian box with lc_* set to hu_HU.UTF-8 ?

On my Fedora Core 6 machine, the encoding implied by LANG=hu_HU
seems to be LATIN2, not UTF8.  It's possible that Debian's glibc
does this differently than Fedora's, but not real likely.
So I think Greg has probably identified the problem correctly:
you have a locale-vs-encoding mismatch on the Debian setup.

FWIW, 8.3 will reject this sort of misconfiguration:

$ LANG=hu_HU initdb -E utf8
The files belonging to this database system will be owned by user "tgl".
This user must also own the server process.

The database cluster will be initialized with locale hu_HU.
initdb: encoding mismatch
The encoding you selected (UTF8) and the encoding that the
selected locale uses (LATIN2) do not match.  This would lead to
misbehavior in various character string processing functions.
Rerun initdb and either do not specify an encoding explicitly,
or choose a matching combination.

            regards, tom lane

Re: Yet another problem with ILIKE and UTF-8

From
"Gergely Bor"
Date:
Hello Gregory, hello all,


I've tested the DB created with hu_HU.UTF8 lc_* settings and it works
like a charm! We'll re-create the database and move the contents. It's
a nice improvement of 8.3 to detect and disallow such
misconfiguration.

------
Summing up the issue & fix (for googling):
 - Version: postgresql 8.2.x (and possibly older ones as well)
 - Symptom: UTF-8 values are stored correctly but some ILIKE queries
don't return results as expected, even when searching for ASCII
substrings.
 - Reason: database misconfiguration -> UPPER()/LOWER() cannot handle
UTF-8 chars -> ILIKE cannot handle UTF-8 chars. The error above occurs
if the DB is created with UTF-8 internal encoding, but lc_* settings
are not referring UTF-8 locales. (Eg. bad locale reference: "hu_HU",
good locale reference: "hu_HU.UTF8".)
 - Fix: a new database (or DB cluster?) should be created with the
corrected lc_* settings. If there's data already, it should be dumped
and restored (see steps below).
 - Additional info: lc_* settings (and a whole lot of others) can be
displayed with the psql command "SHOW ALL".
 - Additional info: in case the DB is configured incorrectly, psql 8.2
just fails to match rows correctly (without any errors), but 8.3+ will
reject this bad configuration in time.
------

Thanks a lot, guys!


Best regards,
  Gergely BOR

On 10/25/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
> "Gergely Bor" <borg42@gmail.com> writes:
>
> > We'll google the initdb stuff and try it ASAP.
> >
> > What I've tried is LOWER and UPPER, and they seem to return trash for
> > Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm...
> > maybe ILIKE requires LOWER and UPPER to work? Would not be
> > illogical...)
>
> It does. I think it works by just downcasing both strings. It's possible to do
> better but tricky. I think 8.3 has an optimization for that for single-byte
> encodings but it had to be disabled for utf-8 in the end.
>
> If it's returning trash for those characters then it's not prepared to handle
> UTF-8 data. You have to use an encoding compatible with your locale and
> vice-versa.
>
> If you want to store UTF-8 data I suggest you
>
> . add hu_HU.UTF-8 to /etc/locale.gen,
> . rerun /usr/sbin/locale-gen
> . pg_dump your database
> . re-initdb with the locale set to hu_HU.UTF-8
> . pg_restore your data.
>
> Unfortunately that'll take quite a while and involve down-time.
>
> You should probably do this in a second directory aside from your existing
> database just in case you've created any invalidly encoded utf-8 strings.
> You'll have to fix them before restoring. (Actually I don't recall which
> version got strict about that.)
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>