Re: Yet another problem with ILIKE and UTF-8 - Mailing list pgsql-bugs

From Gergely Bor
Subject Re: Yet another problem with ILIKE and UTF-8
Date
Msg-id 396c97310710260717w38a1736y7d05d0bf15b9359e@mail.gmail.com
Whole thread Raw
In response to Re: Yet another problem with ILIKE and UTF-8  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-bugs
Hello Gregory, hello all,


I've tested the DB created with hu_HU.UTF8 lc_* settings and it works
like a charm! We'll re-create the database and move the contents. It's
a nice improvement of 8.3 to detect and disallow such
misconfiguration.

------
Summing up the issue & fix (for googling):
 - Version: postgresql 8.2.x (and possibly older ones as well)
 - Symptom: UTF-8 values are stored correctly but some ILIKE queries
don't return results as expected, even when searching for ASCII
substrings.
 - Reason: database misconfiguration -> UPPER()/LOWER() cannot handle
UTF-8 chars -> ILIKE cannot handle UTF-8 chars. The error above occurs
if the DB is created with UTF-8 internal encoding, but lc_* settings
are not referring UTF-8 locales. (Eg. bad locale reference: "hu_HU",
good locale reference: "hu_HU.UTF8".)
 - Fix: a new database (or DB cluster?) should be created with the
corrected lc_* settings. If there's data already, it should be dumped
and restored (see steps below).
 - Additional info: lc_* settings (and a whole lot of others) can be
displayed with the psql command "SHOW ALL".
 - Additional info: in case the DB is configured incorrectly, psql 8.2
just fails to match rows correctly (without any errors), but 8.3+ will
reject this bad configuration in time.
------

Thanks a lot, guys!


Best regards,
  Gergely BOR

On 10/25/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
> "Gergely Bor" <borg42@gmail.com> writes:
>
> > We'll google the initdb stuff and try it ASAP.
> >
> > What I've tried is LOWER and UPPER, and they seem to return trash for
> > Hungarian UTF-8 characters, but they handle ASCII well. (Hmmmm...
> > maybe ILIKE requires LOWER and UPPER to work? Would not be
> > illogical...)
>
> It does. I think it works by just downcasing both strings. It's possible to do
> better but tricky. I think 8.3 has an optimization for that for single-byte
> encodings but it had to be disabled for utf-8 in the end.
>
> If it's returning trash for those characters then it's not prepared to handle
> UTF-8 data. You have to use an encoding compatible with your locale and
> vice-versa.
>
> If you want to store UTF-8 data I suggest you
>
> . add hu_HU.UTF-8 to /etc/locale.gen,
> . rerun /usr/sbin/locale-gen
> . pg_dump your database
> . re-initdb with the locale set to hu_HU.UTF-8
> . pg_restore your data.
>
> Unfortunately that'll take quite a while and involve down-time.
>
> You should probably do this in a second directory aside from your existing
> database just in case you've created any invalidly encoded utf-8 strings.
> You'll have to fix them before restoring. (Actually I don't recall which
> version got strict about that.)
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>

pgsql-bugs by date:

Previous
From: "Lee Packham"
Date:
Subject: Re: BUG #3699: Fails to compile DTrace Support
Next
From: Tom Lane
Date:
Subject: Re: Possible planner bug/regression introduced in 8.2.5