Thread: Controlling locale and impact on LIKE statements

Controlling locale and impact on LIKE statements

From

"Martin Langhoff"

Date:

04 September 2007, 22:43:15

Hi!

Background:

Using Pg8.1/8.2 on a utf-8 database, I found out that my left-anchored
LIKE clauses were forcing a full table scan instead of using the
index. After a bit of digging, I found that Pg can only use the
"normal" index for left-anchored LIKE queries if locale is 'C'.

"The optimizer can also use a B-tree index for queries involving the
pattern matching operators LIKE and ~ if the pattern is a constant and
is anchored to the beginning of the string — for example, col LIKE
'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your
server does not use the C locale you will need to create the index
with a special operator class to support indexing of pattern-matching
queries."
http://www.postgresql.org/docs/8.1/static/indexes-types.html

What I think I need to do:

As I have a Pg install where the locale is already en_US.UTF-8, and
the database already exists, is there a DB-scoped way of controlling
the locale? I think the index usage noted above is affected by
lc_ctype but I could be wrong.

I really don't want to go down the "rebuild your pgcluster" path as
outlined here http://archives.postgresql.org/pgsql-hackers/2007-02/msg00992.php
;-)

Is there a better way? In this specific install I can create the
additional index. However, this needs a general fix for Moodle, which
has an abstract  DB schema handling (we support MySQL, Pg, MSSQL,
Oracle) and the whole thing of figuring out what the locale is and
whether to add magical additional indexes just for Pg makes me look
like a loony.

See the discussion with Eloy (maintainer of the schema abstraction
layer) at http://moodle.org/mod/forum/discuss.php?d=78738#p350512
login as "guest" to avoid registration.

cheers,


martin

Re: Controlling locale and impact on LIKE statements

From

Alvaro Herrera

Date:

04 September 2007, 23:23:52

Martin Langhoff escribió:

> As I have a Pg install where the locale is already en_US.UTF-8, and
> the database already exists, is there a DB-scoped way of controlling
> the locale?

Not really.

> Is there a better way? In this specific install I can create the
> additional index. However, this needs a general fix for Moodle, which
> has an abstract  DB schema handling (we support MySQL, Pg, MSSQL,
> Oracle) and the whole thing of figuring out what the locale is and
> whether to add magical additional indexes just for Pg makes me look
> like a loony.

You are right and Eloy is wrong on that discussion.  There is not
anything the DB can do to use the regular index if the locale is not C
for LIKE queries.  There are good reasons for this.  There's not much
option beyond creating the pattern_ops index.

--
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."                               (Nathaniel Smith)

Re: Controlling locale and impact on LIKE statements

From

Gregory Stark

Date:

05 September 2007, 00:11:40

"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> Martin Langhoff escribió:
>
>> the whole thing of figuring out what the locale is and whether to add
>> magical additional indexes just for Pg makes me look like a loony.
>
> You are right and Eloy is wrong on that discussion.  There is not
> anything the DB can do to use the regular index if the locale is not C
> for LIKE queries.  There are good reasons for this.  There's not much
> option beyond creating the pattern_ops index.

Indeed *all* indexes are magical additional things added just for the one
database. There's not any standard definition of what indexes you'll need for
all databases out there. Indexes aren't even in the SQL standard because
they're part of performance tuning for each individual database engine.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Controlling locale and impact on LIKE statements

From

Alvaro Herrera

Date:

06 September 2007, 09:57:51

Martin Langhoff escribió:
> On 9/5/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> > Martin Langhoff escribió:
> >
> > > As I have a Pg install where the locale is already en_US.UTF-8, and
> > > the database already exists, is there a DB-scoped way of controlling
> > > the locale?
> >
> > Not really.
>
> Ah well. But I do have to wonder why... if each database can have its
> own encoding, that is likely to be matched with a locale. Isn't that
> the main usage scenario? In fact, with unicode encodings, it's likely
> that all your DBs are utf-8 encoded, but each may have its own locale.

The problem is twofold:

1. index ordering is dependent on locale, and
2. there are some indexes over text columns on shared tables, that is,
tables to are in all databases (pg_database, pg_authid, etc).

So you cannot really change the locale without making those indexes
invalid.  It has been said in the past that it is possible to work
around this, which would allow us to change locale per database, but it
hasn't gotten done yet.

> And yet, right now it's all affected by the locale the cluster was
> init'd under. In my case, old Pg installations have been upgraded a
> few times from a Debian Sarge (C locale). Newer DB servers based on
> ubuntu are getting utf-8-ish locales. And all this variation is
> impacting something that should be per DB...
>
> Is this too crazy to ask? ;-)

Well, you are not the only one to have asked this, so it's probably not
crazy.  It just hasn't gotten any hacker motivated enough yet, though.

> > You are right and Eloy is wrong on that discussion.  There is not
> > anything the DB can do to use the regular index if the locale is not C
> > for LIKE queries.  There are good reasons for this.  There's not much
> > option beyond creating the pattern_ops index.
>
> Are the reasons *really* good? ;-)

Well, I can't remember them ATM :-)  But this was given deep
consideration and the pattern_ops were the best solution to be found.

--
Alvaro Herrera                  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."                                      (E. Dijkstra)

Re: Controlling locale and impact on LIKE statements

From

"Martin Langhoff"

Date:

06 September 2007, 15:38:24

On 9/5/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Martin Langhoff escribió:
>
> > As I have a Pg install where the locale is already en_US.UTF-8, and
> > the database already exists, is there a DB-scoped way of controlling
> > the locale?
>
> Not really.

Ah well. But I do have to wonder why... if each database can have its
own encoding, that is likely to be matched with a locale. Isn't that
the main usage scenario? In fact, with unicode encodings, it's likely
that all your DBs are utf-8 encoded, but each may have its own locale.

And yet, right now it's all affected by the locale the cluster was
init'd under. In my case, old Pg installations have been upgraded a
few times from a Debian Sarge (C locale). Newer DB servers based on
ubuntu are getting utf-8-ish locales. And all this variation is
impacting something that should be per DB...

Is this too crazy to ask? ;-)

> You are right and Eloy is wrong on that discussion.  There is not
> anything the DB can do to use the regular index if the locale is not C
> for LIKE queries.  There are good reasons for this.  There's not much
> option beyond creating the pattern_ops index.

Are the reasons *really* good? ;-)

I can see that LIKE 'foo%' is implemented as a combined
greater-than/less-than clause, which is collation dependent. But why
can't I say "for this query, assume C collation, even if you've been
init'd under a utf-8 locale"? That'd save us a whole lot of trouble...

cheers,

martin