Thread: Controlling locale and impact on LIKE statements
Hi! Background: Using Pg8.1/8.2 on a utf-8 database, I found out that my left-anchored LIKE clauses were forcing a full table scan instead of using the index. After a bit of digging, I found that Pg can only use the "normal" index for left-anchored LIKE queries if locale is 'C'. "The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your server does not use the C locale you will need to create the index with a special operator class to support indexing of pattern-matching queries." http://www.postgresql.org/docs/8.1/static/indexes-types.html What I think I need to do: As I have a Pg install where the locale is already en_US.UTF-8, and the database already exists, is there a DB-scoped way of controlling the locale? I think the index usage noted above is affected by lc_ctype but I could be wrong. I really don't want to go down the "rebuild your pgcluster" path as outlined here http://archives.postgresql.org/pgsql-hackers/2007-02/msg00992.php ;-) Is there a better way? In this specific install I can create the additional index. However, this needs a general fix for Moodle, which has an abstract DB schema handling (we support MySQL, Pg, MSSQL, Oracle) and the whole thing of figuring out what the locale is and whether to add magical additional indexes just for Pg makes me look like a loony. See the discussion with Eloy (maintainer of the schema abstraction layer) at http://moodle.org/mod/forum/discuss.php?d=78738#p350512 login as "guest" to avoid registration. cheers, martin
Martin Langhoff escribió: > As I have a Pg install where the locale is already en_US.UTF-8, and > the database already exists, is there a DB-scoped way of controlling > the locale? Not really. > Is there a better way? In this specific install I can create the > additional index. However, this needs a general fix for Moodle, which > has an abstract DB schema handling (we support MySQL, Pg, MSSQL, > Oracle) and the whole thing of figuring out what the locale is and > whether to add magical additional indexes just for Pg makes me look > like a loony. You are right and Eloy is wrong on that discussion. There is not anything the DB can do to use the regular index if the locale is not C for LIKE queries. There are good reasons for this. There's not much option beyond creating the pattern_ops index. -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4" "Most hackers will be perfectly comfortable conceptualizing users as entropy sources, so let's move on." (Nathaniel Smith)
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > Martin Langhoff escribió: > >> the whole thing of figuring out what the locale is and whether to add >> magical additional indexes just for Pg makes me look like a loony. > > You are right and Eloy is wrong on that discussion. There is not > anything the DB can do to use the regular index if the locale is not C > for LIKE queries. There are good reasons for this. There's not much > option beyond creating the pattern_ops index. Indeed *all* indexes are magical additional things added just for the one database. There's not any standard definition of what indexes you'll need for all databases out there. Indexes aren't even in the SQL standard because they're part of performance tuning for each individual database engine. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Martin Langhoff escribió: > On 9/5/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > Martin Langhoff escribió: > > > > > As I have a Pg install where the locale is already en_US.UTF-8, and > > > the database already exists, is there a DB-scoped way of controlling > > > the locale? > > > > Not really. > > Ah well. But I do have to wonder why... if each database can have its > own encoding, that is likely to be matched with a locale. Isn't that > the main usage scenario? In fact, with unicode encodings, it's likely > that all your DBs are utf-8 encoded, but each may have its own locale. The problem is twofold: 1. index ordering is dependent on locale, and 2. there are some indexes over text columns on shared tables, that is, tables to are in all databases (pg_database, pg_authid, etc). So you cannot really change the locale without making those indexes invalid. It has been said in the past that it is possible to work around this, which would allow us to change locale per database, but it hasn't gotten done yet. > And yet, right now it's all affected by the locale the cluster was > init'd under. In my case, old Pg installations have been upgraded a > few times from a Debian Sarge (C locale). Newer DB servers based on > ubuntu are getting utf-8-ish locales. And all this variation is > impacting something that should be per DB... > > Is this too crazy to ask? ;-) Well, you are not the only one to have asked this, so it's probably not crazy. It just hasn't gotten any hacker motivated enough yet, though. > > You are right and Eloy is wrong on that discussion. There is not > > anything the DB can do to use the regular index if the locale is not C > > for LIKE queries. There are good reasons for this. There's not much > > option beyond creating the pattern_ops index. > > Are the reasons *really* good? ;-) Well, I can't remember them ATM :-) But this was given deep consideration and the pattern_ops were the best solution to be found. -- Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC "Industry suffers from the managerial dogma that for the sake of stability and continuity, the company should be independent of the competence of individual employees." (E. Dijkstra)
On 9/5/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Martin Langhoff escribió: > > > As I have a Pg install where the locale is already en_US.UTF-8, and > > the database already exists, is there a DB-scoped way of controlling > > the locale? > > Not really. Ah well. But I do have to wonder why... if each database can have its own encoding, that is likely to be matched with a locale. Isn't that the main usage scenario? In fact, with unicode encodings, it's likely that all your DBs are utf-8 encoded, but each may have its own locale. And yet, right now it's all affected by the locale the cluster was init'd under. In my case, old Pg installations have been upgraded a few times from a Debian Sarge (C locale). Newer DB servers based on ubuntu are getting utf-8-ish locales. And all this variation is impacting something that should be per DB... Is this too crazy to ask? ;-) > You are right and Eloy is wrong on that discussion. There is not > anything the DB can do to use the regular index if the locale is not C > for LIKE queries. There are good reasons for this. There's not much > option beyond creating the pattern_ops index. Are the reasons *really* good? ;-) I can see that LIKE 'foo%' is implemented as a combined greater-than/less-than clause, which is collation dependent. But why can't I say "for this query, assume C collation, even if you've been init'd under a utf-8 locale"? That'd save us a whole lot of trouble... cheers, martin