Thread: text column indexing in UTF-8 database
Do I really need 4 indexes per column to handle the 4 combinations of {equality,like/regexp} x {unfolded,casefolded} in a UTF-8 encoded database? I have a column that I'd like to be able to search with equality and regexp (or like), optionally casefolded. The database is UTF-8 encoded. The table and index defs are below. Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed up regexp and like; that worked beautiful. But I discovered a caveat that t_p_o apparently doesn't handle equality. Thus, I think I need distinct indexes for the 4 cases above. Right? Thanks, Reece rkh@csb-dev=> \d pannotation Table "unison.pannotation" Column | Type | Modifiers ----------------+--------------------------+------------------------ pannotation_id | integer | not null default origin_id | integer | not null alias | text | not null descr | text | tax_id | integer | added | timestamp with time zone | not null default timenow() Indexes: ... "pannotation_alias" btree (alias) "pannotation_alias_cf" btree (lower(alias)) "pannotation_alias_cf_tpo" btree (lower(alias) text_pattern_ops) "pannotation_alias_tpo" btree (alias text_pattern_ops) ... where those indexes are defined as: rkh@csb-dev=> \x rkh@csb-dev=> select indexname,indexdef from pg_indexes where indexname~'^pannotation_alias'; -[ RECORD 1 ]-------------------------------------------------------- indexname | pannotation_alias_cf_tpo indexdef | CREATE INDEX pannotation_alias_cf_tpo ON pannotation USING btree (lower(alias) text_pattern_ops) -[ RECORD 2 ]--------------------------------------------------------- indexname | pannotation_alias_tpo indexdef | CREATE INDEX pannotation_alias_tpo ON pannotation USING btree (alias text_pattern_ops) -[ RECORD 3 ]--------------------------------------------------------- indexname | pannotation_alias indexdef | CREATE INDEX pannotation_alias ON pannotation USING btree (alias) -[ RECORD 4 ]--------------------------------------------------------- indexname | pannotation_alias_cf indexdef | CREATE INDEX pannotation_alias_cf ON pannotation USING btree (lower(alias)) -- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
On Mar 12, 2009, at 5:15 PM, Reece Hart wrote: > Do I really need 4 indexes per column to handle the 4 combinations of > {equality,like/regexp} x {unfolded,casefolded} in a UTF-8 encoded > database? > > I have a column that I'd like to be able to search with equality and > regexp (or like), optionally casefolded. The database is UTF-8 > encoded. > The table and index defs are below. > > Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed > up > regexp and like; that worked beautiful. But I discovered a caveat that > t_p_o apparently doesn't handle equality. Thus, I think I need > distinct > indexes for the 4 cases above. Right? If A=B then lower(A) = lower(B), and if A like B then lower(A) like lower(B). So, if nothing else, you could rewrite "where alias = 'Foo'" as "where lower(alias) = lower('Foo') and alias='Foo'" and take advantage of the lower() functional index. I've read that 8.4 will be able to use a text_pattern_ops index for equality. Cheers, Steve
On Thu, 2009-03-12 at 17:15 -0700, Reece Hart wrote: > Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed up > regexp and like; that worked beautiful. But I discovered a caveat that > t_p_o apparently doesn't handle equality. Thus, I think I need distinct > indexes for the 4 cases above. Right? It looks like an index using text_pattern_ops can be used for equality (see my test case below). This works apparently because texteq() is defined as bitwise-equality. Is that really correct? I was under the impression that some locales do not obey that rule, and may consider two slightly different strings to be equal. Regards, Jeff Davis create table a(t text); create index a_idx on a (t text_pattern_ops); insert into a values('foo'); set enable_seqscan='f'; analyze a; explain analyze select * from a where t = 'foo'; QUERY PLAN ----------------------------------------------- Index Scan using a_idx on a (cost=0.00..8.27 rows=1 width=4) (actual time=0.009..0.010 rows=1 loops=1) Index Cond: (t = 'foo'::text) Total runtime: 0.036 ms (3 rows)
On Thu, 2009-03-12 at 18:02 -0700, Jeff Davis wrote: > It looks like an index using text_pattern_ops can be used for equality > (see my test case below). Odd. I can't reproduce your test case. I noticed that I edited out the version and platform from my OP. (A: 8.3.6, x86_64 linux). You're on 8.3.6, or do you happen to be testing on the 8.4 branch? I see this: rkh@rkh=> \i tpo-test.sql version -------------------------------------------------------------------------------------------- PostgreSQL 8.3.6 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.0 (SUSE Linux) select name,setting from pg_settings where name~'locale|encoding'; name | setting -----------------+--------- client_encoding | UTF8 server_encoding | UTF8 (2 rows) \!echo $LANG en_US.UTF-8 create temp table a(t text); CREATE TABLE insert into a values('foo'); INSERT 0 1 set enable_seqscan='f'; SET create index a_t_tpo on a (t text_pattern_ops); CREATE INDEX analyze a; ANALYZE explain analyze select * from a where t = 'foo'; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on a (cost=100000000.00..100000001.01 rows=1 width=4) (actual time=0.014..0.016 rows=1 loops=1) Filter: (t = 'foo'::text) Total runtime: 0.047 ms (3 rows) create index a_t on a (t); CREATE INDEX analyze a; ANALYZE explain analyze select * from a where t = 'foo'; QUERY PLAN ------------------------------------------------------------------------------------------------------- Index Scan using a_t on a (cost=0.00..8.27 rows=1 width=4) (actual time=0.061..0.062 rows=1 loops=1) Index Cond: (t = 'foo'::text) Total runtime: 0.099 ms (3 rows) script at http://harts.net/reece/tpo-test.sql -- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
On Thu, 2009-03-12 at 17:32 -0700, Steve Atkins wrote:
Good idea. Thanks. The niggling remaining problem is that the DB is open to a SQL-savvy audience and it'd be nice to avoid telling them to casefold their predicates.
For regexps, lower(alias) ~* lower(regexp) won't work because extended regexps might contain character classes (e.g., \S != \s). And, I guess that alias ~* regexp requires a seqscan because the index isn't ordered over ~* (right?). How about lower(alias) ~* regexp ? Is PG smart enough to know that that ordering is well defined? Is my head on straight there?
Thanks again,
Reece
If A=B then lower(A) = lower(B), and if A like B then lower(A) like
lower(B).
So, if nothing else, you could rewrite "where alias = 'Foo'" as
"where lower(alias) = lower('Foo') and alias='Foo'" and take advantage
of the lower() functional index.
Good idea. Thanks. The niggling remaining problem is that the DB is open to a SQL-savvy audience and it'd be nice to avoid telling them to casefold their predicates.
For regexps, lower(alias) ~* lower(regexp) won't work because extended regexps might contain character classes (e.g., \S != \s). And, I guess that alias ~* regexp requires a seqscan because the index isn't ordered over ~* (right?). How about lower(alias) ~* regexp ? Is PG smart enough to know that that ordering is well defined? Is my head on straight there?
Thanks again,
Reece
-- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0 |
Jeff Davis <pgsql@j-davis.com> writes: > It looks like an index using text_pattern_ops can be used for equality > (see my test case below). This is true as of 8.4; prior versions make a distinction between = and ~=~. > This works apparently because texteq() is defined as bitwise-equality. > Is that really correct? I was under the impression that some locales do > not obey that rule, and may consider two slightly different strings to > be equal. The locale might, but Postgres doesn't --- look at the implementation of texteq(). There is actually some history here; the former distinction in the equality operators arose from exactly your concern. But after we put in the second-pass check to insist on bitwise equality, we realized that the equality operators really were equivalent. regards, tom lane
On Fri, 2009-03-13 at 13:13 -0400, Tom Lane wrote: > There is actually some history here; the former distinction in the > equality operators arose from exactly your concern. But after we > put in the second-pass check to insist on bitwise equality, we > realized that the equality operators really were equivalent. > Interesting. It's certainly convenient when "=" means "values are exactly the same" ;) One thing that still doesn't make sense to me is that texteq() is bitwise-equality even in 8.3. It sounds like Reece Hart can avoid the extra index by making a new opclass that's like text_pattern_ops except "~=~" should be named "=". Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > One thing that still doesn't make sense to me is that texteq() is > bitwise-equality even in 8.3. Historical artifact ... we made the semantics change some time ago, but the ensuing change to remove ~=~ didn't happen until 8.4. > It sounds like Reece Hart can avoid the extra index by making a new > opclass that's like text_pattern_ops except "~=~" should be named "=". Unfortunately not, because the LIKE index optimizations are keyed off the built-in opclasses. regards, tom lane