Thread: text column indexing in UTF-8 database

text column indexing in UTF-8 database

From
Reece Hart
Date:
Do I really need 4 indexes per column to handle the 4 combinations of
{equality,like/regexp} x {unfolded,casefolded} in a UTF-8 encoded
database?

I have a column that I'd like to be able to search with equality and
regexp (or like), optionally casefolded. The database is UTF-8 encoded.
The table and index defs are below.

Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed up
regexp and like; that worked beautiful. But I discovered a caveat that
t_p_o apparently doesn't handle equality. Thus, I think I need distinct
indexes for the 4 cases above. Right?

Thanks,
Reece


rkh@csb-dev=> \d pannotation
                       Table "unison.pannotation"
     Column     |           Type           | Modifiers
----------------+--------------------------+------------------------
 pannotation_id | integer                  | not null default
 origin_id      | integer                  | not null
 alias          | text                     | not null
 descr          | text                     |
 tax_id         | integer                  |
 added          | timestamp with time zone | not null default timenow()
Indexes:
...
    "pannotation_alias" btree (alias)
    "pannotation_alias_cf" btree (lower(alias))
    "pannotation_alias_cf_tpo" btree (lower(alias) text_pattern_ops)
    "pannotation_alias_tpo" btree (alias text_pattern_ops)
...

where those indexes are defined as:
rkh@csb-dev=> \x
rkh@csb-dev=> select indexname,indexdef from pg_indexes
where indexname~'^pannotation_alias';
-[ RECORD 1 ]--------------------------------------------------------
indexname | pannotation_alias_cf_tpo
indexdef  | CREATE INDEX pannotation_alias_cf_tpo ON pannotation USING btree (lower(alias) text_pattern_ops)
-[ RECORD 2 ]---------------------------------------------------------
indexname | pannotation_alias_tpo
indexdef  | CREATE INDEX pannotation_alias_tpo ON pannotation USING btree (alias text_pattern_ops)
-[ RECORD 3 ]---------------------------------------------------------
indexname | pannotation_alias
indexdef  | CREATE INDEX pannotation_alias ON pannotation USING btree (alias)
-[ RECORD 4 ]---------------------------------------------------------
indexname | pannotation_alias_cf
indexdef  | CREATE INDEX pannotation_alias_cf ON pannotation USING btree
(lower(alias))


--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


Re: text column indexing in UTF-8 database

From
Steve Atkins
Date:
On Mar 12, 2009, at 5:15 PM, Reece Hart wrote:

> Do I really need 4 indexes per column to handle the 4 combinations of
> {equality,like/regexp} x {unfolded,casefolded} in a UTF-8 encoded
> database?
>
> I have a column that I'd like to be able to search with equality and
> regexp (or like), optionally casefolded. The database is UTF-8
> encoded.
> The table and index defs are below.
>
> Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed
> up
> regexp and like; that worked beautiful. But I discovered a caveat that
> t_p_o apparently doesn't handle equality. Thus, I think I need
> distinct
> indexes for the 4 cases above. Right?

If A=B then lower(A) = lower(B), and if A like B then lower(A) like
lower(B).

So, if nothing else, you could rewrite "where alias = 'Foo'" as
"where lower(alias) = lower('Foo') and alias='Foo'" and take advantage
of the lower() functional index.

I've read that 8.4 will be able to use a text_pattern_ops index for
equality.

Cheers,
   Steve


Re: text column indexing in UTF-8 database

From
Jeff Davis
Date:
On Thu, 2009-03-12 at 17:15 -0700, Reece Hart wrote:
> Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed up
> regexp and like; that worked beautiful. But I discovered a caveat that
> t_p_o apparently doesn't handle equality. Thus, I think I need distinct
> indexes for the 4 cases above. Right?

It looks like an index using text_pattern_ops can be used for equality
(see my test case below).

This works apparently because texteq() is defined as bitwise-equality.
Is that really correct? I was under the impression that some locales do
not obey that rule, and may consider two slightly different strings to
be equal.

Regards,
    Jeff Davis

create table a(t text);
create index a_idx on a (t text_pattern_ops);
insert into a values('foo');
set enable_seqscan='f';
analyze a;
explain analyze select * from a where t = 'foo';
               QUERY PLAN
-----------------------------------------------
 Index Scan using a_idx on a  (cost=0.00..8.27 rows=1 width=4)
   (actual time=0.009..0.010 rows=1 loops=1)
   Index Cond: (t = 'foo'::text)
 Total runtime: 0.036 ms
(3 rows)



Re: text column indexing in UTF-8 database

From
Reece Hart
Date:
On Thu, 2009-03-12 at 18:02 -0700, Jeff Davis wrote:
> It looks like an index using text_pattern_ops can be used for equality
> (see my test case below).

Odd. I can't reproduce your test case. I noticed that I edited out the
version and platform from my OP. (A: 8.3.6, x86_64 linux). You're on
8.3.6, or do you happen to be testing on the 8.4 branch?

I see this:

rkh@rkh=> \i tpo-test.sql
version
--------------------------------------------------------------------------------------------
 PostgreSQL 8.3.6 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.0 (SUSE Linux)

select name,setting from pg_settings where name~'locale|encoding';
      name       | setting
-----------------+---------
 client_encoding | UTF8
 server_encoding | UTF8
(2 rows)

\!echo $LANG
en_US.UTF-8

create temp table a(t text);
CREATE TABLE

insert into a values('foo');
INSERT 0 1

set enable_seqscan='f';
SET

create index a_t_tpo on a (t text_pattern_ops);
CREATE INDEX

analyze a;
ANALYZE

explain analyze select * from a where t = 'foo';
                                                QUERY
PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on a  (cost=100000000.00..100000001.01 rows=1 width=4) (actual
time=0.014..0.016 rows=1 loops=1)
   Filter: (t = 'foo'::text)
 Total runtime: 0.047 ms
(3 rows)

create index a_t on a (t);
CREATE INDEX

analyze a;
ANALYZE

explain analyze select * from a where t = 'foo';
                                              QUERY
PLAN
-------------------------------------------------------------------------------------------------------
 Index Scan using a_t on a  (cost=0.00..8.27 rows=1 width=4) (actual
time=0.061..0.062 rows=1 loops=1)
   Index Cond: (t = 'foo'::text)
 Total runtime: 0.099 ms
(3 rows)


script at http://harts.net/reece/tpo-test.sql


--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


Re: text column indexing in UTF-8 database

From
Reece Hart
Date:
On Thu, 2009-03-12 at 17:32 -0700, Steve Atkins wrote:
If A=B then lower(A) = lower(B), and if A like B then lower(A) like 
lower(B).

So, if nothing else, you could rewrite "where alias = 'Foo'" as
"where lower(alias) = lower('Foo') and alias='Foo'" and take advantage
of the lower() functional index.

Good idea. Thanks. The niggling remaining problem is that the DB is open to a SQL-savvy audience and it'd be nice to avoid telling them to casefold their predicates.

For regexps, lower(alias) ~* lower(regexp) won't work because extended regexps might contain character classes (e.g., \S != \s). And, I guess that alias ~* regexp requires a seqscan because the index isn't ordered over ~* (right?). How about lower(alias) ~* regexp ? Is PG smart enough to know that that ordering is well defined? Is my head on straight there?

Thanks again,
Reece

-- 
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

Re: text column indexing in UTF-8 database

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> It looks like an index using text_pattern_ops can be used for equality
> (see my test case below).

This is true as of 8.4; prior versions make a distinction between =
and ~=~.

> This works apparently because texteq() is defined as bitwise-equality.
> Is that really correct? I was under the impression that some locales do
> not obey that rule, and may consider two slightly different strings to
> be equal.

The locale might, but Postgres doesn't --- look at the implementation
of texteq().

There is actually some history here; the former distinction in the
equality operators arose from exactly your concern.  But after we
put in the second-pass check to insist on bitwise equality, we
realized that the equality operators really were equivalent.

            regards, tom lane

Re: text column indexing in UTF-8 database

From
Jeff Davis
Date:
On Fri, 2009-03-13 at 13:13 -0400, Tom Lane wrote:
> There is actually some history here; the former distinction in the
> equality operators arose from exactly your concern.  But after we
> put in the second-pass check to insist on bitwise equality, we
> realized that the equality operators really were equivalent.
>

Interesting. It's certainly convenient when "=" means "values are
exactly the same" ;)

One thing that still doesn't make sense to me is that texteq() is
bitwise-equality even in 8.3.

It sounds like Reece Hart can avoid the extra index by making a new
opclass that's like text_pattern_ops except "~=~" should be named "=".

Regards,
    Jeff Davis


Re: text column indexing in UTF-8 database

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> One thing that still doesn't make sense to me is that texteq() is
> bitwise-equality even in 8.3.

Historical artifact ... we made the semantics change some time ago, but
the ensuing change to remove ~=~ didn't happen until 8.4.

> It sounds like Reece Hart can avoid the extra index by making a new
> opclass that's like text_pattern_ops except "~=~" should be named "=".

Unfortunately not, because the LIKE index optimizations are keyed off
the built-in opclasses.

            regards, tom lane