Re: [PERFORM] substring index what is better way to query - Mailing list pgsql-performance

From Tom Lane
Subject Re: [PERFORM] substring index what is better way to query
Date
Msg-id 4596.1497968349@sss.pgh.pa.us
Whole thread Raw
In response to [PERFORM] substring index what is better way to query  (Tieson Molly <tvmaly@gmail.com>)
Responses Re: [PERFORM] substring index what is better way to query
List pgsql-performance
Tieson Molly <tvmaly@gmail.com> writes:
> I have an index I created on the prefix of a column:

> create index location_geo_idx ON locations( substring(geohash, 0, 5));

> I currently use a similar to query, but I wanted to know if there is a
> faster way to query multiple value using this index than this?

> select l.geohash from locations l where l.geohash similar to '(dr7g|dr7e)%';

Well, you've got a couple of problems there.  The most basic one is that
that index doesn't match that query at all.  You need to arrange things
so that the lefthand side of the SIMILAR TO operator is exactly the
indexed value, not something that's related to it.  (Yes, in principle
that index could be used to answer this query, but it would require a
great deal more intimate knowledge than the planner has about the
semantics of both substring() and SIMILAR TO.)  IOW, you need to write

select l.geohash from locations l
  where substring(l.geohash, 0, 5) similar to '(dr7g|dr7e)%';

The other possible solution would be to just index the geohash strings
verbatim; unless they are quite long, that's what I'd recommend, usually.

Secondly, if you're using a non-C locale, you're likely not getting an
indexscan plan anyway; check it with EXPLAIN.  To get an indexed prefix
search out of a pattern match, the index has to use C sorting rules,
which you can force with a COLLATE or text_pattern_ops option if the
database's prevailing locale isn't C.

Thirdly, if you experiment with EXPLAIN a little bit, you'll soon realize
that the planner is not great at extracting common prefix strings out of
OR'd pattern branches:

regression=# create table loc (f1 text unique);
CREATE TABLE
regression=# explain select * from loc where f1 similar to '(dr7g|dr7e)%';
                               QUERY PLAN
-------------------------------------------------------------------------
 Bitmap Heap Scan on loc  (cost=4.22..14.37 rows=1 width=32)
   Filter: (f1 ~ '^(?:(?:dr7g|dr7e).*)$'::text)
   ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)
         Index Cond: ((f1 >= 'd'::text) AND (f1 < 'e'::text))
(4 rows)

The useful part of this for speed purposes is the "Index Cond", and
you can see that it's only enforcing that the first character be "d".
I don't remember that code very well at the moment, but I'm a bit
surprised that it's even figured out that the "d" is common to both
branches.  You can get a lot more traction if you factor the common
prefix manually:

regression=# explain select * from loc where f1 similar to 'dr7(g|e)%';
                               QUERY PLAN
-------------------------------------------------------------------------
 Bitmap Heap Scan on loc  (cost=4.22..14.37 rows=1 width=32)
   Filter: (f1 ~ '^(?:dr7(?:g|e).*)$'::text)
   ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)
         Index Cond: ((f1 >= 'dr7'::text) AND (f1 < 'dr8'::text))
(4 rows)

or maybe even

regression=# explain select * from loc where f1 similar to 'dr7g%' or f1 similar to 'dr7e%';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Bitmap Heap Scan on loc  (cost=8.45..19.04 rows=2 width=32)
   Recheck Cond: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~ '^(?:dr7e.*)$'::text))
   Filter: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~ '^(?:dr7e.*)$'::text))
   ->  BitmapOr  (cost=8.45..8.45 rows=14 width=0)
         ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)
               Index Cond: ((f1 >= 'dr7g'::text) AND (f1 < 'dr7h'::text))
         ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)
               Index Cond: ((f1 >= 'dr7e'::text) AND (f1 < 'dr7f'::text))
(8 rows)

Whether this is worth the trouble depends a lot on your data distribution,
but any of them are probably better than the seqscan you're no doubt
getting right now.

            regards, tom lane


pgsql-performance by date:

Previous
From: Tieson Molly
Date:
Subject: [PERFORM] substring index what is better way to query
Next
From: Tieson Molly
Date:
Subject: Re: [PERFORM] substring index what is better way to query