Thread: like & optimization

like & optimization

From

Scott Ribe

Date:

12 October 2013, 18:08:40

PG 9.3, consider a table test like:

tz timestamp not null,
cola varchar not null,
colb varchar not null

2 compound indexes:

tz_cola on (tz, cola)
tz_colb on (tz, colb varchar_pattern_ops)

now a query, for some start & end timestamps:

select * from test where tz >= start and tz < end and colb like '%foobar%'

Assume that the tz restriction is somewhat selective, say 1% of the table, and the colb restriction is extremely
selective,say less than 0.00001%. 

It seems to me that the fastest way to resolve this query is to use the tz_colb index directly, scanning the range
betweentz >= start and tz < end for the colb condition. 

But pg wants to use the pg_cola index to find all rows in the time range, then filter those rows for the colb
condition.(FYI, cola contains only very small values, while colb's values are typically several times longer.) 

Now if I tweak the time range, I can get it to seq scan the table for all conditions, or bitmap heap scan + re-check
condtz + filter colb + bitmap index scan tz_cola, but never use the tz_colb index... 

Am I right about the fastest way to perform the search? Is there some way to get pg to do this, or would this require
anenhancement? 

Here's a sample query plan:

 Index Scan using tz_cola on test  (cost=0.56..355622.52 rows=23 width=106) (actual time=61.403..230.649 rows=4
loops=1)
   Index Cond: ((tz >= '2013-04-01 06:00:00-05'::timestamp with time zone) AND (tz <= '2013-04-30
06:00:00-05'::timestampwith time zone)) 
   Filter: ((colb)::text ~~ '%foobar%'::text)
   Rows Removed by Filter: 261725
 Total runtime: 230.689 ms

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

Re: like & optimization

From

Torsten Förtsch

Date:

12 October 2013, 21:30:29

On 12/10/13 20:08, Scott Ribe wrote:
> select * from test where tz >= start and tz < end and colb like '%foobar%'

I think you can use an index only for wildcard expressions that are
anchored at the beginning. So,

  select * from test where tz >= start and tz < end
     and colb like 'foobar%'

can use an index on colb.

You could perhaps

  select * from test where tz >= start and tz < end
     and colb like 'foobar%'
  union all
  select * from test where tz >= start and tz < end
     and reverse(colb) like 'raboof%'

Then you need 2 indexes, one on colb the other on reverse(colb).

You can have duplicates in the result set if the table contains rows
where colb='foobar'. If that's a problem, use union distinct.

Alternatively, if foobar is kind of a word (with boundaries), you could
consider full-text search.

Just my 2¢,
Torsten

Re: like & optimization

From

Sergey Konoplev

Date:

12 October 2013, 21:59:52

On Sat, Oct 12, 2013 at 11:08 AM, Scott Ribe
<scott_ribe@elevated-dev.com> wrote:
[skipped]

> select * from test where tz >= start and tz < end and colb like '%foobar%'
>
> Assume that the tz restriction is somewhat selective, say 1% of the table, and the colb restriction is extremely
selective,say less than 0.00001%. 

[skipped]

> Here's a sample query plan:
>
>  Index Scan using tz_cola on test  (cost=0.56..355622.52 rows=23 width=106) (actual time=61.403..230.649 rows=4
loops=1)
>    Index Cond: ((tz >= '2013-04-01 06:00:00-05'::timestamp with time zone) AND (tz <= '2013-04-30
06:00:00-05'::timestampwith time zone)) 
>    Filter: ((colb)::text ~~ '%foobar%'::text)
>    Rows Removed by Filter: 261725
>  Total runtime: 230.689 ms

You need to create the pg_trgm extension
(http://www.postgresql.org/docs/9.3/static/pgtrgm.html) if you need
things like LIKE '%foobar%' to use indexes.

First create 2 separate indexes for your columns ... USING btree (tz)
and ... USING gin (colb gin_trgm_ops). And to EXPLAIN ANALYZE your
query. It could probably be the best option because BitmapAnd might be
very effective.

You could also try to create a multi-column index, but you will need
to create a btree_gin extension first
(http://www.postgresql.org/docs/9.3/static/btree-gin.html) to be able
to combine GIN and btree in one index ... gin (tz, colb gin_trgm_ops).
Then EXPLAIN ANALYZE the query and compare with the above.

Note, if you have intensive writes on the table you would probably
want to set FASTUPDATE to off on the GIN index, because it might lead
to unpredictable stalls
(http://www.postgresql.org/docs/9.3/static/gin-implementation.html#GIN-FAST-UPDATE).

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

Re: like & optimization

From

Tom Lane

Date:

12 October 2013, 22:21:41

Scott Ribe <scott_ribe@elevated-dev.com> writes:
> PG 9.3, consider a table test like:
> tz timestamp not null,
> cola varchar not null,
> colb varchar not null

> 2 compound indexes:

> tz_cola on (tz, cola)
> tz_colb on (tz, colb varchar_pattern_ops)

> now a query, for some start & end timestamps:

> select * from test where tz >= start and tz < end and colb like '%foobar%'

> Assume that the tz restriction is somewhat selective, say 1% of the table, and the colb restriction is extremely
selective,say less than 0.00001%. 

> It seems to me that the fastest way to resolve this query is to use the tz_colb index directly, scanning the range
betweentz >= start and tz < end for the colb condition. 

> But pg wants to use the pg_cola index to find all rows in the time range, then filter those rows for the colb
condition.(FYI, cola contains only very small values, while colb's values are typically several times longer.) 

The reason you're losing on this is that the "select *" command eliminates
the possibility of an index-only scan (I'm assuming that that selects some
columns that aren't in the index).  Given that a plain indexscan will
always involve fetching each heap row that satisfies the indexable
condition (the one on tz), the planner figures it might as well use the
physically-smaller index.

It's true that in principle we could use the index-only-scan index AM
machinery to retrieve colb from the index, and then check the LIKE
predicate on that value before we go to the heap to get the other values;
but the code isn't factored that way at the moment.  I'm not entirely sure
that such cases arise often enough to be worth making it happen.  I think
there was discussion of this point back when the index-only-scan patch was
being written, and we decided it didn't seem worth pursuing at the time.

            regards, tom lane

Re: like & optimization

From

Scott Ribe

Date:

12 October 2013, 23:21:09

On Oct 12, 2013, at 4:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The reason you're losing on this is that the "select *" command eliminates
> the possibility of an index-only scan (I'm assuming that that selects some
> columns that aren't in the index).  Given that a plain indexscan will
> always involve fetching each heap row that satisfies the indexable
> condition (the one on tz), the planner figures it might as well use the
> physically-smaller index.

OK, that logic makes sense. In the particular case I'm looking at, the comparison to colb will match such a tiny
fractionthat I think it should be faster to use the index first before fetching heap rows. (It most certainly would be
fasterif the rows to be evaluated for the colb match were randomly dispersed, but because they tend to be naturally
clusteredon tz anyway, and the rows are pretty small, there's some chance an index scan might not save enough heap row
I/Oto offset it's own I/O.) 

> It's true that in principle we could use the index-only-scan index AM
> machinery to retrieve colb from the index, and then check the LIKE
> predicate on that value before we go to the heap to get the other values;
> but the code isn't factored that way at the moment.  I'm not entirely sure
> that such cases arise often enough to be worth making it happen.  I think
> there was discussion of this point back when the index-only-scan patch was
> being written, and we decided it didn't seem worth pursuing at the time.

It's not a common-enough case for me to worry about. This is a very rare query in this application--I just wanted to
knowif I was missing something wrt indexes or whatever. It took me a long time to even find varchar_pattern_ops. (This
isone particular question where the top results from google searches are dominated by incorrect assertions. Yes,
Virginia,it *IS* possible to use an index in evaluating a like '%whatever' condition--whether or not it helps in a
particularquery is an open question, but it most certainly is possible.) 

Besides, you've given me the hint, if I really care about this I can try a covering index ;-)

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

Re: like & optimization

From

Merlin Moncure

Date:

14 October 2013, 21:40:44

On Sat, Oct 12, 2013 at 4:28 PM, Torsten Förtsch
<torsten.foertsch@gmx.net> wrote:
> On 12/10/13 20:08, Scott Ribe wrote:
>> select * from test where tz >= start and tz < end and colb like '%foobar%'
>
> I think you can use an index only for wildcard expressions that are
> anchored at the beginning. So,
>
>   select * from test where tz >= start and tz < end
>      and colb like 'foobar%'
>
> can use an index on colb.
>
> You could perhaps
>
>   select * from test where tz >= start and tz < end
>      and colb like 'foobar%'
>   union all
>   select * from test where tz >= start and tz < end
>      and reverse(colb) like 'raboof%'
>
> Then you need 2 indexes, one on colb the other on reverse(colb).
>
> You can have duplicates in the result set if the table contains rows
> where colb='foobar'. If that's a problem, use union distinct.
>
> Alternatively, if foobar is kind of a word (with boundaries), you could
> consider full-text search.

pg_trgm module optimizes 'like with wildcards' without those
restrictions.  It's very fast for what it does.  Because of the
GIST/GIN dependency index only scans are not going to be used through
pg_tgrm though.

merlin

Re: like & optimization

From

Scott Ribe

Date:

19 October 2013, 22:26:26

Thank you all. Both the double index & pg_trgm would be good solutions.

On Oct 14, 2013, at 3:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

> On Sat, Oct 12, 2013 at 4:28 PM, Torsten Förtsch
> <torsten.foertsch@gmx.net> wrote:
>> On 12/10/13 20:08, Scott Ribe wrote:
>>> select * from test where tz >= start and tz < end and colb like '%foobar%'
>>
>> I think you can use an index only for wildcard expressions that are
>> anchored at the beginning. So,
>>
>>  select * from test where tz >= start and tz < end
>>     and colb like 'foobar%'
>>
>> can use an index on colb.
>>
>> You could perhaps
>>
>>  select * from test where tz >= start and tz < end
>>     and colb like 'foobar%'
>>  union all
>>  select * from test where tz >= start and tz < end
>>     and reverse(colb) like 'raboof%'
>>
>> Then you need 2 indexes, one on colb the other on reverse(colb).
>>
>> You can have duplicates in the result set if the table contains rows
>> where colb='foobar'. If that's a problem, use union distinct.
>>
>> Alternatively, if foobar is kind of a word (with boundaries), you could
>> consider full-text search.
>
> pg_trgm module optimizes 'like with wildcards' without those
> restrictions.  It's very fast for what it does.  Because of the
> GIST/GIN dependency index only scans are not going to be used through
> pg_tgrm though.
>
> merlin
>


--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice