Thread: like & optimization
PG 9.3, consider a table test like: tz timestamp not null, cola varchar not null, colb varchar not null 2 compound indexes: tz_cola on (tz, cola) tz_colb on (tz, colb varchar_pattern_ops) now a query, for some start & end timestamps: select * from test where tz >= start and tz < end and colb like '%foobar%' Assume that the tz restriction is somewhat selective, say 1% of the table, and the colb restriction is extremely selective,say less than 0.00001%. It seems to me that the fastest way to resolve this query is to use the tz_colb index directly, scanning the range betweentz >= start and tz < end for the colb condition. But pg wants to use the pg_cola index to find all rows in the time range, then filter those rows for the colb condition.(FYI, cola contains only very small values, while colb's values are typically several times longer.) Now if I tweak the time range, I can get it to seq scan the table for all conditions, or bitmap heap scan + re-check condtz + filter colb + bitmap index scan tz_cola, but never use the tz_colb index... Am I right about the fastest way to perform the search? Is there some way to get pg to do this, or would this require anenhancement? Here's a sample query plan: Index Scan using tz_cola on test (cost=0.56..355622.52 rows=23 width=106) (actual time=61.403..230.649 rows=4 loops=1) Index Cond: ((tz >= '2013-04-01 06:00:00-05'::timestamp with time zone) AND (tz <= '2013-04-30 06:00:00-05'::timestampwith time zone)) Filter: ((colb)::text ~~ '%foobar%'::text) Rows Removed by Filter: 261725 Total runtime: 230.689 ms -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On 12/10/13 20:08, Scott Ribe wrote: > select * from test where tz >= start and tz < end and colb like '%foobar%' I think you can use an index only for wildcard expressions that are anchored at the beginning. So, select * from test where tz >= start and tz < end and colb like 'foobar%' can use an index on colb. You could perhaps select * from test where tz >= start and tz < end and colb like 'foobar%' union all select * from test where tz >= start and tz < end and reverse(colb) like 'raboof%' Then you need 2 indexes, one on colb the other on reverse(colb). You can have duplicates in the result set if the table contains rows where colb='foobar'. If that's a problem, use union distinct. Alternatively, if foobar is kind of a word (with boundaries), you could consider full-text search. Just my 2¢, Torsten
On Sat, Oct 12, 2013 at 11:08 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: [skipped] > select * from test where tz >= start and tz < end and colb like '%foobar%' > > Assume that the tz restriction is somewhat selective, say 1% of the table, and the colb restriction is extremely selective,say less than 0.00001%. [skipped] > Here's a sample query plan: > > Index Scan using tz_cola on test (cost=0.56..355622.52 rows=23 width=106) (actual time=61.403..230.649 rows=4 loops=1) > Index Cond: ((tz >= '2013-04-01 06:00:00-05'::timestamp with time zone) AND (tz <= '2013-04-30 06:00:00-05'::timestampwith time zone)) > Filter: ((colb)::text ~~ '%foobar%'::text) > Rows Removed by Filter: 261725 > Total runtime: 230.689 ms You need to create the pg_trgm extension (http://www.postgresql.org/docs/9.3/static/pgtrgm.html) if you need things like LIKE '%foobar%' to use indexes. First create 2 separate indexes for your columns ... USING btree (tz) and ... USING gin (colb gin_trgm_ops). And to EXPLAIN ANALYZE your query. It could probably be the best option because BitmapAnd might be very effective. You could also try to create a multi-column index, but you will need to create a btree_gin extension first (http://www.postgresql.org/docs/9.3/static/btree-gin.html) to be able to combine GIN and btree in one index ... gin (tz, colb gin_trgm_ops). Then EXPLAIN ANALYZE the query and compare with the above. Note, if you have intensive writes on the table you would probably want to set FASTUPDATE to off on the GIN index, because it might lead to unpredictable stalls (http://www.postgresql.org/docs/9.3/static/gin-implementation.html#GIN-FAST-UPDATE). -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
Scott Ribe <scott_ribe@elevated-dev.com> writes: > PG 9.3, consider a table test like: > tz timestamp not null, > cola varchar not null, > colb varchar not null > 2 compound indexes: > tz_cola on (tz, cola) > tz_colb on (tz, colb varchar_pattern_ops) > now a query, for some start & end timestamps: > select * from test where tz >= start and tz < end and colb like '%foobar%' > Assume that the tz restriction is somewhat selective, say 1% of the table, and the colb restriction is extremely selective,say less than 0.00001%. > It seems to me that the fastest way to resolve this query is to use the tz_colb index directly, scanning the range betweentz >= start and tz < end for the colb condition. > But pg wants to use the pg_cola index to find all rows in the time range, then filter those rows for the colb condition.(FYI, cola contains only very small values, while colb's values are typically several times longer.) The reason you're losing on this is that the "select *" command eliminates the possibility of an index-only scan (I'm assuming that that selects some columns that aren't in the index). Given that a plain indexscan will always involve fetching each heap row that satisfies the indexable condition (the one on tz), the planner figures it might as well use the physically-smaller index. It's true that in principle we could use the index-only-scan index AM machinery to retrieve colb from the index, and then check the LIKE predicate on that value before we go to the heap to get the other values; but the code isn't factored that way at the moment. I'm not entirely sure that such cases arise often enough to be worth making it happen. I think there was discussion of this point back when the index-only-scan patch was being written, and we decided it didn't seem worth pursuing at the time. regards, tom lane
On Oct 12, 2013, at 4:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The reason you're losing on this is that the "select *" command eliminates > the possibility of an index-only scan (I'm assuming that that selects some > columns that aren't in the index). Given that a plain indexscan will > always involve fetching each heap row that satisfies the indexable > condition (the one on tz), the planner figures it might as well use the > physically-smaller index. OK, that logic makes sense. In the particular case I'm looking at, the comparison to colb will match such a tiny fractionthat I think it should be faster to use the index first before fetching heap rows. (It most certainly would be fasterif the rows to be evaluated for the colb match were randomly dispersed, but because they tend to be naturally clusteredon tz anyway, and the rows are pretty small, there's some chance an index scan might not save enough heap row I/Oto offset it's own I/O.) > It's true that in principle we could use the index-only-scan index AM > machinery to retrieve colb from the index, and then check the LIKE > predicate on that value before we go to the heap to get the other values; > but the code isn't factored that way at the moment. I'm not entirely sure > that such cases arise often enough to be worth making it happen. I think > there was discussion of this point back when the index-only-scan patch was > being written, and we decided it didn't seem worth pursuing at the time. It's not a common-enough case for me to worry about. This is a very rare query in this application--I just wanted to knowif I was missing something wrt indexes or whatever. It took me a long time to even find varchar_pattern_ops. (This isone particular question where the top results from google searches are dominated by incorrect assertions. Yes, Virginia,it *IS* possible to use an index in evaluating a like '%whatever' condition--whether or not it helps in a particularquery is an open question, but it most certainly is possible.) Besides, you've given me the hint, if I really care about this I can try a covering index ;-) -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On Sat, Oct 12, 2013 at 4:28 PM, Torsten Förtsch <torsten.foertsch@gmx.net> wrote: > On 12/10/13 20:08, Scott Ribe wrote: >> select * from test where tz >= start and tz < end and colb like '%foobar%' > > I think you can use an index only for wildcard expressions that are > anchored at the beginning. So, > > select * from test where tz >= start and tz < end > and colb like 'foobar%' > > can use an index on colb. > > You could perhaps > > select * from test where tz >= start and tz < end > and colb like 'foobar%' > union all > select * from test where tz >= start and tz < end > and reverse(colb) like 'raboof%' > > Then you need 2 indexes, one on colb the other on reverse(colb). > > You can have duplicates in the result set if the table contains rows > where colb='foobar'. If that's a problem, use union distinct. > > Alternatively, if foobar is kind of a word (with boundaries), you could > consider full-text search. pg_trgm module optimizes 'like with wildcards' without those restrictions. It's very fast for what it does. Because of the GIST/GIN dependency index only scans are not going to be used through pg_tgrm though. merlin
Thank you all. Both the double index & pg_trgm would be good solutions. On Oct 14, 2013, at 3:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sat, Oct 12, 2013 at 4:28 PM, Torsten Förtsch > <torsten.foertsch@gmx.net> wrote: >> On 12/10/13 20:08, Scott Ribe wrote: >>> select * from test where tz >= start and tz < end and colb like '%foobar%' >> >> I think you can use an index only for wildcard expressions that are >> anchored at the beginning. So, >> >> select * from test where tz >= start and tz < end >> and colb like 'foobar%' >> >> can use an index on colb. >> >> You could perhaps >> >> select * from test where tz >= start and tz < end >> and colb like 'foobar%' >> union all >> select * from test where tz >= start and tz < end >> and reverse(colb) like 'raboof%' >> >> Then you need 2 indexes, one on colb the other on reverse(colb). >> >> You can have duplicates in the result set if the table contains rows >> where colb='foobar'. If that's a problem, use union distinct. >> >> Alternatively, if foobar is kind of a word (with boundaries), you could >> consider full-text search. > > pg_trgm module optimizes 'like with wildcards' without those > restrictions. It's very fast for what it does. Because of the > GIST/GIN dependency index only scans are not going to be used through > pg_tgrm though. > > merlin > -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice