Thread: Wildcard searches & performance question
Hello, I have a database with the following layout: searchterms=# \d+ searches_2002 Table "public.searches_2002" Column | Type | Modifiers | Description -----------+------------------------+-----------+------------- srchdate | date | not null | srchtime | time without time zone | not null | client_ip | inet | not null | srchquery | character varying(50) | not null | fhvalue | smallint | | Indexes: searches_2002_client_ip btree (client_ip), searches_2002_srchdate btree (srchdate), searches_2002_srchdatetime btree (srchdate, srchtime), searches_2002_srchquery btree (srchquery), searches_2002_srchquery_lcase btree (lower(srchquery)), searches_2002_srchquery_withfh btree (srchquery, fhvalue), searches_2002_srchtime btree (srchtime) There are no uniqueness properties that would make it possible for this table to have a primary key, as it is a list of searches performed on a search engine and the users' behaviour is, well... umm, odd, to be mild. :) Also, do note that this is a test table, so nevermind the excessive amount of indexes - performance is not an issue here, I am still evaluating the need and benefits of having various indexes on those columns. The particular case I am interested is this: when executing queries involving pattern searches using various operators on srchquery, none of the indexes are used in certain cases, namely those LIKE and regexp filters that start with a wildcard. This makes perfect sense, because wildcard pattern searches that start with a wildcard, can not really benefit from an index scan, because a sequential scan is probably going to be faster: we are only going to benefit from scanning an index in those special cases where the wildcard evaluates to a zero-length string. One example of a query plan: searchterms=# explain select count(*) from searches_2002 where srchquery like '%curriculum%'; QUERY PLAN -------------------------------------------------------------------------- Aggregate (cost=4583.26..4583.26 rows=1 width=0) -> Seq Scan on searches_2002 (cost=0.00..4583.26 rows=1 width=0) Filter: (srchquery ~~ '%curriculum%'::text) There is 211061 records in this test table, but the real-life tables would contain a much much larger amount of data, more like 50+ million rows. This promise of a hell on earth trying to optimize performance makes me wonder: would there be a sensible way/reason for avoiding sequential scans on queries that start with a wildcard, and would avoiding sequential scans even be feasible in such cases? Or in other words, can I somehow optimize LIKE and regexp queries that start with a wildcard? TIA, -- Grega Bremec System Administration & Development Support grega.bremec-at-noviforum.si http://najdi.si/ http://www.noviforum.si/
On Tue, May 27, 2003 at 09:09:08PM +0200, Grega Bremec wrote: > that start with a wildcard, and would avoiding sequential scans even be > feasible in such cases? Or in other words, can I somehow optimize LIKE and > regexp queries that start with a wildcard? Not really. But it sounds like you might be a candidate for full text indexing. See contrib/tsearch for one implementation. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Grega, See www.openfts.org for a tool to do what you need. There's also a simpler one in /contrib, as Andrew mentioned. -- -Josh Berkus Aglio Database Solutions San Francisco
What you want is full text searching. To see it in action go here: fts.postgresql.org To download it go here: http://openfts.sourceforge.net/ There is also the older, and slightly slower full text indexing engine, included in the /contrib/fulltextindex directory. It's a little more wrung out, but also not as likely to get maintenance in the future. Basically, full text indexing does exactly what you're asking for by indexing each row inserted by each word in it (that isn't a noise word like "the" or "a") and then uses the indexes created for its searches. On Tue, 27 May 2003, Grega Bremec wrote: > Hello, > > I have a database with the following layout: > > searchterms=# \d+ searches_2002 > Table "public.searches_2002" > Column | Type | Modifiers | Description > -----------+------------------------+-----------+------------- > srchdate | date | not null | > srchtime | time without time zone | not null | > client_ip | inet | not null | > srchquery | character varying(50) | not null | > fhvalue | smallint | | > Indexes: searches_2002_client_ip btree (client_ip), > searches_2002_srchdate btree (srchdate), > searches_2002_srchdatetime btree (srchdate, srchtime), > searches_2002_srchquery btree (srchquery), > searches_2002_srchquery_lcase btree (lower(srchquery)), > searches_2002_srchquery_withfh btree (srchquery, fhvalue), > searches_2002_srchtime btree (srchtime) > > There are no uniqueness properties that would make it possible for this table > to have a primary key, as it is a list of searches performed on a search > engine and the users' behaviour is, well... umm, odd, to be mild. :) > > Also, do note that this is a test table, so nevermind the excessive amount of > indexes - performance is not an issue here, I am still evaluating the need and > benefits of having various indexes on those columns. > > The particular case I am interested is this: when executing queries involving > pattern searches using various operators on srchquery, none of the indexes are > used in certain cases, namely those LIKE and regexp filters that start with > a wildcard. > > This makes perfect sense, because wildcard pattern searches that start with a > wildcard, can not really benefit from an index scan, because a sequential scan > is probably going to be faster: we are only going to benefit from scanning an > index in those special cases where the wildcard evaluates to a zero-length > string. > > One example of a query plan: > > searchterms=# explain select count(*) > from searches_2002 > where srchquery like '%curriculum%'; > QUERY PLAN > -------------------------------------------------------------------------- > Aggregate (cost=4583.26..4583.26 rows=1 width=0) > -> Seq Scan on searches_2002 (cost=0.00..4583.26 rows=1 width=0) > Filter: (srchquery ~~ '%curriculum%'::text) > > There is 211061 records in this test table, but the real-life tables would > contain a much much larger amount of data, more like 50+ million rows. > > This promise of a hell on earth trying to optimize performance makes me wonder: > would there be a sensible way/reason for avoiding sequential scans on queries > that start with a wildcard, and would avoiding sequential scans even be > feasible in such cases? Or in other words, can I somehow optimize LIKE and > regexp queries that start with a wildcard? > > TIA, >
> feasible in such cases? Or in other words, can I somehow optimize LIKE and > regexp queries that start with a wildcard? If they start with a wildcard, but end in character data you could reverse the string and index that... If they start and end with a wildcard, your best bet is a full text indexing solution (various contrib modules). -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
Thank you very much for all your suggestions. I am planning on investing some time into trying out a couple FT indexes. Not minding the fact most of the queries are going to be exact phrase substring searches, performance will most probably benefit from it, so at least some of what I'm after is achieved that way. I shall be getting back with reports, in case anyone is interested. Cheers, -- Grega Bremec System Administration & Development Support grega.bremec-at-noviforum.si http://najdi.si/ http://www.noviforum.si/
On Wed, 28 May 2003, Grega Bremec wrote: > Thank you very much for all your suggestions. > > I am planning on investing some time into trying out a couple FT indexes. Not > minding the fact most of the queries are going to be exact phrase substring > searches, performance will most probably benefit from it, so at least some of > what I'm after is achieved that way. > > I shall be getting back with reports, in case anyone is interested. Be sure to look at OpenFTS http://sourceforge.net/projects/openfts/