Thread: How to boost performance of queries containing pattern matching characters
Hi, How can we boost performance of queries containing pattern matching characters? In my case, we're using a percent sign (%) that matches any string of zero or more characters. QUERY: DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%' EMAIL column is VARCHAR(256). As it is clear from the above query, email is matched "partially and case-insensitively", which my application requirement demands. In case, if it were a full match, I could easily define a functional INDEX on EMAIL column (lower(EMAIL)) and I could rewrite my DELETE where criteria like lower(EMAIL) = 'someemail@domain.com'. MYTABLE currently contains 2 million records and grows consistently. Regards, Gnanam
Re: How to boost performance of queries containing pattern matching characters
From
Richard Huxton
Date:
On 14/02/11 06:59, Gnanakumar wrote: > > How can we boost performance of queries containing pattern matching > characters? > QUERY: DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%' > As it is clear from the above query, email is matched "partially and > case-insensitively", which my application requirement demands. Well, for that exact pattern you're not going to find an index that's much help. Do you really need something so wide-ranging though? The above will match all of the following: user1@domain.com user2@sub.domain.com user3@domain.com.au user4@unrelated-domain.com user5@unrelated-domain.com.au user3@sub.domain.com.au user4@sub.unrelated-domain.com user5@sub.unrelated-domain.com.au user6@sub.unrelated-domain.completely-wrong.com Is that really what you are after? Or, did you just want to match: user1@domain.com user2@sub.domain.com -- Richard Huxton Archonet Ltd
Re: How to boost performance of queries containing pattern matching characters
From
"Gnanakumar"
Date:
> Is that really what you are after? Or, did you just want to match: > user1@domain.com > user2@sub.domain.com I understand that because I've (%) at the beginning and end, it's going to match unrelated domains, etc., which as you said rightly, it is wide-ranging. But my point here is that how can I improve performance of the queries containing pattern matching characters.
Re: How to boost performance of queries containing pattern matching characters
From
Richard Huxton
Date:
On 14/02/11 07:28, Gnanakumar wrote: >> Is that really what you are after? Or, did you just want to match: >> user1@domain.com >> user2@sub.domain.com > > I understand that because I've (%) at the beginning and end, it's going to > match unrelated domains, etc., which as you said rightly, it is > wide-ranging. But my point here is that how can I improve performance of > the queries containing pattern matching characters. If you really need to match all those options, you can't use an index. A substring-matching index would need to have multiple entries per character per value (since it doesn't know what you will search for). The index-size becomes unmanageable very quickly. That's why I asked what you really wanted to match. So, I'll ask again: do you really want to match all of those options? -- Richard Huxton Archonet Ltd
Re: How to boost performance of queries containing pattern matching characters
From
"Gnanakumar"
Date:
> If you really need to match all those options, you can't use an index. A > substring-matching index would need to have multiple entries per > character per value (since it doesn't know what you will search for). > The index-size becomes unmanageable very quickly. > That's why I asked what you really wanted to match. To be more specific, in fact, our current application allows to delete email(s) with a minimum of 3 characters. There is a note/warning also given for application Users' before deleting, explaining the implication of this delete action (partial & case-insensitive, and it could be wide-ranging too). > So, I'll ask again: do you really want to match all of those options? Yes, as explained above, I want to match all those.
Re: How to boost performance of queries containing pattern matching characters
From
Richard Huxton
Date:
On 14/02/11 07:38, Artur Zając wrote: > I had almost the same problem. > To resolve it, I created my own text search parser (myftscfg) which divides > text in column into three letters parts, for example: > > someemail@domain.com is divided to som, ome,mee,eem,ema,mai,ail,il@, > l@d,@do,dom,oma,mai,ain,in.,n.c,.co,com > > There should be also index on email column: > > CREATE INDEX "email _fts" on mytable using gin > (to_tsvector('myftscfg'::regconfig, email)) > > Every query like email ilike '%domain.com%' should be rewrited to: > > WHERE > to_tsvector('myftscfg',email) @@ to_tsquery('dom') AND > to_tsvector('myftscfg',email) @@ to_tsquery('oma') AND > to_tsvector('myftscfg',email) @@ to_tsquery('mai') AND ... Looks like you've almost re-invented the trigram module: http://www.postgresql.org/docs/9.0/static/pgtrgm.html -- Richard Huxton Archonet Ltd
Re: How to boost performance of queries containing pattern matching characters
From
Richard Huxton
Date:
On 14/02/11 07:46, Gnanakumar wrote: >> If you really need to match all those options, you can't use an index. A >> substring-matching index would need to have multiple entries per >> character per value (since it doesn't know what you will search for). >> The index-size becomes unmanageable very quickly. > >> That's why I asked what you really wanted to match. > To be more specific, in fact, our current application allows to delete > email(s) with a minimum of 3 characters. There is a note/warning also given > for application Users' before deleting, explaining the implication of this > delete action (partial& case-insensitive, and it could be wide-ranging > too). > >> So, I'll ask again: do you really want to match all of those options? > Yes, as explained above, I want to match all those. Then you can't use a simple index. If you did use an index it would probably be much slower for "com" or "yah" or "gma" and so on. The closest you can do is something like Artur's option (or the pg_trgm module - handy since you are looking at 3-chars and up) to select likely matches combined with a separate search on '%domain.com%' to confirm that fact. P.S. - I'd be inclined to just match the central domain parts, so for "user1@europe.megacorp.com" you would index "europe" and "megacorp" and only allow matching on the start of each string. Of course if your application spec says you need to match on "p.c" too then that's what you have to do. -- Richard Huxton Archonet Ltd
Re: How to boost performance of queries containing pattern matching characters
From
Artur Zając
Date:
>How can we boost performance of queries containing pattern matching >characters? In my case, we're using a percent sign (%) that matches any string of zero or more characters. > > QUERY: DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%' > > EMAIL column is VARCHAR(256). > > As it is clear from the above query, email is matched "partially and case-insensitively", which my application requirement demands. > > In case, if it were a full match, I could easily define a functional > INDEX on EMAIL column (lower(EMAIL)) and I could rewrite my DELETE where criteria like lower(EMAIL) = 'someemail@domain.com'. > > MYTABLE currently contains 2 million records and grows consistently. I had almost the same problem. To resolve it, I created my own text search parser (myftscfg) which divides text in column into three letters parts, for example: someemail@domain.com is divided to som, ome,mee,eem,ema,mai,ail,il@, l@d,@do,dom,oma,mai,ain,in.,n.c,.co,com There should be also index on email column: CREATE INDEX "email _fts" on mytable using gin (to_tsvector('myftscfg'::regconfig, email)) Every query like email ilike '%domain.com%' should be rewrited to: WHERE to_tsvector('myftscfg',email) @@ to_tsquery('dom') AND to_tsvector('myftscfg',email) @@ to_tsquery('oma') AND to_tsvector('myftscfg',email) @@ to_tsquery('mai') AND to_tsvector('myftscfg',email) @@ to_tsquery('ain') AND to_tsvector('myftscfg',email) @@ to_tsquery('in.') AND to_tsvector('myftscfg',email) @@ to_tsquery('n.c') AND to_tsvector('myftscfg',email) @@ to_tsquery('.co') AND to_tsvector('myftscfg',email) @@ to_tsquery('com') AND email ILIKE '%domain.com%'; Index is reducing number of records and clause email ILIKE '%domain.com%' is selecting only valid records. I didn't found better solution. ------------------------------------------- Artur Zajac
Re: How to boost performance of queries containing pattern matching characters
From
"Gnanakumar"
Date:
> The closest you can do is something like Artur's option (or the pg_trgm > module - handy since you are looking at 3-chars and up) to select likely > matches combined with a separate search on '%domain.com%' to confirm > that fact. Thanks for your suggestion. Our production server is currently running PostgreSQL v8.2.3. I think pg_trgm contrib module is not available for 8.2 series. Also, I read about WildSpeed - fast wildcard search for LIKE operator. What is your opinion on that? http://www.sai.msu.su/~megera/wiki/wildspeed
Re: How to boost performance of queries containing pattern matching characters
From
Artur Zając
Date:
> Looks like you've almost re-invented the trigram module: > http://www.postgresql.org/docs/9.0/static/pgtrgm.html I didn't know about this module. Idea to use three letters strings and use Full Text Search is the same, but the rest is not. Is the idea to use similarity for this problem is really correct? How should be query constructed to return really all records matching ILIKE criteria? ------------------------------------------- Artur Zajac
Re: How to boost performance of queries containing pattern matching characters
From
Shaun Thomas
Date:
On 02/14/2011 12:59 AM, Gnanakumar wrote: > QUERY: DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%' > EMAIL column is VARCHAR(256). Honestly? You'd be better off normalizing this column and maybe hiding that fact in a view if your app requires email as a single column. Split it like this: So user@gmail.com becomes: email_acct (user) email_domain (gmail) email_tld (com) This would let you drop the first % on your like match and then traditional indexes would work just fine. You could also differentiate between domains with different TLDs without using wildcards, which is always faster. I might ask why you are checking email for wildcards after the TLD in the first place. Is it really so common you are trying to match .com, .com.au, .com.bar.baz.edu, or whatever? At the very least, splitting the account from the domain+tld would be beneficial, as it would remove the necessity of the first wildcard, which is really what's hurting you. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Re: How to boost performance of queries containing pattern matching characters
From
Greg Smith
Date:
Gnanakumar wrote: > Thanks for your suggestion. Our production server is currently running > PostgreSQL v8.2.3. I think pg_trgm contrib module is not available for 8.2 > series. > You're going to find that most of the useful answers here will not work on 8.2. Full-text search was not fully integrated into the database until 8.3. Trying to run an app using it on 8.2 is going to be a constant headache for you. Also, moving from 8.2 to 8.3 is just a general performance boost in many ways. > Also, I read about WildSpeed - fast wildcard search for LIKE operator. What > is your opinion on that? > http://www.sai.msu.su/~megera/wiki/wildspeed > WildSpeed works fine if you can handle the massive disk space and maintenance overhead it introduces. I consider it useful only for data sets that are almost static, where you can afford to build its large index structure once and then use it to accelerate reads continuously, with minimal updates. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Re: How to boost performance of queries containing pattern matching characters
From
Steve Atkins
Date:
On Feb 14, 2011, at 12:09 AM, Artur Zając wrote: >> Looks like you've almost re-invented the trigram module: >> http://www.postgresql.org/docs/9.0/static/pgtrgm.html > > I didn't know about this module. > Idea to use three letters strings and use Full Text Search is the same, but > the rest is not. > > Is the idea to use similarity for this problem is really correct? How should > be query constructed to return really all records matching ILIKE criteria? If what you really want is the ability to select email addresses based on subdomain, you might want to do this instead: create email_domain_idx on mytable (reverse(lower(split_part(email, '@', 2)))); Then you can do things like this ... delete from mytable where reverse(lower(split_part(email, '@', 2))) = reverse('aol.com'); ... to delete all aol.com users or like this ... delete from mytable where reverse(lower(split_part(email, '@', 2))) like reverse('%.aol.com'); ... to delete all email addresses that are in a subdomain of "aol.com". You need a reverse() function to do that. Here's one in plpgsql: CREATE OR REPLACE FUNCTION reverse(text) RETURNS text AS ' DECLARE original alias for $1; reverse_str text; i int4; BEGIN reverse_str = ''''; FOR i IN REVERSE LENGTH(original)..1 LOOP reverse_str = reverse_str || substr(original,i,1); END LOOP; return reverse_str; END;' LANGUAGE 'plpgsql' IMMUTABLE; (Normalizing the email address so that you store local part and domain part separately is even better, but an index on thereverse of the domain is still useful for working with subdomains). Cheers, Steve