Thread: How to boost performance of queries containing pattern matching characters

How to boost performance of queries containing pattern matching characters

From
"Gnanakumar"
Date:
Hi,

How can we boost performance of queries containing pattern matching
characters?  In my case, we're using a percent sign (%) that matches any
string of zero or more characters.

QUERY:  DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%'

EMAIL column is VARCHAR(256).

As it is clear from the above query, email is matched "partially and
case-insensitively", which my application requirement demands.

In case, if it were a full match, I could easily define a functional INDEX
on EMAIL column (lower(EMAIL)) and I could rewrite my DELETE where criteria
like lower(EMAIL) = 'someemail@domain.com'.

MYTABLE currently contains 2 million records and grows consistently.

Regards,
Gnanam


Re: How to boost performance of queries containing pattern matching characters

From
Richard Huxton
Date:
On 14/02/11 06:59, Gnanakumar wrote:
>
> How can we boost performance of queries containing pattern matching
> characters?

> QUERY:  DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%'

> As it is clear from the above query, email is matched "partially and
> case-insensitively", which my application requirement demands.

Well, for that exact pattern you're not going to find an index that's
much help. Do you really need something so wide-ranging though? The
above will match all of the following:

user1@domain.com
user2@sub.domain.com
user3@domain.com.au
user4@unrelated-domain.com
user5@unrelated-domain.com.au
user3@sub.domain.com.au
user4@sub.unrelated-domain.com
user5@sub.unrelated-domain.com.au
user6@sub.unrelated-domain.completely-wrong.com

Is that really what you are after? Or, did you just want to match:
   user1@domain.com
   user2@sub.domain.com

--
   Richard Huxton
   Archonet Ltd

Re: How to boost performance of queries containing pattern matching characters

From
"Gnanakumar"
Date:
> Is that really what you are after? Or, did you just want to match:
>    user1@domain.com
>    user2@sub.domain.com

I understand that because I've (%) at the beginning and end, it's going to
match unrelated domains, etc., which as you said rightly, it is
wide-ranging.  But my point here is that how can I improve performance of
the queries containing pattern matching characters.


Re: How to boost performance of queries containing pattern matching characters

From
Richard Huxton
Date:
On 14/02/11 07:28, Gnanakumar wrote:
>> Is that really what you are after? Or, did you just want to match:
>>     user1@domain.com
>>     user2@sub.domain.com
>
> I understand that because I've (%) at the beginning and end, it's going to
> match unrelated domains, etc., which as you said rightly, it is
> wide-ranging.  But my point here is that how can I improve performance of
> the queries containing pattern matching characters.

If you really need to match all those options, you can't use an index. A
substring-matching index would need to have multiple entries per
character per value (since it doesn't know what you will search for).
The index-size becomes unmanageable very quickly.

That's why I asked what you really wanted to match.

So, I'll ask again: do you really want to match all of those options?

--
   Richard Huxton
   Archonet Ltd

Re: How to boost performance of queries containing pattern matching characters

From
"Gnanakumar"
Date:
> If you really need to match all those options, you can't use an index. A
> substring-matching index would need to have multiple entries per
> character per value (since it doesn't know what you will search for).
> The index-size becomes unmanageable very quickly.

> That's why I asked what you really wanted to match.
To be more specific, in fact, our current application allows to delete
email(s) with a minimum of 3 characters.  There is a note/warning also given
for application Users' before deleting, explaining the implication of this
delete action (partial & case-insensitive, and it could be wide-ranging
too).

> So, I'll ask again: do you really want to match all of those options?
Yes, as explained above, I want to match all those.


Re: How to boost performance of queries containing pattern matching characters

From
Richard Huxton
Date:
On 14/02/11 07:38, Artur Zając wrote:
> I had almost the same problem.
> To resolve it, I created my own text search parser (myftscfg) which divides
> text in column into three letters parts, for example:
>
> someemail@domain.com is divided to som, ome,mee,eem,ema,mai,ail,il@,
> l@d,@do,dom,oma,mai,ain,in.,n.c,.co,com
>
> There should be also index on email column:
>
> CREATE INDEX "email _fts" on mytable using gin
> (to_tsvector('myftscfg'::regconfig, email))
>
> Every query like email ilike '%domain.com%' should be rewrited to:
>
> WHERE
> to_tsvector('myftscfg',email) @@ to_tsquery('dom') AND
> to_tsvector('myftscfg',email) @@ to_tsquery('oma') AND
> to_tsvector('myftscfg',email) @@ to_tsquery('mai') AND
...

Looks like you've almost re-invented the trigram module:
   http://www.postgresql.org/docs/9.0/static/pgtrgm.html

--
   Richard Huxton
   Archonet Ltd

Re: How to boost performance of queries containing pattern matching characters

From
Richard Huxton
Date:
On 14/02/11 07:46, Gnanakumar wrote:
>> If you really need to match all those options, you can't use an index. A
>> substring-matching index would need to have multiple entries per
>> character per value (since it doesn't know what you will search for).
>> The index-size becomes unmanageable very quickly.
>
>> That's why I asked what you really wanted to match.
> To be more specific, in fact, our current application allows to delete
> email(s) with a minimum of 3 characters.  There is a note/warning also given
> for application Users' before deleting, explaining the implication of this
> delete action (partial&  case-insensitive, and it could be wide-ranging
> too).
>
>> So, I'll ask again: do you really want to match all of those options?
> Yes, as explained above, I want to match all those.

Then you can't use a simple index. If you did use an index it would
probably be much slower for "com" or "yah" or "gma" and so on.

The closest you can do is something like Artur's option (or the pg_trgm
module - handy since you are looking at 3-chars and up) to select likely
matches combined with a separate search on '%domain.com%' to confirm
that fact.

P.S. - I'd be inclined to just match the central domain parts, so for
"user1@europe.megacorp.com" you would index "europe" and "megacorp" and
only allow matching on the start of each string. Of course if your
application spec says you need to match on "p.c" too then that's what
you have to do.

--
   Richard Huxton
   Archonet Ltd

Re: How to boost performance of queries containing pattern matching characters

From
Artur Zając
Date:
>How can we boost performance of queries containing pattern matching
>characters?  In my case, we're using a percent sign (%) that matches any
string of  zero or more characters.
>
> QUERY:  DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%'
>
> EMAIL column is VARCHAR(256).
>
> As it is clear from the above query, email is matched "partially and
case-insensitively", which my application requirement demands.
>
> In case, if it were a full match, I could easily define a functional
> INDEX on EMAIL column (lower(EMAIL)) and I could rewrite my DELETE where
criteria like lower(EMAIL) = 'someemail@domain.com'.
>
> MYTABLE currently contains 2 million records and grows consistently.

I had almost the same problem.
To resolve it, I created my own text search parser (myftscfg) which divides
text in column into three letters parts, for example:

someemail@domain.com is divided to som, ome,mee,eem,ema,mai,ail,il@,
l@d,@do,dom,oma,mai,ain,in.,n.c,.co,com

There should be also index on email column:

CREATE INDEX "email _fts" on mytable using gin
(to_tsvector('myftscfg'::regconfig, email))

Every query like email ilike '%domain.com%' should be rewrited to:

WHERE
to_tsvector('myftscfg',email) @@ to_tsquery('dom') AND
to_tsvector('myftscfg',email) @@ to_tsquery('oma') AND
to_tsvector('myftscfg',email) @@ to_tsquery('mai') AND
to_tsvector('myftscfg',email) @@ to_tsquery('ain') AND
to_tsvector('myftscfg',email) @@ to_tsquery('in.') AND
to_tsvector('myftscfg',email) @@ to_tsquery('n.c') AND
to_tsvector('myftscfg',email) @@ to_tsquery('.co') AND
to_tsvector('myftscfg',email) @@ to_tsquery('com') AND email ILIKE
'%domain.com%';

Index is reducing number of records and clause email ILIKE '%domain.com%' is
selecting only valid records.

I didn't found better solution.

-------------------------------------------
Artur Zajac


Re: How to boost performance of queries containing pattern matching characters

From
"Gnanakumar"
Date:
> The closest you can do is something like Artur's option (or the pg_trgm
> module - handy since you are looking at 3-chars and up) to select likely
> matches combined with a separate search on '%domain.com%' to confirm
> that fact.

Thanks for your suggestion.  Our production server is currently running
PostgreSQL v8.2.3.  I think pg_trgm contrib module is not available for 8.2
series.

Also, I read about WildSpeed - fast wildcard search for LIKE operator.  What
is your opinion on that?
http://www.sai.msu.su/~megera/wiki/wildspeed


Re: How to boost performance of queries containing pattern matching characters

From
Artur Zając
Date:
> Looks like you've almost re-invented the trigram module:
>   http://www.postgresql.org/docs/9.0/static/pgtrgm.html

I didn't know about this module.
Idea to use three letters strings and use Full Text Search is the same, but
the rest is not.

Is the idea to use similarity for this problem is really correct? How should
be query constructed to return really all records matching ILIKE criteria?


-------------------------------------------
Artur Zajac


Re: How to boost performance of queries containing pattern matching characters

From
Shaun Thomas
Date:
On 02/14/2011 12:59 AM, Gnanakumar wrote:

> QUERY:  DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%'
> EMAIL column is VARCHAR(256).

Honestly? You'd be better off normalizing this column and maybe hiding
that fact in a view if your app requires email as a single column. Split
it like this:

So user@gmail.com becomes:

email_acct (user)
email_domain (gmail)
email_tld (com)

This would let you drop the first % on your like match and then
traditional indexes would work just fine. You could also differentiate
between domains with different TLDs without using wildcards, which is
always faster.

I might ask why you are checking email for wildcards after the TLD in
the first place. Is it really so common you are trying to match .com,
.com.au, .com.bar.baz.edu, or whatever? At the very least, splitting the
account from the domain+tld would be beneficial, as it would remove the
necessity of the first wildcard, which is really what's hurting you.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: How to boost performance of queries containing pattern matching characters

From
Greg Smith
Date:
Gnanakumar wrote:
> Thanks for your suggestion.  Our production server is currently running
> PostgreSQL v8.2.3.  I think pg_trgm contrib module is not available for 8.2
> series.
>

You're going to find that most of the useful answers here will not work
on 8.2.  Full-text search was not fully integrated into the database
until 8.3.  Trying to run an app using it on 8.2 is going to be a
constant headache for you.  Also, moving from 8.2 to 8.3 is just a
general performance boost in many ways.


> Also, I read about WildSpeed - fast wildcard search for LIKE operator.  What
> is your opinion on that?
> http://www.sai.msu.su/~megera/wiki/wildspeed
>

WildSpeed works fine if you can handle the massive disk space and
maintenance overhead it introduces.  I consider it useful only for data
sets that are almost static, where you can afford to build its large
index structure once and then use it to accelerate reads continuously,
with minimal updates.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: How to boost performance of queries containing pattern matching characters

From
Steve Atkins
Date:
On Feb 14, 2011, at 12:09 AM, Artur Zając wrote:

>> Looks like you've almost re-invented the trigram module:
>>  http://www.postgresql.org/docs/9.0/static/pgtrgm.html
>
> I didn't know about this module.
> Idea to use three letters strings and use Full Text Search is the same, but
> the rest is not.
>
> Is the idea to use similarity for this problem is really correct? How should
> be query constructed to return really all records matching ILIKE criteria?

If what you really want is the ability to select email addresses based on
subdomain, you might want to do this instead:

create email_domain_idx on mytable (reverse(lower(split_part(email, '@', 2))));

Then you can do things like this ...

delete from mytable where reverse(lower(split_part(email, '@', 2))) = reverse('aol.com');

... to delete all aol.com users or like this ...

delete from mytable where reverse(lower(split_part(email, '@', 2))) like reverse('%.aol.com');

... to delete all email addresses that are in a subdomain of "aol.com".

You need a reverse() function to do that. Here's one in plpgsql:

CREATE OR REPLACE FUNCTION reverse(text) RETURNS text AS '
DECLARE
       original alias for $1;
       reverse_str text;
       i int4;
BEGIN
 reverse_str = '''';
 FOR i IN REVERSE LENGTH(original)..1 LOOP
  reverse_str = reverse_str || substr(original,i,1);
 END LOOP;
 return reverse_str;
END;'
LANGUAGE 'plpgsql' IMMUTABLE;

(Normalizing the email address so that you store local part and domain part separately is even better, but an index on
thereverse of the domain is still useful for working with subdomains). 

Cheers,
  Steve