Thread: Re: LIKE and REGEX optimization

Re: LIKE and REGEX optimization

From
Chris Browne
Date:
Kico Zaninetti <kicozaninetti@gmail.com> writes:
> Hi all.
>
> I have a database with 62 million registers and I have to make a
> SELECT using LIKE.
>
> This is my select:
> SELECT * FROM phone WHERE name LIKE = '%ZANINETTI%' AND city = 'SAO
> PAULO' AND state = 'SP'
>
> I have an index created like this:
> CREATE INDEX "telefones_idx2" ON "public"."phone"
>   USING btree ("name" varchar_pattern_ops, "city", "state");
>
> When I explain the query I have this:
> QUERY PLAN
> Bitmap Heap Scan on telefones  (cost=1031528.27..2726942.75 rows=4
> width=145)
>   Recheck Cond: (((city)::text = 'SAO PAULO'::text) AND ((state)::text
> = 'SP'::text))
>   Filter: ((name)::text ~~ '%ZANINETTI%'::text)
>   ->  Bitmap Index Scan on telefones_idx2  (cost=0.00..1031528.27
> rows=1712760 width=0)
>         Index Cond: (((city)::text = 'SAO PAULO'::text) AND
> ((state)::text = 'SP'::text))
>
>
> The cost is over than 1 million! It's to high and I have to reduce it.
> Does someone know how can I make it?

You may be misreading what the cost is...

It looks to me like the planner is estimating that there are 1712760
rows where city = 'SAO PAULO' and state = 'SP'; is that estimate way
off?

If it is, then maybe you need to ANALYZE the table, perhaps combined
with altering the histogram size for city/state.

(e.g. - alter table phone alter column city set statistics 100;
        alter table phone alter column state set statistics 100;)

If altering statistics, then re-ANALYZEing the table helps, that may
resolve things.

Unfortunately, the LIKE clause isn't going to readily take advantage
of regular indices, because you have put in '%' at beginning and end,
which means that regular indices will not be usable.

I wouldn't be too surprised if that is the case; Sao Paulo is about
the most populous South American city, with ~11M people, so having a
database with 1.7M phone numbers in that city does not seem "out
there."

The only idea that comes to mind to follow that is to look into
tsearch.  On PostgreSQL 8.3, it's built in; on 8.2 and earlier, it's a
"contrib" module.  It allows indexing on words inside columns, which
would seem to fit your requirement.
--
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://cbbrowne.com/info/sap.html
Signs  of  a  Klingon  Programmer #3:  "By   filing this  TPR you have
challenged the honor of my family. Prepare to die!"

Re: LIKE and REGEX optimization

From
"Scott Marlowe"
Date:
On Jan 15, 2008 2:29 PM, Chris Browne <cbbrowne@acm.org> wrote:
> Kico Zaninetti <kicozaninetti@gmail.com> writes:
> > Hi all.
> >
> > I have a database with 62 million registers and I have to make a
> > SELECT using LIKE.
> >
> > This is my select:
> > SELECT * FROM phone WHERE name LIKE = '%ZANINETTI%' AND city = 'SAO
> > PAULO' AND state = 'SP'

This query is not capable of using an index on name, since you can't
use an index with a like beginning with a %... So

> > I have an index created like this:
> > CREATE INDEX "telefones_idx2" ON "public"."phone"
> >   USING btree ("name" varchar_pattern_ops, "city", "state");

This index serves no purpose, because they get used left to right.
Since you can't match name, the other two parts aren't used.

Does the OP have an index on city,state?  That might help.

Re: LIKE and REGEX optimization

From
Richard Huxton
Date:
Chris Browne wrote:
> The only idea that comes to mind to follow that is to look into
> tsearch.  On PostgreSQL 8.3, it's built in; on 8.2 and earlier, it's a
> "contrib" module.  It allows indexing on words inside columns, which
> would seem to fit your requirement.

You might want to google for "postgresql trigram" too. Some notes at the
location below.
http://www.sai.msu.su/~megera/oddmuse/index.cgi/ReadmeTrgm

It's more of a substring search than tsearch2 is, so might meet your
needs better. It's in the "contrib" package / source directory.

--
   Richard Huxton
   Archonet Ltd

Re: LIKE and REGEX optimization

From
hubert depesz lubaczewski
Date:
On Tue, Jan 15, 2008 at 04:49:41PM -0600, Scott Marlowe wrote:
> This query is not capable of using an index on name, since you can't
> use an index with a like beginning with a %... So

actually you can. you just can't use index for like %something%, but it
can be solved using trigrams or another approaches. for example:
http://www.depesz.com/index.php/2007/09/15/speeding-up-like-xxx/

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)