Thread: How to get around LIKE inefficiencies?

How to get around LIKE inefficiencies?

From
The Hermit Hacker
Date:
I'm tryin to figure out how to speed up udmsearch when run under
postgresql, and am being hit by atrocious performance when using a LIKE
query ... the query looks like:

SELECT ndict.url_id,ndict.intag  FROM ndict,url WHERE ndict.word_id=1971739852   AND url.rec_id=ndict.url_id    AND
(url.urlLIKE 'http://www.postgresql.org/%');
 

Take off the AND ( LIKE ) part of the query, finishes almost as soon as
you hit return.  Put it back in, and you can go for coffee before it
finishes ...
If I do 'SELECT url_id FROM ndict WHERE word_id=1971739852', there
are 153 records returned ... is there some way, that I'm not thinking, of
re-writing the above so that it 'resolves' the equality before the LIKE in
order to reduce the number of tuples that it has to do the LIKE on?  Is
there some way of writing the above so that it doesn't take forever to
execute?
I'm running this on a Dual-PIII 450 Server, 512Meg of RAM, zero
swap space being used ... the database has its indices on one hard drive,
the tables themselves are on a second one ... its PgSQL 7.0.2 (Tom,
anything in v7.0.3 that might improve this?) and startup is as:

#!/bin/tcsh
setenv PORT 5432
setenv POSTMASTER /pgsql/bin/postmaster
unlimit
${POSTMASTER} -B 384 -N 192 \             -o "-F -S 32768" \             -i -p ${PORT} -D/pgsql/data >&
/pgsql/logs/postmaster.${PORT}.$$ &
So its not like I'm not throwing alot of resources at this ...
Is there anything that we can do to improve this?  I was trying to
think of some way to use a subselect to narrow the search results, or
something ...
Oh, the above explains down to:

NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..1195.14 rows=1 width=10) ->  Index Scan using url_url on url  (cost=0.00..2.73 rows=1 width=4)
-> Index Scan using n_word on ndict  (cost=0.00..1187.99 rows=353 width=6)
 

EXPLAIN
ndict: 663018 tuples  url:  29276 tuples

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: How to get around LIKE inefficiencies?

From
Tom Lane
Date:
A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.
        regards, tom lane


Re: How to get around LIKE inefficiencies?

From
Bruce Momjian
Date:
Sorry to be getting in here late.  Have you tried CLUSTER?  If it is
using an index scan, and it is slow, cluster often helps, especially
when there are several duplicate matches, as there is with LIKE.  Let me
know how that works.

> A brute-force answer would be to remove the url_url index ;-)
> dunno if that would slow down other queries, however.
> 
>             regards, tom lane
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: How to get around LIKE inefficiencies?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Sorry to be getting in here late.  Have you tried CLUSTER?

Prolly won't help much.  I think what he's getting burnt by
is that the planner thinks that an indexscan based on the
LIKE 'http://www.postgresql.org/%' condition will be extremely
selective --- it has no idea that most of the URLs in his table
will match that prefix.  It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.
        regards, tom lane


Re: How to get around LIKE inefficiencies?

From
Philip Warner
Date:
At 21:28 5/11/00 -0500, Tom Lane wrote:
>A brute-force answer would be to remove the url_url index ;-)
>dunno if that would slow down other queries, however.

Could you trick it into not using the index (AND using the other strategy?)
by using a calculation:

SELECT ndict.url_id,ndict.intag  FROM ndict,url WHERE ndict.word_id=1971739852   AND url.rec_id=ndict.url_id    AND (
(url.url|| ' ') LIKE 'http://www.postgresql.org/% ');
 

it's a bit nasty.

If you had 7.1, the following might work:

SELECT url_id,intag From  (Select ndict.url_id,ndict.intag,url  FROM ndict,url WHERE ndict.word_id=1971739852   AND
url.rec_id=ndict.url_id)as zzz
 
Where    zzz.url LIKE 'http://www.postgresql.org/%'

But I don't know how subselect-in-from handles this sort of query - it
might be 'clever' enough to fold it somehow.




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: How to get around LIKE inefficiencies?

From
Philip Warner
Date:
At 21:47 5/11/00 -0500, Tom Lane wrote:
>It's ye same olde nonuniform-distribution
>problem; until we have better statistics, there's not much hope
>for a non-kluge solution.

Wasn't somebody trying to do something with that a few weeks back?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: How to get around LIKE inefficiencies?]

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Sorry to be getting in here late.  Have you tried CLUSTER?
> 
> Prolly won't help much.  I think what he's getting burnt by
> is that the planner thinks that an indexscan based on the
> LIKE 'http://www.postgresql.org/%' condition will be extremely
> selective --- it has no idea that most of the URLs in his table
> will match that prefix.  It's ye same olde nonuniform-distribution
> problem; until we have better statistics, there's not much hope
> for a non-kluge solution.

But I think it will help.  There will be lots of index lookups, but they
will be sequential in the heap, not random over the heap.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: How to get around LIKE inefficiencies?

From
Bruce Momjian
Date:
Yes, I am waiting to hear back on that.

> At 21:47 5/11/00 -0500, Tom Lane wrote:
> >It's ye same olde nonuniform-distribution
> >problem; until we have better statistics, there's not much hope
> >for a non-kluge solution.
> 
> Wasn't somebody trying to do something with that a few weeks back?
> 
> 
> ----------------------------------------------------------------
> Philip Warner                    |     __---_____
> Albatross Consulting Pty. Ltd.   |----/       -  \
> (A.B.N. 75 008 659 498)          |          /(@)   ______---_
> Tel: (+61) 0500 83 82 81         |                 _________  \
> Fax: (+61) 0500 83 82 82         |                 ___________ |
> Http://www.rhyme.com.au          |                /           \|
>                                  |    --________--
> PGP key available upon request,  |  /
> and from pgp5.ai.mit.edu:11371   |/
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: How to get around LIKE inefficiencies?

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> Could you trick it into not using the index (AND using the other strategy?)
> by using a calculation:

>    AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

> it's a bit nasty.

Looks like a great kluge to me ;-)
        regards, tom lane


Re: How to get around LIKE inefficiencies?

From
The Hermit Hacker
Date:
yowch ... removing that one index makes my 'test' search (mvcc) come back
as:

[97366] SQL 0.05s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE ndict.word_id=572517542 AND
url.rec_id=ndict.url_id AND (url.url LIKE 'http://www.postgresql.org/%')
 

vs what we were doing before ... now, let's see how increasing the number
of docs indexed changes this ...

thanks ... 


On Sun, 5 Nov 2000, Tom Lane wrote:

> A brute-force answer would be to remove the url_url index ;-)
> dunno if that would slow down other queries, however.
> 
>             regards, tom lane
> 
> 

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: How to get around LIKE inefficiencies?

From
Philip Warner
Date:
At 21:59 5/11/00 -0500, Tom Lane wrote:
>
>Looks like a great kluge to me ;-)
>

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: How to get around LIKE inefficiencies?

From
The Hermit Hacker
Date:
On Mon, 6 Nov 2000, Philip Warner wrote:

> At 21:59 5/11/00 -0500, Tom Lane wrote:
> >
> >Looks like a great kluge to me ;-)
> >
> 
> Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

i can now do 3 word searches in what looks like no time ... not sure how
it will look after I have ~100k documents indexed, but at least now its
not diead at 22k ...





Re: How to get around LIKE inefficiencies?

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> On Mon, 6 Nov 2000, Philip Warner wrote:
>> At 21:59 5/11/00 -0500, Tom Lane wrote:
>>>> Looks like a great kluge to me ;-)
>> 
>> Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

> Except, if we are telling it to get rid of using the index, may as well
> get rid of it altogether, as updates/inserts would be slowed down by
> having to update that too ...

Sure --- but do you have any other query types where the index *is*
useful?  If so, Philip's idea will let you suppress use of the index
for just this one kind of query.
        regards, tom lane


Re: How to get around LIKE inefficiencies?

From
Philip Warner
Date:
At 23:12 5/11/00 -0400, The Hermit Hacker wrote:
>
>Except, if we are telling it to get rid of using the index, may as well
>get rid of it altogether, as updates/inserts would be slowed down by
>having to update that too ...
>

So long as you don't ever need the index for anything else, then getting
rid of it is by far the best solution. But, eg, if you want to check if a
page is already indexed you will probably end up with a sequential search.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: How to get around LIKE inefficiencies?

From
Ron Chmara
Date:
The Hermit Hacker wrote:
> I'm tryin to figure out how to speed up udmsearch when run under
> postgresql, and am being hit by atrocious performance when using a LIKE
> query ... the query looks like:
> SELECT ndict.url_id,ndict.intag
>   FROM ndict,url
>  WHERE ndict.word_id=1971739852
>    AND url.rec_id=ndict.url_id
>    AND (url.url LIKE 'http://www.postgresql.org/%');
> Take off the AND ( LIKE ) part of the query, finishes almost as soon as
> you hit return.  Put it back in, and you can go for coffee before it
> finishes ...

The entire *approach* is wrong. I'm currently in the process of optimizing
a db which is used for logfile mining, and it was originally built with the same
kludge.... it seems to make sense when there's only a few thousand records,
but at 20 million records, yikes!

The problem is that there's a "like" operation for something that is
fundamentally static (http://www.postgresql.org/) with some varying
data *after it*, that you're not using, in any form, for this operation.
This can be solved one of two ways:

1. Preprocess your files to strip out the paths and arguments on
a new field for the domain call. You are only setting up that data once,
so you shouldn't be using a "like" operator for every query. It's not
like on monday the server is "http://www.postgresql.org/1221" and on
tuesday the server is "http://www.postgresql.org/12111". It's always
the *same server*, so split out that data into it's own column, it's own
index.

This turns your query into:
SELECT ndict.url_id,ndict.intag  FROM ndict,url WHERE ndict.word_id=1971739852   AND url.rec_id=ndict.url_id   AND
url.server_url='http://www.postgresql.org/';

2. Trigger to do the above, if you're doing on-the-fly inserts into
your db (so you can't pre-process).

-Ronabop

--
Brought to you from iBop the iMac, a MacOS, Win95, Win98, LinuxPPC machine,
which is currently in MacOS land.  Your bopping may vary.


Re: How to get around LIKE inefficiencies?

From
The Hermit Hacker
Date:
On Sun, 5 Nov 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > On Mon, 6 Nov 2000, Philip Warner wrote:
> >> At 21:59 5/11/00 -0500, Tom Lane wrote:
> >>>> Looks like a great kluge to me ;-)
> >> 
> >> Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}
> 
> > Except, if we are telling it to get rid of using the index, may as well
> > get rid of it altogether, as updates/inserts would be slowed down by
> > having to update that too ...
> 
> Sure --- but do you have any other query types where the index *is*
> useful?  If so, Philip's idea will let you suppress use of the index
> for just this one kind of query.

Actually, it looks like they got a bit smart, and they search for the URL
in the url table based on the CRC32 value instead of text ...




Re: How to get around LIKE inefficiencies?

From
The Hermit Hacker
Date:
On Mon, 6 Nov 2000, Philip Warner wrote:

> At 23:12 5/11/00 -0400, The Hermit Hacker wrote:
> >
> >Except, if we are telling it to get rid of using the index, may as well
> >get rid of it altogether, as updates/inserts would be slowed down by
> >having to update that too ...
> >
> 
> So long as you don't ever need the index for anything else, then getting
> rid of it is by far the best solution. But, eg, if you want to check if a
> page is already indexed you will probably end up with a sequential search.

except, it appears that they both store the text of the URL *and* the
CRC32 value of it, and do other queries based on the CRC32 value of it ...



Re: How to get around LIKE inefficiencies?

From
Bruce Momjian
Date:
I am adding a new TODO item:
* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM  ANALYZE, and CLUSTER

Seems we should be able to emit NOTICE messages suggesting performance
improvements.



> Philip Warner <pjw@rhyme.com.au> writes:
> > Could you trick it into not using the index (AND using the other strategy?)
> > by using a calculation:
> 
> >    AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');
> 
> > it's a bit nasty.
> 
> Looks like a great kluge to me ;-)
> 
>             regards, tom lane
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: How to get around LIKE inefficiencies?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am adding a new TODO item:
>     * Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
>       ANALYZE, and CLUSTER
> Seems we should be able to emit NOTICE messages suggesting performance
> improvements.

This would be targeted to help those who refuse to read documentation?

I'm not following the point here...
        regards, tom lane


Re: How to get around LIKE inefficiencies?

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am adding a new TODO item:
> >     * Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
> >       ANALYZE, and CLUSTER
> > Seems we should be able to emit NOTICE messages suggesting performance
> > improvements.
> 
> This would be targeted to help those who refuse to read documentation?
> 
> I'm not following the point here...

Well, I think there is some confusion about when CLUSTER is good, and I
can see people turning it on and running their application to look for
things they forgot, like indexes or vacuum analyze.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: How to get around LIKE inefficiencies?

From
The Hermit Hacker
Date:
On Sun, 5 Nov 2000, Bruce Momjian wrote:

> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > I am adding a new TODO item:
> > >     * Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
> > >       ANALYZE, and CLUSTER
> > > Seems we should be able to emit NOTICE messages suggesting performance
> > > improvements.
> > 
> > This would be targeted to help those who refuse to read documentation?
> > 
> > I'm not following the point here...
> 
> Well, I think there is some confusion about when CLUSTER is good, and I
> can see people turning it on and running their application to look for
> things they forgot, like indexes or vacuum analyze.

so, we are gonna have an AI built into the database now too?  my
experience to date is that each scenario is different for what can be done
to fix something ... as my last problem shows.  I could remove the index,
since it isn't used anywhere else that I'm aware of, or, as philip pointed
out, I could change my query ...

now, this 'PERFORMANCE_TIPS', would it have to be smart enough to think
about Philips solution, or only Tom's?  How is such a knowledge base
maintained?  Who is turned off of PgSQL when they enable that, and it
doesn't help their performance even though they follow the
recommendations?





Re: How to get around LIKE inefficiencies?

From
Bruce Momjian
Date:
> so, we are gonna have an AI built into the database now too?  my
> experience to date is that each scenario is different for what can be done
> to fix something ... as my last problem shows.  I could remove the index,
> since it isn't used anywhere else that I'm aware of, or, as philip pointed
> out, I could change my query ...
> 
> now, this 'PERFORMANCE_TIPS', would it have to be smart enough to think
> about Philips solution, or only Tom's?  How is such a knowledge base
> maintained?  Who is turned off of PgSQL when they enable that, and it
> doesn't help their performance even though they follow the
> recommendations?

Well, I think it would be helpful to catch the most obvious things
people forget, but if no one thinks its a good idea, I will yank it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: How to get around LIKE inefficiencies?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Well, I think it would be helpful to catch the most obvious things
> people forget, but if no one thinks its a good idea, I will yank it.

If you've got an idea *how* to do it in any sort of reliable fashion,
I'm all ears.  But it sounds more like pie-in-the-sky to me.
        regards, tom lane


Re: How to get around LIKE inefficiencies?

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Well, I think it would be helpful to catch the most obvious things
> > people forget, but if no one thinks its a good idea, I will yank it.
> 
> If you've got an idea *how* to do it in any sort of reliable fashion,
> I'm all ears.  But it sounds more like pie-in-the-sky to me.

But I like pie.  :-)

Well, we could throw a message when the optimizer tries to get
statistics on a column with no analyze stats, or table stats on a table
that has never been vacuumed, or does a sequential scan on a table that
has >%50 expired rows.  

We could throw a message when a query does an index scan that bounces
all over the heap looking for a single value.  We could though a message
when a constant is compared to a column, and there is no index on the
column.

Not perfect, but would help catch some obvious things people forget.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: How to get around LIKE inefficiencies?

From
Tatsuo Ishii
Date:
>     I'm running this on a Dual-PIII 450 Server, 512Meg of RAM, zero
> swap space being used ... the database has its indices on one hard drive,
> the tables themselves are on a second one ... its PgSQL 7.0.2 (Tom,
> anything in v7.0.3 that might improve this?) and startup is as:
> 
> #!/bin/tcsh
> setenv PORT 5432
> setenv POSTMASTER /pgsql/bin/postmaster
> unlimit
> ${POSTMASTER} -B 384 -N 192 \
>               -o "-F -S 32768" \
>               -i -p ${PORT} -D/pgsql/data >&
> /pgsql/logs/postmaster.${PORT}.$$ &

BTW, you have a 32MB sort space for each backend, and you allow up to
192 concurrent backends. So whole postgres requires at least 192*32MB
= 6144MB memory if all 192 users try to connect to your server at the
same time!  I would suggest adding enough swap space or descreasing -S
setting...
--
Tatsuo Ishii