Thread: How to get around LIKE inefficiencies?
I'm tryin to figure out how to speed up udmsearch when run under postgresql, and am being hit by atrocious performance when using a LIKE query ... the query looks like: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE ndict.word_id=1971739852 AND url.rec_id=ndict.url_id AND (url.urlLIKE 'http://www.postgresql.org/%'); Take off the AND ( LIKE ) part of the query, finishes almost as soon as you hit return. Put it back in, and you can go for coffee before it finishes ... If I do 'SELECT url_id FROM ndict WHERE word_id=1971739852', there are 153 records returned ... is there some way, that I'm not thinking, of re-writing the above so that it 'resolves' the equality before the LIKE in order to reduce the number of tuples that it has to do the LIKE on? Is there some way of writing the above so that it doesn't take forever to execute? I'm running this on a Dual-PIII 450 Server, 512Meg of RAM, zero swap space being used ... the database has its indices on one hard drive, the tables themselves are on a second one ... its PgSQL 7.0.2 (Tom, anything in v7.0.3 that might improve this?) and startup is as: #!/bin/tcsh setenv PORT 5432 setenv POSTMASTER /pgsql/bin/postmaster unlimit ${POSTMASTER} -B 384 -N 192 \ -o "-F -S 32768" \ -i -p ${PORT} -D/pgsql/data >& /pgsql/logs/postmaster.${PORT}.$$ & So its not like I'm not throwing alot of resources at this ... Is there anything that we can do to improve this? I was trying to think of some way to use a subselect to narrow the search results, or something ... Oh, the above explains down to: NOTICE: QUERY PLAN: Nested Loop (cost=0.00..1195.14 rows=1 width=10) -> Index Scan using url_url on url (cost=0.00..2.73 rows=1 width=4) -> Index Scan using n_word on ndict (cost=0.00..1187.99 rows=353 width=6) EXPLAIN ndict: 663018 tuples url: 29276 tuples Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
A brute-force answer would be to remove the url_url index ;-) dunno if that would slow down other queries, however. regards, tom lane
Sorry to be getting in here late. Have you tried CLUSTER? If it is using an index scan, and it is slow, cluster often helps, especially when there are several duplicate matches, as there is with LIKE. Let me know how that works. > A brute-force answer would be to remove the url_url index ;-) > dunno if that would slow down other queries, however. > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Sorry to be getting in here late. Have you tried CLUSTER? Prolly won't help much. I think what he's getting burnt by is that the planner thinks that an indexscan based on the LIKE 'http://www.postgresql.org/%' condition will be extremely selective --- it has no idea that most of the URLs in his table will match that prefix. It's ye same olde nonuniform-distribution problem; until we have better statistics, there's not much hope for a non-kluge solution. regards, tom lane
At 21:28 5/11/00 -0500, Tom Lane wrote: >A brute-force answer would be to remove the url_url index ;-) >dunno if that would slow down other queries, however. Could you trick it into not using the index (AND using the other strategy?) by using a calculation: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE ndict.word_id=1971739852 AND url.rec_id=ndict.url_id AND ( (url.url|| ' ') LIKE 'http://www.postgresql.org/% '); it's a bit nasty. If you had 7.1, the following might work: SELECT url_id,intag From (Select ndict.url_id,ndict.intag,url FROM ndict,url WHERE ndict.word_id=1971739852 AND url.rec_id=ndict.url_id)as zzz Where zzz.url LIKE 'http://www.postgresql.org/%' But I don't know how subselect-in-from handles this sort of query - it might be 'clever' enough to fold it somehow. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 21:47 5/11/00 -0500, Tom Lane wrote: >It's ye same olde nonuniform-distribution >problem; until we have better statistics, there's not much hope >for a non-kluge solution. Wasn't somebody trying to do something with that a few weeks back? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Sorry to be getting in here late. Have you tried CLUSTER? > > Prolly won't help much. I think what he's getting burnt by > is that the planner thinks that an indexscan based on the > LIKE 'http://www.postgresql.org/%' condition will be extremely > selective --- it has no idea that most of the URLs in his table > will match that prefix. It's ye same olde nonuniform-distribution > problem; until we have better statistics, there's not much hope > for a non-kluge solution. But I think it will help. There will be lots of index lookups, but they will be sequential in the heap, not random over the heap. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Yes, I am waiting to hear back on that. > At 21:47 5/11/00 -0500, Tom Lane wrote: > >It's ye same olde nonuniform-distribution > >problem; until we have better statistics, there's not much hope > >for a non-kluge solution. > > Wasn't somebody trying to do something with that a few weeks back? > > > ---------------------------------------------------------------- > Philip Warner | __---_____ > Albatross Consulting Pty. Ltd. |----/ - \ > (A.B.N. 75 008 659 498) | /(@) ______---_ > Tel: (+61) 0500 83 82 81 | _________ \ > Fax: (+61) 0500 83 82 82 | ___________ | > Http://www.rhyme.com.au | / \| > | --________-- > PGP key available upon request, | / > and from pgp5.ai.mit.edu:11371 |/ > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Philip Warner <pjw@rhyme.com.au> writes: > Could you trick it into not using the index (AND using the other strategy?) > by using a calculation: > AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% '); > it's a bit nasty. Looks like a great kluge to me ;-) regards, tom lane
yowch ... removing that one index makes my 'test' search (mvcc) come back as: [97366] SQL 0.05s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE ndict.word_id=572517542 AND url.rec_id=ndict.url_id AND (url.url LIKE 'http://www.postgresql.org/%') vs what we were doing before ... now, let's see how increasing the number of docs indexed changes this ... thanks ... On Sun, 5 Nov 2000, Tom Lane wrote: > A brute-force answer would be to remove the url_url index ;-) > dunno if that would slow down other queries, however. > > regards, tom lane > > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
At 21:59 5/11/00 -0500, Tom Lane wrote: > >Looks like a great kluge to me ;-) > Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-} ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
On Mon, 6 Nov 2000, Philip Warner wrote: > At 21:59 5/11/00 -0500, Tom Lane wrote: > > > >Looks like a great kluge to me ;-) > > > > Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-} Except, if we are telling it to get rid of using the index, may as well get rid of it altogether, as updates/inserts would be slowed down by having to update that too ... i can now do 3 word searches in what looks like no time ... not sure how it will look after I have ~100k documents indexed, but at least now its not diead at 22k ...
The Hermit Hacker <scrappy@hub.org> writes: > On Mon, 6 Nov 2000, Philip Warner wrote: >> At 21:59 5/11/00 -0500, Tom Lane wrote: >>>> Looks like a great kluge to me ;-) >> >> Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-} > Except, if we are telling it to get rid of using the index, may as well > get rid of it altogether, as updates/inserts would be slowed down by > having to update that too ... Sure --- but do you have any other query types where the index *is* useful? If so, Philip's idea will let you suppress use of the index for just this one kind of query. regards, tom lane
At 23:12 5/11/00 -0400, The Hermit Hacker wrote: > >Except, if we are telling it to get rid of using the index, may as well >get rid of it altogether, as updates/inserts would be slowed down by >having to update that too ... > So long as you don't ever need the index for anything else, then getting rid of it is by far the best solution. But, eg, if you want to check if a page is already indexed you will probably end up with a sequential search. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
The Hermit Hacker wrote: > I'm tryin to figure out how to speed up udmsearch when run under > postgresql, and am being hit by atrocious performance when using a LIKE > query ... the query looks like: > SELECT ndict.url_id,ndict.intag > FROM ndict,url > WHERE ndict.word_id=1971739852 > AND url.rec_id=ndict.url_id > AND (url.url LIKE 'http://www.postgresql.org/%'); > Take off the AND ( LIKE ) part of the query, finishes almost as soon as > you hit return. Put it back in, and you can go for coffee before it > finishes ... The entire *approach* is wrong. I'm currently in the process of optimizing a db which is used for logfile mining, and it was originally built with the same kludge.... it seems to make sense when there's only a few thousand records, but at 20 million records, yikes! The problem is that there's a "like" operation for something that is fundamentally static (http://www.postgresql.org/) with some varying data *after it*, that you're not using, in any form, for this operation. This can be solved one of two ways: 1. Preprocess your files to strip out the paths and arguments on a new field for the domain call. You are only setting up that data once, so you shouldn't be using a "like" operator for every query. It's not like on monday the server is "http://www.postgresql.org/1221" and on tuesday the server is "http://www.postgresql.org/12111". It's always the *same server*, so split out that data into it's own column, it's own index. This turns your query into: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE ndict.word_id=1971739852 AND url.rec_id=ndict.url_id AND url.server_url='http://www.postgresql.org/'; 2. Trigger to do the above, if you're doing on-the-fly inserts into your db (so you can't pre-process). -Ronabop -- Brought to you from iBop the iMac, a MacOS, Win95, Win98, LinuxPPC machine, which is currently in MacOS land. Your bopping may vary.
On Sun, 5 Nov 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > On Mon, 6 Nov 2000, Philip Warner wrote: > >> At 21:59 5/11/00 -0500, Tom Lane wrote: > >>>> Looks like a great kluge to me ;-) > >> > >> Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-} > > > Except, if we are telling it to get rid of using the index, may as well > > get rid of it altogether, as updates/inserts would be slowed down by > > having to update that too ... > > Sure --- but do you have any other query types where the index *is* > useful? If so, Philip's idea will let you suppress use of the index > for just this one kind of query. Actually, it looks like they got a bit smart, and they search for the URL in the url table based on the CRC32 value instead of text ...
On Mon, 6 Nov 2000, Philip Warner wrote: > At 23:12 5/11/00 -0400, The Hermit Hacker wrote: > > > >Except, if we are telling it to get rid of using the index, may as well > >get rid of it altogether, as updates/inserts would be slowed down by > >having to update that too ... > > > > So long as you don't ever need the index for anything else, then getting > rid of it is by far the best solution. But, eg, if you want to check if a > page is already indexed you will probably end up with a sequential search. except, it appears that they both store the text of the URL *and* the CRC32 value of it, and do other queries based on the CRC32 value of it ...
I am adding a new TODO item: * Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM ANALYZE, and CLUSTER Seems we should be able to emit NOTICE messages suggesting performance improvements. > Philip Warner <pjw@rhyme.com.au> writes: > > Could you trick it into not using the index (AND using the other strategy?) > > by using a calculation: > > > AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% '); > > > it's a bit nasty. > > Looks like a great kluge to me ;-) > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I am adding a new TODO item: > * Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM > ANALYZE, and CLUSTER > Seems we should be able to emit NOTICE messages suggesting performance > improvements. This would be targeted to help those who refuse to read documentation? I'm not following the point here... regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I am adding a new TODO item: > > * Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM > > ANALYZE, and CLUSTER > > Seems we should be able to emit NOTICE messages suggesting performance > > improvements. > > This would be targeted to help those who refuse to read documentation? > > I'm not following the point here... Well, I think there is some confusion about when CLUSTER is good, and I can see people turning it on and running their application to look for things they forgot, like indexes or vacuum analyze. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sun, 5 Nov 2000, Bruce Momjian wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I am adding a new TODO item: > > > * Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM > > > ANALYZE, and CLUSTER > > > Seems we should be able to emit NOTICE messages suggesting performance > > > improvements. > > > > This would be targeted to help those who refuse to read documentation? > > > > I'm not following the point here... > > Well, I think there is some confusion about when CLUSTER is good, and I > can see people turning it on and running their application to look for > things they forgot, like indexes or vacuum analyze. so, we are gonna have an AI built into the database now too? my experience to date is that each scenario is different for what can be done to fix something ... as my last problem shows. I could remove the index, since it isn't used anywhere else that I'm aware of, or, as philip pointed out, I could change my query ... now, this 'PERFORMANCE_TIPS', would it have to be smart enough to think about Philips solution, or only Tom's? How is such a knowledge base maintained? Who is turned off of PgSQL when they enable that, and it doesn't help their performance even though they follow the recommendations?
> so, we are gonna have an AI built into the database now too? my > experience to date is that each scenario is different for what can be done > to fix something ... as my last problem shows. I could remove the index, > since it isn't used anywhere else that I'm aware of, or, as philip pointed > out, I could change my query ... > > now, this 'PERFORMANCE_TIPS', would it have to be smart enough to think > about Philips solution, or only Tom's? How is such a knowledge base > maintained? Who is turned off of PgSQL when they enable that, and it > doesn't help their performance even though they follow the > recommendations? Well, I think it would be helpful to catch the most obvious things people forget, but if no one thinks its a good idea, I will yank it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Well, I think it would be helpful to catch the most obvious things > people forget, but if no one thinks its a good idea, I will yank it. If you've got an idea *how* to do it in any sort of reliable fashion, I'm all ears. But it sounds more like pie-in-the-sky to me. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Well, I think it would be helpful to catch the most obvious things > > people forget, but if no one thinks its a good idea, I will yank it. > > If you've got an idea *how* to do it in any sort of reliable fashion, > I'm all ears. But it sounds more like pie-in-the-sky to me. But I like pie. :-) Well, we could throw a message when the optimizer tries to get statistics on a column with no analyze stats, or table stats on a table that has never been vacuumed, or does a sequential scan on a table that has >%50 expired rows. We could throw a message when a query does an index scan that bounces all over the heap looking for a single value. We could though a message when a constant is compared to a column, and there is no index on the column. Not perfect, but would help catch some obvious things people forget. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> I'm running this on a Dual-PIII 450 Server, 512Meg of RAM, zero > swap space being used ... the database has its indices on one hard drive, > the tables themselves are on a second one ... its PgSQL 7.0.2 (Tom, > anything in v7.0.3 that might improve this?) and startup is as: > > #!/bin/tcsh > setenv PORT 5432 > setenv POSTMASTER /pgsql/bin/postmaster > unlimit > ${POSTMASTER} -B 384 -N 192 \ > -o "-F -S 32768" \ > -i -p ${PORT} -D/pgsql/data >& > /pgsql/logs/postmaster.${PORT}.$$ & BTW, you have a 32MB sort space for each backend, and you allow up to 192 concurrent backends. So whole postgres requires at least 192*32MB = 6144MB memory if all 192 users try to connect to your server at the same time! I would suggest adding enough swap space or descreasing -S setting... -- Tatsuo Ishii