Thread: LIKE search and performance
Hi,
I have a table with varchar and text columns, and I have to search through these text in the whole table.
An example would be:
SELECT * FROM table
WHERE name like '%john%' or street like '%srt%'
Anyway, the query planner always does seq scan on the whole table and that takes some time. How can this be optimized or made in another way to be faster?
I tried to make indexes on the columns but no success.
PG 8.2
Regards,
Andy.
Andy wrote: > SELECT * FROM table > WHERE name like '%john%' or street like '%srt%' > > Anyway, the query planner always does seq scan on the whole table and that > takes some time. How can this be optimized or made in another way to be > faster? > > I tried to make indexes on the columns but no success. None of the normal indexes will work for finding text in the middle of a string. If you do think of a simple way of solving this, drop a short letter explaining your idea to your local patent office followed by the Nobel prize committee. However, one of the contrib packages is "tsearch2" which is designed to do keyword searches on text for you. It'll also handle stemming (e.g. "search" will match "searching" etc.) with the right choice of dictionary. Loads of other clever stuff in it too. It's one of the optional packages with most Linux packaging systems and on the Windows one too. If you install from source see the contrib/ directory for details. -- Richard Huxton Archonet Ltd
Am 23.05.2007 um 09:08 schrieb Andy: > I have a table with varchar and text columns, and I have to search > through these text in the whole table. > > An example would be: > SELECT * FROM table > WHERE name like '%john%' or street > like '%srt%' > > Anyway, the query planner always does seq scan on the whole table > and that takes some time. How can this be optimized or made in > another way to be faster? The problem is that normal indexes cannot be used for "contains" queries. If you need fulltext search capabilities you have to take a look at tsearch2 or an external search engine like Lucene. cug
On 5/23/07, Andy <frum@ar-sd.net> wrote: > An example would be: > SELECT * FROM table > WHERE name like '%john%' or street like '%srt%' > > Anyway, the query planner always does seq scan on the whole table and that > takes some time. How can this be optimized or made in another way to be > faster? There's no algorithm in existence that can "index" arbitrary substrings the way you think. The only rational way to accomplish this is to first break the text into substrings using some algorithm (eg., words delimited by whitespace and punctuation), and index the substrings individually. You can do this using vanilla PostgreSQL, and you can use Tsearch2 and/or its GIN indexes to help speed up the searches. The simplest solution would be to put all the substrings in a table, one row per substring, along with an attribute referencing the source table/row. At this point you have effectively reduced your search space: you can use a query to isolate the words in your "dictionary" that contain the substrings. So a query might be: select ... from persons where id in ( select person_id from person_words where word like '%john%'; ) The "like" search, even though it will use a sequential scan, is bound to be much faster on these small words than searching for substrings through large gobs of text in the persons table. Note that PostgreSQL *can* exploit the index for *prefix* matching, if you tell it to use the right operator class: create index persons_name_index on persons (name text_pattern_ops); or, if you're using varchars (though there is rarely any reason to): create index persons_name_index on persons (name varchar_pattern_ops); (These two may be identical internally. Check the docs.) Now you can do: select ... from persons where name like 'john%'; which will yield a query plan such as this: Index Scan using persons_name_index on persons (cost=0.00..8.27 rows=1 width=29) (actual time=0.184..0.373 rows=51 loops=1) Index Cond: ((name ~>=~ 'john'::text) AND (name ~<~ 'joho'::text)) Filter: (title ~~ 'john%'::text) Alexander.
Andy wrote: > Hi, > > I have a table with varchar and text columns, and I have to search > through these text in the whole table. > > An example would be: > SELECT * FROM table > WHERE name like '%john%' or street like '%srt%' > > Anyway, the query planner always does seq scan on the whole table and > that takes some time. How can this be optimized or made in another way > to be faster? Use tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for full text indexing. Rigmor > > I tried to make indexes on the columns but no success. > > PG 8.2 > > Regards, > Andy.
Thank you all for the answers. I will try your suggestions and see what that brings in terms of performance. Andy. > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > Rigmor Ukuhe > Sent: Wednesday, May 23, 2007 6:52 PM > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] LIKE search and performance > > Andy wrote: > > Hi, > > > > I have a table with varchar and text columns, and I have to search > > through these text in the whole table. > > > > An example would be: > > SELECT * FROM table > > WHERE name like '%john%' or > street like '%srt%' > > > > Anyway, the query planner always does seq scan on the whole > table and > > that takes some time. How can this be optimized or made in > another way > > to be faster? > > Use tsearch2 > (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for > full text indexing. > > Rigmor > > > > > I tried to make indexes on the columns but no success. > > > > PG 8.2 > > > > Regards, > > Andy. > > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > >
Alexander Staubo wrote: > On 5/23/07, Andy <frum@ar-sd.net> wrote: >> An example would be: >> SELECT * FROM table >> WHERE name like '%john%' or street like >> '%srt%' >> >> Anyway, the query planner always does seq scan on the whole table and >> that >> takes some time. How can this be optimized or made in another way to be >> faster? > > There's no algorithm in existence that can "index" arbitrary > substrings the way you think. The only rational way to accomplish this > is to first break the text into substrings using some algorithm (eg., > words delimited by whitespace and punctuation), and index the > substrings individually. That seems rather harsh. If I'd put an index on each of these colomns I'd certainly expect it to use the indices - and I'm pretty sure that Sybase would. I'd expect it to scan the index leaf pages instead of the table itself - they should be much more compact and also likely to be hot in cache. Why *wouldn't* the planner do this? James
James Mansion wrote: > Alexander Staubo wrote: >> On 5/23/07, Andy <frum@ar-sd.net> wrote: >>> An example would be: >>> SELECT * FROM table >>> WHERE name like '%john%' or street like >>> '%srt%' >>> >>> Anyway, the query planner always does seq scan on the whole table and >>> that >>> takes some time. How can this be optimized or made in another way to be >>> faster? >> >> There's no algorithm in existence that can "index" arbitrary >> substrings the way you think. The only rational way to accomplish this >> is to first break the text into substrings using some algorithm (eg., >> words delimited by whitespace and punctuation), and index the >> substrings individually. > That seems rather harsh. If I'd put an index on each of these colomns > I'd certainly > expect it to use the indices - and I'm pretty sure that Sybase would. > I'd expect > it to scan the index leaf pages instead of the table itself - they > should be much > more compact and also likely to be hot in cache. If Sybase is still like SQL Server (or the other way around), it *may* end up scanning the index *IFF* the index is a clustered index. If it's a normal index, it will do a sequential scan on the table. Yes, the leaf page of the index is more compact, but you also have to scan the intermediate pages to get to the leaf pages. But again, it can be a win. On such a system. It's not a win on PostgreSQL, because of our MVCC implementation. We need to scan *both* index *and* data pages if we go down that route, in which case it's a lot faster to just scan the data pages alone. I don't really know how MSSQL deals with this now that they have MVCC-ish behavior, and I have no idea at all if sybase has anything like MVCC. > Why *wouldn't* the planner do this? The question should be why the optimizer doesn't consider it, and the executor uses it. The planner really doesn't decide that part :-) Hopefully, the answer can be found above. //Magnus
> If Sybase is still like SQL Server (or the other way around), it *may* > end up scanning the index *IFF* the index is a clustered index. If it's > a normal index, it will do a sequential scan on the table. > > Are you sure its not covered? Have to check at work - but I'm off next week so it'll have to wait. > It's not a win on PostgreSQL, because of our MVCC implementation. We > need to scan *both* index *and* data pages if we go down that route, in > which case it's a lot faster to just scan the data pages alone. > > Why do you need to go to all the data pages - doesn't the index structure contain all the keys so you prefilter and then check to see if the *matched* items are still in view? I'll be first to admit I know zip about Postgres, but it seems odd - doesn't the index contain copies of the key values?. I suspect that I mis-spoke with 'leaf'. I really just mean 'all index pages with data', since the scan does not even need to be in index order, just a good way to get at the data in a compact way.
On Thu, 2007-05-24 at 21:54 +0100, James Mansion wrote: > > If Sybase is still like SQL Server (or the other way around), it *may* > > end up scanning the index *IFF* the index is a clustered index. If it's > > a normal index, it will do a sequential scan on the table. > > > > > Are you sure its not covered? Have to check at work - but I'm off next > week so it'll have to wait. > > > It's not a win on PostgreSQL, because of our MVCC implementation. We > > need to scan *both* index *and* data pages if we go down that route, in > > which case it's a lot faster to just scan the data pages alone. > > > > > Why do you need to go to all the data pages - doesn't the index > structure contain all the keys so > you prefilter and then check to see if the *matched* items are still in > view? I'll be first to admit I > know zip about Postgres, but it seems odd - doesn't the index contain > copies of the key values?. > > I suspect that I mis-spoke with 'leaf'. I really just mean 'all index > pages with data', since the scan > does not even need to be in index order, just a good way to get at the > data in a compact way. PG could scan the index looking for matches first and only load the actual rows if it found a match, but that could only be a possible win if there were very few matches, because the difference in cost between a full index scan and a sequential scan would need to be greater than the cost of randomly fetching all of the matching data rows from the table to look up the visibility information. So yes it would be possible, but the odds of it being faster than a sequential scan are small enough to make it not very useful. And since it's basically impossible to know the selectivity of this kind of where condition, I doubt the planner would ever realistically want to choose that plan anyway because of its poor worst-case behavior. -- Mark
Mark Lewis wrote: > PG could scan the index looking for matches first and only load the > actual rows if it found a match, but that could only be a possible win > if there were very few matches, because the difference in cost between a > full index scan and a sequential scan would need to be greater than the > cost of randomly fetching all of the matching data rows from the table > to look up the visibility information. Just out of curiosity: Does Postgress store a duplicate of the data in the index, even for long strings? I thought indexesonly had to store the string up to the point where there was no ambiguity, for example, if I have "missing", "mississippi"and "misty", the index only needs "missin", "missis" and "mist" in the actual index. This would make it impossibleto use a full index scan for a LIKE query. Craig
Craig James wrote: > Mark Lewis wrote: > > >PG could scan the index looking for matches first and only load the > >actual rows if it found a match, but that could only be a possible win > >if there were very few matches, because the difference in cost between a > >full index scan and a sequential scan would need to be greater than the > >cost of randomly fetching all of the matching data rows from the table > >to look up the visibility information. > > Just out of curiosity: Does Postgress store a duplicate of the data in the > index, even for long strings? I thought indexes only had to store the > string up to the point where there was no ambiguity, for example, if I have > "missing", "mississippi" and "misty", the index only needs "missin", > "missis" and "mist" in the actual index. What would happen when you inserted a new tuple with just "miss"? You would need to expand all the other tuples in the index. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ "Puedes vivir solo una vez, pero si lo haces bien, una vez es suficiente"
On Thu, May 24, 2007 at 02:02:40PM -0700, Mark Lewis wrote: > PG could scan the index looking for matches first and only load the > actual rows if it found a match, but that could only be a possible win > if there were very few matches, because the difference in cost between a > full index scan and a sequential scan would need to be greater than the > cost of randomly fetching all of the matching data rows from the table > to look up the visibility information. > So yes it would be possible, but the odds of it being faster than a > sequential scan are small enough to make it not very useful. > And since it's basically impossible to know the selectivity of this kind > of where condition, I doubt the planner would ever realistically want to > choose that plan anyway because of its poor worst-case behavior. What is a real life example where an intelligent and researched database application would issue a like or ilike query as their primary condition in a situation where they expected very high selectivity? Avoiding a poor worst-case behaviour for a worst-case behaviour that won't happen doesn't seem practical. What real life examples, that could not be implemented a better way, would behave poorly if like/ilike looked at the index first to filter? I don't understand... :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Alvaro Herrera wrote: >> Just out of curiosity: Does Postgress store a duplicate of the data in the >> index, even for long strings? I thought indexes only had to store the >> string up to the point where there was no ambiguity, for example, if I have >> "missing", "mississippi" and "misty", the index only needs "missin", >> "missis" and "mist" in the actual index. > > What would happen when you inserted a new tuple with just "miss"? You > would need to expand all the other tuples in the index. That's right. This technique used by some index implementations is a tradeoff between size and update speed. Most wordsin most natural languages can be distinguished by the first few characters. The chances of having to modify more thana few surrounding nodes when you insert "miss" is small, so some implementations choose this method. Other implementationschoose to store the full string. I was just curious which method Postgres uses. Craig
> PG could scan the index looking for matches first and only load the > actual rows if it found a match, but that could only be a possible win > if there were very few matches, because the difference in cost between a > full index scan and a sequential scan would need to be greater than the > cost of randomly fetching all of the matching data rows from the table > to look up the visibility information. If you need to do that kind of thing, ie. seq scanning a table checking only one column among a large table of many columns, then don't use an index. An index, being a btree, needs to be traversed in order (or else, a lot of locking problems come up) which means some random accesses. So, you could make a table, with 2 columns, updated via triggers : your text field, and the primary key of your main table. Scanning that would be faster. Still, a better solution for searching in text is : - tsearch2 if you need whole words - trigrams for any substring match - xapian for full text search with wildcards (ie. John* = Johnny) Speed-wise those three will beat any seq scan on a large table by a huge margin.
mark@mark.mielke.cc wrote: >> And since it's basically impossible to know the selectivity of this kind >> of where condition, I doubt the planner would ever realistically want to >> choose that plan anyway because of its poor worst-case behavior. > > What is a real life example where an intelligent and researched > database application would issue a like or ilike query as their > primary condition in a situation where they expected very high > selectivity? > > Avoiding a poor worst-case behaviour for a worst-case behaviour that > won't happen doesn't seem practical. But if you are also filtering on e.g. date, and that has an index with good selectivity, you're never going to use the text index anyway are you? If you've only got a dozen rows to check against, might as well just read them in. The only time it's worth considering the behaviour at all is *if* the worst-case is possible. -- Richard Huxton Archonet Ltd
On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote: > mark@mark.mielke.cc wrote: > >>And since it's basically impossible to know the selectivity of this kind > >>of where condition, I doubt the planner would ever realistically want to > >>choose that plan anyway because of its poor worst-case behavior. > >What is a real life example where an intelligent and researched > >database application would issue a like or ilike query as their > >primary condition in a situation where they expected very high > >selectivity? > >Avoiding a poor worst-case behaviour for a worst-case behaviour that > >won't happen doesn't seem practical. > But if you are also filtering on e.g. date, and that has an index with > good selectivity, you're never going to use the text index anyway are > you? If you've only got a dozen rows to check against, might as well > just read them in. > The only time it's worth considering the behaviour at all is *if* the > worst-case is possible. I notice you did not provide a real life example as requested. :-) This seems like an ivory tower restriction. Not allowing best performance in a common situation vs not allowing worst performance in a not-so-common situation. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
mark@mark.mielke.cc wrote: > On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote: >> mark@mark.mielke.cc wrote: >>>> And since it's basically impossible to know the selectivity of this kind >>>> of where condition, I doubt the planner would ever realistically want to >>>> choose that plan anyway because of its poor worst-case behavior. >>> What is a real life example where an intelligent and researched >>> database application would issue a like or ilike query as their >>> primary condition in a situation where they expected very high >>> selectivity? >>> Avoiding a poor worst-case behaviour for a worst-case behaviour that >>> won't happen doesn't seem practical. >> But if you are also filtering on e.g. date, and that has an index with >> good selectivity, you're never going to use the text index anyway are >> you? If you've only got a dozen rows to check against, might as well >> just read them in. >> The only time it's worth considering the behaviour at all is *if* the >> worst-case is possible. > > I notice you did not provide a real life example as requested. :-) OK - any application that allows user-built queries: <choose column: foo> <choose filter: contains> <choose target: "bar"> Want another? Any application that has a "search by name" box - users can (and do) put one letter in and hit enter. Unfortunately you don't always have control over the selectivity of queries issued. > This seems like an ivory tower restriction. Not allowing best performance > in a common situation vs not allowing worst performance in a not-so-common > situation. What best performance plan are you thinking of? I'm assuming we're talking about trailing-wildcard matches here, rather than "contains" style matches. -- Richard Huxton Archonet Ltd
> OK - any application that allows user-built queries: <choose column: > foo> <choose filter: contains> <choose target: "bar"> > > Want another? Any application that has a "search by name" box - users > can (and do) put one letter in and hit enter. > > Unfortunately you don't always have control over the selectivity of > queries issued. -*- HOW TO MAKE A SEARCH FORM -*- Imagine you have to code the search on IMDB. This is what a smart developer would do First, he uses AJAX autocompletion, so the thing is reactive. Then, he does not bother the user with a many-fields form. Instead of forcing the user to think (users HATE that), he writes smart code. Does Google Maps have separate fields for country, city, street, zipcode ? No. Because Google is about as smart as it gets. So, you parse the user query. If the user types, for instance, less than 3 letters (say, spi), he probably wants stuff that *begins* with those letters. There is no point in searching for the letter "a" in a million movie titles database. So, if the user types "spi", you display "name LIKE spi%", which is indexed, very fast. And since you're smart, you use AJAX. And you display only the most popular results (ie. most clicked on). http://imdb.com/find?s=all&q=spi Since 99% of the time the user wanted "spiderman" or "spielberg", you're done and he's happy. Users like being happy. If the user just types "a", you display the first 10 things that start with "a", this is useless but the user will marvel at your AJAX skillz. Then he will probably type in a few other letters. Then, if the user uses his space bar and types "spi 1980" you'll recognize a year and display spielberg's movies in 1980. Converting your strings to phonetics is also a good idea since about 0.7% of the l33T teenagers can spell stuff especially spiElberg. Only the guy who wants to know who had sex with marilyn monroe on the 17th day of the shooting of Basic Instinct will need to use the Advanced search. If you detect several words, then switch to a prefix-based fulltext search like Xapian which utterly rocks. Example : the user types "savin priv", you search for "savin*" NEAR "priv*" and you display "saving private ryan" before he has even finished typing the second word of his query. Users love that, they feel understood, they will click on your ads and buy your products. In all cases, search results should be limited to less than 100 to be easy on the database. The user doesn't care about a search returning more than 10-20 results, he will just rephrase the query, and the time taken to fetch those thousands of records with name LIKE '%a%' will have been utterly lost. Who goes to page 2 in google results ? BOTTOM LINE : databases don't think, you do.
On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote: > >I notice you did not provide a real life example as requested. :-) > OK - any application that allows user-built queries: <choose column: > foo> <choose filter: contains> <choose target: "bar"> > Want another? Any application that has a "search by name" box - users > can (and do) put one letter in and hit enter. > Unfortunately you don't always have control over the selectivity of > queries issued. The database has 10 million records. The user enters "bar" and it translates to "%bar%". You are suggesting that we expect bar to match 1 million+ records? :-) I hope not. I would define this as bad process. I would also use "LIMIT" to something like "100". > >This seems like an ivory tower restriction. Not allowing best performance > >in a common situation vs not allowing worst performance in a not-so-common > >situation. > What best performance plan are you thinking of? I'm assuming we're > talking about trailing-wildcard matches here, rather than "contains" > style matches. "Trailing-wildcard" already uses B-Tree index, does it not? I am speaking of contains, as contains is the one that was said to require a seqscan. I am questioning why it requires a seqscan. The claim was made that with MVCC, the index is insufficient to check for visibility and that the table would need to be accessed anyways, therefore a seqscan is required. I question whether a like '%bar%' should be considered a high selectivity query in the general case. I question whether a worst case should be assumed. Perhaps I question too much? :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
mark@mark.mielke.cc wrote: > I am speaking of contains, as contains is the one that was said to > require a seqscan. I am questioning why it requires a seqscan. The > claim was made that with MVCC, the index is insufficient to check > for visibility and that the table would need to be accessed anyways, > therefore a seqscan is required. I question whether a like '%bar%' > should be considered a high selectivity query in the general case. > I question whether a worst case should be assumed. If you are doing %bar% you should be using pg_tgrm or tsearch2. J > > Perhaps I question too much? :-) > > Cheers, > mark >
PFC wrote: > >> OK - any application that allows user-built queries: <choose column: >> foo> <choose filter: contains> <choose target: "bar"> >> >> Want another? Any application that has a "search by name" box - users >> can (and do) put one letter in and hit enter. >> >> Unfortunately you don't always have control over the selectivity of >> queries issued. > > -*- HOW TO MAKE A SEARCH FORM -*- > > Imagine you have to code the search on IMDB. > > This is what a smart developer would do All good domain-specific tips to provide users with a satisfying search-experience. None of which address the question of what plan PG should produce for: SELECT * FROM bigtable WHERE foo LIKE 's%' -- Richard Huxton Archonet Ltd
> None of which address the question of what plan PG should produce for: > SELECT * FROM bigtable WHERE foo LIKE 's%' Ah, this one already uses the btree since the '%' is at the end. My point is that a search like this will yield too many results to be useful to the user anyway, so optimizing its performance is a kind of red herring.
mark@mark.mielke.cc wrote: > On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote: >>> I notice you did not provide a real life example as requested. :-) >> OK - any application that allows user-built queries: <choose column: >> foo> <choose filter: contains> <choose target: "bar"> >> Want another? Any application that has a "search by name" box - users >> can (and do) put one letter in and hit enter. >> Unfortunately you don't always have control over the selectivity of >> queries issued. > > The database has 10 million records. The user enters "bar" and it > translates to "%bar%". You are suggesting that we expect bar to match > 1 million+ records? :-) I was saying that you don't know. At least, I don't know of any cheap way of gathering full substring stats or doing a full substring indexing. Even tsearch2 can't do that. > I hope not. I would define this as bad process. I would also use "LIMIT" > to something like "100". Yes, but that's not the query we're talking about is it? If possible you don't do '%bar%' searches at all. If you do, you try to restrict it further or LIMIT the results. There's nothing to discuss in these cases. >>> This seems like an ivory tower restriction. Not allowing best performance >>> in a common situation vs not allowing worst performance in a not-so-common >>> situation. >> What best performance plan are you thinking of? I'm assuming we're >> talking about trailing-wildcard matches here, rather than "contains" >> style matches. > > "Trailing-wildcard" already uses B-Tree index, does it not? Yes, it searches the btree and then checks the data for visibility. I thought that was what you felt could be worked around. It appears I was wrong. > I am speaking of contains, as contains is the one that was said to > require a seqscan. I am questioning why it requires a seqscan. Well, you seemed to be suggesting you had something better in mind. At least, that was my reading of your original post. > The > claim was made that with MVCC, the index is insufficient to check > for visibility True, for PG's implementation of MVCC. You *could* have visibility in each index, but that obviously would take more space. For a table with many indexes, that could be a *lot* more space. You also have to update all that visibilty information too. > and that the table would need to be accessed anyways, > therefore a seqscan is required. I question whether a like '%bar%' > should be considered a high selectivity query in the general case. > I question whether a worst case should be assumed. Well, the general rule-of-thumb is only about 10% for the changeover between index & seq-scan. That is, once you are reading 10% of the rows on disk (to check visibility) you might as well read them all (since you'll be reading most of the blocks anyway if the rows are randomly distributed). If you are doing SELECT * from that table then you'll want all that data you read. If you are doing SELECT count(*) then you only wanted the visibility :-( Now you and I can look at a substring and probably make a good guess how common it is (assuming we know the targets are British surnames or Japanese towns). PG needs one number - or rather, it picks one number for each length of search-string (afaik). > Perhaps I question too much? :-) Not sure it's possible to question too much :-) However, you need to provide answers occasionally too - what numbers would you pick? :-) -- Richard Huxton Archonet Ltd
PFC wrote: >> None of which address the question of what plan PG should produce for: >> SELECT * FROM bigtable WHERE foo LIKE 's%' > > Ah, this one already uses the btree since the '%' is at the end. > My point is that a search like this will yield too many results to > be useful to the user anyway, so optimizing its performance is a kind of > red herring. At the *application level* yes. At the *query planner* level no. At the query planner level I just want it to come up with the best plan it can. The original argument was that PG's estimate of the number of matching rows was too optimistic (or pessimistic) in the case where we are doing a contains substring-search. -- Richard Huxton Archonet Ltd
"Richard Huxton" <dev@archonet.com> writes: > Now you and I can look at a substring and probably make a good guess how common > it is (assuming we know the targets are British surnames or Japanese towns). PG > needs one number - or rather, it picks one number for each length of > search-string (afaik). I don't think that's true. Postgres calculates the lower and upper bound implied by the search pattern and then uses the histogram to estimate how selective that range is. It's sometimes surprisingly good but obviously it's not perfect. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Richard Huxton" <dev@archonet.com> writes: > >> Now you and I can look at a substring and probably make a good guess how common >> it is (assuming we know the targets are British surnames or Japanese towns). PG >> needs one number - or rather, it picks one number for each length of >> search-string (afaik). > > I don't think that's true. Postgres calculates the lower and upper bound > implied by the search pattern and then uses the histogram to estimate how > selective that range is. It's sometimes surprisingly good but obviously it's > not perfect. Sorry - I'm obviously picking my words badly today. I meant for the "contains" substring match. It gives different (goes away and checks...yes) predictions based on string length. So it guesses that LIKE '%aaa%' will match more than LIKE '%aaaa%'. Of course, if we were matching surnames you and I could say that this is very unlikely, but without some big statistics table I guess there's not much more PG can do. For a trailing wildcard LIKE 'aaa%' it can and does as you say convert this into something along the lines of (>= 'aaa' AND < 'aab'). Although IIRC that depends if your locale allows such (not sure, I don't really use non-C/non-English locales enough). -- Richard Huxton Archonet Ltd
mark@mark.mielke.cc wrote: > What is a real life example where an intelligent and researched > database application would issue a like or ilike query as their > primary condition in a situation where they expected very high > selectivity? > In my case the canonical example is to search against textual keys where the search is performed automatically if the user hs typed enough data and paused. In almost all cases the '%' trails, and I'm looking for 'starts with' in effect. usually the search will have a specified upper number of returned rows, if that's an available facility. I realise in this case that matching against the index does not allow the match count unless we check MVCC as we go, but I don't see why another thread can't be doing that. James
On Wed, Jun 06, 2007 at 11:23:13PM +0100, James Mansion wrote: > mark@mark.mielke.cc wrote: > >What is a real life example where an intelligent and researched > >database application would issue a like or ilike query as their > >primary condition in a situation where they expected very high > >selectivity? > In my case the canonical example is to search against textual keys > where the search is performed automatically if the user hs typed > enough data and paused. In almost all cases the '%' trails, and I'm > looking for 'starts with' in effect. usually the search will have a > specified upper number of returned rows, if that's an available > facility. I realise in this case that matching against the index > does not allow the match count unless we check MVCC as we go, but I > don't see why another thread can't be doing that. I believe PostgreSQL already considers using the index for "starts with", so this wasn't part of the discussion for me. Sorry that this wasn't clear. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/