Re: LIKE search and performance - Mailing list pgsql-performance
From | Richard Huxton |
---|---|
Subject | Re: LIKE search and performance |
Date | |
Msg-id | 46571CEC.7070202@archonet.com Whole thread Raw |
In response to | Re: LIKE search and performance (mark@mark.mielke.cc) |
Responses |
Re: LIKE search and performance
|
List | pgsql-performance |
mark@mark.mielke.cc wrote: > On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote: >>> I notice you did not provide a real life example as requested. :-) >> OK - any application that allows user-built queries: <choose column: >> foo> <choose filter: contains> <choose target: "bar"> >> Want another? Any application that has a "search by name" box - users >> can (and do) put one letter in and hit enter. >> Unfortunately you don't always have control over the selectivity of >> queries issued. > > The database has 10 million records. The user enters "bar" and it > translates to "%bar%". You are suggesting that we expect bar to match > 1 million+ records? :-) I was saying that you don't know. At least, I don't know of any cheap way of gathering full substring stats or doing a full substring indexing. Even tsearch2 can't do that. > I hope not. I would define this as bad process. I would also use "LIMIT" > to something like "100". Yes, but that's not the query we're talking about is it? If possible you don't do '%bar%' searches at all. If you do, you try to restrict it further or LIMIT the results. There's nothing to discuss in these cases. >>> This seems like an ivory tower restriction. Not allowing best performance >>> in a common situation vs not allowing worst performance in a not-so-common >>> situation. >> What best performance plan are you thinking of? I'm assuming we're >> talking about trailing-wildcard matches here, rather than "contains" >> style matches. > > "Trailing-wildcard" already uses B-Tree index, does it not? Yes, it searches the btree and then checks the data for visibility. I thought that was what you felt could be worked around. It appears I was wrong. > I am speaking of contains, as contains is the one that was said to > require a seqscan. I am questioning why it requires a seqscan. Well, you seemed to be suggesting you had something better in mind. At least, that was my reading of your original post. > The > claim was made that with MVCC, the index is insufficient to check > for visibility True, for PG's implementation of MVCC. You *could* have visibility in each index, but that obviously would take more space. For a table with many indexes, that could be a *lot* more space. You also have to update all that visibilty information too. > and that the table would need to be accessed anyways, > therefore a seqscan is required. I question whether a like '%bar%' > should be considered a high selectivity query in the general case. > I question whether a worst case should be assumed. Well, the general rule-of-thumb is only about 10% for the changeover between index & seq-scan. That is, once you are reading 10% of the rows on disk (to check visibility) you might as well read them all (since you'll be reading most of the blocks anyway if the rows are randomly distributed). If you are doing SELECT * from that table then you'll want all that data you read. If you are doing SELECT count(*) then you only wanted the visibility :-( Now you and I can look at a substring and probably make a good guess how common it is (assuming we know the targets are British surnames or Japanese towns). PG needs one number - or rather, it picks one number for each length of search-string (afaik). > Perhaps I question too much? :-) Not sure it's possible to question too much :-) However, you need to provide answers occasionally too - what numbers would you pick? :-) -- Richard Huxton Archonet Ltd
pgsql-performance by date: