Thread: LIKE search and performance

LIKE search and performance

From
"Andy"
Date:
Hi,
 
I have a table with varchar and text columns, and I have to search through these text in the whole table.
 
An example would be:
SELECT * FROM table
                             WHERE name like '%john%' or street like '%srt%'
 
Anyway, the query planner always does seq scan on the whole table and that takes some time. How can this be optimized or made in another way to be faster?
 
I tried to make indexes on the columns but no success.
 
PG 8.2
 
Regards,
Andy.

Re: LIKE search and performance

From
Richard Huxton
Date:
Andy wrote:
> SELECT * FROM table
>                              WHERE name like '%john%' or street like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table and that
> takes some time. How can this be optimized or made in another way to be
> faster?
>
> I tried to make indexes on the columns but no success.

None of the normal indexes will work for finding text in the middle of a
string. If you do think of a simple way of solving this, drop a short
letter explaining your idea to your local patent office followed by the
Nobel prize committee.

However, one of the contrib packages is "tsearch2" which is designed to
do keyword searches on text for you. It'll also handle stemming (e.g.
"search" will match "searching" etc.) with the right choice of
dictionary. Loads of other clever stuff in it too.

It's one of the optional packages with most Linux packaging systems and
on the Windows one too. If you install from source see the contrib/
directory for details.

--
   Richard Huxton
   Archonet Ltd

Re: LIKE search and performance

From
Guido Neitzer
Date:
Am 23.05.2007 um 09:08 schrieb Andy:

> I have a table with varchar and text columns, and I have to search
> through these text in the whole table.
>
> An example would be:
> SELECT * FROM table
>                              WHERE name like '%john%' or street
> like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table
> and that takes some time. How can this be optimized or made in
> another way to be faster?

The problem is that normal indexes cannot be used for "contains"
queries.

If you need fulltext search capabilities you have to take a look at
tsearch2 or an external search engine like Lucene.

cug

Re: LIKE search and performance

From
"Alexander Staubo"
Date:
On 5/23/07, Andy <frum@ar-sd.net> wrote:
> An example would be:
> SELECT * FROM table
>                              WHERE name like '%john%' or street like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table and that
> takes some time. How can this be optimized or made in another way to be
> faster?

There's no algorithm in existence that can "index" arbitrary
substrings the way you think. The only rational way to accomplish this
is to first break the text into substrings using some algorithm (eg.,
words delimited by whitespace and punctuation), and index the
substrings individually.

You can do this using vanilla PostgreSQL, and you can use Tsearch2
and/or its GIN indexes to help speed up the searches. The simplest
solution would be to put all the substrings in a table, one row per
substring, along with an attribute referencing the source table/row.

At this point you have effectively reduced your search space: you can
use a query to isolate the words in your "dictionary" that contain the
substrings. So a query might be:

  select ... from persons where id in (
    select person_id from person_words
    where word like '%john%';
  )

The "like" search, even though it will use a sequential scan, is bound
to be much faster on these small words than searching for substrings
through large gobs of text in the persons table.

Note that PostgreSQL *can* exploit the index for *prefix* matching, if
you tell it to use the right operator class:

  create index persons_name_index on persons (name text_pattern_ops);

or, if you're using varchars (though there is rarely any reason to):

  create index persons_name_index on persons (name varchar_pattern_ops);

(These two may be identical internally. Check the docs.)

Now you can do:

  select ... from persons where name like 'john%';

which will yield a query plan such as this:

 Index Scan using persons_name_index on persons  (cost=0.00..8.27
rows=1 width=29) (actual time=0.184..0.373 rows=51 loops=1)
   Index Cond: ((name ~>=~ 'john'::text) AND (name ~<~ 'joho'::text))
   Filter: (title ~~ 'john%'::text)

Alexander.

Re: LIKE search and performance

From
Rigmor Ukuhe
Date:
Andy wrote:
> Hi,
>
> I have a table with varchar and text columns, and I have to search
> through these text in the whole table.
>
> An example would be:
> SELECT * FROM table
>                              WHERE name like '%john%' or street like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table and
> that takes some time. How can this be optimized or made in another way
> to be faster?

Use tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for full text indexing.

Rigmor

>
> I tried to make indexes on the columns but no success.
>
> PG 8.2
>
> Regards,
> Andy.




Re: LIKE search and performance

From
"Andy"
Date:
Thank you all for the answers.
I will try your suggestions and see what that brings in terms of
performance.

Andy.

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Rigmor Ukuhe
> Sent: Wednesday, May 23, 2007 6:52 PM
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] LIKE search and performance
>
> Andy wrote:
> > Hi,
> >
> > I have a table with varchar and text columns, and I have to search
> > through these text in the whole table.
> >
> > An example would be:
> > SELECT * FROM table
> >                              WHERE name like '%john%' or
> street like '%srt%'
> >
> > Anyway, the query planner always does seq scan on the whole
> table and
> > that takes some time. How can this be optimized or made in
> another way
> > to be faster?
>
> Use tsearch2
> (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for
> full text indexing.
>
> Rigmor
>
> >
> > I tried to make indexes on the columns but no success.
> >
> > PG 8.2
> >
> > Regards,
> > Andy.
>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>


Re: LIKE search and performance

From
James Mansion
Date:
Alexander Staubo wrote:
> On 5/23/07, Andy <frum@ar-sd.net> wrote:
>> An example would be:
>> SELECT * FROM table
>>                              WHERE name like '%john%' or street like
>> '%srt%'
>>
>> Anyway, the query planner always does seq scan on the whole table and
>> that
>> takes some time. How can this be optimized or made in another way to be
>> faster?
>
> There's no algorithm in existence that can "index" arbitrary
> substrings the way you think. The only rational way to accomplish this
> is to first break the text into substrings using some algorithm (eg.,
> words delimited by whitespace and punctuation), and index the
> substrings individually.
That seems rather harsh.  If I'd put an index on each of these colomns
I'd certainly
expect it to use the indices - and I'm pretty sure that Sybase would.
I'd expect
it to scan the index leaf pages instead of the table itself - they
should be much
more compact and also likely to be hot in cache.

Why *wouldn't* the planner do this?

James


Re: LIKE search and performance

From
Magnus Hagander
Date:
James Mansion wrote:
> Alexander Staubo wrote:
>> On 5/23/07, Andy <frum@ar-sd.net> wrote:
>>> An example would be:
>>> SELECT * FROM table
>>>                              WHERE name like '%john%' or street like
>>> '%srt%'
>>>
>>> Anyway, the query planner always does seq scan on the whole table and
>>> that
>>> takes some time. How can this be optimized or made in another way to be
>>> faster?
>>
>> There's no algorithm in existence that can "index" arbitrary
>> substrings the way you think. The only rational way to accomplish this
>> is to first break the text into substrings using some algorithm (eg.,
>> words delimited by whitespace and punctuation), and index the
>> substrings individually.
> That seems rather harsh.  If I'd put an index on each of these colomns
> I'd certainly
> expect it to use the indices - and I'm pretty sure that Sybase would.
> I'd expect
> it to scan the index leaf pages instead of the table itself - they
> should be much
> more compact and also likely to be hot in cache.

If Sybase is still like SQL Server (or the other way around), it *may*
end up scanning the index *IFF* the index is a clustered index. If it's
a normal index, it will do a sequential scan on the table.

Yes, the leaf page of the index is more compact, but you also have to
scan the intermediate pages to get to the leaf pages. But again, it can
be a win. On such a system.

It's not a win on PostgreSQL, because of our MVCC implementation. We
need to scan *both* index *and* data pages if we go down that route, in
which case it's a lot faster to just scan the data pages alone.

I don't really know how MSSQL deals with this now that they have
MVCC-ish behavior, and I have no idea at all if sybase has anything like
MVCC.


> Why *wouldn't* the planner do this?

The question should be why the optimizer doesn't consider it, and the
executor uses it. The planner really doesn't decide that part :-)
Hopefully, the answer can be found above.

//Magnus

Re: LIKE search and performance

From
James Mansion
Date:
> If Sybase is still like SQL Server (or the other way around), it *may*
> end up scanning the index *IFF* the index is a clustered index. If it's
> a normal index, it will do a sequential scan on the table.
>
>
Are you sure its not covered?  Have to check at work - but I'm off next
week so it'll have to wait.

> It's not a win on PostgreSQL, because of our MVCC implementation. We
> need to scan *both* index *and* data pages if we go down that route, in
> which case it's a lot faster to just scan the data pages alone.
>
>
Why do you need to go to all the data pages - doesn't the index
structure contain all the keys so
you prefilter and then check to see if the *matched* items are still in
view?  I'll be first to admit I
know zip about Postgres, but it seems odd - doesn't the index contain
copies of the key values?.

I suspect that I mis-spoke with 'leaf'.  I really just mean 'all index
pages with data', since the scan
does not even need to be in index order, just a good way to get at the
data in a compact way.



Re: LIKE search and performance

From
Mark Lewis
Date:
On Thu, 2007-05-24 at 21:54 +0100, James Mansion wrote:
> > If Sybase is still like SQL Server (or the other way around), it *may*
> > end up scanning the index *IFF* the index is a clustered index. If it's
> > a normal index, it will do a sequential scan on the table.
> >
> >
> Are you sure its not covered?  Have to check at work - but I'm off next
> week so it'll have to wait.
>
> > It's not a win on PostgreSQL, because of our MVCC implementation. We
> > need to scan *both* index *and* data pages if we go down that route, in
> > which case it's a lot faster to just scan the data pages alone.
> >
> >
> Why do you need to go to all the data pages - doesn't the index
> structure contain all the keys so
> you prefilter and then check to see if the *matched* items are still in
> view?  I'll be first to admit I
> know zip about Postgres, but it seems odd - doesn't the index contain
> copies of the key values?.
>
> I suspect that I mis-spoke with 'leaf'.  I really just mean 'all index
> pages with data', since the scan
> does not even need to be in index order, just a good way to get at the
> data in a compact way.

PG could scan the index looking for matches first and only load the
actual rows if it found a match, but that could only be a possible win
if there were very few matches, because the difference in cost between a
full index scan and a sequential scan would need to be greater than the
cost of randomly fetching all of the matching data rows from the table
to look up the visibility information.

So yes it would be possible, but the odds of it being faster than a
sequential scan are small enough to make it not very useful.

And since it's basically impossible to know the selectivity of this kind
of where condition, I doubt the planner would ever realistically want to
choose that plan anyway because of its poor worst-case behavior.

-- Mark

Re: LIKE search and performance

From
Craig James
Date:
Mark Lewis wrote:

> PG could scan the index looking for matches first and only load the
> actual rows if it found a match, but that could only be a possible win
> if there were very few matches, because the difference in cost between a
> full index scan and a sequential scan would need to be greater than the
> cost of randomly fetching all of the matching data rows from the table
> to look up the visibility information.

Just out of curiosity: Does Postgress store a duplicate of the data in the index, even for long strings?  I thought
indexesonly had to store the string up to the point where there was no ambiguity, for example, if I have "missing",
"mississippi"and "misty", the index only needs "missin", "missis" and "mist" in the actual index.  This would make it
impossibleto use a full index scan for a LIKE query. 

Craig

Re: LIKE search and performance

From
Alvaro Herrera
Date:
Craig James wrote:
> Mark Lewis wrote:
>
> >PG could scan the index looking for matches first and only load the
> >actual rows if it found a match, but that could only be a possible win
> >if there were very few matches, because the difference in cost between a
> >full index scan and a sequential scan would need to be greater than the
> >cost of randomly fetching all of the matching data rows from the table
> >to look up the visibility information.
>
> Just out of curiosity: Does Postgress store a duplicate of the data in the
> index, even for long strings?  I thought indexes only had to store the
> string up to the point where there was no ambiguity, for example, if I have
> "missing", "mississippi" and "misty", the index only needs "missin",
> "missis" and "mist" in the actual index.

What would happen when you inserted a new tuple with just "miss"?  You
would need to expand all the other tuples in the index.

--
Alvaro Herrera                         http://www.flickr.com/photos/alvherre/
"Puedes vivir solo una vez, pero si lo haces bien, una vez es suficiente"

Re: LIKE search and performance

From
mark@mark.mielke.cc
Date:
On Thu, May 24, 2007 at 02:02:40PM -0700, Mark Lewis wrote:
> PG could scan the index looking for matches first and only load the
> actual rows if it found a match, but that could only be a possible win
> if there were very few matches, because the difference in cost between a
> full index scan and a sequential scan would need to be greater than the
> cost of randomly fetching all of the matching data rows from the table
> to look up the visibility information.

> So yes it would be possible, but the odds of it being faster than a
> sequential scan are small enough to make it not very useful.

> And since it's basically impossible to know the selectivity of this kind
> of where condition, I doubt the planner would ever realistically want to
> choose that plan anyway because of its poor worst-case behavior.

What is a real life example where an intelligent and researched
database application would issue a like or ilike query as their
primary condition in a situation where they expected very high
selectivity?

Avoiding a poor worst-case behaviour for a worst-case behaviour that
won't happen doesn't seem practical.

What real life examples, that could not be implemented a better way,
would behave poorly if like/ilike looked at the index first to filter?

I don't understand... :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


Re: LIKE search and performance

From
Craig James
Date:
Alvaro Herrera wrote:
 >> Just out of curiosity: Does Postgress store a duplicate of the data in the
>> index, even for long strings?  I thought indexes only had to store the
>> string up to the point where there was no ambiguity, for example, if I have
>> "missing", "mississippi" and "misty", the index only needs "missin",
>> "missis" and "mist" in the actual index.
>
> What would happen when you inserted a new tuple with just "miss"?  You
> would need to expand all the other tuples in the index.

That's right.  This technique used by some index implementations is a tradeoff between size and update speed.  Most
wordsin most natural languages can be distinguished by the first few characters.  The chances of having to modify more
thana few surrounding nodes when you insert "miss" is small, so some implementations choose this method.  Other
implementationschoose to store the full string.  I was just curious which method Postgres uses. 

Craig


Re: LIKE search and performance

From
PFC
Date:
> PG could scan the index looking for matches first and only load the
> actual rows if it found a match, but that could only be a possible win
> if there were very few matches, because the difference in cost between a
> full index scan and a sequential scan would need to be greater than the
> cost of randomly fetching all of the matching data rows from the table
> to look up the visibility information.

    If you need to do that kind of thing, ie. seq scanning a table checking
only one column among a large table of many columns, then don't use an
index. An index, being a btree, needs to be traversed in order (or else, a
lot of locking problems come up) which means some random accesses.

    So, you could make a table, with 2 columns, updated via triggers : your
text field, and the primary key of your main table. Scanning that would be
faster.

    Still, a better solution for searching in text is :

    - tsearch2 if you need whole words
    - trigrams for any substring match
    - xapian for full text search with wildcards (ie. John* = Johnny)

    Speed-wise those three will beat any seq scan on a large table by a huge
margin.

Re: LIKE search and performance

From
Richard Huxton
Date:
mark@mark.mielke.cc wrote:
>> And since it's basically impossible to know the selectivity of this kind
>> of where condition, I doubt the planner would ever realistically want to
>> choose that plan anyway because of its poor worst-case behavior.
>
> What is a real life example where an intelligent and researched
> database application would issue a like or ilike query as their
> primary condition in a situation where they expected very high
> selectivity?
>
> Avoiding a poor worst-case behaviour for a worst-case behaviour that
> won't happen doesn't seem practical.

But if you are also filtering on e.g. date, and that has an index with
good selectivity, you're never going to use the text index anyway are
you? If you've only got a dozen rows to check against, might as well
just read them in.

The only time it's worth considering the behaviour at all is *if* the
worst-case is possible.

--
   Richard Huxton
   Archonet Ltd

Re: LIKE search and performance

From
mark@mark.mielke.cc
Date:
On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote:
> mark@mark.mielke.cc wrote:
> >>And since it's basically impossible to know the selectivity of this kind
> >>of where condition, I doubt the planner would ever realistically want to
> >>choose that plan anyway because of its poor worst-case behavior.
> >What is a real life example where an intelligent and researched
> >database application would issue a like or ilike query as their
> >primary condition in a situation where they expected very high
> >selectivity?
> >Avoiding a poor worst-case behaviour for a worst-case behaviour that
> >won't happen doesn't seem practical.
> But if you are also filtering on e.g. date, and that has an index with
> good selectivity, you're never going to use the text index anyway are
> you? If you've only got a dozen rows to check against, might as well
> just read them in.
> The only time it's worth considering the behaviour at all is *if* the
> worst-case is possible.

I notice you did not provide a real life example as requested. :-)

This seems like an ivory tower restriction. Not allowing best performance
in a common situation vs not allowing worst performance in a not-so-common
situation.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


Re: LIKE search and performance

From
Richard Huxton
Date:
mark@mark.mielke.cc wrote:
> On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote:
>> mark@mark.mielke.cc wrote:
>>>> And since it's basically impossible to know the selectivity of this kind
>>>> of where condition, I doubt the planner would ever realistically want to
>>>> choose that plan anyway because of its poor worst-case behavior.
>>> What is a real life example where an intelligent and researched
>>> database application would issue a like or ilike query as their
>>> primary condition in a situation where they expected very high
>>> selectivity?
>>> Avoiding a poor worst-case behaviour for a worst-case behaviour that
>>> won't happen doesn't seem practical.
>> But if you are also filtering on e.g. date, and that has an index with
>> good selectivity, you're never going to use the text index anyway are
>> you? If you've only got a dozen rows to check against, might as well
>> just read them in.
>> The only time it's worth considering the behaviour at all is *if* the
>> worst-case is possible.
>
> I notice you did not provide a real life example as requested. :-)

OK - any application that allows user-built queries: <choose column:
foo> <choose filter: contains> <choose target: "bar">

Want another? Any application that has a "search by name" box - users
can (and do) put one letter in and hit enter.

Unfortunately you don't always have control over the selectivity of
queries issued.

> This seems like an ivory tower restriction. Not allowing best performance
> in a common situation vs not allowing worst performance in a not-so-common
> situation.

What best performance plan are you thinking of? I'm assuming we're
talking about trailing-wildcard matches here, rather than "contains"
style matches.

--
   Richard Huxton
   Archonet Ltd

Re: LIKE search and performance

From
PFC
Date:

> OK - any application that allows user-built queries: <choose column:
> foo> <choose filter: contains> <choose target: "bar">
>
> Want another? Any application that has a "search by name" box - users
> can (and do) put one letter in and hit enter.
>
> Unfortunately you don't always have control over the selectivity of
> queries issued.

    -*- HOW TO MAKE A SEARCH FORM -*-

    Imagine you have to code the search on IMDB.

    This is what a smart developer would do

    First, he uses AJAX autocompletion, so the thing is reactive.
    Then, he does not bother the user with a many-fields form. Instead of
forcing the user to think (users HATE that), he writes smart code.
    Does Google Maps have separate fields for country, city, street, zipcode
? No. Because Google is about as smart as it gets.

    So, you parse the user query.

    If the user types, for instance, less than 3 letters (say, spi), he
probably wants stuff that *begins* with those letters. There is no point
in searching for the letter "a" in a million movie titles database.
    So, if the user types "spi", you display "name LIKE spi%", which is
indexed, very fast. And since you're smart, you use AJAX. And you display
only the most popular results (ie. most clicked on).

http://imdb.com/find?s=all&q=spi

    Since 99% of the time the user wanted "spiderman" or "spielberg", you're
done and he's happy. Users like being happy.
    If the user just types "a", you display the first 10 things that start
with "a", this is useless but the user will marvel at your AJAX skillz.
Then he will probably type in a few other letters.

    Then, if the user uses his space bar and types "spi 1980" you'll
recognize a year and display spielberg's movies in 1980.
    Converting your strings to phonetics is also a good idea since about 0.7%
of the l33T teenagers can spell stuff especially spiElberg.

    Only the guy who wants to know who had sex with marilyn monroe on the
17th day of the shooting of Basic Instinct will need to use the Advanced
search.

    If you detect several words, then switch to a prefix-based fulltext
search like Xapian which utterly rocks.
    Example : the user types "savin priv", you search for "savin*" NEAR
"priv*" and you display "saving private ryan" before he has even finished
typing the second word of his query. Users love that, they feel
understood, they will click on your ads and buy your products.

    In all cases, search results should be limited to less than 100 to be
easy on the database. The user doesn't care about a search returning more
than 10-20 results, he will just rephrase the query, and the time taken to
fetch those thousands of records with name LIKE '%a%' will have been
utterly lost. Who goes to page 2 in google results ?

    BOTTOM LINE : databases don't think, you do.

Re: LIKE search and performance

From
mark@mark.mielke.cc
Date:
On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote:
> >I notice you did not provide a real life example as requested. :-)
> OK - any application that allows user-built queries: <choose column:
> foo> <choose filter: contains> <choose target: "bar">
> Want another? Any application that has a "search by name" box - users
> can (and do) put one letter in and hit enter.
> Unfortunately you don't always have control over the selectivity of
> queries issued.

The database has 10 million records. The user enters "bar" and it
translates to "%bar%". You are suggesting that we expect bar to match
1 million+ records? :-)

I hope not. I would define this as bad process. I would also use "LIMIT"
to something like "100".

> >This seems like an ivory tower restriction. Not allowing best performance
> >in a common situation vs not allowing worst performance in a not-so-common
> >situation.
> What best performance plan are you thinking of? I'm assuming we're
> talking about trailing-wildcard matches here, rather than "contains"
> style matches.

"Trailing-wildcard" already uses B-Tree index, does it not?

I am speaking of contains, as contains is the one that was said to
require a seqscan. I am questioning why it requires a seqscan. The
claim was made that with MVCC, the index is insufficient to check
for visibility and that the table would need to be accessed anyways,
therefore a seqscan is required. I question whether a like '%bar%'
should be considered a high selectivity query in the general case.
I question whether a worst case should be assumed.

Perhaps I question too much? :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


Re: LIKE search and performance

From
"Joshua D. Drake"
Date:
mark@mark.mielke.cc wrote:

> I am speaking of contains, as contains is the one that was said to
> require a seqscan. I am questioning why it requires a seqscan. The
> claim was made that with MVCC, the index is insufficient to check
> for visibility and that the table would need to be accessed anyways,
> therefore a seqscan is required. I question whether a like '%bar%'
> should be considered a high selectivity query in the general case.
> I question whether a worst case should be assumed.

If you are doing %bar% you should be using pg_tgrm or tsearch2.

J


>
> Perhaps I question too much? :-)
>
> Cheers,
> mark
>


Re: LIKE search and performance

From
Richard Huxton
Date:
PFC wrote:
>
>> OK - any application that allows user-built queries: <choose column:
>> foo> <choose filter: contains> <choose target: "bar">
>>
>> Want another? Any application that has a "search by name" box - users
>> can (and do) put one letter in and hit enter.
>>
>> Unfortunately you don't always have control over the selectivity of
>> queries issued.
>
>     -*- HOW TO MAKE A SEARCH FORM -*-
>
>     Imagine you have to code the search on IMDB.
>
>     This is what a smart developer would do

All good domain-specific tips to provide users with a satisfying
search-experience.

None of which address the question of what plan PG should produce for:
SELECT * FROM bigtable WHERE foo LIKE 's%'

--
   Richard Huxton
   Archonet Ltd

Re: LIKE search and performance

From
PFC
Date:
> None of which address the question of what plan PG should produce for:
> SELECT * FROM bigtable WHERE foo LIKE 's%'

    Ah, this one already uses the btree since the '%' is at the end.
    My point is that a search like this will yield too many results to be
useful to the user anyway, so optimizing its performance is a kind of red
herring.


Re: LIKE search and performance

From
Richard Huxton
Date:
mark@mark.mielke.cc wrote:
> On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote:
>>> I notice you did not provide a real life example as requested. :-)
>> OK - any application that allows user-built queries: <choose column:
>> foo> <choose filter: contains> <choose target: "bar">
>> Want another? Any application that has a "search by name" box - users
>> can (and do) put one letter in and hit enter.
>> Unfortunately you don't always have control over the selectivity of
>> queries issued.
>
> The database has 10 million records. The user enters "bar" and it
> translates to "%bar%". You are suggesting that we expect bar to match
> 1 million+ records? :-)

I was saying that you don't know. At least, I don't know of any cheap
way of gathering full substring stats or doing a full substring
indexing. Even tsearch2 can't do that.

> I hope not. I would define this as bad process. I would also use "LIMIT"
> to something like "100".

Yes, but that's not the query we're talking about is it? If possible you
don't do '%bar%' searches at all. If you do, you try to restrict it
further or LIMIT the results. There's nothing to discuss in these cases.

>>> This seems like an ivory tower restriction. Not allowing best performance
>>> in a common situation vs not allowing worst performance in a not-so-common
>>> situation.
>> What best performance plan are you thinking of? I'm assuming we're
>> talking about trailing-wildcard matches here, rather than "contains"
>> style matches.
>
> "Trailing-wildcard" already uses B-Tree index, does it not?

Yes, it searches the btree and then checks the data for visibility. I
thought that was what you felt could be worked around. It appears I was
wrong.

> I am speaking of contains, as contains is the one that was said to
> require a seqscan. I am questioning why it requires a seqscan.

Well, you seemed to be suggesting you had something better in mind. At
least, that was my reading of your original post.

 > The
> claim was made that with MVCC, the index is insufficient to check
> for visibility

True, for PG's implementation of MVCC. You *could* have visibility in
each index, but that obviously would take more space. For a table with
many indexes, that could be a *lot* more space. You also have to update
all that visibilty information too.

 > and that the table would need to be accessed anyways,
> therefore a seqscan is required. I question whether a like '%bar%'
> should be considered a high selectivity query in the general case.
> I question whether a worst case should be assumed.

Well, the general rule-of-thumb is only about 10% for the changeover
between index & seq-scan. That is, once you are reading 10% of the rows
on disk (to check visibility) you might as well read them all (since
you'll be reading most of the blocks anyway if the rows are randomly
distributed). If you are doing SELECT * from that table then you'll want
all that data you read. If you are doing SELECT count(*) then you only
wanted the visibility :-(

Now you and I can look at a substring and probably make a good guess how
common it is (assuming we know the targets are British surnames or
Japanese towns). PG needs one number - or rather, it picks one number
for each length of search-string (afaik).

> Perhaps I question too much? :-)

Not sure it's possible to question too much :-)
However, you need to provide answers occasionally too - what numbers
would you pick? :-)

--
   Richard Huxton
   Archonet Ltd

Re: LIKE search and performance

From
Richard Huxton
Date:
PFC wrote:
>> None of which address the question of what plan PG should produce for:
>> SELECT * FROM bigtable WHERE foo LIKE 's%'
>
>     Ah, this one already uses the btree since the '%' is at the end.
>     My point is that a search like this will yield too many results to
> be useful to the user anyway, so optimizing its performance is a kind of
> red herring.

At the *application level* yes.
At the *query planner* level no.

At the query planner level I just want it to come up with the best plan
it can. The original argument was that PG's estimate of the number of
matching rows was too optimistic (or pessimistic) in the case where we
are doing a contains substring-search.

--
   Richard Huxton
   Archonet Ltd

Re: LIKE search and performance

From
Gregory Stark
Date:
"Richard Huxton" <dev@archonet.com> writes:

> Now you and I can look at a substring and probably make a good guess how common
> it is (assuming we know the targets are British surnames or Japanese towns). PG
> needs one number - or rather, it picks one number for each length of
> search-string (afaik).

I don't think that's true. Postgres calculates the lower and upper bound
implied by the search pattern and then uses the histogram to estimate how
selective that range is. It's sometimes surprisingly good but obviously it's
not perfect.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: LIKE search and performance

From
Richard Huxton
Date:
Gregory Stark wrote:
> "Richard Huxton" <dev@archonet.com> writes:
>
>> Now you and I can look at a substring and probably make a good guess how common
>> it is (assuming we know the targets are British surnames or Japanese towns). PG
>> needs one number - or rather, it picks one number for each length of
>> search-string (afaik).
>
> I don't think that's true. Postgres calculates the lower and upper bound
> implied by the search pattern and then uses the histogram to estimate how
> selective that range is. It's sometimes surprisingly good but obviously it's
> not perfect.

Sorry - I'm obviously picking my words badly today.

I meant for the "contains" substring match. It gives different (goes
away and checks...yes) predictions based on string length. So it guesses
that LIKE '%aaa%' will match more than LIKE '%aaaa%'. Of course, if we
were matching surnames you and I could say that this is very unlikely,
but without some big statistics table I guess there's not much more PG
can do.

For a trailing wildcard LIKE 'aaa%' it can and does as you say convert
this into something along the lines of (>= 'aaa' AND < 'aab'). Although
IIRC that depends if your locale allows such (not sure, I don't really
use non-C/non-English locales enough).

--
   Richard Huxton
   Archonet Ltd

Re: LIKE search and performance

From
James Mansion
Date:
mark@mark.mielke.cc wrote:
> What is a real life example where an intelligent and researched
> database application would issue a like or ilike query as their
> primary condition in a situation where they expected very high
> selectivity?
>
In my case the canonical example is to search against textual keys where
the search is
performed automatically if the user hs typed enough data and paused.  In
almost all
cases the '%' trails, and I'm looking for 'starts with' in effect.
usually the search will have
a specified upper number of returned rows, if that's an available
facility.  I realise in this
case that matching against the index does not allow the match count
unless we check
MVCC as we go, but I don't see why another thread can't be doing that.

James


Re: LIKE search and performance

From
mark@mark.mielke.cc
Date:
On Wed, Jun 06, 2007 at 11:23:13PM +0100, James Mansion wrote:
> mark@mark.mielke.cc wrote:
> >What is a real life example where an intelligent and researched
> >database application would issue a like or ilike query as their
> >primary condition in a situation where they expected very high
> >selectivity?
> In my case the canonical example is to search against textual keys
> where the search is performed automatically if the user hs typed
> enough data and paused.  In almost all cases the '%' trails, and I'm
> looking for 'starts with' in effect.  usually the search will have a
> specified upper number of returned rows, if that's an available
> facility.  I realise in this case that matching against the index
> does not allow the match count unless we check MVCC as we go, but I
> don't see why another thread can't be doing that.

I believe PostgreSQL already considers using the index for "starts
with", so this wasn't part of the discussion for me. Sorry that this
wasn't clear.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/