Thread: Wildcard searches & performance question

Wildcard searches & performance question

From
Grega Bremec
Date:
Hello,

I have a database with the following layout:

    searchterms=# \d+ searches_2002
        Table "public.searches_2002"
      Column   |          Type          | Modifiers | Description
    -----------+------------------------+-----------+-------------
     srchdate  | date                   | not null  |
     srchtime  | time without time zone | not null  |
     client_ip | inet                   | not null  |
     srchquery | character varying(50)  | not null  |
     fhvalue   | smallint               |           |
    Indexes: searches_2002_client_ip btree (client_ip),
         searches_2002_srchdate btree (srchdate),
         searches_2002_srchdatetime btree (srchdate, srchtime),
         searches_2002_srchquery btree (srchquery),
         searches_2002_srchquery_lcase btree (lower(srchquery)),
         searches_2002_srchquery_withfh btree (srchquery, fhvalue),
         searches_2002_srchtime btree (srchtime)

There are no uniqueness properties that would make it possible for this table
to have a primary key, as it is a list of searches performed on a search
engine and the users' behaviour is, well... umm, odd, to be mild. :)

Also, do note that this is a test table, so nevermind the excessive amount of
indexes - performance is not an issue here, I am still evaluating the need and
benefits of having various indexes on those columns.

The particular case I am interested is this: when executing queries involving
pattern searches using various operators on srchquery, none of the indexes are
used in certain cases, namely those LIKE and regexp filters that start with
a wildcard.

This makes perfect sense, because wildcard pattern searches that start with a
wildcard, can not really benefit from an index scan, because a sequential scan
is probably going to be faster: we are only going to benefit from scanning an
index in those special cases where the wildcard evaluates to a zero-length
string.

One example of a query plan:

    searchterms=# explain select count(*)
            from searches_2002
            where srchquery like '%curriculum%';
                    QUERY PLAN
    --------------------------------------------------------------------------
     Aggregate  (cost=4583.26..4583.26 rows=1 width=0)
       ->  Seq Scan on searches_2002  (cost=0.00..4583.26 rows=1 width=0)
         Filter: (srchquery ~~ '%curriculum%'::text)

There is 211061 records in this test table, but the real-life tables would
contain a much much larger amount of data, more like 50+ million rows.

This promise of a hell on earth trying to optimize performance makes me wonder:
would there be a sensible way/reason for avoiding sequential scans on queries
that start with a wildcard, and would avoiding sequential scans even be
feasible in such cases? Or in other words, can I somehow optimize LIKE and
regexp queries that start with a wildcard?

TIA,
--
    Grega Bremec
    System Administration & Development Support
    grega.bremec-at-noviforum.si
    http://najdi.si/
    http://www.noviforum.si/

Re: Wildcard searches & performance question

From
Andrew Sullivan
Date:
On Tue, May 27, 2003 at 09:09:08PM +0200, Grega Bremec wrote:
> that start with a wildcard, and would avoiding sequential scans even be
> feasible in such cases? Or in other words, can I somehow optimize LIKE and
> regexp queries that start with a wildcard?

Not really.  But it sounds like you might be a candidate for full
text indexing.  See contrib/tsearch for one implementation.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Wildcard searches & performance question

From
Josh Berkus
Date:
Grega,

See www.openfts.org for a tool to do what you need.  There's also a simpler
one in /contrib, as Andrew mentioned.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wildcard searches & performance question

From
"scott.marlowe"
Date:
What you want is full text searching.  To see it in action go here:

fts.postgresql.org

To download it go here:

http://openfts.sourceforge.net/

There is also the older, and slightly slower full text indexing engine,
included in the /contrib/fulltextindex directory.  It's a little more
wrung out, but also not as likely to get maintenance in the future.

Basically, full text indexing does exactly what you're asking for by
indexing each row inserted by each word in it (that isn't a noise word
like "the" or "a") and then uses the indexes created for its searches.

On Tue, 27 May 2003, Grega Bremec wrote:

> Hello,
>
> I have a database with the following layout:
>
>     searchterms=# \d+ searches_2002
>         Table "public.searches_2002"
>       Column   |          Type          | Modifiers | Description
>     -----------+------------------------+-----------+-------------
>      srchdate  | date                   | not null  |
>      srchtime  | time without time zone | not null  |
>      client_ip | inet                   | not null  |
>      srchquery | character varying(50)  | not null  |
>      fhvalue   | smallint               |           |
>     Indexes: searches_2002_client_ip btree (client_ip),
>          searches_2002_srchdate btree (srchdate),
>          searches_2002_srchdatetime btree (srchdate, srchtime),
>          searches_2002_srchquery btree (srchquery),
>          searches_2002_srchquery_lcase btree (lower(srchquery)),
>          searches_2002_srchquery_withfh btree (srchquery, fhvalue),
>          searches_2002_srchtime btree (srchtime)
>
> There are no uniqueness properties that would make it possible for this table
> to have a primary key, as it is a list of searches performed on a search
> engine and the users' behaviour is, well... umm, odd, to be mild. :)
>
> Also, do note that this is a test table, so nevermind the excessive amount of
> indexes - performance is not an issue here, I am still evaluating the need and
> benefits of having various indexes on those columns.
>
> The particular case I am interested is this: when executing queries involving
> pattern searches using various operators on srchquery, none of the indexes are
> used in certain cases, namely those LIKE and regexp filters that start with
> a wildcard.
>
> This makes perfect sense, because wildcard pattern searches that start with a
> wildcard, can not really benefit from an index scan, because a sequential scan
> is probably going to be faster: we are only going to benefit from scanning an
> index in those special cases where the wildcard evaluates to a zero-length
> string.
>
> One example of a query plan:
>
>     searchterms=# explain select count(*)
>             from searches_2002
>             where srchquery like '%curriculum%';
>                     QUERY PLAN
>     --------------------------------------------------------------------------
>      Aggregate  (cost=4583.26..4583.26 rows=1 width=0)
>        ->  Seq Scan on searches_2002  (cost=0.00..4583.26 rows=1 width=0)
>          Filter: (srchquery ~~ '%curriculum%'::text)
>
> There is 211061 records in this test table, but the real-life tables would
> contain a much much larger amount of data, more like 50+ million rows.
>
> This promise of a hell on earth trying to optimize performance makes me wonder:
> would there be a sensible way/reason for avoiding sequential scans on queries
> that start with a wildcard, and would avoiding sequential scans even be
> feasible in such cases? Or in other words, can I somehow optimize LIKE and
> regexp queries that start with a wildcard?
>
> TIA,
>


Re: Wildcard searches & performance question

From
Rod Taylor
Date:
> feasible in such cases? Or in other words, can I somehow optimize LIKE and
> regexp queries that start with a wildcard?

If they start with a wildcard, but end in character data you could
reverse the string and index that...

If they start and end with a wildcard, your best bet is a full text
indexing solution (various contrib modules).

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Attachment

Re: Wildcard searches & performance question

From
Grega Bremec
Date:
Thank you very much for all your suggestions.

I am planning on investing some time into trying out a couple FT indexes. Not
minding the fact most of the queries are going to be exact phrase substring
searches, performance will most probably benefit from it, so at least some of
what I'm after is achieved that way.

I shall be getting back with reports, in case anyone is interested.

Cheers,
--
    Grega Bremec
    System Administration & Development Support
    grega.bremec-at-noviforum.si
    http://najdi.si/
    http://www.noviforum.si/

Re: Wildcard searches & performance question

From
"scott.marlowe"
Date:
On Wed, 28 May 2003, Grega Bremec wrote:

> Thank you very much for all your suggestions.
>
> I am planning on investing some time into trying out a couple FT indexes. Not
> minding the fact most of the queries are going to be exact phrase substring
> searches, performance will most probably benefit from it, so at least some of
> what I'm after is achieved that way.
>
> I shall be getting back with reports, in case anyone is interested.

Be sure to look at OpenFTS http://sourceforge.net/projects/openfts/