Thread: Performance for seq. scans

Performance for seq. scans

From

Jules Bean

Date:

26 July 2000, 06:52:30

Hi all,

I've had a look over the docs and the FAQ and I can't see anything
answering this, so here goes:

I'm in the (slightly unusual, in a relational world) situation that
the dominant query on my database is a wildcard search, so that no
indexes can be used.  (E.g. select * from table_a where foo like
'%bar%').

Without some /very/ clever (and disk-space intensive) subword
indexing, this query is doomed to be a sequential scan, which I'm
resigned to.  It's a question of making that as fast as possible.

My dataset is around 500M as a text file on disk, and around 1500M as
postgres data. The machine I'm working on at the moment does the
search in around 90 seconds. (For comparision, MS SQL 7, the other
solution being considered here, takes around 75 seconds on identical
hardware).

Interestingly, using 'vmstat' shows that the CPU is maxxed out at 50%
(this being a dual CPU machine), while the disk access is a mere
4M/sec --- bonnie claims this machine is capable of around 25M/sec to
this particular disk. So it would seem that the bottleneck is the
CPU. [I understand why both CPUs aren't used]

My previous feeling had been that the bottleneck was going to be the
disk, in which case I was going to recommend installing enough memory
in the machine that the kernel disk cache could cache the whole file,
and thus speeding up the search. In the current situtation, it seems
like the only improvement would be to install a faster CPU (and since
we're currently using a PIII 600, I couldn't expect much more than a
60% improvement or so that way).

It seems slightly surprising that postgres can only "service" a 4M/sec
stream of data from the disk with a LIKE query -- not such a complex
query. Is there some unnecessary data copying in the critical path for
the search?

I almost forgot -- this is debian package 7.0.2-2.

Any pointers to whether or not this performance can be improved upon,
welcomed.  Currently I'm feeling like the right solution may be to
dump the 500M text file periodically and run 'grep' on a machine with
enough memory to cache the text file ;-)

Jules Bean

Re: Performance for seq. scans

From

Steve Heaven

Date:

26 July 2000, 07:19:38

At 11:51 26/07/00 +0100, you wrote:
>Hi all,
>
>I've had a look over the docs and the FAQ and I can't see anything
>answering this, so here goes:
>
>I'm in the (slightly unusual, in a relational world) situation that
>the dominant query on my database is a wildcard search, so that no
>indexes can be used.  (E.g. select * from table_a where foo like
>'%bar%').

We were in a similar position and went for the 'Full Text Indexing" extra.
You'll find it in contrib/fulltextindex.
It creates a function which you call on a trigger to produce an index of
words for specified fields. These indexes do get _very_ large (one of ours
is ~800 MB), but it does work very well and speeds searches up enormously.

Steve

--
thorNET  - Internet Consultancy, Services & Training
Phone: 01454 854413
Fax:   01454 854412
http://www.thornet.co.uk

Re: Performance for seq. scans

From

Jules Bean

Date:

26 July 2000, 07:29:18

On Wed, Jul 26, 2000 at 12:14:13PM +0100, Steve Heaven wrote:
> At 11:51 26/07/00 +0100, you wrote:
> >Hi all,
> >
> >I've had a look over the docs and the FAQ and I can't see anything
> >answering this, so here goes:
> >
> >I'm in the (slightly unusual, in a relational world) situation that
> >the dominant query on my database is a wildcard search, so that no
> >indexes can be used.  (E.g. select * from table_a where foo like
> >'%bar%').
>
> We were in a similar position and went for the 'Full Text Indexing" extra.
> You'll find it in contrib/fulltextindex.
> It creates a function which you call on a trigger to produce an index of
> words for specified fields. These indexes do get _very_ large (one of ours
> is ~800 MB), but it does work very well and speeds searches up enormously.

If I understand you correctly, that's word-based? It's just splitting
on whitespace and punctuation?  Unfortunately, that's not quite what
we need --- our wildcard searches needn't have their '%' on word
boundaries.

Jules

Re: Performance for seq. scans

From

Steve Heaven

Date:

26 July 2000, 07:49:17

At 12:28 26/07/00 +0100, Jules Bean wrote:
>> We were in a similar position and went for the 'Full Text Indexing" extra.
>> You'll find it in contrib/fulltextindex.
>> It creates a function which you call on a trigger to produce an index of
>> words for specified fields. These indexes do get _very_ large (one of ours
>> is ~800 MB), but it does work very well and speeds searches up enormously.
>
>If I understand you correctly, that's word-based? It's just splitting
>on whitespace and punctuation?  Unfortunately, that's not quite what
>we need --- our wildcard searches needn't have their '%' on word
>boundaries.
>

There is a function in the source called breakup(). This can be customised
to create the index entries on sub-word strings.

Steve

--
thorNET  - Internet Consultancy, Services & Training
Phone: 01454 854413
Fax:   01454 854412
http://www.thornet.co.uk

Re: Performance for seq. scans

From

"Mitch Vincent"

Date:

26 July 2000, 10:32:39

The FTI trigger code that's distributed with PostgreSQL now actually breaks
the words up into two character substrings.

I re-wrote it to eliminate duplicates and only split up the words based on
whitespace and delimiters -- if you did this you could still use LIKE to
match based on  substrings, then you would have the added speed of an index
scan..

Jules:

select * from table_a where foo like '%bar%'

Depending on the table type of foo, that doesn't have to do a seq scan... If
it's anything but text, you can create an index on it -- LIKE can use
indexes. If it is type text then I would look into using the FTI stuff in
contrib. If you want mine, let me know however it sounds like the
distributed version would be more suited to what you'd like to do.

Good luck!

-Mitch

----- Original Message -----
From: "Steve Heaven" <steve@thornet.co.uk>
To: "Jules Bean" <jules@jellybean.co.uk>
Cc: <pgsql-general@postgresql.org>
Sent: Wednesday, July 26, 2000 7:39 AM
Subject: Re: [GENERAL] Performance for seq. scans

> At 12:28 26/07/00 +0100, Jules Bean wrote:
> >> We were in a similar position and went for the 'Full Text Indexing"
extra.
> >> You'll find it in contrib/fulltextindex.
> >> It creates a function which you call on a trigger to produce an index
of
> >> words for specified fields. These indexes do get _very_ large (one of
ours
> >> is ~800 MB), but it does work very well and speeds searches up
enormously.
> >
> >If I understand you correctly, that's word-based? It's just splitting
> >on whitespace and punctuation?  Unfortunately, that's not quite what
> >we need --- our wildcard searches needn't have their '%' on word
> >boundaries.
> >
>
> There is a function in the source called breakup(). This can be customised
> to create the index entries on sub-word strings.
>
> Steve
>
> --
> thorNET  - Internet Consultancy, Services & Training
> Phone: 01454 854413
> Fax:   01454 854412
> http://www.thornet.co.uk
>

Re: Performance for seq. scans

From

Jules Bean

Date:

26 July 2000, 10:45:45

On Wed, Jul 26, 2000 at 10:34:39AM -0400, Mitch Vincent wrote:
> The FTI trigger code that's distributed with PostgreSQL now actually breaks
> the words up into two character substrings.
>
> I re-wrote it to eliminate duplicates and only split up the words based on
> whitespace and delimiters -- if you did this you could still use LIKE to
> match based on  substrings, then you would have the added speed of an index
> scan..
>
> Jules:
>
> select * from table_a where foo like '%bar%'
>
> Depending on the table type of foo, that doesn't have to do a seq scan... If
> it's anything but text, you can create an index on it -- LIKE can use
> indexes. If it is type text then I would look into using the FTI stuff in
> contrib. If you want mine, let me know however it sounds like the
> distributed version would be more suited to what you'd like to do.

Hmm.  Colour me confused.

You're suggesting I can use a conventional index on 'foo'?  Well,
certainly, there is such an index on the table anyhow (since some
queries do use it) but I don't understand.  Surely LIKE only uses
indexes when the string match is anchored on the left?  Certainly
that's my understanding of indexes, and it's what the docs say...

OTOH, are you saying that the FTI code in 7.0.2 contrib does in fact
do the 'index all occuring 2-char substring' trick?  I must go and
investigate that straight away.  Many thanks.

Jules

Re: Performance for seq. scans

From

Jules Bean

Date:

26 July 2000, 10:59:03

On Wed, Jul 26, 2000 at 03:44:47PM +0100, Jules Bean wrote:
>
> OTOH, are you saying that the FTI code in 7.0.2 contrib does in fact
> do the 'index all occuring 2-char substring' trick?  I must go and
> investigate that straight away.  Many thanks.

OK, I apologise for my skepticism.

I have just read the FTI code, and as far as a quick glance can tell
me, it appears to index, for each 'word' in my main table, all the
final segments of that word.  This enables me to match any substring,
since if %bar% matches, then there must be /some/ final segment which
begins with bar, so I can search for bar% in the FTI table.

Rather cleverer than the algorithm I'd thought of ;-)

Thanks to everyone for all your help.  Now it just remains to make up
my mind whether I have the disk space for this ;-)

Jules

Re: Performance for seq. scans

From

bmccoy@chapelperilous.net

Date:

26 July 2000, 12:58:59

On Wed, 26 Jul 2000, Jules Bean wrote:

> If I understand you correctly, that's word-based? It's just splitting
> on whitespace and punctuation?  Unfortunately, that's not quite what
> we need --- our wildcard searches needn't have their '%' on word
> boundaries.

You can have it search on less than word boundaries.  It uses a regexp in
the query statement, and actually will match against sub-strings.  Check
the documentation.  It talks about how this works.

Brett W. McCoy
                                         http://www.chapelperilous.net/~bmccoy/
-------------------------------------------------------------------------------
If everything is coming your way then you're in the wrong lane.

Re: Performance for seq. scans

From

Michael Blakeley

Date:

26 July 2000, 13:55:46

>  From: Jules Bean <jules@jellybean.co.uk>
>  To: pgsql-general@postgresql.org
>  Subject: Performance for seq. scans
>
>  I've had a look over the docs and the FAQ and I can't see anything
>  answering this, so here goes:
>
>  I'm in the (slightly unusual, in a relational world) situation that
>  the dominant query on my database is a wildcard search, so that no
>  indexes can be used.  (E.g. select * from table_a where foo like
>  '%bar%').
>
>  Interestingly, using 'vmstat' shows that the CPU is maxxed out at 50%
>  (this being a dual CPU machine), while the disk access is a mere
>  4M/sec --- bonnie claims this machine is capable of around 25M/sec to
>  this particular disk. So it would seem that the bottleneck is the
>  CPU. [I understand why both CPUs aren't used]

I'd make sure that you're using the latest compilers to build
postgres, and maxing-out the optimization. If you're binaries are
unoptimized, that ought to be good for 15% (for your app, possibly
more).

Also, you haven't said what postmaster options you're using - I've
seen big changes by tweaking sort memory, buffer pools, etc. Details
of the parameters are in the man pages and docs.

-- Mike

Re: Performance for seq. scans

From

Bruce Momjian

Date:

27 July 2000, 15:38:37

> > We were in a similar position and went for the 'Full Text Indexing" extra.
> > You'll find it in contrib/fulltextindex.
> > It creates a function which you call on a trigger to produce an index of
> > words for specified fields. These indexes do get _very_ large (one of ours
> > is ~800 MB), but it does work very well and speeds searches up enormously.
>
> If I understand you correctly, that's word-based? It's just splitting
> on whitespace and punctuation?  Unfortunately, that's not quite what
> we need --- our wildcard searches needn't have their '%' on word
> boundaries.

It is not word-based.  It slices each word into pieces and indexes it.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026