Performance for seq. scans - Mailing list pgsql-general

From Jules Bean
Subject Performance for seq. scans
Date
Msg-id 20000726115132.C29809@grommit.office.vi.net
Whole thread Raw
Responses Re: Performance for seq. scans
List pgsql-general
Hi all,

I've had a look over the docs and the FAQ and I can't see anything
answering this, so here goes:

I'm in the (slightly unusual, in a relational world) situation that
the dominant query on my database is a wildcard search, so that no
indexes can be used.  (E.g. select * from table_a where foo like
'%bar%').

Without some /very/ clever (and disk-space intensive) subword
indexing, this query is doomed to be a sequential scan, which I'm
resigned to.  It's a question of making that as fast as possible.

My dataset is around 500M as a text file on disk, and around 1500M as
postgres data. The machine I'm working on at the moment does the
search in around 90 seconds. (For comparision, MS SQL 7, the other
solution being considered here, takes around 75 seconds on identical
hardware).

Interestingly, using 'vmstat' shows that the CPU is maxxed out at 50%
(this being a dual CPU machine), while the disk access is a mere
4M/sec --- bonnie claims this machine is capable of around 25M/sec to
this particular disk. So it would seem that the bottleneck is the
CPU. [I understand why both CPUs aren't used]

My previous feeling had been that the bottleneck was going to be the
disk, in which case I was going to recommend installing enough memory
in the machine that the kernel disk cache could cache the whole file,
and thus speeding up the search. In the current situtation, it seems
like the only improvement would be to install a faster CPU (and since
we're currently using a PIII 600, I couldn't expect much more than a
60% improvement or so that way).

It seems slightly surprising that postgres can only "service" a 4M/sec
stream of data from the disk with a LIKE query -- not such a complex
query. Is there some unnecessary data copying in the critical path for
the search?

I almost forgot -- this is debian package 7.0.2-2.

Any pointers to whether or not this performance can be improved upon,
welcomed.  Currently I'm feeling like the right solution may be to
dump the 500M text file periodically and run 'grep' on a machine with
enough memory to cache the text file ;-)

Jules Bean



pgsql-general by date:

Previous
From: Bob Parkinson
Date:
Subject: planner switch from index scan to seq scan?
Next
From: "Andrea Aime"
Date:
Subject: Re: Connecting to PostgreSQL databases