Re: Slow performance when querying millions of rows - Mailing list pgsql-performance

From Greg Smith
Subject Re: Slow performance when querying millions of rows
Date
Msg-id 4E0A7981.7070000@2ndQuadrant.com
Whole thread Raw
In response to Re: Slow performance when querying millions of rows  ("Craig McIlwee" <craig.mcilwee@openroadsconsulting.com>)
List pgsql-performance
On 06/28/2011 07:50 PM, Craig McIlwee wrote:
> I was thinking that shared buffers controlled the amount of data,
> primarily table and index pages, that the database could store in
> memory at once.  Based on that assumption, I thought that a larger
> value would enable an entire table + index to be in memory together
> and speed up the query.  Am I wrong?

It does to some extent.  But:

a) This amount doesn't impact query planning as much if you've set a
large effective_cache_size

b) The operating system is going to cache things outside of PostgreSQL, too

c) Data read via a sequential scan sometimes skips going into
shared_buffers, to keep that cache from being swamped with any single scan

d) until the data has actually made its way into memory, you may be
pulling it in there by an inefficient random process at first.  By the
time the cache is populated, the thing you wanted a populated cache to
accelerate may already have finished.

It's possible to get insight into this all using pg_buffercache to
actually see what's in the cache, and I've put up some talks and scripts
to help with that at http://projects.2ndquadrant.com/talks you might
find useful.

> Looks like they are set to adaptive read-ahead now.  If the database
> is executing many concurrent queries, is it reasonable to suspect that
> the IO requests will compete with each other in such a way that the
> controller would rarely see many sequential requests since it is
> serving many processes?  The controller does have an 'on' option also
> that forces read-ahead, maybe that would solve the issue if we can
> rely on the data to survive in the cache until the actual read request
> takes place.

I've never been able to find good documentation on just what the
difference between the adaptive and on modes of that controller really
are, which is why I suggested you try both and see.  Linux has a
uniquely good read-ahead model that was tuned with PostgreSQL
specifically in mind.  And you still have to tweak it upwards from the
defaults in order for the database to fetch things as fast as the drives
are capable sometimes.  So your idea that you will meet/exceed the
drive's capabilities for bulk sequential scans is less likely than you
might think.  RAID5 in theory should give you 2X or more of the speed of
any single disk when reading a clustered table, but the way PostgreSQL
does it may make that hard to realize on Windows.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
Comprehensive and Customized PostgreSQL Training Classes:
http://www.2ndquadrant.us/postgresql-training/


pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: Slow performance when querying millions of rows
Next
From: Robert Haas
Date:
Subject: Re: change sample size for statistics