Re: Slow performance when querying millions of rows - Mailing list pgsql-performance

From Craig McIlwee
Subject Re: Slow performance when querying millions of rows
Date
Msg-id 20110628235050.77a9baf7@mail.openroadsconsulting.com
Whole thread Raw
In response to Re: Slow performance when querying millions of rows  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Slow performance when querying millions of rows
Re: Slow performance when querying millions of rows
List pgsql-performance
> On 06/28/2011 05:28 PM, Craig McIlwee wrote:
> > Autovacuum is disabled for these tables since the data is never
> > updated.  The tables that we are testing with at the moment will not
> > grow any larger and have been both clustered and analyzed.
>
> Note that any such prep to keep from ever needing to maintain these
> tables in the future should include the FREEZE option, possibly with
> some parameters tweaked first to make it more aggressive.  Autovacuum
> will eventually revisit them in order to prevent transaction ID
> wrap-around, even if it's disabled.  If you're going to the trouble of
> prepping them so they are never touched again, you should do a freeze
> with the right parameters to keep this from happening again.
>
> > work_mem: 512MB
> > shared_buffers: 64MB, 512MB, and 1024MB, each yielded the same query
> > plan and took the same amount of time to execute give or take a few
> > seconds
>
> shared_buffers doesn't normally impact the query plan; it impacts how
> much churn there is between the database and the operating system cache,
> mainly important for making write-heavy work efficient.  On Windows,
> you'll probably be safe to set this to 512MB and forget about it.  It
> doesn't benefit from large values anyway.

I was thinking that shared buffers controlled the amount of data, primarily table and index pages, that the database could store in memory at once.  Based on that assumption, I thought that a larger value would enable an entire table + index to be in memory together and speed up the query.  Am I wrong?

>
> This is a very large work_mem setting however, so be careful that you
> won't have many users connecting at once if you're going to use it. 
> Each connection can use a multiple of work_mem, making it quite possible
> you could run out of memory with this configuration.  If that low user
> count is true, you may want to make sure you're enforcing it by lowering
> max_connections, as a safety measure to prevent problems.

I plan on lowering this quite a bit since I haven't seen much of a boost by increasing it.

>
> > Since the daily tables are only 360MB, I would hope that the entire
> > table could be pulled into memory with one large sequential read.  Of
> > course this assumes that the file pieces are stored contiguously, but
> > auto defrag is enabled and shows low fragmentation so Im trusting (as
> > much as one can) Windows to do the right thing here.  My drives have a
> > 150MB/s sustained max throughput, and considering that data is spread
> > across 5 drives I would hope to at least be able to reach the single
> > disk theoretical limit and read an entire table plus the index into
> > memory about 4 to 5 seconds.  Based on the analyze output, each daily
> > table averages 6 to 7 seconds, so Im pretty close there and maybe just
> > limited by disk speed?
>
> One thing to note is that your drive speed varies based on what part of
> the disk things are located at; the slower parts of the drive will be
> much less than 150MB/s.
>
> On Linux servers it's impossible to reach something close to the disk's
> raw speed without making the operating system read-ahead feature much
> more aggressive than it is by default.  Because PostgreSQL fetches a
> single block at a time, to keep the drive completely busy something has
> to notice the pattern of access and be reading data ahead of when the
> database even asks for it.  You may find a parameter you can tune in the
> properties for the drives somewhere in the Windows Control Panel.  And
> there's a read-ahead setting on your PERC card that's better than
> nothing you may not have turned on (not as good as the Linux one, but
> it's useful).  There are two useful settings there ("on" and "adaptive"
> if I recall correctly) that you can try, to see which works better.

Looks like they are set to adaptive read-ahead now.  If the database is executing many concurrent queries, is it reasonable to suspect that the IO requests will compete with each other in such a way that the controller would rarely see many sequential requests since it is serving many processes?  The controller does have an 'on' option also that forces read-ahead, maybe that would solve the issue if we can rely on the data to survive in the cache until the actual read request takes place.

>
> > Ive read documentation that says I should be able to set statistics
> > values for an entire table as opposed to per column, but havent found
> > how to do that.  I guess I was either too lazy to update statistics on
> > each column or just didnt think it would help much.
>
> You can adjust the statistics target across the entire database using
> the default_statistics_target setting, or you can tweak them per column
> using ALTER TABLE.  There is no table-level control.  I find it
> difficult to answer questions about whether there is enough stats or not
> without actually looking at pg_stats to see how the database is
> interpreting the data, and comparing it against the real distribution. 
> This is an area where flailing about trying things doesn't work very
> well; you need to be very systematic about the analysis and testing
> strategy if you're going to get anywhere useful.  It's not easy to do.
>
> As a larger commentary on what you're trying to do, applications like
> this often find themselves at a point one day where you just can't allow
> arbitrary user queries to run against them anymore.  What normally
> happens then is that the most common things that people really need end
> up being run one and stored in some summary form, using techniques such
> as materialized views:  http://wiki.postgresql.org/wiki/Materialized_Views
>
> In your case, I would start now on trying to find the common patters to
> the long running reports that people generate, and see if it's possible
> to pre-compute some portion of them and save that summary.  And you may
> find yourself in a continuous battle with business requests regardless. 
> It's often key decision makers who feel they should be able to query any
> way they want, regardless of its impact on the database.  Finding a
> middle ground there is usually challenging.
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> Comprehensive and Customized PostgreSQL Training Classes:
> http://www.2ndquadrant.us/postgresql-training/
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Thanks,
Craig

This e-mail communication (including any attachments) may contain confidential and/or privileged material intended solely for the individual or entity to which it is addressed.
P - Think before you print.

pgsql-performance by date:

Previous
From: "Craig McIlwee"
Date:
Subject: Re: Slow performance when querying millions of rows
Next
From: Tomas Vondra
Date:
Subject: Re: Slow performance when querying millions of rows