Thread: random_page_cost vs seq_page_cost

random_page_cost vs seq_page_cost

From

Benedikt Grundmann

Date:

05 January 2012, 06:04:28

Hello list,

I have a question of how to benchmark hardware to determine
the appropriate ratio of seq_page_cost vs random_page_cost.

Emails in this mailing lists archive seem to indicate that 
1.0 vs 3.0 - 4.0 are appropriate values on modern hardware.

Which surprised me a bit as I had thought that on actual 
harddrives (ignoring SSDs) random_page_cost is higher.
I guess that the number tries to reflect caching of the
relevant pages in memory and modern hardware you have
more of that?

Anyhow it would be great to have a scientific way of
setting those numbers.

Background:

We have recently upgrade two of our biggest postgres databases 
to new hardware and minor version number bump (8.4.5 -> 8.4.9).

We are experiencing a big performance regression in some queries.
In those cases the planner seems to choose a nested loop index
scan instead of hashing the index once and then joining.

The new hardware was optimized for seq scans and does those
very fast.  

We are not sure if the database used to choose differently
before the move to the new hardware and the hardware is 
performing worse for random seeks.  Or if the planner is
now making different choices.

As a counter measure we are experimenting with 
enable_nestloop = off
random_page_cost = 20 (instead of the previous 4).

It is worth noting that for many small tables the nestloop
is indeed marginally faster (in the doesn't really matter 
because both cases are fast enough case).  But the regression
for the big tables (big in the sense of index just fits into
memory but in practice might not because there other frequently
accessed big tables) is a show stopper.

For some of those tables we have also have recently (as part
of the move) clustered for the first time in ages and it was
speculated that that might have changed statistics (such
as correlation) and increased the attractiveness of the 
index scan to the planner.

Another thing that I have thought before might be provide
some enlightenment would be a 
explain log select ...

That would show all the sub plans considered and why they 
were discarded or something approximating this.

Thanks in advance for any reply and sorry that this email
turned out to be rather long stream of consciousness dump.

Bene

------ relevant parts of our postgresql.conf -----------

shared_buffers = 12GB           # min 128kB                               # (change requires restart)
temp_buffers = 512MB            # min 800kB
#max_prepared_transactions = 0  # zero disables the feature                               # (change requires restart)
# Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
# It is not advisable to set max_prepared_transactions nonzero unless you
# actively intend to use prepared transactions.
work_mem = 192MB                # min 64kB
maintenance_work_mem = 1GB      # min 1MB
#max_stack_depth = 2MB          # min 100kB

# - Kernel Resource Usage -

#max_files_per_process = 1000   # min 25                               # (change requires restart)
#shared_preload_libraries = ''          # (change requires restart)

# - Cost-Based Vacuum Delay -

vacuum_cost_delay = 100ms       # 0-100 milliseconds
vacuum_cost_page_hit = 1        # 0-10000 credits
vacuum_cost_page_miss = 10      # 0-10000 credits
vacuum_cost_page_dirty = 20     # 0-10000 credits
vacuum_cost_limit = 7500        # 1-10000 credits

# - Background Writer -

#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100    # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0          # 0-10.0 multipler on buffers scanned/round

# - Asynchronous Behavior -

effective_io_concurrency = 40    # 1-1000. 0 disables prefetching

# WRITE AHEAD LOG

fsync = on                      # turns forced synchronization on or off
synchronous_commit = off        # immediate fsync at commit
#wal_sync_method = fsync        # the default is the first option                                # supported by the
operatingsystem:                               #   open_datasync                               #   fdatasync
                  #   fsync                               #   fsync_writethrough                               #
open_sync
full_page_writes = on           # recover from partial page writes
wal_buffers = 16MB              # min 32kB                               # (change requires restart)
#wal_writer_delay = 200ms       # 1-10000 milliseconds

commit_delay = 1000             # range 0-100000, in microseconds
commit_siblings = 5             # range 1-1000

# - Checkpoints -

checkpoint_segments = 128           # in logfile segments, min 1, 16MB each
checkpoint_timeout = 10min          # range 30s-1h
checkpoint_completion_target = 0.9  # checkpoint target duration, 0.0 - 1.0
checkpoint_warning = 30s            # 0 disables

# - Archiving -

archive_mode = off      # allows archiving to be done

# QUERY TUNING

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

seq_page_cost = 1.0             # measured on an arbitrary scale
random_page_cost = 20.0         # same scale as above
#cpu_tuple_cost = 0.01          # same scale as above
#cpu_index_tuple_cost = 0.005   # same scale as above
#cpu_operator_cost = 0.0025     # same scale as above
effective_cache_size = 32GB     # dpowers: set to 2/3 of available ram

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5                # range 1-10
#geqo_pool_size = 0             # selects default based on effort
#geqo_generations = 0           # selects default based on effort
#geqo_selection_bias = 2.0      # range 1.5-2.0

# - Other Planner Options -

default_statistics_target = 1000   # range 1-10000
#constraint_exclusion = partition  # on, off, or partition
cursor_tuple_fraction = 1.0    # range 0.0-1.0
#from_collapse_limit = 8
#join_collapse_limit = 8        # 1 disables collapsing of explicit                                # JOIN clauses

Machine:
OS: linux 2.6.32-71.29.1
CPU 12 x Intel(R) Xeon(R) CPU X5680 @ 3.33GHz Cache size: 12288 KB
RAM: 47.12 GB

Re: random_page_cost vs seq_page_cost

From

Robert Haas

Date:

05 January 2012, 10:32:50

On Thu, Jan 5, 2012 at 5:04 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> We are experiencing a big performance regression in some queries.
> In those cases the planner seems to choose a nested loop index
> scan instead of hashing the index once and then joining.

I think you probably need to post EXPLAIN ANALYZE output from the
actual queries to get useful advice, probably to pgsql-performance,
rather than here.

It's hard to believe that enable_nestloop=off is doing anything other
than masking whatever the real problem is, but it's hard to tell what
that problem is based on the information provided.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: random_page_cost vs seq_page_cost

From

Benedikt Grundmann

Date:

05 January 2012, 10:34:45

On 05/01/12 10:04, Benedikt Grundmann wrote:
> 
> As a counter measure we are experimenting with 
> enable_nestloop = off
> random_page_cost = 20 (instead of the previous 4).
> 
For what it is worth we had to revert the enable_nestloop = off 
change.  It just moved the pain around by making other queries
perform much worse than before.

Re: random_page_cost vs seq_page_cost

From

Jeremy Harris

Date:

05 January 2012, 19:01:23

On 2012-01-05 10:04, Benedikt Grundmann wrote:
> I have a question of how to benchmark hardware to determine
> the appropriate ratio of seq_page_cost vs random_page_cost.

It'd be really nice if the DBMS measured actual experienced values......

-- 
Jeremy

Re: random_page_cost vs seq_page_cost

From

Josh Berkus

Date:

05 January 2012, 20:17:42

On 1/5/12 3:00 PM, Jeremy Harris wrote:
> On 2012-01-05 10:04, Benedikt Grundmann wrote:
>> I have a question of how to benchmark hardware to determine
>> the appropriate ratio of seq_page_cost vs random_page_cost.
> 
> It'd be really nice if the DBMS measured actual experienced values......

Certainly it would.  It would also require a whole lot of
instrumentation.  Feel free to write some ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: random_page_cost vs seq_page_cost

From

Peter Eisentraut

Date:

07 January 2012, 17:01:48

On tor, 2012-01-05 at 10:04 +0000, Benedikt Grundmann wrote:
> We have recently upgrade two of our biggest postgres databases 
> to new hardware and minor version number bump (8.4.5 -> 8.4.9).
> 
> We are experiencing a big performance regression in some queries.
> In those cases the planner seems to choose a nested loop index
> scan instead of hashing the index once and then joining.

There was a planner regression introduced in version 8.4.8, which was
thought to be fixed in 8.4.9.  Maybe you got caught by that.  See

Message-Id: <760C0206-B5F4-4DC6-9296-B7A730B7F403@silentmedia.com>

for some information.  Check if your queries match that pattern.

Re: random_page_cost vs seq_page_cost

From

Benedikt Grundmann

Date:

09 January 2012, 12:06:08

On 07/01/12 23:01, Peter Eisentraut wrote:
> On tor, 2012-01-05 at 10:04 +0000, Benedikt Grundmann wrote:
> > We have recently upgrade two of our biggest postgres databases 
> > to new hardware and minor version number bump (8.4.5 -> 8.4.9).
> > 
> > We are experiencing a big performance regression in some queries.
> > In those cases the planner seems to choose a nested loop index
> > scan instead of hashing the index once and then joining.
> 
> There was a planner regression introduced in version 8.4.8, which was
> thought to be fixed in 8.4.9.  Maybe you got caught by that.  See
> 
> Message-Id: <760C0206-B5F4-4DC6-9296-B7A730B7F403@silentmedia.com>
> 
> for some information.  Check if your queries match that pattern.

Good idea.  But that is not it.  We checked by using 8.4.5 on
the new hardware (and the new database) which produced the same
(bad) plans as 8.4.10 (with both the old and the new postgres config).

We are again speculating that it might be:
> > For some of those tables we have also have recently (as part
> > of the move) clustered for the first time in ages and it was
> > speculated that that might have changed statistics (such
> > as correlation) and increased the attractiveness of the
> > index scan to the planner.

Is that possible?  If so what is the best way to prove / disprove
this theory? And ideally if true what knobs are available to tune
this?

Thanks,

Bene

Re: random_page_cost vs seq_page_cost

From

Greg Smith

Date:

10 January 2012, 16:22:59

On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
> I have a question of how to benchmark hardware to determine
> the appropriate ratio of seq_page_cost vs random_page_cost.
>
> Emails in this mailing lists archive seem to indicate that
> 1.0 vs 3.0 - 4.0 are appropriate values on modern hardware.
>
> Which surprised me a bit as I had thought that on actual
> harddrives (ignoring SSDs) random_page_cost is higher.
> I guess that the number tries to reflect caching of the
> relevant pages in memory and modern hardware you have
> more of that?

That sort of thing is one reason why all attempts so far to set 
random_page_cost based on physical characteristics haven't gone anywhere 
useful.  The setting is sort of overloaded right now, it's a fuzzy mix 
of true random seek cost blended with some notion of cache percentage. 
Trying to bring some measurements to bear on it is a less effective 
approach than what people actually do here.  Monitor the profile of 
query execution, change the value, see what happens.  Use that as 
feedback for what direction to keep going; repeat until you're just 
spinning with no improvements.

It's easy to measure the actual read times and set the value based on 
that instead.  But that doesn't actually work out so well.  There's at 
least three problems in that area:

-Timing information is sometimes very expensive to collect.  This I 
expect to at least document and quantify why usefully as a 9.2 feature.

-Basing query execution decisions on what is already in the cache leads 
to all sorts of nasty feedback situations where you optimize for the 
short term, for example using an index already in cache, while never 
reading in what would be a superior long term choice because it seems 
too expensive.

-Making a major adjustment to the query planning model like this would 
require a large performance regression testing framework to evaluate the 
results in.

> We are not sure if the database used to choose differently
> before the move to the new hardware and the hardware is
> performing worse for random seeks.  Or if the planner is
> now making different choices.

I don't recommend ever deploying new hardware without first doing some 
low-level benchmarks to validate its performance.  Once stuff goes into 
production, you can't do that anymore.  See 
http://www.2ndquadrant.com/en/talks/ for my hardware benchmarking talks 
if you'd like some ideas on what to collect.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: random_page_cost vs seq_page_cost

From

Benedikt Grundmann

Date:

11 January 2012, 04:27:11

(replying just to you)
On 10/01/12 15:22, Greg Smith wrote:
> On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
> That sort of thing is one reason why all attempts so far to set
> random_page_cost based on physical characteristics haven't gone
> anywhere useful.  The setting is sort of overloaded right now, it's a
> fuzzy mix of true random seek cost blended with some notion of cache
> percentage. Trying to bring some measurements to bear on it is a less
> effective approach than what people actually do here.  Monitor the
> profile of query execution, change the value, see what happens.  Use
> that as feedback for what direction to keep going; repeat until
> you're just spinning with no improvements.
> 
Thank you very much for the reply it is very interesting.  I'm
excited to hear that documentation in that area will improve in
9.2.  It's interesting postgres has remarkable good documentation
but it is a sufficiently complex system that to actually sensible
tune the knobs provided you have to understand quite a lot about
what is going on.  A colleague of mine likes to say 
"all abstractions leak", which seems very appropriate in this case.

> >We are not sure if the database used to choose differently
> >before the move to the new hardware and the hardware is
> >performing worse for random seeks.  Or if the planner is
> >now making different choices.
> 
> I don't recommend ever deploying new hardware without first doing
> some low-level benchmarks to validate its performance.  Once stuff
> goes into production, you can't do that anymore.  See
> http://www.2ndquadrant.com/en/talks/ for my hardware benchmarking
> talks if you'd like some ideas on what to collect.
> 
We had actually done lots of tests on the sequential read performance.
But you are right we could have done better (and I'll definitely read
through your talks).

Did you see my follow up?  Based on the feedback we did further tests
and It is now clear that neither the hardware nor the database version 
are at fault.  A different plan is chosen by both new and old database 
version if spun up on the database as it is right now.

Our best guess is that the clusters we run after we had moved to the
hardware (it having more diskspace and faster sequential I/O making
it possible) changed the planners perception of how the joins will
perform in relation to each other.

Cheers,

Bene

Re: random_page_cost vs seq_page_cost

From

Benedikt Grundmann

Date:

11 January 2012, 04:29:18

On 11/01/12 08:26, Benedikt Grundmann wrote:
> (replying just to you)
Clearly I didn't.  Sigh. Getting myself a coffee now.

Re: random_page_cost vs seq_page_cost

From

Bruce Momjian

Date:

07 February 2012, 16:24:07

On Wed, Jan 11, 2012 at 08:26:52AM +0000, Benedikt Grundmann wrote:
> (replying just to you)
> On 10/01/12 15:22, Greg Smith wrote:
> > On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
> > That sort of thing is one reason why all attempts so far to set
> > random_page_cost based on physical characteristics haven't gone
> > anywhere useful.  The setting is sort of overloaded right now, it's a
> > fuzzy mix of true random seek cost blended with some notion of cache
> > percentage. Trying to bring some measurements to bear on it is a less
> > effective approach than what people actually do here.  Monitor the
> > profile of query execution, change the value, see what happens.  Use
> > that as feedback for what direction to keep going; repeat until
> > you're just spinning with no improvements.
> >
> Thank you very much for the reply it is very interesting.  I'm
> excited to hear that documentation in that area will improve in
> 9.2.  It's interesting postgres has remarkable good documentation
> but it is a sufficiently complex system that to actually sensible
> tune the knobs provided you have to understand quite a lot about
> what is going on.  A colleague of mine likes to say
> "all abstractions leak", which seems very appropriate in this case.

Where did you see that there will be an improvement in the 9.2
documentation?  I don't see an improvement.

I looked over the random_page_cost documentation and remembered I was
always concerned about how vague it was about caching effects, so I
wrote the attached doc patch to explicity state it.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

random

Re: random_page_cost vs seq_page_cost

From

Greg Smith

Date:

07 February 2012, 18:06:39

On 02/07/2012 03:23 PM, Bruce Momjian wrote:
> Where did you see that there will be an improvement in the 9.2
> documentation?  I don't see an improvement.

I commented that I'm hoping for an improvement in the documentation of 
how much timing overhead impacts attempts to measure this area better.  
That's from the "add timing of buffer I/O requests" feature submission.  
I'm not sure if Bene read too much into that or not; I didn't mean to 
imply that the docs around random_page_cost have gotten better.

This particular complaint is extremely common though, seems to pop up on 
one of the lists a few times each year.  Your suggested doc fix is fine 
as a quick one, but I think it might be worth expanding further on this 
topic.  Something discussing SSDs seems due here too.  Here's a first 
draft of a longer discussion, to be inserted just after where it states 
the default value is 4.0:

True random access to mechanical disk storage will normally be more 
expensive than this default suggests.  The value used is lower to 
reflect caching effects.  Some common random accesses to disk, such as 
indexed reads, are considered likely to be in cache.  The default value 
can be thought of as modeling random access as 40 times as expensive as 
sequential, while expecting that 90% of random reads will actually be 
cached.

If you believe a high cache rate is an incorrect assumption for your 
workload, you might increase random_page_cost to closer reflect the true 
cost of random reads against your storage.  Correspondingly, if your 
data is likely to be completely cached, such as when the database is 
smaller than the total memory in the server, decreasing random_page_cost 
can be appropriate.  Storage where the true cost of random reads is low, 
such as solid-state drives and similar memory-based devices, might also 
find lower values of random_page_cost better reflect the real-world cost 
of that operation.

===

I think of the value as being more like 80 times as expensive and a 95% 
hit rate, but the above seems more likely to turn into understandable 
math to a first-time reader of this section.  I stopped just short of 
recommending a value for the completely cached case.  I normally use 
1.01 there; I know others prefer going fully to 1.0 instead.  That 
argument seems like it could rage on for some time.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: random_page_cost vs seq_page_cost

From

Bruce Momjian

Date:

07 February 2012, 20:58:45

On Tue, Feb 07, 2012 at 05:06:18PM -0500, Greg Smith wrote:
> On 02/07/2012 03:23 PM, Bruce Momjian wrote:
> >Where did you see that there will be an improvement in the 9.2
> >documentation?  I don't see an improvement.
> 
> I commented that I'm hoping for an improvement in the documentation
> of how much timing overhead impacts attempts to measure this area
> better.  That's from the "add timing of buffer I/O requests" feature
> submission.  I'm not sure if Bene read too much into that or not; I
> didn't mean to imply that the docs around random_page_cost have
> gotten better.
> 
> This particular complaint is extremely common though, seems to pop
> up on one of the lists a few times each year.  Your suggested doc
> fix is fine as a quick one, but I think it might be worth expanding
> further on this topic.  Something discussing SSDs seems due here
> too.  Here's a first draft of a longer discussion, to be inserted
> just after where it states the default value is 4.0:

I was initially concerned that tuning advice in this part of the docs
would look out of place, but now see the 25% shared_buffers
recommentation, and it looks fine, so we are OK.  (Should we caution
against more than 8GB of shared buffers?  I don't see that in the docs.)

I agree we are overdue for better a explanation of random page cost, so
I agree with your direction.  I did a little word-smithing to tighten up
your text;  feel free to discard what you don't like:
Random access to mechanical disk storage is normally much more expensivethan four-times sequential access.  However, a
lowerdefault is used(4.0) because the majority of random accesses to disk, such as indexedreads, are assumed to be in
cache. The default value can be thought ofas modeling random access as 40 times slower than sequential, whileexpecting
90%of random reads to be cached.If you believe a 90% cache rate is an incorrect assumptionfor your workload, you can
increaserandom_page_cost to betterreflect the true cost of random storage reads. Correspondingly,if your data is likely
tobe completely in cache, such as whenthe database is smaller than the total server memory, decreasingrandom_page_cost
canbe appropriate.  Storage that has a low randomread cost relative to sequential, e.g. solid-state drives, mightalso
bebetter modeled with a lower value for random_page_cost.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: random_page_cost vs seq_page_cost

From

Benedikt Grundmann

Date:

08 February 2012, 05:29:21

On 07/02/12 19:58, Bruce Momjian wrote:
> On Tue, Feb 07, 2012 at 05:06:18PM -0500, Greg Smith wrote:
> > On 02/07/2012 03:23 PM, Bruce Momjian wrote:
> > >Where did you see that there will be an improvement in the 9.2
> > >documentation?  I don't see an improvement.
> > 
> > I commented that I'm hoping for an improvement in the documentation
> > of how much timing overhead impacts attempts to measure this area
> > better.  That's from the "add timing of buffer I/O requests" feature
> > submission.  I'm not sure if Bene read too much into that or not; I
> > didn't mean to imply that the docs around random_page_cost have
> > gotten better.

I guess I did.  But I'm very glad that as a side effect Bruce and Greg 
have improved it ;-)

Re: random_page_cost vs seq_page_cost

From

Jeff Janes

Date:

11 February 2012, 20:49:46

On Tue, Feb 7, 2012 at 2:06 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 02/07/2012 03:23 PM, Bruce Momjian wrote:
>>
>> Where did you see that there will be an improvement in the 9.2
>> documentation?  I don't see an improvement.
>
>
> I commented that I'm hoping for an improvement in the documentation of how
> much timing overhead impacts attempts to measure this area better.  That's
> from the "add timing of buffer I/O requests" feature submission.  I'm not
> sure if Bene read too much into that or not; I didn't mean to imply that the
> docs around random_page_cost have gotten better.
>
> This particular complaint is extremely common though, seems to pop up on one
> of the lists a few times each year.  Your suggested doc fix is fine as a
> quick one, but I think it might be worth expanding further on this topic.
>  Something discussing SSDs seems due here too.  Here's a first draft of a
> longer discussion, to be inserted just after where it states the default
> value is 4.0:
>
> True random access to mechanical disk storage will normally be more
> expensive than this default suggests.  The value used is lower to reflect
> caching effects.  Some common random accesses to disk, such as indexed
> reads, are considered likely to be in cache.  The default value can be
> thought of as modeling random access as 40 times as expensive as sequential,
> while expecting that 90% of random reads will actually be cached.

For these numbers to work out to 4, we must also be assuming that
virtually zero of the sequentially read pages are cached.  Is that a
realistic assumption?  If the table is accessed only by seq scans, the
ring buffer may prevent it from getting cached (although even then it
could very well be the OS cache as that doesn't respect the ring
buffer), but it would be pretty common for other parts of the
application to access the same table via index scan, and so cause
substantial parts of it to be cached.

But I can see that that would rapidly get too complicated to discuss
in the documentation.

Cheers,

Jeff

Re: random_page_cost vs seq_page_cost

From

Jeff Janes

Date:

11 February 2012, 20:53:42

On Tue, Feb 7, 2012 at 4:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
> I was initially concerned that tuning advice in this part of the docs
> would look out of place, but now see the 25% shared_buffers
> recommentation, and it looks fine, so we are OK.  (Should we caution
> against more than 8GB of shared buffers?  I don't see that in the docs.)

Has it ever been well-characterized what the problem is with >8GB?
I've used shared buffers above that size for testing purposes and
could never provoke a problem with it.

Cheers,

Jeff

Re: random_page_cost vs seq_page_cost

From

Greg Smith

Date:

12 February 2012, 22:05:11

On 02/11/2012 07:53 PM, Jeff Janes wrote:
> Has it ever been well-characterized what the problem is with>8GB?
> I've used shared buffers above that size for testing purposes and
> could never provoke a problem with it.

If anyone ever manages to characterize it well, we might actually make 
progress on isolating and fixing it.  All we have so far are a couple of 
application level test results suggesting a higher value caused 
performance to drop.  The first public one I remember was from Jignesh; 
http://archives.postgresql.org/pgsql-performance/2008-02/msg00184.php 
gives him quoting on where he found the Solaris roll-off was at.  What 
we really need to stomp this one down is someone to find the same thing, 
then show profiler output in each case.  Note that Jignesh's report 
included significant amount of filesystem level tuning, using things 
like more direct I/O, and that might be a necessary requirement to run 
into the exact variant of this limitation he mentioned.

I haven't spent a lot of time looking for this problem myself.  What 
I've heard second-hand from more than one person now is a) larger 
settings than 8GB can be an improvement for some people still, and b) 
simple benchmarks don't always have this problem.  I have noted that the 
few public and private reports I've gotten all suggest problems show up 
on benchmarks of more complicated workloads.  I think Jignesh mentioned 
this being obvious in the more complicated TPC-derived benchmarks, not 
in simple things like pgbench.  I may be misquoting him though.  And 
given that one of the possible causes for this was an excess of some 
lock contention, it's quite possible this one is already gone from 9.2, 
given the large number of lock related issues that have been squashed so 
far in this release.

All of those disclaimers are why I think no one has pushed to put a note 
about this in the official docs.  Right now the only suggested limit is 
this one:

"The useful range for shared_buffers on Windows systems is generally 
from 64MB to 512MB."

The most common practical limit I've run into with large shared_buffers 
settings hits earlier than 8GB:  running into checkpoint spike issues.  
I have installs that started with shared_buffers in the 4 to 8GB range, 
where we saw badly spiking I/O at checkpoint sync time.  Lowering the 
databases cache can result in smarter writing decisions withing the OS, 
improving latency--even though total writes are actually higher if you 
measure what flows from the database to OS.  That side of the latency 
vs. throughput trade-off existing is one of the main reasons I haven't 
gone chasing after problems with really large shared_buffers settings.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: random_page_cost vs seq_page_cost

From

Bruce Momjian

Date:

14 February 2012, 17:52:48

On Tue, Feb 07, 2012 at 07:58:28PM -0500, Bruce Momjian wrote:
> I was initially concerned that tuning advice in this part of the docs
> would look out of place, but now see the 25% shared_buffers
> recommentation, and it looks fine, so we are OK.  (Should we caution
> against more than 8GB of shared buffers?  I don't see that in the docs.)
> 
> I agree we are overdue for better a explanation of random page cost, so
> I agree with your direction.  I did a little word-smithing to tighten up
> your text;  feel free to discard what you don't like:
> 
>     Random access to mechanical disk storage is normally much more expensive
>     than four-times sequential access.  However, a lower default is used
>     (4.0) because the majority of random accesses to disk, such as indexed
>     reads, are assumed to be in cache.  The default value can be thought of
>     as modeling random access as 40 times slower than sequential, while
>     expecting 90% of random reads to be cached.
>     
>     If you believe a 90% cache rate is an incorrect assumption
>     for your workload, you can increase random_page_cost to better
>     reflect the true cost of random storage reads. Correspondingly,
>     if your data is likely to be completely in cache, such as when
>     the database is smaller than the total server memory, decreasing
>     random_page_cost can be appropriate.  Storage that has a low random
>     read cost relative to sequential, e.g. solid-state drives, might
>     also be better modeled with a lower value for random_page_cost.

Patch applied for random_page_cost docs.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +