Re: shared_buffers documentation - Mailing list pgsql-hackers

From Greg Smith
Subject Re: shared_buffers documentation
Date
Msg-id 4BC91332.6060702@2ndquadrant.com
Whole thread Raw
In response to Re: shared_buffers documentation  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: shared_buffers documentation  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas wrote:
> Well, why can't they just hang out as dirty buffers in the OS cache,
> which is also designed to solve this problem?
>   

If the OS were guaranteed to be as suitable for this purpose as the 
approach taken in the database, this might work.  But much like the 
clock sweep approach should outperform a simpler OS caching 
implementation in many common workloads, there are a couple of spots 
where making dirty writes the OS's problem can fall down:

1) That presumes that OS write coalescing will solve the problem for you 
by merging repeat writes, which depending on implementation it might not.

2) On some filesystems, such as ext3, any write with an fsync behind it 
will flush the whole write cache out and defeat this optimization.  
Since the spread checkpoint design has some such writes going to the 
data disk in the middle of the currently processing checkpoing, in those 
situations that's likely to push the first write of that block to disk 
before it can be combined with a second.  If you'd have kept it in the 
buffer cache it might survive as long as a full checkpoint cycle longer..

3) The "timeout" as it were for shared buffers is driven by the distance 
between checkpoints, typically as long as 5 minutes.  The longest a 
filesystem will hold onto a write is probably less.  On Linux it's 
typically 30 seconds before the OS considers a write important to get 
out to disk, longest case; if you've already filled a lot of RAM with 
writes it can be substantially less.

> I guess the obvious question is whether Windows "doesn't need" more
> shared memory than that, or whether it "can't effectively use" more
> memory than that.
>   

It's probably can't effectively use.  We know for a fact that 
applications where blocks regularly accumulate high usage counts and 
have repeat read/writes to them, which includes pgbench, benefit in 
several easy to measure ways from using larger amounts of database 
buffer cache.  There's just plain old less churn of buffers going in and 
out of there.  The alternate explanation of "Windows is just so much 
better at read/write caching that you should give it most of the RAM 
anyway" doesn't really sound as probable as the more commonly proposed 
theory "Windows doesn't handle large blocks of shared memory well".

Note that there's no discussion of the why behind this is in the commit 
you just did, just the description of what happens.  The reasons why are 
left undefined, which I feel is appropriate given we really don't know 
for sure.  Still waiting for somebody to let loose the Visual Studio 
profiler and measure what's causing the degradation at larger sizes.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: walreceiver is uninterruptible on win32
Next
From: Robert Haas
Date:
Subject: Re: Streaming replication and a disk full in primary