Thread: PG 8.3 and large shared buffer settings

PG 8.3 and large shared buffer settings

From
Dan Sugalski
Date:
Is there any practical limit to the number of shared buffers PG 8.3.7
can handle before more becomes counter-productive? I remember the
buffer management algorithm used to get unhappy with too many buffers
and past a certain point performance dropped with extra memory
pitched at Postgres.

My production DB's around 200G, and the box hosting it has 192G of
memory on it, running a 64 bit AIX build of 8.3.7. I'm currently
running with 15G of shared buffers, and while performance is pretty
good, things still hit the disk more than I'd like. I can easily bump
the shared buffer setting up to 40G or so without affecting anything
else that matters.

The box runs other things as well as the database, so the OS buffer
cache tends to get effectively flushed -- permanently pinning more of
the database in memory would be an overall win for DB performance,
assuming bad things don't happen because of buffer management.
(Unfortunately I've only got a limited window to bounce the server,
so I can't do too much in the way of experimentation with buffer
sizing)
--
                Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
dan@sidhe.org                         have teddy bears and even
                                       teddy bears get drunk

Re: PG 8.3 and large shared buffer settings

From
Tom Lane
Date:
Dan Sugalski <dan@sidhe.org> writes:
> Is there any practical limit to the number of shared buffers PG 8.3.7
> can handle before more becomes counter-productive?

Probably, but I've not heard any definitive measurements showing an
upper limit.  The traditional wisdom of limiting it to 1G or so dates
from before the last rounds of revisions to the bufmgr logic.

> My production DB's around 200G, and the box hosting it has 192G of
> memory on it, running a 64 bit AIX build of 8.3.7.

Yowza.  You might be able to do measurements that no one has done
before.  Let us know what you find out.

BTW, does AIX have any provision for locking shared memory into RAM?
One of the gotchas for large shared memory has always been the risk
that the kernel would decide to swap some of it out.

            regards, tom lane

Re: PG 8.3 and large shared buffer settings

From
Dan Sugalski
Date:
At 12:36 AM -0400 9/25/09, Tom Lane wrote:
>Dan Sugalski <dan@sidhe.org> writes:
>>  Is there any practical limit to the number of shared buffers PG 8.3.7
>>  can handle before more becomes counter-productive?
>
>Probably, but I've not heard any definitive measurements showing an
>upper limit.  The traditional wisdom of limiting it to 1G or so dates
>from before the last rounds of revisions to the bufmgr logic.

Excellent.

>  > My production DB's around 200G, and the box hosting it has 192G of
>>  memory on it, running a 64 bit AIX build of 8.3.7.
>
>Yowza.  You might be able to do measurements that no one has done
>before.  Let us know what you find out.

:) It's a machine of non-trivial size, to be sure. I'll give the
buffer setting a good bump and see how it goes. I may be able to take
one of the slony replicas off-line the next holiday and run some
performance tests, but that won't be for a while.

>BTW, does AIX have any provision for locking shared memory into RAM?
>One of the gotchas for large shared memory has always been the risk
>that the kernel would decide to swap some of it out.

I'll have to go check, but I think it does. This box hasn't actually
hit swap since it started --  a good chunk of that RAM is used as
semi-permanent disk cache but unfortunately the regular day-to-day
use of this box (they won't let me have it as a dedicated DB-only
machine. Go figure :) doing other stuff the cache tends to turn over
pretty quickly.
--
                Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
dan@sidhe.org                         have teddy bears and even
                                       teddy bears get drunk

Re: PG 8.3 and large shared buffer settings

From
Aidan Van Dyk
Date:
* Dan Sugalski <dan@sidhe.org> [090925 06:06]:

> I'll have to go check, but I think it does. This box hasn't actually hit
> swap since it started --  a good chunk of that RAM is used as
> semi-permanent disk cache but unfortunately the regular day-to-day use of
> this box (they won't let me have it as a dedicated DB-only machine. Go
> figure :) doing other stuff the cache tends to turn over pretty quickly.

All the more reason to find a way to use it all as shared buffers and
lock it into ram...

Oh, sorry, you expect the DB to play nice with everything else?

;-)

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Attachment

Re: PG 8.3 and large shared buffer settings

From
Scott Carey
Date:
That won't work well anyway because the postgres shared_buffers dos not cache things that are sequentially scanned (it
usesa ring buffer for each scan).  So, for any data that is only accessed by sequential scan, you're relying on the OS
andthe disks.  If you access a table via index scan though, all its pages will go through shared_buffers. 

Size shared_buffers to no more than the 'hot' space of index and randomly accessed data.

________________________________________
From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of Aidan Van Dyk
[aidan@highrise.ca]
Sent: Friday, September 25, 2009 6:33 AM
To: Dan Sugalski
Cc: Tom Lane; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] PG 8.3 and large shared buffer settings

* Dan Sugalski <dan@sidhe.org> [090925 06:06]:

> I'll have to go check, but I think it does. This box hasn't actually hit
> swap since it started --  a good chunk of that RAM is used as
> semi-permanent disk cache but unfortunately the regular day-to-day use of
> this box (they won't let me have it as a dedicated DB-only machine. Go
> figure :) doing other stuff the cache tends to turn over pretty quickly.

All the more reason to find a way to use it all as shared buffers and
lock it into ram...

Oh, sorry, you expect the DB to play nice with everything else?

;-)

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: PG 8.3 and large shared buffer settings

From
Aidan Van Dyk
Date:
* Scott Carey <scott@richrelevance.com> [090925 11:57]:
> That won't work well anyway because the postgres shared_buffers dos not cache things that are sequentially scanned
(ituses a ring buffer for each scan).  So, for any data that is only accessed by sequential scan, you're relying on the
OSand the disks.  If you access a table via index scan though, all its pages will go through shared_buffers. 

In older version too, or only since synchronized scans got in?

a.


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Attachment

Re: PG 8.3 and large shared buffer settings

From
Jeff Janes
Date:
On Fri, Sep 25, 2009 at 8:53 AM, Scott Carey <scott@richrelevance.com> wrote:
> That won't work well anyway because the postgres shared_buffers dos not cache
> things that are sequentially scanned (it uses a ring buffer for each scan).  So, for
> any data that is only accessed by sequential scan, you're relying on the OS and
> the disks.  If you access a table via index scan though, all its pages will go through
> shared_buffers.

Does it doe this even if the block was already in shared_buffers?
That seems like a serious no-no to me to read the same block into
different buffers.  I thought that the sequential scan would have to
break stride when it encountered a block already in buffer.  But I
haven't looked at the code, maybe I am over analogizing to other
software I'm familiar with.

Jeff

Re: PG 8.3 and large shared buffer settings

From
Greg Smith
Date:
On Fri, 25 Sep 2009, Jeff Janes wrote:

> Does it do this even if the block was already in shared_buffers?

Usually not.  The buffer ring algorithm is used to manage pages that are
read in specifically to satisfy a sequential scan (there's a slightly
different ring method used for VACUUM too).  If the buffer you need is
already available and not "pinned" (locked by someone else), it's not read
from disk again.  Instead, its usage count is incremently only if it's at
zero (this doesn't count as a use unless it's about to be evicted as
unused), and it's returned without being added to the ring.

There's a section about this ("Buffer Ring Replacement Strategy") in the
source code:
http://git.postgresql.org/gitweb?p=postgresql.git;a=blob_plain;f=src/backend/storage/buffer/README;hb=HEAD

The commit that added the feature is at
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0

The basic flow of this code is that backends ask for buffers using
BufferAlloc, which then calls StrategyGetBuffer (where the ring list is
managed) only if it doesn't first find the page in the buffer cache.  You
get what you'd hope for here:  a sequential scan will use blocks when
they're already available in the cache, while reading in less popular
blocks that weren't cached into the temporary ring area.  There's always
the OS cache backing the PostrgreSQL one to handle cases where the working
set you're using is just a bit larger than shared_buffers.  The ring read
requests may very well be satisfied by that too if there was a recent
sequential scan the OS is still caching.

You can read a high-level summary of the algorithm used for ring
management (with an intro to buffer management in general) in my "Inside
the PostgreSQL Buffer Cache" presentation at
http://www.westnet.com/~gsmith/content/postgresql/ on P10 "Optimizations
for problem areas".  That doesn't specifically cover the "what if it's in
the cache already?" case though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: PG 8.3 and large shared buffer settings

From
Greg Smith
Date:
On Thu, 24 Sep 2009, Dan Sugalski wrote:

> Is there any practical limit to the number of shared buffers PG 8.3.7 can
> handle before more becomes counter-productive?

There are actually two distinct questions here you should consider,
because the popular wisdom here and what makes sense for your case might
be different.

The biggest shared_buffers tests I've seen come from Sun, where Jignesh
there saw around 10GB was the largest amount of RAM you could give to the
database before it stopped improving performance.  As you guessed, there
is a certain amount of overhead to managing the buffers involved, and as
the size grows the chance you'll run into locking issues or similar
resource contention grows too.

Another problem spot are checkpoints.  If you dirty a very large buffer
cache, that whole thing will have to get dumped to disk eventually, and on
some workloads people have found they have to reduce shared_buffers
specifically to keep this from being too painful.

That's not answering your question though; what it answers is "how large
can shared_buffers get before it's counterproductive compared with giving
the memory to OS to manage?"

The basic design of PostgreSQL presumes that the OS buffer cache exists as
a second-chance source for cached buffers.  The OS cache tends to be
optimized to handle large numbers of buffers well, but without very much
memory about what's been used recently to optimize allocations and
evictions.  The symmetry there is one reason behind why shared_buffers
shouldn't be most of the RAM on your system; splitting things up so that
PG has a cut and the OS has at least as large of its own space lets the
two cache management schemes complement each other.

> The box runs other things as well as the database, so the OS buffer cache
> tends to get effectively flushed -- permanently pinning more of the database
> in memory would be an overall win for DB performance, assuming bad things
> don't happen because of buffer management.

This means that the question you want an answer to is "if the OS cache
isn't really available, where does giving memory to shared_buffers becomes
less efficient than not caching things at all?"  My guess is that this
number is much larger than 10GB, but I don't think anyone has done any
tests to try to quantify exactly where it is.  Typically when people are
talking about systems as large as yours, they're dedicated database
servers at that point, so the OS cache gets considered at the same time.
If it's effectively out of the picture, the spot where caching still helps
even when it's somewhat inefficient due to buffer contention isn't well
explored.

It would depend on the app too.  If you're heavily balanced toward reads
that don't need locks, you can certainly support a larger shared_buffers
than someone who is writing a lot (just the checkpoint impact alone makes
this true, and there's other sources for locking contention).

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: PG 8.3 and large shared buffer settings

From
Gerhard Wiesinger
Date:
On Sat, 26 Sep 2009, Greg Smith wrote:

> On Fri, 25 Sep 2009, Jeff Janes wrote:
>
>> Does it do this even if the block was already in shared_buffers?
>
> Usually not.  The buffer ring algorithm is used to manage pages that are read
> in specifically to satisfy a sequential scan (there's a slightly different
> ring method used for VACUUM too).  If the buffer you need is already
> available and not "pinned" (locked by someone else), it's not read from disk
> again.  Instead, its usage count is incremently only if it's at zero (this
> doesn't count as a use unless it's about to be evicted as unused), and it's
> returned without being added to the ring.
>

Hello Greg,

What happens when a postmaster dies (e.g. core dump, kill -9, etc.). How
is reference counting cleaned up and the lock removed?

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/

Re: PG 8.3 and large shared buffer settings

From
Pierre Frédéric Caillaud
Date:
>> Is there any practical limit to the number of shared buffers PG 8.3.7
>> can handle before more becomes counter-productive?

It is more efficient to have the page in shared buffers, rather than doing
a context switch to the OS, copying the entire page from the OS's cache
into shared buffers, and coming back to postgres. Shared buffers use less
CPU. However, this is totally negligible versus the disk wait time of an
uncached IO.

The same page may be cached once in shared_buffers, and once in the OS
cache, so if your shared buffers is half your RAM, and the other half is
disk cache, perhaps it won't be optimal: is stuff is cached twice, you can
cache half as much stuff.

If your entire database can fit in shared buffers, good for you though.
But then a checkpoint comes, and postgres will write all dirty buffers to
disk in the order it finds them in Shared Buffers, which might be totally
different from the on-disk order. If you have enough OS cache left to
absorb these writes, the OS will reorder them. If not, lots of random
writes are going to occur. On a RAID5 this can be a lot of fun.

Re: PG 8.3 and large shared buffer settings

From
Jeff Janes
Date:
On Sat, Sep 26, 2009 at 9:57 AM, Gerhard Wiesinger <lists@wiesinger.com> wrote:
> On Sat, 26 Sep 2009, Greg Smith wrote:
>
>> On Fri, 25 Sep 2009, Jeff Janes wrote:
>>
>>> Does it do this even if the block was already in shared_buffers?
>>
>> Usually not.  The buffer ring algorithm is used to manage pages that are
>> read in specifically to satisfy a sequential scan (there's a slightly
>> different ring method used for VACUUM too).  If the buffer you need is
>> already available and not "pinned" (locked by someone else), it's not read
>> from disk again.  Instead, its usage count is incremently only if it's at
>> zero (this doesn't count as a use unless it's about to be evicted as
>> unused), and it's returned without being added to the ring.
>>
>
> Hello Greg,
>
> What happens when a postmaster dies (e.g. core dump, kill -9, etc.). How is
> reference counting cleaned up and the lock removed?

If a backend dies in disgrace, the master detects this and the whole
cluster is taken down and brought back up.

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.

(The DETAIL is technically accurate, but somewhat misleading.  If the
crash to another backend happens while your backend is waiting on the
commit record WAL fsync to return, then while the postmaster may have
commanded your session to rollback, it is too late to actually do so
and when the server comes back up and finishes recovery, you will
probably find that your transaction has indeed committed, assuming you
have some way to accurately deduce this)


Jeff

Re: PG 8.3 and large shared buffer settings

From
Jeff Janes
Date:
On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>
> Another problem spot are checkpoints.  If you dirty a very large buffer
> cache, that whole thing will have to get dumped to disk eventually, and on
> some workloads people have found they have to reduce shared_buffers
> specifically to keep this from being too painful.

Hi Greg,

Is this the case even if checkpoint_completion_target is set close to 1.0?

If you dirty buffers fast enough to dirty most of a huge
shared_buffers area between checkpoints, then it seems like lowering
the shared_buffers wouldn't reduce the amount of I/O needed, it would
just shift the I/O from checkpoints to the backends themselves.

It looks like checkpoint_completion_target was introduced in 8.3.0

Cheers,

Jeff

Re: PG 8.3 and large shared buffer settings

From
Greg Smith
Date:
On Sat, 26 Sep 2009, Jeff Janes wrote:

> On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>>
>> Another problem spot are checkpoints.  If you dirty a very large buffer
>> cache, that whole thing will have to get dumped to disk eventually, and on
>> some workloads people have found they have to reduce shared_buffers
>> specifically to keep this from being too painful.
>
> Is this the case even if checkpoint_completion_target is set close to 1.0?

Sure.  checkpoint_completion_target aims to utilize more of the space
between each checkpoint by spreading them out over more of that space, but
it alone doesn't change the fact that checkpoints are only so long.  By
default, you're going to get one every five minutes, and on active systems
they can come every few seconds if you're not aggressive with increasing
checkpoint_segments.

Some quick math gives an idea of the scale of the problem.  A single cheap
disk can write random I/O (which checkpoints writes often are) at 1-2MB/s;
let's call it 100MB/minute.  That means that in 5 minutes, a single disk
system might be hard pressed to write even 500MB of data out.  But you can
easily dirty 500MB in seconds nowadays.  Now imagine shared_buffers is
40GB and you've dirtied a bunch of it; how long will that take to clear
even on a fast RAID system?  It won't be quick, and the whole system will
grind to a halt at the end of the checkpoint as all the buffered writes
queued up are forced out.

> If you dirty buffers fast enough to dirty most of a huge shared_buffers
> area between checkpoints, then it seems like lowering the shared_buffers
> wouldn't reduce the amount of I/O needed, it would just shift the I/O
> from checkpoints to the backends themselves.

What's even worse is that backends can be writing data and filling the OS
buffer cache in between checkpoints too, but all of that is forced to
complete before the checkpoint can finish too.  You can easily start the
checkpoint process with the whole OS cache filled with backend writes that
will slow checkpoint ones if you're not careful.

Because disks are slow, you need to get things that are written to disk as
soon as feasible, so the OS has more time to work on them, reorder for
efficient writing, etc.

Ultimately, the sooner you get I/O to the OS cache to write, the better,
*unless* you're going to write that same block over again before it must
go to disk.  Normally you want buffers that aren't accessed often to get
written out to disk early rather than linger until checkpoint time,
there's nothing wrong with a backend doing a write if that block wasn't
going to be used again soon. The ideal setup from a latency perspective is
that you size shared_buffers just large enough to hold the things you
write to regularly, but not so big that it caches every write.

> It looks like checkpoint_completion_target was introduced in 8.3.0

Correct.  Before then, you had no hope for reducing checkpoint overhead
but to use very small settings for shared_buffers, particularly if you
cranked the old background writer up so that it wrote lots of redundant
information too (that's was the main result of "tuning" it on versions
before 8.3 as well).

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: PG 8.3 and large shared buffer settings

From
Josh Berkus
Date:
On 9/26/09 8:19 AM, Greg Smith wrote:
> This means that the question you want an answer to is "if the OS cache
> isn't really available, where does giving memory to shared_buffers
> becomes less efficient than not caching things at all?"  My guess is
> that this number is much larger than 10GB, but I don't think anyone has
> done any tests to try to quantify exactly where it is.  Typically when
> people are talking about systems as large as yours, they're dedicated
> database servers at that point, so the OS cache gets considered at the
> same time. If it's effectively out of the picture, the spot where
> caching still helps even when it's somewhat inefficient due to buffer
> contention isn't well explored.

It also depends on the filesystem.  In testing at Sun and on this list,
people have found that very large s_b (60% of RAM) plus directIO was
actually a win on Solaris UFS, partly because UFS isn't very agressive
or smart about readahead and caching.  On Linux/Ext3, however, it was
never a win.

I don't know what AIX's filesystems are like.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

Re: PG 8.3 and large shared buffer settings

From
Hannu Krosing
Date:
On Mon, 2009-09-28 at 10:36 -0700, Josh Berkus wrote:
> On 9/26/09 8:19 AM, Greg Smith wrote:
> > This means that the question you want an answer to is "if the OS cache
> > isn't really available, where does giving memory to shared_buffers
> > becomes less efficient than not caching things at all?"  My guess is
> > that this number is much larger than 10GB, but I don't think anyone has
> > done any tests to try to quantify exactly where it is.  Typically when
> > people are talking about systems as large as yours, they're dedicated
> > database servers at that point, so the OS cache gets considered at the
> > same time. If it's effectively out of the picture, the spot where
> > caching still helps even when it's somewhat inefficient due to buffer
> > contention isn't well explored.
>
> It also depends on the filesystem.  In testing at Sun and on this list,
> people have found that very large s_b (60% of RAM) plus directIO was
> actually a win on Solaris UFS, partly because UFS isn't very agressive
> or smart about readahead and caching.  On Linux/Ext3, however, it was
> never a win.

Again, it depends.

On my recent testing of a simple seqscan on 1-int table, that are
entirely in cache (either syscache or shared buffers), the shared
buffers only scan was 6% to 10% percent faster than when the relation
was entirely in system cache and each page had to be switched in via
syscall / context switch.

This was on Linux/Ext3 but I suspect this to be mostly independent of
file system.

Also, in ancient times, when I used Slony, and an early version of Slony
at that, which did not drop and recreate indexes around initial copy,
the copy time could be 2 to 3 _times_ slower for large tables with lots
of indexes when indexes were in system cache vs. when they were in
shared buffers (if I remember correctly, it was 1G shared buffers vs. 3G
on a 4G machine). It was probably due to all kinds of index page splits
etc which shuffled index pages back and forth a lot between userspace
and syscache. So this is not entirely read-only thing either.

--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training