Thread: lru_multiplier and backend page write-outs

lru_multiplier and backend page write-outs

From
Peter Schuller
Date:
Hello,

I've had the feeling for a while that the pg_stat_bgwriter statistics
doesn't work quite the way I have understood it (based on the
excellent [1] and the pg docs).

I am now monitoring a database that has an lru_multiplier of 4.0, a
delay of 200ms and a maxpages of 1000. Current stats:

postgres=# select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend |
buffers_alloc 

-------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
               241 |              17 |              72803 |             0 |                0 |           81015 |
81708 
(1 row)

This is while the database is undergoing continuous activity (almost
exclusively writing), but at a rate that does not saturate underlying
storage (caching raid controller, all write ops are fast, cache is
never filled).

In addition, PostgreSQL is not even close to even filling it's buffer
cache. The buffer cache is configured at 1 GB, and the resident size
of the PostgreSQL process is only 80-90 MB so far. So even
independently of any lru multplier setting, delays and whatever else,
I don't see why any backend would ever have to do its own writeouts in
order to allocate a page from the buffer cache.

One theory: Is it the auto vacuum process? Stracing those I've seen
that they very often to writes directly to disk.

In any case, the reason I am fixating on buffers_backend is that I am
after a clear indication whether any "normal" backend (non-autovacuum
or anything like that) is ever having to block on disk writes, other
than WAL fsync:s.

Is a non-zero buffers_backend consistent with expected behavior?

[1] http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org


Attachment

Re: lru_multiplier and backend page write-outs

From
Greg Smith
Date:
On Wed, 5 Nov 2008, Peter Schuller wrote:

> In addition, PostgreSQL is not even close to even filling it's buffer
> cache. The buffer cache is configured at 1 GB, and the resident size
> of the PostgreSQL process is only 80-90 MB so far. So even
> independently of any lru multplier setting, delays and whatever else,
> I don't see why any backend would ever have to do its own writeouts in
> order to allocate a page from the buffer cache.

Any buffer that you've accessed recently gets its recent usage count
incremented such that the background writer won't touch it--the current
one only writes things where that count is 0.  The only mechanism which
drops that usage count back down again only kicks in once you've used all
the buffers in the cache.  You need some pressure to evict buffers that
can't fit anymore before the background writer has any useful role to play
in PostgreSQL 8.3.

At one point I envisioned making it smart enough to try and handle the
scenario you describe--on an idle system, you may very well want to write
out dirty and recently accessed buffers if there's nothing else going on.
But such behavior is counter-productive on a busy system, which is why a
similar mechanism that existed before 8.3 was removed.  Making that only
happen when idle requires a metric for what "busy" means, which is tricky
to do given the information available to this particular process.

Short version:  if you never fill the buffer cache, buffers_clean will
always be zero, and you'll only see writes by checkpoints and things not
operating with the standard client buffer allocation mechanism.  Which
brings us to...

> One theory: Is it the auto vacuum process? Stracing those I've seen
> that they very often to writes directly to disk.

In order to keep it from using up the whole cache with maintenance
overhead, vacuum allocates a 256K ring of buffers and use re-uses ones
from there whenever possible.  That will generate buffer_backend writes
when that ring fills but it has more left to scan.  Your theory that all
the backend writes are coming from vacuum seems consistant with what
you've described.

You might even want to drop the two background writer parameters you've
tweaked upwards back down closer to their original values.  I get the
impression you might have increased those hoping for more background
writer work because you weren't seeing any.  If you ever do get to where
your buffer cache is full and the background writer starts doing
something, those could jump from ineffective to wastefully heavy at that
point.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: lru_multiplier and backend page write-outs

From
Peter Schuller
Date:
Hello,

> At one point I envisioned making it smart enough to try and handle the
> scenario you describe--on an idle system, you may very well want to write
> out dirty and recently accessed buffers if there's nothing else going on.
> But such behavior is counter-productive on a busy system, which is why a
> similar mechanism that existed before 8.3 was removed.  Making that only
> happen when idle requires a metric for what "busy" means, which is tricky
> to do given the information available to this particular process.
>
> Short version:  if you never fill the buffer cache, buffers_clean will
> always be zero, and you'll only see writes by checkpoints and things not
> operating with the standard client buffer allocation mechanism.  Which
> brings us to...

Sure. I am not really out to get the background writer to
pre-emptively do "idle trickling". Though I can see cases where one
might care about this (such as lessening the impact of OS buffer cache
delays on checkpoints), it's not what I am after now.

> > One theory: Is it the auto vacuum process? Stracing those I've seen
> > that they very often to writes directly to disk.
>
> In order to keep it from using up the whole cache with maintenance
> overhead, vacuum allocates a 256K ring of buffers and use re-uses ones
> from there whenever possible.  That will generate buffer_backend writes
> when that ring fills but it has more left to scan.  Your theory that all
> the backend writes are coming from vacuum seems consistant with what
> you've described.

The bit that is inconsistent with this theory, given the above ring
buffer desription, is that I saw the backend write-out count
increasing constantlyduring the write activity I was generating to the
database. However (because in this particular case it was a small
database used for some latency related testing), no table was ever
large enough that 256k buffers would ever be filled by the process of
vacuuming a single table. Most tables would likely have been a handful
to a couple of hundred of pages large.

In addition, when I say "constantly" above I mean that the count
increases even between successive SELECT:s (of the stat table) with
only a second or two in between. In the abscence of long-running
vacuum's, that discounts vacuuming because the naptime is 1 minute.

In fact this already discounted vacuuming even without the added
information you provided above, but I didn't realize when originally
posting.

The reason I mentioned vacuuming was that the use case is such that we
do have a lot of tables constantly getting writes and updates, but
they are all small.

Anything else known that might be generating the writes, if it is not
vacuuming?

> You might even want to drop the two background writer parameters you've
> tweaked upwards back down closer to their original values.  I get the
> impression you might have increased those hoping for more background
> writer work because you weren't seeing any.  If you ever do get to where
> your buffer cache is full and the background writer starts doing
> something, those could jump from ineffective to wastefully heavy at that
> point.

I tweaked it in order to eliminate backends having to do
"synchrounous" (with respect to the operating system even if not with
respect to the underlying device) writes.

The idea is that writes to the operating system are less
understood/controlled, in terms of any latency they may case. It would
be very nice if the backend writes were always zero under normal
circumstances (or at least growing very very rarely in edge cases
where the JIT policy did not suceed), in order to make it a more
relevant and rare observation that the backend write-outs are
systematically increasing.

On this topic btw, was it considered to allow the administrator to
specify a fixed-size margin to use when applying the JIT policy? (The
JIT policy and logic itself being exactly the same still.)

Especially with larger buffer caches, that would perhaps allow the
administrator to make a call to truly eliminate synchronous writes
during normal operation, while not adversely affecting anything (if
the buffer cache is 1 GB, having a margin of say 50 MB does not really
matter much in terms of wasting memory, yet could have a significant
impact on eliminating synchronous write-outs).

On a system where you really want to keep backend writes to exactly 0
under normal circumstances (discounting vacuuming), and having a large
buffer cache (say the one gig), it might be nice to be able to say "ok
- I have 1 GB of buffer cache. for the purpose of the JIT algorithm,
please pretend it's only 900 MB". The result is 100 MB of constantly
sized "margin", with respect to ensuring writes are asynchronous.

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org


Attachment

Re: lru_multiplier and backend page write-outs

From
Greg Smith
Date:
On Thu, 6 Nov 2008, Peter Schuller wrote:

>> In order to keep it from using up the whole cache with maintenance
>> overhead, vacuum allocates a 256K ring of buffers and use re-uses ones
>> from there whenever possible.
>
> no table was ever large enough that 256k buffers would ever be filled by
> the process of vacuuming a single table.

Not 256K buffers--256K, 32 buffers.

> In addition, when I say "constantly" above I mean that the count
> increases even between successive SELECT:s (of the stat table) with
> only a second or two in between.

Writes to the database when only doing read operations are usually related
to hint bits:  http://wiki.postgresql.org/wiki/Hint_Bits

> On this topic btw, was it considered to allow the administrator to
> specify a fixed-size margin to use when applying the JIT policy?

Right now, there's no way to know exactly what's in the buffer cache
without scanning the individual buffers, which requires locking their
headers so you can see them consistently.  No one process can get the big
picture without doing something intrusive like that, and on a busy system
the overhead of collecting more data to know how exactly far ahead the
cleaning is can drag down overall performance.  A lot can happen while the
background writer is sleeping.

One next-generation design which has been sketched out but not even
prototyped would take cleaned buffers and add them to the internal list of
buffers that are free, which right now is usually empty on the theory that
cached data is always more useful than a reserved buffer.  If you
developed a reasonable model for how many buffers you needed and padded
that appropriately, that's the easiest way (given the rest of the buffer
manager code) to get close to ensuring there aren't any backend writes.
Because you've got the OS buffering writes anyway in most cases, it's hard
to pin down whether that actually improved worst-case latency though. And
moving in that direction always seems to reduce average throughput even in
write-heavy benchmarks.

The important thing to remember is that the underlying OS has its own read
and write caching mechanisms here, and unless the PostgreSQL ones are
measurably better than those you might as well let the OS manage the
problem instead.  It's easy to demonstrate that's happening when you give
a decent amount of memory to shared_buffers, it's much harder to prove
that's the case for an improved write scheduling algorithm.  Stepping back
a bit, you might even consider that one reason PostgreSQL has grown as
well as it has in scalability is exactly because it's been riding
improvements the underlying OS in many of these cases, rather than trying
to do all the I/O scheduling itself.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: lru_multiplier and backend page write-outs

From
Peter Schuller
Date:
> > no table was ever large enough that 256k buffers would ever be filled by
> > the process of vacuuming a single table.
>
> Not 256K buffers--256K, 32 buffers.

Ok.

> > In addition, when I say "constantly" above I mean that the count
> > increases even between successive SELECT:s (of the stat table) with
> > only a second or two in between.
>
> Writes to the database when only doing read operations are usually related
> to hint bits:  http://wiki.postgresql.org/wiki/Hint_Bits

Sorry, I didn't mean to imply read-only operations (I did read the
hint bits information a while back though). What I meant was that
while I was constantly generating the insert/delete/update activity, I
was selecting the bg writer stats with only a second or two in
between. The intent was to convey that the count of backend written
pages was systematically and constantly (as in a few hundreds per
handful of seconds) increasing, in spite of no long running vacuum and
the buffer cache not being close to full.

> > On this topic btw, was it considered to allow the administrator to
> > specify a fixed-size margin to use when applying the JIT policy?
>
> Right now, there's no way to know exactly what's in the buffer cache
> without scanning the individual buffers, which requires locking their
> headers so you can see them consistently.  No one process can get the big
> picture without doing something intrusive like that, and on a busy system
> the overhead of collecting more data to know how exactly far ahead the
> cleaning is can drag down overall performance.  A lot can happen while the
> background writer is sleeping.

Understood.

> One next-generation design which has been sketched out but not even
> prototyped would take cleaned buffers and add them to the internal list of
> buffers that are free, which right now is usually empty on the theory that
> cached data is always more useful than a reserved buffer.  If you
> developed a reasonable model for how many buffers you needed and padded
> that appropriately, that's the easiest way (given the rest of the buffer
> manager code) to get close to ensuring there aren't any backend writes.
> Because you've got the OS buffering writes anyway in most cases, it's hard
> to pin down whether that actually improved worst-case latency though. And
> moving in that direction always seems to reduce average throughput even in
> write-heavy benchmarks.

Ok.

> The important thing to remember is that the underlying OS has its own read
> and write caching mechanisms here, and unless the PostgreSQL ones are
> measurably better than those you might as well let the OS manage the
> problem instead.

The problem though is that though the OS may be good in the common
cases it is designed for, it can have specific features that are
directly counter-productive if your goals do not line up with that of
the commonly designed-for use case (in particular, if you care about
latency a lot and not necessarily about absolute max throughput).

For example, in Linux up until recently if not still, there is the
1024 per-inode buffer limit that limited the number of buffers written
as a result of expiry, which means that when PostgreSQL does its
fsync(), you may end up having a lot more to write out than what would
have been the case if the centisecs_expiry had been enforced,
regardless of whether PostgreSQL was tuned to write dirty pages out
sufficiently aggressively. If the amount built up exceeds the capacity
of the RAID controller cache...

I had a case where I suspect this was exaserbating the
situation. Manually doing a 'sync' on the system every few seconds
noticably helped (the theory being, because it forced page write-outs
to happen earlier and in smaller storms).

>  It's easy to demonstrate that's happening when you give
> a decent amount of memory to shared_buffers, it's much harder to prove
> that's the case for an improved write scheduling algorithm.  Stepping back
> a bit, you might even consider that one reason PostgreSQL has grown as
> well as it has in scalability is exactly because it's been riding
> improvements the underlying OS in many of these cases, rather than trying
> to do all the I/O scheduling itself.

Sure. In this case with the backend writes, I am nore interesting in
understanding better what is happening and having better indications
of when backends block on I/O, than necessarily having a proven
improvement in throughput or latency. It makes it easier to reason
about what is happening when you *do* have a measured performance
problem.

Thanks for all the insightful information.

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org


Attachment