Thread: Limit of bgwriter_lru_maxpages of max. 1000?

Limit of bgwriter_lru_maxpages of max. 1000?

From
Gerhard Wiesinger
Date:
Hello,

I think the limit of bgwriter_lru_maxpages of max. 1000 doesn't make any
sense because in fact it limits performance of the database (version 8.3).

On heavy write operations buffer cached must be freed. With the default
config this is practically limited to:
bgwriter_delay=200ms
bgwriter_lru_maxpages=100
8k*bgwriter_lru_maxpages*1000/bgwriter_delay=
=8k*100*1000/200=4000k=4MB/s
Isn't that a major performancd bottleneck in default config?

bgwriter_delay=200ms
bgwriter_lru_maxpages=1000
8k*bgwriter_lru_maxpages*1000/bgwriter_delay=
=8k*1000*1000/200=40000k=40MB/s
Still not a very high number for current I/O loads.

Lowering bgwriter_delay is possible, but I think overhead is too much and
still there is a limit of 800MB/s involved:
bgwriter_delay=10ms
bgwriter_lru_maxpages=1000
8k*bgwriter_lru_maxpages*1000/bgwriter_delay=
=8k*1000*1000/10=800000k=800MB/s

So I think it would be better to have such a configuration:
bgwriter_delay=50ms
bgwriter_lru_maxpages=100000
8k*bgwriter_lru_maxpages*1000/bgwriter_delay=
=8k*100000*1000/50=16000000k=16000MB/s

So in fact I think bgwriter_lru_maxpages should be limited to 100000 if
limited at all.

Are my argumentations correct?
Any comments?

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Greg Smith
Date:
On Sun, 27 Sep 2009, Gerhard Wiesinger wrote:

> Lowering bgwriter_delay is possible, but I think overhead is too much and
> still there is a limit of 800MB/s involved:

Stuff written by the background writer turns into largely random I/O.
800MB/s of random writes is so large of a number it's only recently become
remotely possible; a RAID0 of SSD devices might manage it.  No hardware
available until very recently had any hope of getting that sort of
performance.

In any case, I would wager you'll run into one of many other bottlenecks
in PostgreSQL and/or currently available system/disk hardware long before
the background writer limit gets important.

> So in fact I think bgwriter_lru_maxpages should be limited to 100000 if
> limited at all.

The current limit is based on the assumption that people will set it to
values way too high if allowed, to the point where it's counterproductive.
That's exactly what people used to do with early background writer
designs.  I think you're wandering down the same road, where what it
actually does and what you think it does are not the same thing at all.
Much of the important disk I/O coming out of the database should be
related to checkpoints, not the background writer, and there is no limit
on that I/O.

If you think you've got a situation where the current limits are not
sufficient, the path to argue that would start with showing what you're
seeing in pg_stat_bgwriter.  I can imagine some results from there on a
system with a very high rate of I/O available that would suggest the
current limits are too small.  I've never come close to actually seeing
such results in the real world though, and if you're not already
monitoring those numbers on a real system I'd suggest you start there
rather than presuming there's a design limitation here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Gerhard Wiesinger
Date:
On Fri, 2 Oct 2009, Greg Smith wrote:

> On Sun, 27 Sep 2009, Gerhard Wiesinger wrote:
>
>> Lowering bgwriter_delay is possible, but I think overhead is too much and
>> still there is a limit of 800MB/s involved:
>
> Stuff written by the background writer turns into largely random I/O. 800MB/s
> of random writes is so large of a number it's only recently become remotely
> possible; a RAID0 of SSD devices might manage it.  No hardware available
> until very recently had any hope of getting that sort of performance.
>
> In any case, I would wager you'll run into one of many other bottlenecks in
> PostgreSQL and/or currently available system/disk hardware long before the
> background writer limit gets important.
>

Of course, 800MB/s are a theoretical max. limit I could thought of. But
with SSDs this might be possible.

>> So in fact I think bgwriter_lru_maxpages should be limited to 100000 if
>> limited at all.
>
> The current limit is based on the assumption that people will set it to
> values way too high if allowed, to the point where it's counterproductive.
> That's exactly what people used to do with early background writer designs.
> I think you're wandering down the same road, where what it actually does and
> what you think it does are not the same thing at all. Much of the important
> disk I/O coming out of the database should be related to checkpoints, not the
> background writer, and there is no limit on that I/O.
>

In my experience flushing I/O as soon as possible is the best solution.
Think of the following scenario: You currently limit bgwriter at 4MB/s but
you would have about 10MB/s random I/O capacity (a normal low cost
system). So utilitzzation would be only 40% and you could write even more.
At checkpoint time you would get a spike which the I/O system couldn't
handle at all and performance goes down to nearly zero because of the I/O
spike (e.g. 500% of available I/O needed). IHMO such scenarios should be
avoided.

> If you think you've got a situation where the current limits are not
> sufficient, the path to argue that would start with showing what you're
> seeing in pg_stat_bgwriter.  I can imagine some results from there on a
> system with a very high rate of I/O available that would suggest the current
> limits are too small.  I've never come close to actually seeing such results
> in the real world though, and if you're not already monitoring those numbers
> on a real system I'd suggest you start there rather than presuming there's a
> design limitation here.
>

On an nearly idle database with sometimes some performance tests:

SELECT
   buffers_checkpoint/buffers_clean AS checkpoint_spike,
   ROUND(100.0*buffers_checkpoint/(buffers_checkpoint + buffers_clean +
buffers_backend),2) AS checkpoint_percentage,
   ROUND(100.0*buffers_clean/(buffers_checkpoint + buffers_clean +
buffers_backend),2) AS pg_writer_percentage,
   ROUND(100.0*buffers_backend/(buffers_checkpoint + buffers_clean +
buffers_backend),2) AS backend_percentage
FROM
   pg_stat_bgwriter
;

  checkpoint_spike | checkpoint_percentage | pg_writer_percentage | backend_percentage
------------------+-----------------------+----------------------+--------------------
                31 |                 90.58 |                 2.92 |               6.50

So flushing happens typically at checkpoint time. In 6.5%of all blocks
were put by the backend on disk which says IHMO: pgwriter is to slow,
backend has to do the work now.

So I'd like to do some tests with new statistics. Any fast way to reset
statistics for all databases for pg_stat_pgwriter?

Thnx.

Ciao,
Gerhard

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Greg Smith
Date:
On Fri, 2 Oct 2009, Gerhard Wiesinger wrote:

> In my experience flushing I/O as soon as possible is the best solution.

That what everyone assumes, but detailed benchmarks of PostgreSQL don't
actually support that view given how the database operates.  We went
through a lot of work in 8.3 related to how to optimize the database as a
system that disproved some of the theories about what would work well
here.

What happens if you're really aggressive about writing blocks out as soon
as they're dirty is that you waste a lot of I/O on things that just get
dirty again later.  Since checkpoint time is the only period where blocks
*must* get written, the approach that worked the best for reducing
checkpoint spikes was to spread the checkpoint writes out over a very wide
period.  The only remaining work that made sense for the background writer
was to tightly focus the background writer its I/O on blocks that are
about to be evicted due to low usage no matter what.

In most cases where people think they need more I/O from the background
writer, what you actually want is to increase checkpoint_segments,
checkpoint_completion_target, and checkpoint_timeout in order to spread
the checkpoint I/O out over a longer period.  The stats you provided
suggest this is working exactly as intended.

As far as work to improve the status quo, IMHO the next thing to improve
is getting the fsync calls made at checkpoint time more intelligently
spread over the whole period.  That's got a better payback than trying to
make the background writer more aggressive, which is basically a doomed
cause.

> So I'd like to do some tests with new statistics. Any fast way to reset
> statistics for all databases for pg_stat_pgwriter?

No, that's an open TODO item I keep meaning to fix; we lost that
capability at one point.  What I do is create a table that looks just like
it, but with a time stamp, and save snapshots to that table.  Then a view
on top can generate just the deltas between two samples to show activity
during that time.  It's handy to have such a history anyway.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Scott Marlowe
Date:
On Fri, Oct 2, 2009 at 2:19 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Fri, 2 Oct 2009, Gerhard Wiesinger wrote:
>
>> In my experience flushing I/O as soon as possible is the best solution.
>
> That what everyone assumes, but detailed benchmarks of PostgreSQL don't
> actually support that view given how the database operates.  We went through
> a lot of work in 8.3 related to how to optimize the database as a system
> that disproved some of the theories about what would work well here.
>
> What happens if you're really aggressive about writing blocks out as soon as
> they're dirty is that you waste a lot of I/O on things that just get dirty
> again later.  Since checkpoint time is the only period where blocks *must*
> get written, the approach that worked the best for reducing checkpoint
> spikes was to spread the checkpoint writes out over a very wide period.

The session servers we have at work are a perfect match for this.  By
increasing checkpoint segments to 100 (or more), timeout to 60
minutes, and setting completion target lower (currently 0.25) we have
reduced our IO wait from 10 to 15% to nearly nothing.  These are
databases that update the same rows over and over with session data as
the user navigates the system, so writing things out as early as
possible is a REAL bad idea.

> In most cases where people think they need more I/O from the background
> writer, what you actually want is to increase checkpoint_segments,
> checkpoint_completion_target, and checkpoint_timeout in order to spread the
> checkpoint I/O out over a longer period.  The stats you provided suggest
> this is working exactly as intended.

I found that lowering checkpoint completion target was what helped.
Does that seem counter-intuitive to you?

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Tom Lane
Date:
Scott Marlowe <scott.marlowe@gmail.com> writes:
> The session servers we have at work are a perfect match for this.  By
> increasing checkpoint segments to 100 (or more), timeout to 60
> minutes, and setting completion target lower (currently 0.25) we have
> reduced our IO wait from 10 to 15% to nearly nothing.  These are
> databases that update the same rows over and over with session data as
> the user navigates the system, so writing things out as early as
> possible is a REAL bad idea.

> I found that lowering checkpoint completion target was what helped.
> Does that seem counter-intuitive to you?

Once the checkpoint completion target time is high enough that the
checkpoint-induced I/O is just background noise for you, increasing the
target further won't make for any noticeable further improvement.  I'm
not sure I see how it would make things *worse* though.  Maybe, even
though the I/O wait is "nearly nothing", the I/O is still forcing enough
extra seeks to slow normal disk operations?  If so, getting the
checkpoint out of the way sooner so that you can get back to full speed
operation sooner might be better than reducing the rate of checkpoint
I/Os below the nearly-noise level.  I'm just guessing about that though.
What were you measuring --- the performance within checkpoint, the
performance outside it, or the whole-cycle average?

            regards, tom lane

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Scott Marlowe <scott.marlowe@gmail.com> writes:
> > The session servers we have at work are a perfect match for this.  By
> > increasing checkpoint segments to 100 (or more), timeout to 60
> > minutes, and setting completion target lower (currently 0.25) we have
> > reduced our IO wait from 10 to 15% to nearly nothing.  These are
> > databases that update the same rows over and over with session data as
> > the user navigates the system, so writing things out as early as
> > possible is a REAL bad idea.
>
> > I found that lowering checkpoint completion target was what helped.
> > Does that seem counter-intuitive to you?
>
> Once the checkpoint completion target time is high enough that the
> checkpoint-induced I/O is just background noise for you, increasing the
> target further won't make for any noticeable further improvement.  I'm
> not sure I see how it would make things *worse* though.  Maybe, even
> though the I/O wait is "nearly nothing", the I/O is still forcing enough
> extra seeks to slow normal disk operations?  If so, getting the
> checkpoint out of the way sooner so that you can get back to full speed
> operation sooner might be better than reducing the rate of checkpoint
> I/Os below the nearly-noise level.  I'm just guessing about that though.
> What were you measuring --- the performance within checkpoint, the
> performance outside it, or the whole-cycle average?

My guess is that having a very long fuzzy checkpoint time means that
when you fsync you are fsync'ing lots of data, both your checkpoint data
and other writes performed by backends.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> My guess is that having a very long fuzzy checkpoint time means that
> when you fsync you are fsync'ing lots of data, both your checkpoint data
> and other writes performed by backends.

Hmm, could be ... although that would imply that shared_buffers should
be kicked up some more, so the backends aren't doing so many writes for
themselves.

            regards, tom lane

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Scott Marlowe
Date:
On Fri, Oct 2, 2009 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Scott Marlowe <scott.marlowe@gmail.com> writes:
>> The session servers we have at work are a perfect match for this.  By
>> increasing checkpoint segments to 100 (or more), timeout to 60
>> minutes, and setting completion target lower (currently 0.25) we have
>> reduced our IO wait from 10 to 15% to nearly nothing.  These are
>> databases that update the same rows over and over with session data as
>> the user navigates the system, so writing things out as early as
>> possible is a REAL bad idea.
>
>> I found that lowering checkpoint completion target was what helped.
>> Does that seem counter-intuitive to you?
>
> Once the checkpoint completion target time is high enough that the
> checkpoint-induced I/O is just background noise for you, increasing the
> target further won't make for any noticeable further improvement.  I'm
> not sure I see how it would make things *worse* though.  Maybe, even
> though the I/O wait is "nearly nothing", the I/O is still forcing enough
> extra seeks to slow normal disk operations?  If so, getting the
> checkpoint out of the way sooner so that you can get back to full speed
> operation sooner might be better than reducing the rate of checkpoint
> I/Os below the nearly-noise level.  I'm just guessing about that though.
> What were you measuring --- the performance within checkpoint, the
> performance outside it, or the whole-cycle average?

I was measuring it over an extended period, say a few hours.  This db
is small enough to fit in memory easily (1.5Gig on a machine with 6Gig
ram doing nothing else) so all the io is basically blocks out, with
none in.  Since the checkpoints should have a LOT of the same records
updated over and over, I'm guessing that a very low completion target
lets it collect a lot of those together and just write out the last
one.  The nice thing is I can benchmark one of these machines against
the other, since they're basically identical twins doing the same job,
and see how changes like this affect them.  At first I was at
something like 0.5 completion target, and increasing the checkpoint
segments did make an effect, but lowering the completion target was as
much of a gain as the increased checkpoint segments.

So, I think the tuning of a small db like this that can fit in memory
is a whole different ball game than one that's several times larger
than memory.

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Greg Smith
Date:
On Fri, 2 Oct 2009, Scott Marlowe wrote:

> I found that lowering checkpoint completion target was what helped.
> Does that seem counter-intuitive to you?

Generally, but there are plenty of ways you can get into a state where a
short but not immediate checkpoint is better.  For example, consider a
case where your buffer cache is filled with really random stuff.  There's
a sorting horizon in effect, where your OS and/or controller makes
decisions about what order to write things based on the data it already
has around, not really knowing what's coming in the near future.

Let's say you've got 256MB of cache in the disk controller, you have 1GB
of buffer cache to write out, and there's 8GB of RAM in the server so it
can cache the whole write.  If you wrote it out in a big burst, the OS
would elevator sort things and feed them to the controller in disk order.
Very efficient, one pass over the disk to write everything out.

But if you broke that up into 256MB write pieces instead on the database
side, pausing after each chunk was written, the OS would only be sorting
across 256MB at a time, and would basically fill the controller cache up
with that before it saw the larger picture.  The disk controller can end
up making seek decisions with that small of a planning window now that are
not really optimal, making more passes over the disk to write the same
data out.  If the timing between the DB write cache and the OS is
pathologically out of sync here, the result can end up being slower than
had you just written out in bigger chunks instead.  This is one reason I'd
like to see fsync calls happen earlier and more evenly than they do now,
to reduce these edge cases.

The usual approach I take in this situation is to reduce the amount of
write caching the OS does, so at least things get more predictable.  A
giant write cache always gives the best average performance, but the
worst-case behavior increases at the same time.

There was a patch floating around at one point that sorted all the
checkpoint writes by block order, which would reduce how likely it is
you'll end up in one of these odd cases.  That turned out to be hard to
nail down the benefit of though, because in a typical case the OS caching
here trumps any I/O scheduling you try to do in user land, and it's hard
to repeatibly generate scattered data in a benchmark situation.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

[ANN] VTD-XML 2.7

From
"jimmy Zhang"
Date:
VTD-XML 2.7 is released and can be downloaded at
http://sourceforge.net/projects/vtd-xml/files/

Below is a summary of what are the new features and enhancements.

Expanded VTD-XML's Core API

* VTDNav: toStringUpperCase, toStringLowerCase, contains(), endsWith(),
startsWith()
* Extended VTD added in-memory buffer support

Improved Xpath

* added the following XPath 2.0 functions: abs(), ends-with(), upper-case(),
lower-case()
* added support for variable reference
* significantly enhanced XPath syntax, checking error reporting (Special
thanks to Mark Swanson)
* Internal performance tuning

Bug fixes and Code Enhancement

* C version significantly removed warning message, fix memory leak during
Xpath expression parsing,
* Various bug fies (Special thanks to Jon Roberts, John Zhu, Matej Spiller,
Steve Polson, and Romain La Tellier)


Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Gerhard Wiesinger
Date:
On Fri, 2 Oct 2009, Greg Smith wrote:

> On Fri, 2 Oct 2009, Scott Marlowe wrote:
>
>> I found that lowering checkpoint completion target was what helped.
>> Does that seem counter-intuitive to you?
>

I set it to 0.0 now.

> Generally, but there are plenty of ways you can get into a state where a
> short but not immediate checkpoint is better.  For example, consider a case
> where your buffer cache is filled with really random stuff.  There's a
> sorting horizon in effect, where your OS and/or controller makes decisions
> about what order to write things based on the data it already has around, not
> really knowing what's coming in the near future.
>

Ok, if checkpoint doesn't block anything on normal operation time doesn't
really matter.

> Let's say you've got 256MB of cache in the disk controller, you have 1GB of
> buffer cache to write out, and there's 8GB of RAM in the server so it can
> cache the whole write.  If you wrote it out in a big burst, the OS would
> elevator sort things and feed them to the controller in disk order. Very
> efficient, one pass over the disk to write everything out.
>
> But if you broke that up into 256MB write pieces instead on the database
> side, pausing after each chunk was written, the OS would only be sorting
> across 256MB at a time, and would basically fill the controller cache up with
> that before it saw the larger picture.  The disk controller can end up making
> seek decisions with that small of a planning window now that are not really
> optimal, making more passes over the disk to write the same data out.  If the
> timing between the DB write cache and the OS is pathologically out of sync
> here, the result can end up being slower than had you just written out in
> bigger chunks instead.  This is one reason I'd like to see fsync calls happen
> earlier and more evenly than they do now, to reduce these edge cases.
>
> The usual approach I take in this situation is to reduce the amount of write
> caching the OS does, so at least things get more predictable.  A giant write
> cache always gives the best average performance, but the worst-case behavior
> increases at the same time.
>
> There was a patch floating around at one point that sorted all the checkpoint
> writes by block order, which would reduce how likely it is you'll end up in
> one of these odd cases.  That turned out to be hard to nail down the benefit
> of though, because in a typical case the OS caching here trumps any I/O
> scheduling you try to do in user land, and it's hard to repeatibly generate
> scattered data in a benchmark situation.
>

Ok, on a basic insert test and a systemtap script
(http://www.wiesinger.com/opensource/systemtap/postgresql-checkpoint.stp)
checkpoint is still a major I/O spike.

################################################################################
Buffers between : Sun Oct  4 18:29:50 2009, synced 55855 buffer(s), flushed 744 buffer(s) between checkpoint
Checkpoint start: Sun Oct  4 18:29:50 2009
Checkpoint end  : Sun Oct  4 18:29:56 2009, synced 12031 buffer(s), flushed 12031 buffer(s)
################################################################################
Buffers between : Sun Oct  4 18:30:20 2009, synced 79000 buffer(s), flushed 0 buffer(s) between checkpoint
Checkpoint start: Sun Oct  4 18:30:20 2009
Checkpoint end  : Sun Oct  4 18:30:26 2009, synced 10753 buffer(s), flushed 10753 buffer(s)
################################################################################
Buffers between : Sun Oct  4 18:30:50 2009, synced 51120 buffer(s), flushed 1007 buffer(s) between checkpoint
Checkpoint start: Sun Oct  4 18:30:50 2009
Checkpoint end  : Sun Oct  4 18:30:56 2009, synced 11899 buffer(s), flushed 11912 buffer(s)
################################################################################

Ok, I further had a look at the code to understand the behavior of the
buffercache and the background writer since that wasn't logically.

So as far as I saw the basic algorithm is:
1.) Normally (non checkpoints) only dirty and non recently used pages
(usage_count == 0) are flushed to disk. I think that's basically fine as a
strategy as indexes might update blocks more than once. It's also ok that
blocks are written and not flushed (well be done on checkpoint time).
2.) At checkpoints write out all dirty buffers and flush all previously
written and newly written. Also spreading I/O seems also ok to me now.

BUT: I think I've found 2 major bugs in the implementation (or I didn't
understand something correctly). Codebase analyzed is 8.3.8 since I
currently use it.

##############################################
Bug1: usage_count is IHMO not consistent
##############################################
I think this has been introduced with:

http://git.postgresql.org/gitweb?p=postgresql.git;a=blobdiff;f=src/backend/storage/buffer/bufmgr.c;h=6e6b862273afea40241e410e18fd5d740c2b1643;hp=97f7822077de683989a064cdc624a025f85e54ab;hb=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0;hpb=98ffa4e9bd75c8124378c712933bb13d2697b694

So either usage_count = 1 init in BufferAlloc is not correct or
SyncOneBuffer() with skip_recently_used and usage_count=1 is not correct:
         if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
                 result |= BUF_REUSABLE;
         else if (skip_recently_used)
         {
                 /* Caller told us not to write recently-used buffers */
                 UnlockBufHdr(bufHdr);
                 return result;
         }

##############################################
Bug2: Double iteration of buffers
##############################################
As you can seen in the calling tree below there is double iteration
with buffers involved. This might be a major performance bottleneck.

// Checkpoint buffer sync
BufferSync()
   loop buffers:
     SyncOneBuffer() // skip_recently_used=false
     CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!

CheckpointWriteDelay()
   if (IsCheckpointOnSchedule())
   { BgBufferSync()
     CheckArchiveTimeout()
     BgWriterNap()
   }

BgBufferSync()
   loop buffers:
     SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g.
indices).But improvement (e.g. aging) is IHMO necessary 
##############################################

BTW: Are there some tests available how fast a buffer cache hit is and a
disk cache hit is (not in the buffer cache but in the disk cache)? I'll
asked, because a lot of locking is involved in the code.

BTW2: Oracle buffercache and background writer strategy is also
interesting.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/process.htm#i7259
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/memory.htm#i10221

Thnx for feedback.

Ciao,
Gerhard

--
http://www.wiesinger.com/

-----------------------------------
src/backend/postmaster/bgwriter.c
-----------------------------------
BackgroundWriterMain()
   loop forever:
     timeout:
       CreateCheckPoint() // NON_IMMEDIATE
       smgrcloseall()
     nontimeout:
       BgBufferSync()
     sleep
// Rest is done in XLogWrite()

RequestCheckpoint()
   CreateCheckPoint() or signal through shared memory segment
   smgrcloseall()

CheckpointWriteDelay()
   if (IsCheckpointOnSchedule())
   { BgBufferSync()
     CheckArchiveTimeout()
     BgWriterNap()
   }

-----------------------------------
src/backend/commands/dbcommands.c
-----------------------------------
createdb()
   RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT

dropdb()
   RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-----------------------------------
src/backend/commands/tablespace.c
-----------------------------------
DropTableSpace()
   RequestCheckpoint()

-----------------------------------
src/backend/tcop/utility.c
-----------------------------------
ProcessUtility()
   // Command CHECKPOINT;
   RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT

-----------------------------------
src/backend/access/transam/xlog.c
-----------------------------------
CreateCheckPoint()
   CheckPointGuts()
     CheckPointCLOG()
     CheckPointSUBTRANS()
     CheckPointMultiXact()
     CheckPointBuffers());       /* performs all required fsyncs */
     CheckPointTwoPhase()

XLogWrite()
   too_much_transaction_log_consumed:
     RequestCheckpoint() // NON_IMMEDIATE

pg_start_backup()
   RequestCheckpoint() // CHECKPOINT_FORCE | CHECKPOINT_WAIT

XLogFlush()
   // Flush transaction log

-----------------------------------
src/backend/storage/buffer/bufmgr.c
-----------------------------------
CheckPointBuffers()
   BufferSync()
   smgrsync()

// Checkpoint buffer sync
BufferSync()
   loop buffers:
     SyncOneBuffer() // skip_recently_used=false
     CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!

// Backgroundwriter buffer sync
BgBufferSync()
   loop buffers:
     SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g.
indices).But improvement (e.g. aging) is IHMO necessary 

SyncOneBuffer() // Problem with skip_recently_used and usage_count=1 (not flushed!)
    FlushBuffer()

FlushBuffer()
   XLogFlush()
   smgrwrite()

CheckPointBuffers()
   BufferSync()
   smgrsync()

BufferAlloc() // Init with usage_count=1 is not logically => Will never be flushed in bg_writer!
   PinBuffer()

PinBuffer()
   usage_count++;

-----------------------------------
src/backend/storage/buffer/localbuf.c
-----------------------------------
LocalBufferAlloc()
   usage_count++;

-----------------------------------
src/backend/storage/smgr/md.c
-----------------------------------
smgrwrite() = mdwrite()
   => write file (not flushed immediatly) but registers for later flushling with register_dirty_segment at checkpoint
time

smgrsync() = mdsync()
   => Syncs registered non flushed files.

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Greg Smith
Date:
On Sun, 4 Oct 2009, Gerhard Wiesinger wrote:

>> On Fri, 2 Oct 2009, Scott Marlowe wrote:
>>
>>> I found that lowering checkpoint completion target was what helped.
>>> Does that seem counter-intuitive to you?
>>
>
> I set it to 0.0 now.

If you set that to 0.0, the whole checkpoing spreading logic doesn't apply
like it's supposed to.  I'm not sure what the results you posted mean now.
If you had it set to 0 and saw a bad spike (which is how I read your
message), I'd say "yes, that's what happens when you do reduce that
parameter, so don't do that".  If you meant something else please clarify.

Thanks for the dtrace example, I suggested we add those checkpoint probes
in there and someone did, but I hadn't seen anybody use them for anything
yet.

> Bug1: usage_count is IHMO not consistent

It's a bit hack-ish, but the changes made to support multiple buffer use
strategies introduced by the "Make large sequential scans and VACUUMs work
in a limited-size ring" commit are reasonable even if they're not as
consistent as we'd like.  Those changes were supported by benchmarks
proving their utility, which always trump theoretical "that shouldn't work
better!" claims when profiling performance.

Also, they make sense to me, but I've spent a lot of time staring at
pg_buffercache output to get a feel for what shows up in there under
various circumstances.  That's where I'd suggest you go if this doesn't
seem right to you; run some real database tests and use pg_buffercache to
see what's inside the cache when you're done.  What's in there and what I
expected to be in there weren't always the same thing, and it's
interesting to note how that changes as shared_buffers increases.  I
consider some time studying that a pre-requisite to analyzing performance
of this code.

> Bug2: Double iteration of buffers
> As you can seen in the calling tree below there is double iteration with
> buffers involved. This might be a major performance bottleneck.

Hmmm, this might be a real bug causing scans through the buffer cache to
go twice as fast as intended.  Since the part you suggest is doubled isn't
very intensive or called all that often, there's no way it can be a major
issue though.  That's based on knowing what the code does and how much it
was called, as well as some confidence that if it were really a *major*
problem, it would have shown up on the extensive benchmarks done on all
the code paths you're investigating.

> BTW: Are there some tests available how fast a buffer cache hit is and a disk
> cache hit is (not in the buffer cache but in the disk cache)? I'll asked,
> because a lot of locking is involved in the code.

I did some once but didn't find anything particularly interesting about
the results.  Since you seem to be on a research tear here, it would be
helpful to have a script to test that out available, I wasn't able to
release mine and something dtrace based would probably be better than the
approach I used (I threw a bunch of gettimeofdata calls into the logs and
post-processed them with a script).

> BTW2: Oracle buffercache and background writer strategy is also interesting.

As a rule, we don't post links to other database implementation details
here, as those can have patented design details we'd prefer not to
intentionally re-implement.  Much of Oracle's design here doesn't apply
here anyway, as it was done in the era when all of their writes were
synchronous.  That required them to worry about doing a good job on some
things in their background writer that we shrug off and let os writes
combined with fsync handle instead.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Gerhard Wiesinger
Date:
On Mon, 5 Oct 2009, Greg Smith wrote:

> On Sun, 4 Oct 2009, Gerhard Wiesinger wrote:
>
>>> On Fri, 2 Oct 2009, Scott Marlowe wrote:
>>>
>>>> I found that lowering checkpoint completion target was what helped.
>>>> Does that seem counter-intuitive to you?
>>>
>>
>> I set it to 0.0 now.
>
> If you set that to 0.0, the whole checkpoing spreading logic doesn't apply
> like it's supposed to.  I'm not sure what the results you posted mean now. If
> you had it set to 0 and saw a bad spike (which is how I read your message),
> I'd say "yes, that's what happens when you do reduce that parameter, so don't
> do that".  If you meant something else please clarify.

I think the problem is, that it is done on checkpoint time (whether spread
or not). I should have been already be done by bgwriter.

>
> Thanks for the dtrace example, I suggested we add those checkpoint probes in
> there and someone did, but I hadn't seen anybody use them for anything yet.
>

I think more probes (e.g. on different writing conditions like writing
from bgwriter or on a checkpoint) would be interesting here.

>> Bug1: usage_count is IHMO not consistent
>
> It's a bit hack-ish, but the changes made to support multiple buffer use
> strategies introduced by the "Make large sequential scans and VACUUMs work in
> a limited-size ring" commit are reasonable even if they're not as consistent
> as we'd like.  Those changes were supported by benchmarks proving their
> utility, which always trump theoretical "that shouldn't work better!" claims
> when profiling performance.
>
> Also, they make sense to me, but I've spent a lot of time staring at
> pg_buffercache output to get a feel for what shows up in there under various
> circumstances.  That's where I'd suggest you go if this doesn't seem right to
> you; run some real database tests and use pg_buffercache to see what's inside
> the cache when you're done.  What's in there and what I expected to be in
> there weren't always the same thing, and it's interesting to note how that
> changes as shared_buffers increases.  I consider some time studying that a
> pre-requisite to analyzing performance of this code.
>

I have analyzed pg_buffercache (query every second, see below) in parallel
to see what happens but I didn't see expected results in some ways with
the usage_counts. Therefore I analyzed the code and found IHMO the problem
with the usage_count and buffer reallocation. Since the code change is
also new (I think it way 05/2009) it might be that you tested before ...

BTW: Is it possible to get everything in pg_class over all databases as
admin?

>> Bug2: Double iteration of buffers
>> As you can seen in the calling tree below there is double iteration with
>> buffers involved. This might be a major performance bottleneck.
>
> Hmmm, this might be a real bug causing scans through the buffer cache to go
> twice as fast as intended.

That's not twice O(2*n)=O(n) that's a factor n*n (outer and inner loop
iteration) which means overall is O(n^2) which is IHMO too much.

>  Since the part you suggest is doubled isn't very
> intensive or called all that often, there's no way it can be a major issue
> though.

It is a major issue since it is O(n^2) and not O(n). E.g. with 2GB share
buffer we have 262144 blocks and 68719476736 calls which is far too much.

>  That's based on knowing what the code does and how much it was
> called, as well as some confidence that if it were really a *major* problem,
> it would have shown up on the extensive benchmarks done on all the code paths
> you're investigating.
>

The problem might be hidden for the following reasons:
1.) Buffers values are too low that even n^2 is low for today's machines
2.) Code is not often called in that way
3.) backend writes out pages so that the code is never executed
4.) ...

>> BTW: Are there some tests available how fast a buffer cache hit is and a
>> disk cache hit is (not in the buffer cache but in the disk cache)? I'll
>> asked, because a lot of locking is involved in the code.
>
> I did some once but didn't find anything particularly interesting about the
> results.  Since you seem to be on a research tear here, it would be helpful
> to have a script to test that out available, I wasn't able to release mine
> and something dtrace based would probably be better than the approach I used
> (I threw a bunch of gettimeofdata calls into the logs and post-processed them
> with a script).
>

Do you have an where one should set tracepoints inside and outside
PostgreSQL?

>> BTW2: Oracle buffercache and background writer strategy is also
>> interesting.
>
> As a rule, we don't post links to other database implementation details here,
> as those can have patented design details we'd prefer not to intentionally
> re-implement.  Much of Oracle's design here doesn't apply here anyway, as it
> was done in the era when all of their writes were synchronous.  That required
> them to worry about doing a good job on some things in their background
> writer that we shrug off and let os writes combined with fsync handle
> instead.
>

Ok, no problem.

Ciao,
Gerhard

--
http://www.wiesinger.com/


SELECT
   CASE
     WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
     ELSE datname
   END AS database,
   CASE
     WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
     ELSE spcname
   END AS tablespace,
   CASE
     WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relname
   END AS relation,
   CASE
     WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relkind
   END AS relkind,
   usagecount > 0 AS usagecount_gt_0,
   isdirty,
   MIN(relblocknumber) AS min_blocknumber,
   MAX(relblocknumber) AS max_blocknumber,
   ROUND(AVG(relblocknumber),2) AS avg_blocknumber,
   ROUND(STDDEV(relblocknumber),2) AS stddev_blocknumber,
   COUNT(*) AS count
FROM
   pg_buffercache
LEFT JOIN pg_class      ON pg_buffercache.relfilenode = pg_class.oid
LEFT JOIN pg_tablespace ON pg_buffercache.reltablespace =
pg_tablespace.oid
LEFT JOIN pg_database   ON pg_buffercache.reldatabase = pg_database.oid
WHERE
   isdirty = true
GROUP BY
   CASE
     WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
     ELSE datname
   END,
   CASE
     WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
     ELSE spcname
   END,
   CASE
     WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relkind
   END,
   CASE
     WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relname
   END,
   usagecount > 0,
   isdirty
--HAVING relkind = 'r'
ORDER BY
   database,
   tablespace,
   relkind DESC,
   relation,
   usagecount > 0,
   isdirty
;

Re: Limit of bgwriter_lru_maxpages of max. 1000?

From
Greg Smith
Date:
On Mon, 5 Oct 2009, Gerhard Wiesinger wrote:

> I think the problem is, that it is done on checkpoint time (whether
> spread or not). I should have been already be done by bgwriter.

This is pretty simple:  if you write things before checkpoint time, you'll
end up re-writing a percentage of the blocks if they're re-dirtied before
the checkpoint actually happens.  The checkpoint itself is always the most
efficient time to write something out.  People think that the background
writer should do more, but it can't without generating more writes than if
you instead focused on spreading the checkpoints out instead.  This is why
the only work the BGW does try to do is writing out blocks that it's
pretty sure are going to be evicted very soon (in the next 200ms, or
whatever its cycle time is set to), to minimize the potential for
mistakes.  The design errors a bit on the side of doing too little because
it is paranoid about not doing wasted work, and that implementation always
beat one where the background writer was more aggressive in benchmarks.

This is hard for people to accept, but there were three of us running
independent tests to improve things here by the end of 8.3 development and
everybody saw similar results as far as the checkpoint spreading approach
being the right one.  At the time the patch was labeled "load distributed
checkpoint" and if I had more time today I'd try and find the more
interesting parts of that discussion to highlight them.

> BTW: Is it possible to get everything in pg_class over all databases as
> admin?

Scott's message at
http://archives.postgresql.org/pgsql-general/2009-09/msg00986.php
summarizes the problem nicely, and I suggested my workaround for it at
http://archives.postgresql.org/pgsql-general/2009-09/msg00984.php

>>> Bug2: Double iteration of buffers
>>> As you can seen in the calling tree below there is double iteration with
>>> buffers involved. This might be a major performance bottleneck.
>>
>> Hmmm, this might be a real bug causing scans through the buffer cache to go
>> twice as fast as intended.
>
> That's not twice O(2*n)=O(n) that's a factor n*n (outer and inner loop
> iteration) which means overall is O(n^2) which is IHMO too much.

I follow what you mean, didn't notice that.  SyncOneBuffer isn't a O(n)
operation; it's O(1).  So I'd think that the potential bug here turns into
a O(n) issue then given it's the routine being called n times.

This seems like a job for "dump things to the log file" style debugging.
If I can reproduce an actual bug here it sounds like a topic for the
hackers list outside of this discussion.

> The problem might be hidden for the following reasons:
> 1.) Buffers values are too low that even n^2 is low for today's machines
> 2.) Code is not often called in that way
> 3.) backend writes out pages so that the code is never executed

(2) was the reason I figured it might have escaped notice.  It's really
not called that often in a way that would run into the problem you think
is there.

> Do you have an where one should set tracepoints inside and outside
> PostgreSQL?

I think you'd want to instrument BufferAlloc inside bufmgr.c to measure
what you're after.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD