Thread: BufferSync and bgwriter

BufferSync and bgwriter

From

Simon Riggs

Date:

11 December 2004, 17:33:26

The idea that bgwriter smooths out the response time of transactions is
only true if the buffer lists T1 and T2 have *some* clean buffers
available for use when performing I/O. The alternative is that
transactions unlucky enough to encounter the no-clean-buffers situation
have to clean a space for themselves, effectively making the bgwriter
redundant.

In BufferSync, we start off by calling StrategyDirtyBufferList to make a
list of all the dirty buffers. Even though we know we are limited to
maxpages, we still scan the whole of shared_buffers (...making it a very
expensive call and thereby causing us to increase bgwriter_delay, which
then negates the cleaning effect as described above).

Once we've got the list, we limit ourselves to only using maxpages of
the list that we just built. We do it that way round to allow
bgwriter_percent to calculate how many of the dirty buffers it should
flush, on the assumption that percent < 100.

If the bgwriter_percent = 100, then we should actually do the sensible
thing and prepare the list that we need, i.e. limit
StrategyDirtyBufferList to finding at most bgwriter_maxpages.

Thus if you have a large shared_buffers, you can still have a relatively
frequent bgwriter_delay, so that the bgwriter can keep the LRUs of the
T1 and T2 lists free for use...and so let backends get on with useful
work.

Patch which implements this attached, for discussion.

Mark, any chance we could run this patch on STP to test whether it has a
beneficial performance effect? Re-run test 207 to compare?

I'll be asking for this in 8.0, if it works, for all the same
performance reasons discussed previously as well as coming under the
header of "bgwriter default changes" since this effects the default
behaviour when bgwriter_percent=100.

There are some other ideas for 8.1, but that can wait.

--
Best Regards, Simon Riggs

Attachment

100pct.patch

Re: [Testperf-general] BufferSync and bgwriter

From

Neil Conway

Date:

12 December 2004, 05:47:10

I wonder if we even need to retain the bgwriter_percent GUC var. Is 
there actually a situation in which the combination of bgwriter_maxpages 
and bgwriter_delay does not give the DBA sufficient flexibility in 
tuning bgwriter behavior?

Simon Riggs wrote:
> If the bgwriter_percent = 100, then we should actually do the sensible
> thing and prepare the list that we need, i.e. limit
> StrategyDirtyBufferList to finding at most bgwriter_maxpages.

Is the plan to make bgwriter_percent = 100 the default setting?

-Neil

Re: [Testperf-general] BufferSync and bgwriter

From

Simon Riggs

Date:

12 December 2004, 22:12:04

> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
> Simon Riggs wrote:
> > If the bgwriter_percent = 100, then we should actually do the sensible
> > thing and prepare the list that we need, i.e. limit
> > StrategyDirtyBufferList to finding at most bgwriter_maxpages.
>
> Is the plan to make bgwriter_percent = 100 the default setting?

Hmm...must confess that my only plan is:
i) discover dynamic behaviour of bgwriter
ii) fix any bugs or wierdness as quickly as possible
iii) try to find a way to set the bgwriter defaults

I'm worried that we're late in the day for changes, but I'm equally
worried that a) the bgwriter is very tuning sensitive b) we don't really
have much info on how to set the defaults in a meaningful way for the
majority of cases c) there are some issues that greatly reduce the
effectiveness of the bgwriter in many circumstances.

The 100pct.patch was my first attempt at getting something acceptable in
the next few days that gives sufficient room for the DBA to perform
tuning.

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
> I wonder if we even need to retain the bgwriter_percent GUC var. Is
> there actually a situation in which the combination of bgwriter_maxpages
> and bgwriter_delay does not give the DBA sufficient flexibility in
> tuning bgwriter behavior?

Yes, I do now think that only two GUCs are required to tune the
behaviour; but you make me think - which two? Right now, bgwriter_delay
is useless because the O(N) behaviour makes it impossible to set any
lower when you have a large shared_buffers. (I see that as a bug)

Your question has made me rethink the exact objective of the bgwriter's
actions: The way it is coded now the bgwriter looks for dirty blocks, no
matter where they are in the list. What we are bothered about is the
number of clean buffers at the LRU, which has a direct influence on the
probability that BufferAlloc() will need to call FlushBuffer(), since
StrategyGetBuffer() returns the first unpinned buffer, dirty or not.
After further thought, I would prefer a subtle change in behaviour so
that the bgwriter checks that clean blocks are available at the LRUs for
when buffer replacement occurs. With that slight change, I'd keep the
bgwriter_percent GUC but make it mean something different.

bgwriter_percent would be the % of shared_buffers that are searched
(from the LRU end) to see if they contain dirty buffers, which are then
written to disk.  That means the number of dirty blocks written to disk
is between 0 and the number of buffers searched, but we're not hugely
bothered what that number is... [This change to StrategyDirtyBufferList
resolves the unusability of the bgwriter with large shared_buffers]

Writing away dirty blocks towards the MRU end is more likely to be
wasted effort. If a block stays near the MRU then it will be dirty again
in the wink of an eye, so you gain nothing at checkpoint time by
cleaning it. Also, since it isn't near the LRU, cleaning it has no
effect on buffer replacement I/O. If a block is at the LRU, then it is
by definition the least likely to be reused, and is a candidate for
replacement anyway. So concentrating on the LRU, not the number of dirty
buffers seems to be the better thing to do.

That would then be a much simpler way of setting the defaults. With that
definition, we would set the defaults:

bgwriter_percent = 2 (according to my new suggestion here)
bgwriter_delay = 200
bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine
tuning)

Thus, for the default shared_buffers=1000 the bgwriter would clear a
space of up to 20 blocks each cycle.
For a config with shared_buffers=60000, the bgwriter default would clear
space for 600 blocks (max) each cycle - a reasonable setting.

Overall that would need very little specific tuning, because it would
scale upwards as you changed the shared_buffers higher.

So, that interpretation of bgwriter_percent gives these advantages:
- we bound the StrategyDirtyBufferList scan to a small % of the whole
list, rather than the whole list...so we could realistically set the
bgwriter_delay lower if required
- we can set a default that scales, so would not often need to change it
- the parameter is defined in terms of the thing we really care about:
sufficient clean blocks at the LRU of the buffer lists
- these changes are very isolated and actually minor - just a different
way of specifying which buffers the bgwriter will clean

Patch attached...again for discussion and to help understanding of this
proposal. Will submit to patches if we agree it seems like the best way
to allow the bgwriter defaults to be sensibly set.

[...and yes, everybody, I do know where we are in the release cycle]

--
Best Regards, Simon Riggs

Attachment

bg2.patch

Re: [Testperf-general] BufferSync and bgwriter

From

Neil Conway

Date:

13 December 2004, 02:45:31

On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote:
> > On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
> > Is the plan to make bgwriter_percent = 100 the default setting?
> 
> Hmm...must confess that my only plan is:
> i) discover dynamic behaviour of bgwriter
> ii) fix any bugs or wierdness as quickly as possible
> iii) try to find a way to set the bgwriter defaults

I was just curious why you were bothering to special-case
bgwriter_percent = 100 if it's not going to be the default setting (in
which case I would be surprised if more than 1 in 10 users would take
advantage of the patch).

> Right now, bgwriter_delay
> is useless because the O(N) behaviour makes it impossible to set any
> lower when you have a large shared_buffers.

BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do
this scan while holding the BufMgrLock, which is a well known source of
contention. So reducing the time we hold that lock would be good.

> Your question has made me rethink the exact objective of the bgwriter's
> actions: The way it is coded now the bgwriter looks for dirty blocks, no
> matter where they are in the list.

Not sure what you mean. StrategyDirtyBufferList() returns the specified
number of dirty buffers in order, starting with the T1/T2 LRUs and going
back to the MRUs of both lists. bgwriter_percent effectively ignores
some portion of the tail of that list, so we end up just flushing the
buffers closest to the L1/L2 LRUs. How is this different from what
you're describing?

> bgwriter_percent would be the % of shared_buffers that are searched
> (from the LRU end) to see if they contain dirty buffers, which are
> then written to disk.

By definition, buffers closest to the LRU end of the lists are not
frequently accessed. If we only search the N% of the lists closest to
LRU, we will probably end up flushing just those pages to disk -- and
then not flushing anything else to disk in the subsequent bgwriter calls
because all the buffers close to the LRU will be non-dirty. That's okay
if all we're concerned about is avoiding write() by a real backend, but
we also want to smooth out checkpoint load, which I don't think this
approach would do well.

I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages
is all the tuning we need, and I think "max # of pages to write" is a
simpler and more logical tuning knob than "% of the buffer pool to scan
looking for dirty buffers." So at each bufmgr invocation, we pick the at
most bgwriter_maxpages dirty pages from the pool, using the pages
closest to the LRUs of T1 and T2. I'd be happy to supply a patch to
implement that if you think it sounds okay.

-Neil

Re: [Testperf-general] BufferSync and bgwriter

From

Mark Kirkwood

Date:

13 December 2004, 04:39:59

Simon,

I am seeing a reasonably reproducible performance boost after applying
your patch (I'm not sure if that was one of the main objectives, but it
certainly is nice).

I *was* seeing a noticeable decrease between 7.4.6 and 8.0.0RC1 running
pgbench. However, after applying your patch, 8.0 is pretty much back to
being the same.

Now I know pgbench is ..err... not always the most reliable for this
sort of thing, so I am interested if this seems like a reasonable sort
of thing to be noticing  (and also if anyone else has noticed the
decrement)?

(The attached brief results are for Linux x86, but I can see a similar
performance decrement  7.4.6->8.0.0RC1 on FreeBSD 5.3 x86)

regards

Mark
Simon Riggs wrote:

>Hmm...must confess that my only plan is:
>i) discover dynamic behaviour of bgwriter
>ii) fix any bugs or wierdness as quickly as possible
>iii) try to find a way to set the bgwriter defaults
>
>
>
System
------
P4 2.8Ghz 1G 1xSeagate Barracuda 40G
Linux 2.6.9 glibc 2.3.3 gcc 3.4.2
Postgresql 7.4.6 | 8.0.0RC1

Test
----
Pgbench with scale factor = 200

Pg 7.4.6
--------

clients    transactions    tps
1        1000            65.1
2        1000            72.5
4        1000            69.2
8        1000            48.3

Pg 8.0.0RC1
-----------

clients    transactions    tps        tps (new buff patch + settings)
1        1000            55.8    70.9
2        1000            68.3    77.9
4        1000            38.4    62.8
8        1000            29.4    38.1

(averages over 3 runs, database dropped and recreated after each set, with a
checkpoint performed after each individual run)

Parameters
----------

Non default postgresql.conf parameters:

tcpip_socket = true [listen_addresses = "*"]
max_connections = 100
shared_buffers = 10000
wal_buffers = 1024
checkpoint_segments = 10
effective_cache_size = 40000
random_page_cost = 0.8

bgwriter settings (used with patch only)

bgwriter_delay = 200
bgwriter_percent = 2
bgwriter_maxpages = 100

Re: [Testperf-general] BufferSync and bgwriter

From

Simon Riggs

Date:

13 December 2004, 09:09:18

On Mon, 2004-12-13 at 04:39, Mark Kirkwood wrote:
> I am seeing a reasonably reproducible performance boost after applying 
> your patch (I'm not sure if that was one of the main objectives, but it 
> certainly is nice).
> 
> I *was* seeing a noticeable decrease between 7.4.6 and 8.0.0RC1 running 
> pgbench. However, after applying your patch, 8.0 is pretty much back to 
> being the same.
> 

Thanks Mark - brilliant to have some confirming test results back so
quickly. 

The tests indicate that we're on the right track here and that we should
test this on the OSDL platform also on a long run, to check out the
effects of both normal running and checkpointing.

Given these test settings:
bgwriter_delay = 200  
bgwriter_percent = 2   
bgwriter_maxpages = 100  

This shows the importance of reducing the length of the BufMgr lock in
StrategyDirtyBufferList() -- which I think Neil also agrees is the main
problem here.
> 
> ______________________________________________________________________
> System
> ------
> P4 2.8Ghz 1G 1xSeagate Barracuda 40G
> Linux 2.6.9 glibc 2.3.3 gcc 3.4.2
> Postgresql 7.4.6 | 8.0.0RC1
> 
> Test
> ----
> Pgbench with scale factor = 200
> 
> Pg 7.4.6
> --------
> 
> clients    transactions    tps
> 1        1000            65.1
> 2        1000            72.5
> 4        1000            69.2
> 8        1000            48.3
> 
> 
> Pg 8.0.0RC1
> -----------
> 
> clients    transactions    tps        tps (new buff patch + settings)
> 1        1000            55.8    70.9
> 2        1000            68.3    77.9
> 4        1000            38.4    62.8
> 8        1000            29.4    38.1
> 
> (averages over 3 runs, database dropped and recreated after each set, with a 
> checkpoint performed after each individual run)
> 
> 
> Parameters
> ----------
> 
> Non default postgresql.conf parameters:
> 
> tcpip_socket = true [listen_addresses = "*"]
> max_connections = 100
> shared_buffers = 10000
> wal_buffers = 1024    
> checkpoint_segments = 10
> effective_cache_size = 40000
> random_page_cost = 0.8
> 
> bgwriter settings (used with patch only)
> 
> bgwriter_delay = 200  
> bgwriter_percent = 2   
> bgwriter_maxpages = 100  
-- 
Best Regards, Simon Riggs

Re: [Testperf-general] BufferSync and bgwriter

From

Simon Riggs

Date:

13 December 2004, 09:15:19

On Mon, 2004-12-13 at 02:43, Neil Conway wrote:
> On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote:
> > > On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
> > > Is the plan to make bgwriter_percent = 100 the default setting?
> >
> > Hmm...must confess that my only plan is:
> > i) discover dynamic behaviour of bgwriter
> > ii) fix any bugs or wierdness as quickly as possible
> > iii) try to find a way to set the bgwriter defaults
>
> I was just curious why you were bothering to special-case
> bgwriter_percent = 100 if it's not going to be the default setting (in
> which case I would be surprised if more than 1 in 10 users would take
> advantage of the patch).
>
> > Right now, bgwriter_delay
> > is useless because the O(N) behaviour makes it impossible to set any
> > lower when you have a large shared_buffers.
>
> BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do
> this scan while holding the BufMgrLock, which is a well known source of
> contention. So reducing the time we hold that lock would be good.

Yes, the duration of the BufMgrLock held during StrategyDirtyBufferList
and its effect on system performance is my concern. Reducing that is one
of the primary objectives here (point (ii)).

> > bgwriter_percent would be the % of shared_buffers that are searched
> > (from the LRU end) to see if they contain dirty buffers, which are
> > then written to disk.
>
> By definition, buffers closest to the LRU end of the lists are not
> frequently accessed. If we only search the N% of the lists closest to
> LRU, we will probably end up flushing just those pages to disk -- and
> then not flushing anything else to disk in the subsequent bgwriter calls
> because all the buffers close to the LRU will be non-dirty. That's okay
> if all we're concerned about is avoiding write() by a real backend, but
> we also want to smooth out checkpoint load, which I don't think this
> approach would do well.

My argument for that was: N% of lists closest to LRU approach gives
- constant search time (searching for N dirty buffers causes a variable
number of buffers to be searched, so lock time varies...)
- if blocks are no longer used, they eventually migrate to the LRU, so
they then get written away by bgwriter rather than at checkpoint time.
- the blocks near the MRU get dirtied again fairly quickly, so still
need to be flushed again at checkpoint
So, overall, I think this would smooth out the checkpoint load

We've little time left: If we do not manage to perform a performance
test that shows that this argument is valid, then I'd agree that we drop
that idea (for now) because of the risk that it does have the
side-effect you mention.

Longer term, I think possibly having two types of bgwriter activity
would be worthwhile:
1) short and frequent LRU cleaning
2) longer but less frequent mini-checkpoints that reach up towards the
MRU

> I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages
> is all the tuning we need, and I think "max # of pages to write" is a
> simpler and more logical tuning knob than "% of the buffer pool to scan
> looking for dirty buffers." So at each bufmgr invocation, we pick the at
> most bgwriter_maxpages dirty pages from the pool, using the pages
> closest to the LRUs of T1 and T2. I'd be happy to supply a patch to
> implement that if you think it sounds okay.

Whichever way we do it, we agree that bgwriter_maxpages is all the
tuning that you and I need.

My suggestion was to provide both the tuning knob AND removing the need
for the knob completely for the (as you say) 9 out of 10 people that
never will perform any tuning, by using bgwriter_percent to set a value
that is approximately correct all of the time.

Anyway, thanks for taking the time to read all of these postings. We're
clearly agreed on the main aspect of this, AFAICS.

I'd be happy to supply a patch to
> implement that if you think it sounds okay.

...my understanding is that you'd only be touching BufferSync() to
simplify it, and to remove all of the bgwriter_percent GUC stuff and its
call path to BufferSync()?

I've hacked my patch down to show what I think you mean for the
BufferSync() changes.... to allow perf comparisons if time allows.
Clearly your own patch will more accurately portray those...

--
Best Regards, Simon Riggs

Attachment

bg3.patch

Re: [Testperf-general] BufferSync and bgwriter

From

Mark Wong

Date:

14 December 2004, 23:50:45

Sorry for the delay; here are results with the bg3.patch with database
parameters that should match run 207.  I haven't been able to take the
time too look over the results myself, but I tried to make sure this
run was the same as 207:http://www.osdl.org/projects/dbt2dev/results/dev4-010/207

Mark

Re: [Testperf-general] BufferSync and bgwriter

From

Mark Wong

Date:

15 December 2004, 00:04:13

Sorry, wrong link, right one here:http://www.osdl.org/projects/dbt2dev/results/dev4-010/211

Mark

Re: [Testperf-general] BufferSync and bgwriter

From

Simon Riggs

Date:

15 December 2004, 00:56:04

On Wed, 2004-12-15 at 00:00, Mark Wong wrote:

>     http://www.osdl.org/projects/dbt2dev/results/dev4-010/211
> 

Thanks Mark for turning that around so quickly. Looks good...

Results performed to compare
test 207
http://www.osdl.org/projects/dbt2dev/results/dev4-010/207

test 211 with bg3.patch which matches Neil/my option (3)
http://www.osdl.org/projects/dbt2dev/results/dev4-010/211

The overall results show 3% throughput gain. The negative effects of
checkpointing are significantly reduced and this shows up in the New
Order Transaction response time max dropping from 37s to 25s, which
looks like a significant user-visible performance gain. Similar
reduction in max response times is shown for all transaction types:
consistent removal of the longest wait times.

The gains come from greater effectiveness of the bgwriter, which reduces
I/O wait time spikes to almost zero once the shared_buffers are
completely full. (see Processor Utilization graph: wait)

It looks to me that reducing the bgwriter_delay slightly might yield
additional gains, say to 180 or 160. That should now be possible since
the cost of doing so has been greatly reduced. StrategyDirtyBufferList
has now dropped way down the list in oprofile results.

Neil very kindly points out privately that the patch has a missing
sanity check bug in it, which has shown up in Neil's testing. That
wouldn't effect these performance results, however. I leave it to Neil
to post a corrected version as a result of his efforts.

I leave it to the consensus to decide whether these results represent
significant gains and whether to add to 8.0, or defer.

Neil's suggestion (2) should also needs to be considered - test results
could still show that as the better option, so I keep an open mind.

-- 
Best Regards, Simon Riggs

Re: [Testperf-general] BufferSync and bgwriter

From

Jan Wieck

Date:

15 December 2004, 16:11:03

On 12/12/2004 5:08 PM, Simon Riggs wrote:

>> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> Simon Riggs wrote:
>> > If the bgwriter_percent = 100, then we should actually do the sensible
>> > thing and prepare the list that we need, i.e. limit
>> > StrategyDirtyBufferList to finding at most bgwriter_maxpages.
>> 
>> Is the plan to make bgwriter_percent = 100 the default setting?
> 
> Hmm...must confess that my only plan is:
> i) discover dynamic behaviour of bgwriter
> ii) fix any bugs or wierdness as quickly as possible
> iii) try to find a way to set the bgwriter defaults
> 
> I'm worried that we're late in the day for changes, but I'm equally
> worried that a) the bgwriter is very tuning sensitive b) we don't really
> have much info on how to set the defaults in a meaningful way for the
> majority of cases c) there are some issues that greatly reduce the
> effectiveness of the bgwriter in many circumstances.
> 
> The 100pct.patch was my first attempt at getting something acceptable in
> the next few days that gives sufficient room for the DBA to perform
> tuning.

Doesn't cranking up the bgwriter_percent to 100 effectively make the 
entire shared memory a write-through cache? In other words, with 100% 
the bgwriter will allways write all dirty blocks out and it becomes 
unlikely to avoid an IO for subsequent modificaitons to the same data block.


Jan

> 
> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> I wonder if we even need to retain the bgwriter_percent GUC var. Is 
>> there actually a situation in which the combination of bgwriter_maxpages 
>> and bgwriter_delay does not give the DBA sufficient flexibility in 
>> tuning bgwriter behavior?
> 
> Yes, I do now think that only two GUCs are required to tune the
> behaviour; but you make me think - which two? Right now, bgwriter_delay
> is useless because the O(N) behaviour makes it impossible to set any
> lower when you have a large shared_buffers. (I see that as a bug)
> 
> Your question has made me rethink the exact objective of the bgwriter's
> actions: The way it is coded now the bgwriter looks for dirty blocks, no
> matter where they are in the list. What we are bothered about is the
> number of clean buffers at the LRU, which has a direct influence on the
> probability that BufferAlloc() will need to call FlushBuffer(), since
> StrategyGetBuffer() returns the first unpinned buffer, dirty or not.
> After further thought, I would prefer a subtle change in behaviour so
> that the bgwriter checks that clean blocks are available at the LRUs for
> when buffer replacement occurs. With that slight change, I'd keep the
> bgwriter_percent GUC but make it mean something different.
> 
> bgwriter_percent would be the % of shared_buffers that are searched
> (from the LRU end) to see if they contain dirty buffers, which are then
> written to disk.  That means the number of dirty blocks written to disk
> is between 0 and the number of buffers searched, but we're not hugely
> bothered what that number is... [This change to StrategyDirtyBufferList
> resolves the unusability of the bgwriter with large shared_buffers]
> 
> Writing away dirty blocks towards the MRU end is more likely to be
> wasted effort. If a block stays near the MRU then it will be dirty again
> in the wink of an eye, so you gain nothing at checkpoint time by
> cleaning it. Also, since it isn't near the LRU, cleaning it has no
> effect on buffer replacement I/O. If a block is at the LRU, then it is
> by definition the least likely to be reused, and is a candidate for
> replacement anyway. So concentrating on the LRU, not the number of dirty
> buffers seems to be the better thing to do.
> 
> That would then be a much simpler way of setting the defaults. With that
> definition, we would set the defaults:
> 
> bgwriter_percent = 2 (according to my new suggestion here)
> bgwriter_delay = 200
> bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine
> tuning)
> 
> Thus, for the default shared_buffers=1000 the bgwriter would clear a
> space of up to 20 blocks each cycle.
> For a config with shared_buffers=60000, the bgwriter default would clear
> space for 600 blocks (max) each cycle - a reasonable setting.
> 
> Overall that would need very little specific tuning, because it would
> scale upwards as you changed the shared_buffers higher.
> 
> So, that interpretation of bgwriter_percent gives these advantages:
> - we bound the StrategyDirtyBufferList scan to a small % of the whole
> list, rather than the whole list...so we could realistically set the
> bgwriter_delay lower if required
> - we can set a default that scales, so would not often need to change it
> - the parameter is defined in terms of the thing we really care about:
> sufficient clean blocks at the LRU of the buffer lists
> - these changes are very isolated and actually minor - just a different
> way of specifying which buffers the bgwriter will clean
> 
> Patch attached...again for discussion and to help understanding of this
> proposal. Will submit to patches if we agree it seems like the best way
> to allow the bgwriter defaults to be sensibly set.
> 
> [...and yes, everybody, I do know where we are in the release cycle]
> 
> 
> 
> ------------------------------------------------------------------------
> 
> Index: buffer/bufmgr.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.182
> diff -d -c -r1.182 bufmgr.c
> *** buffer/bufmgr.c    24 Nov 2004 02:56:17 -0000    1.182
> --- buffer/bufmgr.c    12 Dec 2004 21:53:10 -0000
> ***************
> *** 681,686 ****
> --- 681,687 ----
>   {
>       BufferDesc **dirty_buffers;
>       BufferTag  *buftags;
> +     int         maxdirty;
>       int            num_buffer_dirty;
>       int            i;
>   
> ***************
> *** 688,717 ****
>       if (percent == 0 || maxpages == 0)
>           return 0;
>   
>       /*
>        * Get a list of all currently dirty buffers and how many there are.
>        * We do not flush buffers that get dirtied after we started. They
>        * have to wait until the next checkpoint.
>        */
> !     dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
> !     buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
>   
>       LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
> -     num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> -                                                NBuffers);
>   
> !     /*
> !      * If called by the background writer, we are usually asked to only
> !      * write out some portion of dirty buffers now, to prevent the IO
> !      * storm at checkpoint time.
> !      */
> !     if (percent > 0)
> !     {
> !         Assert(percent <= 100);
> !         num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
> !     }
> !     if (maxpages > 0 && num_buffer_dirty > maxpages)
> !         num_buffer_dirty = maxpages;
>   
>       /* Make sure we can handle the pin inside the loop */
>       ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> --- 689,720 ----
>       if (percent == 0 || maxpages == 0)
>           return 0;
>   
> +     /* Set number of buffers we will clean at LRUs of buffer lists 
> +      * If no limits set, then clean the whole of shared_buffers
> +      */
> +     if (maxpages > 0)
> +         maxdirty = maxpages;
> +     else {
> +         if (percent > 0) {
> +                Assert(percent <= 100);
> +             maxdirty = (NBuffers * percent + 99) / 100;
> +         }
> +         else
> +             maxdirty = NBuffers;
> +     }
> + 
>       /*
>        * Get a list of all currently dirty buffers and how many there are.
>        * We do not flush buffers that get dirtied after we started. They
>        * have to wait until the next checkpoint.
>        */
> !     dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *));
> !     buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag));
>   
>       LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
>   
> !        num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> !                                                maxdirty);
>   
>       /* Make sure we can handle the pin inside the loop */
>       ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> Index: buffer/freelist.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.48
> diff -d -c -r1.48 freelist.c
> *** buffer/freelist.c    16 Sep 2004 16:58:31 -0000    1.48
> --- buffer/freelist.c    12 Dec 2004 21:53:11 -0000
> ***************
> *** 735,741 ****
>    * StrategyDirtyBufferList
>    *
>    * Returns a list of dirty buffers, in priority order for writing.
> -  * Note that the caller may choose not to write them all.
>    *
>    * The caller must beware of the possibility that a buffer is no longer dirty,
>    * or even contains a different page, by the time he reaches it.  If it no
> --- 735,740 ----
> ***************
> *** 755,760 ****
> --- 754,760 ----
>       int            cdb_id_t2;
>       int            buf_id;
>       BufferDesc *buf;
> +     int            i;
>   
>       /*
>        * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
> ***************
> *** 765,771 ****
>       cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
>       cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>   
> !     while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
>       {
>           if (cdb_id_t1 >= 0)
>           {
> --- 765,771 ----
>       cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
>       cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>   
> !     for (i = 0; i < max_buffers; i++)
>       {
>           if (cdb_id_t1 >= 0)
>           {
> ***************
> *** 779,786 ****
>                       buffers[num_buffer_dirty] = buf;
>                       buftags[num_buffer_dirty] = buf->tag;
>                       num_buffer_dirty++;
> -                     if (num_buffer_dirty >= max_buffers)
> -                         break;
>                   }
>               }
>   
> --- 779,784 ----
> ***************
> *** 799,806 ****
>                       buffers[num_buffer_dirty] = buf;
>                       buftags[num_buffer_dirty] = buf->tag;
>                       num_buffer_dirty++;
> -                     if (num_buffer_dirty >= max_buffers)
> -                         break;
>                   }
>               }
>   
> --- 797,802 ----
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: [Testperf-general] BufferSync and bgwriter

From

Jan Wieck

Date:

15 December 2004, 16:16:48

On 12/12/2004 9:43 PM, Neil Conway wrote:

> On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote:
>> > On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> > Is the plan to make bgwriter_percent = 100 the default setting?
>> 
>> Hmm...must confess that my only plan is:
>> i) discover dynamic behaviour of bgwriter
>> ii) fix any bugs or wierdness as quickly as possible
>> iii) try to find a way to set the bgwriter defaults
> 
> I was just curious why you were bothering to special-case
> bgwriter_percent = 100 if it's not going to be the default setting (in
> which case I would be surprised if more than 1 in 10 users would take
> advantage of the patch).
> 
>> Right now, bgwriter_delay
>> is useless because the O(N) behaviour makes it impossible to set any
>> lower when you have a large shared_buffers.
> 
> BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do
> this scan while holding the BufMgrLock, which is a well known source of
> contention. So reducing the time we hold that lock would be good.
> 
>> Your question has made me rethink the exact objective of the bgwriter's
>> actions: The way it is coded now the bgwriter looks for dirty blocks, no
>> matter where they are in the list.
> 
> Not sure what you mean. StrategyDirtyBufferList() returns the specified
> number of dirty buffers in order, starting with the T1/T2 LRUs and going
> back to the MRUs of both lists. bgwriter_percent effectively ignores
> some portion of the tail of that list, so we end up just flushing the
> buffers closest to the L1/L2 LRUs. How is this different from what
> you're describing?
> 
>> bgwriter_percent would be the % of shared_buffers that are searched
>> (from the LRU end) to see if they contain dirty buffers, which are
>> then written to disk.
> 
> By definition, buffers closest to the LRU end of the lists are not
> frequently accessed. If we only search the N% of the lists closest to
> LRU, we will probably end up flushing just those pages to disk -- and
> then not flushing anything else to disk in the subsequent bgwriter calls
> because all the buffers close to the LRU will be non-dirty. That's okay
> if all we're concerned about is avoiding write() by a real backend, but
> we also want to smooth out checkpoint load, which I don't think this
> approach would do well.
> 
> I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages
> is all the tuning we need, and I think "max # of pages to write" is a
> simpler and more logical tuning knob than "% of the buffer pool to scan
> looking for dirty buffers." So at each bufmgr invocation, we pick the at
> most bgwriter_maxpages dirty pages from the pool, using the pages
> closest to the LRUs of T1 and T2. I'd be happy to supply a patch to
> implement that if you think it sounds okay.

I too don't think that this approach will retain the checkpoing smooting 
effect, the current implementation has.

The real problem is that the "cleaner" the buffer pool is, the longer 
the scan for dirty buffers will take because the dirty blocks tend to be 
at the very end of the scan order. The real solution for this would be 
not to scan the whole pool, but to maintain a separate chain of only 
dirty buffers in LRU order.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: [Testperf-general] BufferSync and bgwriter

From

Josh Berkus

Date:

15 December 2004, 17:43:19

Jan,

> I too don't think that this approach will retain the checkpoing smooting
> effect, the current implementation has.
>
> The real problem is that the "cleaner" the buffer pool is, the longer
> the scan for dirty buffers will take because the dirty blocks tend to be
> at the very end of the scan order. The real solution for this would be
> not to scan the whole pool, but to maintain a separate chain of only
> dirty buffers in LRU order.

Hmmm, I've not seen this.  For example, with people who are having trouble 
with checkpoint spikes on Linux, I've taken to recommending that they call 
sync() (via cron) every 5-10 seconds (thanks, Bruce, for suggestion!).   
Believe it or not, this does help smooth out the spikes and give better 
overall performance in a many-small-writes situation.

Simon, one of the problems with the OSDL-DBT2 test is that it's too steady.   
DBT2 gives a constant stream of small writes at a regular, predictable rate.   
This does not, in fact, match any real-world application I know.    

To allow DBT2 to be used for real bgwriter benchmarking, Mark would need to 
change the following:

1) Randomize the timing of the commits, so that sometimes there is only 30 
writes/minute, and other times there is 300.   A timing pattern that would 
produce a "sine wave" with occasional random spikes would be best; in my 
experience, OLTP applications tend to have wave-like spikes and lulls.

2) Include a sprinkling of random or regular "large writes" which affect 
several tables and 1000's of rows.   For example, once per hour, change 
10,000 pending orders to "shipped", and archive 10,000 "old orders" to an 
archive table.

However, this would require "splitting" DBT2; there's the DBT2 which simulates 
the TPC-C test, and the DBT2 which will help us tune for real-world 
applications.   The two tests will not be the same.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: [Testperf-general] BufferSync and bgwriter

From

Josh Berkus

Date:

15 December 2004, 17:50:15

Folks,

> To allow DBT2 to be used for real bgwriter benchmarking, Mark would need to
> change the following:
>
> 1) Randomize the timing of the commits, so that sometimes there is only 30
> writes/minute, and other times there is 300.   A timing pattern that would
> produce a "sine wave" with occasional random spikes would be best; in my
> experience, OLTP applications tend to have wave-like spikes and lulls.
>
> 2) Include a sprinkling of random or regular "large writes" which affect
> several tables and 1000's of rows.   For example, once per hour, change
> 10,000 pending orders to "shipped", and archive 10,000 "old orders" to an
> archive table.

Oh, also we need to:
3) Run the test for 3+ hours after scaling up, and turn on autovacuum.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: [Testperf-general] BufferSync and bgwriter

From

Greg Stark

Date:

17 December 2004, 18:13:39

Jan Wieck <JanWieck@yahoo.com> writes:

> Doesn't cranking up the bgwriter_percent to 100 effectively make the entire
> shared memory a write-through cache? In other words, with 100% the bgwriter
> will allways write all dirty blocks out and it becomes unlikely to avoid an IO
> for subsequent modificaitons to the same data block.

If the goal is to not write out hot pages why look in T1 at all? Why not just
flush 100% of the dirty pages from T2 and ignore T1 entirely?

-- 
greg