Re: [Testperf-general] BufferSync and bgwriter - Mailing list pgsql-hackers

From Jan Wieck
Subject Re: [Testperf-general] BufferSync and bgwriter
Date
Msg-id 41C061EE.5030508@Yahoo.com
Whole thread Raw
In response to Re: [Testperf-general] BufferSync and bgwriter  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: [Testperf-general] BufferSync and bgwriter
List pgsql-hackers
On 12/12/2004 5:08 PM, Simon Riggs wrote:

>> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> Simon Riggs wrote:
>> > If the bgwriter_percent = 100, then we should actually do the sensible
>> > thing and prepare the list that we need, i.e. limit
>> > StrategyDirtyBufferList to finding at most bgwriter_maxpages.
>> 
>> Is the plan to make bgwriter_percent = 100 the default setting?
> 
> Hmm...must confess that my only plan is:
> i) discover dynamic behaviour of bgwriter
> ii) fix any bugs or wierdness as quickly as possible
> iii) try to find a way to set the bgwriter defaults
> 
> I'm worried that we're late in the day for changes, but I'm equally
> worried that a) the bgwriter is very tuning sensitive b) we don't really
> have much info on how to set the defaults in a meaningful way for the
> majority of cases c) there are some issues that greatly reduce the
> effectiveness of the bgwriter in many circumstances.
> 
> The 100pct.patch was my first attempt at getting something acceptable in
> the next few days that gives sufficient room for the DBA to perform
> tuning.

Doesn't cranking up the bgwriter_percent to 100 effectively make the 
entire shared memory a write-through cache? In other words, with 100% 
the bgwriter will allways write all dirty blocks out and it becomes 
unlikely to avoid an IO for subsequent modificaitons to the same data block.


Jan

> 
> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> I wonder if we even need to retain the bgwriter_percent GUC var. Is 
>> there actually a situation in which the combination of bgwriter_maxpages 
>> and bgwriter_delay does not give the DBA sufficient flexibility in 
>> tuning bgwriter behavior?
> 
> Yes, I do now think that only two GUCs are required to tune the
> behaviour; but you make me think - which two? Right now, bgwriter_delay
> is useless because the O(N) behaviour makes it impossible to set any
> lower when you have a large shared_buffers. (I see that as a bug)
> 
> Your question has made me rethink the exact objective of the bgwriter's
> actions: The way it is coded now the bgwriter looks for dirty blocks, no
> matter where they are in the list. What we are bothered about is the
> number of clean buffers at the LRU, which has a direct influence on the
> probability that BufferAlloc() will need to call FlushBuffer(), since
> StrategyGetBuffer() returns the first unpinned buffer, dirty or not.
> After further thought, I would prefer a subtle change in behaviour so
> that the bgwriter checks that clean blocks are available at the LRUs for
> when buffer replacement occurs. With that slight change, I'd keep the
> bgwriter_percent GUC but make it mean something different.
> 
> bgwriter_percent would be the % of shared_buffers that are searched
> (from the LRU end) to see if they contain dirty buffers, which are then
> written to disk.  That means the number of dirty blocks written to disk
> is between 0 and the number of buffers searched, but we're not hugely
> bothered what that number is... [This change to StrategyDirtyBufferList
> resolves the unusability of the bgwriter with large shared_buffers]
> 
> Writing away dirty blocks towards the MRU end is more likely to be
> wasted effort. If a block stays near the MRU then it will be dirty again
> in the wink of an eye, so you gain nothing at checkpoint time by
> cleaning it. Also, since it isn't near the LRU, cleaning it has no
> effect on buffer replacement I/O. If a block is at the LRU, then it is
> by definition the least likely to be reused, and is a candidate for
> replacement anyway. So concentrating on the LRU, not the number of dirty
> buffers seems to be the better thing to do.
> 
> That would then be a much simpler way of setting the defaults. With that
> definition, we would set the defaults:
> 
> bgwriter_percent = 2 (according to my new suggestion here)
> bgwriter_delay = 200
> bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine
> tuning)
> 
> Thus, for the default shared_buffers=1000 the bgwriter would clear a
> space of up to 20 blocks each cycle.
> For a config with shared_buffers=60000, the bgwriter default would clear
> space for 600 blocks (max) each cycle - a reasonable setting.
> 
> Overall that would need very little specific tuning, because it would
> scale upwards as you changed the shared_buffers higher.
> 
> So, that interpretation of bgwriter_percent gives these advantages:
> - we bound the StrategyDirtyBufferList scan to a small % of the whole
> list, rather than the whole list...so we could realistically set the
> bgwriter_delay lower if required
> - we can set a default that scales, so would not often need to change it
> - the parameter is defined in terms of the thing we really care about:
> sufficient clean blocks at the LRU of the buffer lists
> - these changes are very isolated and actually minor - just a different
> way of specifying which buffers the bgwriter will clean
> 
> Patch attached...again for discussion and to help understanding of this
> proposal. Will submit to patches if we agree it seems like the best way
> to allow the bgwriter defaults to be sensibly set.
> 
> [...and yes, everybody, I do know where we are in the release cycle]
> 
> 
> 
> ------------------------------------------------------------------------
> 
> Index: buffer/bufmgr.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.182
> diff -d -c -r1.182 bufmgr.c
> *** buffer/bufmgr.c    24 Nov 2004 02:56:17 -0000    1.182
> --- buffer/bufmgr.c    12 Dec 2004 21:53:10 -0000
> ***************
> *** 681,686 ****
> --- 681,687 ----
>   {
>       BufferDesc **dirty_buffers;
>       BufferTag  *buftags;
> +     int         maxdirty;
>       int            num_buffer_dirty;
>       int            i;
>   
> ***************
> *** 688,717 ****
>       if (percent == 0 || maxpages == 0)
>           return 0;
>   
>       /*
>        * Get a list of all currently dirty buffers and how many there are.
>        * We do not flush buffers that get dirtied after we started. They
>        * have to wait until the next checkpoint.
>        */
> !     dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
> !     buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
>   
>       LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
> -     num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> -                                                NBuffers);
>   
> !     /*
> !      * If called by the background writer, we are usually asked to only
> !      * write out some portion of dirty buffers now, to prevent the IO
> !      * storm at checkpoint time.
> !      */
> !     if (percent > 0)
> !     {
> !         Assert(percent <= 100);
> !         num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
> !     }
> !     if (maxpages > 0 && num_buffer_dirty > maxpages)
> !         num_buffer_dirty = maxpages;
>   
>       /* Make sure we can handle the pin inside the loop */
>       ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> --- 689,720 ----
>       if (percent == 0 || maxpages == 0)
>           return 0;
>   
> +     /* Set number of buffers we will clean at LRUs of buffer lists 
> +      * If no limits set, then clean the whole of shared_buffers
> +      */
> +     if (maxpages > 0)
> +         maxdirty = maxpages;
> +     else {
> +         if (percent > 0) {
> +                Assert(percent <= 100);
> +             maxdirty = (NBuffers * percent + 99) / 100;
> +         }
> +         else
> +             maxdirty = NBuffers;
> +     }
> + 
>       /*
>        * Get a list of all currently dirty buffers and how many there are.
>        * We do not flush buffers that get dirtied after we started. They
>        * have to wait until the next checkpoint.
>        */
> !     dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *));
> !     buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag));
>   
>       LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
>   
> !        num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> !                                                maxdirty);
>   
>       /* Make sure we can handle the pin inside the loop */
>       ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> Index: buffer/freelist.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.48
> diff -d -c -r1.48 freelist.c
> *** buffer/freelist.c    16 Sep 2004 16:58:31 -0000    1.48
> --- buffer/freelist.c    12 Dec 2004 21:53:11 -0000
> ***************
> *** 735,741 ****
>    * StrategyDirtyBufferList
>    *
>    * Returns a list of dirty buffers, in priority order for writing.
> -  * Note that the caller may choose not to write them all.
>    *
>    * The caller must beware of the possibility that a buffer is no longer dirty,
>    * or even contains a different page, by the time he reaches it.  If it no
> --- 735,740 ----
> ***************
> *** 755,760 ****
> --- 754,760 ----
>       int            cdb_id_t2;
>       int            buf_id;
>       BufferDesc *buf;
> +     int            i;
>   
>       /*
>        * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
> ***************
> *** 765,771 ****
>       cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
>       cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>   
> !     while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
>       {
>           if (cdb_id_t1 >= 0)
>           {
> --- 765,771 ----
>       cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
>       cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>   
> !     for (i = 0; i < max_buffers; i++)
>       {
>           if (cdb_id_t1 >= 0)
>           {
> ***************
> *** 779,786 ****
>                       buffers[num_buffer_dirty] = buf;
>                       buftags[num_buffer_dirty] = buf->tag;
>                       num_buffer_dirty++;
> -                     if (num_buffer_dirty >= max_buffers)
> -                         break;
>                   }
>               }
>   
> --- 779,784 ----
> ***************
> *** 799,806 ****
>                       buffers[num_buffer_dirty] = buf;
>                       buftags[num_buffer_dirty] = buf->tag;
>                       num_buffer_dirty++;
> -                     if (num_buffer_dirty >= max_buffers)
> -                         break;
>                   }
>               }
>   
> --- 797,802 ----
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: production server down
Next
From:
Date:
Subject: Re: RE: Re: bgwriter changes