Re: [Testperf-general] BufferSync and bgwriter - Mailing list pgsql-hackers
From | Jan Wieck |
---|---|
Subject | Re: [Testperf-general] BufferSync and bgwriter |
Date | |
Msg-id | 41C061EE.5030508@Yahoo.com Whole thread Raw |
In response to | Re: [Testperf-general] BufferSync and bgwriter (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: [Testperf-general] BufferSync and bgwriter
|
List | pgsql-hackers |
On 12/12/2004 5:08 PM, Simon Riggs wrote: >> On Sun, 2004-12-12 at 05:46, Neil Conway wrote: >> Simon Riggs wrote: >> > If the bgwriter_percent = 100, then we should actually do the sensible >> > thing and prepare the list that we need, i.e. limit >> > StrategyDirtyBufferList to finding at most bgwriter_maxpages. >> >> Is the plan to make bgwriter_percent = 100 the default setting? > > Hmm...must confess that my only plan is: > i) discover dynamic behaviour of bgwriter > ii) fix any bugs or wierdness as quickly as possible > iii) try to find a way to set the bgwriter defaults > > I'm worried that we're late in the day for changes, but I'm equally > worried that a) the bgwriter is very tuning sensitive b) we don't really > have much info on how to set the defaults in a meaningful way for the > majority of cases c) there are some issues that greatly reduce the > effectiveness of the bgwriter in many circumstances. > > The 100pct.patch was my first attempt at getting something acceptable in > the next few days that gives sufficient room for the DBA to perform > tuning. Doesn't cranking up the bgwriter_percent to 100 effectively make the entire shared memory a write-through cache? In other words, with 100% the bgwriter will allways write all dirty blocks out and it becomes unlikely to avoid an IO for subsequent modificaitons to the same data block. Jan > > On Sun, 2004-12-12 at 05:46, Neil Conway wrote: >> I wonder if we even need to retain the bgwriter_percent GUC var. Is >> there actually a situation in which the combination of bgwriter_maxpages >> and bgwriter_delay does not give the DBA sufficient flexibility in >> tuning bgwriter behavior? > > Yes, I do now think that only two GUCs are required to tune the > behaviour; but you make me think - which two? Right now, bgwriter_delay > is useless because the O(N) behaviour makes it impossible to set any > lower when you have a large shared_buffers. (I see that as a bug) > > Your question has made me rethink the exact objective of the bgwriter's > actions: The way it is coded now the bgwriter looks for dirty blocks, no > matter where they are in the list. What we are bothered about is the > number of clean buffers at the LRU, which has a direct influence on the > probability that BufferAlloc() will need to call FlushBuffer(), since > StrategyGetBuffer() returns the first unpinned buffer, dirty or not. > After further thought, I would prefer a subtle change in behaviour so > that the bgwriter checks that clean blocks are available at the LRUs for > when buffer replacement occurs. With that slight change, I'd keep the > bgwriter_percent GUC but make it mean something different. > > bgwriter_percent would be the % of shared_buffers that are searched > (from the LRU end) to see if they contain dirty buffers, which are then > written to disk. That means the number of dirty blocks written to disk > is between 0 and the number of buffers searched, but we're not hugely > bothered what that number is... [This change to StrategyDirtyBufferList > resolves the unusability of the bgwriter with large shared_buffers] > > Writing away dirty blocks towards the MRU end is more likely to be > wasted effort. If a block stays near the MRU then it will be dirty again > in the wink of an eye, so you gain nothing at checkpoint time by > cleaning it. Also, since it isn't near the LRU, cleaning it has no > effect on buffer replacement I/O. If a block is at the LRU, then it is > by definition the least likely to be reused, and is a candidate for > replacement anyway. So concentrating on the LRU, not the number of dirty > buffers seems to be the better thing to do. > > That would then be a much simpler way of setting the defaults. With that > definition, we would set the defaults: > > bgwriter_percent = 2 (according to my new suggestion here) > bgwriter_delay = 200 > bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine > tuning) > > Thus, for the default shared_buffers=1000 the bgwriter would clear a > space of up to 20 blocks each cycle. > For a config with shared_buffers=60000, the bgwriter default would clear > space for 600 blocks (max) each cycle - a reasonable setting. > > Overall that would need very little specific tuning, because it would > scale upwards as you changed the shared_buffers higher. > > So, that interpretation of bgwriter_percent gives these advantages: > - we bound the StrategyDirtyBufferList scan to a small % of the whole > list, rather than the whole list...so we could realistically set the > bgwriter_delay lower if required > - we can set a default that scales, so would not often need to change it > - the parameter is defined in terms of the thing we really care about: > sufficient clean blocks at the LRU of the buffer lists > - these changes are very isolated and actually minor - just a different > way of specifying which buffers the bgwriter will clean > > Patch attached...again for discussion and to help understanding of this > proposal. Will submit to patches if we agree it seems like the best way > to allow the bgwriter defaults to be sensibly set. > > [...and yes, everybody, I do know where we are in the release cycle] > > > > ------------------------------------------------------------------------ > > Index: buffer/bufmgr.c > =================================================================== > RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v > retrieving revision 1.182 > diff -d -c -r1.182 bufmgr.c > *** buffer/bufmgr.c 24 Nov 2004 02:56:17 -0000 1.182 > --- buffer/bufmgr.c 12 Dec 2004 21:53:10 -0000 > *************** > *** 681,686 **** > --- 681,687 ---- > { > BufferDesc **dirty_buffers; > BufferTag *buftags; > + int maxdirty; > int num_buffer_dirty; > int i; > > *************** > *** 688,717 **** > if (percent == 0 || maxpages == 0) > return 0; > > /* > * Get a list of all currently dirty buffers and how many there are. > * We do not flush buffers that get dirtied after we started. They > * have to wait until the next checkpoint. > */ > ! dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *)); > ! buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag)); > > LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); > - num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags, > - NBuffers); > > ! /* > ! * If called by the background writer, we are usually asked to only > ! * write out some portion of dirty buffers now, to prevent the IO > ! * storm at checkpoint time. > ! */ > ! if (percent > 0) > ! { > ! Assert(percent <= 100); > ! num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100; > ! } > ! if (maxpages > 0 && num_buffer_dirty > maxpages) > ! num_buffer_dirty = maxpages; > > /* Make sure we can handle the pin inside the loop */ > ResourceOwnerEnlargeBuffers(CurrentResourceOwner); > --- 689,720 ---- > if (percent == 0 || maxpages == 0) > return 0; > > + /* Set number of buffers we will clean at LRUs of buffer lists > + * If no limits set, then clean the whole of shared_buffers > + */ > + if (maxpages > 0) > + maxdirty = maxpages; > + else { > + if (percent > 0) { > + Assert(percent <= 100); > + maxdirty = (NBuffers * percent + 99) / 100; > + } > + else > + maxdirty = NBuffers; > + } > + > /* > * Get a list of all currently dirty buffers and how many there are. > * We do not flush buffers that get dirtied after we started. They > * have to wait until the next checkpoint. > */ > ! dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *)); > ! buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag)); > > LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); > > ! num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags, > ! maxdirty); > > /* Make sure we can handle the pin inside the loop */ > ResourceOwnerEnlargeBuffers(CurrentResourceOwner); > Index: buffer/freelist.c > =================================================================== > RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v > retrieving revision 1.48 > diff -d -c -r1.48 freelist.c > *** buffer/freelist.c 16 Sep 2004 16:58:31 -0000 1.48 > --- buffer/freelist.c 12 Dec 2004 21:53:11 -0000 > *************** > *** 735,741 **** > * StrategyDirtyBufferList > * > * Returns a list of dirty buffers, in priority order for writing. > - * Note that the caller may choose not to write them all. > * > * The caller must beware of the possibility that a buffer is no longer dirty, > * or even contains a different page, by the time he reaches it. If it no > --- 735,740 ---- > *************** > *** 755,760 **** > --- 754,760 ---- > int cdb_id_t2; > int buf_id; > BufferDesc *buf; > + int i; > > /* > * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all > *************** > *** 765,771 **** > cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1]; > cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2]; > > ! while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0) > { > if (cdb_id_t1 >= 0) > { > --- 765,771 ---- > cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1]; > cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2]; > > ! for (i = 0; i < max_buffers; i++) > { > if (cdb_id_t1 >= 0) > { > *************** > *** 779,786 **** > buffers[num_buffer_dirty] = buf; > buftags[num_buffer_dirty] = buf->tag; > num_buffer_dirty++; > - if (num_buffer_dirty >= max_buffers) > - break; > } > } > > --- 779,784 ---- > *************** > *** 799,806 **** > buffers[num_buffer_dirty] = buf; > buftags[num_buffer_dirty] = buf->tag; > num_buffer_dirty++; > - if (num_buffer_dirty >= max_buffers) > - break; > } > } > > --- 797,802 ---- > > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
pgsql-hackers by date: