Re: Scaling shared buffer eviction - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Scaling shared buffer eviction
Date
Msg-id CA+TgmobCe7nyZ_3uA8WvmJbZvLtFiveT3EupMj4M=wYocbApew@mail.gmail.com
Whole thread Raw
In response to Re: Scaling shared buffer eviction  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Scaling shared buffer eviction
Re: Scaling shared buffer eviction
List pgsql-hackers
On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have improved the patch  by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
    removed the hibernate logic as bgwriter will now work only when
    there is scarcity of buffer's in free list. Basic idea is when the
    number of buffers on freelist drops below the low threshold, the
    allocating backend sets the latch and bgwriter wakesup and begin

    adding buffer's to freelist until it reaches high threshold and then

    again goes back to sleep.


This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync().  Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere.  Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed.  I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting.  Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers.

A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least one behavior that some people (at least, me) will care about, which is the guarantee that the background writer will scan the entire buffer pool at least every couple of minutes.  This is important because it guarantees that dirty data doesn't sit in memory forever.  When the system becomes busy again after a long idle period, users will expect the system to have used the idle time to flush dirty buffers to disk.  This also improves data recovery prospects if, for example, somebody loses their pg_xlog directory - there may be dirty buffers whose contents are lost, of course, but they won't be months old.

b.  New stats for number of buffers on freelist has been added, some
     old one's like maxwritten_clean can be removed as new logic for
     syncing buffers and moving them to free list doesn't use them.
     However I think it's better to remove them once the new logic is
     accepted.  Added some new logs for info related to free list under
     BGW_DEBUG

If I'm reading this right, the new statistic is an incrementing counter where, every time you update it, you add the number of buffers currently on the freelist.  That makes no sense.  I think what you should be counting is the number of allocations that are being satisfied from the free-list.  Then, by comparing the rate at which that value is incrementing to the rate at which buffers_alloc is incrementing, somebody can figure out what percentage of allocations are requiring a clock-sweep run.  Actually, I think it's better to flip it around: count the number of allocations that require an individual backend to run the clock sweep (vs. being satisfied from the free-list); call it, say, buffers_backend_clocksweep.  We can then try to tune the patch to make that number as small as possible under varying workloads.

c.  Used the already existing bgwriterLatch in BufferStrategyControl to
     wake bgwriter when number of buffer's in freelist drops below
     threshold.

Seems like a good idea.
 
d.  Autotune the low and high threshold for freelist for various
     configurations.  Generally if keep small number (200~2000) of buffers
     always available on freelist, then even for high shared buffers
     like 15GB, it appears to be sufficient.  However when the value
     of shared buffer's is less, then we need much smaller number.  I
     think we can provide these as config knobs for user as well, but for
     now based on LWLOCK_STATS result, I have chosen some hard
     coded values for low and high threshold values for freelist.
     Values for low and high threshold have been decided based on total
     number of shared buffers, basically I have divided them into 5
     categories (16~100, 100~1000,  1000~10000, 10000~100000,
     100000 and above) and then ran tests(read-only pgbench) for various
     configurations falling under these categories.  The reason for keeping
     lesser categories for larger shared buffers is that if there are small
     number (200~2000) of buffers available on free list, then it seems to
     be sufficient for quite high loads, however as the total number of shared
     buffer's decreases we need to be more careful as if we keep the number as
     too low then it will lead to more clock sweep by backends (which means
     freelist lock contention) and if we keep number higher bgwriter will evict
     many useful buffers.  Results based on LWLOCK_STATS is at end of mail.

I think we need to come up with some kind of formula here rather than just a list of hard-coded constants.  And it definitely needs some comments explaining the logic behind the choices.
 
Aside from those specific remarks, I think the elephant in the room is the question of whether it really makes sense to have one process which is responsible both for populating the free list and for writing buffers to disk.  One problem, which I alluded to above under point (1), is that we might sometimes want to ensure that dirty buffers are written out to disk without decrementing usage counts or adding anything to the free list.  This is a potentially solvable problem, though, because we can figure out the number of buffers that we need to scan for freelist population and the number that we need to scan for minimum buffer pool cleaning (one cycle every 2 minutes).  Once we've met the first goal, any further buffers we run into under the second goal get cleaned if appropriate but their usage counts don't get pushed down nor do they get added to the freelist.  Once we meet the second goal, we can go back to sleep.

But the other problem, which I think is likely unsolvable, is that writing a dirty page can take a long time on a busy system (multiple seconds) and the freelist can be emptied much, much quicker than that (milliseconds).  Although your benchmark results show great speed-ups on read-only workloads, we're not really going to get the benefit consistently on read-write workloads -- unless of course the background writer fails to actually write anything, which should be viewed as a bug, not a feature -- because the freelist will often be empty while the background writer is blocked on I/O.

I'm wondering if it would be a whole lot simpler and better to introduce a new background process, maybe with a name like bgreclaim.  That process wouldn't write dirty buffers.  Instead, it would just run the clock sweep (i.e. the last loop inside StrategyGetBuffer) and put the buffers onto the free list.  Then, we could leave the bgwriter logic more or less intact.  It certainly needs improvement, but that could be another patch.

Incidentally, while I generally think your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need significant cleanup.  Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things that used to be protected by BufFreelistLock; specifically, you're doing StrategyControl->numBufferAllocs++ and SetLatch(StrategyControl->bgwriterLatch) without any locking.  That's not OK.  I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today.  Increment StrategyControl->numBufferAllocs; save the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist if there is one, saving its identity.  Release the spinlock.  Then, set the bgwriterLatch if needed.  In the first loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage count and return it if not, holding the buffer header lock.  Otherwise, reacquire the spinlock just long enough to pop a new potential victim and then loop around.

Under this locking strategy, StrategyNotifyBgWriter would use freelist_lck.  Right now, the patch removes the only caller, and should therefore remove the function as well, but if we go with the new-process idea listed above that part would get reverted, and then you'd need to make it use the correct spinlock.  You should also go through this patch and remove all the commented-out bits and pieces that you haven't cleaned up; those are distracting and unhelpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

Previous
From: testman1316
Date:
Subject: Re: PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?
Next
From: Robert Haas
Date:
Subject: Re: Spinlocks and compiler/memory barriers