Re: Move unused buffers to freelist - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Move unused buffers to freelist |
Date | |
Msg-id | 20130627130146.GH1254@alap2.anarazel.de Whole thread Raw |
In response to | Re: Move unused buffers to freelist (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Move unused buffers to freelist
Re: Move unused buffers to freelist |
List | pgsql-hackers |
On 2013-06-27 08:23:31 -0400, Robert Haas wrote: > I'd like to just back up a minute here and talk about the broader > picture here. Sounds like a very good plan. > So in other words, > there's no huge *performance* problem for a working set larger than > shared_buffers, but there is a huge *scalability* problem. Now why is > that? > As far as I can tell, the answer is that we've got a scalability > problem around BufFreelistLock. Part of the problem is it's name ;) > Contention on the buffer mapping > locks may also be a problem, but all of my previous benchmarking (with > LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant > in the room. Contention wise I aggree. What I have seen is that we have a huge amount of cacheline bouncing around the buffer header spinlocks. > My interest in having the background writer add buffers > to the free list is basically around solving that problem. It's a > pretty dramatic problem, as the graph above shows, and this patch > doesn't solve it. > One thing that occurred to me while writing this note is that the > background writer doesn't have any compelling reason to run on a > read-only workload. It will still run at a certain minimum rate, so > that it cycles the buffer pool every 2 minutes, if I remember > correctly. I have previously added some adhoc instrumentation that printed the amount of buffers that were required (by other backends) during a bgwriter cycle and the amount of buffers that the buffer manager could actually write out. I don't think I actually found any workload where the bgwriter actually wroute out a relevant percentage of the necessary pages. Which would explain why the patch doesn't have a big benefit. The freelist is empty most of the time, so we don't benefit from the reduced work done under the lock. I think the whole algorithm that guides how much the background writer actually does, including its pacing/sleeping logic, needs to be rewritten from scratch before we are actually able to measure the benefit from this patch. I personally don't think there's much to salvage from the current code. Problems with the current code: * doesn't manipulate the usage_count and never does anything to used pages. Which means it will just about never find a victimbuffer in a busy database. * by far not aggressive enough, touches only a few buffers ahead of the clock sweep. * does not advance the clock sweep, so the individual backends will touch the same buffers again and transfer all the bufferspinlock cacheline around * The adaption logic it has makes it so slow to adapt that it takes several minutes to adapt. * ... There's another thing we could do to noticeably improve scalability of buffer acquiration. Currently we do a huge amount of work under the freelist lock. In StrategyGetBuffer: LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); ... // check freelist, will usually be empty ... for (;;) { buf = &BufferDescriptors[StrategyControl->nextVictimBuffer]; ++StrategyControl->nextVictimBuffer; LockBufHdr(buf); if (buf->refcount == 0) { if (buf->usage_count > 0) { buf->usage_count--; } else { /* Found a usable buffer */ if(strategy != NULL) AddBufferToRing(strategy, buf); return buf; } } UnlockBufHdr(buf); } So, we perform the entire clock sweep until we found a single buffer we can use inside a *global* lock. At times we need to iterate over the whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all the usage counts enough (if the database is busy it can take even longer...). In a busy database where usually all the usagecounts are high the next backend will touch a lot of those buffers again which causes massive cache eviction & bouncing. It seems far more sensible to only protect the clock sweep's nextVictimBuffer with a spinlock. With some care all the rest can happen without any global interlock. I think even after fixing this - which we definitely should do - having a sensible/more aggressive bgwriter moving pages onto the freelist makes sense because then backends then don't need to deal with dirty pages. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: