Re: Moving more work outside WALInsertLock - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Moving more work outside WALInsertLock |
Date | |
Msg-id | 4F0AF9C7.2080703@enterprisedb.com Whole thread Raw |
In response to | Re: Moving more work outside WALInsertLock (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: Moving more work outside WALInsertLock
Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) |
List | pgsql-hackers |
On 09.01.2012 15:44, Simon Riggs wrote: > On Sat, Jan 7, 2012 at 9:31 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > >> Anyway, here's a new version of the patch. It no longer busy-waits for >> in-progress insertions to finish, and handles xlog-switches. This is now >> feature-complete. It's a pretty complicated patch, so I would appreciate >> more eyeballs on it. And benchmarking again. > > Took me awhile to understand why the data structure for the insertion > slots is so complex. Why not have slots per buffer? That would be > easier to understand and slots are very small. Hmm, how would that work? > Can we avoid having spinlocks on the slots altogether? If we have a > page number (int) and an LSN, inserters would set LSN and then set > page number. Anybody waiting for slots would stop if page number is > zero since that means its not complete yet. So readers look at page > number first and aren't allowed to look at LSN without valid page > number. The LSN on a slot is set in ReserveXLogInsertLocation(), while holding the insertpos_lck spinlock. The inserter doesn't acquire the per-slot spinlock at that point, it relies on the fact that no-one will look at the slot until the shared nextslot variable has been incremented. The spinlock is only acquired when updating the pointer, which only happens when crossing a WAL page, which isn't that performance-criticial, and when the insertion is finished. It would be nice to get rid of the spinlock acquisition when the insertion is finished, but I don't see any easy way around that. The spinlock is needed to make sure that when the inserter clears its slot, it can atomically check the waiter field. The theory is that contention on those per-slot spinlocks is very rare. Profiling this patch with "perf", it looks like the bottleneck is the insertpos_lck spinlock. > Page number would be useful in working out where to stop when doing > background flushes, which we need for Group Commit, which is arriving > soon for this release. Ok, I'll have to just take your word on that :-). I don't see why Group Commit needs to care about page boundaries, but the slot data structure I used already allows you to check fairly how far you write out the WAL without having to wait for any in-progress insertions to complete. > Can we also try aligning the actual insertions onto cache lines rather > than just MAXALIGNing them? The WAL header fills half a cache line as > it is, so many other records will fit nicely. I'd like to see what > that does to space consumption, but it might be a useful option at > least. Hmm, that's an interesting thought. That would mean having gaps in the in-memory WAL cache, so that when it's written out, you'd need to stitch together the pieces to form the WAL that's actually written to disk. Or just leave the gaps in the on-disk format, if we're willing to change the WAL format for this, but I don't think we want make our WAL any larger than it already is. I've written this patch avoiding WAL format changes, but if we're willing to do that, there's a few things we could do that would help. For one, the logic in ReserveXLogInsertLocation() that figures out where in the WAL stream the record begins and where it ends could be made a lot simpler. At the moment, we refuse to split a WAL record header across WAL pages, and because of that, the number of bytes occupied by a WAL record depends on where in the WAL it's written. If we changed that, reserving space from the WAL for a record that's N bytes long could be done essentially as "CurrPos += N". There's some complications, like having to keep track of the prev-link too, but I believe it would be possible to get rid of the spinlock and implement ReserveXLogInsertLocation() as a single atomic fetch-and-add instruction. > GetInsertRecPtr() should return the XLogRecPtr of the latest > allocation. IMHO that is what we need for checkpoints and the > walsender doesn't really matter. Ok. Thanks looking at the patch! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: