Re: Moving more work outside WALInsertLock - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Moving more work outside WALInsertLock
Date
Msg-id 4F0AF9C7.2080703@enterprisedb.com
Whole thread Raw
In response to Re: Moving more work outside WALInsertLock  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Moving more work outside WALInsertLock  (Simon Riggs <simon@2ndQuadrant.com>)
Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
On 09.01.2012 15:44, Simon Riggs wrote:
> On Sat, Jan 7, 2012 at 9:31 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>
>> Anyway, here's a new version of the patch. It no longer busy-waits for
>> in-progress insertions to finish, and handles xlog-switches. This is now
>> feature-complete. It's a pretty complicated patch, so I would appreciate
>> more eyeballs on it. And benchmarking again.
>
> Took me awhile to understand why the data structure for the insertion
> slots is so complex. Why not have slots per buffer? That would be
> easier to understand and slots are very small.

Hmm, how would that work?

> Can we avoid having spinlocks on the slots altogether? If we have a
> page number (int) and an LSN, inserters would set LSN and then set
> page number. Anybody waiting for slots would stop if page number is
> zero since that means its not complete yet. So readers look at page
> number first and aren't allowed to look at LSN without valid page
> number.

The LSN on a slot is set in ReserveXLogInsertLocation(), while holding 
the insertpos_lck spinlock. The inserter doesn't acquire the per-slot 
spinlock at that point, it relies on the fact that no-one will look at 
the slot until the shared nextslot variable has been incremented. The 
spinlock is only acquired when updating the pointer, which only happens 
when crossing a WAL page, which isn't that performance-criticial, and 
when the insertion is finished. It would be nice to get rid of the 
spinlock acquisition when the insertion is finished, but I don't see any 
easy way around that. The spinlock is needed to make sure that when the 
inserter clears its slot, it can atomically check the waiter field.

The theory is that contention on those per-slot spinlocks is very rare. 
Profiling this patch with "perf", it looks like the bottleneck is the 
insertpos_lck spinlock.

> Page number would be useful in working out where to stop when doing
> background flushes, which we need for Group Commit, which is arriving
> soon for this release.

Ok, I'll have to just take your word on that :-). I don't see why Group 
Commit needs to care about page boundaries, but the slot data structure 
I used already allows you to check fairly how far you write out the WAL 
without having to wait for any in-progress insertions to complete.

> Can we also try aligning the actual insertions onto cache lines rather
> than just MAXALIGNing them? The WAL header fills half a cache line as
> it is, so many other records will fit nicely. I'd like to see what
> that does to space consumption, but it might be a useful option at
> least.

Hmm, that's an interesting thought. That would mean having gaps in the 
in-memory WAL cache, so that when it's written out, you'd need to stitch 
together the pieces to form the WAL that's actually written to disk. Or 
just leave the gaps in the on-disk format, if we're willing to change 
the WAL format for this, but I don't think we want make our WAL any 
larger than it already is.

I've written this patch avoiding WAL format changes, but if we're 
willing to do that, there's a few things we could do that would help. 
For one, the logic in ReserveXLogInsertLocation() that figures out where 
in the WAL stream the record begins and where it ends could be made a 
lot simpler. At the moment, we refuse to split a WAL record header 
across WAL pages, and because of that, the number of bytes occupied by a 
WAL record depends on where in the WAL it's written. If we changed that, 
reserving space from the WAL for a record that's N bytes long could be 
done essentially as "CurrPos += N". There's some complications, like 
having to keep track of the prev-link too, but I believe it would be 
possible to get rid of the spinlock and implement 
ReserveXLogInsertLocation() as a single atomic fetch-and-add instruction.

> GetInsertRecPtr() should return the XLogRecPtr of the latest
> allocation. IMHO that is what we need for checkpoints and the
> walsender doesn't really matter.

Ok. Thanks looking at the patch!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Moving more work outside WALInsertLock
Next
From: Benedikt Grundmann
Date:
Subject: Re: random_page_cost vs seq_page_cost