literature on write-ahead logging - Mailing list pgsql-hackers

From Robert Haas
Subject literature on write-ahead logging
Date
Msg-id BANLkTik-PFhy+chdY_v4z=oHufSOvvkh-g@mail.gmail.com
Whole thread Raw
Responses Re: literature on write-ahead logging
Re: literature on write-ahead logging
Re: literature on write-ahead logging
List pgsql-hackers
I did a brief literature search for papers on breaking the
WAL-serialization bottleneck today and hit upon this:

Aether: A Scalable Approach to Logging, Ryan Johnson, Ippokratis Pandis, et al.
http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

Section 5 appears to be the most relevant to our problems with WALInsertLock.

They reject the approach that I proposed, wherein WAL is generated by
individual backends in their own queues and serialized later: "While a
total ordering is not technically required for correctness, valid
partial orders tend to be too complex and interdependent to be worth
pursuing as a performance optimization"; see also appendix A.5, which
may be succinctly summarized as "no one does that".

Instead, they propose two other optimizations:

1. Subdivide XLOG insertion into three operations: (1) allocate space
in the log buffer, (2) copy the log records into the allocated space,
and (3) release the space to the buffer manager for eventual write to
disk.  AIUI, WALInsertLock currently covers all three phases of this
operation, but phase 2 can proceed in parallel.  It's pretty easy to
imagine maintain one pointer that references the next available byte
of log space (let's call this the "next insert" pointer), and a second
pointer that references the byte following the last byte known to be
written (let's call this the "insert done" pointer).  When a backend
wants to insert data, it locks the "next insert" pointer, advances it,
releases the lock, and starts copying data into the buffer.  That part
is simple enough.  Then we need to update the "insert done" pointer,
which is fine if it happens to point to the beginning of our record -
but it may be that we were busy inserting a short record while the guy
ahead of us was inserting a long one (an FPW, for example), in which
case it gets complicated.  We now need to sleep until everyone ahead
of us is done, then wake up and finish the operation.

You can end up with a chain of waiters that looks like this: A -> B ->
C -> D -> E.  To avoid the scenario where A finishes its operation and
then wakes up B, which must finish its operation and then wake up C,
which must finish its operation and then wake up D, and so on, the
paper proposes a further optimization: have A finish up all the
pending work for B, C, D, and E and then wake them all up at once.  I
guess the operations are so short here that it's more efficient for A
to just do all the work (within some reasonable bounds), so that C, D,
and E don't have to wait for multiple context switches.

2. If a process attempts to allocate space in the log buffer and finds
the lock contended, it instead attempts to lock a so-called
consolidation buffer, where it buddies up with some other processes
having the same problem.  One of them calculated the space requirement
for the whole group, allocates that amount of space, and the parcels
it out to the group members.  The details are complex, but the basic
idea makes a lot of sense, because if incrementing "next insert"
pointer is a hotspot, it clearly makes sense to do one big increment
rather than a series of smaller ones.

In the interest of full disclosure, I'm rather badly misrepresenting
the paper in the above discussion.  First, as they present it, these
two optimizations are independent of each other, whereas I have not
described them so.  This is probably a lack of adequate understanding
on my part.  Second, they aren't really using locks, unless you count
bus locks - they appear to have implemented most or all of it via
CAS-based lock-free algorithms, which is probably well-justified
optimization effort.

Anyway, I think this bears more looking into, and will try to find
time to do so.  The performance results they cite are quite
impressive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: WALInsertLock contention
Next
From: Robert Haas
Date:
Subject: Re: WALInsertLock contention