Thread: literature on write-ahead logging

literature on write-ahead logging

From
Robert Haas
Date:
I did a brief literature search for papers on breaking the
WAL-serialization bottleneck today and hit upon this:

Aether: A Scalable Approach to Logging, Ryan Johnson, Ippokratis Pandis, et al.
http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

Section 5 appears to be the most relevant to our problems with WALInsertLock.

They reject the approach that I proposed, wherein WAL is generated by
individual backends in their own queues and serialized later: "While a
total ordering is not technically required for correctness, valid
partial orders tend to be too complex and interdependent to be worth
pursuing as a performance optimization"; see also appendix A.5, which
may be succinctly summarized as "no one does that".

Instead, they propose two other optimizations:

1. Subdivide XLOG insertion into three operations: (1) allocate space
in the log buffer, (2) copy the log records into the allocated space,
and (3) release the space to the buffer manager for eventual write to
disk.  AIUI, WALInsertLock currently covers all three phases of this
operation, but phase 2 can proceed in parallel.  It's pretty easy to
imagine maintain one pointer that references the next available byte
of log space (let's call this the "next insert" pointer), and a second
pointer that references the byte following the last byte known to be
written (let's call this the "insert done" pointer).  When a backend
wants to insert data, it locks the "next insert" pointer, advances it,
releases the lock, and starts copying data into the buffer.  That part
is simple enough.  Then we need to update the "insert done" pointer,
which is fine if it happens to point to the beginning of our record -
but it may be that we were busy inserting a short record while the guy
ahead of us was inserting a long one (an FPW, for example), in which
case it gets complicated.  We now need to sleep until everyone ahead
of us is done, then wake up and finish the operation.

You can end up with a chain of waiters that looks like this: A -> B ->
C -> D -> E.  To avoid the scenario where A finishes its operation and
then wakes up B, which must finish its operation and then wake up C,
which must finish its operation and then wake up D, and so on, the
paper proposes a further optimization: have A finish up all the
pending work for B, C, D, and E and then wake them all up at once.  I
guess the operations are so short here that it's more efficient for A
to just do all the work (within some reasonable bounds), so that C, D,
and E don't have to wait for multiple context switches.

2. If a process attempts to allocate space in the log buffer and finds
the lock contended, it instead attempts to lock a so-called
consolidation buffer, where it buddies up with some other processes
having the same problem.  One of them calculated the space requirement
for the whole group, allocates that amount of space, and the parcels
it out to the group members.  The details are complex, but the basic
idea makes a lot of sense, because if incrementing "next insert"
pointer is a hotspot, it clearly makes sense to do one big increment
rather than a series of smaller ones.

In the interest of full disclosure, I'm rather badly misrepresenting
the paper in the above discussion.  First, as they present it, these
two optimizations are independent of each other, whereas I have not
described them so.  This is probably a lack of adequate understanding
on my part.  Second, they aren't really using locks, unless you count
bus locks - they appear to have implemented most or all of it via
CAS-based lock-free algorithms, which is probably well-justified
optimization effort.

Anyway, I think this bears more looking into, and will try to find
time to do so.  The performance results they cite are quite
impressive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: literature on write-ahead logging

From
Merlin Moncure
Date:
On Wed, Jun 8, 2011 at 11:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I did a brief literature search for papers on breaking the
> WAL-serialization bottleneck today and hit upon this:
>
> Aether: A Scalable Approach to Logging, Ryan Johnson, Ippokratis Pandis, et al.
> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
>
> Section 5 appears to be the most relevant to our problems with WALInsertLock.
>
> They reject the approach that I proposed, wherein WAL is generated by
> individual backends in their own queues and serialized later: "While a
> total ordering is not technically required for correctness, valid
> partial orders tend to be too complex and interdependent to be worth
> pursuing as a performance optimization"; see also appendix A.5, which
> may be succinctly summarized as "no one does that".

heh -- makes total sense.  great stuff.

merlin


Re: literature on write-ahead logging

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mié jun 08 23:24:34 -0400 2011:
> I did a brief literature search for papers on breaking the
> WAL-serialization bottleneck today and hit upon this:
> 
> Aether: A Scalable Approach to Logging, Ryan Johnson, Ippokratis Pandis, et al.
> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

Great.

> 1. Subdivide XLOG insertion into three operations: (1) allocate space
> in the log buffer, (2) copy the log records into the allocated space,
> and (3) release the space to the buffer manager for eventual write to
> disk.  AIUI, WALInsertLock currently covers all three phases of this
> operation, but phase 2 can proceed in parallel.  It's pretty easy to
> imagine maintain one pointer that references the next available byte
> of log space (let's call this the "next insert" pointer), and a second
> pointer that references the byte following the last byte known to be
> written (let's call this the "insert done" pointer).

I think this can be done more simply if instead of a single "insert
done" pointer you have an array of them, one per backend; there's also a
global pointer that can be advanced per the minimum of the bunch, which
you can calculate with some quick locking of the array.  You don't need
to sleep at all, except to update the array and calculate the global
ptr, so this is probably also faster.

> Second, they aren't really using locks, unless you count
> bus locks - they appear to have implemented most or all of it via
> CAS-based lock-free algorithms, which is probably well-justified
> optimization effort.

Oh, hmm ...

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: literature on write-ahead logging

From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 10:22 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>> 1. Subdivide XLOG insertion into three operations: (1) allocate space
>> in the log buffer, (2) copy the log records into the allocated space,
>> and (3) release the space to the buffer manager for eventual write to
>> disk.  AIUI, WALInsertLock currently covers all three phases of this
>> operation, but phase 2 can proceed in parallel.  It's pretty easy to
>> imagine maintain one pointer that references the next available byte
>> of log space (let's call this the "next insert" pointer), and a second
>> pointer that references the byte following the last byte known to be
>> written (let's call this the "insert done" pointer).
>
> I think this can be done more simply if instead of a single "insert
> done" pointer you have an array of them, one per backend; there's also a
> global pointer that can be advanced per the minimum of the bunch, which
> you can calculate with some quick locking of the array.  You don't need
> to sleep at all, except to update the array and calculate the global
> ptr, so this is probably also faster.

I think looping over an array with one entry per backend is going to
be intolerably slow...  but it's possible I'm wrong.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: literature on write-ahead logging

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of jue jun 09 10:28:39 -0400 2011:
> On Thu, Jun 9, 2011 at 10:22 AM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> >> 1. Subdivide XLOG insertion into three operations: (1) allocate space
> >> in the log buffer, (2) copy the log records into the allocated space,
> >> and (3) release the space to the buffer manager for eventual write to
> >> disk.  AIUI, WALInsertLock currently covers all three phases of this
> >> operation, but phase 2 can proceed in parallel.  It's pretty easy to
> >> imagine maintain one pointer that references the next available byte
> >> of log space (let's call this the "next insert" pointer), and a second
> >> pointer that references the byte following the last byte known to be
> >> written (let's call this the "insert done" pointer).
> >
> > I think this can be done more simply if instead of a single "insert
> > done" pointer you have an array of them, one per backend; there's also a
> > global pointer that can be advanced per the minimum of the bunch, which
> > you can calculate with some quick locking of the array.  You don't need
> > to sleep at all, except to update the array and calculate the global
> > ptr, so this is probably also faster.
> 
> I think looping over an array with one entry per backend is going to
> be intolerably slow...  but it's possible I'm wrong.

Slower than sleeping?  Consider that this doesn't need to be done for
each record insertion, only when you need to flush (maybe more than
that, but I think that's the lower limit).

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: literature on write-ahead logging

From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 10:34 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Robert Haas's message of jue jun 09 10:28:39 -0400 2011:
>> On Thu, Jun 9, 2011 at 10:22 AM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>> >> 1. Subdivide XLOG insertion into three operations: (1) allocate space
>> >> in the log buffer, (2) copy the log records into the allocated space,
>> >> and (3) release the space to the buffer manager for eventual write to
>> >> disk.  AIUI, WALInsertLock currently covers all three phases of this
>> >> operation, but phase 2 can proceed in parallel.  It's pretty easy to
>> >> imagine maintain one pointer that references the next available byte
>> >> of log space (let's call this the "next insert" pointer), and a second
>> >> pointer that references the byte following the last byte known to be
>> >> written (let's call this the "insert done" pointer).
>> >
>> > I think this can be done more simply if instead of a single "insert
>> > done" pointer you have an array of them, one per backend; there's also a
>> > global pointer that can be advanced per the minimum of the bunch, which
>> > you can calculate with some quick locking of the array.  You don't need
>> > to sleep at all, except to update the array and calculate the global
>> > ptr, so this is probably also faster.
>>
>> I think looping over an array with one entry per backend is going to
>> be intolerably slow...  but it's possible I'm wrong.
>
> Slower than sleeping?  Consider that this doesn't need to be done for
> each record insertion, only when you need to flush (maybe more than
> that, but I think that's the lower limit).

Maybe.  I'm worried that if someone jacks up max_connections to 1000
or 5000 or somesuch it could get pretty slow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: literature on write-ahead logging

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of jue jun 09 10:55:45 -0400 2011:
> On Thu, Jun 9, 2011 at 10:34 AM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:

> > Slower than sleeping?  Consider that this doesn't need to be done for
> > each record insertion, only when you need to flush (maybe more than
> > that, but I think that's the lower limit).
> 
> Maybe.  I'm worried that if someone jacks up max_connections to 1000
> or 5000 or somesuch it could get pretty slow.

Well, other things are going to get pretty slow as well, not just this
one, which is why we suggest using a connection pooler with a reasonable
limit.

On the other hand, maybe those are things we ought to address sometime,
so perhaps we don't want to be designing the old limitation into a new
feature.

A possibly crazy idea: instead of having a MaxBackends-sized array, how
about some smaller array of insert-done-pointer-updating backends (a
couple dozen or so), and if it's full, the next one has to sleep a bit
until one of them becomes available.  We could protect this with a
PGSemaphore having as many counts as items are in the array.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: literature on write-ahead logging

From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 11:13 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Robert Haas's message of jue jun 09 10:55:45 -0400 2011:
>> On Thu, Jun 9, 2011 at 10:34 AM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>
>> > Slower than sleeping?  Consider that this doesn't need to be done for
>> > each record insertion, only when you need to flush (maybe more than
>> > that, but I think that's the lower limit).
>>
>> Maybe.  I'm worried that if someone jacks up max_connections to 1000
>> or 5000 or somesuch it could get pretty slow.
>
> Well, other things are going to get pretty slow as well, not just this
> one, which is why we suggest using a connection pooler with a reasonable
> limit.
>
> On the other hand, maybe those are things we ought to address sometime,
> so perhaps we don't want to be designing the old limitation into a new
> feature.
>
> A possibly crazy idea: instead of having a MaxBackends-sized array, how
> about some smaller array of insert-done-pointer-updating backends (a
> couple dozen or so), and if it's full, the next one has to sleep a bit
> until one of them becomes available.  We could protect this with a
> PGSemaphore having as many counts as items are in the array.

Maybe.  It would have to be structured in such a way that you didn't
perform a system call in the common case, I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: literature on write-ahead logging

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> [ lots of interesting stuff about WAL optimization snipped ]

> ... Second, they aren't really using locks, unless you count
> bus locks - they appear to have implemented most or all of it via
> CAS-based lock-free algorithms, which is probably well-justified
> optimization effort.

FWIW, I'm pretty suspicious of claims that lock-free data structures
will be some kind of magic bullet.  As far as I can tell, a very large
part of our contention problems on many-core machines stem from the CPUs
fighting over cache line ownership.  Avoiding an explicit lock in favor
of hardware primitive test-and-modify instructions isn't going to do a
darn thing to improve that.  And contorting the algorithms until they
fit into what's portably available in that line could easily be a net
loss.
        regards, tom lane


Re: literature on write-ahead logging

From
Robert Haas
Date:
On Thu, Jun 9, 2011 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> [ lots of interesting stuff about WAL optimization snipped ]
>
>> ... Second, they aren't really using locks, unless you count
>> bus locks - they appear to have implemented most or all of it via
>> CAS-based lock-free algorithms, which is probably well-justified
>> optimization effort.
>
> FWIW, I'm pretty suspicious of claims that lock-free data structures
> will be some kind of magic bullet.  As far as I can tell, a very large
> part of our contention problems on many-core machines stem from the CPUs
> fighting over cache line ownership.  Avoiding an explicit lock in favor
> of hardware primitive test-and-modify instructions isn't going to do a
> darn thing to improve that.  And contorting the algorithms until they
> fit into what's portably available in that line could easily be a net
> loss.

That's possible.  It would definitely be possible to get slap-happy
with CAS, and I'm not eager to go there just because we can.  On the
other hand, these lock-free algorithms seem to be springing up like
kudzu, so I doubt we'll be able to avoid them forever.  People
wouldn't keep doing it if it didn't solve some problem that they have.My suspicion is that in many cases there are
betterways to optimize 
that avoid the need to serialize things altogether (as the fastlock
patch does).  But WAL insertion might be an exception: nobody seems to
have any good ideas about how to do that without serial locking;
unless someone does, don't see another alternative other than to
compress the critical section down to as few machine language
instructions as possible.

Anyway it's all speculation at this point: when I or someone else has
time to write some actual code, then we'll benchmark it and see what
happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: literature on write-ahead logging

From
Jim Nasby
Date:
On Jun 9, 2011, at 9:28 AM, Robert Haas wrote:
> On Thu, Jun 9, 2011 at 10:22 AM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>>> 1. Subdivide XLOG insertion into three operations: (1) allocate space
>>> in the log buffer, (2) copy the log records into the allocated space,
>>> and (3) release the space to the buffer manager for eventual write to
>>> disk.  AIUI, WALInsertLock currently covers all three phases of this
>>> operation, but phase 2 can proceed in parallel.  It's pretty easy to
>>> imagine maintain one pointer that references the next available byte
>>> of log space (let's call this the "next insert" pointer), and a second
>>> pointer that references the byte following the last byte known to be
>>> written (let's call this the "insert done" pointer).
>>
>> I think this can be done more simply if instead of a single "insert
>> done" pointer you have an array of them, one per backend; there's also a
>> global pointer that can be advanced per the minimum of the bunch, which
>> you can calculate with some quick locking of the array.  You don't need
>> to sleep at all, except to update the array and calculate the global
>> ptr, so this is probably also faster.
>
> I think looping over an array with one entry per backend is going to
> be intolerably slow...  but it's possible I'm wrong.

If the global pointer is just tracking the minimum of the array entries, do you actually have to lock the entire array?
Theglobal pointer is going to be "behind" soon enough anyway, so why does it need a static view of the entire array?
ISTMyou only need to ensure that updating individual elements in the array is atomic. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net