literature on write-ahead logging - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | literature on write-ahead logging |
Date | |
Msg-id | BANLkTik-PFhy+chdY_v4z=oHufSOvvkh-g@mail.gmail.com Whole thread Raw |
Responses |
Re: literature on write-ahead logging
Re: literature on write-ahead logging Re: literature on write-ahead logging |
List | pgsql-hackers |
I did a brief literature search for papers on breaking the WAL-serialization bottleneck today and hit upon this: Aether: A Scalable Approach to Logging, Ryan Johnson, Ippokratis Pandis, et al. http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf Section 5 appears to be the most relevant to our problems with WALInsertLock. They reject the approach that I proposed, wherein WAL is generated by individual backends in their own queues and serialized later: "While a total ordering is not technically required for correctness, valid partial orders tend to be too complex and interdependent to be worth pursuing as a performance optimization"; see also appendix A.5, which may be succinctly summarized as "no one does that". Instead, they propose two other optimizations: 1. Subdivide XLOG insertion into three operations: (1) allocate space in the log buffer, (2) copy the log records into the allocated space, and (3) release the space to the buffer manager for eventual write to disk. AIUI, WALInsertLock currently covers all three phases of this operation, but phase 2 can proceed in parallel. It's pretty easy to imagine maintain one pointer that references the next available byte of log space (let's call this the "next insert" pointer), and a second pointer that references the byte following the last byte known to be written (let's call this the "insert done" pointer). When a backend wants to insert data, it locks the "next insert" pointer, advances it, releases the lock, and starts copying data into the buffer. That part is simple enough. Then we need to update the "insert done" pointer, which is fine if it happens to point to the beginning of our record - but it may be that we were busy inserting a short record while the guy ahead of us was inserting a long one (an FPW, for example), in which case it gets complicated. We now need to sleep until everyone ahead of us is done, then wake up and finish the operation. You can end up with a chain of waiters that looks like this: A -> B -> C -> D -> E. To avoid the scenario where A finishes its operation and then wakes up B, which must finish its operation and then wake up C, which must finish its operation and then wake up D, and so on, the paper proposes a further optimization: have A finish up all the pending work for B, C, D, and E and then wake them all up at once. I guess the operations are so short here that it's more efficient for A to just do all the work (within some reasonable bounds), so that C, D, and E don't have to wait for multiple context switches. 2. If a process attempts to allocate space in the log buffer and finds the lock contended, it instead attempts to lock a so-called consolidation buffer, where it buddies up with some other processes having the same problem. One of them calculated the space requirement for the whole group, allocates that amount of space, and the parcels it out to the group members. The details are complex, but the basic idea makes a lot of sense, because if incrementing "next insert" pointer is a hotspot, it clearly makes sense to do one big increment rather than a series of smaller ones. In the interest of full disclosure, I'm rather badly misrepresenting the paper in the above discussion. First, as they present it, these two optimizations are independent of each other, whereas I have not described them so. This is probably a lack of adequate understanding on my part. Second, they aren't really using locks, unless you count bus locks - they appear to have implemented most or all of it via CAS-based lock-free algorithms, which is probably well-justified optimization effort. Anyway, I think this bears more looking into, and will try to find time to do so. The performance results they cite are quite impressive. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: