Thread: Race condition between hot standby and restoring a FPW
There's a race condition between a backend running queries in hot standby mode, and restoring a full-page image from a WAL record. It's present in all supported versions. RestoreBackupBlockContents does this: > buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, > RBM_ZERO); > Assert(BufferIsValid(buffer)); > if (get_cleanup_lock) > LockBufferForCleanup(buffer); > else > LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); If the page is not in buffer cache yet, and a backend reads and locks the buffer after the above XLogReadBufferExtended call has zeroed it, but before it has locked it, the backend sees an empty page. The principle of fixing that is straightforward: the zeroed page should not be visible to others, even momentarily. Unfortunately, I think that's going to require an API change to ReadBufferExtended(RBM_ZERO) :-(. I can think of two ways to fix this: 1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before returning it. That makes the API inconsistent, as the function would sometimes lock the page, and sometimes not. 2. When ReadBufferExtended doesn't find the page in cache, it returns the buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the caller to call a second function, after locking the page, to finish the I/O. Anyone have a better idea? - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > There's a race condition between a backend running queries in hot > standby mode, and restoring a full-page image from a WAL record. It's > present in all supported versions. > I can think of two ways to fix this: > 1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before > returning it. That makes the API inconsistent, as the function would > sometimes lock the page, and sometimes not. Ugh. > 2. When ReadBufferExtended doesn't find the page in cache, it returns > the buffer in !BM_VALID state (i.e. still in I/O in-progress state). > Require the caller to call a second function, after locking the page, to > finish the I/O. Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO in terms of handling error conditions, but does not forcibly zero the page if it's already valid? regards, tom lane
On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > 2. When ReadBufferExtended doesn't find the page in cache, it returns the > buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the > caller to call a second function, after locking the page, to finish the I/O. This seems like a reasonable approach. If you tilt your head the right way, zeroing a page and restoring a backup block are the same thing: either way, you want to "read" the block into shared buffers without actually reading it, so that you can overwrite the prior contents with something else. So, you could fix this by adding a new mode, RBM_OVERWRITE, and passing the new page contents as an additional argument to ReadBufferExtended, which would then memcpy() that data into place where RBM_ZERO calls MemSet() to zero it. I'm not sure whether we want to complicate the API that way, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/12/2014 04:56 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> There's a race condition between a backend running queries in hot >> standby mode, and restoring a full-page image from a WAL record. It's >> present in all supported versions. > >> I can think of two ways to fix this: > >> 1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before >> returning it. That makes the API inconsistent, as the function would >> sometimes lock the page, and sometimes not. > > Ugh. > >> 2. When ReadBufferExtended doesn't find the page in cache, it returns >> the buffer in !BM_VALID state (i.e. still in I/O in-progress state). >> Require the caller to call a second function, after locking the page, to >> finish the I/O. > > Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO > in terms of handling error conditions, but does not forcibly zero the page > if it's already valid? Isn't that exactly what RBM_ZERO_ONERROR does? Anyway, you don't want to read the page from disk, just to check if it's already valid. We stopped doing that in 8.2 (commit 8c3cc86e7b688b0efe5ec6ce4f4342c2883b1db5), and it gave a big speedup to recovery. (Note that when the page is already in the buffer-cache, RBM_ZERO already doesn't zero the page. So this race condition only happens when the page isn't in the buffer cache yet). - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 11/12/2014 04:56 PM, Tom Lane wrote: >> Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO >> in terms of handling error conditions, but does not forcibly zero the page >> if it's already valid? > Anyway, you don't want to read the page from disk, just to check if it's > already valid. Oh, good point. > (Note that when the page is already in the buffer-cache, RBM_ZERO > already doesn't zero the page. So this race condition only happens when > the page isn't in the buffer cache yet). Right. On reconsideration I think the "RBM_ZERO returns page already locked" alternative may be the less ugly. That has the advantage that any code that doesn't get updated will fail clearly and reliably. With the other thing, you need some additional back channel for the caller to know whether it must complete the I/O, and it's not obvious what will happen if it fails to do that (except that it will be bad). regards, tom lane
On 11/12/2014 05:08 PM, Robert Haas wrote: > On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> 2. When ReadBufferExtended doesn't find the page in cache, it returns the >> buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the >> caller to call a second function, after locking the page, to finish the I/O. > > This seems like a reasonable approach. > > If you tilt your head the right way, zeroing a page and restoring a > backup block are the same thing: either way, you want to "read" the > block into shared buffers without actually reading it, so that you can > overwrite the prior contents with something else. So, you could fix > this by adding a new mode, RBM_OVERWRITE, and passing the new page > contents as an additional argument to ReadBufferExtended, which would > then memcpy() that data into place where RBM_ZERO calls MemSet() to > zero it. Yes, that would be quite a clean API. However, there's a problem with locking, when the redo routine modifies multiple pages. Currently, you lock the page first, and replace the page with the new contents while holding the lock. With RBM_OVERWRITE, the new page contents would sneak into the buffer before RestoreBackupBlock has acquired the lock on the page, and another backend might pin and lock the page before RestoreBackupBlock does. The page contents would be valid, but they might not be consistent with other buffers yet. The redo routine might be doing an atomic operation that spans multiple pages, by holding the locks on all the pages until it's finished with all the changes, but the backend would see a partial result. - Heikki
On 11/12/2014 05:20 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> On 11/12/2014 04:56 PM, Tom Lane wrote: >>> Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO >>> in terms of handling error conditions, but does not forcibly zero the page >>> if it's already valid? > >> Anyway, you don't want to read the page from disk, just to check if it's >> already valid. > > Oh, good point. > >> (Note that when the page is already in the buffer-cache, RBM_ZERO >> already doesn't zero the page. So this race condition only happens when >> the page isn't in the buffer cache yet). > > Right. > > On reconsideration I think the "RBM_ZERO returns page already locked" > alternative may be the less ugly. That has the advantage that any code > that doesn't get updated will fail clearly and reliably. Yeah, I'm leaning to that approach as well. It's made more ugly by the fact that you sometimes need a cleanup lock on the buffer, so the caller will somehow need to specify whether to get a cleanup lock or a normal exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code that's not updated to break even more obviously, at compile-time. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 11/12/2014 05:20 PM, Tom Lane wrote: >> On reconsideration I think the "RBM_ZERO returns page already locked" >> alternative may be the less ugly. That has the advantage that any code >> that doesn't get updated will fail clearly and reliably. > Yeah, I'm leaning to that approach as well. It's made more ugly by the > fact that you sometimes need a cleanup lock on the buffer, so the caller > will somehow need to specify whether to get a cleanup lock or a normal > exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. > Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code > that's not updated to break even more obviously, at compile-time. Yeah, I was considering suggesting changing the name of the mode too. +1 for solving the lock-type problem with 2 modes. (You could also consider leaving RBM_ZERO in place with its current semantics, but I think what you've shown here is that there is no safe way to use it, so probably that's not what we should do.) regards, tom lane
On 11/12/14, 9:47 AM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> On 11/12/2014 05:20 PM, Tom Lane wrote: >>> On reconsideration I think the "RBM_ZERO returns page already locked" >>> alternative may be the less ugly. That has the advantage that any code >>> that doesn't get updated will fail clearly and reliably. > >> Yeah, I'm leaning to that approach as well. It's made more ugly by the >> fact that you sometimes need a cleanup lock on the buffer, so the caller >> will somehow need to specify whether to get a cleanup lock or a normal >> exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. >> Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code >> that's not updated to break even more obviously, at compile-time. > > Yeah, I was considering suggesting changing the name of the mode too. > +1 for solving the lock-type problem with 2 modes. > > (You could also consider leaving RBM_ZERO in place with its current > semantics, but I think what you've shown here is that there is no > safe way to use it, so probably that's not what we should do.) If we're tweaking modes, can we avoid zeroing the buffer if we're about to dump a full page image into it? Presumably thatmeans we'd have to keep the page locked until the image is written... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 11/13/2014 01:06 AM, Jim Nasby wrote: > On 11/12/14, 9:47 AM, Tom Lane wrote: >> Heikki Linnakangas <hlinnakangas@vmware.com> writes: >>> On 11/12/2014 05:20 PM, Tom Lane wrote: >>>> On reconsideration I think the "RBM_ZERO returns page already locked" >>>> alternative may be the less ugly. That has the advantage that any code >>>> that doesn't get updated will fail clearly and reliably. >> >>> Yeah, I'm leaning to that approach as well. It's made more ugly by the >>> fact that you sometimes need a cleanup lock on the buffer, so the caller >>> will somehow need to specify whether to get a cleanup lock or a normal >>> exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. >>> Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code >>> that's not updated to break even more obviously, at compile-time. >> >> Yeah, I was considering suggesting changing the name of the mode too. >> +1 for solving the lock-type problem with 2 modes. >> >> (You could also consider leaving RBM_ZERO in place with its current >> semantics, but I think what you've shown here is that there is no >> safe way to use it, so probably that's not what we should do.) Committed that way. I left a placeholder for RBM_ZERO in back-branches, so 3rd party extensions using it continue to work - to the extent that they did before - without recompiling. In theory, it's possible that someone is using RBM_ZERO while holding an AccessExclusiveLock on the relation, or there's something else that ensures that no-one else is looking at the relation. I doubt there actually are any extensions out there using it, but might as well. > If we're tweaking modes, can we avoid zeroing the buffer if we're about to dump a full page image into it? Presumably thatmeans we'd have to keep the page locked until the image is written... Hmm. I suppose we could, as long as the caller knows that is has to re-initialize the page. That would probably be ok for all current callers. I didn't go ahead with that, however; the zeroing isn't that expensive and we would need to rename the mode if it didn't zero the page anymore, which would cause some code churn. - Heikki