Thread: Race condition between hot standby and restoring a FPW

Race condition between hot standby and restoring a FPW

From
Heikki Linnakangas
Date:
There's a race condition between a backend running queries in hot 
standby mode, and restoring a full-page image from a WAL record. It's 
present in all supported versions.

RestoreBackupBlockContents does this:

>     buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
>                                     RBM_ZERO);
>     Assert(BufferIsValid(buffer));
>     if (get_cleanup_lock)
>         LockBufferForCleanup(buffer);
>     else
>         LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);

If the page is not in buffer cache yet, and a backend reads and locks 
the buffer after the above XLogReadBufferExtended call has zeroed it, 
but before it has locked it, the backend sees an empty page.

The principle of fixing that is straightforward: the zeroed page should 
not be visible to others, even momentarily. Unfortunately, I think 
that's going to require an API change to ReadBufferExtended(RBM_ZERO) :-(.

I can think of two ways to fix this:

1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before 
returning it. That makes the API inconsistent, as the function would 
sometimes lock the page, and sometimes not.

2. When ReadBufferExtended doesn't find the page in cache, it returns 
the buffer in !BM_VALID state (i.e. still in I/O in-progress state). 
Require the caller to call a second function, after locking the page, to 
finish the I/O.

Anyone have a better idea?

- Heikki



Re: Race condition between hot standby and restoring a FPW

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> There's a race condition between a backend running queries in hot 
> standby mode, and restoring a full-page image from a WAL record. It's 
> present in all supported versions.

> I can think of two ways to fix this:

> 1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before 
> returning it. That makes the API inconsistent, as the function would 
> sometimes lock the page, and sometimes not.

Ugh.

> 2. When ReadBufferExtended doesn't find the page in cache, it returns 
> the buffer in !BM_VALID state (i.e. still in I/O in-progress state). 
> Require the caller to call a second function, after locking the page, to 
> finish the I/O.

Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?
        regards, tom lane



Re: Race condition between hot standby and restoring a FPW

From
Robert Haas
Date:
On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> 2. When ReadBufferExtended doesn't find the page in cache, it returns the
> buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the
> caller to call a second function, after locking the page, to finish the I/O.

This seems like a reasonable approach.

If you tilt your head the right way, zeroing a page and restoring a
backup block are the same thing: either way, you want to "read" the
block into shared buffers without actually reading it, so that you can
overwrite the prior contents with something else.  So, you could fix
this by adding a new mode, RBM_OVERWRITE, and passing the new page
contents as an additional argument to ReadBufferExtended, which would
then memcpy() that data into place where RBM_ZERO calls MemSet() to
zero it.

I'm not sure whether we want to complicate the API that way, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Race condition between hot standby and restoring a FPW

From
Heikki Linnakangas
Date:
On 11/12/2014 04:56 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> There's a race condition between a backend running queries in hot
>> standby mode, and restoring a full-page image from a WAL record. It's
>> present in all supported versions.
>
>> I can think of two ways to fix this:
>
>> 1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before
>> returning it. That makes the API inconsistent, as the function would
>> sometimes lock the page, and sometimes not.
>
> Ugh.
>
>> 2. When ReadBufferExtended doesn't find the page in cache, it returns
>> the buffer in !BM_VALID state (i.e. still in I/O in-progress state).
>> Require the caller to call a second function, after locking the page, to
>> finish the I/O.
>
> Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
> in terms of handling error conditions, but does not forcibly zero the page
> if it's already valid?

Isn't that exactly what RBM_ZERO_ONERROR does?

Anyway, you don't want to read the page from disk, just to check if it's 
already valid. We stopped doing that in 8.2 (commit 
8c3cc86e7b688b0efe5ec6ce4f4342c2883b1db5), and it gave a big speedup to 
recovery.

(Note that when the page is already in the buffer-cache, RBM_ZERO 
already doesn't zero the page. So this race condition only happens when 
the page isn't in the buffer cache yet).

- Heikki




Re: Race condition between hot standby and restoring a FPW

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 11/12/2014 04:56 PM, Tom Lane wrote:
>> Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
>> in terms of handling error conditions, but does not forcibly zero the page
>> if it's already valid?

> Anyway, you don't want to read the page from disk, just to check if it's 
> already valid.

Oh, good point.

> (Note that when the page is already in the buffer-cache, RBM_ZERO 
> already doesn't zero the page. So this race condition only happens when 
> the page isn't in the buffer cache yet).

Right.

On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly.  That has the advantage that any code
that doesn't get updated will fail clearly and reliably.  With the other
thing, you need some additional back channel for the caller to know
whether it must complete the I/O, and it's not obvious what will happen
if it fails to do that (except that it will be bad).
        regards, tom lane



Re: Race condition between hot standby and restoring a FPW

From
Heikki Linnakangas
Date:
On 11/12/2014 05:08 PM, Robert Haas wrote:
> On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> 2. When ReadBufferExtended doesn't find the page in cache, it returns the
>> buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the
>> caller to call a second function, after locking the page, to finish the I/O.
>
> This seems like a reasonable approach.
>
> If you tilt your head the right way, zeroing a page and restoring a
> backup block are the same thing: either way, you want to "read" the
> block into shared buffers without actually reading it, so that you can
> overwrite the prior contents with something else.  So, you could fix
> this by adding a new mode, RBM_OVERWRITE, and passing the new page
> contents as an additional argument to ReadBufferExtended, which would
> then memcpy() that data into place where RBM_ZERO calls MemSet() to
> zero it.

Yes, that would be quite a clean API. However, there's a problem with 
locking, when the redo routine modifies multiple pages. Currently, you 
lock the page first, and replace the page with the new contents while 
holding the lock. With RBM_OVERWRITE, the new page contents would sneak 
into the buffer before RestoreBackupBlock has acquired the lock on the 
page, and another backend might pin and lock the page before 
RestoreBackupBlock does. The page contents would be valid, but they 
might not be consistent with other buffers yet. The redo routine might 
be doing an atomic operation that spans multiple pages, by holding the 
locks on all the pages until it's finished with all the changes, but the 
backend would see a partial result.

- Heikki




Re: Race condition between hot standby and restoring a FPW

From
Heikki Linnakangas
Date:
On 11/12/2014 05:20 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> On 11/12/2014 04:56 PM, Tom Lane wrote:
>>> Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
>>> in terms of handling error conditions, but does not forcibly zero the page
>>> if it's already valid?
>
>> Anyway, you don't want to read the page from disk, just to check if it's
>> already valid.
>
> Oh, good point.
>
>> (Note that when the page is already in the buffer-cache, RBM_ZERO
>> already doesn't zero the page. So this race condition only happens when
>> the page isn't in the buffer cache yet).
>
> Right.
>
> On reconsideration I think the "RBM_ZERO returns page already locked"
> alternative may be the less ugly.   That has the advantage that any code
> that doesn't get updated will fail clearly and reliably.

Yeah, I'm leaning to that approach as well. It's made more ugly by the 
fact that you sometimes need a cleanup lock on the buffer, so the caller 
will somehow need to specify whether to get a cleanup lock or a normal 
exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. 
Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code 
that's not updated to break even more obviously, at compile-time.

- Heikki



Re: Race condition between hot standby and restoring a FPW

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 11/12/2014 05:20 PM, Tom Lane wrote:
>> On reconsideration I think the "RBM_ZERO returns page already locked"
>> alternative may be the less ugly.   That has the advantage that any code
>> that doesn't get updated will fail clearly and reliably.

> Yeah, I'm leaning to that approach as well. It's made more ugly by the 
> fact that you sometimes need a cleanup lock on the buffer, so the caller 
> will somehow need to specify whether to get a cleanup lock or a normal 
> exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. 
> Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code 
> that's not updated to break even more obviously, at compile-time.

Yeah, I was considering suggesting changing the name of the mode too.
+1 for solving the lock-type problem with 2 modes.

(You could also consider leaving RBM_ZERO in place with its current
semantics, but I think what you've shown here is that there is no
safe way to use it, so probably that's not what we should do.)
        regards, tom lane



Re: Race condition between hot standby and restoring a FPW

From
Jim Nasby
Date:
On 11/12/14, 9:47 AM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> On 11/12/2014 05:20 PM, Tom Lane wrote:
>>> On reconsideration I think the "RBM_ZERO returns page already locked"
>>> alternative may be the less ugly.   That has the advantage that any code
>>> that doesn't get updated will fail clearly and reliably.
>
>> Yeah, I'm leaning to that approach as well. It's made more ugly by the
>> fact that you sometimes need a cleanup lock on the buffer, so the caller
>> will somehow need to specify whether to get a cleanup lock or a normal
>> exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK.
>> Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code
>> that's not updated to break even more obviously, at compile-time.
>
> Yeah, I was considering suggesting changing the name of the mode too.
> +1 for solving the lock-type problem with 2 modes.
>
> (You could also consider leaving RBM_ZERO in place with its current
> semantics, but I think what you've shown here is that there is no
> safe way to use it, so probably that's not what we should do.)

If we're tweaking modes, can we avoid zeroing the buffer if we're about to dump a full page image into it? Presumably
thatmeans we'd have to keep the page locked until the image is written...
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Race condition between hot standby and restoring a FPW

From
Heikki Linnakangas
Date:
On 11/13/2014 01:06 AM, Jim Nasby wrote:
> On 11/12/14, 9:47 AM, Tom Lane wrote:
>> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>>> On 11/12/2014 05:20 PM, Tom Lane wrote:
>>>> On reconsideration I think the "RBM_ZERO returns page already locked"
>>>> alternative may be the less ugly.   That has the advantage that any code
>>>> that doesn't get updated will fail clearly and reliably.
>>
>>> Yeah, I'm leaning to that approach as well. It's made more ugly by the
>>> fact that you sometimes need a cleanup lock on the buffer, so the caller
>>> will somehow need to specify whether to get a cleanup lock or a normal
>>> exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK.
>>> Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code
>>> that's not updated to break even more obviously, at compile-time.
>>
>> Yeah, I was considering suggesting changing the name of the mode too.
>> +1 for solving the lock-type problem with 2 modes.
>>
>> (You could also consider leaving RBM_ZERO in place with its current
>> semantics, but I think what you've shown here is that there is no
>> safe way to use it, so probably that's not what we should do.)

Committed that way. I left a placeholder for RBM_ZERO in back-branches, 
so 3rd party extensions using it continue to work - to the extent that 
they did before - without recompiling. In theory, it's possible that 
someone is using RBM_ZERO while holding an AccessExclusiveLock on the 
relation, or there's something else that ensures that no-one else is 
looking at the relation. I doubt there actually are any extensions out 
there using it, but might as well.

> If we're tweaking modes, can we avoid zeroing the buffer if we're about to dump a full page image into it? Presumably
thatmeans we'd have to keep the page locked until the image is written...
 

Hmm. I suppose we could, as long as the caller knows that is has to 
re-initialize the page. That would probably be ok for all current 
callers. I didn't go ahead with that, however; the zeroing isn't that 
expensive and we would need to rename the mode if it didn't zero the 
page anymore, which would cause some code churn.

- Heikki