Thread: Checkpoint not retrying failed fsync?

Checkpoint not retrying failed fsync?

From
Andrew Gierth
Date:
This is only a preliminary report, I'm still trying to analyze what's
going on, but:

In doing testing on FreeBSD with a filesystem set up to induce errors
controllably (using gconcat+gnop), I can get this to happen (on 11devel):

(note that "mytable" is on a tablespace on the erroring filesystem,
while "x" is on a clean filesystem)

postgres=# insert into mytable values (-1);
INSERT 0 1
postgres=# checkpoint;
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
postgres=# insert into x values (3);
INSERT 0 1
postgres=# checkpoint;
CHECKPOINT

(the message in the server log is the expected one about fsync failing)

Checking the WAL shows that there is indeed a checkpoint record for the
second checkpoint and pg_control points to it, so a crash restart at
this point would not try and replay the "mytable" write.

Furthermore, checking the trace output from the checkpointer process, it
is not even attempting an fsync of the failing file; this isn't like the
Linux fsync issue, I've confirmed that fsync will repeatedly fail on the
file until the underlying errors stop.

As far as I can tell from reading the code, if a checkpoint fails the
checkpointer is supposed to keep all the outstanding fsync requests for
next time. Am I wrong, or is there some failure in the logic to do this?

-- 
Andrew (irc:RhodiumToad)


Re: Checkpoint not retrying failed fsync?

From
Thomas Munro
Date:
On Fri, Apr 6, 2018 at 10:16 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Furthermore, checking the trace output from the checkpointer process, it
> is not even attempting an fsync of the failing file; this isn't like the
> Linux fsync issue, I've confirmed that fsync will repeatedly fail on the
> file until the underlying errors stop.

Thank you for confirming that!  Now, how does one go about buying
shares in FreeBSD?

> As far as I can tell from reading the code, if a checkpoint fails the
> checkpointer is supposed to keep all the outstanding fsync requests for
> next time. Am I wrong, or is there some failure in the logic to do this?

Yikes.  I think this is suspicious:

                 * The bitmap manipulations are slightly tricky,
because we can call
                 * AbsorbFsyncRequests() inside the loop and that
could result in
                 * bms_add_member() modifying and even re-palloc'ing
the bitmapsets.
                 * This is okay because we unlink each bitmapset from
the hashtable
                 * entry before scanning it.  That means that any incoming fsync
                 * requests will be processed now if they reach the
table before we
                 * begin to scan their fork.

Why is it OK to unlink the bitmapset?  We still need its contents, in
the case that the fsync fails!

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Checkpoint not retrying failed fsync?

From
Andrew Gierth
Date:
>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes:

 >> As far as I can tell from reading the code, if a checkpoint fails the
 >> checkpointer is supposed to keep all the outstanding fsync requests for
 >> next time. Am I wrong, or is there some failure in the logic to do this?

 Thomas> Yikes.  I think this is suspicious:

Yes, tracing through a checkpoint shows that this is clearly wrong.

 Thomas> Why is it OK to unlink the bitmapset? We still need its
 Thomas> contents, in the case that the fsync fails!

Right.

But I don't think just copying the value is sufficient; if a new bit was
set while we were processing the old ones, how would we know which to
clear? We couldn't just clear all the bits afterwards because then we
might lose a request.

-- 
Andrew (irc:RhodiumToad)


Re: Checkpoint not retrying failed fsync?

From
Thomas Munro
Date:
On Fri, Apr 6, 2018 at 11:34 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes:
>
>  >> As far as I can tell from reading the code, if a checkpoint fails the
>  >> checkpointer is supposed to keep all the outstanding fsync requests for
>  >> next time. Am I wrong, or is there some failure in the logic to do this?
>
>  Thomas> Yikes.  I think this is suspicious:
>
> Yes, tracing through a checkpoint shows that this is clearly wrong.
>
>  Thomas> Why is it OK to unlink the bitmapset? We still need its
>  Thomas> contents, in the case that the fsync fails!
>
> Right.
>
> But I don't think just copying the value is sufficient; if a new bit was
> set while we were processing the old ones, how would we know which to
> clear? We couldn't just clear all the bits afterwards because then we
> might lose a request.

Agreed.  The attached draft patch handles that correctly, I think.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Checkpoint not retrying failed fsync?

From
Thomas Munro
Date:
On Fri, Apr 6, 2018 at 11:36 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Apr 6, 2018 at 11:34 AM, Andrew Gierth
> <andrew@tao11.riddles.org.uk> wrote:
>> Right.
>>
>> But I don't think just copying the value is sufficient; if a new bit was
>> set while we were processing the old ones, how would we know which to
>> clear? We couldn't just clear all the bits afterwards because then we
>> might lose a request.
>
> Agreed.  The attached draft patch handles that correctly, I think.

After some testing, here is a better one for review.  Changes:

1.  The copy wasn't really needed.
2.  errno needed to be restored (in case bms_union stomped on it),
thanks to Andrew for an off-list complaint about that.
3.  Memory context was wrong.
4.  bms_first_member() eats its input.  Need to use bms_next_member() instead.
5.  Mustn't leak in error path (that was a pre-existing bug).

Also here's a patch to test it.  If the file /tmp/broken_fsync exists,
you'll see a fake EIO when you CHECKPOINT.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Checkpoint not retrying failed fsync?

From
Thomas Munro
Date:
On Fri, Apr 6, 2018 at 12:56 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> After some testing, here is a better one for review.

One problem I thought of about 8 milliseconds after clicking send is
that bms_union() may fail to allocate memory and then you're hosed.
Here is a new version that uses bms_join() instead, because that can't
fail.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Checkpoint not retrying failed fsync?

From
Amit Kapila
Date:
On Fri, Apr 6, 2018 at 6:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Apr 6, 2018 at 11:36 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Fri, Apr 6, 2018 at 11:34 AM, Andrew Gierth
>> <andrew@tao11.riddles.org.uk> wrote:
>>> Right.
>>>
>>> But I don't think just copying the value is sufficient; if a new bit was
>>> set while we were processing the old ones, how would we know which to
>>> clear? We couldn't just clear all the bits afterwards because then we
>>> might lose a request.
>>
>> Agreed.  The attached draft patch handles that correctly, I think.
>
> After some testing, here is a better one for review.  Changes:
>
> 1.  The copy wasn't really needed.
> 2.  errno needed to be restored (in case bms_union stomped on it),
> thanks to Andrew for an off-list complaint about that.
> 3.  Memory context was wrong.
> 4.  bms_first_member() eats its input.  Need to use bms_next_member() instead.
>

Won't in the success case, you need to delete each member (by
something like bms_del_member) rather than just using bms_free?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Checkpoint not retrying failed fsync?

From
Thomas Munro
Date:
On Sun, Apr 8, 2018 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Won't in the success case, you need to delete each member (by
> something like bms_del_member) rather than just using bms_free?

Thanks for looking at this.  Yeah, if requests for segment numbers 0
and 1 were in "requests", and 0 succeeded but then 1 fails, my
previous patch would leave both in there to be retried next time
around.  I thought that was pretty harmless so I didn't worry about it
before, but of course you're right that it's not necessary to retry
the ones that succeeded, so we could remove them as we go.  New patch
attached.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Checkpoint not retrying failed fsync?

From
Thomas Munro
Date:
On Sun, Apr 8, 2018 at 9:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> New patch attached.

For Linux users (read: almost all users) this patch on its own is a
bit like rearranging the deck chairs on the Titanic.  Because retrying
on failure is useless, among other problems, Craig and Andres are
working on converting IO errors into PANICs and overhauling the
request queue[1].

I was about to mark this patch "rejected" and forget about it, since
Craig's patch makes it redundant.  But then I noticed that Craig's
patch doesn't actually remove the retry behaviour completely: it
promotes only EIO and ENOSPC to PANIC.  If you somehow manage to get
any other errno from fsync(), the checkpoint will fail and you'll be
exposed to this bug: PostgreSQL forgets that the segment was dirty, so
the next checkpoint might fsync nothing at all and then bogusly report
success.  So, I think either Craig's patch should PANIC on *any* error
from fsync(), or we should commit this patch too so that we remember
which segments are dirty next time around.

[1]
https://www.postgresql.org/message-id/flat/CAMsr%2BYFPeKVaQ57PwHqmRNjPCPABsdbV%3DL85he2dVBcr6yS1mA%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Checkpoint not retrying failed fsync?

From
Thomas Munro
Date:
On Tue, Jun 12, 2018 at 3:31 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I was about to mark this patch "rejected" and forget about it, since
> Craig's patch makes it redundant.  But then I noticed that Craig's
> patch doesn't actually remove the retry behaviour completely: it
> promotes only EIO and ENOSPC to PANIC.

Rejected, since this will have to be dealt with one way or another in
that other thread.

-- 
Thomas Munro
http://www.enterprisedb.com