Thread: Checkpoint not retrying failed fsync?
This is only a preliminary report, I'm still trying to analyze what's going on, but: In doing testing on FreeBSD with a filesystem set up to induce errors controllably (using gconcat+gnop), I can get this to happen (on 11devel): (note that "mytable" is on a tablespace on the erroring filesystem, while "x" is on a clean filesystem) postgres=# insert into mytable values (-1); INSERT 0 1 postgres=# checkpoint; ERROR: checkpoint request failed HINT: Consult recent messages in the server log for details. postgres=# insert into x values (3); INSERT 0 1 postgres=# checkpoint; CHECKPOINT (the message in the server log is the expected one about fsync failing) Checking the WAL shows that there is indeed a checkpoint record for the second checkpoint and pg_control points to it, so a crash restart at this point would not try and replay the "mytable" write. Furthermore, checking the trace output from the checkpointer process, it is not even attempting an fsync of the failing file; this isn't like the Linux fsync issue, I've confirmed that fsync will repeatedly fail on the file until the underlying errors stop. As far as I can tell from reading the code, if a checkpoint fails the checkpointer is supposed to keep all the outstanding fsync requests for next time. Am I wrong, or is there some failure in the logic to do this? -- Andrew (irc:RhodiumToad)
On Fri, Apr 6, 2018 at 10:16 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Furthermore, checking the trace output from the checkpointer process, it > is not even attempting an fsync of the failing file; this isn't like the > Linux fsync issue, I've confirmed that fsync will repeatedly fail on the > file until the underlying errors stop. Thank you for confirming that! Now, how does one go about buying shares in FreeBSD? > As far as I can tell from reading the code, if a checkpoint fails the > checkpointer is supposed to keep all the outstanding fsync requests for > next time. Am I wrong, or is there some failure in the logic to do this? Yikes. I think this is suspicious: * The bitmap manipulations are slightly tricky, because we can call * AbsorbFsyncRequests() inside the loop and that could result in * bms_add_member() modifying and even re-palloc'ing the bitmapsets. * This is okay because we unlink each bitmapset from the hashtable * entry before scanning it. That means that any incoming fsync * requests will be processed now if they reach the table before we * begin to scan their fork. Why is it OK to unlink the bitmapset? We still need its contents, in the case that the fsync fails! -- Thomas Munro http://www.enterprisedb.com
>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes: >> As far as I can tell from reading the code, if a checkpoint fails the >> checkpointer is supposed to keep all the outstanding fsync requests for >> next time. Am I wrong, or is there some failure in the logic to do this? Thomas> Yikes. I think this is suspicious: Yes, tracing through a checkpoint shows that this is clearly wrong. Thomas> Why is it OK to unlink the bitmapset? We still need its Thomas> contents, in the case that the fsync fails! Right. But I don't think just copying the value is sufficient; if a new bit was set while we were processing the old ones, how would we know which to clear? We couldn't just clear all the bits afterwards because then we might lose a request. -- Andrew (irc:RhodiumToad)
On Fri, Apr 6, 2018 at 11:34 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes: > > >> As far as I can tell from reading the code, if a checkpoint fails the > >> checkpointer is supposed to keep all the outstanding fsync requests for > >> next time. Am I wrong, or is there some failure in the logic to do this? > > Thomas> Yikes. I think this is suspicious: > > Yes, tracing through a checkpoint shows that this is clearly wrong. > > Thomas> Why is it OK to unlink the bitmapset? We still need its > Thomas> contents, in the case that the fsync fails! > > Right. > > But I don't think just copying the value is sufficient; if a new bit was > set while we were processing the old ones, how would we know which to > clear? We couldn't just clear all the bits afterwards because then we > might lose a request. Agreed. The attached draft patch handles that correctly, I think. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Apr 6, 2018 at 11:36 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Apr 6, 2018 at 11:34 AM, Andrew Gierth > <andrew@tao11.riddles.org.uk> wrote: >> Right. >> >> But I don't think just copying the value is sufficient; if a new bit was >> set while we were processing the old ones, how would we know which to >> clear? We couldn't just clear all the bits afterwards because then we >> might lose a request. > > Agreed. The attached draft patch handles that correctly, I think. After some testing, here is a better one for review. Changes: 1. The copy wasn't really needed. 2. errno needed to be restored (in case bms_union stomped on it), thanks to Andrew for an off-list complaint about that. 3. Memory context was wrong. 4. bms_first_member() eats its input. Need to use bms_next_member() instead. 5. Mustn't leak in error path (that was a pre-existing bug). Also here's a patch to test it. If the file /tmp/broken_fsync exists, you'll see a fake EIO when you CHECKPOINT. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Apr 6, 2018 at 12:56 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > After some testing, here is a better one for review. One problem I thought of about 8 milliseconds after clicking send is that bms_union() may fail to allocate memory and then you're hosed. Here is a new version that uses bms_join() instead, because that can't fail. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Apr 6, 2018 at 6:26 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Apr 6, 2018 at 11:36 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Fri, Apr 6, 2018 at 11:34 AM, Andrew Gierth >> <andrew@tao11.riddles.org.uk> wrote: >>> Right. >>> >>> But I don't think just copying the value is sufficient; if a new bit was >>> set while we were processing the old ones, how would we know which to >>> clear? We couldn't just clear all the bits afterwards because then we >>> might lose a request. >> >> Agreed. The attached draft patch handles that correctly, I think. > > After some testing, here is a better one for review. Changes: > > 1. The copy wasn't really needed. > 2. errno needed to be restored (in case bms_union stomped on it), > thanks to Andrew for an off-list complaint about that. > 3. Memory context was wrong. > 4. bms_first_member() eats its input. Need to use bms_next_member() instead. > Won't in the success case, you need to delete each member (by something like bms_del_member) rather than just using bms_free? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Apr 8, 2018 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Won't in the success case, you need to delete each member (by > something like bms_del_member) rather than just using bms_free? Thanks for looking at this. Yeah, if requests for segment numbers 0 and 1 were in "requests", and 0 succeeded but then 1 fails, my previous patch would leave both in there to be retried next time around. I thought that was pretty harmless so I didn't worry about it before, but of course you're right that it's not necessary to retry the ones that succeeded, so we could remove them as we go. New patch attached. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Sun, Apr 8, 2018 at 9:17 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > New patch attached. For Linux users (read: almost all users) this patch on its own is a bit like rearranging the deck chairs on the Titanic. Because retrying on failure is useless, among other problems, Craig and Andres are working on converting IO errors into PANICs and overhauling the request queue[1]. I was about to mark this patch "rejected" and forget about it, since Craig's patch makes it redundant. But then I noticed that Craig's patch doesn't actually remove the retry behaviour completely: it promotes only EIO and ENOSPC to PANIC. If you somehow manage to get any other errno from fsync(), the checkpoint will fail and you'll be exposed to this bug: PostgreSQL forgets that the segment was dirty, so the next checkpoint might fsync nothing at all and then bogusly report success. So, I think either Craig's patch should PANIC on *any* error from fsync(), or we should commit this patch too so that we remember which segments are dirty next time around. [1] https://www.postgresql.org/message-id/flat/CAMsr%2BYFPeKVaQ57PwHqmRNjPCPABsdbV%3DL85he2dVBcr6yS1mA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
On Tue, Jun 12, 2018 at 3:31 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I was about to mark this patch "rejected" and forget about it, since > Craig's patch makes it redundant. But then I noticed that Craig's > patch doesn't actually remove the retry behaviour completely: it > promotes only EIO and ENOSPC to PANIC. Rejected, since this will have to be dealt with one way or another in that other thread. -- Thomas Munro http://www.enterprisedb.com