Thread: fsync error handling in pg_receivewal, pg_recvlogical

fsync error handling in pg_receivewal, pg_recvlogical

From
Peter Eisentraut
Date:
Do we need to review the fsync error handling in pg_receivewal and
pg_recvlogical, following the recent backend changes?  The current
default behavior is that these tools will log fsync errors and then
reconnect and proceed with the next data streaming in.  As a result, you
might then have some files in the accumulated WAL that have not been
fsynced.  Perhaps a hard exit would be more appropriate?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: fsync error handling in pg_receivewal, pg_recvlogical

From
Michael Paquier
Date:
On Fri, Mar 29, 2019 at 12:48:09PM +0100, Peter Eisentraut wrote:
> Do we need to review the fsync error handling in pg_receivewal and
> pg_recvlogical, following the recent backend changes?  The current
> default behavior is that these tools will log fsync errors and then
> reconnect and proceed with the next data streaming in.  As a result, you
> might then have some files in the accumulated WAL that have not been
> fsynced.  Perhaps a hard exit would be more appropriate?

Yes, I think that we are going to need an equivalent of that for all
frontend tools.  At various degrees, making sure that a fsync happens
is also important for pg_dump, pg_basebackup, pg_rewind and
pg_checksums so it is not only a problem of the two tools you mention.
It seems to me that the most correct way to have those failures would
be to use directly exit(EXIT_FAILURE) in file_utils.c where
appropriate.
--
Michael

Attachment

Re: fsync error handling in pg_receivewal, pg_recvlogical

From
Peter Eisentraut
Date:
On 2019-03-29 14:05, Michael Paquier wrote:
> Yes, I think that we are going to need an equivalent of that for all
> frontend tools.  At various degrees, making sure that a fsync happens
> is also important for pg_dump, pg_basebackup, pg_rewind and
> pg_checksums so it is not only a problem of the two tools you mention.
> It seems to me that the most correct way to have those failures would
> be to use directly exit(EXIT_FAILURE) in file_utils.c where
> appropriate.

Yeah, there is more to do.  The reason I'm focusing on these two right
now is that they would typically run as a background service, and a
clean exit is most important there.  In the other cases, the program
runs more often in the foreground and you can see error messages.  There
are also some cases where fsync() failures are intentionally ignored
((void) casts), so some of that would need to be investigated further.

Here is a patch to get started.  Note that these calls don't go through
file_utils.c, so it's a separate issue anyway.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: fsync error handling in pg_receivewal, pg_recvlogical

From
Michael Paquier
Date:
On Tue, Jun 25, 2019 at 02:23:05PM +0200, Peter Eisentraut wrote:
> Yeah, there is more to do.  The reason I'm focusing on these two right
> now is that they would typically run as a background service, and a
> clean exit is most important there.  In the other cases, the program
> runs more often in the foreground and you can see error messages.  There
> are also some cases where fsync() failures are intentionally ignored
> ((void) casts), so some of that would need to be investigated further.

The remaining three calls all go through file_utils.c.

> Here is a patch to get started.  Note that these calls don't go through
> file_utils.c, so it's a separate issue anyway.

Why using a different error code.  Using EXIT_FAILURE is a more common
practice in the in-core binaries.  The patch looks fine to me except
that, that's a good first cut.
--
Michael

Attachment

Re: fsync error handling in pg_receivewal, pg_recvlogical

From
Sehrope Sarkuni
Date:
Hi,

Tried out this patch and it applies, compiles, and passes check-world. Also flipped things around in pg_recvlogical.c to exit-on-success to ensure it's actually being called and that worked too. Outside of a more complicated harness that simulates fsync errors not sure how else to test this further.

I did some searching and found a FUSE based on that looks interesting: CharybdeFS[1]. Rather than being fixed at mount time, it has a client/server interface so you can change the handling of syscalls on the fly[2]. For example you can error out fsync calls halfway through a test rather than always or randomly. Haven't tried it out but leaving it here as it seems relevant.

[1]: https://github.com/scylladb/charybdefs
[2]: https://www.scylladb.com/2016/05/02/fault-injection-filesystem-cookbook/

On Wed, Jun 26, 2019 at 12:11 AM Michael Paquier <michael@paquier.xyz> wrote:
Why using a different error code.  Using EXIT_FAILURE is a more common
practice in the in-core binaries.  The patch looks fine to me except
that, that's a good first cut.

An error code specific to fsync issues could help with tests as the harness could check it to ensure things died for the right reasons. With a generic "messed up fsync" harness you might even be able to run some existing tests that would otherwise pass and check for the fsync-specific exit code.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

 

Re: fsync error handling in pg_receivewal, pg_recvlogical

From
Sehrope Sarkuni
Date:
While reviewing this patch I read through some of the other fsync
callsites and noticed this typo (walkdir is in file_utils.c, not
initdb.c) too:

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 315c74c745..9b79df2d7f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3208,7 +3208,7 @@ SyncDataDirectory(void)
  *
  * Errors are reported at level elevel, which might be ERROR or less.
  *
- * See also walkdir in initdb.c, which is a frontend version of this logic.
+ * See also walkdir in file_utils.c, which is a frontend version of this logic.
  */
 static void
 walkdir(const char *path,

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/



Re: fsync error handling in pg_receivewal, pg_recvlogical

From
Michael Paquier
Date:
On Sat, Jul 27, 2019 at 01:06:06PM -0400, Sehrope Sarkuni wrote:
> While reviewing this patch I read through some of the other fsync
> callsites and noticed this typo (walkdir is in file_utils.c, not
> initdb.c) too:

Thanks, Sehrope.  Applied.
--
Michael

Attachment

Re: fsync error handling in pg_receivewal, pg_recvlogical

From
Peter Eisentraut
Date:
On 2019-07-27 19:02, Sehrope Sarkuni wrote:
> Tried out this patch and it applies, compiles, and passes check-world.
> Also flipped things around in pg_recvlogical.c to exit-on-success to
> ensure it's actually being called and that worked too. Outside of a more
> complicated harness that simulates fsync errors not sure how else to
> test this further.

I have committed this, with the exit code changed back, as requested by
Michael.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services