Thread: Small race in pg_xlogdump --follow

Small race in pg_xlogdump --follow

From
Magnus Hagander
Date:
When using pg_xlogdump in --follow mode, there is what I believe is a small race condition when files are switched.

If pg_xlogdump detects then end of one file (by looking at the record) it will immediately try to open the following file. If that file has not yet been created, it will fail with an error message saying it cannot open the file. But there's a race condition where the server has not had time to put the file in place.

Attached patch puts a retry loop around opening the file that retries for 5 seconds (which is excessive, but should be safe) in case the file is missing (and still fails out for other error messages of course).

Comments?

--
Attachment

Re: Small race in pg_xlogdump --follow

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Attached patch puts a retry loop around opening the file that retries for 5
> seconds (which is excessive, but should be safe) in case the file is
> missing (and still fails out for other error messages of course).

> Comments?

The patch assumes that pg_usleep won't change errno, an assumption
I have little faith in.
        regards, tom lane



Re: Small race in pg_xlogdump --follow

From
Magnus Hagander
Date:
On Mon, Sep 26, 2016 at 3:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> Attached patch puts a retry loop around opening the file that retries for 5
> seconds (which is excessive, but should be safe) in case the file is
> missing (and still fails out for other error messages of course).

> Comments?

The patch assumes that pg_usleep won't change errno, an assumption
I have little faith in.


Oh, right, at the very last loop. I've never seen it need more than 1 loop so I didn't manage to hit that codepath :) But yeah, saving errno and restoring it on the other side of pg_usleep is probably a good idea.

--
Attachment

Re: Small race in pg_xlogdump --follow

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Oh, right, at the very last loop. I've never seen it need more than 1 loop
> so I didn't manage to hit that codepath :) But yeah, saving errno and
> restoring it on the other side of pg_usleep is probably a good idea.

Right.  On a larger scale --- I'm too uncaffeinated to figure out whether
this is the code path taken for the initially user-supplied file name.
But it would be unfriendly if when you fat-finger the command line
arguments, it takes 5 seconds for pg_xlogdump to tell you so.  Maybe
the looping behavior should only happen on non-first file access.
        regards, tom lane



Re: Small race in pg_xlogdump --follow

From
Magnus Hagander
Date:
On Mon, Sep 26, 2016 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> Oh, right, at the very last loop. I've never seen it need more than 1 loop
> so I didn't manage to hit that codepath :) But yeah, saving errno and
> restoring it on the other side of pg_usleep is probably a good idea.

Right.  On a larger scale --- I'm too uncaffeinated to figure out whether
this is the code path taken for the initially user-supplied file name.
But it would be unfriendly if when you fat-finger the command line
arguments, it takes 5 seconds for pg_xlogdump to tell you so.  Maybe
the looping behavior should only happen on non-first file access.

It's a different codepath, so the retry does not happen on that. Different error message, even. ("could not open file" vs "could not find file"). 

I'll go commit this including a source code comment about why the loop is there that was missing.

--