Thread: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery
"could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery
From
Thomas Crayford
Date:
Hi there,
We've found (what seems like) a race condition in a replica server which is receiving WAL from the archive. We are not 100% sure if it happens on streaming replicas as well.
Specifically, we occasionally see an error like this:
May 20 09:56:14 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory
Upon looking at logs around this, we can see that the recovery process had just restored that file:
May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG: restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory
(on this particular server, pid 9 is the recovery process, and pid 141 is the checkpointer)
We've seen this across a range of server versions, including (in the last 30 days):
9.5.12
9.5.14
9.6.10
9.6.6
9.6.7
9.6.8
9.6.9
10.2
10.3
10.4
10.5
Upon digging into the code, we found (we think) the reason behind this race. The recovery process can call unlink() on the file, whilst the restartpoint attempts to recycle the segment. To quote from the docs:
Looking at the code, I think that the two racing functions are RestoreArchivedFile, and CreateRestartPoint.
The former calls unlink on the wal segment, CreateRestartPoint does attempt to do recycling on segments.
From our perspective this seems like a race condition - the recovery process unlinks the file as the checkpointer tries to recycle the wal segment. At this point the checkpointer panics and shuts down, causing a hit to availability.
As far as we have seen, starting the process up after the crash allows the server to continue making progress.
This seems like an extremely rare race - our data show it occurs once every few hundred thousand wal restores from archive that Heroku Postgres does.
We've found (what seems like) a race condition in a replica server which is receiving WAL from the archive. We are not 100% sure if it happens on streaming replicas as well.
Specifically, we occasionally see an error like this:
May 20 09:56:14 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory
Upon looking at logs around this, we can see that the recovery process had just restored that file:
May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG: restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory
(on this particular server, pid 9 is the recovery process, and pid 141 is the checkpointer)
We've seen this across a range of server versions, including (in the last 30 days):
9.5.12
9.5.14
9.6.10
9.6.6
9.6.7
9.6.8
9.6.9
10.2
10.3
10.4
10.5
Upon digging into the code, we found (we think) the reason behind this race. The recovery process can call unlink() on the file, whilst the restartpoint attempts to recycle the segment. To quote from the docs:
In archive recovery or standby mode, the server periodically performs restartpoints, which are similar to checkpoints in normal operation: the server forces all its state to disk, updates thepg_control
file to indicate that the already-processed WAL data need not be scanned again, and then recycles any old log segment files in thepg_wal
directory.
Looking at the code, I think that the two racing functions are RestoreArchivedFile, and CreateRestartPoint.
The former calls unlink on the wal segment, CreateRestartPoint does attempt to do recycling on segments.
From our perspective this seems like a race condition - the recovery process unlinks the file as the checkpointer tries to recycle the wal segment. At this point the checkpointer panics and shuts down, causing a hit to availability.
As far as we have seen, starting the process up after the crash allows the server to continue making progress.
This seems like an extremely rare race - our data show it occurs once every few hundred thousand wal restores from archive that Heroku Postgres does.
Thanks
Tom Crayford
Heroku Postgres
Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition betweenrestartpoint and recovery
From
Michael Paquier
Date:
On Mon, Sep 24, 2018 at 12:58:59PM +0100, Thomas Crayford wrote: > May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG: > restored log file "00000002000072B50000003A" from archive > May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code = > 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such > file or directory What kind of restore_command is used here? > Looking at the code, I think that the two racing functions are > RestoreArchivedFile, and CreateRestartPoint. > > The former calls unlink on the wal segment, CreateRestartPoint does attempt > to do recycling on segments. Don't you mean KeepFileRestoredFromArchive()? RestoreArchivedFile would call unlink() on pg_wal/RECOVERYXLOG so that does not match. -- Michael
Attachment
Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery
From
Thomas Crayford
Date:
Hi there,
Ok, thanks for the pointer. It seems like the race condition I talked about is still accurate, does that seem right?
Thanks
Tom
On Mon, Sep 24, 2018 at 4:37 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Sep 24, 2018 at 12:58:59PM +0100, Thomas Crayford wrote:
> May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG:
> restored log file "00000002000072B50000003A" from archive
> May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code =
> 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such
> file or directory
What kind of restore_command is used here?
> Looking at the code, I think that the two racing functions are
> RestoreArchivedFile, and CreateRestartPoint.
>
> The former calls unlink on the wal segment, CreateRestartPoint does attempt
> to do recycling on segments.
Don't you mean KeepFileRestoredFromArchive()? RestoreArchivedFile would
call unlink() on pg_wal/RECOVERYXLOG so that does not match.
--
Michael
Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition betweenrestartpoint and recovery
From
Michael Paquier
Date:
On Fri, Sep 28, 2018 at 01:02:42PM +0100, Thomas Crayford wrote: > Ok, thanks for the pointer. It seems like the race condition I talked about > is still accurate, does that seem right? KeepFileRestoredFromArchive() looks like a good candidate on the matter as it removes a WAL segment before replacing it by another with the same name. I have a hard time understanding why the checkpointer would try to recycle a segment just recovered though as the startup process would immediately try to use it. I have not spent more than one hour looking at potential spots though, which is not much for this kind of race conditions. It is also why I am curious about what kind of restore_command you are using. -- Michael
Attachment
Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery
From
Thomas Crayford
Date:
Hi there Michael,
Sorry for the slow response on this - I was oncall last week and it was quite distracting and busy.
With respect to the restore_command, we use wal-e: https://github.com/wal-e/wal-e, specifically:
envdir DIRECTORY wal-e wal-fetch "%f" "%p"
Thanks
Tom
On Fri, Sep 28, 2018 at 11:59 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Sep 28, 2018 at 01:02:42PM +0100, Thomas Crayford wrote:
> Ok, thanks for the pointer. It seems like the race condition I talked about
> is still accurate, does that seem right?
KeepFileRestoredFromArchive() looks like a good candidate on the matter
as it removes a WAL segment before replacing it by another with the same
name. I have a hard time understanding why the checkpointer would try
to recycle a segment just recovered though as the startup process would
immediately try to use it. I have not spent more than one hour looking
at potential spots though, which is not much for this kind of race
conditions.
It is also why I am curious about what kind of restore_command you are
using.
--
Michael
Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition betweenrestartpoint and recovery
From
Michael Paquier
Date:
On Mon, Oct 01, 2018 at 12:43:02PM +0100, Thomas Crayford wrote: > Sorry for the slow response on this - I was oncall last week and it was > quite distracting and busy. > > With respect to the restore_command, we use wal-e: > https://github.com/wal-e/wal-e, specifically: > > envdir DIRECTORY wal-e wal-fetch "%f" "%p" Okay, so that should be fine. When it comes to such race conditions, putting directly into the suspicious code paths calls to pg_usleep would most likely help in enforcing the bug to show up. -- Michael