Thread: Concurrency issue in pg_rewind

Concurrency issue in pg_rewind

From
Alexander Kukushkin
Date:
Hello,

Today I bumped into an issue with pg_rewind which is not 100% clear
where should be better fixed.

The first call of pg_rewind failed with the following message:
servers diverged at WAL location A76/39E55338 on timeline 132
could not open file
"/home/postgres/pgdata/pgroot/data/pg_wal/0000008400000A760000001E":
No such file or directory

could not find previous WAL record at A76/1EFFE620
Failure, exiting

In order to avoid rebuilding the replica from scratch, we restored the
missing file by calling restore_command (wal-g) and repeated the call
of pg_rewind.

The second time pg_rewind also failed, but the error looked differently:
servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file
"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":
No such file or directory
Failure, exiting

The second call left PGDATA in an inconsistent state (empty pg_control).

A few words about where the pg_wal/.wal-g/prefetch/running/ is coming
from: wal-g by default when fetching the WAL file is also trying to do
a prefetch of a few next WAL files. For that it forks and the child
process doing prefetch while the parent process exits. In order to
avoid multiple parallel prefetches of the same file, wal-g keeps its
state in the pg_wal/.wal-g directory. It also keeps prefetched files
there.

What in fact happened: pg_rewind is building a list of files in the
target directory which don't match with the source directory and
therefore must be changed (copied/removed/truncated/etc). When the
list was built, the wal-g prefetch was still running, but when
pg_rewind tried to remove files that should not be there because they
don't exist in the source directory it failed with the fatal error.

The issue occurred on 10.14, but I believe very similar might happen
with postgres 13 when pg_rewind is called with --restore-target-wal
option.

One might argue that the issue should be fixed in wal-g (it should not
mess up with pg_wal), and I personally 99% agree with that, but so far
this behavior was safe, because postgres itself never looks inside
unexpected directories in pg_wal. Also from the usability perspective
it is very convenient to keep prefetched files in the pg_wal/.wal-g,
because it guarantees 100% that they will be located on the same
filesystem as pg_wal and therefore the next time when the
restore_command is called it is enough just to rename the file.

That made me think about how it could be improved in the pg_rewind.
The thing is, that we want to have a specific file to be removed, and
it is already not there. Should it be a fatal error?
traverse_datadir()/recurse_dir() already ignoring all failed lstat()
calls with errno == ENOENT.

Basically I have to options:
1. Remove modify remove_target_file(const char *path, bool missing_ok)
function and remove the missing_ok option, that would be consistent
with recurse_dir()
2. Change the logic of remove_target_file(), so it doesn't exit with
the fatal error if the file is missing, but shows only a warning.

In addition to the aforementioned options the remove_target_dir() also
should be improved, i.e. it should check errno and behave similarly to
the remove_target_file() if the errno == ENOENT

What do you think?

Regards,
--
Alexander Kukushkin



Re: Concurrency issue in pg_rewind

From
Oleksandr Shulgin
Date:
On Wed, Sep 16, 2020 at 2:55 PM Alexander Kukushkin <cyberdemn@gmail.com> wrote:

The second time pg_rewind also failed, but the error looked differently:
servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file
"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":

Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal that it doesn't know about.
What if the administrator made a backup of some WAL segments there?

--
Alex

Re: Concurrency issue in pg_rewind

From
Alexey Kondratov
Date:
On 2020-09-16 15:55, Alexander Kukushkin wrote:
> Hello,
> 
> Today I bumped into an issue with pg_rewind which is not 100% clear
> where should be better fixed.
> 
> The first call of pg_rewind failed with the following message:
> servers diverged at WAL location A76/39E55338 on timeline 132
> could not open file
> "/home/postgres/pgdata/pgroot/data/pg_wal/0000008400000A760000001E":
> No such file or directory
> 
> could not find previous WAL record at A76/1EFFE620
> Failure, exiting
> 
> In order to avoid rebuilding the replica from scratch, we restored the
> missing file by calling restore_command (wal-g) and repeated the call
> of pg_rewind.
> 
> The second time pg_rewind also failed, but the error looked 
> differently:
> servers diverged at WAL location A76/39E55338 on timeline 132
> rewinding from last common checkpoint at A76/1EF254B8 on timeline 132
> 
> could not remove file
> "/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":
> No such file or directory
> Failure, exiting
> 
> The second call left PGDATA in an inconsistent state (empty 
> pg_control).
> 
> A few words about where the pg_wal/.wal-g/prefetch/running/ is coming
> from: wal-g by default when fetching the WAL file is also trying to do
> a prefetch of a few next WAL files. For that it forks and the child
> process doing prefetch while the parent process exits. In order to
> avoid multiple parallel prefetches of the same file, wal-g keeps its
> state in the pg_wal/.wal-g directory. It also keeps prefetched files
> there.
> 

Hm, I cannot understand why wal-g (or any other tool) is trying to run 
pg_rewind, while WAL copying (and prefetch) is still in progress? Why do 
not just wait until it is finished?

It is also not clear for me why it does not put prefetched WAL files 
directly into the pg_wal?

> 
> The issue occurred on 10.14, but I believe very similar might happen
> with postgres 13 when pg_rewind is called with --restore-target-wal
> option.
> 

With --restore-target-wal pg_rewind is trying to call restore_command on 
its own and it can happen at two stages:

1) When pg_rewind is trying to find the last checkpoint preceding a 
divergence point. In that case file map is not even yet initialized. 
Thus, all fetched WAL segments at this stage will be present in the file 
map created later.

2) When it creates a data pages map. It should traverse WAL from the 
last common checkpoint till the final shutdown point in order to find 
all modified pages on the target. At this stage pg_rewind only updates 
info about data segments in the file map. That way, I see a minor 
problem that WAL segments fetched at this stage would not be deleted, 
since they are absent in the file map.

Anyway, pg_rewind does not delete neither WAL segments, not any other 
files in the middle of the file map creation, so I cannot imagine, how 
it can get into the same trouble on its own.

> 
> That made me think about how it could be improved in the pg_rewind.
> The thing is, that we want to have a specific file to be removed, and
> it is already not there. Should it be a fatal error?
> traverse_datadir()/recurse_dir() already ignoring all failed lstat()
> calls with errno == ENOENT.
> 

recurse_dir() is also called on the source, which should be running when 
--source-server is used, so it can remove files. There is actually a 
TODO there:

/*
  * File doesn't exist anymore. This is ok, if the new master
  * is running and the file was just removed. If it was a data
  * file, there should be a WAL record of the removal. If it
  * was something else, it couldn't have been anyway.
  *
  * TODO: But complain if we're processing the target dir!
  */

> 
> Basically I have to options:
> 1. Remove modify remove_target_file(const char *path, bool missing_ok)
> function and remove the missing_ok option, that would be consistent
> with recurse_dir()
> 2. Change the logic of remove_target_file(), so it doesn't exit with
> the fatal error if the file is missing, but shows only a warning.
> 
> In addition to the aforementioned options the remove_target_dir() also
> should be improved, i.e. it should check errno and behave similarly to
> the remove_target_file() if the errno == ENOENT
> 
> What do you think?
> 

Although keeping arbitrary files inside PGDATA does not look like a good 
idea for me, I do not see anything criminal in skipping non-existing 
file, when executing a file map by pg_rewind.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Concurrency issue in pg_rewind

From
Alexander Kukushkin
Date:
On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:
>
> Hm, I cannot understand why wal-g (or any other tool) is trying to run
> pg_rewind, while WAL copying (and prefetch) is still in progress? Why do
> not just wait until it is finished?

wal-g doesn't try to call pg_rewind.
First, we called wal-g, it fetched the file we requested and exited.
But, before exiting, wal-g forks, and the child process does prefetch
of a few next WALs.
We don't really know when the child process exits and can't wait for it.

>
> It is also not clear for me why it does not put prefetched WAL files
> directly into the pg_wal?

Because this is how postgres works. It doesn't matter whether the
specific WAL segment is there, postgres will call the restore_command
anyway.
The restore command also doesn't know if the file in pg_wal is OK,
therefore keeping the prefetched file in some other place and moving
it seems to be a good approach.

> With --restore-target-wal pg_rewind is trying to call restore_command on
> its own and it can happen at two stages:
>
> 1) When pg_rewind is trying to find the last checkpoint preceding a
> divergence point. In that case file map is not even yet initialized.
> Thus, all fetched WAL segments at this stage will be present in the file
> map created later.

Nope, it will fetch files you requested, and in addition to that it
will leave a child process running in the background which is doing
the prefetch (manipulating with pg_wal/.wal-g/...)

>
> 2) When it creates a data pages map. It should traverse WAL from the
> last common checkpoint till the final shutdown point in order to find
> all modified pages on the target. At this stage pg_rewind only updates
> info about data segments in the file map. That way, I see a minor
> problem that WAL segments fetched at this stage would not be deleted,
> since they are absent in the file map.
>
> Anyway, pg_rewind does not delete neither WAL segments, not any other
> files in the middle of the file map creation, so I cannot imagine, how
> it can get into the same trouble on its own.

When pg_rewind was creating the map, some temporary files where there,
because the forked child process of wal-g was still running.
When the wal-g child process exits, it removes some of these files.
Specifically, it was trying to prefetch 0000008400000A7600000024 into
the pg_wal/.wal-g/prefetch/running/0000008400000A7600000024, but
apparently the file wasn't available on S3 and prefetch failed,
therefore the empty file was removed.


> Although keeping arbitrary files inside PGDATA does not look like a good
> idea for me, I do not see anything criminal in skipping non-existing
> file, when executing a file map by pg_rewind.

Good, I will prepare a patch then.

Regards,
--
Alexander Kukushkin



Re: Concurrency issue in pg_rewind

From
Alexey Kondratov
Date:
On 2020-09-17 15:27, Alexander Kukushkin wrote:
> On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov
> <a.kondratov@postgrespro.ru> wrote:
> 
>> With --restore-target-wal pg_rewind is trying to call restore_command 
>> on
>> its own and it can happen at two stages:
>> 
>> 1) When pg_rewind is trying to find the last checkpoint preceding a
>> divergence point. In that case file map is not even yet initialized.
>> Thus, all fetched WAL segments at this stage will be present in the 
>> file
>> map created later.
> 
> Nope, it will fetch files you requested, and in addition to that it
> will leave a child process running in the background which is doing
> the prefetch (manipulating with pg_wal/.wal-g/...)
> 
>> 
>> 2) When it creates a data pages map. It should traverse WAL from the
>> last common checkpoint till the final shutdown point in order to find
>> all modified pages on the target. At this stage pg_rewind only updates
>> info about data segments in the file map. That way, I see a minor
>> problem that WAL segments fetched at this stage would not be deleted,
>> since they are absent in the file map.
>> 
>> Anyway, pg_rewind does not delete neither WAL segments, not any other
>> files in the middle of the file map creation, so I cannot imagine, how
>> it can get into the same trouble on its own.
> 
> When pg_rewind was creating the map, some temporary files where there,
> because the forked child process of wal-g was still running.
> When the wal-g child process exits, it removes some of these files.
> Specifically, it was trying to prefetch 0000008400000A7600000024 into
> the pg_wal/.wal-g/prefetch/running/0000008400000A7600000024, but
> apparently the file wasn't available on S3 and prefetch failed,
> therefore the empty file was removed.
> 

I do understand how you got into this problem with wal-g. This part of 
my answer was about bare postgres and pg_rewind. And my point was that 
from my perspective pg_rewind with --restore-target-wal cannot get into 
the same trouble on its own, without 'help' of some side tools like 
wal-g.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Concurrency issue in pg_rewind

From
Michael Paquier
Date:
On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:
> Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal
> that it doesn't know about.
> What if the administrator made a backup of some WAL segments there?

IMO, this would be a rather bad strategy anyway, so just don't do
that, because that could also mean that this is on the same partition
as pg_wal/ which would crash the server if the partition has the idea
to get full even if max_wal_size is set correctly.  If you think about
that, this is rather similar to putting tablespaces in your root data
folder: it may look fancy, but you just make the task of the backend
more complicated.  Note that this practice is not wise when Postgres
decides to apply actions that loop across the entire tree, just to
name two of them: the full fsync of PGDATA at the beginning of crash
recovery or just a base backup.
--
Michael

Attachment

Re: Concurrency issue in pg_rewind

From
"Andrey M. Borodin"
Date:

> 18 сент. 2020 г., в 11:10, Michael Paquier <michael@paquier.xyz> написал(а):
>
> On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:
>> Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal
>> that it doesn't know about.
>> What if the administrator made a backup of some WAL segments there?
>
> IMO, this would be a rather bad strategy anyway, so just don't do
> that, because that could also mean that this is on the same partition
> as pg_wal/

This is whole point of having prefetch. restore_command just links file from the same partition.
In WAL-G you strictly control number of cached WALs, so if you configured max_wal_size - you can configure
WALG_DOWNLOAD_CONCURRENCYtoo. 

Best regards, Andrey Borodin.


Re: Concurrency issue in pg_rewind

From
Michael Paquier
Date:
On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:
> This is whole point of having prefetch. restore_command just links
> file from the same partition.

If this stuff is willing to do so, you may have your reasons, but even
if you wish to locate both pg_wal/ and the prefetch path in the same
partition, I don't get why it is necessary to have the prefetch path
included directly in pg_wal?  You could just use different paths for
both.  Say, with a base partition at /my/path/, you can just have
/my/path/pg_wal/ that the Postgres backend links to, and
/my/path/wal-g/prefetch/ for the secondary path.
--
Michael

Attachment

Re: Concurrency issue in pg_rewind

From
Alexander Kukushkin
Date:
Hi,

On Fri, 18 Sep 2020 at 08:59, Michael Paquier <michael@paquier.xyz> wrote:

> If this stuff is willing to do so, you may have your reasons, but even
> if you wish to locate both pg_wal/ and the prefetch path in the same
> partition, I don't get why it is necessary to have the prefetch path
> included directly in pg_wal?  You could just use different paths for
> both.  Say, with a base partition at /my/path/, you can just have
> /my/path/pg_wal/ that the Postgres backend links to, and
> /my/path/wal-g/prefetch/ for the secondary path.

Well, I agree and disagree at the same time.
Yes, it would be nice not to write anything unexpected to PGDATA and
pg_wal, but this is also a usability issue.
Creating a separate directory and configuring wal-e/wal-g to use it
(now it is not even possible to configure it), requires some effort
from the administrator, while using something inside pg_wal just
works.

At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
an inconsistent state with empty pg_control file, this is totally bad
and easily fixable. We want the specific file to be absent and it is
already absent, why should it be a fatal error and not warning?

Regards,
--
Alexander Kukushkin



Re: Concurrency issue in pg_rewind

From
Oleksandr Shulgin
Date:
On Fri, Sep 18, 2020 at 8:10 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:
> Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal
> that it doesn't know about.
> What if the administrator made a backup of some WAL segments there?

IMO, this would be a rather bad strategy anyway, so just don't do
that, because that could also mean that this is on the same partition
as pg_wal/ which would crash the server if the partition has the idea
to get full even if max_wal_size is set correctly.

To clarify my point, I don't mean to backup WAL segments in the background when the server is running, but precisely when the server is down and you need to intervene, such as running pg_rewind.  You might want to "stash" some of the latest segments in case you need to start over (name it pg_wal/0000008400000A760000001E.backup, or pg_wal/backup/0000008400000A760000001E).  It is surprising that pg_rewind might want to decide to remove those.

--
Alex

Re: Concurrency issue in pg_rewind

From
"Andrey M. Borodin"
Date:

> 18 сент. 2020 г., в 11:59, Michael Paquier <michael@paquier.xyz> написал(а):
>
> On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:
>> This is whole point of having prefetch. restore_command just links
>> file from the same partition.
>
> If this stuff is willing to do so, you may have your reasons, but even
> if you wish to locate both pg_wal/ and the prefetch path in the same
> partition, I don't get why it is necessary to have the prefetch path
> included directly in pg_wal?  You could just use different paths for
> both.  Say, with a base partition at /my/path/, you can just have
> /my/path/pg_wal/ that the Postgres backend links to, and
> /my/path/wal-g/prefetch/ for the secondary path.

This complexity doesn't seem necessary to me. What we gain? Prefetched WAL is WAL per se. Makes sense to keep it in
pg_waltree by default. 

I will implement possibility to move cache out of pg_wal (similar functionality is implemented in pgBackRest). But it
seemsuseless to me: user can configure WAL prefetch to be less performant, without any benefits. 

Best regards, Andrey Borodin.


Re: Concurrency issue in pg_rewind

From
Heikki Linnakangas
Date:
On 18/09/2020 10:17, Alexander Kukushkin wrote:
> At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
> an inconsistent state with empty pg_control file, this is totally bad
> and easily fixable. We want the specific file to be absent and it is
> already absent, why should it be a fatal error and not warning?

Whenever pg_rewind runs into something unexpected, it fails loudly, so 
that the administrator can re-initialize from a base backup. That's the 
general rule. If a file goes missing while pg_rewind is running, that is 
unexpected. It could be a sign that the server was started concurrently, 
or another pg_rewind was started against it, for example.

I feel that we could make an exception of some sort here, but I'm not 
sure what exactly. I don't feel comfortable just downgrading the 
unexpected ENOENT on unlink() to warning in all cases. Besides, scary 
warnings that you routinely ignore is not good either.

I have a hard time coming up with a general rule and justification 
that's not just "do X because WAL-G does Y". pg_rewind failing because 
WAL-G removed a file unexpectedly is one problem, but another is that 
the restore_command might get confused if a pg_rewind removes a file 
that restore_command needs. This is hard when restore_command does 
things in the background, and there's no communication between the 
background process and pg_rewind.

The general principle is that pg_rewind is equivalent to overwriting the 
target with the source, only faster. Perhaps pg_wal should be an 
exception, and pg_rewind should leave alone any files under pg_wal that 
it doesn't recognize.

- Heikki



Re: Concurrency issue in pg_rewind

From
Stephen Frost
Date:
Greetings,

* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
> On 18/09/2020 10:17, Alexander Kukushkin wrote:
> >At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
> >an inconsistent state with empty pg_control file, this is totally bad
> >and easily fixable. We want the specific file to be absent and it is
> >already absent, why should it be a fatal error and not warning?
>
> Whenever pg_rewind runs into something unexpected, it fails loudly, so that
> the administrator can re-initialize from a base backup. That's the general
> rule. If a file goes missing while pg_rewind is running, that is unexpected.
> It could be a sign that the server was started concurrently, or another
> pg_rewind was started against it, for example.

Agreed.

> I feel that we could make an exception of some sort here, but I'm not sure
> what exactly. I don't feel comfortable just downgrading the unexpected
> ENOENT on unlink() to warning in all cases. Besides, scary warnings that you
> routinely ignore is not good either.

I also dislike the idea of downgrading this.

> I have a hard time coming up with a general rule and justification that's
> not just "do X because WAL-G does Y". pg_rewind failing because WAL-G
> removed a file unexpectedly is one problem, but another is that the
> restore_command might get confused if a pg_rewind removes a file that
> restore_command needs. This is hard when restore_command does things in the
> background, and there's no communication between the background process and
> pg_rewind.

I would also point out that wal-g isn't the only backup/restore tool
that does pre-fetching: so does pgbackrest, but we pre-fetch into an
independent spool directory, because these tools really should *not* be
modifying the PGDATA directory during restore_command.

I'm really disinclined to make concessions for external tools to start
writing into directories that they shouldn't be- and this goes for
removing .ready files too, imv.  Yes, you can do such things and maybe
things will work, but if you run into issues with that, that's on you
for making changes to the PGDATA directory, not on PG to try and guess
at what you, or any other external tool, did and magically work around
it or keep it working.

Thanks,

Stephen

Attachment

Re: Concurrency issue in pg_rewind

From
Stephen Frost
Date:
Greetings,

* Andrey M. Borodin (x4mmm@yandex-team.ru) wrote:
> > 18 сент. 2020 г., в 11:59, Michael Paquier <michael@paquier.xyz> написал(а):
> > On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:
> >> This is whole point of having prefetch. restore_command just links
> >> file from the same partition.
> >
> > If this stuff is willing to do so, you may have your reasons, but even
> > if you wish to locate both pg_wal/ and the prefetch path in the same
> > partition, I don't get why it is necessary to have the prefetch path
> > included directly in pg_wal?  You could just use different paths for
> > both.  Say, with a base partition at /my/path/, you can just have
> > /my/path/pg_wal/ that the Postgres backend links to, and
> > /my/path/wal-g/prefetch/ for the secondary path.
>
> This complexity doesn't seem necessary to me. What we gain? Prefetched WAL is WAL per se. Makes sense to keep it in
pg_waltree by default. 
>
> I will implement possibility to move cache out of pg_wal (similar functionality is implemented in pgBackRest). But it
seemsuseless to me: user can configure WAL prefetch to be less performant, without any benefits. 

In this case there's certainly one very clear benefit: pg_rewind will be
more robust at detecting serious issues and complaining loudly,
hopefully avoiding having users end up with corrupted clusters.  That's
certainly not nothing, from my perspective.

Thanks,

Stephen

Attachment