Thread: Re: pg_rewind WAL segments deletion pitfall

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
(Moved to -hackers)

At Thu, 25 Aug 2022 10:34:40 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in 
> > # killall -9 postgres
> > # rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
> > mkdir newarch oldarch
> > initdb -k -D oldprim
> > echo "archive_mode = 'always'">> oldprim/postgresql.conf
> >
> 
> With archive_mode = always you can't reproduce it.
> It is very rarely people set it to always in production due to the overhead.
...
> The archive_mode has to be set to on and the archive_command should be
> failing when you do pg_ctl -D oldprim stop

Ah, I see.

What I don't still understand is why pg_rewind doesn't work for the
old primary in that case. When archive_mode=on, the old primary has
the complete set of WAL files counting both pg_wal and its archive. So
as the same to the privious repro, pg_rewind -c ought to work (but it
uses its own archive this time).  In that sense the proposed solution
is still not needed in this case.

A bit harder situation comes after the server successfully rewound; if
the new primary goes so far that the old primary cannot connect. Even
in that case, you can copy-in the requried WAL files or configure
restore_command of the old pimary so that it finds required WAL files
there.

As the result the system in total doesn't lose a WAL file.

So.. I might still be missing something..


###############################
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc 'select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where
name= 'wal_segment_size')::int);
 
`
psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'
pg_ctl -D newprim promote
psql -p 5433 -c 'checkpoint'

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

postgres -D oldprim  # fails with "WAL file has been removed"

# The alternative of copying-in
# echo "restore_command = 'cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf

# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
  if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)

postgres -D oldprim

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hello Kyotaro,


On Fri, 26 Aug 2022 at 10:04, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> With archive_mode = always you can't reproduce it.
> It is very rarely people set it to always in production due to the overhead.
...
> The archive_mode has to be set to on and the archive_command should be
> failing when you do pg_ctl -D oldprim stop

Ah, I see.

What I don't still understand is why pg_rewind doesn't work for the
old primary in that case. When archive_mode=on, the old primary has
the complete set of WAL files counting both pg_wal and its archive. So
as the same to the privious repro, pg_rewind -c ought to work (but it
uses its own archive this time).  In that sense the proposed solution
is still not needed in this case.

The pg_rewind finishes successfully. But as a result it removes some files from pg_wal that are required to perform recovery because they are missing on the new primary.

 

A bit harder situation comes after the server successfully rewound; if
the new primary goes so far that the old primary cannot connect. Even
in that case, you can copy-in the requried WAL files or configure
restore_command of the old pimary so that it finds required WAL files
there.

Yes, we can do the backup of pg_wal before running pg_rewind, but it feels very ugly, because we will also have to clean this "backup" after a successful recovery.
It would be much better if pg_rewind didn't remove WAL files between the last common checkpoint and diverged LSN in the first place.

Regards,
--
Alexander Kukushkin

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
Hello, Alex.

At Fri, 26 Aug 2022 10:57:25 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in 
> On Fri, 26 Aug 2022 at 10:04, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
> > What I don't still understand is why pg_rewind doesn't work for the
> > old primary in that case. When archive_mode=on, the old primary has
> > the complete set of WAL files counting both pg_wal and its archive. So
> > as the same to the privious repro, pg_rewind -c ought to work (but it
> > uses its own archive this time).  In that sense the proposed solution
> > is still not needed in this case.
> >
> 
> The pg_rewind finishes successfully. But as a result it removes some files
> from pg_wal that are required to perform recovery because they are missing
> on the new primary.

IFAIS pg_rewind doesn't. -c option contrarily restores the all
segments after the last (common) checkpoint and all of them are left
alone after pg_rewind finishes.  postgres itself removes the WAL files
after recovery. After-promotion cleanup and checkpoint revmoes the
files on the previous timeline.

Before pg_rewind runs in the repro below, the old primary has the
following segments.

TLI1:  2 8 9 A B C D

Just after pg_rewind finishes, the old primary has the following
segments.

TLI1:  2 3   5 6 7
TLI2:      4        (and 00000002.history)

pg_rewind copied 1-2 to 1-3 and 2-4 and history file from the new
primary, 1-4 to 1-7 from archive. After rewind finished, 1-4,1-8 to
1-D have been removed since the new primary didn't have them.

Recovery starts from 1-3 and promotes at 0/4_000000. postgres removes
1-5 to 1-7 by post-promotion cleanup and removes 1-2 to 1-4 by a
restartpoint. All of the segments are useless after the old primary
promotes.

When the old primary starts, it uses 1-3 and 2-4 for recovery and
fails to fetch 2-5 from the new primary.  But it is not an issue of
pg_rewind at all.

> > A bit harder situation comes after the server successfully rewound; if
> > the new primary goes so far that the old primary cannot connect. Even
> > in that case, you can copy-in the requried WAL files or configure
> > restore_command of the old pimary so that it finds required WAL files
> > there.
> >
> 
> Yes, we can do the backup of pg_wal before running pg_rewind, but it feels

So, if I understand you correctly, the issue you are complaining is
not about the WAL segments on the old timeline but about those on the
new timeline, which don't have a business with what pg_rewind does. As
the same with the case of pg_basebackup, the missing segments need to
be somehow copied from the new primary since the old primary never had
the chance to have them before.

> very ugly, because we will also have to clean this "backup" after a
> successful recovery.

What do you mean by the "backup" here? Concretely what WAL segments do
you feel need to remove, for example, in the repro case?  Or, could
you show your issue by something like the repro below?

> It would be much better if pg_rewind didn't remove WAL files between the
> last common checkpoint and diverged LSN in the first place.

Thus I don't follow this..

regards.


(Fixed a bug and slightly modified)
====
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where
name= 'wal_segment_size')::int);"`
 
psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

postgres -D oldprim  # fails with "WAL file has been removed"

# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf

# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
  if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)

postgres -D oldprim
====

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hello Kyotaro,

On Tue, 30 Aug 2022 at 07:50, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

 
So, if I understand you correctly, the issue you are complaining is
not about the WAL segments on the old timeline but about those on the
new timeline, which don't have a business with what pg_rewind does. As
the same with the case of pg_basebackup, the missing segments need to
be somehow copied from the new primary since the old primary never had
the chance to have them before.

No, we are complaining exactly about WAL segments from the old timeline that are removed by pg_rewind.
Those segments haven't been archived by the old primary and the new primary already recycled them.


 

Thus I don't follow this..

I did a slight modification of your script that reproduces a problem.
 

====
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
# advance WAL on the old primary; four WAL segments will never make it to the archive
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done

# record approx. diverging WAL segment
start_wal=`psql -p 5432 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name = 'wal_segment_size')::int);"`
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

postgres -D oldprim  # fails with "WAL file has been removed"

# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf

# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
  if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)

postgres -D oldprim  # also fails with "requested WAL segment XXX has already been removed"
===

Regards,
--
Alexander Kukushkin

Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:

I did a slight modification of your script that reproduces a problem.
 

====

It seems that formatting damaged the script, so I better attach it as a file.
 
Regards,
--
Alexander Kukushkin
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Tue, 30 Aug 2022 14:50:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> IFAIS pg_rewind doesn't. -c option contrarily restores the all
> segments after the last (common) checkpoint and all of them are left
> alone after pg_rewind finishes.  postgres itself removes the WAL files
> after recovery. After-promotion cleanup and checkpoint revmoes the
> files on the previous timeline.
> 
> Before pg_rewind runs in the repro below, the old primary has the
> following segments.
> 
> TLI1:  2 8 9 A B C D
> 
> Just after pg_rewind finishes, the old primary has the following
> segments.
> 
> TLI1:  2 3   5 6 7
> TLI2:      4        (and 00000002.history)
> 
> pg_rewind copied 1-2 to 1-3 and 2-4 and history file from the new
1> primary, 1-4 to 1-7 from archive. After rewind finished, 1-4,1-8 to
> 1-D have been removed since the new primary didn't have them.
> 
> Recovery starts from 1-3 and promotes at 0/4_000000. postgres removes
> 1-5 to 1-7 by post-promotion cleanup and removes 1-2 to 1-4 by a
> restartpoint. All of the segments are useless after the old primary
> promotes.
> 
> When the old primary starts, it uses 1-3 and 2-4 for recovery and
> fails to fetch 2-5 from the new primary.  But it is not an issue of
> pg_rewind at all.

Ah. I think I understand what you are mentioning. If the new primary
didn't have the segment 1-3 to 1-6, pg_rewind removes it.  The new
primary doesn't have it in pg_wal nor in archive. The old primary has
it in its archive.  So get out from the situation, we need to the
following *two* things before the old primary can start:

1. copy 1-3 to 1-6 from the archive of the *old* primary
2. copy 2-7 and later from the archive of the *new* primary

Since pg_rewind have copied in to the old primary's pg_wal, removing them just have users to perform the task
duplicatedly,as you stated.
 

Okay, I completely understand the problem and convinced that it is
worth changing the behavior.

However, the proposed patch looks too complex to me.  It can be done
by just comparing xlog file name and the last checkpoint location and
TLI in decide_file_actions().

regards.

=====
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where
name= 'wal_segment_size')::int);"`
 
for i in $(seq 1 5); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint'
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

#### copy the missing file of the old timeline
## cp oldarch/00000001000000000000000[3456] oldprim/pg_wal
## cp newarch/00000002000000000000000* oldprim/pg_wal

postgres -D oldprim  # fails with "WAL file has been removed"


# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf

# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
  if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)

postgres -D oldprim
=====

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Tue, 30 Aug 2022 08:49:27 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in 
> No, we are complaining exactly about WAL segments from the old timeline
> that are removed by pg_rewind.
> Those segments haven't been archived by the old primary and the new primary
> already recycled them.

Yeah, sorry for my thick skull but I finally got your point.

And as I said in a mail I sent just before, the patch looks too
complex.  How about just comparing WAL file name aginst the last
common checkpoint's tli and lsn? We can tell filemap.c about the last
checkpoint and decide_file_action can compare the file name with it.

It is sufficient to preserve WAL files if tli matches and the segment
number of the WAL file is equal to or later than the checkpoint
location.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hello Kyotaro,


On Tue, 30 Aug 2022 at 09:51, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:


And as I said in a mail I sent just before, the patch looks too
complex.  How about just comparing WAL file name aginst the last
common checkpoint's tli and lsn? We can tell filemap.c about the last
checkpoint and decide_file_action can compare the file name with it.

It is sufficient to preserve WAL files if tli matches and the segment
number of the WAL file is equal to or later than the checkpoint
location.

What if the last common checkpoint was on a previous timeline?
I.e., standby was promoted to primary, the timeline changed from 1 to 2, and after that the node crashed _before_ the CHECKPOINT after promote has finished.
The next node will advance the timeline from 2 to 3.
In this case, the last common checkpoint will be on timeline 1, and the check becomes more complex because we will have to consider both timelines, 1 and 2.

Also, we need to take into account the divergency LSN. Files after it are not required.

Regards,
--
Alexander Kukushkin

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Tue, 30 Aug 2022 10:03:07 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in 
> On Tue, 30 Aug 2022 at 09:51, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
> > And as I said in a mail I sent just before, the patch looks too
> > complex.  How about just comparing WAL file name aginst the last
> > common checkpoint's tli and lsn? We can tell filemap.c about the last
> > checkpoint and decide_file_action can compare the file name with it.
> >
> > It is sufficient to preserve WAL files if tli matches and the segment
> > number of the WAL file is equal to or later than the checkpoint
> > location.
> >
> 
> What if the last common checkpoint was on a previous timeline?
> I.e., standby was promoted to primary, the timeline changed from 1 to 2,
> and after that the node crashed _before_ the CHECKPOINT after promote has
> finished.
> The next node will advance the timeline from 2 to 3.
> In this case, the last common checkpoint will be on timeline 1, and the
> check becomes more complex because we will have to consider both timelines,
> 1 and 2.

Hmm. Doesn't it work to ignoring tli then? All segments that their
segment number is equal to or larger than the checkpoint locaiton are
preserved regardless of TLI?

> Also, we need to take into account the divergency LSN. Files after it are
> not required.

They are removed at the later checkpoints. But also we can remove
segments that are out of the range between the last common checkpoint
and divergence point ignoring TLI.  the divergence point is also
compared?

> if (file_segno >= last_common_checkpoint_seg &&
>     file_segno <= divergence_seg)
>     <PRESERVE IT>;


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:


On Tue, 30 Aug 2022 at 10:27, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Hmm. Doesn't it work to ignoring tli then? All segments that their
segment number is equal to or larger than the checkpoint locaiton are
preserved regardless of TLI?

If we ignore TLI there is a chance that we may retain some unnecessary (or just wrong) files.
 

> Also, we need to take into account the divergency LSN. Files after it are
> not required.

They are removed at the later checkpoints. But also we can remove
segments that are out of the range between the last common checkpoint
and divergence point ignoring TLI. 

Everything that is newer last_common_checkpoint_seg could be removed (but it already happens automatically, because these files are missing on the new primary).
WAL files that are older than last_common_checkpoint_seg could be either removed or at least not copied from the new primary.

 
the divergence point is also
compared?

> if (file_segno >= last_common_checkpoint_seg &&
>     file_segno <= divergence_seg)
>     <PRESERVE IT>;

The current implementation relies on tracking WAL files being open while searching for the last common checkpoint. It automatically starts from the divergence_seg, automatically finishes at last_common_checkpoint_seg, and last but not least, automatically handles timeline changes. I don't think that manually written code that decides what to do from the WAL file name (and also takes into account TLI) could be much simpler than the current approach.


Actually, since we start doing some additional "manipulations" with files in pg_wal, we probably should do a symmetric action with files inside pg_wal/archive_status

Regards,
--
Alexander Kukushkin

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Tue, 30 Aug 2022 11:01:58 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in 
> On Tue, 30 Aug 2022 at 10:27, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
> 
> >
> > Hmm. Doesn't it work to ignoring tli then? All segments that their
> > segment number is equal to or larger than the checkpoint locaiton are
> > preserved regardless of TLI?
> >
> 
> If we ignore TLI there is a chance that we may retain some unnecessary (or
> just wrong) files.

Right. I mean I don't think thats a problem and we can rely on
postgres itself for later cleanup. Theoretically some out-of-range tli
or segno files are left alone but they surely will be gone soon after
the server starts.

> > > Also, we need to take into account the divergency LSN. Files after it are
> > > not required.
> >
> > They are removed at the later checkpoints. But also we can remove
> > segments that are out of the range between the last common checkpoint
> > and divergence point ignoring TLI.
> 
> 
> Everything that is newer last_common_checkpoint_seg could be removed (but
> it already happens automatically, because these files are missing on the
> new primary).
> WAL files that are older than last_common_checkpoint_seg could be either
> removed or at least not copied from the new primary.
..
> The current implementation relies on tracking WAL files being open while
> searching for the last common checkpoint. It automatically starts from the
> divergence_seg, automatically finishes at last_common_checkpoint_seg, and
> last but not least, automatically handles timeline changes. I don't think
> that manually written code that decides what to do from the WAL file name
> (and also takes into account TLI) could be much simpler than the current
> approach.

Yeah, I know.  My expectation is taking the simplest way for the same
effect. My concern was the additional hash. On second thought, I
concluded that we should that on the existing filehash.

We can just add a FILE_ACTION_NONE entry to the file hash from
SimpleXLogPageRead.  Since this happens before decide_file_action()
call, decide_file_action() should ignore the entries with
FILE_ACTION_NONE. Also we need to call filehash_init() earlier.

> Actually, since we start doing some additional "manipulations" with files
> in pg_wal, we probably should do a symmetric action with files inside
> pg_wal/archive_status

In that sense, pg_rewind rather should place missing
archive_status/*.done for segments including restored ones seen while
finding checkpoint.  This is analogous of the behavior with
pg_basebackup and pg_receivewal. Also we should add FILE_ACTION_NONE
entries for .done files for segments read while finding checkpoint.

What do you think about that?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Wed, 31 Aug 2022 14:30:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> What do you think about that?

By the way don't you add an CF entry for this?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Polina Bungina
Date:

Hello Kayotaro,


Here is the new version of the patch that includes the changes you suggested. It is smaller now but I doubt if it is as easy to understand as it used to be.


The need of manipulations with the target’s pg_wal/archive_status directory is a question to discuss…

At first glance it seems to be useless for .ready files: checkpointer process will anyway recreate them if archiving is enabled on the rewound old primary and we will finally have them in the archive. As for the .done files, it seems reasonable to follow the pg_basebackup logic and keep .done files together with the corresponding segments (those between the last common checkpoint and the point of divergence) to protect them from being archived once again.

But on the other hand it seems to be not that straightforward: imaging we have WAL segment X on the target along with X.done file and we decide to preserve them both (or we download it from archive and force .done file creation), while archive_mode was set to ‘always’ and the source (promoted replica) also still has WAL segment X and X.ready file. After pg_rewind we will end up with both X.ready and X.done, which seems to be not a good situation (but most likely not critical either).


Regards,

Polina Bungina



On Wed, Aug 31, 2022 at 7:30 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 30 Aug 2022 11:01:58 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
> On Tue, 30 Aug 2022 at 10:27, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
>
> >
> > Hmm. Doesn't it work to ignoring tli then? All segments that their
> > segment number is equal to or larger than the checkpoint locaiton are
> > preserved regardless of TLI?
> >
>
> If we ignore TLI there is a chance that we may retain some unnecessary (or
> just wrong) files.

Right. I mean I don't think thats a problem and we can rely on
postgres itself for later cleanup. Theoretically some out-of-range tli
or segno files are left alone but they surely will be gone soon after
the server starts.

> > > Also, we need to take into account the divergency LSN. Files after it are
> > > not required.
> >
> > They are removed at the later checkpoints. But also we can remove
> > segments that are out of the range between the last common checkpoint
> > and divergence point ignoring TLI.
>
>
> Everything that is newer last_common_checkpoint_seg could be removed (but
> it already happens automatically, because these files are missing on the
> new primary).
> WAL files that are older than last_common_checkpoint_seg could be either
> removed or at least not copied from the new primary.
..
> The current implementation relies on tracking WAL files being open while
> searching for the last common checkpoint. It automatically starts from the
> divergence_seg, automatically finishes at last_common_checkpoint_seg, and
> last but not least, automatically handles timeline changes. I don't think
> that manually written code that decides what to do from the WAL file name
> (and also takes into account TLI) could be much simpler than the current
> approach.

Yeah, I know.  My expectation is taking the simplest way for the same
effect. My concern was the additional hash. On second thought, I
concluded that we should that on the existing filehash.

We can just add a FILE_ACTION_NONE entry to the file hash from
SimpleXLogPageRead.  Since this happens before decide_file_action()
call, decide_file_action() should ignore the entries with
FILE_ACTION_NONE. Also we need to call filehash_init() earlier.

> Actually, since we start doing some additional "manipulations" with files
> in pg_wal, we probably should do a symmetric action with files inside
> pg_wal/archive_status

In that sense, pg_rewind rather should place missing
archive_status/*.done for segments including restored ones seen while
finding checkpoint.  This is analogous of the behavior with
pg_basebackup and pg_receivewal. Also we should add FILE_ACTION_NONE
entries for .done files for segments read while finding checkpoint.

What do you think about that?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
Polina Bungina
Date:
Terribly sorry for misspelling your name and for the topposting!

Regards,
Polina Bungina

Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hello Kyotaro,

any further thoughts on it?

Regards,
--
Alexander Kukushkin

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Thu, 1 Sep 2022 13:33:09 +0200, Polina Bungina <bungina@gmail.com> wrote in 
> Here is the new version of the patch that includes the changes you
> suggested. It is smaller now but I doubt if it is as easy to understand as
> it used to be.

pg_rewind works in two steps. First it constructs file map which
decides the action for each file, then second, it performs file
operations according to the file map. So, if we are going to do
something on some files, that action should be record that in the file
map, I think.

Regarding the the patch, pg_rewind starts reading segments from the
divergence point back to the nearest checkpoint, then moves foward
during rewinding. So, the fact that SimpleXLogPageRead have read a
segment suggests that the segment is required during the next startup.
So I don't think we need to move around the keepWalSeg flag.  All
files that are wanted while rewinding should be preserved
unconditionally.

It's annoying that the file path for file map and open(2) have
different top directory. But sharing the same path string between the
two seems rather ugly..

I feel uncomfortable to directly touch the internal of file_entry_t
outside filemap.c. I'd like to hide the internals in filemap.c, but
pg_rewind already does that..

+        /*
+         * Some entries (WAL segments) already have an action assigned
+         * (see SimpleXLogPageRead()).
+         */
+        if (entry->action == FILE_ACTION_NONE)
+            continue;
         entry->action = decide_file_action(entry);

It might be more reasonable to call decide_file_action() when action
is UNDECIDED.

> The need of manipulations with the target’s pg_wal/archive_status directory
> is a question to discuss…
>
> At first glance it seems to be useless for .ready files: checkpointer
> process will anyway recreate them if archiving is enabled on the rewound
> old primary and we will finally have them in the archive. As for the .done
> files, it seems reasonable to follow the pg_basebackup logic and keep .done
> files together with the corresponding segments (those between the last
> common checkpoint and the point of divergence) to protect them from being
> archived once again.
> 
> But on the other hand it seems to be not that straightforward: imaging we
> have WAL segment X on the target along with X.done file and we decide to
> preserve them both (or we download it from archive and force .done file
> creation), while archive_mode was set to ‘always’ and the source (promoted
> replica) also still has WAL segment X and X.ready file. After pg_rewind we
> will end up with both X.ready and X.done, which seems to be not a good
> situation (but most likely not critical either).

Thanks for the thought. Yes, it's not so straight-forward. And, as you
mentioned, the worst result comes from not doing that is that some
already-archived segments are archived at next run, which is generally
harmless. So I think we're ok to ignore that in this patdh then create
other patch if we still want to do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Polina Bungina
Date:
On Tue, Sep 27, 2022 at 9:50 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Regarding the the patch, pg_rewind starts reading segments from the
divergence point back to the nearest checkpoint, then moves foward
during rewinding. So, the fact that SimpleXLogPageRead have read a
segment suggests that the segment is required during the next startup.
So I don't think we need to move around the keepWalSeg flag.  All
files that are wanted while rewinding should be preserved
unconditionally.

I am probably not getting this right but as far as I see SimpleXLogPageRead is called at most 3 times during pg_rewind run:
1. From readOneRecord to determine the end-of-WAL on the target by reading the last shutdown checkpoint record/minRecoveryPoint on it
2. From findLastCheckpoint to find last common checkpoint (here it indeed reads all the segments that are required during the startup, hence the keepWalSeg flag set to true)
3. From extractPageMap to extract all the pages modified after the fork (here we also read all the segments that should be kept but also the ones further, until the target's end record. Doesn't seem we should unconditionally preserve them all).
Am I missing something?

 
+               /*
+                * Some entries (WAL segments) already have an action assigned
+                * (see SimpleXLogPageRead()).
+                */
+               if (entry->action == FILE_ACTION_NONE)
+                       continue;
                entry->action = decide_file_action(entry);
It might be more reasonable to call decide_file_action() when action
is UNDECIDED.
 
Agree, will change this part. 

Regards,
Polina Bungina

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Wed, 28 Sep 2022 10:09:05 +0200, Polina Bungina <bungina@gmail.com> wrote in 
> On Tue, Sep 27, 2022 at 9:50 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
> 
> > Regarding the the patch, pg_rewind starts reading segments from the
> > divergence point back to the nearest checkpoint, then moves foward
> > during rewinding. So, the fact that SimpleXLogPageRead have read a
> > segment suggests that the segment is required during the next startup.
> > So I don't think we need to move around the keepWalSeg flag.  All
> > files that are wanted while rewinding should be preserved
> > unconditionally.
> >
> 
> I am probably not getting this right but as far as I see SimpleXLogPageRead
> is called at most 3 times during pg_rewind run:
> 1. From readOneRecord to determine the end-of-WAL on the target by reading
> the last shutdown checkpoint record/minRecoveryPoint on it
> 2. From findLastCheckpoint to find last common checkpoint (here it
> indeed reads all the segments that are required during the startup, hence
> the keepWalSeg flag set to true)
> 3. From extractPageMap to extract all the pages modified after the fork
> (here we also read all the segments that should be kept but also the ones
> further, until the target's end record. Doesn't seem we should
> unconditionally preserve them all).
> Am I missing something?

No. You're right. I have to admit that I was confused at the time X(,
sorry for that.  Those extra files are I believe harmless but of
course it's preferable to avoid them. So the keepWalSeg is useful.

So the latest version become very similar to v1 in that the both have
keepWalSeg flag. The difference is the need of the file name hash.  I
still think that it's better if we don't need the additional file
hash.  If we move out the bare code in v2 added to
SimpleXLogPageRead(), then name it "preserve_file(char *filepath)",
the code become more easy to read.

+        if (private->keepWalSeg)
+        {
+            /* the caller told us to preserve this file for future use */
+            snprintf(xlogfpath, MAXPGPATH, XLOGDIR "/%s", xlogfname);
+           preserve_file(xlogfpath);
+        }

Instead, I think we should add a comment here:

@@ -192,6 +195,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 
 >     private.tliIndex = tliIndex;
 >     private.restoreCommand = restoreCommand;
++    /*
++   * WAL files read during searching for the last checkpoint are required
++   * by the next startup recovery of the target cluster.
++  */
 >+    private.keepWalSeg = true;

What do you think about the above?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
Polina Bungina
Date:
I agree with your suggestions, so here is the updated version of patch. Hope I haven't missed anything.

Regards,
Polina Bungina
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
On 2022-09-29 17:18, Polina Bungina wrote:
> I agree with your suggestions, so here is the updated version of
> patch. Hope I haven't missed anything.
> 
> Regards,
> Polina Bungina

Thanks for working on this!
It seems like we are also facing the same issue.

I tested the v3 patch under our condition, old primary has succeeded to 
become new standby.


BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached in 
[1], old primary also failed to become standby:

   FATAL:  could not receive data from WAL stream: ERROR:  requested WAL 
segment 000000020000000000000007 has already been removed

However, I think this is not a problem:  just adding restore_command 
like below fixed the situation.

   echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> 
oldprim/postgresql.conf

Attached modified reproduction script for reference.

[1]https://www.postgresql.org/message-id/CAFh8B%3DnNiFZOAPsv49gffxHBPzwmZ%3D6Msd4miMis87K%3Dd9rcRA%40mail.gmail.com


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Wed, 28 Jun 2023 22:28:13 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in 
> 
> On 2022-09-29 17:18, Polina Bungina wrote:
> > I agree with your suggestions, so here is the updated version of
> > patch. Hope I haven't missed anything.
> > Regards,
> > Polina Bungina
> 
> Thanks for working on this!
> It seems like we are also facing the same issue.

Thanks for looking this.

> I tested the v3 patch under our condition, old primary has succeeded
> to become new standby.
> 
> 
> BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached
> in [1], old primary also failed to become standby:
> 
>   FATAL: could not receive data from WAL stream: ERROR: requested WAL
>   segment 000000020000000000000007 has already been removed
> 
> However, I think this is not a problem: just adding restore_command
> like below fixed the situation.
> 
>   echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >>
>   oldprim/postgresql.conf

I thought on the same line at first, but that's not the point
here. The problem we want ot address is that pg_rewind ultimately
removes certain crucial WAL files required for the new primary to
start, despite them being present previously. In other words, that
restore_command works, but it only undoes what pg_rewind wrongly did,
resulting in unnecessary consupmtion of I/O and/or network bandwidth
that essentially serves no purpose.

pg_rewind already has a feature that determines how each file should
be handled, but it is currently making wrong dicisions for WAL
files. The goal here is to rectify this behavior and ensure that
pg_rewind makes the right decisions.

> Attached modified reproduction script for reference.
> 
> [1]https://www.postgresql.org/message-id/CAFh8B%3DnNiFZOAPsv49gffxHBPzwmZ%3D6Msd4miMis87K%3Dd9rcRA%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
On 2023-06-29 10:25, Kyotaro Horiguchi wrote:
Thanks for the comment!

> At Wed, 28 Jun 2023 22:28:13 +0900, torikoshia
> <torikoshia@oss.nttdata.com> wrote in
>> 
>> On 2022-09-29 17:18, Polina Bungina wrote:
>> > I agree with your suggestions, so here is the updated version of
>> > patch. Hope I haven't missed anything.
>> > Regards,
>> > Polina Bungina
>> 
>> Thanks for working on this!
>> It seems like we are also facing the same issue.
> 
> Thanks for looking this.
> 
>> I tested the v3 patch under our condition, old primary has succeeded
>> to become new standby.
>> 
>> 
>> BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached
>> in [1], old primary also failed to become standby:
>> 
>>   FATAL: could not receive data from WAL stream: ERROR: requested WAL
>>   segment 000000020000000000000007 has already been removed
>> 
>> However, I think this is not a problem: just adding restore_command
>> like below fixed the situation.
>> 
>>   echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >>
>>   oldprim/postgresql.conf
> 
> I thought on the same line at first, but that's not the point
> here.

Yes. I don't think adding restore_command solves the problem and
modification to prevent deleting necessary WAL like proposed
patch is necessary.

I added restore_command since
pg_rewind-removes-wal-segments-reproduce.sh failed to catch up
even after applying v3 patch and prevent pg_rewind from delete
WALs(*), because some necessary WALs were archived.

It's not a problem we are discussing here, but I wanted to get
the script to work to the point where old primary could
successfully catch up to new primary.

(*)Specifically, running the script without apply the patch,
recovery failed because 000000010000000000000003 which has
already been removed. This file was deleted by pg_rewind as
we know.
OTHO without the restore_command, recovery failed because
000000020000000000000007 has already been removed even after
applying the patch.

> The problem we want ot address is that pg_rewind ultimately
> removes certain crucial WAL files required for the new primary to
> start, despite them being present previously.

I thought it's not "new primary", but "old primary".

> In other words, that
> restore_command works, but it only undoes what pg_rewind wrongly did,
> resulting in unnecessary consupmtion of I/O and/or network bandwidth
> that essentially serves no purpose.

As far as I tested using the script and the situation we are facing,
after promoting newprim necessary WAL(000000010000000000000003..) were
not available and just adding restore_command did not solve the problem.

> pg_rewind already has a feature that determines how each file should
> be handled, but it is currently making wrong dicisions for WAL
> files. The goal here is to rectify this behavior and ensure that
> pg_rewind makes the right decisions.

+1


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION



Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
On 2022-09-29 17:18, Polina Bungina wrote:
> I agree with your suggestions, so here is the updated version of
> patch. Hope I haven't missed anything.

Thanks for the patch, I've marked this as ready-for-committer.

BTW, this issue can be considered a bug, right?
I think it would be appropriate to provide backpatch.

On 2023-06-29 18:42, torikoshia wrote:
> On 2023-06-29 10:25, Kyotaro Horiguchi wrote:
> Thanks for the comment!
> 
>> At Wed, 28 Jun 2023 22:28:13 +0900, torikoshia
>> <torikoshia@oss.nttdata.com> wrote in
>>> 
>>> On 2022-09-29 17:18, Polina Bungina wrote:
>>> > I agree with your suggestions, so here is the updated version of
>>> > patch. Hope I haven't missed anything.
>>> > Regards,
>>> > Polina Bungina
>>> 
>>> Thanks for working on this!
>>> It seems like we are also facing the same issue.
>> 
>> Thanks for looking this.
>> 
>>> I tested the v3 patch under our condition, old primary has succeeded
>>> to become new standby.
>>> 
>>> 
>>> BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached
>>> in [1], old primary also failed to become standby:
>>> 
>>>   FATAL: could not receive data from WAL stream: ERROR: requested WAL
>>>   segment 000000020000000000000007 has already been removed
>>> 
>>> However, I think this is not a problem: just adding restore_command
>>> like below fixed the situation.
>>> 
>>>   echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >>
>>>   oldprim/postgresql.conf
>> 
>> I thought on the same line at first, but that's not the point
>> here.
> 
> Yes. I don't think adding restore_command solves the problem and
> modification to prevent deleting necessary WAL like proposed
> patch is necessary.
> 
> I added restore_command since
> pg_rewind-removes-wal-segments-reproduce.sh failed to catch up
> even after applying v3 patch and prevent pg_rewind from delete
> WALs(*), because some necessary WALs were archived.
> 
> It's not a problem we are discussing here, but I wanted to get
> the script to work to the point where old primary could
> successfully catch up to new primary.
> 
> (*)Specifically, running the script without apply the patch,
> recovery failed because 000000010000000000000003 which has
> already been removed. This file was deleted by pg_rewind as
> we know.
> OTHO without the restore_command, recovery failed because
> 000000020000000000000007 has already been removed even after
> applying the patch.
> 
>> The problem we want ot address is that pg_rewind ultimately
>> removes certain crucial WAL files required for the new primary to
>> start, despite them being present previously.
> 
> I thought it's not "new primary", but "old primary".
> 
>> In other words, that
>> restore_command works, but it only undoes what pg_rewind wrongly did,
>> resulting in unnecessary consupmtion of I/O and/or network bandwidth
>> that essentially serves no purpose.
> 
> As far as I tested using the script and the situation we are facing,
> after promoting newprim necessary WAL(000000010000000000000003..) were
> not available and just adding restore_command did not solve the 
> problem.
> 
>> pg_rewind already has a feature that determines how each file should
>> be handled, but it is currently making wrong dicisions for WAL
>> files. The goal here is to rectify this behavior and ensure that
>> pg_rewind makes the right decisions.
> 
> +1

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION



Re: pg_rewind WAL segments deletion pitfall

From
Michael Paquier
Date:
On Fri, Aug 18, 2023 at 03:40:57PM +0900, torikoshia wrote:
> Thanks for the patch, I've marked this as ready-for-committer.
>
> BTW, this issue can be considered a bug, right?
> I think it would be appropriate to provide backpatch.

Hmm, I agree that there is a good argument in back-patching as we have
the WAL files between the redo LSN and the divergence LSN, but
pg_rewind is not smart enough to keep them around.  If the archives of
the primary were not able to catch up, the old primary is as good as
kaput, and restore_command won't help here.

I don't like much this patch.  While it takes correctly advantage of
the backward record read logic from SimpleXLogPageRead() able to
handle correctly timeline jumps, it creates a hidden dependency in the
code between the hash table from filemap.c and the page callback.
Wouldn't it be simpler to build a list of the segment names using the
information from WALOpenSegment and build this list in
findLastCheckpoint()?  Also, I am wondering if we should be smarter
with any potential conflict handling between the source and the
target, rather than just enforcing a FILE_ACTION_NONE for all these
files.  In short, could it be better to copy the WAL file from the
source if it exists there?

+       /*
+        * Some entries (WAL segments) already have an action assigned
+        * (see SimpleXLogPageRead()).
+        */
+       if (entry->action == FILE_ACTION_UNDECIDED)
+           entry->action = decide_file_action(entry);

This change makes me a bit uneasy, per se my previous comment with the
additional code dependencies.

I think that this scenario deserves a test case.  If one wants to
emulate a delay in WAL archiving, it is possible to set
archive_command to a command that we know will fail, for instance.
--
Michael

Attachment

Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
On 2023-08-22 14:32, Michael Paquier wrote:
Thanks for your review!

> On Fri, Aug 18, 2023 at 03:40:57PM +0900, torikoshia wrote:
>> Thanks for the patch, I've marked this as ready-for-committer.
>> 
>> BTW, this issue can be considered a bug, right?
>> I think it would be appropriate to provide backpatch.
> 
> Hmm, I agree that there is a good argument in back-patching as we have
> the WAL files between the redo LSN and the divergence LSN, but
> pg_rewind is not smart enough to keep them around.  If the archives of
> the primary were not able to catch up, the old primary is as good as
> kaput, and restore_command won't help here.

True.
I also imagine that in the typical failover scenario where the target 
cluster was shut down soon after the divergence and pg_rewind was 
executed without much time, we can avoid this kind of 'requested WAL 
segment has already removed' error by preventing pg_rewind from deleting 
necessary WALs.


> I don't like much this patch.  While it takes correctly advantage of
> the backward record read logic from SimpleXLogPageRead() able to
> handle correctly timeline jumps, it creates a hidden dependency in the
> code between the hash table from filemap.c and the page callback.
> Wouldn't it be simpler to build a list of the segment names using the
> information from WALOpenSegment and build this list in
> findLastCheckpoint()?  Also, I am wondering if we should be smarter
> with any potential conflict handling between the source and the
> target, rather than just enforcing a FILE_ACTION_NONE for all these
> files.  In short, could it be better to copy the WAL file from the
> source if it exists there?
> 
> +       /*
> +        * Some entries (WAL segments) already have an action assigned
> +        * (see SimpleXLogPageRead()).
> +        */
> +       if (entry->action == FILE_ACTION_UNDECIDED)
> +           entry->action = decide_file_action(entry);
> 
> This change makes me a bit uneasy, per se my previous comment with the
> additional code dependencies.
> 
> I think that this scenario deserves a test case.  If one wants to
> emulate a delay in WAL archiving, it is possible to set
> archive_command to a command that we know will fail, for instance.
> --
> Michael

Bungina, are you going to respond to these comments?

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi,



On Tue, 22 Aug 2023 at 07:32, Michael Paquier <michael@paquier.xyz> wrote:


I don't like much this patch.  While it takes correctly advantage of
the backward record read logic from SimpleXLogPageRead() able to
handle correctly timeline jumps, it creates a hidden dependency in the
code between the hash table from filemap.c and the page callback.
Wouldn't it be simpler to build a list of the segment names using the
information from WALOpenSegment and build this list in
findLastCheckpoint()?

I think the first version of the patch more or less did that. Not necessarily a list, but a hash table of WAL file names that we want to keep. But Kyotaro Horiguchi didn't like it and suggested creating entries in the filemap.c hash table instead.
But, I agree, doing it directly from the findLastCheckpoint() makes the code easier to understand.

 
  Also, I am wondering if we should be smarter
with any potential conflict handling between the source and the
target, rather than just enforcing a FILE_ACTION_NONE for all these
files.  In short, could it be better to copy the WAL file from the
source if it exists there?

Before the switchpoint these files are supposed to be the same on the old primary, new primary, and also in the archive. Also, if there is a restore_command postgres will fetch the same file from the archive even if it already exists in pg_wal, which effectively discards all pg_rewind efforts on copying WAL files.
 

+       /*
+        * Some entries (WAL segments) already have an action assigned
+        * (see SimpleXLogPageRead()).
+        */
+       if (entry->action == FILE_ACTION_UNDECIDED)
+           entry->action = decide_file_action(entry);

This change makes me a bit uneasy, per se my previous comment with the
additional code dependencies.

We can revert to the original approach (see v1-0001-pg_rewind-wal-deletion.patch from the very first email) if you like.
 
I think that this scenario deserves a test case.  If one wants to
emulate a delay in WAL archiving, it is possible to set
archive_command to a command that we know will fail, for instance.

Yes, I totally agree, it is on our radar, but meanwhile please see the new version, just to check if I correctly understood your idea.
 
Regards,
--
Alexander Kukushkin
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
Kyotaro Horiguchi
Date:
At Wed, 23 Aug 2023 13:44:52 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in 
> On Tue, 22 Aug 2023 at 07:32, Michael Paquier <michael@paquier.xyz> wrote:
> > I don't like much this patch.  While it takes correctly advantage of
> > the backward record read logic from SimpleXLogPageRead() able to
> > handle correctly timeline jumps, it creates a hidden dependency in the
> > code between the hash table from filemap.c and the page callback.
> > Wouldn't it be simpler to build a list of the segment names using the
> > information from WALOpenSegment and build this list in
> > findLastCheckpoint()?
> 
> I think the first version of the patch more or less did that. Not
> necessarily a list, but a hash table of WAL file names that we want to
> keep. But Kyotaro Horiguchi didn't like it and suggested creating entries
> in the filemap.c hash table instead.
> But, I agree, doing it directly from the findLastCheckpoint() makes the
> code easier to understand.
...
> > +       /*
> > +        * Some entries (WAL segments) already have an action assigned
> > +        * (see SimpleXLogPageRead()).
> > +        */
> > +       if (entry->action == FILE_ACTION_UNDECIDED)
> > +           entry->action = decide_file_action(entry);
> >
> > This change makes me a bit uneasy, per se my previous comment with the
> > additional code dependencies.
> >
> 
> We can revert to the original approach (see
> v1-0001-pg_rewind-wal-deletion.patch from the very first email) if you like.

On the other hand, that approach brings in another source that
suggests the way that file should be handled. I still think that
entry->action should be the only source. However, it seems I'm in the
minority here. So I'm not tied to that approach.

> > I think that this scenario deserves a test case.  If one wants to
> > emulate a delay in WAL archiving, it is possible to set
> > archive_command to a command that we know will fail, for instance.
> >
> 
> Yes, I totally agree, it is on our radar, but meanwhile please see the new
> version, just to check if I correctly understood your idea.

Agreed.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
On 2023-08-24 09:45, Kyotaro Horiguchi wrote:
> At Wed, 23 Aug 2023 13:44:52 +0200, Alexander Kukushkin
> <cyberdemn@gmail.com> wrote in
>> On Tue, 22 Aug 2023 at 07:32, Michael Paquier <michael@paquier.xyz> 
>> wrote:
>> > I don't like much this patch.  While it takes correctly advantage of
>> > the backward record read logic from SimpleXLogPageRead() able to
>> > handle correctly timeline jumps, it creates a hidden dependency in the
>> > code between the hash table from filemap.c and the page callback.
>> > Wouldn't it be simpler to build a list of the segment names using the
>> > information from WALOpenSegment and build this list in
>> > findLastCheckpoint()?
>> 
>> I think the first version of the patch more or less did that. Not
>> necessarily a list, but a hash table of WAL file names that we want to
>> keep. But Kyotaro Horiguchi didn't like it and suggested creating 
>> entries
>> in the filemap.c hash table instead.
>> But, I agree, doing it directly from the findLastCheckpoint() makes 
>> the
>> code easier to understand.
> ...
>> > +       /*
>> > +        * Some entries (WAL segments) already have an action assigned
>> > +        * (see SimpleXLogPageRead()).
>> > +        */
>> > +       if (entry->action == FILE_ACTION_UNDECIDED)
>> > +           entry->action = decide_file_action(entry);
>> >
>> > This change makes me a bit uneasy, per se my previous comment with the
>> > additional code dependencies.
>> >
>> 
>> We can revert to the original approach (see
>> v1-0001-pg_rewind-wal-deletion.patch from the very first email) if you 
>> like.
> 
> On the other hand, that approach brings in another source that
> suggests the way that file should be handled. I still think that
> entry->action should be the only source.

+1.
Imaging a case when we come to need decide how to treat files based on 
yet another factor, I feel that a single source of truth is better than 
creating a list or hash for each factor.

> However, it seems I'm in the
> minority here. So I'm not tied to that approach.
> 
>> > I think that this scenario deserves a test case.  If one wants to
>> > emulate a delay in WAL archiving, it is possible to set
>> > archive_command to a command that we know will fail, for instance.
>> >
>> 
>> Yes, I totally agree, it is on our radar, but meanwhile please see the 
>> new
>> version, just to check if I correctly understood your idea.

Thanks for the patch.
I tested v4 patch using the script attached below thread and it has 
successfully finished.

https://www.postgresql.org/message-id/2e75ae22dce9a227c3d47fa6d0ed094a%40oss.nttdata.com


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi,

Please find attached v5.
What changed:
1. Now we collect which files should be kept  in a separate hash table.
2. Decision whether to keep the file is made only when the file is actually missing on the source. That is, remaining WAL files will be copied over as it currently is, although it could be extremely inefficient and unnecessary.
3. Added TAP test that actually at least one file isn't removed.

Regards,
--
Alexander Kukushkin
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi,

Please find attached v6.
Changes compared to v5:
1. use "perl -e 'exit(1)'" instead of "false" as archive_command, so it also works on Windows
2. fixed the test name

Regards,
--
Alexander Kukushkin
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
Thanks for the patch.

I tested the v6 patch using the test script attached on [1], old primary 
has succeeded to become new standby.

I have very minor questions on the regression tests mainly regarding the 
consistency with other tests for pg_rewind:


> +setup_cluster;
> +create_standby;

Would it be better to add parentheses?
Also should we add "RewindTest::" for these function?


> +primary_psql("create table t(a int)");
> +primary_psql("insert into t values(0)");
> +primary_psql("select pg_switch_wal()");
..

Should 'select', 'create', etc be capitalized?


> my $false = "$^X -e 'exit(1)'";
I feel it's hard to understand what does this mean.
Isn't it better to add comments and describe this is for windows 
environments?


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi,

On Wed, 18 Oct 2023 at 08:50, torikoshia <torikoshia@oss.nttdata.com> wrote:

I have very minor questions on the regression tests mainly regarding the
consistency with other tests for pg_rewind:

Please find attached a new version of the patch. It addresses all your comments.

Regards,
--
Alexander Kukushkin
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
On 2023-10-31 00:26, Alexander Kukushkin wrote:
> Hi,
> 
> On Wed, 18 Oct 2023 at 08:50, torikoshia <torikoshia@oss.nttdata.com>
> wrote:
> 
>> I have very minor questions on the regression tests mainly regarding
>> the
>> consistency with other tests for pg_rewind:
> 
> Please find attached a new version of the patch. It addresses all your
> comments.

Thanks for updating the patch!

> +extern void preserve_file(char *filepath);

Is this necessary?
This function was defined in older version patch, but no longer seems to 
exist.

+# We use "perl -e 'exit(1)'" as a alternative to "false", because the 
last one
'a' should be 'an'?


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi Torikoshia,


On Thu, 2 Nov 2023 at 04:24, torikoshia <torikoshia@oss.nttdata.com> wrote:


> +extern void preserve_file(char *filepath);

Is this necessary?
This function was defined in older version patch, but no longer seems to
exist.

+# We use "perl -e 'exit(1)'" as a alternative to "false", because the
last one
'a' should be 'an'?


Thanks for the feedback

Please find the new version attached.

Regards,
--
Alexander Kukushkin

On Thu, 2 Nov 2023 at 04:24, torikoshia <torikoshia@oss.nttdata.com> wrote:
On 2023-10-31 00:26, Alexander Kukushkin wrote:
> Hi,
>
> On Wed, 18 Oct 2023 at 08:50, torikoshia <torikoshia@oss.nttdata.com>
> wrote:
>
>> I have very minor questions on the regression tests mainly regarding
>> the
>> consistency with other tests for pg_rewind:
>
> Please find attached a new version of the patch. It addresses all your
> comments.

Thanks for updating the patch!

> +extern void preserve_file(char *filepath);

Is this necessary?
This function was defined in older version patch, but no longer seems to
exist.

+# We use "perl -e 'exit(1)'" as a alternative to "false", because the
last one
'a' should be 'an'?


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation


--
Regards,
--
Alexander Kukushkin
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
torikoshia
Date:
On 2023-11-06 23:58, Alexander Kukushkin wrote:
> Hi Torikoshia,
> 
> On Thu, 2 Nov 2023 at 04:24, torikoshia <torikoshia@oss.nttdata.com>
> wrote:
> 
>>> +extern void preserve_file(char *filepath);
>> 
>> Is this necessary?
>> This function was defined in older version patch, but no longer
>> seems to
>> exist.
>> 
>> +# We use "perl -e 'exit(1)'" as a alternative to "false", because
>> the
>> last one
>> 'a' should be 'an'?
> 
> Thanks for the feedback
> 
> Please find the new version attached.
Thanks for the update!

I've set the CF entry to "Ready for Committer".

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: pg_rewind WAL segments deletion pitfall

From
Peter Smith
Date:
2024-01 Commitfest.

Hi, This patch has a CF status of "Ready for Committer", but it is
currently failing some CFbot tests [1]. Please have a look and post an
updated version..

======
[1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/3874

Kind Regards,
Peter Smith.



Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi Peter,

On Mon, 22 Jan 2024 at 00:38, Peter Smith <smithpb2250@gmail.com> wrote:
2024-01 Commitfest.

Hi, This patch has a CF status of "Ready for Committer", but it is
currently failing some CFbot tests [1]. Please have a look and post an
updated version..

======
[1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/3874


From what I can see all failures are not related to this patch:
1. Windows build failed with 
[10:52:49.679] 126/281 postgresql:recovery / recovery/019_replslot_limit ERROR 185.84s (exit status 255 or signal 127 SIGinvalid)
2. FreeBSD build failed with
[09:11:57.656] 190/285 postgresql:psql / psql/010_tab_completion ERROR 0.46s exit status 2
[09:11:57.656] 220/285 postgresql:authentication / authentication/001_password ERROR 0.57s exit status 2

In fact, I don't even see this patch being applied for these builds and the introduced TAP test being executed.

Regards,
--
Alexander Kukushkin

Re: pg_rewind WAL segments deletion pitfall

From
Sutou Kouhei
Date:
Hi,

I'm reviewing patches in Commitfest 2024-07 from top to bottom:
https://commitfest.postgresql.org/48/

This is the 1st patch:
https://commitfest.postgresql.org/48/3874/

The latest patch can't be applied on master:
https://www.postgresql.org/message-id/CAFh8B=nNJtm9ke4_1mhpwGz2PV9yoyF6hMnYh5XACt0AA4VG-A@mail.gmail.com

I've rebased on master. See the attached patch.
Here are changes for it:

* Resolve conflict
* Update copyright year to 2024 from 2023
* Add an added test to meson.build
* Run pgindent

Here are my review comments:

@@ -217,6 +221,26 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
+            char        xlogfname[MAXFNAMELEN];
+
+            tli = xlogreader->seg.ws_tli;
+            segno = xlogreader->seg.ws_segno;
+
+            snprintf(xlogfname, MAXPGPATH, XLOGDIR "/");
+            XLogFileName(xlogfname + strlen(xlogfname),
+                         xlogreader->seg.ws_tli,
+                         xlogreader->seg.ws_segno, WalSegSz);
+
+            /*
+             * Make sure pg_rewind doesn't remove this file, because it is
+             * required for postgres to start after rewind.
+             */
+            insert_keepwalhash_entry(xlogfname);

MAXFNAMELEN is 64 and MAXPGPATH is 1024. strlen(XLOGDIR "/")
is 7 because XLOGDIR is "pg_wal". So xlogfname has enough
size but snprintf(xlogfname, MAXPGPATH) is wrong usage.
(And XLogFileName() uses snprintf(xlogfname, MAXFNAMELEN)
internally.)

How about using one more buffer?

----
char        xlogpath[MAXPGPATH];
char        xlogfname[MAXFNAMELEN];

tli = xlogreader->seg.ws_tli;
segno = xlogreader->seg.ws_segno;

XLogFileName(xlogfname,
             xlogreader->seg.ws_tli,
             xlogreader->seg.ws_segno, WalSegSz);
snprintf(xlogpath, MAXPGPATH, "%s/%s", XLOGDIR, xlogfname);

/*
 * Make sure pg_rewind doesn't remove this file, because it is
 * required for postgres to start after rewind.
 */
insert_keepwalhash_entry(xlogpath);
----


Thanks,
-- 
kou

In <CAFh8B=mDDZEsK0jDMfvP3MmxkWaeTCxW4yN42OH33JY6sQWS5Q@mail.gmail.com>
  "Re: pg_rewind WAL segments deletion pitfall" on Tue, 23 Jan 2024 09:23:29 +0100,
  Alexander Kukushkin <cyberdemn@gmail.com> wrote:

> Hi Peter,
> 
> On Mon, 22 Jan 2024 at 00:38, Peter Smith <smithpb2250@gmail.com> wrote:
> 
>> 2024-01 Commitfest.
>>
>> Hi, This patch has a CF status of "Ready for Committer", but it is
>> currently failing some CFbot tests [1]. Please have a look and post an
>> updated version..
>>
>> ======
>> [1]
>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/3874
>>
>>
> From what I can see all failures are not related to this patch:
> 1. Windows build failed with
> [10:52:49.679] 126/281 postgresql:recovery / recovery/019_replslot_limit
> ERROR 185.84s (exit status 255 or signal 127 SIGinvalid)
> 2. FreeBSD build failed with
> [09:11:57.656] 190/285 postgresql:psql / psql/010_tab_completion ERROR
> 0.46s exit status 2
> [09:11:57.656] 220/285 postgresql:authentication /
> authentication/001_password ERROR 0.57s exit status 2
> 
> In fact, I don't even see this patch being applied for these builds and the
> introduced TAP test being executed.
> 
> Regards,
> --
> Alexander Kukushkin
From 38edc07c49970ad0ac1818284f2e59e26591b3a4 Mon Sep 17 00:00:00 2001
From: Alexander Kukushkin <cyberdemn@gmail.com>
Date: Sun, 6 Aug 2023 15:56:39 +0100
Subject: [PATCH v9] Be more picky with WAL segment deletion in pg_rewind

Make pg_rewind to be a bit wiser in terms of creating filemap:
preserve on the target all WAL segments that contain records between the
last common checkpoint and the point of divergence.

Co-authored-by: Polina Bungina <bungina@gmail.com>
---
 src/bin/pg_rewind/filemap.c                   | 62 +++++++++++++++++-
 src/bin/pg_rewind/filemap.h                   |  3 +
 src/bin/pg_rewind/meson.build                 |  1 +
 src/bin/pg_rewind/parsexlog.c                 | 24 +++++++
 src/bin/pg_rewind/pg_rewind.c                 |  3 +
 src/bin/pg_rewind/t/010_keep_recycled_wals.pl | 65 +++++++++++++++++++
 6 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_rewind/t/010_keep_recycled_wals.pl

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 4458324c9d8..b357c28338a 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -63,6 +63,28 @@ static file_entry_t *lookup_filehash_entry(const char *path);
 static int    final_filemap_cmp(const void *a, const void *b);
 static bool check_file_excluded(const char *path, bool is_source);
 
+typedef struct skipwal_t
+{
+    const char *path;
+    uint32        status;
+}            skipwal_t;
+
+#define SH_PREFIX        keepwalhash
+#define SH_ELEMENT_TYPE    skipwal_t
+#define SH_KEY_TYPE        const char *
+#define    SH_KEY            path
+#define SH_HASH_KEY(tb, key)    hash_string(key)
+#define SH_EQUAL(tb, a, b)        (strcmp(a, b) == 0)
+#define    SH_SCOPE        static inline
+#define SH_RAW_ALLOCATOR    pg_malloc0
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static keepwalhash_hash * keepwalhash = NULL;
+
+static bool keepwalhash_entry_exists(const char *path);
+
 /*
  * Definition of one element part of an exclusion list, used to exclude
  * contents when rewinding.  "name" is the name of the file or path to
@@ -206,6 +228,35 @@ lookup_filehash_entry(const char *path)
     return filehash_lookup(filehash, path);
 }
 
+/* Initialize a hash table to store WAL file names that must be kept */
+void
+keepwalhash_init(void)
+{
+    keepwalhash = keepwalhash_create(FILEHASH_INITIAL_SIZE, NULL);
+}
+
+/* Prevent a given file deletion during rewind */
+void
+insert_keepwalhash_entry(const char *path)
+{
+    skipwal_t  *entry;
+    bool        found;
+
+    /* Should only be called with keepwalhash initialized */
+    Assert(keepwalhash);
+
+    entry = keepwalhash_insert(keepwalhash, path, &found);
+
+    if (!found)
+        entry->path = pg_strdup(path);
+}
+
+static bool
+keepwalhash_entry_exists(const char *path)
+{
+    return keepwalhash_lookup(keepwalhash, path) != NULL;
+}
+
 /*
  * Callback for processing source file list.
  *
@@ -685,7 +736,16 @@ decide_file_action(file_entry_t *entry)
     }
     else if (entry->target_exists && !entry->source_exists)
     {
-        /* File exists in target, but not source. Remove it. */
+        /* File exists in target, but not source. */
+
+        if (keepwalhash_entry_exists(path))
+        {
+            /* This is a WAL file that should be kept. */
+            pg_log_debug("Not removing %s because it is required for recovery", path);
+            return FILE_ACTION_NONE;
+        }
+
+        /* Otherwise remove an unexpected file. */
         return FILE_ACTION_REMOVE;
     }
     else if (!entry->target_exists && !entry->source_exists)
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 007e0f17cf4..0cb6fcae00c 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -110,4 +110,7 @@ extern filemap_t *decide_file_actions(void);
 extern void calculate_totals(filemap_t *filemap);
 extern void print_filemap(filemap_t *filemap);
 
+extern void keepwalhash_init(void);
+extern void insert_keepwalhash_entry(const char *path);
+
 #endif                            /* FILEMAP_H */
diff --git a/src/bin/pg_rewind/meson.build b/src/bin/pg_rewind/meson.build
index e0f88bde221..200ebf84eb9 100644
--- a/src/bin/pg_rewind/meson.build
+++ b/src/bin/pg_rewind/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/007_standby_source.pl',
       't/008_min_recovery_point.pl',
       't/009_growing_files.pl',
+      't/010_keep_recycled_wals.pl',
     ],
   },
 }
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 22f7351fdcd..7329c06d8fa 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -176,6 +176,10 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
     char       *errormsg;
     XLogPageReadPrivate private;
 
+    /* Track WAL segments opened while searching a checkpoint */
+    XLogSegNo    segno = 0;
+    TimeLineID    tli = 0;
+
     /*
      * The given fork pointer points to the end of the last common record,
      * which is not necessarily the beginning of the next record, if the
@@ -217,6 +221,26 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
                          LSN_FORMAT_ARGS(searchptr));
         }
 
+        /* We are trying to detect if the new WAL file was opened */
+        if (xlogreader->seg.ws_tli != tli || xlogreader->seg.ws_segno != segno)
+        {
+            char        xlogfname[MAXFNAMELEN];
+
+            tli = xlogreader->seg.ws_tli;
+            segno = xlogreader->seg.ws_segno;
+
+            snprintf(xlogfname, MAXPGPATH, XLOGDIR "/");
+            XLogFileName(xlogfname + strlen(xlogfname),
+                         xlogreader->seg.ws_tli,
+                         xlogreader->seg.ws_segno, WalSegSz);
+
+            /*
+             * Make sure pg_rewind doesn't remove this file, because it is
+             * required for postgres to start after rewind.
+             */
+            insert_keepwalhash_entry(xlogfname);
+        }
+
         /*
          * Check if it is a checkpoint record. This checkpoint record needs to
          * be the latest checkpoint before WAL forked and not the checkpoint
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0841ab4135b..48c11417b23 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -455,6 +455,9 @@ main(int argc, char **argv)
         exit(0);
     }
 
+    /* Hash to memorize WAL files that should be kept */
+    keepwalhash_init();
+
     findLastCheckpoint(datadir_target, divergerec, lastcommontliIndex,
                        &chkptrec, &chkpttli, &chkptredo, restore_command);
     pg_log_info("rewinding from last common checkpoint at %X/%X on timeline %u",
diff --git a/src/bin/pg_rewind/t/010_keep_recycled_wals.pl b/src/bin/pg_rewind/t/010_keep_recycled_wals.pl
new file mode 100644
index 00000000000..65caaf2faa2
--- /dev/null
+++ b/src/bin/pg_rewind/t/010_keep_recycled_wals.pl
@@ -0,0 +1,65 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+#
+# Test situation where a target data directory contains
+# WAL files that were already recycled by the new primary.
+#
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use RewindTest;
+
+RewindTest::setup_cluster();
+$node_primary->enable_archiving();
+RewindTest::start_primary();
+
+RewindTest::create_standby();
+$node_standby->enable_restoring($node_primary, 0);
+$node_standby->reload();
+
+RewindTest::primary_psql("CHECKPOINT");  # last common checkpoint
+
+# We use "perl -e 'exit(1)'" as an alternative to "false", because the last one
+# might not be available on Windows, but we want to run tests cross-platform.
+my $false = "$^X -e 'exit(1)'";
+$node_primary->append_conf(
+    'postgresql.conf', qq(
+archive_command = '$false'
+));
+$node_primary->reload();
+
+# advance WAL on the primary; WAL segment will never make it to the archive
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+RewindTest::promote_standby;
+
+# new primary loses diverging WAL segment
+RewindTest::standby_psql("INSERT INTO t values(0)");
+RewindTest::standby_psql("SELECT pg_switch_wal()");
+
+$node_standby->stop();
+$node_primary->stop();
+
+my ($stdout, $stderr) = run_command(
+    [
+        'pg_rewind', '--debug',
+        '--source-pgdata', $node_standby->data_dir,
+        '--target-pgdata', $node_primary->data_dir,
+        '--no-sync',
+    ]);
+
+like(
+    $stderr,
+    qr/Not removing pg_wal.* because it is required for recovery/,
+    "some WAL files were skipped");
+
+done_testing();
-- 
2.45.2


Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi Sutou,

Thank you for picking it up!

On Fri, 12 Jul 2024 at 09:24, Sutou Kouhei <kou@clear-code.com> wrote:

Here are my review comments:

@@ -217,6 +221,26 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
+                       char            xlogfname[MAXFNAMELEN];
+
+                       tli = xlogreader->seg.ws_tli;
+                       segno = xlogreader->seg.ws_segno;
+
+                       snprintf(xlogfname, MAXPGPATH, XLOGDIR "/");
+                       XLogFileName(xlogfname + strlen(xlogfname),
+                                                xlogreader->seg.ws_tli,
+                                                xlogreader->seg.ws_segno, WalSegSz);
+
+                       /*
+                        * Make sure pg_rewind doesn't remove this file, because it is
+                        * required for postgres to start after rewind.
+                        */
+                       insert_keepwalhash_entry(xlogfname);

MAXFNAMELEN is 64 and MAXPGPATH is 1024. strlen(XLOGDIR "/")
is 7 because XLOGDIR is "pg_wal". So xlogfname has enough
size but snprintf(xlogfname, MAXPGPATH) is wrong usage.
(And XLogFileName() uses snprintf(xlogfname, MAXFNAMELEN)
internally.)

Nice catch!

I don't think we need another buffer here, just need to use MAXFNAMELEN, because strlen("pg_wal/$wal_filename") + 1 = 32 perfectly fits into 64 bytes.

The new version is attached.

Regards,
--
Alexander Kukushkin
Attachment

Re: pg_rewind WAL segments deletion pitfall

From
Alexander Kukushkin
Date:
Hi Alvaro,

The commit message looks good to me, except maybe using a "master" word, which I would suggest to replace with "primary".

And a small nitpick:
+/*
+ * Initialize a hash table to store WAL file names that must be kept.
+ */
+void
+keepwal_init(void)
+{
+   /*
+    * This hash table is empty in the vast majority of cases, so set an
+    * initial size of 0.
+    */
+   keepwal = keepwal_create(0, NULL);
+}

I don't think that the hash table will be empty. It needs to hold all WAL filenames starting from the last checkpoint and up to divergent point.
On loaded clusters it could be hundreds and thousands of files.

On Tue, 12 Nov 2024 at 20:18, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Hello

After reading the whole thread a couple of times to make sure I
understood the problem correctly, I think the approach in the v10 patch
is a reasonable one.  I agree that it's better for maintainability to
keep a separate hash table.  I made some cosmetic adjustments -- didn't
find any fault in the code.  I also verified that the test is hitting
the problematic case.

So here's v11, which I'll try to get out tomorrow.

I also assessed back-patching this.  There's some conflicts, but nothing
serious, back to 14.  Unfortunately, 13 lacks requisite infrastructure,
so I think it'll have to stay as it is.  (12 is dead.)

Can you please verify that my explanation in the commit message is
sound?

Thanks

--
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)


Regards,
--
Alexander Kukushkin

Re: pg_rewind WAL segments deletion pitfall

From
Alvaro Herrera
Date:
Hello Alexander,

On 2024-Nov-14, Alexander Kukushkin wrote:

> The commit message looks good to me, except maybe using a "master" word,
> which I would suggest to replace with "primary".

Oh wow, thanks for noticing that.  I had already rewritten the commit
message to some extent, but "master" had remained.  Now I pushed the
patch to branches 14+, having replaced it as you suggested.

(This reminds me that I used to have a notification set in the
2ndQuadrant Mattermost instance so that I could LART anybody who used
the words 'master' or 'slave' in the chats there).

> +   /*
> +    * This hash table is empty in the vast majority of cases, so set an
> +    * initial size of 0.
> +    */
> +   keepwal = keepwal_create(0, NULL);
> +}
> 
> I don't think that the hash table will be empty. It needs to hold all WAL
> filenames starting from the last checkpoint and up to divergent point.
> On loaded clusters it could be hundreds and thousands of files.

Oh, okay.  The initial size is just there to avoid having to grow the
hash table, but using the same constant that we use for the filemap hash
table seems good enough ... it shouldn't make much of a difference in
practice.

Regards

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.



Re: pg_rewind WAL segments deletion pitfall

From
Antonin Houska
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Oh wow, thanks for noticing that.  I had already rewritten the commit
> message to some extent, but "master" had remained.  Now I pushed the
> patch to branches 14+, having replaced it as you suggested.

When doing some unrelated work I noticed that in the new test
010_keep_recycled_wals.pl the server fails to reload the configuration
file. The line it complains about is

archive_command = '/usr/bin/perl -e 'exit(1)''

The test still succeeds for some reason.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com