Re: pg_rewind WAL segments deletion pitfall - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: pg_rewind WAL segments deletion pitfall |
Date | |
Msg-id | 20220826.170432.669257260859851674.horikyota.ntt@gmail.com Whole thread Raw |
Responses |
Re: pg_rewind WAL segments deletion pitfall
|
List | pgsql-hackers |
(Moved to -hackers) At Thu, 25 Aug 2022 10:34:40 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in > > # killall -9 postgres > > # rm -r oldprim newprim oldarch newarch oldprim.log newprim.log > > mkdir newarch oldarch > > initdb -k -D oldprim > > echo "archive_mode = 'always'">> oldprim/postgresql.conf > > > > With archive_mode = always you can't reproduce it. > It is very rarely people set it to always in production due to the overhead. ... > The archive_mode has to be set to on and the archive_command should be > failing when you do pg_ctl -D oldprim stop Ah, I see. What I don't still understand is why pg_rewind doesn't work for the old primary in that case. When archive_mode=on, the old primary has the complete set of WAL files counting both pg_wal and its archive. So as the same to the privious repro, pg_rewind -c ought to work (but it uses its own archive this time). In that sense the proposed solution is still not needed in this case. A bit harder situation comes after the server successfully rewound; if the new primary goes so far that the old primary cannot connect. Even in that case, you can copy-in the requried WAL files or configure restore_command of the old pimary so that it finds required WAL files there. As the result the system in total doesn't lose a WAL file. So.. I might still be missing something.. ############################### # killall -9 postgres # rm -r oldprim newprim oldarch newarch oldprim.log newprim.log mkdir newarch oldarch initdb -k -D oldprim echo "archive_mode = 'on'">> oldprim/postgresql.conf echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start psql -p 5432 -c 'create table t(a int)' pg_basebackup -D newprim -p 5432 echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf touch newprim/standby.signal pg_ctl -D newprim -o '-p 5433' -l newprim.log start # the last common checkpoint psql -p 5432 -c 'checkpoint' # record approx. diverging WAL segment start_wal=`psql -p 5433 -Atc 'select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name= 'wal_segment_size')::int); ` psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();' pg_ctl -D newprim promote psql -p 5433 -c 'checkpoint' # old rprimary loses diverging WAL segment for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done # old primary cannot archive any more echo "archive_command = 'false'">> oldprim/postgresql.conf pg_ctl -D oldprim reload pg_ctl -D oldprim stop # rewind the old primary, using its own archive # pg_rewind -D oldprim --source-server='port=5433' # should fail echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf pg_rewind -D oldprim --source-server='port=5433' -c # advance WAL on the old primary; new primary loses the launching WAL seg for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done psql -p 5433 -c 'checkpoint' echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf touch oldprim/standby.signal postgres -D oldprim # fails with "WAL file has been removed" # The alternative of copying-in # echo "restore_command = 'cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf # copy-in WAL files from new primary's archive to old primary (cd newarch; for f in `ls`; do if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi done) postgres -D oldprim -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: