Re: pg_rewind WAL segments deletion pitfall - Mailing list pgsql-hackers
From | torikoshia |
---|---|
Subject | Re: pg_rewind WAL segments deletion pitfall |
Date | |
Msg-id | a06c11b58c462f314a061ad64b3f6353@oss.nttdata.com Whole thread Raw |
In response to | Re: pg_rewind WAL segments deletion pitfall (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: pg_rewind WAL segments deletion pitfall
|
List | pgsql-hackers |
On 2023-06-29 10:25, Kyotaro Horiguchi wrote: Thanks for the comment! > At Wed, 28 Jun 2023 22:28:13 +0900, torikoshia > <torikoshia@oss.nttdata.com> wrote in >> >> On 2022-09-29 17:18, Polina Bungina wrote: >> > I agree with your suggestions, so here is the updated version of >> > patch. Hope I haven't missed anything. >> > Regards, >> > Polina Bungina >> >> Thanks for working on this! >> It seems like we are also facing the same issue. > > Thanks for looking this. > >> I tested the v3 patch under our condition, old primary has succeeded >> to become new standby. >> >> >> BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached >> in [1], old primary also failed to become standby: >> >> FATAL: could not receive data from WAL stream: ERROR: requested WAL >> segment 000000020000000000000007 has already been removed >> >> However, I think this is not a problem: just adding restore_command >> like below fixed the situation. >> >> echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> >> oldprim/postgresql.conf > > I thought on the same line at first, but that's not the point > here. Yes. I don't think adding restore_command solves the problem and modification to prevent deleting necessary WAL like proposed patch is necessary. I added restore_command since pg_rewind-removes-wal-segments-reproduce.sh failed to catch up even after applying v3 patch and prevent pg_rewind from delete WALs(*), because some necessary WALs were archived. It's not a problem we are discussing here, but I wanted to get the script to work to the point where old primary could successfully catch up to new primary. (*)Specifically, running the script without apply the patch, recovery failed because 000000010000000000000003 which has already been removed. This file was deleted by pg_rewind as we know. OTHO without the restore_command, recovery failed because 000000020000000000000007 has already been removed even after applying the patch. > The problem we want ot address is that pg_rewind ultimately > removes certain crucial WAL files required for the new primary to > start, despite them being present previously. I thought it's not "new primary", but "old primary". > In other words, that > restore_command works, but it only undoes what pg_rewind wrongly did, > resulting in unnecessary consupmtion of I/O and/or network bandwidth > that essentially serves no purpose. As far as I tested using the script and the situation we are facing, after promoting newprim necessary WAL(000000010000000000000003..) were not available and just adding restore_command did not solve the problem. > pg_rewind already has a feature that determines how each file should > be handled, but it is currently making wrong dicisions for WAL > files. The goal here is to rectify this behavior and ensure that > pg_rewind makes the right decisions. +1 -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
pgsql-hackers by date: