Thread: [MASSMAIL]Recovery of .partial WAL segments
Dear hackers, Generating a ".partial" WAL segment is pretty common nowadays (using pg_receivewal or during standby promotion). However, we currently don't do anything with it unless the user manually removes that ".partial" extension. The 028_pitr_timelines tests are highlighting that fact: with test data being being in 000000020000000000000003 and 000000010000000000000003.partial,a recovery following the latest timeline (2) will succeed but fail if we follow the currenttimeline (1). By simply trying to fetch the ".partial" file in XLogFileRead, we can easily recover more data and also cover that (currenttimeline) recovery case. So, this proposed patch makes XLogFileRead try to restore ".partial" WAL archives and adds a test to 028_pitr_timelines usingcurrent recovery_target_timeline. As far as I've seen, the current pg_receivewal tests only seem to cover the archives generation but not actually trying torecover using it. I wasn't sure it was interesting to add such tests right now, so I didn't considered it for this patch. Many thanks in advance for your feedback and thoughts about this, Kind Regards, -- Stefan FERCOT Data Egret (https://dataegret.com)
Attachment
Hi,
I've added a CF entry for this patch: https://commitfest.postgresql.org/49/5148/
Not sure why CFbot CI fails on macOS/Windows while it works with the Github CI on my fork (https://cirrus-ci.com/github/pgstef/postgres/partial-walseg-recovery).
Many thanks in advance for your feedback and thoughts about this patch,
Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)
Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)
On Thu, Aug 1, 2024 at 10:23 PM Stefan Fercot <stefan.fercot@protonmail.com> wrote:
Dear hackers,
Generating a ".partial" WAL segment is pretty common nowadays (using pg_receivewal or during standby promotion).
However, we currently don't do anything with it unless the user manually removes that ".partial" extension.
The 028_pitr_timelines tests are highlighting that fact: with test data being being in 000000020000000000000003 and 000000010000000000000003.partial, a recovery following the latest timeline (2) will succeed but fail if we follow the current timeline (1).
By simply trying to fetch the ".partial" file in XLogFileRead, we can easily recover more data and also cover that (current timeline) recovery case.
So, this proposed patch makes XLogFileRead try to restore ".partial" WAL archives and adds a test to 028_pitr_timelines using current recovery_target_timeline.
As far as I've seen, the current pg_receivewal tests only seem to cover the archives generation but not actually trying to recover using it. I wasn't sure it was interesting to add such tests right now, so I didn't considered it for this patch.
Many thanks in advance for your feedback and thoughts about this,
Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)
> On Fri, Aug 02, 2024 at 08:47:02AM GMT, Stefan Fercot wrote: > > Not sure why CFbot CI fails on macOS/Windows while it works with the Github > CI on my fork ( > https://cirrus-ci.com/github/pgstef/postgres/partial-walseg-recovery). I guess it's because the test has to wait a bit after the node has been started until the log lines will appear. One can see it in the node_pitr3 logs, first it was hit by SELECT pg_is_in_recovery() = 'f' and only some moments later produced restored log file "000000010000000000000003.partial" from archive where the test has those operations in reversed order. Seems like the retry loop from 019_replslot_limit might help.
Hi,
On Fri, Aug 9, 2024 at 4:29 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Seems like the retry loop from 019_replslot_limit might help.
Thanks for the tip. Attached v2 adds the retry loop in the test which would hopefully fix the cfbot.
Kind Regards,
Stefan
Stefan
Attachment
Hi there! It looks good to me! But I have a question. How is the partialxlogfname freed in case of an error?
Best regards, Stepan Neretin.
Hi,
On Fri, Oct 18, 2024 at 11:07 AM Stepan Neretin <sndcppg@gmail.com> wrote:
Hi there! It looks good to me!
Thanks!
But I have a question. How is the partialxlogfname freed in case of an error?
I'm not sure I understand your question. What kind of error would you expect exactly?
I mostly added a pfree there because we don't need it further. But why would it be more of a problem compared to other variables/pointers?
Kind regards,
Stefan
On Fri, 5 Apr 2024 at 11:45, Stefan Fercot <stefan.fercot@protonmail.com> wrote: > > Dear hackers, > > Generating a ".partial" WAL segment is pretty common nowadays (using pg_receivewal or during standby promotion). > However, we currently don't do anything with it unless the user manually removes that ".partial" extension. > > The 028_pitr_timelines tests are highlighting that fact: with test data being being in 000000020000000000000003 and 000000010000000000000003.partial,a recovery following the latest timeline (2) will succeed but fail if we follow the currenttimeline (1). > > By simply trying to fetch the ".partial" file in XLogFileRead, we can easily recover more data and also cover that (currenttimeline) recovery case. > > So, this proposed patch makes XLogFileRead try to restore ".partial" WAL archives and adds a test to 028_pitr_timelinesusing current recovery_target_timeline. Does this path only get hit when we don't already have any WAL segments (or partial segments) left for that timeline? I'm a bit worried about overwriting existing (partial) segments that may have more WAL than what we can get from archives. (patch v2) > + restoredArchivedFile = !RestoreArchivedFile(path, xlogfname, > + "RECOVERYXLOG", > + wal_segment_size, > + InRedo) && > + !RestoreArchivedFile(path, partialxlogfname, > "RECOVERYXLOG", > wal_segment_size, > - InRedo)) > + InRedo); The value of restoredArchiveFile is inverted with what it indicates: It is true when we failed to restore an archived xlog segment, and false if we did succeed. I'm also not a fan of the additional allocation of partialxlogfname in this code. It could well do without, by "just" reusing the xlogfname scratch space when we fail to recover the full segment. Kind regards, Matthias van de Meent Neon (https://neon.tech)