On Thu, Mar 14, 2024 at 09:12:52AM +1300, David Steele wrote:
> I think you are right that the start message is better since it can only
> appear once when the backup_label is found. The completed message could in
> theory appear after a restart, though the backup_label must have been found
> at some point.
So, I've given a try to this patch with 99b4a63bef94, to note that
sidewinder failed because of a timing issue on Windows: the recovery
of the node without backup_label, expected to fail, would try to
backup the last segment it has replayed, because, as it has no
backup_label, it behaves like the primary. It would try to use the
same archive location as the primary, leading to a conflict failure on
Windows. This one was easy to fix, by overwritting postgresql.conf on
the node to not do archiving.
Following that, I've noticed a second race condition: we don't wait
for the segment after pg_switch_wal() to be archived. This one can be
easily avoided with a poll on pg_stat_archiver.
After that, also because I've initially managed to, cough, forget an
update of meson.build to list the new test, I've noticed a third
failure on Windows for the case of the node that has a backup_label.
Here is one of the failures:
https://cirrus-ci.com/task/5245341683941376
regress_log_042_low_level_backup and
042_low_level_backup_replica_success.log have all the information
needed, that can be summarized like that:
The system cannot find the file specified.
2024-03-14 06:02:37.670 GMT [560][startup] FATAL: invalid data in file "backup_label"
The first message is something new to me, that seems to point to a
corruption failure of the file system. Why don't we see something
similar in other tests, then? Leaving that aside..
The second LOG is something that can be acted on. I've added some
debugging to the parsing of the backup_label file in the backend, and
noticed that the first fscanf() for START WAL LOCATION is failing
because the last %c is detected as \r rather than \n. Tweaking the
contents stored from pg_backend_stop() with a sed won't help, because
the issue is that we write the CRLFs with append_to_file, and the
startup process cannot cope with that. The simplest method I can
think of is to use binmode, as of the attached.
It is the first time that we'd take the contents received from a
BackgroundPsql and write them to a file parsed by the backend, so
perhaps we should try to do that in a more general way, but I'm not
sure how, tbh, and the case of this test is special while adding
handling for \r when reading the backup_label got discussed in the
past but we were OK with what we are doing now on HEAD.
On top of all that, note that I have removed remove_tree as I am not
sure if this would be OK in all the buildfarm animals, added a quit()
for BackgroundPsql, moved queries to use less BackgroundPsql, as well
as a few other things like avoiding the hardcoded segment names.
meson.build is.. Cough.. Updated now.
I am attaching an updated patch with all that fixed, which is stable
in the CI and any tests I've run. Do you have any comments about
that?
--
Michael