Re: backup manifests and contemporaneous buildfarm failures - Mailing list pgsql-hackers

From Tom Lane
Subject Re: backup manifests and contemporaneous buildfarm failures
Date
Msg-id 26044.1585954081@sss.pgh.pa.us
Whole thread Raw
In response to Re: backup manifests and contemporaneous buildfarm failures  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: backup manifests and contemporaneous buildfarm failures  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: backup manifests and contemporaneous buildfarm failures  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> Same here, on elver.  I see pg_subtrans has been chmod(0)'d,
> presumably by the perl subroutine mutilate_open_directory_fails.  I
> see this in my inbox (the build farm wrote it to stderr or stdout
> rather than the log file):

> cannot chdir to child for
> pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans:
> Permission denied at ./run_build.pl line 1013.
> cannot remove directory for
> pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails:
> Directory not empty at ./run_build.pl line 1013.

I'm guessing that we're looking at a platform-specific difference in
whether "rm -rf" fails outright on an unreadable subdirectory, or
just tries to carry on by unlinking it anyway.

A partial fix would be to have the test script put back normal
permissions on that directory before it exits ... but any failure
partway through the script would leave a time bomb requiring manual
cleanup.

On the whole, I'd argue that testing that behavior is not valuable
enough to take risks of periodically breaking buildfarm members
in a way that will require manual recovery --- to say nothing of
annoying developers who trip over it.  So my vote is to remove
that part of the test and be satisfied with checking the behavior
for an unreadable file.

This doesn't directly explain the failure-at-next-configure behavior
that we're seeing in the buildfarm, but it wouldn't be too surprising
if it ends up being that the buildfarm client script doesn't manage
to fully recover from the situation.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: backup manifests and contemporaneous buildfarm failures
Next
From: Andres Freund
Date:
Subject: vacuum_defer_cleanup_age inconsistently applied on replicas