Re: Back-patch of: avoid multiple hard links to same WAL file after a crash - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
Date
Msg-id 20250420221559.ea.nmisch@google.com
Whole thread Raw
In response to Re: Back-patch of: avoid multiple hard links to same WAL file after a crash  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Sun, Apr 20, 2025 at 02:53:39PM -0700, Noah Misch wrote:
> On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:
> > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:
> > > Noah Misch <noah@leadboat.com> writes:
> > > > Tom and Michael, do you still object to the test addition, or not?  If there
> > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test.
> 
> Pushed as commit 714bd9e.  The failure so far is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35
> with these highlights:
> 
> pg_ctl: server does not shut down
> 
> 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG:  received immediate shutdown request
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL:  archive command was terminated by signal 3: Quit
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL:  The failed archive command was: cp
"pg_wal/00000001000000000000006D"
"/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"
> 
> The checkpoints and WAL creation took 30s, but archiving was only 20% done
> (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.  I can
> reproduce this if I test with valgrind --trace-children=yes.  With my normal
> valgrind settings, the whole test file takes only 18s.  I recommend one of
> these changes to skink:
> 
> - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
>   "sh" and "cp" commands.
> - Remove --trace-children=yes

I gave that more thought.  One can be more surgical than that, via
--trace-children-skip-by-arg='*cp "*' or similar.  My previous message's two
options stop valgrind instrumentation at boundaries like pg_dumpall calling
system(pg_dump ...), since that execs /bin/sh to run pg_dump.  If we wanted to
make it even more explicit and surgical, skink could use
--trace-children-skip-by-arg='*valgrind-ignore-child*' combined with:

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1404 +1404 @@ sub enable_restoring
-      : qq{cp "$path/%f" "%p"};
+      : qq{cp "$path/%f" "%p"  # valgrind-ignore-child};
@@ -1474 +1474 @@ sub enable_archiving
-      : qq{cp "%p" "$path/%f"};
+      : qq{cp "%p" "$path/%f"  # valgrind-ignore-child};

What's your preference?

> Andres, what do you think about making one of those skink configuration
> changes?  Alternatively, I could make the test poll until archiving catches
> up.  However, that would take skink about 30min, and I expect little value
> from 30min of valgrind instrumenting the "cp" command.



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
Next
From: David Rowley
Date:
Subject: Re: Typos in the code and README