Re: Back-patch of: avoid multiple hard links to same WAL file after a crash - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash |
Date | |
Msg-id | f7ekxpwertlg2k4ux6dexi23k6n63fq5f7w5v3k5r556sw7dh7@ukyye6rmw6uv Whole thread Raw |
In response to | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash (Noah Misch <noah@leadboat.com>) |
Responses |
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
|
List | pgsql-hackers |
Hi, On 2025-04-20 14:53:39 -0700, Noah Misch wrote: > On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote: > > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote: > > > Noah Misch <noah@leadboat.com> writes: > > > > Tom and Michael, do you still object to the test addition, or not? If there > > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test. > > Pushed as commit 714bd9e. The failure so far is > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35 > with these highlights: > > pg_ctl: server does not shut down > > 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request > 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit > 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D" > > The checkpoints and WAL creation took 30s, but archiving was only 20% done > (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. Huh. That seems surprisingly slow, even for valgrind. I guess it's one more example for why the single-threaded archiving approach sucks so badly :) > I can reproduce this if I test with valgrind --trace-children=yes. With my > normal valgrind settings, the whole test file takes only 18s. I recommend > one of these changes to skink: > > - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument > "sh" and "cp" commands. > - Remove --trace-children=yes Hm. I think I used --trace-children=yes because I was thinking it was required to track forks. But a newer version of valgrind's man page has an important clarification: --trace-children=<yes|no> [default: no] When enabled, Valgrind will trace into sub-processes initiated via the exec system call. This is necessary formulti-process programs. Note that Valgrind does trace into the child of a fork (it would be difficult not to, since fork makes an identicalcopy of a process), so this option is arguably badly named. However, most children of fork calls immediately call exec anyway. So there doesn't seem to be much point in using --trace-children=yes. > Andres, what do you think about making one of those skink configuration > changes? Alternatively, I could make the test poll until archiving catches > up. However, that would take skink about 30min, and I expect little value > from 30min of valgrind instrumenting the "cp" command. I just changed the config to --trace-children=no. There already is a valgrind run in progress, so it won't be in effect for the next run. Greetings, Andres Freund
pgsql-hackers by date: