Thread: Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Michael Paquier <michael@paquier.xyz> writes: > Add TAP test for archive_cleanup_command and recovery_end_command grassquit just showed a non-reproducible failure in this test [1]: # Postmaster PID for node "standby" is 291160 ok 1 - check content from archives not ok 2 - archive_cleanup_command executed on checkpoint # Failed test 'archive_cleanup_command executed on checkpoint' # at t/002_archiving.pl line 74. This test is sending a CHECKPOINT command to the standby and expecting it to run the archive_cleanup_command, but it looks like the standby did not actually run any checkpoint: 2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG: connection received: host=[local] 2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG: connection authorized: user=bf database=postgres application_name=002_archiving.pl 2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG: statement: CHECKPOINT 2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG: disconnection: session time: 0:00:00.032 user=bf database=postgreshost=[local] I am suspicious that the reason is that ProcessUtility does not ask for a forced checkpoint when in recovery: RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT | (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE)); The trouble with this theory is that this test has been there for nearly six months and this is the first such failure (I scraped the buildfarm logs to be sure). Seems like failures should be a lot more common than that. I wondered if the recent pg_stats changes could have affected this, but I don't really see how. Thoughts? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-04-07%2015%3A45%3A48
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Andres Freund
Date:
Hi, On 2022-04-07 13:40:30 -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > Add TAP test for archive_cleanup_command and recovery_end_command > > grassquit just showed a non-reproducible failure in this test [1]: I was just staring at that as well. > # Postmaster PID for node "standby" is 291160 > ok 1 - check content from archives > not ok 2 - archive_cleanup_command executed on checkpoint > > # Failed test 'archive_cleanup_command executed on checkpoint' > # at t/002_archiving.pl line 74. > > This test is sending a CHECKPOINT command to the standby and > expecting it to run the archive_cleanup_command, but it looks > like the standby did not actually run any checkpoint: > > 2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG: connection received: host=[local] > 2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG: connection authorized: user=bf database=postgres application_name=002_archiving.pl > 2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG: statement: CHECKPOINT > 2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG: disconnection: session time: 0:00:00.032 user=bf database=postgreshost=[local] > > I am suspicious that the reason is that ProcessUtility does not > ask for a forced checkpoint when in recovery: > > RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT | > (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE)); > > The trouble with this theory is that this test has been there for > nearly six months and this is the first such failure (I scraped the > buildfarm logs to be sure). Seems like failures should be a lot > more common than that. > I wondered if the recent pg_stats changes could have affected this, but I > don't really see how. I don't really see either. It's a bit more conceivable that the recovery prefetching changes could affect the timing sufficiently? It's also possible that it requires an animal of a certain speed to happen - we didn't have an -fsanitize=address animal until recently. I guess we'll have to wait and see what the frequency of the problem is? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-04-07 13:40:30 -0400, Tom Lane wrote: >> This test is sending a CHECKPOINT command to the standby and >> expecting it to run the archive_cleanup_command, but it looks >> like the standby did not actually run any checkpoint: >> ... >> I wondered if the recent pg_stats changes could have affected this, but I >> don't really see how. > I don't really see either. It's a bit more conceivable that the recovery > prefetching changes could affect the timing sufficiently? Oh, that's at least a little plausible. > I guess we'll have to wait and see what the frequency of the problem is? Yeah, with only one instance it could just be cosmic rays or something. However, assuming it is real, I guess I wonder why we don't say CHECKPOINT_FORCE in standby mode too. regards, tom lane
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Andres Freund
Date:
Hi, On 2022-04-07 13:57:45 -0400, Tom Lane wrote: > Yeah, with only one instance it could just be cosmic rays or something. > However, assuming it is real, I guess I wonder why we don't say > CHECKPOINT_FORCE in standby mode too. I guess it might partially be that restartpoints require a checkpoint to have happened on the primary. If we used FORCE, we'd have to wait till the next checkpoint on the primary, which'd be a problem if it's e.g. a manually issued CHECKPOINT; before shutting the standby down. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-04-07 13:57:45 -0400, Tom Lane wrote: >> Yeah, with only one instance it could just be cosmic rays or something. Not cosmic rays: skink has shown the same symptom three times running. Looks like maybe the archive_cleanup_command itself is doing something it shouldn't? regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2022-04-07 13:57:45 -0400, Tom Lane wrote: >> Yeah, with only one instance it could just be cosmic rays or something. >> However, assuming it is real, I guess I wonder why we don't say >> CHECKPOINT_FORCE in standby mode too. > I guess it might partially be that restartpoints require a checkpoint to have > happened on the primary. If we used FORCE, we'd have to wait till the next > checkpoint on the primary, which'd be a problem if it's e.g. a manually issued > CHECKPOINT; before shutting the standby down. After seeing skink's results, I tried running that test under valgrind here, and it fails just like that every time. skink's history allows us to bound the failure introduction between 79b716cfb7 and d7ab2a9a3c, which I think makes it just about certain that it was 5dc0418fab (Prefetch data referenced by the WAL, take II), though I've not bisected to be 100% sure. Adding some debug printouts to ExecuteRecoveryCommand convinces me that indeed the archive_cleanup_command is NOT getting called by the problematic CHECKPOINT command. I surmise based on Andres' comment above that the standby isn't making a restartpoint for lack of an available primary checkpoint, which looks to me like it could be a pre-existing bug in the test case: it's sure not doing anything to guarantee that the primary's checkpoint record has reached the standby. I tried adjusting the patch so it does guarantee that (as attached), and in two out of two tries it got past the archive_cleanup_command failure but then hung up waiting for standby2 to promote. On the whole, I'm not sure that the WAL prefetch logic is noticeably more stable than when we booted it out last year :-(. However, I also wonder why it is that this test case wasn't occasionally failing already. regards, tom lane diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl index c8f5ffbaf0..1032d8a388 100644 --- a/src/test/recovery/t/002_archiving.pl +++ b/src/test/recovery/t/002_archiving.pl @@ -44,13 +44,14 @@ $node_standby->start; # Create some content on primary $node_primary->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a"); -my $current_lsn = - $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();"); # Note the presence of this checkpoint for the archive_cleanup_command # check done below, before switching to a new segment. $node_primary->safe_psql('postgres', "CHECKPOINT"); +my $current_lsn = + $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();"); + # Force archiving of WAL file to make it present on primary $node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Andres Freund
Date:
Hi, On 2022-04-08 17:55:51 -0400, Tom Lane wrote: > After seeing skink's results, I tried running that test under valgrind > here, and it fails just like that every time. skink's history allows > us to bound the failure introduction between 79b716cfb7 and > d7ab2a9a3c, which I think makes it just about certain that it was > 5dc0418fab (Prefetch data referenced by the WAL, take II), though I've > not bisected to be 100% sure. I've tested it, it's 5dc0418fab that makes the difference. I reduced the cycle time by making initdb not go through valgrind, but have normal postgres instances go through it. > On the whole, I'm not sure that the WAL prefetch logic is noticeably > more stable than when we booted it out last year :-(. IDK. Last year's issues seems to have largely been caused by a flaky machine. And a bug, if it's that, in some archiving corner case that's not normally reached during tests... Greetings, Andres Freund
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Andres Freund
Date:
Hi, On 2022-04-08 17:55:51 -0400, Tom Lane wrote: > I tried adjusting the patch so it does guarantee that (as attached), > and in two out of two tries it got past the archive_cleanup_command > failure but then hung up waiting for standby2 to promote. Adding $node_standby->safe_psql('postgres', "SELECT pg_switch_wal()"); just after $node_standby2->start; makes the tests pass here. What is that second test really testing? # Check the presence of temporary files specifically generated during # archive recovery. To ensure the presence of the temporary history # file, switch to a timeline large enough to allow a standby to recover # a history file from an archive. As this requires at least two timeline # switches, promote the existing standby first. Then create a second # standby based on the promoted one. Finally, the second standby is # promoted. Note "Then create a second standby based on the promoted one." - but that's not actually what's happening: $node_standby2->init_from_backup($node_primary, $backup_name, has_restoring => 1); It's created from the original primary, not the first standby, as the description says... Both nodes get promoted independently, in a run without valgrind: standby: 2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 2; prev tli1; time 2022-04-08 17:23:42.96686-07 2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000 standby2: 2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 3; prev tli1; time 2022-04-08 17:23:43.307443-> 2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000 except that standby2 can't choose tli 2 because it finds it used. Sure looks like something is funky with that test. But I think there's also something funky in the prefetching logic. I think it may attempt restoring during prefetching somehow, even though there's code that appears to try to prevent that? on standby2 I can see replay progress like the following: 2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024488; LSN 0/30244C8: prev 0/3024448; xid 725; len3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 60 flags 0x00 2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725 2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024488 for Heap/INSERT: off 60 flags 0x00;blkref #0: rel 1663/5/16384, blk 4 2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG"" 2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive:child process exited with exit code 1 2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": Nosuch file or directory 2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/30244C8; LSN 0/3024508: prev 0/3024488; xid 725; len3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 61 flags 0x00 2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725 2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/30244C8 for Heap/INSERT: off 61 flags 0x00;blkref #0: rel 1663/5/16384, blk 4 2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG"" 2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive:child process exited with exit code 1 2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": Nosuch file or directory 2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024508; LSN 0/3024548: prev 0/30244C8; xid 725; len3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 62 flags 0x00 2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725 2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024508 for Heap/INSERT: off 62 flags 0x00;blkref #0: rel 1663/5/16384, blk 4 2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG"" note that we appear to wait between replaying of records, even though we apparently have WAL for the next record! And interestingly I'm not seeing the "switched WAL source from stream to archive after failure" lines I'd expect. Greetings, Andres Freund
On Sat, Apr 9, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-04-08 17:55:51 -0400, Tom Lane wrote: > > I tried adjusting the patch so it does guarantee that (as attached), > > and in two out of two tries it got past the archive_cleanup_command > > failure but then hung up waiting for standby2 to promote. > > Adding > > $node_standby->safe_psql('postgres', "SELECT pg_switch_wal()"); > just after > $node_standby2->start; > > makes the tests pass here. Sorry for the delay... I got a bit confused about the different things going on in this thread but I hope I've got it now: 1. This test had some pre-existing bugs/races, which hadn't failed before due to scheduling, even under Valgrind. The above changes appear to fix those problems. To Michael for comment. > What is that second test really testing? > > # Check the presence of temporary files specifically generated during > # archive recovery. To ensure the presence of the temporary history > # file, switch to a timeline large enough to allow a standby to recover > # a history file from an archive. As this requires at least two timeline > # switches, promote the existing standby first. Then create a second > # standby based on the promoted one. Finally, the second standby is > # promoted. > > Note "Then create a second standby based on the promoted one." - but that's > not actually what's happening: 2. There may also be other problems with the test but those aren't relevant to skink's failure, which starts on the 5th test. To Michael for comment. > But I think there's also something funky in the prefetching logic. I think it > may attempt restoring during prefetching somehow, even though there's code > that appears to try to prevent that? 3. Urghl. Yeah. There is indeed code to report XLREAD_WOULDBLOCK for the lastSourceFailed case, but we don't really want to do that more often than every 5 seconds. So I think I need to make that "sticky", so that we don't attempt to prefetch again (ie to read from the next segment, which invokes restore_command) until we've replayed all the records we already have, then hit the end and go to sleep. The attached patch does that, and makes the offending test pass under Valgrind for me, even without the other changes already mentioned. If I understand correctly, this is due to a timing race in the tests (though I didn't check where exactly), because all those extra fork/exec calls are extremely slow under Valgrind. > And interestingly I'm not seeing the > "switched WAL source from stream to archive after failure" > lines I'd expect. I see them now. It's because it gives up when it's reading ahead (nonblocking), which may not be strictly necessary but I found it simpler to think about. Then when it tries again in 5 seconds it's in blocking mode so it doesn't give up so easily. 2022-04-11 18:15:08.220 NZST [524796] DEBUG: switched WAL source from stream to archive after failure cp: cannot stat '/tmp/archive/000000010000000000000017': No such file or directory 2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not restore file "000000010000000000000017" from archive: child process exited with exit code 1 2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not open file "pg_wal/000000010000000000000017": No such file or directory 2022-04-11 18:15:08.226 NZST [524796] DEBUG: switched WAL source from archive to stream after failure
Attachment
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Michael Paquier
Date:
On Mon, Apr 11, 2022 at 06:48:58PM +1200, Thomas Munro wrote: > Sorry for the delay... I got a bit confused about the different things > going on in this thread but I hope I've got it now: > > 1. This test had some pre-existing bugs/races, which hadn't failed > before due to scheduling, even under Valgrind. The above changes > appear to fix those problems. To Michael for comment. I have seen the thread, and there is a lot in it. I will try to look tomorrow at the parts I got involved in. -- Michael
Attachment
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Michael Paquier
Date:
On Mon, Apr 11, 2022 at 06:48:58PM +1200, Thomas Munro wrote: > 1. This test had some pre-existing bugs/races, which hadn't failed > before due to scheduling, even under Valgrind. The above changes > appear to fix those problems. To Michael for comment. Yeah, there are two problems here. From what I can see, ensuring the execution of archive_cleanup_command on the standby needs the checkpoint on the primary and the restart point on the standby. So pg_current_wal_lsn() should be located after the primary's checkpoint and not before it so as we are sure that the checkpoint records finds its way to the standby. That's what Tom mentioned upthread. The second problem is to make sure that $standby2 sees the promotion of $standby and its history file, but we also want to recover 00000002.history from some archives to create a RECOVERYHISTORY at recovery for the purpose of the test. Switching to a new segment as proposed by Andres does not seem completely right to me because we are not 100% sure of the ordering an archive is going to happen, no? I think that the logic to create $standby2 from the initial backup of the primary is right, because there is no 00000002.history in it, but we also need to be sure that 00000002.history has been archived once the promotion of $standby is done. This can be validated thanks to the logs, actually. >> What is that second test really testing? >> >> # Check the presence of temporary files specifically generated during >> # archive recovery. To ensure the presence of the temporary history >> # file, switch to a timeline large enough to allow a standby to recover >> # a history file from an archive. As this requires at least two timeline >> # switches, promote the existing standby first. Then create a second >> # standby based on the promoted one. Finally, the second standby is >> # promoted. >> >> Note "Then create a second standby based on the promoted one." - but that's >> not actually what's happening: > > 2. There may also be other problems with the test but those aren't > relevant to skink's failure, which starts on the 5th test. To Michael > for comment. This comes from df86e52, where we want to recovery a history file that would be created as RECOVERYHISTORY and make sure that the file gets removed at the end of recovery. So $standby2 should choose a new timeline different from the one of chosen by $standby. Looking back at what has been done, it seems to me that the comment is the incorrect part: https://www.postgresql.org/message-id/20190930080340.GO2888@paquier.xyz All that stuff leads me to the attached. Thoughts? -- Michael
Attachment
On Tue, Apr 12, 2022 at 3:49 PM Michael Paquier <michael@paquier.xyz> wrote: > All that stuff leads me to the attached. Thoughts? Under valgrind I got "Undefined subroutine &main::usleep called at t/002_archiving.pl line 103" so I added "use Time::HiRes qw(usleep);", and now I get past the first 4 tests with your patch, but then promotion times out, not sure why: +++ tap check in src/test/recovery +++ t/002_archiving.pl .. ok 1 - check content from archives ok 2 - archive_cleanup_command executed on checkpoint ok 3 - recovery_end_command not executed yet # found 00000002.history after 14 attempts ok 4 - recovery_end_command executed after promotion Bailout called. Further testing stopped: command "pg_ctl -D /home/tmunro/projects/postgresql/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata -l /home/tmunro/projects/postgresql/src/test/recovery/tmp_check/log/002_archiving_standby2.log promote" exited with value 1 Since it's quite painful to run TAP tests under valgrind, I found a place to stick a plain old sleep to repro these problems: --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -1035,7 +1035,7 @@ sub enable_restoring my $copy_command = $PostgreSQL::Test::Utils::windows_os ? qq{copy "$path\\\\%f" "%p"} - : qq{cp "$path/%f" "%p"}; + : qq{sleep 1 && cp "$path/%f" "%p"}; Soon I'll push the fix to the slowness that xlogprefetcher.c accidentally introduced to continuous archive recovery, ie the problem of calling a failing restore_command repeatedly as we approach the end of a WAL segment instead of just once every 5 seconds after we run out of data, and after that you'll probably need to revert that fix locally to repro this.
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Michael Paquier
Date:
On Sun, Apr 17, 2022 at 08:56:33AM +1200, Thomas Munro wrote: > Under valgrind I got "Undefined subroutine &main::usleep called at > t/002_archiving.pl line 103" so I added "use Time::HiRes qw(usleep);", > and now I get past the first 4 tests with your patch, but then > promotion times out, not sure why: > > +++ tap check in src/test/recovery +++ > t/002_archiving.pl .. > ok 1 - check content from archives > ok 2 - archive_cleanup_command executed on checkpoint > ok 3 - recovery_end_command not executed yet > # found 00000002.history after 14 attempts > ok 4 - recovery_end_command executed after promotion > Bailout called. Further testing stopped: command "pg_ctl -D > /home/tmunro/projects/postgresql/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata > -l /home/tmunro/projects/postgresql/src/test/recovery/tmp_check/log/002_archiving_standby2.log > promote" exited with value 1 Hmm. As far as I can see, aren't you just hitting the 60s timeout of pg_ctl here due to the slowness of valgrind? > Since it's quite painful to run TAP tests under valgrind, I found a > place to stick a plain old sleep to repro these problems: Actually, I am wondering how you are patching Cluster.pm to do that. > Soon I'll push the fix to the slowness that xlogprefetcher.c > accidentally introduced to continuous archive recovery, ie the problem > of calling a failing restore_command repeatedly as we approach the end > of a WAL segment instead of just once every 5 seconds after we run out > of data, and after that you'll probably need to revert that fix > locally to repro this. Okay. Thanks. Anyway, I'll do something about that tomorrow (no room to look at the buildfarm today), and I was thinking about replacing the while loop I had in the last version of the patch with a poll_query_until that does a pg_stat_file() with an absolute path to the history file to avoid the dependency to usleep() in the test, splitting the fix into two commits as there is more than one problem, each applying to different branches. -- Michael
Attachment
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Andrew Dunstan
Date:
On 2022-04-17 Su 00:17, Michael Paquier wrote: >> Since it's quite painful to run TAP tests under valgrind, I found a >> place to stick a plain old sleep to repro these problems: > Actually, I am wondering how you are patching Cluster.pm to do that. I don't really think it's Cluster.pm's business to deal with that. It takes an install path as given either explicitly or implicitly. It shouldn't be too hard to get Makefile.global to install valgrind wrappers into the tmp_install/bin directory. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Michael Paquier
Date:
On Sun, Apr 17, 2022 at 10:56:08AM -0400, Andrew Dunstan wrote: > I don't really think it's Cluster.pm's business to deal with that. It > takes an install path as given either explicitly or implicitly. > > It shouldn't be too hard to get Makefile.global to install valgrind > wrappers into the tmp_install/bin directory. Or what gets used in just a wrapper of the contents of bin/ that get enforced to be first in PATH? -- Michael
Attachment
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Michael Paquier
Date:
On Tue, Apr 12, 2022 at 12:49:48PM +0900, Michael Paquier wrote: > This comes from df86e52, where we want to recovery a history file that > would be created as RECOVERYHISTORY and make sure that the file gets > removed at the end of recovery. So $standby2 should choose a new > timeline different from the one of chosen by $standby. Looking back > at what has been done, it seems to me that the comment is the > incorrect part: > https://www.postgresql.org/message-id/20190930080340.GO2888@paquier.xyz acf1dd42 has taken care of the failures of this test with skink, and I have just taken care of the two races in the tests with e61efaf and 1a8b110. I have left e61efaf out of REL_10_STABLE as the idea of relying on a poll_query_until() with pg_stat_file() and an absolute path would not work there, and the branch will be EOL'd soon while there were no complains with this test for two years. -- Michael
Attachment
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Andrew Dunstan
Date:
On 2022-04-17 Su 19:49, Michael Paquier wrote: > On Sun, Apr 17, 2022 at 10:56:08AM -0400, Andrew Dunstan wrote: >> I don't really think it's Cluster.pm's business to deal with that. It >> takes an install path as given either explicitly or implicitly. >> >> It shouldn't be too hard to get Makefile.global to install valgrind >> wrappers into the tmp_install/bin directory. > Or what gets used in just a wrapper of the contents of bin/ that get > enforced to be first in PATH? That seems likely to be difficult. For example pg_ctl might find its colocated postgres rather than the valgrind wrapper. Ditto initdb. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Apr 18, 2022 at 11:49 AM Michael Paquier <michael@paquier.xyz> wrote: > On Sun, Apr 17, 2022 at 10:56:08AM -0400, Andrew Dunstan wrote: > > I don't really think it's Cluster.pm's business to deal with that. It > > takes an install path as given either explicitly or implicitly. > > > > It shouldn't be too hard to get Makefile.global to install valgrind > > wrappers into the tmp_install/bin directory. > > Or what gets used in just a wrapper of the contents of bin/ that get > enforced to be first in PATH? Delayed response to the question on how I did that, because it was a 4 day weekend down here and I got distracted by sunshine... A horrible slow way to do it is to build with -DUSE_VALGRIND and then just run the whole process tree (eg make, perl, psql, ... and all) under valgrind with --trace-children=yes: tmunro@x1:~/projects/postgresql$ valgrind --quiet --suppressions=`pwd`/src/tools/valgrind.supp --trace-children=yes --track-origins=yes --run-libc-freeres=no --vgdb=no --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END make -C src/test/recovery/ check PROVE_TESTS=t/002_* PROVE_FLAGS=-v I think that sort of thing actually worked when I tried it on a beefier workstation, but it sent my Thinkpad that "only" has a 16GB of RAM into some kind of death spiral. The way I succeeded was indeed using a wrapper script, based on a suggestion from Andres, my kludgy-hardcoded-path-assuming implementation of which looked like: === install-postgres-valgrind, to be run once after building === #!/bin/sh SRC=$HOME/projects/postgresql # move the real binary out of the way mv $SRC/src/backend/postgres $SRC/src/backend/postgres.real # install the wrapper, in the location it'll be copied from for tmp_install cp postgres.valgrind $SRC/src/backend/postgres === === postgres.valgrind wrapper script === #!/bin/sh exec /usr/bin/valgrind \ --quiet \ --error-exitcode=128 \ --suppressions=$HOME/projects/postgresql/src/tools/valgrind.supp \ --trace-children=yes --track-origins=yes --read-var-info=no \ --leak-check=no \ --run-libc-freeres=no \ --vgdb=no \ --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \ $HOME/projects/postgresql/src/backend/postgres.real \ "$@" === Then just: make -C src/test/recovery/ check PROVE_TESTS=t/002_* PROVE_FLAGS=-v Yeah, it might be quite neat to find a tool-supported way to do that. Tangentially, I'd also like to look into making PostgreSQL-under-Valgrind work on FreeBSD and macOS, which didn't work last time I tried it for reasons that might, I hope, have been fixed on the Valgrind side by now.
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
From
Michael Paquier
Date:
On Tue, Apr 19, 2022 at 09:45:11AM +1200, Thomas Munro wrote: > Delayed response to the question on how I did that, because it was a 4 > day weekend down here and I got distracted by sunshine... Happy Easter. > I think that sort of thing actually worked when I tried it on a > beefier workstation, but it sent my Thinkpad that "only" has a 16GB of > RAM into some kind of death spiral. The way I succeeded was indeed > using a wrapper script, based on a suggestion from Andres, my > kludgy-hardcoded-path-assuming implementation of which looked like: > > Yeah, it might be quite neat to find a tool-supported way to do that. Thanks for the details. I feared that it was something like that for the backend. At least that's better than having valgrind spawn all the processes kicked by the make command. :/ > Tangentially, I'd also like to look into making > PostgreSQL-under-Valgrind work on FreeBSD and macOS, which didn't work > last time I tried it for reasons that might, I hope, have been fixed > on the Valgrind side by now. Okay. As a side note, skink has cooled down since acf1dd4, and did not complain either after the additions of e61efaf and 1a8b110. -- Michael