Re: Timeout failure in 019_replslot_limit.pl - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Timeout failure in 019_replslot_limit.pl |
Date | |
Msg-id | 20210918034100.GA2913772@rfd.leadboat.com Whole thread Raw |
In response to | Re: Timeout failure in 019_replslot_limit.pl (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
List | pgsql-hackers |
On Fri, Sep 17, 2021 at 06:59:24PM -0300, Alvaro Herrera wrote: > On 2021-Sep-07, Kyotaro Horiguchi wrote: > > It seems like the "kill 'STOP'" in the script didn't suspend the > > processes before advancing WAL. The attached uses 'ps' command to > > check that since I didn't come up with the way to do the same in Perl. > > Ah! so we tell the kernel to send the signal, but there's no guarantee > about the timing for the reaction from the other process. Makes sense. Agreed. > Your proposal is to examine the other process' state until we see that > it gets the T flag. I wonder how portable this is; I suspect not very. > `ps` is pretty annoying, meaning not consistently implemented -- GNU's > manpage says there are "UNIX options", "BSD options" and "GNU long > options", so it seems hard to believe that there is one set of options > that will work everywhere. I like this, and it's the most-robust way. I agree there's no portable way, so I'd modify it to be fail-open. Run a "ps" command that works on the OP's system. If the output shows the process in a state matching [DRS], we can confidently sleep a bit for signal delivery to finish. If the command fails or prints something else (including state T, which we need check explicitly), assume SIGSTOP delivery is complete. If some other platform shows this race in the future, we can add an additional "ps" command. If we ever get the "stop events" system (https://postgr.es/m/flat/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com), it would be useful for crafting this kind of test without problem seen here. > I found a Perl module (Proc::ProcessTable) that can be used to get the > list of processes and their metadata, but it isn't in core Perl and it > doesn't look very well maintained either, so that one's out. Agreed, that one's out. > Another option might be to wait on the kernel -- do something that would > involve the kernel taking action on the other process, acting like a > barrier of sorts. I don't know if this actually works, but we could > try. Something like sending SIGSTOP first, then "kill 0" -- or just > send SIGSTOP twice: > > diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl > index e065c5c008..e8f323066a 100644 > --- a/src/test/recovery/t/019_replslot_limit.pl > +++ b/src/test/recovery/t/019_replslot_limit.pl > @@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3); > # freeze walsender and walreceiver. Slot will still be active, but walreceiver > # won't get anything anymore. > kill 'STOP', $senderpid, $receiverpid; > +kill 'STOP', $senderpid, $receiverpid; > + > advance_wal($node_primary3, 2); > > my $max_attempts = 180; If this fixes things for the OP, I'd like it slightly better than the "ps" approach. It's less robust, but I like the brevity. Another alternative might be to have walreceiver reach walsender via a proxy Perl script. Then, make that proxy able to accept an instruction to pause passing data until further notice. However, I like two of your options better than this one.
pgsql-hackers by date: