Thread: Timeout failure in 019_replslot_limit.pl
Hi all, Running the recovery tests in a parallel run, enough to bloat a machine in resources, sometimes leads me to the following failure: ok 19 - walsender termination logged # poll_query_until timed out executing this query: # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' This corresponds to the following part of the test, where a WAL sender is SIGSTOP'd and SIGCONT'd: $node_primary3->poll_query_until('postgres', "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'", "lost") or die "timed out waiting for slot to be lost"; There is already a default timeout of 180s applied as of the default of PostgresNode::poll_query_until(), so it seems to me that there could be a different issue hiding here. Thanks, -- Michael
Attachment
Hello On 2021-Sep-06, Michael Paquier wrote: > Running the recovery tests in a parallel run, enough to bloat a > machine in resources, sometimes leads me to the following failure: > ok 19 - walsender termination logged > # poll_query_until timed out executing this query: > # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' Hmm, I've never seen that, and I do run tests in parallel quite often. Would you please attach the log files for that test in a failed run? -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "This is what I like so much about PostgreSQL. Most of the surprises are of the "oh wow! That's cool" Not the "oh shit!" kind. :)" Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-Sep-06, Michael Paquier wrote: >> # poll_query_until timed out executing this query: >> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' > Hmm, I've never seen that, and I do run tests in parallel quite often. I scraped the buildfarm logs looking for similar failures, and didn't find any. (019_replslot_limit.pl hasn't failed at all in the farm since the last fix it received, in late July.) I wonder if Michael's setup had any unusual settings. regards, tom lane
On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote: > I scraped the buildfarm logs looking for similar failures, and didn't > find any. (019_replslot_limit.pl hasn't failed at all in the farm > since the last fix it received, in late July.) The interesting bits are in 019_replslot_limit_primary3.log. In a failed run, I can see that we get immediately a process termination, as follows: 2021-09-07 07:52:53.402 JST [22890] LOG: terminating process 23082 to release replication slot "rep3" 2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL: terminating connection due to administrator command 2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHEREslot_name = 'rep3' In a successful run, the pattern is different: 2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL: terminating connection due to administrator command 2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 2021-09-07 09:27:39.832 JST [57092] LOG: invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds max_slot_wal_keep_size 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoint complete: wrote 19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0, longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoints are occurring too frequently (0 seconds apart) 2021-09-07 09:27:39.833 JST [57092] HINT: Consider increasing the configuration parameter "max_wal_size". 2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHEREslot_name = 'rep3' The slot invalidation is forgotten because we don't complete a checkpoint that does the work it should do, no? There is a completed checkpoint before we query pg_replication_slots, and the buildfarm shows the same thing. > I wonder if Michael's setup had any unusual settings. The way I use configure and build options has caught bugs with code ordering in the past, but this one looks like just a timing issue with the test itself. I can only see that with Big Sur 11.5.2, and I just got fresh logs this morning with a new failure, as of the attached. -- Michael
Attachment
At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote: > > I scraped the buildfarm logs looking for similar failures, and didn't > > find any. (019_replslot_limit.pl hasn't failed at all in the farm > > since the last fix it received, in late July.) > > The interesting bits are in 019_replslot_limit_primary3.log. In a > failed run, I can see that we get immediately a process termination, > as follows: > 2021-09-07 07:52:53.402 JST [22890] LOG: terminating process 23082 to release replication slot "rep3" > 2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL: terminating connection due to administrator command > 2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 > 2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slotsWHERE slot_name = 'rep3' > > In a successful run, the pattern is different: > 2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL: terminating connection due to administrator command > 2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 > 2021-09-07 09:27:39.832 JST [57092] LOG: invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds max_slot_wal_keep_size > 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoint complete: wrote > 19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled; > write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0, > longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB > 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoints are occurring too frequently (0 seconds apart) > 2021-09-07 09:27:39.833 JST [57092] HINT: Consider increasing the configuration parameter "max_wal_size". > 2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slotsWHERE slot_name = 'rep3' > > The slot invalidation is forgotten because we don't complete a > checkpoint that does the work it should do, no? There is a completed > checkpoint before we query pg_replication_slots, and the buildfarm > shows the same thing. It seems like the "kill 'STOP'" in the script didn't suspend the processes before advancing WAL. The attached uses 'ps' command to check that since I didn't come up with the way to do the same in Perl. I'm still not sure it works as expected, though. (Imagining the case where the state changes before the process actually stops..) regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index e065c5c008..7388900008 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3); # freeze walsender and walreceiver. Slot will still be active, but walreceiver # won't get anything anymore. kill 'STOP', $senderpid, $receiverpid; +wait_to_stop($senderpid, $receiverpid); + advance_wal($node_primary3, 2); my $max_attempts = 180; @@ -426,3 +428,30 @@ sub find_in_log return $log =~ m/$pat/; } + +sub wait_to_stop +{ + my (@pids) = @_; + my $max_attempts = 180; + + # Haven't found the means to do the same on Windows + return if $TestLib::windows_os; + + while ($max_attempts-- >= 0) + { + my ($all_stopped) = 1; + + foreach my $pid (@pids) + { + if (`ps -p $pid -o pid,state | tail -1` !~ /^ *\d+ +T/) + { + $all_stopped = 0; + last; + } + } + + last if ($all_stopped); + + sleep 1; + } +}
On Fri, Sep 17, 2021 at 06:59:24PM -0300, Alvaro Herrera wrote: > On 2021-Sep-07, Kyotaro Horiguchi wrote: > > It seems like the "kill 'STOP'" in the script didn't suspend the > > processes before advancing WAL. The attached uses 'ps' command to > > check that since I didn't come up with the way to do the same in Perl. > > Ah! so we tell the kernel to send the signal, but there's no guarantee > about the timing for the reaction from the other process. Makes sense. Agreed. > Your proposal is to examine the other process' state until we see that > it gets the T flag. I wonder how portable this is; I suspect not very. > `ps` is pretty annoying, meaning not consistently implemented -- GNU's > manpage says there are "UNIX options", "BSD options" and "GNU long > options", so it seems hard to believe that there is one set of options > that will work everywhere. I like this, and it's the most-robust way. I agree there's no portable way, so I'd modify it to be fail-open. Run a "ps" command that works on the OP's system. If the output shows the process in a state matching [DRS], we can confidently sleep a bit for signal delivery to finish. If the command fails or prints something else (including state T, which we need check explicitly), assume SIGSTOP delivery is complete. If some other platform shows this race in the future, we can add an additional "ps" command. If we ever get the "stop events" system (https://postgr.es/m/flat/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com), it would be useful for crafting this kind of test without problem seen here. > I found a Perl module (Proc::ProcessTable) that can be used to get the > list of processes and their metadata, but it isn't in core Perl and it > doesn't look very well maintained either, so that one's out. Agreed, that one's out. > Another option might be to wait on the kernel -- do something that would > involve the kernel taking action on the other process, acting like a > barrier of sorts. I don't know if this actually works, but we could > try. Something like sending SIGSTOP first, then "kill 0" -- or just > send SIGSTOP twice: > > diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl > index e065c5c008..e8f323066a 100644 > --- a/src/test/recovery/t/019_replslot_limit.pl > +++ b/src/test/recovery/t/019_replslot_limit.pl > @@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3); > # freeze walsender and walreceiver. Slot will still be active, but walreceiver > # won't get anything anymore. > kill 'STOP', $senderpid, $receiverpid; > +kill 'STOP', $senderpid, $receiverpid; > + > advance_wal($node_primary3, 2); > > my $max_attempts = 180; If this fixes things for the OP, I'd like it slightly better than the "ps" approach. It's less robust, but I like the brevity. Another alternative might be to have walreceiver reach walsender via a proxy Perl script. Then, make that proxy able to accept an instruction to pause passing data until further notice. However, I like two of your options better than this one.
On Wed, Sep 22, 2021 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote: > > On 2021-Sep-20, Michael Paquier wrote: > >> The test gets the right PIDs, as the logs showed: > >> ok 17 - have walsender pid 12663 > >> ok 18 - have walreceiver pid 12662 > > > > As I understood, Horiguchi-san's point isn't that the PIDs might be > > wrong -- the point is to make sure that the process is in state T before > > moving on to the next step in the test. > > I have spent more time on this issue, as it looks that I am the only > one with an environment able to reproduce it (Big Sur 11.6). > > As far as I can see, the states of the WAL sender and receiver are > fine, after adding some extra debugging with ps called from the test > itself, and I have checked that they are SIGSTOP'd or SIGCONT'd when a > failure shows up. > > In a sequence that passes, we have the following sequence: > - Start checkpoint. > - SIGSTOP sent to WAL sender and receiver. > - Advance WAL (CREATE TABLE, DROP TABLE and pg_switch_wal) > - Check that WAL sender is stopped > - SIGCONT on WAL sender. > Am I understanding correctly that after sending SIGCONT to the WAL sender, the checkpoint's SIGTERM signal for the WAL sender is received and it releases the slot and terminates itself? > - Invalidate the slot, checkpoint completes. After which checkpoint invalidates the slot and completes. Now, in the failed run, it appears that due to some reason WAL sender has not released the slot. Is it possible to see if the WAL sender is still alive when a checkpoint is stuck at ConditionVariableSleep? And if it is active, what is its call stack? -- With Regards, Amit Kapila.
On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote: > Now, in the failed run, it appears that due to some reason WAL sender > has not released the slot. Is it possible to see if the WAL sender is > still alive when a checkpoint is stuck at ConditionVariableSleep? And > if it is active, what is its call stack? I got again a failure today, so I have used this occasion to check that when the checkpoint gets stuck the WAL sender process getting SIGCONT is still around, waiting for a write to happen: * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10 frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0,nevents=1) at latch.c:1601:7 frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0,nevents=1, wait_event_info=100663297) at latch.c:1396:8 frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) at be-secure.c:298:3 frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7 frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at pqcomm.c:1298:8 frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', s="SFATAL", len=112) at pqcomm.c:1479:6 frame #7: 0x000000010fc67318 postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9 frame #8: 0x00000001100a469f postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3 frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3 frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", lineno=3193, funcname="ProcessInterrupts") atelog.c:597:2 * frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at postgres.c:3191:4 frame #12: 0x000000010fe0bbe5 postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at walsender.c:2550)) at walsender.c:2285:3 frame #13: 0x000000010fe0754f postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3 frame #14: 0x000000010fe06149 postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\" 0/700000TIMELINE 1") at walsender.c:1652:6 frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12 It logs its FATAL "terminating connection due to administrator command" coming from ProcessInterrupts(), and then it sits idle on ClientWrite. -- Michael
Attachment
On Mon, Sep 27, 2021 at 11:32 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote: > > Now, in the failed run, it appears that due to some reason WAL sender > > has not released the slot. Is it possible to see if the WAL sender is > > still alive when a checkpoint is stuck at ConditionVariableSleep? And > > if it is active, what is its call stack? > > I got again a failure today, so I have used this occasion to check that > when the checkpoint gets stuck the WAL sender process getting SIGCONT > is still around, waiting for a write to happen: > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP > frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10 > frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0,nevents=1) at latch.c:1601:7 > frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0,nevents=1, wait_event_info=100663297) at latch.c:1396:8 > frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) atbe-secure.c:298:3 .. .. > frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12 > > It logs its FATAL "terminating connection due to administrator > command" coming from ProcessInterrupts(), and then it sits idle on > ClientWrite. > So, it seems on your machine it has passed the following condition in secure_write: if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN)) If so, this indicates write failure which seems odd to me and probably something machine-specific or maybe some different settings in your build or machine. BTW, if SSL or GSS is enabled that might have caused it in some way. I think the best way is to debug the secure_write during this occurrence. -- With Regards, Amit Kapila.
On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote: > So, it seems on your machine it has passed the following condition in > secure_write: > if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN)) Yep. > If so, this indicates write failure which seems odd to me and probably > something machine-specific or maybe some different settings in your > build or machine. BTW, if SSL or GSS is enabled that might have caused > it in some way. I think the best way is to debug the secure_write > during this occurrence. Yeah, but we don't use any of them in the context of this test, so this is something on a simple send(), no? Hmm. That would not be the first issue we see with macos these days with interrupted syscalls... And actually in this stack I can see that errno gets set to EINTR. -- Michael
Attachment
On Mon, Sep 27, 2021 at 12:13 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote: > > So, it seems on your machine it has passed the following condition in > > secure_write: > > if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN)) > > Yep. > > > If so, this indicates write failure which seems odd to me and probably > > something machine-specific or maybe some different settings in your > > build or machine. BTW, if SSL or GSS is enabled that might have caused > > it in some way. I think the best way is to debug the secure_write > > during this occurrence. > > Yeah, but we don't use any of them in the context of this test, so > this is something on a simple send(), no? Hmm. That would not be the > first issue we see with macos these days with interrupted syscalls... > And actually in this stack I can see that errno gets set to EINTR. > If errno is EINTR, then how would the code pass the above if check as it has a condition ((errno == EWOULDBLOCK || errno == EAGAIN))? -- With Regards, Amit Kapila.
On 2021-Sep-27, Michael Paquier wrote: > I got again a failure today, so I have used this occasion to check that > when the checkpoint gets stuck the WAL sender process getting SIGCONT > is still around, waiting for a write to happen: > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP > frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10 > frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0,nevents=1) at latch.c:1601:7 > frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0,nevents=1, wait_event_info=100663297) at latch.c:1396:8 > frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) atbe-secure.c:298:3 > frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7 > frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at pqcomm.c:1298:8 > frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', s="SFATAL", len=112) at pqcomm.c:1479:6 > frame #7: 0x000000010fc67318 postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9 > frame #8: 0x00000001100a469f postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3 > frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3 > frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", lineno=3193, funcname="ProcessInterrupts")at elog.c:597:2 > * frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at postgres.c:3191:4 > frame #12: 0x000000010fe0bbe5 postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at walsender.c:2550)) at walsender.c:2285:3 > frame #13: 0x000000010fe0754f postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3 > frame #14: 0x000000010fe06149 postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\" 0/700000TIMELINE 1") at walsender.c:1652:6 > frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12 Ah, so the problem here is that the walsender is not exiting. That also causes the checkpointer to hang waiting for it. I wonder if this is related to the problem reported in https://www.postgresql.org/message-id/adce2c09-3bfc-4666-997a-c21991cb1eb1.mengjuan.cmj%40alibaba-inc.com A patch was proposed on that thread on September 22nd, can to try with that and see if this problem still reproduces? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).
On Sat, Oct 02, 2021 at 07:00:01PM -0300, Alvaro Herrera wrote: > A patch was proposed on that thread on September 22nd, can to try with > that and see if this problem still reproduces? Yes, the failure still shows up, even with a timeout set at 30s which is the default of the patch. -- Michael