Thread: Timeout failure in 019_replslot_limit.pl

Timeout failure in 019_replslot_limit.pl

From

Michael Paquier

Date:

06 September 2021, 00:17:59

Hi all,

Running the recovery tests in a parallel run, enough to bloat a
machine in resources, sometimes leads me to the following failure:
ok 19 - walsender termination logged
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

This corresponds to the following part of the test, where a WAL sender
is SIGSTOP'd and SIGCONT'd:
$node_primary3->poll_query_until('postgres',
        "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'",
        "lost")
  or die "timed out waiting for slot to be lost";

There is already a default timeout of 180s applied as of the default
of PostgresNode::poll_query_until(), so it seems to me that there
could be a different issue hiding here.

Thanks,
--
Michael

Attachment

signature.asc

Re: Timeout failure in 019_replslot_limit.pl

From

Alvaro Herrera

Date:

06 September 2021, 14:59:42

Hello

On 2021-Sep-06, Michael Paquier wrote:

> Running the recovery tests in a parallel run, enough to bloat a
> machine in resources, sometimes leads me to the following failure:
> ok 19 - walsender termination logged
> # poll_query_until timed out executing this query:
> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

Hmm, I've never seen that, and I do run tests in parallel quite often.
Would you please attach the log files for that test in a failed run?

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php

Re: Timeout failure in 019_replslot_limit.pl

From

Tom Lane

Date:

06 September 2021, 16:03:32

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Sep-06, Michael Paquier wrote:
>> # poll_query_until timed out executing this query:
>> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'

> Hmm, I've never seen that, and I do run tests in parallel quite often.

I scraped the buildfarm logs looking for similar failures, and didn't
find any.  (019_replslot_limit.pl hasn't failed at all in the farm
since the last fix it received, in late July.)  I wonder if Michael's
setup had any unusual settings.

            regards, tom lane

Re: Timeout failure in 019_replslot_limit.pl

From

Michael Paquier

Date:

07 September 2021, 00:37:10

On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:
> I scraped the buildfarm logs looking for similar failures, and didn't
> find any.  (019_replslot_limit.pl hasn't failed at all in the farm
> since the last fix it received, in late July.)

The interesting bits are in 019_replslot_limit_primary3.log.  In a
failed run, I can see that we get immediately a process termination,
as follows:
2021-09-07 07:52:53.402 JST [22890] LOG:  terminating process 23082 to release replication slot "rep3"
2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL:  terminating connection due to administrator command
2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM pg_replication_slots
WHEREslot_name = 'rep3'

In a successful run, the pattern is different:
2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL:  terminating connection due to administrator command
2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2021-09-07 09:27:39.832 JST [57092] LOG:  invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds
max_slot_wal_keep_size
2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoint complete: wrote
19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoints are occurring too frequently (0 seconds apart)
2021-09-07 09:27:39.833 JST [57092] HINT:  Consider increasing the configuration parameter "max_wal_size".
2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM pg_replication_slots
WHEREslot_name = 'rep3'

The slot invalidation is forgotten because we don't complete a
checkpoint that does the work it should do, no?  There is a completed
checkpoint before we query pg_replication_slots, and the buildfarm
shows the same thing.

> I wonder if Michael's setup had any unusual settings.

The way I use configure and build options has caught bugs with code
ordering in the past, but this one looks like just a timing issue with
the test itself.  I can only see that with Big Sur 11.5.2, and I just
got fresh logs this morning with a new failure, as of the attached.
--
Michael

Attachment

Re: Timeout failure in 019_replslot_limit.pl

From

Kyotaro Horiguchi

Date:

07 September 2021, 03:01:06

At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:
> > I scraped the buildfarm logs looking for similar failures, and didn't
> > find any.  (019_replslot_limit.pl hasn't failed at all in the farm
> > since the last fix it received, in late July.)
> 
> The interesting bits are in 019_replslot_limit_primary3.log.  In a
> failed run, I can see that we get immediately a process termination,
> as follows:
> 2021-09-07 07:52:53.402 JST [22890] LOG:  terminating process 23082 to release replication slot "rep3"
> 2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL:  terminating connection due to administrator command
> 2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM
pg_replication_slotsWHERE slot_name = 'rep3'
 
> 
> In a successful run, the pattern is different:
> 2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL:  terminating connection due to administrator command
> 2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 09:27:39.832 JST [57092] LOG:  invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds
max_slot_wal_keep_size
> 2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoint complete: wrote
> 19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled;
> write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
> 2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoints are occurring too frequently (0 seconds apart)
> 2021-09-07 09:27:39.833 JST [57092] HINT:  Consider increasing the configuration parameter "max_wal_size".
> 2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM
pg_replication_slotsWHERE slot_name = 'rep3'
 
> 
> The slot invalidation is forgotten because we don't complete a
> checkpoint that does the work it should do, no?  There is a completed
> checkpoint before we query pg_replication_slots, and the buildfarm
> shows the same thing.

It seems like the "kill 'STOP'" in the script didn't suspend the
processes before advancing WAL. The attached uses 'ps' command to
check that since I didn't come up with the way to do the same in Perl.

I'm still not sure it works as expected, though.  (Imagining the case
where the state changes before the process actually stops..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index e065c5c008..7388900008 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
 # freeze walsender and walreceiver. Slot will still be active, but walreceiver
 # won't get anything anymore.
 kill 'STOP', $senderpid, $receiverpid;
+wait_to_stop($senderpid, $receiverpid);
+
 advance_wal($node_primary3, 2);
 
 my $max_attempts = 180;
@@ -426,3 +428,30 @@ sub find_in_log
 
     return $log =~ m/$pat/;
 }
+
+sub wait_to_stop
+{
+    my (@pids) = @_;
+    my $max_attempts = 180;
+
+    # Haven't found the means to do the same on Windows
+    return if $TestLib::windows_os;
+
+    while ($max_attempts-- >= 0)
+    {
+        my ($all_stopped) = 1;
+
+        foreach my $pid (@pids)
+        {
+            if (`ps -p $pid -o pid,state | tail -1` !~ /^ *\d+ +T/)
+            {
+                $all_stopped = 0;
+                last;
+            }
+        }
+
+        last if ($all_stopped);
+
+        sleep 1;
+    }
+}

Re: Timeout failure in 019_replslot_limit.pl

From

Noah Misch

Date:

18 September 2021, 03:41:00

On Fri, Sep 17, 2021 at 06:59:24PM -0300, Alvaro Herrera wrote:
> On 2021-Sep-07, Kyotaro Horiguchi wrote:
> > It seems like the "kill 'STOP'" in the script didn't suspend the
> > processes before advancing WAL. The attached uses 'ps' command to
> > check that since I didn't come up with the way to do the same in Perl.
> 
> Ah! so we tell the kernel to send the signal, but there's no guarantee
> about the timing for the reaction from the other process.  Makes sense.

Agreed.

> Your proposal is to examine the other process' state until we see that
> it gets the T flag.  I wonder how portable this is; I suspect not very.
> `ps` is pretty annoying, meaning not consistently implemented -- GNU's
> manpage says there are "UNIX options", "BSD options" and "GNU long
> options", so it seems hard to believe that there is one set of options
> that will work everywhere.

I like this, and it's the most-robust way.  I agree there's no portable way,
so I'd modify it to be fail-open.  Run a "ps" command that works on the OP's
system.  If the output shows the process in a state matching [DRS], we can
confidently sleep a bit for signal delivery to finish.  If the command fails
or prints something else (including state T, which we need check explicitly),
assume SIGSTOP delivery is complete.  If some other platform shows this race
in the future, we can add an additional "ps" command.

If we ever get the "stop events" system
(https://postgr.es/m/flat/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com),
it would be useful for crafting this kind of test without problem seen here.

> I found a Perl module (Proc::ProcessTable) that can be used to get the
> list of processes and their metadata, but it isn't in core Perl and it
> doesn't look very well maintained either, so that one's out.

Agreed, that one's out.

> Another option might be to wait on the kernel -- do something that would
> involve the kernel taking action on the other process, acting like a
> barrier of sorts.  I don't know if this actually works, but we could
> try.  Something like sending SIGSTOP first, then "kill 0" -- or just
> send SIGSTOP twice:
> 
> diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
> index e065c5c008..e8f323066a 100644
> --- a/src/test/recovery/t/019_replslot_limit.pl
> +++ b/src/test/recovery/t/019_replslot_limit.pl
> @@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
>  # freeze walsender and walreceiver. Slot will still be active, but walreceiver
>  # won't get anything anymore.
>  kill 'STOP', $senderpid, $receiverpid;
> +kill 'STOP', $senderpid, $receiverpid;
> +
>  advance_wal($node_primary3, 2);
>  
>  my $max_attempts = 180;

If this fixes things for the OP, I'd like it slightly better than the "ps"
approach.  It's less robust, but I like the brevity.

Another alternative might be to have walreceiver reach walsender via a proxy
Perl script.  Then, make that proxy able to accept an instruction to pause
passing data until further notice.  However, I like two of your options better
than this one.

Re: Timeout failure in 019_replslot_limit.pl

From

Amit Kapila

Date:

25 September 2021, 11:42:42

On Wed, Sep 22, 2021 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:
> > On 2021-Sep-20, Michael Paquier wrote:
> >> The test gets the right PIDs, as the logs showed:
> >> ok 17 - have walsender pid 12663
> >> ok 18 - have walreceiver pid 12662
> >
> > As I understood, Horiguchi-san's point isn't that the PIDs might be
> > wrong -- the point is to make sure that the process is in state T before
> > moving on to the next step in the test.
>
> I have spent more time on this issue, as it looks that I am the only
> one with an environment able to reproduce it (Big Sur 11.6).
>
> As far as I can see, the states of the WAL sender and receiver are
> fine, after adding some extra debugging with ps called from the test
> itself, and I have checked that they are SIGSTOP'd or SIGCONT'd when a
> failure shows up.
>
> In a sequence that passes, we have the following sequence:
> - Start checkpoint.
> - SIGSTOP sent to WAL sender and receiver.
> - Advance WAL (CREATE TABLE, DROP TABLE and pg_switch_wal)
> - Check that WAL sender is stopped
> - SIGCONT on WAL sender.
>

Am I understanding correctly that after sending SIGCONT to the WAL
sender, the checkpoint's SIGTERM signal for the WAL sender is received
and it releases the slot and terminates itself?

> - Invalidate the slot, checkpoint completes.

After which checkpoint invalidates the slot and completes.

Now, in the failed run, it appears that due to some reason WAL sender
has not released the slot. Is it possible to see if the WAL sender is
still alive when a checkpoint is stuck at ConditionVariableSleep? And
if it is active, what is its call stack?

-- 
With Regards,
Amit Kapila.

Re: Timeout failure in 019_replslot_limit.pl

From

Michael Paquier

Date:

27 September 2021, 06:02:25

On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:
> Now, in the failed run, it appears that due to some reason WAL sender
> has not released the slot. Is it possible to see if the WAL sender is
> still alive when a checkpoint is stuck at ConditionVariableSleep? And
> if it is active, what is its call stack?

I got again a failure today, so I have used this occasion to check that
when the checkpoint gets stuck the WAL sender process getting SIGCONT
is still around, waiting for a write to happen:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
    frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1,
occurred_events=0x00007ffee0395fd0,nevents=1) at latch.c:1601:7
 
    frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1,
occurred_events=0x00007ffee0395fd0,nevents=1, wait_event_info=100663297) at latch.c:1396:8
 
    frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) at
be-secure.c:298:3
    frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7
    frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at pqcomm.c:1298:8
    frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', s="SFATAL", len=112) at pqcomm.c:1479:6
    frame #7: 0x000000010fc67318 postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9
    frame #8: 0x00000001100a469f postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3
    frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3
    frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", lineno=3193, funcname="ProcessInterrupts")
atelog.c:597:2
 
  * frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at postgres.c:3191:4
    frame #12: 0x000000010fe0bbe5 postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at walsender.c:2550)) at
walsender.c:2285:3
    frame #13: 0x000000010fe0754f postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3
    frame #14: 0x000000010fe06149 postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\"
0/700000TIMELINE 1") at walsender.c:1652:6
 
    frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12

It logs its FATAL "terminating connection due to administrator
command" coming from ProcessInterrupts(), and then it sits idle on
ClientWrite.
--
Michael

Attachment

signature.asc

Re: Timeout failure in 019_replslot_limit.pl

From

Amit Kapila

Date:

27 September 2021, 06:23:07

On Mon, Sep 27, 2021 at 11:32 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:
> > Now, in the failed run, it appears that due to some reason WAL sender
> > has not released the slot. Is it possible to see if the WAL sender is
> > still alive when a checkpoint is stuck at ConditionVariableSleep? And
> > if it is active, what is its call stack?
>
> I got again a failure today, so I have used this occasion to check that
> when the checkpoint gets stuck the WAL sender process getting SIGCONT
> is still around, waiting for a write to happen:
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>     frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
>     frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1,
occurred_events=0x00007ffee0395fd0,nevents=1) at latch.c:1601:7
 
>     frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1,
occurred_events=0x00007ffee0395fd0,nevents=1, wait_event_info=100663297) at latch.c:1396:8
 
>     frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694)
atbe-secure.c:298:3
 
..
..
>     frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12
>
> It logs its FATAL "terminating connection due to administrator
> command" coming from ProcessInterrupts(), and then it sits idle on
> ClientWrite.
>

So, it seems on your machine it has passed  the following condition in
secure_write:
if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))

If so, this indicates write failure which seems odd to me and probably
something machine-specific or maybe some different settings in your
build or machine. BTW, if SSL or GSS is enabled that might have caused
it in some way. I think the best way is to debug the secure_write
during this occurrence.

-- 
With Regards,
Amit Kapila.

Re: Timeout failure in 019_replslot_limit.pl

From

Michael Paquier

Date:

27 September 2021, 06:43:52

On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote:
> So, it seems on your machine it has passed  the following condition in
> secure_write:
> if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))

Yep.

> If so, this indicates write failure which seems odd to me and probably
> something machine-specific or maybe some different settings in your
> build or machine. BTW, if SSL or GSS is enabled that might have caused
> it in some way. I think the best way is to debug the secure_write
> during this occurrence.

Yeah, but we don't use any of them in the context of this test, so
this is something on a simple send(), no?  Hmm.  That would not be the
first issue we see with macos these days with interrupted syscalls...
And actually in this stack I can see that errno gets set to EINTR.
--
Michael

Attachment

signature.asc

Re: Timeout failure in 019_replslot_limit.pl

From

Amit Kapila

Date:

27 September 2021, 08:57:07

On Mon, Sep 27, 2021 at 12:13 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote:
> > So, it seems on your machine it has passed  the following condition in
> > secure_write:
> > if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
>
> Yep.
>
> > If so, this indicates write failure which seems odd to me and probably
> > something machine-specific or maybe some different settings in your
> > build or machine. BTW, if SSL or GSS is enabled that might have caused
> > it in some way. I think the best way is to debug the secure_write
> > during this occurrence.
>
> Yeah, but we don't use any of them in the context of this test, so
> this is something on a simple send(), no?  Hmm.  That would not be the
> first issue we see with macos these days with interrupted syscalls...
> And actually in this stack I can see that errno gets set to EINTR.
>

If errno is EINTR, then how would the code pass the above if check as
it has a condition ((errno == EWOULDBLOCK || errno == EAGAIN))?


-- 
With Regards,
Amit Kapila.

Re: Timeout failure in 019_replslot_limit.pl

From

Alvaro Herrera

Date:

02 October 2021, 22:00:01

On 2021-Sep-27, Michael Paquier wrote:

> I got again a failure today, so I have used this occasion to check that
> when the checkpoint gets stuck the WAL sender process getting SIGCONT
> is still around, waiting for a write to happen:
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>     frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
>     frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1,
occurred_events=0x00007ffee0395fd0,nevents=1) at latch.c:1601:7
 
>     frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1,
occurred_events=0x00007ffee0395fd0,nevents=1, wait_event_info=100663297) at latch.c:1396:8
 
>     frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694)
atbe-secure.c:298:3
 
>     frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7
>     frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at pqcomm.c:1298:8
>     frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', s="SFATAL", len=112) at pqcomm.c:1479:6
>     frame #7: 0x000000010fc67318 postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9
>     frame #8: 0x00000001100a469f postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3
>     frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3
>     frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", lineno=3193,
funcname="ProcessInterrupts")at elog.c:597:2
 
>   * frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at postgres.c:3191:4
>     frame #12: 0x000000010fe0bbe5 postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at walsender.c:2550)) at
walsender.c:2285:3
>     frame #13: 0x000000010fe0754f postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3
>     frame #14: 0x000000010fe06149 postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\"
0/700000TIMELINE 1") at walsender.c:1652:6
 
>     frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12

Ah, so the problem here is that the walsender is not exiting.  That also
causes the checkpointer to hang waiting for it.  I wonder if this is
related to the problem reported in
https://www.postgresql.org/message-id/adce2c09-3bfc-4666-997a-c21991cb1eb1.mengjuan.cmj%40alibaba-inc.com
A patch was proposed on that thread on September 22nd, can to try with
that and see if this problem still reproduces?

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).

Re: Timeout failure in 019_replslot_limit.pl

From

Michael Paquier

Date:

06 October 2021, 04:24:52

On Sat, Oct 02, 2021 at 07:00:01PM -0300, Alvaro Herrera wrote:
> A patch was proposed on that thread on September 22nd, can to try with
> that and see if this problem still reproduces?

Yes, the failure still shows up, even with a timeout set at 30s which
is the default of the patch.
--
Michael

Attachment

signature.asc