Thread: A proposal to provide a timeout option for CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot

Hi,

Currently CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot waits unboundedly if there are any in-progress write transactions [1]. The wait is for a reason actually i.e. for building an initial snapshot, but waiting unboundedly isn't good for usability of the command/function and when stuck, the callers will not have any information as to why.

How about we provide a timeout for the command/function instead of letting them wait unboundedly? The behavior will be something like this - if the logical replication slot isn't created within this timeout, the command/function will fail.

We could've asked callers to set statement_timeout before calling CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot but that impacts the queries running in all other sessions and it may not be always possible to set this parameter just for the session that runs command CREATE_REPLICATION_SLOT.

Thoughts?

[1]
(gdb) bt
#0  0x00007fc21509a45a in epoll_wait (epfd=9, events=0x561874204e88, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000056187350e9cc in WaitEventSetWaitBlock (set=0x561874204e28, cur_timeout=-1, occurred_events=0x7fff72b3a4a0, nevents=1) at latch.c:1467
#2  0x000056187350e847 in WaitEventSetWait (set=0x561874204e28, timeout=-1, occurred_events=0x7fff72b3a4a0, nevents=1, wait_event_info=50331653) at latch.c:1413
#3  0x000056187350db64 in WaitLatch (latch=0x7fc21292f324, wakeEvents=33, timeout=0, wait_event_info=50331653) at latch.c:475
#4  0x000056187353b5b2 in ProcSleep (locallock=0x56187422aa58, lockMethodTable=0x561873a61a20 <default_lockmethod>) at proc.c:1337
#5  0x0000561873527e49 in WaitOnLock (locallock=0x56187422aa58, owner=0x5618742888b0) at lock.c:1859
#6  0x0000561873526730 in LockAcquireExtended (locktag=0x7fff72b3a8a0, lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true, locallockp=0x0) at lock.c:1101
#7  0x0000561873525b9d in LockAcquire (locktag=0x7fff72b3a8a0, lockmode=5, sessionLock=false, dontWait=false) at lock.c:752
#8  0x0000561873524099 in XactLockTableWait (xid=734, rel=0x0, ctid=0x0, oper=XLTW_None) at lmgr.c:702
#9  0x00005618734a69c4 in SnapBuildWaitSnapshot (running=0x561874315a18, cutoff=735) at snapbuild.c:1416
#10 0x00005618734a67a2 in SnapBuildFindSnapshot (builder=0x561874311a80, lsn=21941704, running=0x561874315a18) at snapbuild.c:1328
#11 0x00005618734a62c4 in SnapBuildProcessRunningXacts (builder=0x561874311a80, lsn=21941704, running=0x561874315a18) at snapbuild.c:1117
#12 0x000056187348cab0 in standby_decode (ctx=0x5618742fb9e0, buf=0x7fff72b3aa00) at decode.c:346
#13 0x000056187348c34e in LogicalDecodingProcessRecord (ctx=0x5618742fb9e0, record=0x5618742fbda0) at decode.c:119
#14 0x000056187349124e in DecodingContextFindStartpoint (ctx=0x5618742fb9e0) at logical.c:613
#15 0x00005618734c2ab3 in create_logical_replication_slot (name=0x56187420d848 "slot1", plugin=0x56187420d8f8 "test_decoding", temporary=false, two_phase=false, restart_lsn=0, find_startpoint=true) at slotfuncs.c:158
#16 0x00005618734c2bb8 in pg_create_logical_replication_slot (fcinfo=0x5618742efdd0) at slotfuncs.c:187
#17 0x00005618732def6b in ExecMakeTableFunctionResult (setexpr=0x5618742dc318, econtext=0x5618742dc1d0, argContext=0x5618742efcb0, expectedDesc=0x5618742ec098, randomAccess=false) at execSRF.c:234
#18 0x00005618732fbc27 in FunctionNext (node=0x5618742dbfb8) at nodeFunctionscan.c:95
#19 0x00005618732e0987 in ExecScanFetch (node=0x5618742dbfb8, accessMtd=0x5618732fbb72 <FunctionNext>, recheckMtd=0x5618732fbf6e <FunctionRecheck>) at execScan.c:133
#20 0x00005618732e0a00 in ExecScan (node=0x5618742dbfb8, accessMtd=0x5618732fbb72 <FunctionNext>, recheckMtd=0x5618732fbf6e <FunctionRecheck>) at execScan.c:182
#21 0x00005618732fbfc4 in ExecFunctionScan (pstate=0x5618742dbfb8) at nodeFunctionscan.c:270
#22 0x00005618732dc693 in ExecProcNodeFirst (node=0x5618742dbfb8) at execProcnode.c:463
#23 0x00005618732cfe80 in ExecProcNode (node=0x5618742dbfb8) at ../../../src/include/executor/executor.h:259

Regards,
Bharath Rupireddy.
At Thu, 9 Jun 2022 10:25:06 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> Hi,
> 
> Currently CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot waits
> unboundedly if there are any in-progress write transactions [1]. The wait
> is for a reason actually i.e. for building an initial snapshot, but waiting
> unboundedly isn't good for usability of the command/function and when
> stuck, the callers will not have any information as to why.
> 
> How about we provide a timeout for the command/function instead of letting
> them wait unboundedly? The behavior will be something like this - if the
> logical replication slot isn't created within this timeout, the
> command/function will fail.
> 
> We could've asked callers to set statement_timeout before calling
> CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot but that impacts
> the queries running in all other sessions and it may not be always possible
> to set this parameter just for the session that runs command
> CREATE_REPLICATION_SLOT.
>
> Thoughts?

How can the other sessions get affected by setting statement_timeout a
session?  And "SET LOCAL" narrows the effect down to within a
transaction. I think that is sufficient. On the other hand,
CREATE_REPLICATION_SLOT doesn't honor statement_timeout, but honors
lock_timeout. (It's a bit strange but I would hardly go so far as to
say we should "fix" it..) If a program issues CREATE_REPLICATION_SLOT,
it's hard to believe that the same program cannot issue SET (for
lock_timeout) command as well.

When CREATE_REPLICATION_SLOT is called from a CREATE SUBSCRIPTION
command, the latter command itself honors statement_timeout and
disconnects the peer walsender. Thus, client_connection_check_interval
set on publisher side kills the walsender shortly after the
disconnection.

In short, I don't see much point in the timeout of the function/command.

As a general discussion on the timeout of functions/commands by a
parameter, I can only come up with pg_terminate_backend() for now, but
its timeout parameter not only determines timeout seconds but also
specifies whether the function waits for the process termination. That
functionality cannot be achieved by statement timeout.  In that sense
it is a bit apart from pg_logical_replication_slot().

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Thu, Jun 9, 2022 at 1:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> > Currently CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot waits
> > unboundedly if there are any in-progress write transactions [1]. [....]
> >
> > How about we provide a timeout for the command/function instead of letting
> > them wait unboundedly?
>
> How can the other sessions get affected by setting statement_timeout a
> session?  And "SET LOCAL" narrows the effect down to within a
> transaction. I think that is sufficient.

SET LOCAL needs to be run within an explicit txn whereas CREATE
SUBSCRIPTION can't.

> On the other hand,
> CREATE_REPLICATION_SLOT doesn't honor statement_timeout, but honors
> lock_timeout. (It's a bit strange but I would hardly go so far as to
> say we should "fix" it..) If a program issues CREATE_REPLICATION_SLOT,
> it's hard to believe that the same program cannot issue SET (for
> lock_timeout) command as well.

Yes it can issue lock_timeout.

> When CREATE_REPLICATION_SLOT is called from a CREATE SUBSCRIPTION
> command, the latter command itself honors statement_timeout and
> disconnects the peer walsender. Thus, client_connection_check_interval
> set on publisher side kills the walsender shortly after the
> disconnection.

Right.

> In short, I don't see much point in the timeout of the function/command.

I played with it a bit today. There are a couple of ways to get around
the CREATE SUBSCRIPTION blocking issue - set statement_timeout [1] or
transaction_timeout [2] on the subscriber at the session level before
creating the subscription, or set lock_timeout [3] on the publisher.

Since we have a bunch of timeouts already (transaction_timeout being
the latest addition), I don't think we need another one here. So I
withdraw my initial idea on this thread to have a separate timeout to
create a logical replication slot.

[1]
postgres=# SET transaction_timeout = '10s';
SET
postgres=# CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres
port=5432' PUBLICATION mypub;
FATAL:  terminating connection due to transaction timeout
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

[2]
postgres=# SET statement_timeout = '10s';
SET
postgres=# CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres
port=5432' PUBLICATION mypub;

ERROR:  canceling statement due to statement timeout

[3]
postgres=# CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres
port=5432' PUBLICATION mypub;

ERROR:  could not create replication slot "mysub": ERROR:  canceling
statement due to lock timeout

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com