Fix LOCK_TIMEOUT handling in slotsync worker - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject Fix LOCK_TIMEOUT handling in slotsync worker
Date
Msg-id TY4PR01MB169078F33846E9568412D878C94A2A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
Responses Re: Fix LOCK_TIMEOUT handling in slotsync worker
Re: Fix LOCK_TIMEOUT handling in slotsync worker
List pgsql-hackers
Hi,

Previously, the slotsync worker used SIGINT to receive a graceful shutdown
signal from the startup process on promotion. However, SIGINT is also used by
the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
slotsync worker can access and lock catalog tables while parsing libpq tuples,
this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
signals and consequently waiting indefinitely on locks.

I can reproduce the issue by:

1) create a failover replication slot for slotsync on primary.
2) start slotsync worker on standby and uses gdb to make the slotsync
worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
3) take ACCESS EXCLUSIVE lock on pg_type on primary.
4) log standby snapshot to replicate the lock to standby.
5) release the slotsync worker, it will start waiting for the lock on pg_type to
   be released. And on HEAD, it would not be canceled by the lock_timeout
   setting.

Here is a patch to resolve this by replacing the current signal handler with the
appropriate StatementCancelHandler for SIGINT within the slotsync worker.
Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
slotsync of the need to stop during promotion. The slotsync worker now stops
upon detecting that the shared memory flag (stopSignaled) is set to true.

I did not add a tap-test in the patch for now. Although feasible, it requires
a strong lock on a catalog and an injection point to control the
process.

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Newly created replication slot may be invalidated by checkpoint
Next
From: "cca5507"
Date:
Subject: Re: Fix incorrect comments in tuplesort.c