Home > mailing lists

Re: Logical replication timeout problem - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Logical replication timeout problem
Date	January 7, 2022 10:26:14
Msg-id	CAA4eK1JfPDev7BkUc0qSFcn33Q-fy5ro2Pk23LFJ2vF9m=HGQw@mail.gmail.com Whole thread Raw
In response to	Re: Logical replication timeout problem (Fabrice Chapuis <fabrice636861@gmail.com>)
Responses	Re: Logical replication timeout problem
List	pgsql-hackers

Tree view

On Wed, Dec 29, 2021 at 5:02 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:

I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts, the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.

2021-12-28 10:50:01.894 CET [55792] LOCATION: WalSndKeepalive, walsender.c:3365
...
2021-12-28 10:50:31.854 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION: ShowTransactionStateRec, xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: spill 2271 changes in XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION: ReorderBufferSerializeTXN, reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.110 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:32.110 CET [55792] LOCATION: ReorderBufferIterTXNNext, reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk
...
2021-12-28 10:50:35.341 CET [55794] DEBUG: 00000: sending replication keepalive
2021-12-28 10:50:35.341 CET [55794] LOCATION: WalSndKeepalive, walsender.c:3365
...
2021-12-28 10:51:31.995 CET [55791] ERROR: XX000: terminating logical replication worker due to timeout
2021-12-28 10:51:31.995 CET [55791] LOCATION: LogicalRepApplyLoop, worker.c:1267

It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?

Could this function in Apply main loop in worker.c help to find a solution?

rc = WaitLatchOrSocket(MyLatch,
WL_SOCKET_READABLE | WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
fd, wait_time,
WAIT_EVENT_LOGICAL_APPLY_MAIN);

Can you explain why you think this will help in solving your current problem?

With Regards,

Amit Kapila.

pgsql-hackers by date:

From: Tatsuro Yamada
Date: 07 January 2022, 09:30:30
Subject: Re: \dP and \dX use ::regclass without "pg_catalog."

From: Ronan Dunklau
Date: 07 January 2022, 11:03:55
Subject: Re: Use generation context to speed up tuplesorts

rc = WaitLatchOrSocket(MyLatch,
	WL_SOCKET_READABLE \| WL_LATCH_SET \|
	WL_TIMEOUT \| WL_POSTMASTER_DEATH,
	fd, wait_time,
	WAIT_EVENT_LOGICAL_APPLY_MAIN);

Re: Logical replication timeout problem - Mailing list pgsql-hackers

Previous

Next