Home > mailing lists

Very long loop breaking logical replication walsender / walreceiver connection - Mailing list pgsql-bugs

From	RECHTÉ Marc
Subject	Very long loop breaking logical replication walsender / walreceiver connection
Date	November 13, 2024 11:00:46
Msg-id	1430556325.185731745.1731484846410.JavaMail.zimbra@meteo.fr Whole thread Raw
List	pgsql-bugs

Tree view

For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown,
due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to
inactivity).

When those timeout occurred (more than 12 hours), the sender was still busy deleting files from
data/pg_replslot/bdcpb21_sene,
accumulating more than *6 millions* small ".spill" files.

strace on wal sender showed tons of calls like:

unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)

This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c).
The call stack presumes this may probably occur in DecodeCommit or DecodeAbort (backend/replication/logical/decode.c):

unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
 > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
 > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d]
 > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480
(maiscette fonction (static) n'est utiliée qu'au sein de ce module ...) 
 > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175
 > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90,
appellela fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager 
 > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
 > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
 > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
 > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
 > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
 > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
 > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
 > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
 > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]

There are 2 problems in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c:4562):

1) rb->update_progress_txn may be called periodically to avoid the connection to break
2) find a more effiiecent way to delete files blindly (using scandir with a filter on the TXID base name)

See also https://www.postgresql.org/message-id/638764862.181008636.1730878611279.JavaMail.zimbra%40meteo.fr

pgsql-bugs by date:

From: Andres Freund
Date: 12 November 2024, 18:48:31
Subject: Re: HashAgg degenerate case

From: Daniel Gustafsson
Date: 13 November 2024, 12:05:48
Subject: Re: BUG #18702: Critical & High Security vulnerability issue with Trivy Scan in postgres 16

Very long loop breaking logical replication walsender / walreceiver connection - Mailing list pgsql-bugs

Previous

Next