Thread: Logical replication timeout

Logical replication timeout

From
RECHTÉ Marc
Date:
Hello,

For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown, 
due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to inactivity).

The problem is, that due to that failure, the wal_receiver process stops. When the wal_sender is ready to send some data, it finds the connexion broken and exits. 
A new wal_sender process is created that restarts from the beginning (restart LSN). This is an endless loop.

Checking the network connexion between wal_sender and wal_receiver, we found that no traffic occurs for hours.

We first increased wal_receiver_timeout up to 12h and still got a disconnection on the receiver party:

2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR:  terminating logical replication worker due to timeout
2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG:  background worker "logical replication worker" (PID 1356203) exited with exit code 1

Then put this parameter to 0, but got then disconnected by the network (note the slight difference in message):

2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR:  could not receive data from WAL stream: could not receive data from server: Connection timed out
2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG:  background worker "logical replication worker" (PID 1697787) exited with exit code 1

The message is generated in libpqrcv_receive function (replication/libpqwalreceiver/libpqwalreceiver.c) which calls pqsecure_raw_read (interfaces/libpq/fe-secure.c)

The last message "Connection timed out" is the errno translation from the recv system function:

ETIMEDOUT       Connection timed out (POSIX.1-2001)

When those timeout occurred, the sender was still busy deleting files from data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small ".spill" files.
It seems this very long pause is at cleanup stage were PG is blindly trying to delete those files.

strace on wal sender show tons of calls like:

unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)

This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c). 
The call stack presumes this may probably occur in DecodeCommit or DecodeAbort (backend/replication/logical/decode.c):

unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
 > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
 > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d]
 > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480 (mais cette fonction (static) n'est utiliée qu'au sein de ce module ...)
 > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175 
 > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90, appelle la fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager
 > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
 > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
 > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
 > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
 > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
 > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
 > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
 > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
 > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]

We did not find any other option than deleting the subscription to stop that loop and start a new one (thus loosing transactions).

The publisher is PostgreSQL 15.6
The subscriber is PostgreSQL 14.5

Thanks

Re: Logical replication timeout

From
Ashutosh Bapat
Date:
On Wed, Nov 6, 2024 at 1:07 PM RECHTÉ Marc <marc.rechte@meteo.fr> wrote:
>
> Hello,
>
> For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown,
> due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to
inactivity).
>
> The problem is, that due to that failure, the wal_receiver process stops. When the wal_sender is ready to send some
data,it finds the connexion broken and exits. 
> A new wal_sender process is created that restarts from the beginning (restart LSN). This is an endless loop.
>
> Checking the network connexion between wal_sender and wal_receiver, we found that no traffic occurs for hours.
>
> We first increased wal_receiver_timeout up to 12h and still got a disconnection on the receiver party:
>
> 2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR:  terminating logical replication worker due to
timeout
> 2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG:  background worker "logical replication worker"
(PID1356203) exited with exit code 1 
>
> Then put this parameter to 0, but got then disconnected by the network (note the slight difference in message):
>
> 2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR:  could not receive data from WAL stream: could
notreceive data from server: Connection timed out 
> 2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG:  background worker "logical replication
worker"(PID 1697787) exited with exit code 1 
>
> The message is generated in libpqrcv_receive function (replication/libpqwalreceiver/libpqwalreceiver.c) which calls
pqsecure_raw_read(interfaces/libpq/fe-secure.c) 
>
> The last message "Connection timed out" is the errno translation from the recv system function:
>
> ETIMEDOUT       Connection timed out (POSIX.1-2001)
>
> When those timeout occurred, the sender was still busy deleting files from data/pg_replslot/bdcpb21_sene,
accumulatingmore than 6 millions small ".spill" files. 
> It seems this very long pause is at cleanup stage were PG is blindly trying to delete those files.
>
> strace on wal sender show tons of calls like:
>
> unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
>
> This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c).
> The call stack presumes this may probably occur in DecodeCommit or DecodeAbort
(backend/replication/logical/decode.c):
>
> unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
>  > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480
(maiscette fonction (static) n'est utiliée qu'au sein de ce module ...) 
>  > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175
>  > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90,
appellela fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager 
>  > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
>  > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
>  > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
>  > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
>  > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
>  > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
>  > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
>  > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
>  > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]

I think, we need a call to rb->update_progress_txn(rb, txn,
change->lsn) at regular intervals in ReorderBufferRestoreCleanup()
similar to ReorderBufferProcessTXN(). And may be at more places where
we have potentially long running loops.

--
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout

From
Shlok Kyal
Date:
On Wed, 6 Nov 2024 at 13:07, RECHTÉ Marc <marc.rechte@meteo.fr> wrote:
>
> Hello,
>
> For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown,
> due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to
inactivity).
>
> The problem is, that due to that failure, the wal_receiver process stops. When the wal_sender is ready to send some
data,it finds the connexion broken and exits. 
> A new wal_sender process is created that restarts from the beginning (restart LSN). This is an endless loop.
>
> Checking the network connexion between wal_sender and wal_receiver, we found that no traffic occurs for hours.
>
> We first increased wal_receiver_timeout up to 12h and still got a disconnection on the receiver party:
>
> 2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR:  terminating logical replication worker due to
timeout
> 2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG:  background worker "logical replication worker"
(PID1356203) exited with exit code 1 
>
> Then put this parameter to 0, but got then disconnected by the network (note the slight difference in message):
>
> 2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR:  could not receive data from WAL stream: could
notreceive data from server: Connection timed out 
> 2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG:  background worker "logical replication
worker"(PID 1697787) exited with exit code 1 
>
> The message is generated in libpqrcv_receive function (replication/libpqwalreceiver/libpqwalreceiver.c) which calls
pqsecure_raw_read(interfaces/libpq/fe-secure.c) 
>
> The last message "Connection timed out" is the errno translation from the recv system function:
>
> ETIMEDOUT       Connection timed out (POSIX.1-2001)
>
> When those timeout occurred, the sender was still busy deleting files from data/pg_replslot/bdcpb21_sene,
accumulatingmore than 6 millions small ".spill" files. 
> It seems this very long pause is at cleanup stage were PG is blindly trying to delete those files.
>
> strace on wal sender show tons of calls like:
>
> unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
>
> This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c).
> The call stack presumes this may probably occur in DecodeCommit or DecodeAbort
(backend/replication/logical/decode.c):
>
> unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
>  > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480
(maiscette fonction (static) n'est utiliée qu'au sein de ce module ...) 
>  > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175
>  > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90,
appellela fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager 
>  > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
>  > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
>  > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
>  > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
>  > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
>  > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
>  > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
>  > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
>  > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]
>
> We did not find any other option than deleting the subscription to stop that loop and start a new one (thus loosing
transactions).
>
> The publisher is PostgreSQL 15.6
> The subscriber is PostgreSQL 14.5
>
> Thanks

Hi,

Do you have a reproducible test case for the above scenario? Please
share the same.
I am also trying to reproduce the above issue by generating large no.
of spill files.

Thanks and Regards,
Shlok Kyal



Re: Logical replication timeout

From
RECHTÉ Marc
Date:
This how  to reproduce the problem.

Session 1:

psql -c "CREATE TABLE test (i int)" -c "INSERT INTO test SELECT generate_series(1, 2_000_000)"

Session 2:

pg_recvlogical  -d postgres --slot=test --create-slot
pg_recvlogical -d postgres --slot=test --start -f -


Session 3:

cd data/pg_repslots
watch 'ls test | wc -l'


Session 1:

date
time psql -c "BEGIN" -c "
DO LANGUAGE plpgsql
\$\$
DECLARE
  cur CURSOR FOR SELECT * FROM test FOR UPDATE;
  rec record;
BEGIN
  FOR rec IN cur LOOP
    BEGIN
       UPDATE test SET i = i + 1 WHERE CURRENT OF cur;
    EXCEPTION
      WHEN no_data_found THEN
        RAISE NOTICE 'no data found exception';
    END;
  END LOOP;
END;
\$\$
" -c "ROLLBACK"

date
mer. 11 déc. 2024 08:59:03 CET
BEGIN
DO
ROLLBACK

real    0m17,071s
user    0m0,003s
sys    0m0,000s
mer. 11 déc. 2024 08:59:21 CET


Session 3: Watch session

Count increases up to

Wed Dec 11 09:00:02 2024
1434930

Then decreases down to 1

Wed Dec 11 09:03:17 2024
1

Session 2:

Appears last (after spill files deleted)

BEGIN 12874409
COMMIT 12874409


Conclusion:

- The exception block is responsible for generating subtransactions
- Although the transaction lasted 17s, one can see that the decoding was a bit late (40 seconds), but
- spent an extra 200s to delete the spill files !


On Wed, 6 Nov 2024 at 13:07, RECHTÉ Marc <marc.rechte@meteo.fr> wrote:
>
> Hello,
>
> For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown,
> due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to
inactivity).
>
> The problem is, that due to that failure, the wal_receiver process stops. When the wal_sender is ready to send some
data,it finds the connexion broken and exits. 
> A new wal_sender process is created that restarts from the beginning (restart LSN). This is an endless loop.
>
> Checking the network connexion between wal_sender and wal_receiver, we found that no traffic occurs for hours.
>
> We first increased wal_receiver_timeout up to 12h and still got a disconnection on the receiver party:
>
> 2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR:  terminating logical replication worker due to
timeout
> 2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG:  background worker "logical replication worker"
(PID1356203) exited with exit code 1 
>
> Then put this parameter to 0, but got then disconnected by the network (note the slight difference in message):
>
> 2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR:  could not receive data from WAL stream: could
notreceive data from server: Connection timed out 
> 2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG:  background worker "logical replication
worker"(PID 1697787) exited with exit code 1 
>
> The message is generated in libpqrcv_receive function (replication/libpqwalreceiver/libpqwalreceiver.c) which calls
pqsecure_raw_read(interfaces/libpq/fe-secure.c) 
>
> The last message "Connection timed out" is the errno translation from the recv system function:
>
> ETIMEDOUT       Connection timed out (POSIX.1-2001)
>
> When those timeout occurred, the sender was still busy deleting files from data/pg_replslot/bdcpb21_sene,
accumulatingmore than 6 millions small ".spill" files. 
> It seems this very long pause is at cleanup stage were PG is blindly trying to delete those files.
>
> strace on wal sender show tons of calls like:
>
> unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
>
> This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c).
> The call stack presumes this may probably occur in DecodeCommit or DecodeAbort
(backend/replication/logical/decode.c):
>
> unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce
type)
>  > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480
(maiscette fonction (static) n'est utiliée qu'au sein de ce module ...) 
>  > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175
>  > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90,
appellela fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager 
>  > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
>  > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
>  > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
>  > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
>  > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
>  > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
>  > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
>  > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
>  > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]
>
> We did not find any other option than deleting the subscription to stop that loop and start a new one (thus loosing
transactions).
>
> The publisher is PostgreSQL 15.6
> The subscriber is PostgreSQL 14.5
>
> Thanks

Hi,

Do you have a reproducible test case for the above scenario? Please
share the same.
I am also trying to reproduce the above issue by generating large no.
of spill files.

Thanks and Regards,
Shlok Kyal



RE: Logical replication timeout

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Marc,

> For some unknown reason (probably a very big transaction at the source), we
> experienced a logical decoding breakdown,
...
> When those timeout occurred, the sender was still busy deleting files from
> data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small
> ".spill" files. It seems this very long pause is at cleanup stage were PG is
> blindly trying to delete those files.

Thanks for reporting the issue! We will discuss and provide fix if possible.
Apart from the code fix, I have some comments from another perspective.

> The publisher is PostgreSQL 15.6
> The subscriber is PostgreSQL 14.5

Can you enable the parameter "streaming" to on on your system [1]? It allows to
stream the in-progress transactions to the subscriber side. I feel this can avoid
the case that there are many .spill files on the publisher side.

Another approach is to tune the logical_decoding_work_mem parameter [2].
This specifies the maximum amount of memory used by the logical decoding, and
some changes are spilled when it exceeds the limitation. Naively, this setting
can reduce the number of files.

I hope both settings can optimize your system.

[1]: https://www.postgresql.org/docs/14/sql-createsubscription.html
[2]: https://www.postgresql.org/docs/14/runtime-config-resource.html#GUC-LOGICAL-DECODING-WORK-MEM

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Logical replication timeout

From
RECHTÉ Marc
Date:
> Can you enable the parameter "streaming" to on on your system [1]? It allows to
> stream the in-progress transactions to the subscriber side. I feel this can avoid
> the case that there are many .spill files on the publisher side.

> Another approach is to tune the logical_decoding_work_mem parameter [2].
> This specifies the maximum amount of memory used by the logical decoding, and
> some changes are spilled when it exceeds the limitation. Naively, this setting
> can reduce the number of files.


> [1]: https://www.postgresql.org/docs/14/sql-createsubscription.html
> [2]: https://www.postgresql.org/docs/14/runtime-config-resource.html#GUC-LOGICAL-DECODING-WORK-MEM

> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED

Dear Hayato,

Thanks for your suggestions that were both already tested. In our (real) case (a single transaction with 12 millions
sub-transactions):

1) setting the subscription as streaming, just delay a bit the spill file surge. It does not prevent the creation of
spillfiles.
 

2) we set logical_decoding_work_mem to 20GB, which probably also delayed the problem, but did not solve it.

The real problem is spill file deletions that can take days in this particular case !

Marc





RE: Logical replication timeout

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Marc,

Thanks for the reply!

> Thanks for your suggestions that were both already tested. In our (real) case (a
> single transaction with 12 millions sub-transactions):
> 
> 1) setting the subscription as streaming, just delay a bit the spill file surge. It does
> not prevent the creation of spill files.

It is bit surprised for me because I don't know the path which transactions can
be serialized even in the streaming=on case. Let me think over it...

> 2) we set logical_decoding_work_mem to 20GB, which probably also delayed the
> problem, but did not solve it.

Oh, I understood that you've already increased the parameter to the appropriate
value on your env. Is it right?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Logical replication timeout

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Shlok,

> 
> Thanks for sharing the analysis.
> 
> I tested the patch on my machine as well and it has worse performance
> for me as well.
> I came up with an alternate approach. In this approach we keep track
> of wal segment the transaction is part of. This helps to iterate
> through only required files during clean up.
> 
> On my machine, I am running the testcase provided by you in [1]. It is
> generating ~1.9 million spill files. For me the transaction completed
> in 56sec.
> Cleanup (deletion of spill files) took around following time:
> With HEAD : ~ 5min
> With latest patch (attached here) : ~2min
> 
> Can you test if this improves performance for you?

I'm also not sure the performance, but I can post my comments.
I'm not sure your patch can properly handle the list operations.

```
+                oldcontext = MemoryContextSwitchTo(rb->context);
+                txn->walsgmts = lappend(txn->walsgmts, curOpenSegNo);
+                MemoryContextSwitchTo(oldcontext);
+
```

IIUC lappend() accepts a point of a Datum, but here a normal value is passed.
Should we define a new struct which represents a node of list and append it
after it is palloc()'d?
Or your code is enough for some reasons?

```
     /* iterate over all possible filenames, and delete them */
-    for (cur = first; cur <= last; cur++)
+    foreach(cell, txn->walsgmts)
     {
+        XLogSegNo curr_segno = (XLogSegNo) lfirst(cell);
         char        path[MAXPGPATH];
 
-        ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, cur);
+        ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, curr_segno);
         if (unlink(path) != 0 && errno != ENOENT)
             ereport(ERROR,
                     (errcode_for_file_access(),
                      errmsg("could not remove file \"%s\": %m", path)));
     }
+
+    if(txn->walsgmts != NIL)
+    {
+        pfree(txn->walsgmts);
+        txn->walsgmts = NIL;
+    }
```

If above comment is accepted, I feel you can use foreach_delete_current().

=======

Also, even when we optimize the truncation of files, there is a possibility that
replication is timed out. Can you also create a patch which implements [1]?

[1]: https://www.postgresql.org/message-id/CAExHW5s2_T9mULDQRKsdV72wpnA%2BNLT63cX51b51QQVEV4sG5g%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Logical replication timeout

From
RECHTÉ Marc
Date:
> I came up with an alternate approach. In this approach we keep track
> of wal segment the transaction is part of. This helps to iterate
> through only required files during clean up.
> 
> On my machine, I am running the testcase provided by you in [1]. It is
> generating ~1.9 million spill files. For me the transaction completed
> in 56sec.
> Cleanup (deletion of spill files) took around following time:
> With HEAD : ~ 5min
> With latest patch (attached here) : ~2min
> 
> Can you test if this improves performance for you?
> 
> The patch applies on HEAD.

Thanks again for this new patch.

Unfortunately it does not compile (17.2 source):

reorderbuffer.c: In function 'ReorderBufferSerializeTXN':
reorderbuffer.c:3771:72: error: passing argument 2 of 'lappend' makes pointer from integer without a cast
[-Wint-conversion]
 3771 |                                 txn->walsgmts = lappend(txn->walsgmts, curOpenSegNo);
      |                                                                        ^~~~~~~~~~~~
      |                                                                        |
      |                                                                        XLogSegNo {aka long unsigned int}


and

reorderbuffer.c: In function 'ReorderBufferRestoreChanges':
reorderbuffer.c:4304:31: error: assignment to 'XLogSegNo' {aka 'long unsigned int'} from 'void *' makes integer from
pointerwithout a cast [-Wint-conversion]
 
 4304 |                         segno = lfirst(lc);
      |                               ^