Re: Logical replication timeout - Mailing list pgsql-hackers
From | RECHTÉ Marc |
---|---|
Subject | Re: Logical replication timeout |
Date | |
Msg-id | 2142579462.209796913.1733907595547.JavaMail.zimbra@meteo.fr Whole thread Raw |
In response to | Re: Logical replication timeout (Shlok Kyal <shlok.kyal.oss@gmail.com>) |
List | pgsql-hackers |
This how to reproduce the problem. Session 1: psql -c "CREATE TABLE test (i int)" -c "INSERT INTO test SELECT generate_series(1, 2_000_000)" Session 2: pg_recvlogical -d postgres --slot=test --create-slot pg_recvlogical -d postgres --slot=test --start -f - Session 3: cd data/pg_repslots watch 'ls test | wc -l' Session 1: date time psql -c "BEGIN" -c " DO LANGUAGE plpgsql \$\$ DECLARE cur CURSOR FOR SELECT * FROM test FOR UPDATE; rec record; BEGIN FOR rec IN cur LOOP BEGIN UPDATE test SET i = i + 1 WHERE CURRENT OF cur; EXCEPTION WHEN no_data_found THEN RAISE NOTICE 'no data found exception'; END; END LOOP; END; \$\$ " -c "ROLLBACK" date mer. 11 déc. 2024 08:59:03 CET BEGIN DO ROLLBACK real 0m17,071s user 0m0,003s sys 0m0,000s mer. 11 déc. 2024 08:59:21 CET Session 3: Watch session Count increases up to Wed Dec 11 09:00:02 2024 1434930 Then decreases down to 1 Wed Dec 11 09:03:17 2024 1 Session 2: Appears last (after spill files deleted) BEGIN 12874409 COMMIT 12874409 Conclusion: - The exception block is responsible for generating subtransactions - Although the transaction lasted 17s, one can see that the decoding was a bit late (40 seconds), but - spent an extra 200s to delete the spill files ! On Wed, 6 Nov 2024 at 13:07, RECHTÉ Marc <marc.rechte@meteo.fr> wrote: > > Hello, > > For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown, > due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to inactivity). > > The problem is, that due to that failure, the wal_receiver process stops. When the wal_sender is ready to send some data,it finds the connexion broken and exits. > A new wal_sender process is created that restarts from the beginning (restart LSN). This is an endless loop. > > Checking the network connexion between wal_sender and wal_receiver, we found that no traffic occurs for hours. > > We first increased wal_receiver_timeout up to 12h and still got a disconnection on the receiver party: > > 2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR: terminating logical replication worker due to timeout > 2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG: background worker "logical replication worker" (PID1356203) exited with exit code 1 > > Then put this parameter to 0, but got then disconnected by the network (note the slight difference in message): > > 2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR: could not receive data from WAL stream: could notreceive data from server: Connection timed out > 2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG: background worker "logical replication worker"(PID 1697787) exited with exit code 1 > > The message is generated in libpqrcv_receive function (replication/libpqwalreceiver/libpqwalreceiver.c) which calls pqsecure_raw_read(interfaces/libpq/fe-secure.c) > > The last message "Connection timed out" is the errno translation from the recv system function: > > ETIMEDOUT Connection timed out (POSIX.1-2001) > > When those timeout occurred, the sender was still busy deleting files from data/pg_replslot/bdcpb21_sene, accumulatingmore than 6 millions small ".spill" files. > It seems this very long pause is at cleanup stage were PG is blindly trying to delete those files. > > strace on wal sender show tons of calls like: > > unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type) > unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type) > unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type) > unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type) > unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type) > unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type) > > This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c). > The call stack presumes this may probably occur in DecodeCommit or DecodeAbort (backend/replication/logical/decode.c): > > unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type) > > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7] > > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d] > > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480 (maiscette fonction (static) n'est utiliée qu'au sein de ce module ...) > > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175 > > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90, appellela fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager > > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e] > > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121] > > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a] > > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588] > > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97] > > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c] > > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05] > > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554] > > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8] > > We did not find any other option than deleting the subscription to stop that loop and start a new one (thus loosing transactions). > > The publisher is PostgreSQL 15.6 > The subscriber is PostgreSQL 14.5 > > Thanks Hi, Do you have a reproducible test case for the above scenario? Please share the same. I am also trying to reproduce the above issue by generating large no. of spill files. Thanks and Regards, Shlok Kyal
pgsql-hackers by date: