RE: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers
From | Hayato Kuroda (Fujitsu) |
---|---|
Subject | RE: Exit walsender before confirming remote flush in logical replication |
Date | |
Msg-id | TYCPR01MB58701A47F35FED0A2B399662F5C49@TYCPR01MB5870.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Exit walsender before confirming remote flush in logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
Dear Amit, hackers, > Let me try to summarize the discussion till now. The problem we are > trying to solve here is to allow a shutdown to complete when walsender > is not able to send the entire WAL. Currently, in such cases, the > shutdown fails. As per our current understanding, this can happen when > (a) walreceiver/walapply process is stuck (not able to receive more > WAL) due to locks or some other reason; (b) a long time delay has been > configured to apply the WAL (we don't yet have such a feature for > logical replication but the discussion for same is in progress). Thanks for summarizing. While analyzing stuck, I noticed that there are two types of shutdown failures. They could be characterized by the back trace. They are shown at the bottom. Type i) The walsender executes WalSndDone(), but cannot satisfy the condition. It means that all WALs have been sent to the subscriber but have not flushed; sentPtr is not the same as replicatedPtr. This stuck can happen when the delayed transaction is small or streamed. Type ii) The walsender cannot execute WalSndDone(), stacks at ProcessPendingWrites(). It means that when the send buffer becomes full while replicating a transaction; pq_is_send_pending() returns true and the walsender cannot break the loop. This stuck can happen when the delayed transaction is large, but it is not a streamed one. If we choose modification (1), we can only fix type (i) because pending WALs cause the failure. IIUC if we want to shut down walsender processes even if (ii), we must choose (2) and additional fixes are needed. Based on the above, I prefer modification (2) because it can rescue more cases. Thoughts? PSA the patch for it. It is almost the same as the previous version, but the comments are updated. Appendinx: The backtrace for type i) ``` #0 WalSndDone (send_data=0x87f825 <XLogSendLogical>) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:3111 #1 0x000000000087ed1d in WalSndLoop (send_data=0x87f825 <XLogSendLogical>) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:2525 #2 0x000000000087d40a in StartLogicalReplication (cmd=0x1f49030) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1320 #3 0x000000000087df29 in exec_replication_command ( cmd_string=0x1f15498 "START_REPLICATION SLOT \"sub\" LOGICAL 0/0 (proto_version '4', streaming 'on', origin 'none', publication_names'\"pub\"')") at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1830 #4 0x000000000091b032 in PostgresMain (dbname=0x1f4c938 "postgres", username=0x1f4c918 "postgres") at ../../PostgreSQL-Source-Dev/src/backend/tcop/postgres.c:4561 #5 0x000000000085390b in BackendRun (port=0x1f3d0b0) at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:4437 #6 0x000000000085322c in BackendStartup (port=0x1f3d0b0) at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:4165 #7 0x000000000084f7a2 in ServerLoop () at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:1762 #8 0x000000000084f0a2 in PostmasterMain (argc=3, argv=0x1f0ff30) at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:1452 #9 0x000000000074a4d6 in main (argc=3, argv=0x1f0ff30) at ../../PostgreSQL-Source-Dev/src/backend/main/main.c:200 ``` The backtrace for type ii) ``` #0 ProcessPendingWrites () at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1438 #1 0x000000000087d635 in WalSndWriteData (ctx=0x1429ce8, lsn=22406440, xid=731, last_write=true) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1405 #2 0x0000000000888420 in OutputPluginWrite (ctx=0x1429ce8, last_write=true) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/logical.c:669 #3 0x00007f022dfe43a7 in pgoutput_change (ctx=0x1429ce8, txn=0x1457d40, relation=0x7f0245075268, change=0x1460ef8) at ../../PostgreSQL-Source-Dev/src/backend/replication/pgoutput/pgoutput.c:1491 #4 0x0000000000889125 in change_cb_wrapper (cache=0x142bcf8, txn=0x1457d40, relation=0x7f0245075268, change=0x1460ef8) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/logical.c:1077 #5 0x000000000089507c in ReorderBufferApplyChange (rb=0x142bcf8, txn=0x1457d40, relation=0x7f0245075268, change=0x1460ef8,streaming=false) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:1969 #6 0x0000000000895866 in ReorderBufferProcessTXN (rb=0x142bcf8, txn=0x1457d40, commit_lsn=23060624, snapshot_now=0x1440150,command_id=0, streaming=false) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2245 #7 0x0000000000896348 in ReorderBufferReplay (txn=0x1457d40, rb=0x142bcf8, xid=731, commit_lsn=23060624, end_lsn=23060672,commit_time=727353664342177, origin_id=0, origin_lsn=0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2675 #8 0x00000000008963d0 in ReorderBufferCommit (rb=0x142bcf8, xid=731, commit_lsn=23060624, end_lsn=23060672, commit_time=727353664342177,origin_id=0, origin_lsn=0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2699 #9 0x00000000008842c7 in DecodeCommit (ctx=0x1429ce8, buf=0x7ffcf03731a0, parsed=0x7ffcf0372fa0, xid=731, two_phase=false) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:682 #10 0x0000000000883667 in xact_decode (ctx=0x1429ce8, buf=0x7ffcf03731a0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:216 #11 0x000000000088338b in LogicalDecodingProcessRecord (ctx=0x1429ce8, record=0x142a080) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:119 #12 0x000000000087f8c7 in XLogSendLogical () at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:3060 #13 0x000000000087ec5a in WalSndLoop (send_data=0x87f825 <XLogSendLogical>) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:2490 ... ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
pgsql-hackers by date: