001_rep_changes.pl stalls - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | 001_rep_changes.pl stalls |
Date | |
Msg-id | 20200406063649.GA3738151@rfd.leadboat.com Whole thread Raw |
Responses |
Re: 001_rep_changes.pl stalls
|
List | pgsql-hackers |
Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is important for promptly updating pg_stat_replication. When caught up, we should impose that logic before every sleep. The one-line fix is to sleep in WalSndLoop() only when pq_is_send_pending(), not when caught up. On my regular development machine, src/test/subscription/t/001_rep_changes.pl stalls for ~10s at this wait_for_catchup: $node_publisher->safe_psql('postgres', "DELETE FROM tab_rep"); # Restart the publisher and check the state of the subscriber which # should be in a streaming state after catching up. $node_publisher->stop('fast'); $node_publisher->start; $node_publisher->wait_for_catchup('tap_sub'); That snippet emits three notable physical WAL records. There's a Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and an XLOG/FPI_FOR_HINT. The buildfarm has stalled there, but it happens probably less than half the time. Examples[1] showing the stall: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check Here's the most-relevant walsender call tree: WalSndLoop XLogSendLogical (caller invokes once per loop iteration, via send_data callback) XLogReadRecord (caller invokes once) ReadPageInternal (caller invokes twice in this test; more calls are possible) logical_read_xlog_page (caller skips when page is same as last call, else invokes 1-2 times via state->read_page()callback, registered in StartLogicalReplication) WalSndWaitForWal (caller invokes once; has fast path) The cause is a race involving the flow of reply messages (send_feedback() messages) from logical apply worker to walsender. Here are two sequencing patterns; the more-indented parts are what differ. Stalling pattern: sender reads Transaction/COMMIT and sends the changes receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender accepts the report sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver wal_receiver_status_interval elapses; receiver reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT Non-stalling pattern (more prevalent with lower machine performance): sender reads Transaction/COMMIT and sends the changes sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT The fix is to test "MyWalSnd->write < sentPtr" before more sleeps. The test is unnecessary when sleeping due to pq_is_send_pending(); in that case, the receiver is not idle and will reply before idling. I changed WalSndLoop() to sleep only for pq_is_send_pending(). For all other sleep reasons, the sleep will happen in WalSndWaitForWal(). Attached. I don't know whether this is important outside of testing scenarios. I lean against back-patching, but I will back-patch if someone thinks this qualifies as a performance bug. Thanks, nm [1] I spot-checked only my animals, since I wanted to experiment on an affected animal.
Attachment
pgsql-hackers by date: