Re: [HACKERS] Walsender timeouts and large transactions - Mailing list pgsql-hackers
From | Sokolov Yura |
---|---|
Subject | Re: [HACKERS] Walsender timeouts and large transactions |
Date | |
Msg-id | 20170904142052.02203dda@falcon-work Whole thread Raw |
In response to | Re: [HACKERS] Walsender timeouts and large transactions (Sokolov Yura <funny.falcon@postgrespro.ru>) |
Responses |
Re: [HACKERS] Walsender timeouts and large transactions
|
List | pgsql-hackers |
On 2017-05-25 17:52 Petr Jelinek wrote: > Hi, > > We have had issue with walsender timeout when used with logical > decoding and the transaction is taking long time to be decoded > (because it contains many changes) > > I was looking today at the walsender code and realized that it's > because if the network and downstream are fast enough, we'll always > take fast path in WalSndWriteData which does not do reply or > keepalive processing and is only reached once the transaction has > finished by other code. So paradoxically we die of timeout because > everything was fast enough to never fall back to slow code path. > > I propose we only use fast path if the last processed reply is not > older than half of walsender timeout, if it is then we'll force the > slow code path to process the replies again. This is similar logic > that we use to determine if to send keepalive message. I also added > CHECK_INTERRUPRS call to fast code path because otherwise walsender > might ignore them for too long on large transactions. > > Thoughts? > On 2017-08-10 14:20 Sokolov Yura wrote: > On 2017-08-09 16:23, Petr Jelinek wrote: > > On 02/08/17 19:35, Yura Sokolov wrote: > >> The following review has been posted through the commitfest > >> application: > >> make installcheck-world: tested, passed > >> Implements feature: not tested > >> Spec compliant: not tested > >> Documentation: not tested > >> > >> There is no check for (last_reply_timestamp <= 0 || > >> wal_sender_timeout <= 0) as in other places > >> (in WalSndKeepaliveIfNecessary for example). > >> > >> I don't think moving update of 'now' down to end of loop body is > >> correct: > >> there are calls to ProcessConfigFile with SyncRepInitConfig, > >> ProcessRepliesIfAny that can > >> last non-negligible time. It could lead to over sleeping due to > >> larger computed sleeptime. > >> Though I could be mistaken. > >> > >> I'm not sure about moving `if (!pg_is_send_pending())` in a body > >> loop after WalSndKeepaliveIfNecessary. > >> Is it necessary? But it looks harmless at least. > >> > > > > We also need to do actual timeout handing so that the timeout is not > > deferred to the end of the transaction (Which is why I moved `if > > (!pg_is_send_pending())` under WalSndCheckTimeOut() and > > WalSndKeepaliveIfNecessary() calls). > > > > If standby really stalled, then it will not read from socket, and then > `pg_is_sendpending` eventually will return false, and timeout will be > checked. > If standby doesn't stall, then `last_reply_timestamp` will be updated > in `ProcessRepliesIfAny`, and so timeout will not be triggered. > Do I understand correctly, or I missed something? > > >> Could patch be reduced to check after first `if > >> (!pg_is_sendpending())` ? like: > >> > >> if (!pq_is_send_pending()) > >> - return; > >> + { > >> + if (last_reply_timestamp <= 0 || > >> wal_sender_timeout <= 0) > >> + { > >> + CHECK_FOR_INTERRUPTS(); > >> + return; > >> + } > >> + if (now <= > >> TimestampTzPlusMilliseconds(last_reply_timestamp, > >> wal_sender_timeout / 2)) > >> + return; > >> + } > >> > >> If not, what problem prevents? > > > > We should do CHECK_FOR_INTERRUPTS() independently of > > pq_is_send_pending so that it's possible to stop walsender while > > it's processing large transaction. > > In this case CHECK_FOR_INTERRUPTS will be called after > wal_sender_timeout/2. > (This diff is for first appearance of `pq_is_send_pending` in a > function). > > If it should be called more often, then patch could be simplier: > > if (!pq_is_send_pending()) > - return; > + { > + CHECK_FOR_INTERRUPTS(); > + if (last_reply_timestamp <= 0 || wal_sender_timeout > <= 0 || > + now <= > TimestampTzPlusMilliseconds(last_reply_timestamp, > wal_sender_timeout / 2)) > + return; > + } > > (Still, I could be mistaken, it is just suggestion). > > Is it hard to add test for case this patch fixes? > > With regards, Tom, Robert, I believe this bug have to be fixed in Pg10, and I don't wonna be that guy who prevents it from being fixed. What should/could I do? ( https://commitfest.postgresql.org/14/1151/ ) -- With regards, Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company
pgsql-hackers by date: