Thread: walsender timeout on logical replication set
Hello. As reported in [1] it seems that walsender can suffer timeout in certain cases. It is not clearly confirmed, but I suspect that there's the case where LogicalRepApplyLoop keeps running the innermost loop without receiving keepalive packet for longer than wal_sender_timeout (not wal_receiver_timeout). Of course that can be resolved by giving sufficient processing power to the subscriber if not. But if that happens between the servers with the equal processing power, it is reasonable to "fix" this. Theoretically I think this can happen with equally-powered servers if the connecting network is sufficiently fast. Because sending reordered changes is relatively simple and fast than apllying the changes on subscriber. I think we don't want to call GetCurrentTimestamp every iteration of the innermost loop. Even if we call it every N iterations, I don't come up with a proper N that fits any workload. So one possible solution would be using slgalrm. Is it worth doing? Or is there any other way? Even if we won't fix this, we might need to add a description about this restriciton in the documentation? Any thougths? [1] https://www.postgresql.org/message-id/CAEDsCzhBtkNDLM46_fo_HirFYE2Mb3ucbZrYqG59ocWqWy7-xA%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Sep 13, 2021 at 7:01 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hello. > > As reported in [1] it seems that walsender can suffer timeout in > certain cases. It is not clearly confirmed, but I suspect that > there's the case where LogicalRepApplyLoop keeps running the innermost > loop without receiving keepalive packet for longer than > wal_sender_timeout (not wal_receiver_timeout). > Why is that happening? In the previous investigation in this area [1] your tests revealed that after reading a WAL page, we always send keep alive, so even if the transaction is large, we should send some keepalive in-between. The other thing that I am not able to understand from Abhishek's reply [2] is why increasing wal_sender_timeout/wal_recevier_timeout leads to the removal of required WAL segments. As per my understanding, we shouldn't remove WAL unless we get confirmation that the subscriber has processed it. [1] - https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com [2] - https://www.postgresql.org/message-id/CAEDsCzjEHLxgqa4d563CKFwSbgBvvnM91Cqfq_qoZDXCkyOsiw%40mail.gmail.com Note - I have added Abhishek to see if he has answers to any of these questions. -- With Regards, Amit Kapila.
Thank you vary much for coming in! At Fri, 17 Sep 2021 10:18:11 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Mon, Sep 13, 2021 at 7:01 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > Hello. > > > > As reported in [1] it seems that walsender can suffer timeout in > > certain cases. It is not clearly confirmed, but I suspect that > > there's the case where LogicalRepApplyLoop keeps running the innermost > > loop without receiving keepalive packet for longer than > > wal_sender_timeout (not wal_receiver_timeout). > > Why is that happening? In the previous investigation in this area [1] > your tests revealed that after reading a WAL page, we always send keep > alive, so even if the transaction is large, we should send some > keepalive in-between. We fixed too-many keepalives (aka keepalive-flood) in the thread, but this is an issue of long absense of subscriber response. What I'm suspecting, or assuming here is: - The publisher is working fine. It doesn't send extra keepalives so much and does send regular keepalives with wal_sender_timeout/2 by the sender's clock. - Networks conveys all the data in-time. - The subscriber consumes received data at less than half the speed at which the publisher sends data. In this case, while the burst traffic is coming, the publisher keep sending for wal_sender_timeout/2 seconds and it may not send a keepalive for the same duration. This is the correct behavior. On the other hand, the subscriber is kept busy without receiving a keepalive for wal_sender_timeout seconds. AFAICS LogicalRepApplyLoop doesn't send a response unless a keepalive comes while in the inner-most loop. If wel_sender_timeout is relatively short (5 seconds, in the report), a burst (or a gap-less) logical replication traffic can continue easily for more than 2.5 seconds. If wal_sender_timeout is longer (1 min, ditto), burst replication traffics last for more than wal_sender_timeout/2 becomes relatively not so frequent. However, I'm not sure how it makes things worse again to increase it further to 5 min. Is my diagnostics that while the innermost loop in LogicalRepAllyLoop [A] is busy, it doesn't have a chance to send reply until a keepalive comes in correct? If so, walsender timeout due to slowness of subscriber happens and we might need to break the innermost loop to give subscriber a chance to send a response with appropriate intervals. This is what I wanted to propose. [A] backend/replication/logical/worker.c:2565@today's master > /* Loop to process all available data (without blocking). */ > for (;;) > The other thing that I am not able to understand from Abhishek's reply > [2] is why increasing wal_sender_timeout/wal_recevier_timeout leads to > the removal of required WAL segments. As per my understanding, we > shouldn't remove WAL unless we get confirmation that the subscriber > has processed it. > > [1] - https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com > [2] - https://www.postgresql.org/message-id/CAEDsCzjEHLxgqa4d563CKFwSbgBvvnM91Cqfq_qoZDXCkyOsiw%40mail.gmail.com > > Note - I have added Abhishek to see if he has answers to any of these questions. Ouch! max_slot_wal_keep_size was introduced at 13. So I have no idea of how required segments can be removed on the publisher for now. == From the first report [1] sourcedb=# select * from pg_replication_slots; ... restart_lsn 116D0/C36886F8 confirmed_flush_lsn 116D0/C3E5D370 targetdb=# show wal_receiver_status_interval; wal_receiver_status_interval 2s targetdb=# select * from pg_stat_subscription; .. received_lsn 116D1/2BA8F170 last_msg_send_time 2021-08-20 09:05:15.398423+09 last_msg_receipt_time 2021-08-20 09:05:15.398471+09 latest_end_lsn 116D1/2BA8F170 latest_end_time 2021-08-20 09:05:15.398423+09 == There is a gap with about 105 segments (1.7GB) between how far the subscriber advanced and the publisher's idea of how far the subscriber advanced. But that alone cannot cause wal removal.. [1] https://www.postgresql.org/message-id/flat/CAEDsCzjEHLxgqa4d563CKFwSbgBvvnM91Cqfq_qoZDXCkyOsiw%40mail.gmail.com#72da631f3af885b06669ddc1636a0a63 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Sep 17, 2021 at 12:48 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Thank you vary much for coming in! > > At Fri, 17 Sep 2021 10:18:11 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Mon, Sep 13, 2021 at 7:01 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > Hello. > > > > > > As reported in [1] it seems that walsender can suffer timeout in > > > certain cases. It is not clearly confirmed, but I suspect that > > > there's the case where LogicalRepApplyLoop keeps running the innermost > > > loop without receiving keepalive packet for longer than > > > wal_sender_timeout (not wal_receiver_timeout). > > > > Why is that happening? In the previous investigation in this area [1] > > your tests revealed that after reading a WAL page, we always send keep > > alive, so even if the transaction is large, we should send some > > keepalive in-between. > > We fixed too-many keepalives (aka keepalive-flood) in the thread, but > this is an issue of long absense of subscriber response. What I'm > suspecting, or assuming here is: > > - The publisher is working fine. It doesn't send extra keepalives so > much and does send regular keepalives with wal_sender_timeout/2 by > the sender's clock. > I think the publisher should also send it even after a certain amount of WAL is consumed via the below code: WalSndWaitForWal() { ... if (MyWalSnd->flush < sentPtr && MyWalSnd->write < sentPtr && !waiting_for_ping_response) WalSndKeepalive(false); ... } > - Networks conveys all the data in-time. > > - The subscriber consumes received data at less than half the speed at > which the publisher sends data. In this case, while the burst > traffic is coming, the publisher keep sending for > wal_sender_timeout/2 seconds and it may not send a keepalive for the > same duration. This is the correct behavior. On the other hand, the > subscriber is kept busy without receiving a keepalive for > wal_sender_timeout seconds. AFAICS LogicalRepApplyLoop doesn't send > a response unless a keepalive comes while in the inner-most loop. > One way this could happen is that the apply is taking a long time because of contention on subscriber, say there are a lot of other operations going on in the subscriber or it stuck due to some reason. > If wel_sender_timeout is relatively short (5 seconds, in the report), > a burst (or a gap-less) logical replication traffic can continue > easily for more than 2.5 seconds. If wal_sender_timeout is longer (1 > min, ditto), burst replication traffics last for more than > wal_sender_timeout/2 becomes relatively not so frequent. > > However, I'm not sure how it makes things worse again to increase it > further to 5 min. > There might be a possibility that subscriber is stuck or is extremely slow due to other operations. > Is my diagnostics that while the innermost loop in LogicalRepAllyLoop > [A] is busy, it doesn't have a chance to send reply until a keepalive > comes in correct? If so, walsender timeout due to slowness of > subscriber happens and we might need to break the innermost loop to > give subscriber a chance to send a response with appropriate > intervals. This is what I wanted to propose. > I was thinking increasing wal_sender/receiver_timeout should solve this problem. I am not sure why it leads to loss of WAL in the OP's case. > [A] > backend/replication/logical/worker.c:2565@today's master > > /* Loop to process all available data (without blocking). */ > > for (;;) > > > > > The other thing that I am not able to understand from Abhishek's reply > > [2] is why increasing wal_sender_timeout/wal_recevier_timeout leads to > > the removal of required WAL segments. As per my understanding, we > > shouldn't remove WAL unless we get confirmation that the subscriber > > has processed it. > > > > [1] - https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com > > [2] - https://www.postgresql.org/message-id/CAEDsCzjEHLxgqa4d563CKFwSbgBvvnM91Cqfq_qoZDXCkyOsiw%40mail.gmail.com > > > > Note - I have added Abhishek to see if he has answers to any of these questions. > > Ouch! max_slot_wal_keep_size was introduced at 13. So I have no idea > of how required segments can be removed on the publisher for now. > > == From the first report [1] > sourcedb=# select * from pg_replication_slots; > ... > restart_lsn 116D0/C36886F8 > confirmed_flush_lsn 116D0/C3E5D370 > > targetdb=# show wal_receiver_status_interval; > wal_receiver_status_interval 2s > > targetdb=# select * from pg_stat_subscription; > .. > received_lsn 116D1/2BA8F170 > last_msg_send_time 2021-08-20 09:05:15.398423+09 > last_msg_receipt_time 2021-08-20 09:05:15.398471+09 > latest_end_lsn 116D1/2BA8F170 > latest_end_time 2021-08-20 09:05:15.398423+09 > == > > There is a gap with about 105 segments (1.7GB) between how far the > subscriber advanced and the publisher's idea of how far the subscriber > advanced. But that alone cannot cause wal removal.. > Yeah, that is quite strange. -- With Regards, Amit Kapila.