Thread: pg_basebackup may fail to send feedbacks.
Hello, I received an report that pg_basebackup with "-X stream" always exits with the following error. > pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. The walsender had been terminated by replication timeout a bit before the above message. > LOG: terminating walsender process due to replication timeout ==== I digged into this and found that a accumulation of delay in receiving xlog stream can cause such an error. This hardly occurs on an ordinary environment but this would be caused by temporary (or permanent) heavy load on the receiver-side machine. A virtual machine environment could raises the chance, I suppose. In HandleCopyStream(), the feedbacks are sent only after breaks of the xlog stream, so continuous flow of xlog stream prevents it from being sent with expected intervals. walsender sends keepalive message for such a case, but it arrives with a long delay being caught by the congestion of the stream, but this is not a problem because the keepalive is intended to be sent while idle. I think that the status feedback should be sent whenever standby_message_timeout has elapsed just after (or before) an incoming message is processed. The seemingly most straightforward way to fix this is breaking the innner-loop (while(r != 0)) if the time to feedback comes as the attached patch #1 (0001-Make-sure-to-send-...). What do you thing about this? regards, ===== - How to reproduce the situation. As mentioned before, this hardly occurs on ordinary environment. But simulating the heavy load by inserting a delay in HandleCopyStream() effectively let the error occur. With the attached patch #2(insert_interval.diff), the following operation let us see the situation. The walsender reports timeout although the stream processing is delayed but running steadily. postgresql.conf: > wal_level = hot_standby > max_wal_senders = 2 > wal_sender_timeout = 20s # * 2 of default of standby_message_timeout Terminal1$ psql postgres postgres=# CREATE TABLE t1 (a text); postgres=# ALTER TABLE t1 ALTER COLUMN a SET STORAGE EXTERNAL; postgres=# INSERT INTO t1 (SELECT repeat('x', 10000) FROM generate_series(0, 99999)); -- about 1GB postgres=# ^D Terminal1$ pgbench -i postgres Terminal1$ pgbench -T 600 -h localhost postgres Terminal2$ pg_basebackupo -r 32k -X stream -D data -h localhost ... in about a couple of minutes on my environment.. Terminal1: > LOG: terminating walsender process due to replication timeout ... after another couple of minutes. Terminal2: > pg_basebackup: could not receive data from WAL stream: server closed... -- Kyotaro Horiguchi NTT Open Source Software Center
I'm very sorry for confused report. The problem found in 9.4.0 and the diagnosis was mistakenly done on master. 9.4.0 has no problem of feedback delay caused by slow xlog receiving on pg_basebackup mentioned in the previous mail. But the current master still has this problem. regards, At Mon, 02 Feb 2015 13:48:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150202.134834.100672846.horiguchi.kyotaro@lab.ntt.co.jp> > Hello, I received an report that pg_basebackup with "-X stream" > always exits with the following error. > > > pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The walsender had been terminated by replication timeout a bit > before the above message. > > > LOG: terminating walsender process due to replication timeout > > ==== > > I digged into this and found that a accumulation of delay in > receiving xlog stream can cause such an error. This hardly occurs > on an ordinary environment but this would be caused by temporary > (or permanent) heavy load on the receiver-side machine. A virtual > machine environment could raises the chance, I suppose. > > In HandleCopyStream(), the feedbacks are sent only after breaks > of the xlog stream, so continuous flow of xlog stream prevents it > from being sent with expected intervals. walsender sends > keepalive message for such a case, but it arrives with a long > delay being caught by the congestion of the stream, but this is > not a problem because the keepalive is intended to be sent while > idle. > > I think that the status feedback should be sent whenever > standby_message_timeout has elapsed just after (or before) an > incoming message is processed. The seemingly most straightforward > way to fix this is breaking the innner-loop (while(r != 0)) if > the time to feedback comes as the attached patch #1 > (0001-Make-sure-to-send-...). > > What do you thing about this? > > regards, > > > ===== > - How to reproduce the situation. > > As mentioned before, this hardly occurs on ordinary > environment. But simulating the heavy load by inserting a delay > in HandleCopyStream() effectively let the error occur. > > With the attached patch #2(insert_interval.diff), the following > operation let us see the situation. The walsender reports timeout > although the stream processing is delayed but running steadily. > > postgresql.conf: > > wal_level = hot_standby > > max_wal_senders = 2 > > wal_sender_timeout = 20s # * 2 of default of standby_message_timeout > > Terminal1$ psql postgres > postgres=# CREATE TABLE t1 (a text); > postgres=# ALTER TABLE t1 ALTER COLUMN a SET STORAGE EXTERNAL; > postgres=# INSERT INTO t1 (SELECT repeat('x', 10000) FROM generate_series(0, 99999)); -- about 1GB > postgres=# ^D > > Terminal1$ pgbench -i postgres > Terminal1$ pgbench -T 600 -h localhost postgres > > Terminal2$ pg_basebackupo -r 32k -X stream -D data -h localhost > > ... in about a couple of minutes on my environment.. > > Terminal1: > > LOG: terminating walsender process due to replication timeout > > ... after another couple of minutes. > > Terminal2: > > pg_basebackup: could not receive data from WAL stream: server closed... -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I'm very sorry for confused report. The problem found in 9.4.0 > and the diagnosis was mistakenly done on master. > > 9.4.0 has no problem of feedback delay caused by slow xlog > receiving on pg_basebackup mentioned in the previous mail. But > the current master still has this problem. Seems walreceiver has the same problem. No? Regards, -- Fujii Masao
Hello, At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com> > On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > I'm very sorry for confused report. The problem found in 9.4.0 > > and the diagnosis was mistakenly done on master. > > > > 9.4.0 has no problem of feedback delay caused by slow xlog > > receiving on pg_basebackup mentioned in the previous mail. But > > the current master still has this problem. > > Seems walreceiver has the same problem. No? pg_receivexlog.c would have the same problem since it uses the same function with pg_basebackup.c. The correspondent of HandleCopyStream in wansender is WalReceiverMain, and it doesn't seem to have the same kind of loop shown below. It seems to surely send feedback per one record. | r = stream_reader(); | while (r > 0) | { | ... wal record processing stuff without sending feedback.. | r = stream_reader(); | } regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Feb 5, 2015 at 10:20 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com> >> On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > I'm very sorry for confused report. The problem found in 9.4.0 >> > and the diagnosis was mistakenly done on master. >> > >> > 9.4.0 has no problem of feedback delay caused by slow xlog >> > receiving on pg_basebackup mentioned in the previous mail. But >> > the current master still has this problem. >> >> Seems walreceiver has the same problem. No? > > pg_receivexlog.c would have the same problem since it uses the > same function with pg_basebackup.c. > > The correspondent of HandleCopyStream in wansender is > WalReceiverMain, and it doesn't seem to have the same kind of > loop shown below. It seems to surely send feedback per one > record. > > | r = stream_reader(); > | while (r > 0) > | { > | ... wal record processing stuff without sending feedback.. > | r = stream_reader(); > | } WalReceiverMain() has the similar code as follows. len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); if (len != 0) { for (;;) { if (len > 0) { .... len = walrcv_receive(0, &buf); } } Regards, -- Fujii Masao
Sorry, I misunderstood that. > > At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com> > >> On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI > >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >> > I'm very sorry for confused report. The problem found in 9.4.0 > >> > and the diagnosis was mistakenly done on master. > >> > > >> > 9.4.0 has no problem of feedback delay caused by slow xlog > >> > receiving on pg_basebackup mentioned in the previous mail. But > >> > the current master still has this problem. > >> > >> Seems walreceiver has the same problem. No? > > > > pg_receivexlog.c would have the same problem since it uses the > > same function with pg_basebackup.c. > > > > The correspondent of HandleCopyStream in wansender is > > WalReceiverMain, and it doesn't seem to have the same kind of > > loop shown below. It seems to surely send feedback per one > > record. > > > > | r = stream_reader(); > > | while (r > 0) > > | { > > | ... wal record processing stuff without sending feedback.. > > | r = stream_reader(); > > | } > > WalReceiverMain() has the similar code as follows. > > len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); > if (len != 0) > { > for (;;) > { > if (len > 0) > { > .... > len = walrcv_receive(0, &buf); > } > } The loop seems a bit different but eventually the same about this discussion. 408> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); 409> if (len != 0) 410> { 415> for (;;) 416> { 417> if (len > 0) 418> { 425> XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); 426> } 427-438> else {break;} 439> len = walrcv_receive(0, &buf); 440> } 441> } XLogWalRcvProcessMsg doesn't send feedback when processing 'w'=WAL record packet. So the same thing as pg_basebackup and pg_receivexlog will occur on walsender. Exiting the for(;;) loop by TimestampDifferenceExceeds just before line 439 is an equivalent measure to I poposed for receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there is seemingly simpler (but I feel a bit uncomfortable for the latter) Or other measures? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Feb 6, 2015 at 3:22 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Sorry, I misunderstood that. > >> > At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com> >> >> On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI >> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >> > I'm very sorry for confused report. The problem found in 9.4.0 >> >> > and the diagnosis was mistakenly done on master. >> >> > >> >> > 9.4.0 has no problem of feedback delay caused by slow xlog >> >> > receiving on pg_basebackup mentioned in the previous mail. But >> >> > the current master still has this problem. >> >> >> >> Seems walreceiver has the same problem. No? >> > >> > pg_receivexlog.c would have the same problem since it uses the >> > same function with pg_basebackup.c. >> > >> > The correspondent of HandleCopyStream in wansender is >> > WalReceiverMain, and it doesn't seem to have the same kind of >> > loop shown below. It seems to surely send feedback per one >> > record. >> > >> > | r = stream_reader(); >> > | while (r > 0) >> > | { >> > | ... wal record processing stuff without sending feedback.. >> > | r = stream_reader(); >> > | } >> >> WalReceiverMain() has the similar code as follows. >> >> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); >> if (len != 0) >> { >> for (;;) >> { >> if (len > 0) >> { >> .... >> len = walrcv_receive(0, &buf); >> } >> } > > The loop seems a bit different but eventually the same about this > discussion. > > 408> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); > 409> if (len != 0) > 410> { > 415> for (;;) > 416> { > 417> if (len > 0) > 418> { > 425> XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); > 426> } > 427-438> else {break;} > 439> len = walrcv_receive(0, &buf); > 440> } > 441> } > > XLogWalRcvProcessMsg doesn't send feedback when processing > 'w'=WAL record packet. So the same thing as pg_basebackup and > pg_receivexlog will occur on walsender. > > Exiting the for(;;) loop by TimestampDifferenceExceeds just > before line 439 is an equivalent measure to I poposed for > receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there > is seemingly simpler (but I feel a bit uncomfortable for the > latter) I'm concerned about the performance degradation by calling getimeofday() per a record. > Or other measures? Introduce the maximum number of records that we can receive and process between feedbacks? For example, we can change XLogWalRcvSendHSFeedback() so that it's called per at least 8 records. I'm not sure what number is good, though... Regards, -- Fujii Masao
Hello, The attached patch is fix for walreciever not using gettimeofday, and fix for receivelog using it. > > XLogWalRcvProcessMsg doesn't send feedback when processing > > 'w'=WAL record packet. So the same thing as pg_basebackup and > > pg_receivexlog will occur on walsender. > > > > Exiting the for(;;) loop by TimestampDifferenceExceeds just > > before line 439 is an equivalent measure to I poposed for > > receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there > > is seemingly simpler (but I feel a bit uncomfortable for the > > latter) > > I'm concerned about the performance degradation by calling > getimeofday() per a record. > > > Or other measures? > > Introduce the maximum number of records that we can receive and > process between feedbacks? For example, we can change > XLogWalRcvSendHSFeedback() so that it's called per at least > 8 records. I'm not sure what number is good, though... At the beginning of the "while(len > 0){if (len > 0){" block, last_recv_timestamp is always kept up to date (by using gettimeotda():). So we can use the variable instead of gettimeofday() in my previous proposal. The start time of the timeout could be last_recv_timestamp at the beginning of the while loop, since it is a bit earlier than the actual time at the point. The another solution would be using RegisterTimeout/enable_timeout_after and it seemed to be work for me. It can throw out the time counting stuff in the loop we are talking about and that of XLogWalRcvSendHSFeedback and XLogWalRcvSendReply, but it might be a bit too large for the gain. Considering pg_basebackup/receivexlog, the loop in receivelog.c does not maintain the time value within it, so I think we are forced to use feGetCurrentTimeStamp if not using SIGALRM. The wal reading function simply gets the data from the buffer in memory for most calls so the gettimeofday for each iteration could slow the process significantly. SIGALRM seems to be valuable for the case. Thoughts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, 15 19:48:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150210.194823.219136034.horiguchi.kyotaro@lab.ntt.co.jp> > Considering pg_basebackup/receivexlog, the loop in receivelog.c > does not maintain the time value within it, so I think we are > forced to use feGetCurrentTimeStamp if not using SIGALRM. The wal > reading function simply gets the data from the buffer in memory > for most calls so the gettimeofday for each iteration could slow > the process significantly. SIGALRM seems to be valuable for the > case. As a very fancy and ugly sample, the attached patch does this, although, I'm a bit at a loss how to make this signal things to be more sober.. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > The attached patch is fix for walreciever not using gettimeofday, > and fix for receivelog using it. > >> > XLogWalRcvProcessMsg doesn't send feedback when processing >> > 'w'=WAL record packet. So the same thing as pg_basebackup and >> > pg_receivexlog will occur on walsender. >> > >> > Exiting the for(;;) loop by TimestampDifferenceExceeds just >> > before line 439 is an equivalent measure to I poposed for >> > receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there >> > is seemingly simpler (but I feel a bit uncomfortable for the >> > latter) >> >> I'm concerned about the performance degradation by calling >> getimeofday() per a record. >> >> > Or other measures? >> >> Introduce the maximum number of records that we can receive and >> process between feedbacks? For example, we can change >> XLogWalRcvSendHSFeedback() so that it's called per at least >> 8 records. I'm not sure what number is good, though... > > At the beginning of the "while(len > 0){if (len > 0){" block, > last_recv_timestamp is always kept up to date (by using > gettimeotda():). So we can use the variable instead of > gettimeofday() in my previous proposal. Yes, but something like last_recv_timestamp doesn't exist in REL9_1_STABLE. So you don't think that your proposed fix should be back-patched to all supported versions. Right? > The start time of the timeout could be last_recv_timestamp at the > beginning of the while loop, since it is a bit earlier than the > actual time at the point. Sounds strange to me. I think that last_recv_timestamp should be compared with the time when the last feedback message was sent, e.g., maybe you can expose sendTime in XLogWalRcvSendReplay() as global variable, and use it to compare with last_recv_timestamp. When we get out of the WAL receive loop due to the timeout check which your patch added, XLogWalRcvFlush() is always executed. I don't think that current WAL file needs to be flushed at that time. > The another solution would be using > RegisterTimeout/enable_timeout_after and it seemed to be work for > me. It can throw out the time counting stuff in the loop we are > talking about and that of XLogWalRcvSendHSFeedback and > XLogWalRcvSendReply, but it might be a bit too large for the > gain. Yes, sounds overkill. Regards, -- Fujii Masao
Thank you for the comment. Three new patches are attached. I forgot to give a revision number on the previous patch, but I think this is the 2nd version. 1. walrcv_reply_fix_94_v2.patch Walreceiver patch applyable on master/REL9_4_STBLE/REL9_3_STABLE 2. walrcv_reply_fix_92_v2.patch Walreceiver patch applyable on REL9_2_STABLE 3. walrcv_reply_fix_91_v2.patch Walreceiver patch applyable on REL9_1_STABLE At Sat, 14 Feb 2015 03:01:22 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHZ_4yWyokA+5ks9s_698adjH8+0haMCSC9WYFh37GY7g@mail.gmail.com> > On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >> Introduce the maximum number of records that we can receive and > >> process between feedbacks? For example, we can change > >> XLogWalRcvSendHSFeedback() so that it's called per at least > >> 8 records. I'm not sure what number is good, though... > > > > At the beginning of the "while(len > 0){if (len > 0){" block, > > last_recv_timestamp is always kept up to date (by using > > gettimeotda():). So we can use the variable instead of > > gettimeofday() in my previous proposal. > > Yes, but something like last_recv_timestamp doesn't exist in > REL9_1_STABLE. So you don't think that your proposed fix > should be back-patched to all supported versions. Right? Back to 9.3 has the loop with the same structure. 9.2 is in a bit differenct shape but looks to have the same issue. 9.1 and 9.0 has the same staps with 9.2 but 9.0 doesn't has wal sender timeout and 9.1 does not have a variable having current time. 9.3 and later: the previous patch works, but revised as your comment. 9.2: The time of the last reply is stored in the file-scope variable reply_message, and the current time is stored in walrcv->lastMsgReceiptTime.The timeout is determined using theses variables. 9.1: walrcv doesn't have lastMsgReceiptTime. The current time should be acquired explicitly in the innermost loop. I triedcalling gettimeofday() once per several loops. The skip count is determined by anticipated worst duration to processa wal record and wal_receiver_status_interval. However, I doubt the necessity to do so.. The value of the worst durationis simply a random guess. 9.0: No point to do this kind of fix. > > The start time of the timeout could be last_recv_timestamp at the > > beginning of the while loop, since it is a bit earlier than the > > actual time at the point. > > Sounds strange to me. I think that last_recv_timestamp should be > compared with the time when the last feedback message was sent, > e.g., maybe you can expose sendTime in XLogWalRcvSendReplay() > as global variable, and use it to compare with last_recv_timestamp. You're right to do exactly right thing, but, as you mentioned, exposing sendTime is requied to do so. The solution I proposed is the second-best assuming that wal_sender_timeout is recommended to be longer enough (several times longer) than wal_receiver_status_interval. If no one minds to expose sendTime, it is the best solution. The attached patch does it. # The added comment in the patch was wrong, though. > When we get out of the WAL receive loop due to the timeout check > which your patch added, XLogWalRcvFlush() is always executed. > I don't think that current WAL file needs to be flushed at that time. Thank you for pointing it out. Run XLogWalRcvSendReply(force = true) immediately instead of breaking from the loop. > > The another solution would be using > > RegisterTimeout/enable_timeout_after and it seemed to be work for > > me. It can throw out the time counting stuff in the loop we are > > talking about and that of XLogWalRcvSendHSFeedback and > > XLogWalRcvSendReply, but it might be a bit too large for the > > gain. > > Yes, sounds overkill. However, gettimeofday() for each recv is also a overkill for most cases. I'll post the patches for receivelog.c based on the patch for 9.1 wal receiver. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, this is the last patch for pg_basebackup/pg_receivexlog on master (9.5). Preor versions don't have this issue. 4. basebackup_reply_fix_mst_v2.patch receivelog.c patch applyable on master. This is based on the same design with walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday(). regards, At Tue, 17 Feb 2015 19:45:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150217.194519.58137941.horiguchi.kyotaro@lab.ntt.co.jp> > Thank you for the comment. Three new patches are attached. I > forgot to give a revision number on the previous patch, but I > think this is the 2nd version. > > 1. walrcv_reply_fix_94_v2.patch > Walreceiver patch applyable on master/REL9_4_STBLE/REL9_3_STABLE > > 2. walrcv_reply_fix_92_v2.patch > Walreceiver patch applyable on REL9_2_STABLE > > 3. walrcv_reply_fix_91_v2.patch > Walreceiver patch applyable on REL9_1_STABLE > > > At Sat, 14 Feb 2015 03:01:22 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHZ_4yWyokA+5ks9s_698adjH8+0haMCSC9WYFh37GY7g@mail.gmail.com> > > On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > >> Introduce the maximum number of records that we can receive and > > >> process between feedbacks? For example, we can change > > >> XLogWalRcvSendHSFeedback() so that it's called per at least > > >> 8 records. I'm not sure what number is good, though... > > > > > > At the beginning of the "while(len > 0){if (len > 0){" block, > > > last_recv_timestamp is always kept up to date (by using > > > gettimeotda():). So we can use the variable instead of > > > gettimeofday() in my previous proposal. > > > > Yes, but something like last_recv_timestamp doesn't exist in > > REL9_1_STABLE. So you don't think that your proposed fix > > should be back-patched to all supported versions. Right? > > Back to 9.3 has the loop with the same structure. 9.2 is in a bit > differenct shape but looks to have the same issue. 9.1 and 9.0 > has the same staps with 9.2 but 9.0 doesn't has wal sender > timeout and 9.1 does not have a variable having current time. > > 9.3 and later: the previous patch works, but revised as your > comment. > > 9.2: The time of the last reply is stored in the file-scope > variable reply_message, and the current time is stored in > walrcv->lastMsgReceiptTime. The timeout is determined using > theses variables. > > 9.1: walrcv doesn't have lastMsgReceiptTime. The current time > should be acquired explicitly in the innermost loop. I tried > calling gettimeofday() once per several loops. The skip count > is determined by anticipated worst duration to process a wal > record and wal_receiver_status_interval. However, I doubt the > necessity to do so.. The value of the worst duration is simply > a random guess. > > 9.0: No point to do this kind of fix. > > > > > The start time of the timeout could be last_recv_timestamp at the > > > beginning of the while loop, since it is a bit earlier than the > > > actual time at the point. > > > > Sounds strange to me. I think that last_recv_timestamp should be > > compared with the time when the last feedback message was sent, > > e.g., maybe you can expose sendTime in XLogWalRcvSendReplay() > > as global variable, and use it to compare with last_recv_timestamp. > > You're right to do exactly right thing, but, as you mentioned, > exposing sendTime is requied to do so. The solution I proposed is > the second-best assuming that wal_sender_timeout is recommended > to be longer enough (several times longer) than > wal_receiver_status_interval. If no one minds to expose sendTime, > it is the best solution. The attached patch does it. > > # The added comment in the patch was wrong, though. > > > When we get out of the WAL receive loop due to the timeout check > > which your patch added, XLogWalRcvFlush() is always executed. > > I don't think that current WAL file needs to be flushed at that time. > > Thank you for pointing it out. Run XLogWalRcvSendReply(force = > true) immediately instead of breaking from the loop. > > > > The another solution would be using > > > RegisterTimeout/enable_timeout_after and it seemed to be work for > > > me. It can throw out the time counting stuff in the loop we are > > > talking about and that of XLogWalRcvSendHSFeedback and > > > XLogWalRcvSendReply, but it might be a bit too large for the > > > gain. > > > > Yes, sounds overkill. > > However, gettimeofday() for each recv is also a overkill for most > cases. I'll post the patches for receivelog.c based on the patch > for 9.1 wal receiver. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 18, 2015 at 5:34 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, this is the last patch for pg_basebackup/pg_receivexlog on > master (9.5). Preor versions don't have this issue. > > 4. basebackup_reply_fix_mst_v2.patch > receivelog.c patch applyable on master. > > This is based on the same design with > walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday(). Thanks for updating the patches! But I'm still not sure if the idea depending on the frequent calls of gettimeofday() for each WAL receive is good or not. Some users may complain about the performance impact by such frequent calls and we may want to get rid of them from walreceiver loop in the future. If we adopt your idea now, I'm afraid that it would tie our hands in that case. How much impact can such frequent calls of gettimeofday() have on replication performance? If it's not negligible, probably we should remove them at first and find out another idea to fix the problem you pointed. ISTM that it's not so difficult to remove them. Thought? Do you have any numbers which can prove that such frequent gettimeofday() has only ignorable impact on the performance? Regards, -- Fujii Masao
Hello, At Thu, 19 Feb 2015 19:22:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGLFLaFrCYcuikkVefNaoEL448TLSJ9oPsvb17v3foZHA@mail.gmail.com> > On Wed, Feb 18, 2015 at 5:34 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, this is the last patch for pg_basebackup/pg_receivexlog on > > master (9.5). Preor versions don't have this issue. > > > > 4. basebackup_reply_fix_mst_v2.patch > > receivelog.c patch applyable on master. > > > > This is based on the same design with > > walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday(). > > Thanks for updating the patches! But I'm still not sure if the idea depending > on the frequent calls of gettimeofday() for each WAL receive is good or not. Neither do I. Nowadays, linux on AMD64/x64 environment has no problem even if gettimeofday() called frequently, but Windows seems to have a problem and I don't know about other platforms. One possible timing source is LSN. > if ((blockpos - last_blockpos) / BLKSZ > 0) > { > now = feGetCurrentTimestamp(); > if (feTimestampDifferenceExceeds(last_status, now, .. > if (!sendFeedback(conn, blockpos, now, false)) > } > } > > last_blockpos = blockpos; But once per PAGESZ can easily be more frequent than once per 10 records and XLOG_SEG_SIZE seems too big. However I don't see any bases to determine the frequency between them nor other than the time itself. SIGALRM seems to me to be more preferable to keep the main jobe as fast as possible than introducing a code with no reasonable basis. > Some users may complain about the performance impact by such frequent calls > and we may want to get rid of them from walreceiver loop in the future. > If we adopt your idea now, I'm afraid that it would tie our hands in that case. > > How much impact can such frequent calls of gettimeofday() have on replication > performance? If it's not negligible, probably we should remove them at first > and find out another idea to fix the problem you pointed. ISTM that it's not so > difficult to remove them. Thought? Do you have any numbers which can prove > that such frequent gettimeofday() has only ignorable impact on the performance? The attached patch is 'the more sober' version of SIGLARM patch. I'll search for the another way after this. regards, diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c index 8caedff..c55af83 100644 --- a/src/bin/pg_basebackup/receivelog.c +++ b/src/bin/pg_basebackup/receivelog.c @@ -16,6 +16,8 @@#include <sys/stat.h>#include <unistd.h> +#include <signal.h> +#include <sys/time.h>/* local includes */#include "receivelog.h" @@ -33,6 +35,12 @@ static XLogRecPtr lastFlushPosition = InvalidXLogRecPtr;static bool still_sending = true; /* feedbackstill needs to be sent? */ +static bool standby_message_timeout_active = false; /* timeout is active? */ +static bool standby_message_timeout_expired = false; /* timeout expired? */ + +static void SigAlrmHandler(int s); +static int set_standby_message_timeout(long duration); +static int cancel_standby_message_timeout(void);static PGresult *HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *basedir, stream_stop_callback stream_stop, int standby_message_timeout, @@ -60,6 +68,14 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,static bool ReadEndOfStreamingResult(PGresult*res, XLogRecPtr *startpos, uint32 *timeline); + +static void +SigAlrmHandler(int s) +{ + standby_message_timeout_expired = true; +} + +static boolmark_file_as_archived(const char *basedir, const char *fname){ @@ -802,6 +818,65 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)}/* + * set_standby_message_timeout: Set standby message timeout. + * + * duration is the timeout duration in milliseconds. + * + * Returns 0 on success and -1 on error. + */ +static int +set_standby_message_timeout(long duration) +{ + struct itimerval itv; + + /* negative duration means no standby_message_timeout. */ + if (duration < 0) return 0; + + MemSet(&itv, 0, sizeof(itv)); + itv.it_value.tv_sec = duration / 1000; + itv.it_value.tv_usec = (duration % 1000) * 1000; + standby_message_timeout_active = true; + standby_message_timeout_expired = false; + if (setitimer(ITIMER_REAL, &itv, NULL) != 0) + { + fprintf(stderr, "could not set timer: %m"); + return -1; + } + + return 0; +} + +/* + * cancel_standby_message_timeout: cancel standby message timeout if active. + * + * Returns 0 on success and -1 on error. + */ +static int +cancel_standby_message_timeout(void) +{ + struct itimerval itv; + + if (!standby_message_timeout_active) + return; + + standby_message_timeout_active = false; + + if (standby_message_timeout_expired) + { + standby_message_timeout_expired = false; + return 0; + } + MemSet(&itv, 0, sizeof(itv)); + if (setitimer(ITIMER_REAL, &itv, NULL) != 0) + { + fprintf(stderr, "could not set timer: %m"); + return -1; + } + + return 0; +} + +/* * The main loop of ReceiveXlogStream. Handles the COPY stream after * initiating streaming with the START_STREAMING command.* @@ -818,6 +893,15 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *copybuf = NULL; int64 last_status = -1; XLogRecPtr blockpos = startpos; + sighandler_t oldhandler; + + oldhandler = pqsignal(SIGALRM, SigAlrmHandler); + + if (oldhandler != SigAlrmHandler) + { + /* SIGALRM should not be used here */ + Assert(oldhandler == SIG_DFL); + } still_sending = true; @@ -878,23 +962,9 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, */ sleeptime = CalculateCopyStreamSleeptime(now,standby_message_timeout, last_status); - r = CopyStreamReceive(conn, sleeptime, ©buf); - while (r != 0) + while (r > 0) { - if (r == -1) - goto error; - if (r == -2) - { - PGresult *res = HandleEndOfCopyStream(conn, copybuf, blockpos, - basedir, partial_suffix, - stoppos, mark_done); - if (res == NULL) - goto error; - else - return res; - } - /* Check the message type. */ if (copybuf[0] == 'k') { @@ -925,14 +995,62 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, } /* + If standby_message_timeout has expired, send feed back and then + * set timeout again. + */ + if (standby_message_timeout_expired) + { + if (!sendFeedback(conn, blockpos, now, false)) + goto error; + last_status = feGetCurrentTimestamp(); + + /* Set next timeout */ + sleeptime = CalculateCopyStreamSleeptime(now, standby_message_timeout, + last_status); + if (set_standby_message_timeout(sleeptime) < 0) + goto error; + } + + /* * Process the received data, and any subsequent data we * can read without blocking. */ r = CopyStreamReceive(conn, 0, ©buf); + + /* If continuous input has come, set timeout for standby + * message. This fires a bit later than exact desired time but it + * would be earlier enough because standby_message_timeout is + * ususally set enough smaller than wal_sender_timeout. + */ + if (r > 0 && !standby_message_timeout_active) + { + if (set_standby_message_timeout(sleeptime) < 0) + goto error; + + } } - } + if (cancel_standby_message_timeout() < 0) + goto error; + if (r == -1) + goto error; + if (r == -2) + { + PGresult *res = HandleEndOfCopyStream(conn, copybuf, blockpos, + basedir, partial_suffix, + stoppos, mark_done); + if (res == NULL) + goto error; + else + return res; + } + + } + error: + cancel_standby_message_timeout(); + pqsignal(SIGALRM, oldhandler); + if (copybuf != NULL) PQfreemem(copybuf); return NULL;
Hello, the attached is the v4 patch that checks feedback timing every WAL segments boundary. At Fri, 20 Feb 2015 17:29:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150220.172914.241732690.horiguchi.kyotaro@lab.ntt.co.jp> > > Some users may complain about the performance impact by such > > frequent calls and we may want to get rid of them from > > walreceiver loop in the future. If we adopt your idea now, > > I'm afraid that it would tie our hands in that case. > > > > How much impact can such frequent calls of gettimeofday() > > have on replication performance? If it's not negligible, > > probably we should remove them at first and find out another > > idea to fix the problem you pointed. ISTM that it's not so > > difficult to remove them. Thought? Do you have any numbers > > which can prove that such frequent gettimeofday() has only > > ignorable impact on the performance? > > The attached patch is 'the more sober' version of SIGLARM patch. I said that checking whether to send feedback every boundary between WAL segments seemed too long but after some rethinking, I changed my mind. - The most large possible delay source in the busy-receive loop is fsyncing at closing WAL segment file just written, thiscan take several seconds. Freezing longer than the timeout interval could not be saved and is not worth saving anyway. - 16M bytes-disk-writes intervals between gettimeofday() seems to be gentle enough even on platforms where gettimeofday()is rather heavy. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, the attached is the v4 patch that checks feedback timing > every WAL segments boundary. > > At Fri, 20 Feb 2015 17:29:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20150220.172914.241732690.horiguchi.kyotaro@lab.ntt.co.jp> >> > Some users may complain about the performance impact by such >> > frequent calls and we may want to get rid of them from >> > walreceiver loop in the future. If we adopt your idea now, >> > I'm afraid that it would tie our hands in that case. >> > >> > How much impact can such frequent calls of gettimeofday() >> > have on replication performance? If it's not negligible, >> > probably we should remove them at first and find out another >> > idea to fix the problem you pointed. ISTM that it's not so >> > difficult to remove them. Thought? Do you have any numbers >> > which can prove that such frequent gettimeofday() has only >> > ignorable impact on the performance? >> >> The attached patch is 'the more sober' version of SIGLARM patch. > > I said that checking whether to send feedback every boundary > between WAL segments seemed too long but after some rethinking, I > changed my mind. > > - The most large possible delay source in the busy-receive loop > is fsyncing at closing WAL segment file just written, this can > take several seconds. Freezing longer than the timeout > interval could not be saved and is not worth saving anyway. > > - 16M bytes-disk-writes intervals between gettimeofday() seems > to be gentle enough even on platforms where gettimeofday() is > rather heavy. Sounds reasonable to me. So we don't need to address the problem in walreceiver side so proactively because it tries to send the feedback every after flushing the WAL records. IOW, the problem you observed is less likely to happen. Right? + now = feGetCurrentTimestamp(); + if (standby_message_timeout > 0 && Minor comment: should feGetCurrentTimestamp() be called after the check of standby_message_timeout > 0, to avoid unnecessary calls of that? ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len, XLogRecPtr *blockpos, uint32 timeline, char *basedir, stream_stop_callback stream_stop, - char *partial_suffix, bool mark_done) + char *partial_suffix, bool mark_done, + int standby_message_timeout, int64 *last_status) Maybe it's time to refactor this ugly coding (i.e., currently many arguments need to be given to each functions. Looks ugly)... Regards, -- Fujii Masao
Hi, the attached is the v5 patch. - Do feGetCurrentTimestamp() only when necessary. - Rebased to current master At Mon, 2 Mar 2015 20:21:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG1tJHpG03oZgwoKxt5wYD5v4S3HuTgSx7RotBhHnjwJw@mail.gmail.com> > On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, the attached is the v4 patch that checks feedback timing > > every WAL segments boundary. .. > > I said that checking whether to send feedback every boundary > > between WAL segments seemed too long but after some rethinking, I > > changed my mind. > > > > - The most large possible delay source in the busy-receive loop > > is fsyncing at closing WAL segment file just written, this can > > take several seconds. Freezing longer than the timeout > > interval could not be saved and is not worth saving anyway. > > > > - 16M bytes-disk-writes intervals between gettimeofday() seems > > to be gentle enough even on platforms where gettimeofday() is > > rather heavy. > > Sounds reasonable to me. > > So we don't need to address the problem in walreceiver side so proactively > because it tries to send the feedback every after flushing the WAL records. > IOW, the problem you observed is less likely to happen. Right? > > + now = feGetCurrentTimestamp(); > + if (standby_message_timeout > 0 && Surely it would hardly happen, especially on ordinary configuration. However, the continuous receiving of the replication stream is a quite normal behavior even if hardly happens. So the receiver should guarantee the feedbacks to be sent by logic as long as it is working normally, as long as the code for the special case won't be too large and won't take too long time:). The current walreceiver doesn't look trying to send feedbacks after flushing every wal records. HandleCopyStream will restlessly process the records in a gapless replication stream, sending feedback only when keepalive packet arrives. It is the fact that the response to the keepalive would help greatly but it is not what the keepalives are for. It is intended to be used to confirm if a silent receiver is still alive. Even with this fix, the case that one write or flush takes longer time than the feedback interval cannot be saved, but it would be ok since it should be regarded as disorder. > Minor comment: should feGetCurrentTimestamp() be called after the check of > standby_message_timeout > 0, to avoid unnecessary calls of that? Ah, you're right. I'll fixed it. > ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len, > XLogRecPtr *blockpos, uint32 timeline, > char *basedir, stream_stop_callback stream_stop, > - char *partial_suffix, bool mark_done) > + char *partial_suffix, bool mark_done, > + int standby_message_timeout, int64 *last_status) > > Maybe it's time to refactor this ugly coding (i.e., currently many arguments > need to be given to each functions. Looks ugly)... I'm increasing the ugliness:( XLog stuff seems to need to share many states widely to work. But the parameter list of the function looks to be bearable to this extent, to me:). regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 10, 2015 at 5:29 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hi, the attached is the v5 patch. > > - Do feGetCurrentTimestamp() only when necessary. > - Rebased to current master > > > At Mon, 2 Mar 2015 20:21:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG1tJHpG03oZgwoKxt5wYD5v4S3HuTgSx7RotBhHnjwJw@mail.gmail.com> >> On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Hello, the attached is the v4 patch that checks feedback timing >> > every WAL segments boundary. > .. >> > I said that checking whether to send feedback every boundary >> > between WAL segments seemed too long but after some rethinking, I >> > changed my mind. >> > >> > - The most large possible delay source in the busy-receive loop >> > is fsyncing at closing WAL segment file just written, this can >> > take several seconds. Freezing longer than the timeout >> > interval could not be saved and is not worth saving anyway. >> > >> > - 16M bytes-disk-writes intervals between gettimeofday() seems >> > to be gentle enough even on platforms where gettimeofday() is >> > rather heavy. >> >> Sounds reasonable to me. >> >> So we don't need to address the problem in walreceiver side so proactively >> because it tries to send the feedback every after flushing the WAL records. >> IOW, the problem you observed is less likely to happen. Right? >> >> + now = feGetCurrentTimestamp(); >> + if (standby_message_timeout > 0 && > > Surely it would hardly happen, especially on ordinary > configuration. > > However, the continuous receiving of the replication stream is a > quite normal behavior even if hardly happens. So the receiver > should guarantee the feedbacks to be sent by logic as long as it > is working normally, as long as the code for the special case > won't be too large and won't take too long time:). > > The current walreceiver doesn't look trying to send feedbacks > after flushing every wal records. HandleCopyStream will > restlessly process the records in a gapless replication stream, > sending feedback only when keepalive packet arrives. It is the > fact that the response to the keepalive would help greatly but it > is not what the keepalives are for. It is intended to be used to > confirm if a silent receiver is still alive. > > Even with this fix, the case that one write or flush takes longer > time than the feedback interval cannot be saved, but it would be > ok since it should be regarded as disorder. > > >> Minor comment: should feGetCurrentTimestamp() be called after the check of >> standby_message_timeout > 0, to avoid unnecessary calls of that? > > Ah, you're right. I'll fixed it. Even if the timeout has not elapsed yet, if synchronous mode is true, we should send feedback here? Regards, -- Fujii Masao