Thread: Suppressing useless wakeups in walreceiver
Hi, While working on WaitEventSet-ifying various codepaths, I found it strange that walreceiver wakes up 10 times per second while idle. Here's a draft patch to compute the correct sleep time.
Attachment
Hello. At Thu, 27 Jan 2022 23:50:04 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > While working on WaitEventSet-ifying various codepaths, I found it > strange that walreceiver wakes up 10 times per second while idle. > Here's a draft patch to compute the correct sleep time. Agree to the objective. However I feel the patch makes the code somewhat less readable maybe because WalRcvComputeNextWakeup hides the timeout deatils. Of course other might thing differently. - ping_sent = false; - XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1, - startpointTLI); + WalRcvComputeNextWakeup(&state, + WALRCV_WAKEUP_TIMEOUT, + last_recv_timestamp); + WalRcvComputeNextWakeup(&state, + WALRCV_WAKEUP_PING, + last_recv_timestamp); The calculated target times are not used within the closest loop and the loop is quite busy. WALRCV_WAKEUP_PING is the replacement of ping_sent, but I think the computation of both two target times would be better done after the loop only when the "if (len > 0)" block was passed. - XLogWalRcvSendReply(false, false); + XLogWalRcvSendReply(&state, GetCurrentTimestamp(), + false, false); The GetCurrentTimestamp() is same with last_recv_timestamp when the recent recv() had any bytes received. So we can avoid the call to GetCurrentTimestamp in that case. If we do the change above, the same flag notifies the necessity of separete GetCurrentTimestamp(). I understand the reason for startpointTLI being stored in WalRcvInfo but don't understand about primary_has_standby_xmin. It is just moved from a static variable of XLogWalRcvSendHSFeedback to the struct member that is modifed and read only by the same function. The enum item WALRCV_WAKEUP_TIMEOUT looks somewhat uninformative. How about naming it WALRCV_WAKEUP_TERMIATE (or WALRCV_WAKEUP_TO_DIE)? WALRCV_WAKEUP_TIMEOUT and WALRCV_WAKEUP_PING doesn't fire when wal_receiver_timeout is zero. In that case we should not set the timeouts at all to avoid spurious wakeups? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jan 28, 2022 at 8:26 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Thu, 27 Jan 2022 23:50:04 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > > While working on WaitEventSet-ifying various codepaths, I found it > > strange that walreceiver wakes up 10 times per second while idle. > > Here's a draft patch to compute the correct sleep time. > > Agree to the objective. However I feel the patch makes the code > somewhat less readable maybe because WalRcvComputeNextWakeup hides the > timeout deatils. Of course other might thing differently. Thanks for looking! The reason why I put the timeout computation into a function is because there are about 3 places you need to do it: at the beginning, when rescheduling for next time, and when the configuration file changes and you might want to recompute them. The logic to decode the GUCs and compute the times would be duplicated. I have added a comment about that, and tried to make the code clearer. Do you have another idea? > - ping_sent = false; > - XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1, > - startpointTLI); > + WalRcvComputeNextWakeup(&state, > + WALRCV_WAKEUP_TIMEOUT, > + last_recv_timestamp); > + WalRcvComputeNextWakeup(&state, > + WALRCV_WAKEUP_PING, > + last_recv_timestamp); > > The calculated target times are not used within the closest loop and > the loop is quite busy. WALRCV_WAKEUP_PING is the replacement of > ping_sent, but I think the computation of both two target times would > be better done after the loop only when the "if (len > 0)" block was > passed. Those two wakeup times should only be adjusted when data is received. The calls should be exactly where last_recv_timestamp is set in master, no? > - XLogWalRcvSendReply(false, false); > + XLogWalRcvSendReply(&state, GetCurrentTimestamp(), > + false, false); > > The GetCurrentTimestamp() is same with last_recv_timestamp when the > recent recv() had any bytes received. So we can avoid the call to > GetCurrentTimestamp in that case. If we do the change above, the same > flag notifies the necessity of separete GetCurrentTimestamp(). Yeah, I agree that we should remove more GetCurrentTimestamp() calls. Here's another idea: let's remove last_recv_timestamp, and just use 'now', being careful to update it after sleeping and in the receive loop (in case it gets busy for a long time), so that it's always fresh enough. > I understand the reason for startpointTLI being stored in WalRcvInfo > but don't understand about primary_has_standby_xmin. It is just moved > from a static variable of XLogWalRcvSendHSFeedback to the struct > member that is modifed and read only by the same function. Yeah, this was unnecessary refactoring. I removed that change (we could look into moving more state into the new state object instead of using static locals and globals, but that can be for another thread, I got carried away...) > The enum item WALRCV_WAKEUP_TIMEOUT looks somewhat uninformative. How > about naming it WALRCV_WAKEUP_TERMIATE (or WALRCV_WAKEUP_TO_DIE)? Ok, let's try TERMINATE. > WALRCV_WAKEUP_TIMEOUT and WALRCV_WAKEUP_PING doesn't fire when > wal_receiver_timeout is zero. In that case we should not set the > timeouts at all to avoid spurious wakeups? Oh, yeah, I forgot to handle wal_receiver_timeout = 0. Fixed.
Attachment
At Fri, 28 Jan 2022 22:41:32 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > On Fri, Jan 28, 2022 at 8:26 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > At Thu, 27 Jan 2022 23:50:04 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > The reason why I put the timeout computation into a function is > because there are about 3 places you need to do it: at the beginning, > when rescheduling for next time, and when the configuration file > changes and you might want to recompute them. The logic to decode the > GUCs and compute the times would be duplicated. I have added a > comment about that, and tried to make the code clearer. > > Do you have another idea? Yeah, I understand that and am having no better ideas so I'm fine with that. > > - ping_sent = false; > > - XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1, > > - startpointTLI); > > + WalRcvComputeNextWakeup(&state, > > + WALRCV_WAKEUP_TIMEOUT, > > + last_recv_timestamp); > > + WalRcvComputeNextWakeup(&state, > > + WALRCV_WAKEUP_PING, > > + last_recv_timestamp); > > > > The calculated target times are not used within the closest loop and > > the loop is quite busy. WALRCV_WAKEUP_PING is the replacement of > > ping_sent, but I think the computation of both two target times would > > be better done after the loop only when the "if (len > 0)" block was > > passed. > > Those two wakeup times should only be adjusted when data is received. Yes. > The calls should be exactly where last_recv_timestamp is set in > master, no? The calcuations at other than the last one iteration are waste of cycles and I thought the loop as a kind of tight-loop. So I don't mind we consider the additional cycles are acceptable. (Just I'm not sure.) > > - XLogWalRcvSendReply(false, false); > > + XLogWalRcvSendReply(&state, GetCurrentTimestamp(), > > + false, false); > > > > The GetCurrentTimestamp() is same with last_recv_timestamp when the > > recent recv() had any bytes received. So we can avoid the call to > > GetCurrentTimestamp in that case. If we do the change above, the same > > flag notifies the necessity of separete GetCurrentTimestamp(). > > Yeah, I agree that we should remove more GetCurrentTimestamp() calls. > Here's another idea: let's remove last_recv_timestamp, and just use > 'now', being careful to update it after sleeping and in the receive > loop (in case it gets busy for a long time), so that it's always fresh > enough. I like it. It could be slightly off but it doesn't matter at all. > > I understand the reason for startpointTLI being stored in WalRcvInfo > > but don't understand about primary_has_standby_xmin. It is just moved > > from a static variable of XLogWalRcvSendHSFeedback to the struct > > member that is modifed and read only by the same function. > > Yeah, this was unnecessary refactoring. I removed that change (we > could look into moving more state into the new state object instead of > using static locals and globals, but that can be for another thread, I > got carried away...) > > > The enum item WALRCV_WAKEUP_TIMEOUT looks somewhat uninformative. How > > about naming it WALRCV_WAKEUP_TERMIATE (or WALRCV_WAKEUP_TO_DIE)? > > Ok, let's try TERMINATE. Thanks:) > > WALRCV_WAKEUP_TIMEOUT and WALRCV_WAKEUP_PING doesn't fire when > > wal_receiver_timeout is zero. In that case we should not set the > > timeouts at all to avoid spurious wakeups? > > Oh, yeah, I forgot to handle wal_receiver_timeout = 0. Fixed. Yeah, looks fine. +static void +WalRcvComputeNextWakeup(WalRcvInfo *state, + WalRcvWakeupReason reason, + TimestampTz now) +{ + switch (reason) + { ... + default: + break; + } +} Mmm.. there's NUM_WALRCV_WAKEUPS.. But I think we don't need to allow the case. Isn't it better to replace the break with Assert()? It might be suitable for another patch, but we can do that together. If we passed "now" to the function XLogWalRcvProcessMsg, we would be able to remove further several calls to GetCurrentTimestamp(), in the function itself and ProcessWalSndrMessage (the name is improperly abbreviated..). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
I was surprised to see that this patch was never committed! On Mon, Jan 31, 2022 at 04:28:11PM +0900, Kyotaro Horiguchi wrote: > +static void > +WalRcvComputeNextWakeup(WalRcvInfo *state, > + WalRcvWakeupReason reason, > + TimestampTz now) > +{ > + switch (reason) > + { > ... > + default: > + break; > + } > +} > > Mmm.. there's NUM_WALRCV_WAKEUPS.. But I think we don't need to allow > the case. Isn't it better to replace the break with Assert()? +1 > It might be suitable for another patch, but we can do that > together. If we passed "now" to the function XLogWalRcvProcessMsg, we > would be able to remove further several calls to > GetCurrentTimestamp(), in the function itself and > ProcessWalSndrMessage (the name is improperly abbreviated..). While reading the patch, I did find myself wondering if there were some additional opportunities for reducing the calls to GetCurrentTimestamp(). That was really the only thing that stood out to me. Thomas, are you planning to continue with this patch? If not, I don't mind picking it up. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Sep 30, 2022 at 4:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Thomas, are you planning to continue with this patch? If not, I don't mind > picking it up. Hi Nathan, Please go ahead! One thing I had in my notes to check with this patch is whether there might be a bug due to computing all times in microseconds, but sleeping for a number of milliseconds without rounding up (what I mean is that if it's trying to sleep for 900 microseconds, it might finish up sleeping for 0 milliseconds in a tight loop a few times).
On Fri, Sep 30, 2022 at 05:04:43PM +1300, Thomas Munro wrote: > Please go ahead! One thing I had in my notes to check with this patch > is whether there might be a bug due to computing all times in > microseconds, but sleeping for a number of milliseconds without > rounding up (what I mean is that if it's trying to sleep for 900 > microseconds, it might finish up sleeping for 0 milliseconds in a > tight loop a few times). I think that's right. If 'next_wakeup - now' is less than 1000, 'nap' will be 0. My first instinct is to simply round to the nearest millisecond, but this might still result in wakeups before the calculated wakeup time, and I think we ultimately want to nap until >= the wakeup time whenever possible to avoid tight loops. So, I agree that it would be better to round up to the next millisecond whenever nextWakeup > now. Given the current behavior is to wake up every 100 milliseconds, an extra millisecond here and there doesn't seem like it should cause any problems. And I suppose we should assume we might nap longer already, anyway. I do see that the wakeup time for WALRCV_WAKEUP_PING might be set to 'now' whenever wal_receiver_timeout is 1 millisecond, but I think that's existing behavior, and it feels like an extreme case that we probably needn't worry about too much. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Here is an updated patch set with the following changes: * The creation of the struct for non-shared WAL receiver state is moved to a prerequisite 0001 patch. This should help ease review of 0002 a bit. * I updated the nap time calculation to round up to the next millisecond, as discussed upthread. * I attempted to minimize the calls to GetCurrentTimestamp(). The WAL receiver code already calls this function pretty liberally, so I don't know if this is important, but perhaps it could make a difference for systems that don't have something like the vDSO to avoid real system calls. * I removed the 'tli' argument from functions that now have an argument for the non-shared state struct. The 'tli' is stored within the struct, so the extra argument is unnecessary. One thing that still bugs me a little bit about 0002 is that the calls to GetCurrentTimestamp() feel a bit scattered and more difficult to reason about. AFAICT 0002 keeps 'now' relatively up-to-date, but it seems possible that a future change could easily disrupt that. I don't have any other ideas at the moment, though. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On 2022-Oct-04, Nathan Bossart wrote: > Here is an updated patch set with the following changes: > > * The creation of the struct for non-shared WAL receiver state is moved to > a prerequisite 0001 patch. This should help ease review of 0002 a bit. I think that would be even better if you moved the API adjustments of some functions for the new struct to 0001 as well (XLogWalRcvSendHSFeedback etc). > * I updated the nap time calculation to round up to the next millisecond, > as discussed upthread. I didn't look at this part very carefully, but IIRC walreceiver's responsivity has a direct impact on latency for sync replicas. Maybe it'd be sensible to the definition that the sleep time is rounded down rather than up? (At least, for the cases where we have something to do and not merely continue sleeping.) > * I attempted to minimize the calls to GetCurrentTimestamp(). The WAL > receiver code already calls this function pretty liberally, so I don't know > if this is important, but perhaps it could make a difference for systems > that don't have something like the vDSO to avoid real system calls. Yeah, we are indeed careful about this in most places. Maybe it's not particularly important in 2022 and also not particularly important for walreceiver (again, except in the code paths that affect sync replicas), but there's no reason not to continue to be careful until we discover more widely that it doesn't matter. > * I removed the 'tli' argument from functions that now have an argument for > the non-shared state struct. The 'tli' is stored within the struct, so the > extra argument is unnecessary. +1 (This could also be in 0001, naturally.) > One thing that still bugs me a little bit about 0002 is that the calls to > GetCurrentTimestamp() feel a bit scattered and more difficult to reason > about. AFAICT 0002 keeps 'now' relatively up-to-date, but it seems > possible that a future change could easily disrupt that. I don't have any > other ideas at the moment, though. Hmm. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Thanks for taking a look. On Wed, Oct 05, 2022 at 09:57:00AM +0200, Alvaro Herrera wrote: > On 2022-Oct-04, Nathan Bossart wrote: >> * The creation of the struct for non-shared WAL receiver state is moved to >> a prerequisite 0001 patch. This should help ease review of 0002 a bit. > > I think that would be even better if you moved the API adjustments of > some functions for the new struct to 0001 as well > (XLogWalRcvSendHSFeedback etc). I moved as much as I could to 0001 in v4. >> * I updated the nap time calculation to round up to the next millisecond, >> as discussed upthread. > > I didn't look at this part very carefully, but IIRC walreceiver's > responsivity has a direct impact on latency for sync replicas. Maybe > it'd be sensible to the definition that the sleep time is rounded down > rather than up? (At least, for the cases where we have something to do > and not merely continue sleeping.) The concern is that if we wake up early, we'll end up spinning until the wake-up time is reached. Given the current behavior is to sleep for 100ms at a time, and the tasks in question are governed by wal_receiver_timeout and wal_receiver_status_interval (which are typically set to several seconds) it seems reasonably safe to sleep up to an extra ~1ms here and there without sacrificing responsiveness. In fact, I imagine this patch results in a net improvement in responsiveness for these tasks. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Oct 5, 2022 at 11:38 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > I moved as much as I could to 0001 in v4. Some comments on v4-0002: 1. You might want to use PG_INT64_MAX instead of INT64_MAX for portability? 2. With the below change, the time walreceiver spends in XLogWalRcvSendReply() is also included for XLogWalRcvSendHSFeedback right? I think it's a problem given that XLogWalRcvSendReply() can take a while. Earlier, this wasn't the case, each function calculating 'now' separately. Any reason for changing this behaviour? I know that GetCurrentTimestamp(); isn't cheaper all the time, but here it might result in a change in the behaviour. - XLogWalRcvSendReply(false, false); - XLogWalRcvSendHSFeedback(false); + TimestampTz now = GetCurrentTimestamp(); + + XLogWalRcvSendReply(state, now, false, false); + XLogWalRcvSendHSFeedback(state, now, false); 3. I understand that TimestampTz type is treated as microseconds. Would you mind having a comment in below places to say why we're multiplying with 1000 or 1000000 here? Also, do we need 1000L or 1000000L or type cast to TimestampTz? + state->wakeup[reason] = now + (wal_receiver_timeout / 2) * 1000; + state->wakeup[reason] = now + wal_receiver_timeout * 1000; + state->wakeup[reason] = now + wal_receiver_status_interval * 1000000; 4. How about simplifying WalRcvComputeNextWakeup() something like below? static void WalRcvComputeNextWakeup(WalRcvInfo *state, WalRcvWakeupReason reason, TimestampTz now) { TimestampTz ts = INT64_MAX; switch (reason) { case WALRCV_WAKEUP_TERMINATE: if (wal_receiver_timeout > 0) ts = now + (TimestampTz) (wal_receiver_timeout * 1000L); break; case WALRCV_WAKEUP_PING: if (wal_receiver_timeout > 0) ts = now + (TimestampTz) ((wal_receiver_timeout / 2) * 1000L); break; case WALRCV_WAKEUP_HSFEEDBACK: if (hot_standby_feedback && wal_receiver_status_interval > 0) ts = now + (TimestampTz) (wal_receiver_status_interval * 1000000L); break; case WALRCV_WAKEUP_REPLY: if (wal_receiver_status_interval > 0) ts = now + (TimestampTz) (wal_receiver_status_interval * 1000000); break; default: /* An error is better here. */ } state->wakeup[reason] = ts; } 5. Can we move below code snippets to respective static functions for better readability and code reuse? This: + /* Find the soonest wakeup time, to limit our nap. */ + nextWakeup = INT64_MAX; + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) + nextWakeup = Min(state.wakeup[i], nextWakeup); + nap = Max(0, (nextWakeup - now + 999) / 1000); And this: + now = GetCurrentTimestamp(); + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) + WalRcvComputeNextWakeup(&state, i, now); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Oct 10, 2022 at 03:22:15PM +0530, Bharath Rupireddy wrote: > Some comments on v4-0002: Thanks for taking a look. > 1. You might want to use PG_INT64_MAX instead of INT64_MAX for portability? Yes, I used PG_INT64_MAX in v5. > 2. With the below change, the time walreceiver spends in > XLogWalRcvSendReply() is also included for XLogWalRcvSendHSFeedback > right? I think it's a problem given that XLogWalRcvSendReply() can > take a while. Earlier, this wasn't the case, each function calculating > 'now' separately. Any reason for changing this behaviour? I know that > GetCurrentTimestamp(); isn't cheaper all the time, but here it might > result in a change in the behaviour. Yes, if XLogWalRcvSendReply() takes a long time, we might defer sending the hot_standby_feedback message until a later time. The only reason I changed this was to avoid extra calls to GetCurrentTimestamp(), which might be expensive on some platforms. Outside of the code snippet you pointed out, I think WalReceiverMain() has a similar problem. That being said, I'm generally skeptical that this sort of thing is detrimental given the current behavior (i.e., wake up every 100ms), the usual values of these GUCs (i.e., tens of seconds), and the fact that any tasks that are inappropriately skipped will typically be retried in the next iteration of the loop. > 3. I understand that TimestampTz type is treated as microseconds. > Would you mind having a comment in below places to say why we're > multiplying with 1000 or 1000000 here? Also, do we need 1000L or > 1000000L or type cast to > TimestampTz? I used INT64CONST() in v5, as that seemed the most portable, but I stopped short of adding comments to explain all the conversions. IMO the conversions are relatively obvious, and the units of the GUCs can be easily seen in guc_tables.c. > 4. How about simplifying WalRcvComputeNextWakeup() something like below? Other than saving a couple lines of code, IMO this doesn't meaningfully simplify the function or improve readability. > 5. Can we move below code snippets to respective static functions for > better readability and code reuse? > This: > + /* Find the soonest wakeup time, to limit our nap. */ > + nextWakeup = INT64_MAX; > + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > + nextWakeup = Min(state.wakeup[i], nextWakeup); > + nap = Max(0, (nextWakeup - now + 999) / 1000); > > And this: > > + now = GetCurrentTimestamp(); > + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > + WalRcvComputeNextWakeup(&state, i, now); This did cross my mind, but I opted to avoid creating new functions because 1) they aren't likely to be reused very much, and 2) I actually think it might hurt readability by forcing developers to traipse around the file to figure out what these functions are actually doing. It's not like it would save many lines of code, either. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Oct 10, 2022 at 10:51:14AM -0700, Nathan Bossart wrote: >> + /* Find the soonest wakeup time, to limit our nap. */ >> + nextWakeup = INT64_MAX; >> + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) >> + nextWakeup = Min(state.wakeup[i], nextWakeup); >> + nap = Max(0, (nextWakeup - now + 999) / 1000); Hm. We should probably be more cautious here since nextWakeup is an int64 and nap is an int. My guess is that we should just set nap to -1 if nextWakeup > INT_MAX. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Oct 10, 2022 at 11:10:14AM -0700, Nathan Bossart wrote: > On Mon, Oct 10, 2022 at 10:51:14AM -0700, Nathan Bossart wrote: >>> + /* Find the soonest wakeup time, to limit our nap. */ >>> + nextWakeup = INT64_MAX; >>> + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) >>> + nextWakeup = Min(state.wakeup[i], nextWakeup); >>> + nap = Max(0, (nextWakeup - now + 999) / 1000); > > Hm. We should probably be more cautious here since nextWakeup is an int64 > and nap is an int. My guess is that we should just set nap to -1 if > nextWakeup > INT_MAX. Here's an attempt at fixing that. I ended up just changing "nap" to an int64 and then ensuring it's no larger than INT_MAX in the call to WaitLatchOrSocket(). IIUC we can't use -1 here because WL_TIMEOUT is set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Oct 10, 2022 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Thanks for the updated patches. > > 2. With the below change, the time walreceiver spends in > > XLogWalRcvSendReply() is also included for XLogWalRcvSendHSFeedback > > right? I think it's a problem given that XLogWalRcvSendReply() can > > take a while. Earlier, this wasn't the case, each function calculating > > 'now' separately. Any reason for changing this behaviour? I know that > > GetCurrentTimestamp(); isn't cheaper all the time, but here it might > > result in a change in the behaviour. > > Yes, if XLogWalRcvSendReply() takes a long time, we might defer sending the > hot_standby_feedback message until a later time. The only reason I changed > this was to avoid extra calls to GetCurrentTimestamp(), which might be > expensive on some platforms. Agree. However... > Outside of the code snippet you pointed out, > I think WalReceiverMain() has a similar problem. That being said, I'm > generally skeptical that this sort of thing is detrimental given the > current behavior (i.e., wake up every 100ms), the usual values of these > GUCs (i.e., tens of seconds), and the fact that any tasks that are > inappropriately skipped will typically be retried in the next iteration of > the loop. AFICS, the aim of the patch isn't optimizing around GetCurrentTimestamp() calls and the patch does seem to change quite a bit of them which may cause a change of the behaviour. I think that the GetCurrentTimestamp() optimizations can go to 0003 patch in this thread itself or we can discuss it as a separate thread to seek more thoughts. The 'now' in many instances in the patch may not actually represent the true current time and it includes time taken by other operations as well. I think this will have problems. + XLogWalRcvSendReply(state, now, false, false); + XLogWalRcvSendHSFeedback(state, now, false); + now = GetCurrentTimestamp(); + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) + WalRcvComputeNextWakeup(&state, i, now); + XLogWalRcvSendHSFeedback(&state, now, true); + now = GetCurrentTimestamp(); + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) + WalRcvComputeNextWakeup(&state, i, now); + XLogWalRcvSendHSFeedback(&state, now, true); > > 4. How about simplifying WalRcvComputeNextWakeup() something like below? > > Other than saving a couple lines of code, IMO this doesn't meaningfully > simplify the function or improve readability. IMO, the duplicate lines of code aren't necessary in WalRcvComputeNextWakeup() and that function's code isn't that complex by removing them. > > 5. Can we move below code snippets to respective static functions for > > better readability and code reuse? > > This: > > + /* Find the soonest wakeup time, to limit our nap. */ > > + nextWakeup = INT64_MAX; > > + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > > + nextWakeup = Min(state.wakeup[i], nextWakeup); > > + nap = Max(0, (nextWakeup - now + 999) / 1000); > > > > And this: > > > > + now = GetCurrentTimestamp(); > > + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > > + WalRcvComputeNextWakeup(&state, i, now); > > This did cross my mind, but I opted to avoid creating new functions because > 1) they aren't likely to be reused very much, and 2) I actually think it > might hurt readability by forcing developers to traipse around the file to > figure out what these functions are actually doing. It's not like it would > save many lines of code, either. The second snippet is being used twice already. IMO having static functions (perhaps inline) is much better here. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Oct 11, 2022 at 07:01:26AM +0530, Bharath Rupireddy wrote: > On Mon, Oct 10, 2022 at 11:21 PM Nathan Bossart >> Outside of the code snippet you pointed out, >> I think WalReceiverMain() has a similar problem. That being said, I'm >> generally skeptical that this sort of thing is detrimental given the >> current behavior (i.e., wake up every 100ms), the usual values of these >> GUCs (i.e., tens of seconds), and the fact that any tasks that are >> inappropriately skipped will typically be retried in the next iteration of >> the loop. > > AFICS, the aim of the patch isn't optimizing around > GetCurrentTimestamp() calls and the patch does seem to change quite a > bit of them which may cause a change of the behaviour. I think that > the GetCurrentTimestamp() optimizations can go to 0003 patch in this > thread itself or we can discuss it as a separate thread to seek more > thoughts. > > The 'now' in many instances in the patch may not actually represent > the true current time and it includes time taken by other operations > as well. I think this will have problems. What problems do you think this will cause? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Oct 11, 2022 at 8:20 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > AFICS, the aim of the patch isn't optimizing around > > GetCurrentTimestamp() calls and the patch does seem to change quite a > > bit of them which may cause a change of the behaviour. I think that > > the GetCurrentTimestamp() optimizations can go to 0003 patch in this > > thread itself or we can discuss it as a separate thread to seek more > > thoughts. > > > > The 'now' in many instances in the patch may not actually represent > > the true current time and it includes time taken by other operations > > as well. I think this will have problems. > > What problems do you think this will cause? now = t1; XLogWalRcvSendReply() /* say it ran for a minute or so for whatever reasons */ XLogWalRcvSendHSFeedback() /* with patch walrecevier sends hot standby feedback more often without properly honouring wal_receiver_status_interval because the 'now' isn't actually the current time as far as that function is concerned, it is t1 + XLogWalRcvSendReply()'s time. */ Well, is this really a problem? I'm not sure about that. Let's hear from others. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Oct 11, 2022 at 09:34:25AM +0530, Bharath Rupireddy wrote: > now = t1; > XLogWalRcvSendReply() /* say it ran for a minute or so for whatever reasons */ > XLogWalRcvSendHSFeedback() /* with patch walrecevier sends hot standby > feedback more often without properly honouring > wal_receiver_status_interval because the 'now' isn't actually the > current time as far as that function is concerned, it is > t1 + XLogWalRcvSendReply()'s time. */ > > Well, is this really a problem? I'm not sure about that. Let's hear from others. For this example, the feedback message would just be sent in the next loop iteration instead. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Oct 11, 2022 at 11:22 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Oct 11, 2022 at 09:34:25AM +0530, Bharath Rupireddy wrote: > > now = t1; > > XLogWalRcvSendReply() /* say it ran for a minute or so for whatever reasons */ > > XLogWalRcvSendHSFeedback() /* with patch walrecevier sends hot standby > > feedback more often without properly honouring > > wal_receiver_status_interval because the 'now' isn't actually the > > current time as far as that function is concerned, it is > > t1 + XLogWalRcvSendReply()'s time. */ > > > > Well, is this really a problem? I'm not sure about that. Let's hear from others. > > For this example, the feedback message would just be sent in the next loop > iteration instead. I think the hot standby feedback message gets sent too frequently without honouring the wal_receiver_status_interval because the 'now' is actually not the current time with your patch but 'now + XLogWalRcvSendReply()'s time'. However, it's possible that I may be wrong here. /* * Send feedback at most once per wal_receiver_status_interval. */ if (!TimestampDifferenceExceeds(sendTime, now, wal_receiver_status_interval * 1000)) return; As said upthread [1], I think the best way to move forward is to separate the GetCurrentTimestamp() calls optimizations into 0003. Do you have any further thoughts on review comments posted upthread [1]? [1] https://www.postgresql.org/message-id/CALj2ACV5q2dbVRKwu2PL2s_YY0owZFTRxLdX%3Dt%2BdZ1iag15khA%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
I think in 0001 we should put more stuff in the state struct -- specifically these globals: static int recvFile = -1; static TimeLineID recvFileTLI = 0; static XLogSegNo recvSegNo = 0; The main reason is that it seems odd to have startpointTLI in the struct used in some places together with a file-global recvFileTLI which isn't. The way one is passed as argument and the other as part of a struct seemed too distracting. This should reduce the number of moving parts, ISTM. One thing that confused me for a moment is that we have some state in walrcv and some more state in 'state'. The difference is pretty obvious once you look at the other, but it suggest to me that a better variable name for the latter is 'localstate' to more obviously distinguish them. I was tempted to suggest that LogstreamResult would also be good to have in the new struct, but that might be going a bit too far for a first cut. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Thu, Oct 13, 2022 at 03:35:14PM +0530, Bharath Rupireddy wrote: > I think the hot standby feedback message gets sent too frequently > without honouring the wal_receiver_status_interval because the 'now' > is actually not the current time with your patch but 'now + > XLogWalRcvSendReply()'s time'. However, it's possible that I may be > wrong here. Is your concern that the next wakeup time will be miscalculated because of a stale value in 'now'? I think that's existing behavior, as we save the current time before sending the message in XLogWalRcvSendReply and XLogWalRcvSendHSFeedback. The only way to completely avoid this would be to call GetCurrentTimestamp() after the message has been sent and to use that value to calculate when the next message should be sent. However, I am skeptical there's really any practical impact. Is it really a problem if a message that should be sent every 30 seconds is sent at an interval of 29.9 seconds sometimes? I think it could go the other way, too. If 'now' is stale, we might decide not to send a feedback message until the next loop iteration, so the interval might be 30.1 seconds. I don't think it's realistic to expect complete accuracy here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Oct 13, 2022 at 12:37:39PM +0200, Alvaro Herrera wrote: > I think in 0001 we should put more stuff in the state struct -- > specifically these globals: > > static int recvFile = -1; > static TimeLineID recvFileTLI = 0; > static XLogSegNo recvSegNo = 0; > > The main reason is that it seems odd to have startpointTLI in the struct > used in some places together with a file-global recvFileTLI which isn't. > The way one is passed as argument and the other as part of a struct > seemed too distracting. This should reduce the number of moving parts, > ISTM. Makes sense. Do you think the struct should be file-global so that it doesn't need to be provided as an argument to most of the static functions in this file? > One thing that confused me for a moment is that we have some state in > walrcv and some more state in 'state'. The difference is pretty obvious > once you look at the other, but it suggest to me that a better variable > name for the latter is 'localstate' to more obviously distinguish them. Sure, I'll change it to 'localstate'. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Oct 13, 2022 at 12:09:54PM -0700, Nathan Bossart wrote: > On Thu, Oct 13, 2022 at 12:37:39PM +0200, Alvaro Herrera wrote: >> The main reason is that it seems odd to have startpointTLI in the struct >> used in some places together with a file-global recvFileTLI which isn't. >> The way one is passed as argument and the other as part of a struct >> seemed too distracting. This should reduce the number of moving parts, >> ISTM. > > Makes sense. Do you think the struct should be file-global so that it > doesn't need to be provided as an argument to most of the static functions > in this file? Here's a different take. Instead of creating structs and altering function signatures, we can just make the wake-up times file-global, and we can skip the changes related to reducing the number of calls to GetCurrentTimestamp() for now. This results in a far less invasive patch. (I still think reducing the number of calls to GetCurrentTimestamp() is worthwhile, but I'm beginning to agree with Bharath that it should be handled separately.) Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Thanks for taking this up. At Sat, 15 Oct 2022 20:59:00 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > On Thu, Oct 13, 2022 at 12:09:54PM -0700, Nathan Bossart wrote: > > On Thu, Oct 13, 2022 at 12:37:39PM +0200, Alvaro Herrera wrote: > >> The main reason is that it seems odd to have startpointTLI in the struct > >> used in some places together with a file-global recvFileTLI which isn't. > >> The way one is passed as argument and the other as part of a struct > >> seemed too distracting. This should reduce the number of moving parts, > >> ISTM. > > > > Makes sense. Do you think the struct should be file-global so that it > > doesn't need to be provided as an argument to most of the static functions > > in this file? > > Here's a different take. Instead of creating structs and altering function > signatures, we can just make the wake-up times file-global, and we can skip I'm fine with that. Rather it seems to me really simple. > the changes related to reducing the number of calls to > GetCurrentTimestamp() for now. This results in a far less invasive patch. Sure. > (I still think reducing the number of calls to GetCurrentTimestamp() is > worthwhile, but I'm beginning to agree with Bharath that it should be > handled separately.) +1. We would gain lesser for the required effort. > Thoughts? Now that I see the fix for the implicit conversion: L527: + nap = Max(0, (nextWakeup - now + 999) / 1000); .. L545: + (int) Min(INT_MAX, nap), I think limiting the naptime at use is confusing. Don't we place these adjacently? Then we could change the nap to an integer. Or we can just assert out for the nap time longer than INT_MAX (but this would require another int64 variable. I belive we won't need such a long nap, (or that is no longer a nap?) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sun, Oct 16, 2022 at 9:29 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Here's a different take. Instead of creating structs and altering function > signatures, we can just make the wake-up times file-global, and we can skip > the changes related to reducing the number of calls to > GetCurrentTimestamp() for now. This results in a far less invasive patch. > (I still think reducing the number of calls to GetCurrentTimestamp() is > worthwhile, but I'm beginning to agree with Bharath that it should be > handled separately.) > > Thoughts? Thanks. I'm wondering if we need to extend the similar approach for logical replication workers in logical/worker.c LogicalRepApplyLoop()? A comment on the patch: Isn't it better we track the soonest wakeup time or min of wakeup[] array (update the min whenever the array element is updated in WalRcvComputeNextWakeup()) instead of recomputing everytime by looping for NUM_WALRCV_WAKEUPS times? I think this will save us some CPU cycles, because the for loop, in which the below code is placed, runs till the walreceiver end of life cycle. We may wrap wakeup[] and min inside a structure for better code organization or just add min as another static variable alongside wakeup[]. + /* Find the soonest wakeup time, to limit our nap. */ + nextWakeup = PG_INT64_MAX; + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) + nextWakeup = Min(wakeup[i], nextWakeup); + nap = Max(0, (nextWakeup - now + 999) / 1000); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Oct 17, 2022 at 03:21:18PM +0900, Kyotaro Horiguchi wrote: > Now that I see the fix for the implicit conversion: > > L527: + nap = Max(0, (nextWakeup - now + 999) / 1000); > .. > L545: + (int) Min(INT_MAX, nap), > > > I think limiting the naptime at use is confusing. Don't we place these > adjacently? Then we could change the nap to an integer. Or we can > just assert out for the nap time longer than INT_MAX (but this would > require another int64 variable. I belive we won't need such a long > nap, (or that is no longer a nap?) Yeah, I guess this deserves a comment. I could also combine it easily: nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000)); We could probably just remove the WL_TIMEOUT flag and set timeout to -1 whenever "nap" is calculated to be > INT_MAX, but I didn't think it was worth complicating the code in order to save an extra wakeup every ~25 days. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Oct 26, 2022 at 03:05:20PM +0530, Bharath Rupireddy wrote: > A comment on the patch: > Isn't it better we track the soonest wakeup time or min of wakeup[] > array (update the min whenever the array element is updated in > WalRcvComputeNextWakeup()) instead of recomputing everytime by looping > for NUM_WALRCV_WAKEUPS times? I think this will save us some CPU > cycles, because the for loop, in which the below code is placed, runs > till the walreceiver end of life cycle. We may wrap wakeup[] and min > inside a structure for better code organization or just add min as > another static variable alongside wakeup[]. > > + /* Find the soonest wakeup time, to limit our nap. */ > + nextWakeup = PG_INT64_MAX; > + for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > + nextWakeup = Min(wakeup[i], nextWakeup); > + nap = Max(0, (nextWakeup - now + 999) / 1000); While that might save a few CPU cycles when computing the nap time, I don't think it's worth the added complexity and CPU cycles in WalRcvComputeNextWakeup(). I suspect it'd be difficult to demonstrate any meaningful difference between the two approaches. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sat, Nov 05, 2022 at 03:00:55PM -0700, Nathan Bossart wrote: > On Mon, Oct 17, 2022 at 03:21:18PM +0900, Kyotaro Horiguchi wrote: >> Now that I see the fix for the implicit conversion: >> >> L527: + nap = Max(0, (nextWakeup - now + 999) / 1000); >> .. >> L545: + (int) Min(INT_MAX, nap), >> >> >> I think limiting the naptime at use is confusing. Don't we place these >> adjacently? Then we could change the nap to an integer. Or we can >> just assert out for the nap time longer than INT_MAX (but this would >> require another int64 variable. I belive we won't need such a long >> nap, (or that is no longer a nap?) > > Yeah, I guess this deserves a comment. I could also combine it easily: > > nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000)); Here is a new version of the patch that addresses this feedback. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, Nov 6, 2022 at 12:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Here is a new version of the patch that addresses this feedback. This looks pretty good to me. Thanks for picking it up! I can live with the change to use a global variable; it seems we couldn't quite decide what the scope was for moving stuff into a struct (ie various pre-existing stuff representing the walreceiver's state), but I'm OK with considering that as a pure refactoring project for a separate thread. That array should be "static", though.
On Tue, Nov 8, 2022 at 3:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sun, Nov 6, 2022 at 12:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Here is a new version of the patch that addresses this feedback. > > This looks pretty good to me. Thanks for picking it up! I can live > with the change to use a global variable; it seems we couldn't quite > decide what the scope was for moving stuff into a struct (ie various > pre-existing stuff representing the walreceiver's state), but I'm OK > with considering that as a pure refactoring project for a separate > thread. That array should be "static", though. And with that change and a pgindent, pushed.
On Tue, Nov 8, 2022 at 1:17 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > And with that change and a pgindent, pushed. Thanks. Do we need a similar wakeup approach for logical replication workers in worker.c? Or is it okay that the nap time is 1sec there? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 8, 2022 at 9:20 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Thanks. Do we need a similar wakeup approach for logical replication > workers in worker.c? Or is it okay that the nap time is 1sec there? Yeah, I think so. At least for its idle/nap case (waiting for flush is also a technically interesting case, but harder, and applies to non-idle systems so the polling is a little less offensive).
On Tue, Nov 08, 2022 at 08:46:26PM +1300, Thomas Munro wrote: > On Tue, Nov 8, 2022 at 3:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> This looks pretty good to me. Thanks for picking it up! I can live >> with the change to use a global variable; it seems we couldn't quite >> decide what the scope was for moving stuff into a struct (ie various >> pre-existing stuff representing the walreceiver's state), but I'm OK >> with considering that as a pure refactoring project for a separate >> thread. That array should be "static", though. > > And with that change and a pgindent, pushed. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Nov 08, 2022 at 09:45:40PM +1300, Thomas Munro wrote: > On Tue, Nov 8, 2022 at 9:20 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: >> Thanks. Do we need a similar wakeup approach for logical replication >> workers in worker.c? Or is it okay that the nap time is 1sec there? > > Yeah, I think so. At least for its idle/nap case (waiting for flush > is also a technically interesting case, but harder, and applies to > non-idle systems so the polling is a little less offensive). Bharath, are you planning to pick this up? If not, I can probably spend some time on it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Nov 9, 2022 at 2:38 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Nov 08, 2022 at 09:45:40PM +1300, Thomas Munro wrote: > > On Tue, Nov 8, 2022 at 9:20 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > >> Thanks. Do we need a similar wakeup approach for logical replication > >> workers in worker.c? Or is it okay that the nap time is 1sec there? > > > > Yeah, I think so. At least for its idle/nap case (waiting for flush > > is also a technically interesting case, but harder, and applies to > > non-idle systems so the polling is a little less offensive). > > Bharath, are you planning to pick this up? If not, I can probably spend > some time on it. Please go ahead! -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Thomas Munro <thomas.munro@gmail.com> writes: > And with that change and a pgindent, pushed. There is something very seriously wrong with this patch. On my machine, running "make -j10 check-world" (with compilation already done) has been taking right about 2 minutes for some time. Since this patch, it's taking around 2:45 --- I did a bisect run to confirm that this patch is where it changed. The buildfarm is showing a hit, too. Comparing the step runtimes at https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2005%3A29%3A28 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2007%3A49%3A31 it'd seem that most tests involving walreceivers got much slower: pg_basebackup-check from 00:29 to 00:39, pg_rewind-check went from 00:56 to 01:26, and recovery-check went from 03:56 to 04:45. Curiously, subscription-check only went from 03:26 to 03:29. I've not dug into it further than that, but my bet is that some required wakeup condition was not accounted for. regards, tom lane
On Mon, Nov 14, 2022 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > There is something very seriously wrong with this patch. > > On my machine, running "make -j10 check-world" (with compilation > already done) has been taking right about 2 minutes for some time. > Since this patch, it's taking around 2:45 --- I did a bisect run > to confirm that this patch is where it changed. > > The buildfarm is showing a hit, too. Comparing the step runtimes at > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2005%3A29%3A28 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2007%3A49%3A31 > > it'd seem that most tests involving walreceivers got much slower: > pg_basebackup-check from 00:29 to 00:39, > pg_rewind-check went from 00:56 to 01:26, > and recovery-check went from 03:56 to 04:45. > Curiously, subscription-check only went from 03:26 to 03:29. > > I've not dug into it further than that, but my bet is that some > required wakeup condition was not accounted for. Looking...
On Sun, Nov 13, 2022 at 05:08:04PM -0500, Tom Lane wrote: > There is something very seriously wrong with this patch. > > On my machine, running "make -j10 check-world" (with compilation > already done) has been taking right about 2 minutes for some time. > Since this patch, it's taking around 2:45 --- I did a bisect run > to confirm that this patch is where it changed. I've been looking into this. I wrote a similar patch for logical/worker.c before noticing that check-world was taking much longer. The problem in that case seems to be that process_syncing_tables() isn't called as often. It wouldn't surprise me if there's also something in walreceiver.c that depends upon the frequent wakeups. I suspect this will require a revert. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Nov 14, 2022 at 11:26 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Sun, Nov 13, 2022 at 05:08:04PM -0500, Tom Lane wrote: > > There is something very seriously wrong with this patch. > > > > On my machine, running "make -j10 check-world" (with compilation > > already done) has been taking right about 2 minutes for some time. > > Since this patch, it's taking around 2:45 --- I did a bisect run > > to confirm that this patch is where it changed. > > I've been looking into this. I wrote a similar patch for logical/worker.c > before noticing that check-world was taking much longer. The problem in > that case seems to be that process_syncing_tables() isn't called as often. > It wouldn't surprise me if there's also something in walreceiver.c that > depends upon the frequent wakeups. I suspect this will require a revert. In the case of "meson test pg_basebackup/020_pg_receivewal" it looks like wait_for_catchup hangs around for 10 seconds waiting for HS feedback. I'm wondering if we need to go back to high frequency wakeups until it's caught up, or (probably better) arrange for a proper event for progress. Digging...
On Mon, Nov 14, 2022 at 12:11 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Nov 14, 2022 at 11:26 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: > > On Sun, Nov 13, 2022 at 05:08:04PM -0500, Tom Lane wrote: > > > There is something very seriously wrong with this patch. > > > > > > On my machine, running "make -j10 check-world" (with compilation > > > already done) has been taking right about 2 minutes for some time. > > > Since this patch, it's taking around 2:45 --- I did a bisect run > > > to confirm that this patch is where it changed. > > > > I've been looking into this. I wrote a similar patch for logical/worker.c > > before noticing that check-world was taking much longer. The problem in > > that case seems to be that process_syncing_tables() isn't called as often. > > It wouldn't surprise me if there's also something in walreceiver.c that > > depends upon the frequent wakeups. I suspect this will require a revert. > > In the case of "meson test pg_basebackup/020_pg_receivewal" it looks > like wait_for_catchup hangs around for 10 seconds waiting for HS > feedback. I'm wondering if we need to go back to high frequency > wakeups until it's caught up, or (probably better) arrange for a > proper event for progress. Digging... Maybe there is a better way to code this (I mean, who likes global variables?) and I need to test some more, but I suspect the attached is approximately what we missed.
Attachment
On Mon, Nov 14, 2022 at 12:35 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Maybe there is a better way to code this (I mean, who likes global > variables?) and I need to test some more, but I suspect the attached > is approximately what we missed. Erm, ENOCOFFEE, sorry that's not quite right, it needs the apply LSN, but it demonstrates what the problem is...
On Mon, Nov 14, 2022 at 12:51:14PM +1300, Thomas Munro wrote: > On Mon, Nov 14, 2022 at 12:35 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Maybe there is a better way to code this (I mean, who likes global >> variables?) and I need to test some more, but I suspect the attached >> is approximately what we missed. > > Erm, ENOCOFFEE, sorry that's not quite right, it needs the apply LSN, > but it demonstrates what the problem is... Yeah, I'm able to sort-of reproduce the problem by sending fewer replies, but it's not clear to me why that's the problem. AFAICT all of the code paths that write/flush are careful to send a reply shortly afterward, which should keep writePtr/flushPtr updated. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Nov 14, 2022 at 1:05 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Yeah, I'm able to sort-of reproduce the problem by sending fewer replies, > but it's not clear to me why that's the problem. AFAICT all of the code > paths that write/flush are careful to send a reply shortly afterward, which > should keep writePtr/flushPtr updated. Ahh, I think I might have it. In the old coding, sendTime starts out as 0, which I think caused it to send its first reply message after the first 100ms sleep, and only after that fall into a cadence of wal_receiver_status_interval (10s) while idle. Our new coding won't send its first reply until start time + wal_receiver_status_interval. If I have that right, think we can get back to the previous behaviour by explicitly setting the first message time, like: @@ -433,6 +433,9 @@ WalReceiverMain(void) for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) WalRcvComputeNextWakeup(i, now); + /* XXX start with a reply after 100ms */ + wakeup[WALRCV_WAKEUP_REPLY] = now + 100000; + /* Loop until end-of-streaming or error */ Obviously that's bogus and racy (it races with wait_for_catchup, and it's slow, actually both sides are not great and really should be event-driven, in later work)...
On Mon, Nov 14, 2022 at 02:47:14PM +1300, Thomas Munro wrote: > Ahh, I think I might have it. In the old coding, sendTime starts out > as 0, which I think caused it to send its first reply message after > the first 100ms sleep, and only after that fall into a cadence of > wal_receiver_status_interval (10s) while idle. Our new coding won't > send its first reply until start time + wal_receiver_status_interval. > If I have that right, think we can get back to the previous behaviour > by explicitly setting the first message time, like: Good find! > @@ -433,6 +433,9 @@ WalReceiverMain(void) > for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > WalRcvComputeNextWakeup(i, now); > > + /* XXX start with a reply after 100ms */ > + wakeup[WALRCV_WAKEUP_REPLY] = now + 100000; > + > /* Loop until end-of-streaming or error */ > > Obviously that's bogus and racy (it races with wait_for_catchup, and > it's slow, actually both sides are not great and really should be > event-driven, in later work)... Is there any reason we should wait for 100ms before sending the initial reply? ISTM the previous behavior essentially caused the first reply (and feedback message) to be sent at the first opportunity after streaming begins. My instinct is to do something like the attached. I wonder if we ought to do something similar in the ConfigReloadPending path in case hot_standby_feedback is being turned on. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Nov 15, 2022 at 8:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Is there any reason we should wait for 100ms before sending the initial > reply? ISTM the previous behavior essentially caused the first reply (and > feedback message) to be sent at the first opportunity after streaming > begins. My instinct is to do something like the attached. I wonder if we > ought to do something similar in the ConfigReloadPending path in case > hot_standby_feedback is being turned on. That works for 020_pg_receivewal. I wonder if there are also tests that stream a bit of WAL first and then do wait_for_catchup that were previously benefiting from the 100ms-after-startup message by scheduling luck (as in, that was usually enough for replay)? I might go and teach Cluster.pm to log how much time is wasted in wait_for_catchup to get some observability, and then try to figure out how to optimise it properly. We could perhaps put the 100ms duct tape back temporarily though, if necessary.
On Tue, Nov 15, 2022 at 09:42:26AM +1300, Thomas Munro wrote: > That works for 020_pg_receivewal. I wonder if there are also tests > that stream a bit of WAL first and then do wait_for_catchup that were > previously benefiting from the 100ms-after-startup message by > scheduling luck (as in, that was usually enough for replay)? I might > go and teach Cluster.pm to log how much time is wasted in > wait_for_catchup to get some observability, and then try to figure out > how to optimise it properly. We could perhaps put the 100ms duct tape > back temporarily though, if necessary. Oh, I see. Since we don't check the apply position when determining whether to send a reply, tests may need to wait a full wal_receiver_status_interval. FWIW with my patch, the runtime of the src/test/recovery tests seems to be back to what it was on my machine, but I certainly wouldn't rule out scheduling luck. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Nov 15, 2022 at 12:14 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Nov 15, 2022 at 09:42:26AM +1300, Thomas Munro wrote: > > That works for 020_pg_receivewal. I wonder if there are also tests > > that stream a bit of WAL first and then do wait_for_catchup that were > > previously benefiting from the 100ms-after-startup message by > > scheduling luck (as in, that was usually enough for replay)? I might > > go and teach Cluster.pm to log how much time is wasted in > > wait_for_catchup to get some observability, and then try to figure out > > how to optimise it properly. We could perhaps put the 100ms duct tape > > back temporarily though, if necessary. > > Oh, I see. Since we don't check the apply position when determining > whether to send a reply, tests may need to wait a full > wal_receiver_status_interval. FWIW with my patch, the runtime of the > src/test/recovery tests seems to be back to what it was on my machine, but > I certainly wouldn't rule out scheduling luck. Yeah, what you have looks good in my experiments. Here are some statistics on waits on my system. Times in seconds, first row is before the new time logic went in, your patch is the last row. name | count | sum | avg | stddev | min | max ---------------+-------+---------+-------+--------+-------+-------- before | 718 | 37.375 | 0.052 | 0.135 | 0.006 | 1.110 master | 718 | 173.100 | 0.241 | 1.374 | 0.006 | 10.004 initial-100ms | 718 | 37.169 | 0.052 | 0.126 | 0.006 | 0.676 initial-0ms | 718 | 35.276 | 0.049 | 0.123 | 0.006 | 0.661 The difference on master is explained by these 14 outliers: name | time ------------------------------------------+-------- testrun/recovery/019_replslot_limit | 10.004 testrun/recovery/033_replay_tsp_drops | 9.917 testrun/recovery/033_replay_tsp_drops | 9.957 testrun/pg_basebackup/020_pg_receivewal | 9.899 testrun/pg_rewind/003_extrafiles | 9.961 testrun/pg_rewind/003_extrafiles | 9.916 testrun/pg_rewind/008_min_recovery_point | 9.929 recovery/019_replslot_limit | 10.004 recovery/033_replay_tsp_drops | 9.917 recovery/033_replay_tsp_drops | 9.957 pg_basebackup/020_pg_receivewal | 9.899 pg_rewind/003_extrafiles | 9.961 pg_rewind/003_extrafiles | 9.916 pg_rewind/008_min_recovery_point | 9.929 (14 rows) Now that I can see what is going on I'm going to try to see how much I can squeeze out of this...
Another option might be to just force initial reply/feedback messages when streaming starts. The attached patch improves src/test/recovery test runtime just like the previous one I posted. On Mon, Nov 14, 2022 at 11:01:27AM -0800, Nathan Bossart wrote: > I wonder if we > ought to do something similar in the ConfigReloadPending path in case > hot_standby_feedback is being turned on. Looking closer, I see that a feedback message is forced in this case already. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Another option might be to just force initial reply/feedback messages when > streaming starts. The attached patch improves src/test/recovery test > runtime just like the previous one I posted. Yeah, looks good in my tests. I think we just need to make it conditional so we don't force it if someone has wal_receiver_status_interval disabled.
Attachment
On Wed, Nov 16, 2022 at 04:57:08PM +1300, Thomas Munro wrote: > On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> Another option might be to just force initial reply/feedback messages when >> streaming starts. The attached patch improves src/test/recovery test >> runtime just like the previous one I posted. > > Yeah, looks good in my tests. I think we just need to make it > conditional so we don't force it if someone has > wal_receiver_status_interval disabled. Yeah, that makes sense. IIUC setting "force" to false would have the same effect for the initial round of streaming, but since writePtr/flushPtr will be set in later rounds, no reply would be guaranteed. That might be good enough for the tests, but it seems a bit inconsistent. So, your patch is probably the way to go. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Nov 16, 2022 at 5:24 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Wed, Nov 16, 2022 at 04:57:08PM +1300, Thomas Munro wrote: > > On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> Another option might be to just force initial reply/feedback messages when > >> streaming starts. The attached patch improves src/test/recovery test > >> runtime just like the previous one I posted. > > > > Yeah, looks good in my tests. I think we just need to make it > > conditional so we don't force it if someone has > > wal_receiver_status_interval disabled. > > Yeah, that makes sense. IIUC setting "force" to false would have the same > effect for the initial round of streaming, but since writePtr/flushPtr will > be set in later rounds, no reply would be guaranteed. That might be good > enough for the tests, but it seems a bit inconsistent. So, your patch is > probably the way to go. On second thoughts, there's not much point in that, since we always force replies under various other circumstances anyway. There isn't really a 'never send replies' mode. Committed the way you had it before. Thanks! I can't unsee that we're spending ~35 seconds waiting for catchup in 718 separate wait_for_catchup calls. The problem is entirely on the waiting side (we already do WalRcvForceReply() in xlogrecovery.c when going idle), and there's probably a nice condition variable-based improvement here but I decided to hop over that rabbit hole today.