Thread: Restart pg_usleep when interrupted
- Attachment protected by Amazon:
- 0001-Handle-Sleep-interrupts.patch
Hi,
In the proposal by Bertrand [1] to implement vacuum cost delay tracking
in pg_stat_progress_vacuum, it was discovered that the vacuum cost delay
ends early on the leader process of a parallel vacuum due to parallel workers
reporting progress on index vacuuming, which was introduced in 17
with commit [2]. With this patch, everytime a parallel worker
completes a vacuum index, it will send a completion message to the leader.
The facility that allows a parallel worker to report progress to the leader was
introduced in commit [3].
In the case of the patch being proposed by Bertrand, the number of interrupts
will be much more frequent as parallel workers would send a message to the leader
to update the vacuum delay counters every vacuum_delay_point call.
Looking into this, the ideal solution appears to provide the ability for a pg_usleep
call to restart after being interrupted. Thanks to the usage of nanosleep as
of commit[4], this should be trivial to do as nanosleep
provides a remaining time, which can be used to restart the sleep. This
idea was also mentioned in [5].
I am attaching a POC patch that does the $SUBJECT. Rather than changing the
existing pg_usleep, the patch introduces a function, pg_usleep_handle_interrupt,
that takes in the sleep duration and a boolean flag to force handling of the
interrupt, or not.
This function can replace pg_usleep inside vacuum_delay_point, and could be
useful in other cases in which it’s important to handle interrupts.
For Windows, the “force” = “true” flag of the new function uses SleepEx which
from what I can tell based on the code comments is a non-interruptible sleep.
Thoughts?
[1] https://www.postgresql.org/message-id/ZnbIIUQpFJKAJx2W%40ip-10-97-1-34.eu-west-3.compute.internal
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=46ebdfe164c61fbac961d1eb7f40e9a684289ae6
[4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a948e49e2ef11815be0b211723bfc5b53b7f75a8
[5] https://www.postgresql.org/message-id/24848.1473607575%40sss.pgh.pa.us
Regards,
Sami Imseih
Amazon Web Services (AWS)
Hi, I think you need to find a way to disable this "Attachment protected by Amazon" thing: http://postgr.es/m/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com We want patches to be in the pgsql-hackers archives, not temporarily accessible via some link. ...Robert
> I think you need to find a way to disable this "Attachment protected > by Amazon" thing: Yes, I am looking into that. I only noticed after sending the email. Sorry about that. Sami
> We want patches to be in the pgsql-hackers archives, not temporarily > accessible via some link. > > …Robert > Moving to another email going forward. Reattaching the patch. Regards, Sami Imseih Amazon Web Services
Attachment
Sami Imseih <samimseih@gmail.com> writes: > Reattaching the patch. I feel like this is fundamentally a wrong solution, for the reasons cited in the comment for pg_usleep: long sleeps are a bad idea because of the resulting uncertainty about whether we'll respond to interrupts and such promptly. An example here is that if we get a query cancel interrupt, we should probably not insist on finishing out the current sleep before responding. Therefore, rather than "improving" pg_usleep (and uglifying its API), the right answer is to fix parallel vacuum leaders to not depend on pg_usleep in the first place. A better idea might be to use pg_sleep() or equivalent code. regards, tom lane
On Sat, Jun 29, 2024 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Therefore, rather than "improving" pg_usleep (and uglifying its API), > the right answer is to fix parallel vacuum leaders to not depend on > pg_usleep in the first place. A better idea might be to use > pg_sleep() or equivalent code. In case it's useful for someone looking into that, in earlier discussions we figured out that it is possible to have high resolution timeouts AND support latch multiplexing on Linux, FreeBSD, macOS: https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BhC9mFx8tEcBsyo7-cAfWgtbRy1eDizeFuff2K7T%3D4bA%40mail.gmail.com
On Jun 28, 2024, at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:Sami Imseih <samimseih@gmail.com> writes:Reattaching the patch.
I feel like this is fundamentally a wrong solution, for the reasons
cited in the comment for pg_usleep: long sleeps are a bad idea
because of the resulting uncertainty about whether we'll respond to
interrupts and such promptly. An example here is that if we get
a query cancel interrupt, we should probably not insist on finishing
out the current sleep before responding.
Therefore, rather than "improving" pg_usleep (and uglifying its API),
the right answer is to fix parallel vacuum leaders to not depend on
pg_usleep in the first place. A better idea might be to use
pg_sleep() or equivalent code.
Therefore, rather than "improving" pg_usleep (and uglifying its API),
the right answer is to fix parallel vacuum leaders to not depend on
pg_usleep in the first place. A better idea might be to use
pg_sleep() or equivalent code.Yes, that is a good idea to explore and it will not require introducingan awkward new API. I will look into using something similar topg_sleep.
Hi, On Fri, Jun 28, 2024 at 05:50:20PM -0500, Sami Imseih wrote: > > Thanks for the feedback! > > > On Jun 28, 2024, at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Sami Imseih <samimseih@gmail.com> writes: > >> Reattaching the patch. > > > > I feel like this is fundamentally a wrong solution, for the reasons > > cited in the comment for pg_usleep: long sleeps are a bad idea > > because of the resulting uncertainty about whether we'll respond to > > interrupts and such promptly. An example here is that if we get > > a query cancel interrupt, we should probably not insist on finishing > > out the current sleep before responding. > > The case which brought up this discussion is the pg_usleep that > is called within the vacuum_delay_point being interrupted. > > When I read the same code comment you cited, it sounded to me > that “long sleeps” are those that are in seconds or minutes. The longest > vacuum delay allowed is 100ms. I think that with the proposed patch the actual wait time can be "long". Indeed, the time between the interruptions and restarts of the nanosleep() call will lead to drift (as mentioned in the nanosleep() man page). So, with a large number of interruptions, the actual wait time could be way longer than the expected wait time. To put numbers, I did a few tests with the patch (and with v2 shared in [1]): cost delay is 1ms and cost limit is 10. With 50 indexes and 10 parallel workers I can see things like: 2024-07-02 08:22:23.789 UTC [2189616] LOG: expected 1.000000, actual 239.378368 2024-07-02 08:22:24.575 UTC [2189616] LOG: expected 0.100000, actual 224.331737 2024-07-02 08:22:25.363 UTC [2189616] LOG: expected 1.300000, actual 230.462793 2024-07-02 08:22:26.154 UTC [2189616] LOG: expected 1.000000, actual 225.980803 Means we waited more than the max allowed cost delay (100ms). With 49 parallel workers, it's worst as I can see things like: 2024-07-02 08:26:36.069 UTC [2189807] LOG: expected 1.000000, actual 1106.790136 2024-07-02 08:26:36.298 UTC [2189807] LOG: expected 1.000000, actual 218.148985 The first actual wait time is about 1 second (it has been interrupted about 16300 times during this second). To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep() with an absolute time value, that might be another idea to explore. [1]: https://www.postgresql.org/message-id/flat/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
With 50 indexes and 10 parallel workers I can see things like:
2024-07-02 08:22:23.789 UTC [2189616] LOG: expected 1.000000, actual 239.378368
2024-07-02 08:22:24.575 UTC [2189616] LOG: expected 0.100000, actual 224.331737
2024-07-02 08:22:25.363 UTC [2189616] LOG: expected 1.300000, actual 230.462793
2024-07-02 08:22:26.154 UTC [2189616] LOG: expected 1.000000, actual 225.980803
Means we waited more than the max allowed cost delay (100ms).
With 49 parallel workers, it's worst as I can see things like:
2024-07-02 08:26:36.069 UTC [2189807] LOG: expected 1.000000, actual 1106.790136
2024-07-02 08:26:36.298 UTC [2189807] LOG: expected 1.000000, actual 218.148985
The first actual wait time is about 1 second (it has been interrupted about
16300 times during this second).
To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep()
with an absolute time value, that might be another idea to explore.
[1]: https://www.postgresql.org/message-id/flat/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
Attachment
> > A more portable approach which could be to continue using nanosleep and > add checks to ensure that nanosleep exists whenever > it goes past an absolute time. This was suggested by Bertrand in an offline > conversation. I am not yet fully convinced of this idea, but posting the patch > that implements this idea for anyone interested in looking. oops, forgot to attach the patch. Here it is.
Attachment
Hi, On Fri, Jul 05, 2024 at 11:49:45AM -0500, Sami Imseih wrote: > > > With 50 indexes and 10 parallel workers I can see things like: > > > > 2024-07-02 08:22:23.789 UTC [2189616] LOG: expected 1.000000, actual 239.378368 > > 2024-07-02 08:22:24.575 UTC [2189616] LOG: expected 0.100000, actual 224.331737 > > 2024-07-02 08:22:25.363 UTC [2189616] LOG: expected 1.300000, actual 230.462793 > > 2024-07-02 08:22:26.154 UTC [2189616] LOG: expected 1.000000, actual 225.980803 > > > > Means we waited more than the max allowed cost delay (100ms). > > > > With 49 parallel workers, it's worst as I can see things like: > > > > 2024-07-02 08:26:36.069 UTC [2189807] LOG: expected 1.000000, actual 1106.790136 > > 2024-07-02 08:26:36.298 UTC [2189807] LOG: expected 1.000000, actual 218.148985 > > > > The first actual wait time is about 1 second (it has been interrupted about > > 16300 times during this second). > > > > To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep() > > with an absolute time value, that might be another idea to explore. > > > > [1]: https://www.postgresql.org/message-id/flat/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal > > > > > A more portable approach which could be to continue using nanosleep and > add checks to ensure that nanosleep exists whenever > it goes past an absolute time. This was suggested by Bertrand in an offline > conversation. I am not yet fully convinced of this idea, but posting the patch > that implements this idea for anyone interested in looking. Thanks! I did a few tests with the patch and did not see any "large" drifts like the ones observed above. As far the patch, not thoroughly review (as it's still one option among others being discussed)): + struct timespec current; + float time_diff; + + clock_gettime(PG_INSTR_CLOCK, ¤t); + + time_diff = (absolute.tv_sec - current.tv_sec) + (absolute.tv_nsec - current.tv_nsec) / 1000000000.0; I think it could be "simplified" by making use of instr_time instead of timespec for current and absolute. Then I think it would be enough to compare their ticks. > Since sub-millisecond sleep times are not guaranteed as suggested by > the vacuum_cost_delay docs ( see below ), an alternative idea > is to use clock_nanosleep for vacuum delay when it’s available, else > fallback to WaitLatch. Wouldn't that increase even more the cases where sub-millisecond won't be guaranteed? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
I did a few tests with the patch and did not see any "large" drifts like the
ones observed above.
I think it could be "simplified" by making use of instr_time instead of timespec
for current and absolute. Then I think it would be enough to compare their
ticks.
Since sub-millisecond sleep times are not guaranteed as suggested by
the vacuum_cost_delay docs ( see below ), an alternative idea
is to use clock_nanosleep for vacuum delay when it’s available, else
fallback to WaitLatch.
Wouldn't that increase even more the cases where sub-millisecond won't be
guaranteed?
Attachment
> > Correct I attached a v2 of this patch that uses instr_time to check the elapsed > time and break out of the loop. It needs some more benchmarking. My email client has an issue sending attachments it seems. Reattaching Regards, Sami
Attachment
+ /* + * We allow nanosleep to handle interrupts and retry with the remaining time. + * However, since nanosleep is susceptible to time drift when interrupted + * frequently, we add a safeguard to break out of the nanosleep whenever the + * total time of the sleep exceeds the requested sleep time. Using nanosleep + * is a more portable approach than clock_nanosleep. + */ I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at the bottom of the while loop to avoid needing this extra check. Also, I think we need some commentary about why we want to retry after an interrupt in this case. -- nathan
> > I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at > the bottom of the while loop to avoid needing this extra check. Can you elaborate further? I am not sure how this will work since delay is a timespec and elapsed time is an instr_time. Also, in every iteration of the loop, the delay must be set to the remaining time. The purpose of the elapsed_time is to make sure that we don’t surpass requested time delay as an additional safeguard. > Also, I > think we need some commentary about why we want to retry after an interrupt > in this case. I will elaborate further in the comments for the next revision. Regards, Sami
On Thu, Jul 11, 2024 at 01:10:25PM -0500, Sami Imseih wrote: >> I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at >> the bottom of the while loop to avoid needing this extra check. > > Can you elaborate further? I am not sure how this will work since delay is a timespec > and elapsed time is an instr_time. > > Also, in every iteration of the loop, the delay must be set to the remaining time. The > purpose of the elapsed_time is to make sure that we don´t surpass requested time > delay as an additional safeguard. I'm imagining something like this: struct timespec delay; TimestampTz end_time; end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec); do { long secs; int microsecs; TimestampDifference(GetCurrentTimestamp(), end_time, &secs, µsecs); delay.tv_sec = secs; delay.tv_nsec = microsecs * 1000; } while (nanosleep(&delay, NULL) == -1 && errno == EINTR); -- nathan
Hi, On Thu, Jul 11, 2024 at 10:15:41AM -0500, Sami Imseih wrote: > > > I did a few tests with the patch and did not see any "large" drifts like the > > ones observed above. > > Thanks for testing. > > > I think it could be "simplified" by making use of instr_time instead of timespec > > for current and absolute. Then I think it would be enough to compare their > > ticks. > > Correct I attached a v2 of this patch that uses instr_time to check the elapsed > time and break out of the loop. It needs some more benchmarking. Thanks! Outside of Nathan's comment: 1 === + * However, since nanosleep is susceptible to time drift when interrupted + * frequently, we add a safeguard to break out of the nanosleep whenever the I'm not sure that "nanosleep is susceptible to time drift when interrupted frequently" is a correct wording. What about? " However, since the time between frequent interruptions and restarts of the nanosleep calls can substantially lead to drift in the time when the sleep finally completes, we add...." 2 === +static void vacuum_sleep(double msec); What about a more generic name that could be used outside of vacuum? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2024-Jul-11, Nathan Bossart wrote: > I'm imagining something like this: > > struct timespec delay; > TimestampTz end_time; > > end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec); > > do > { > long secs; > int microsecs; > > TimestampDifference(GetCurrentTimestamp(), end_time, > &secs, µsecs); > > delay.tv_sec = secs; > delay.tv_nsec = microsecs * 1000; > > } while (nanosleep(&delay, NULL) == -1 && errno == EINTR); This looks nicer. We could deal with clock drift easily (in case the sysadmin winds the clock back) by testing that tv_sec+tv_nsec is not higher than the initial time to sleep. I don't know how common this situation is nowadays, but I remember debugging a system years ago where autovacuum was sleeping for a very long time because of that. I can't remember now if we did anything in the code to cope, or just told sysadmins not to do that anymore :-) FWIW my (Linux's) nanosleep() manpage contains this note: If the interval specified in req is not an exact multiple of the granu‐ larity underlying clock (see time(7)), then the interval will be rounded up to the next multiple. Furthermore, after the sleep completes, there may still be a delay before the CPU becomes free to once again execute the calling thread. It's not clear to me what happens if the time to sleep is zero, so maybe there should be a "if tv_sec == 0 && tv_nsec == 0 then break" statement at the bottom of the loop, to quit without sleeping one more time than needed. For Windows, this [1] looks like an interesting and possibly relevant read (though maybe SleepEx already does what we want to do here.) [1] https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/ -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner)
I'm imagining something like this:
struct timespec delay;
TimestampTz end_time;
end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec);
do
{
long secs;
int microsecs;
TimestampDifference(GetCurrentTimestamp(), end_time,
&secs, µsecs);
delay.tv_sec = secs;
delay.tv_nsec = microsecs * 1000;
} while (nanosleep(&delay, NULL) == -1 && errno == EINTR);
LOG: 10.000000,10.013420
LOG: 10.000000,10.011188
LOG: 10.000000,10.010860
LOG: 10.000000,10.014839
LOG: 10.000000,10.004542
LOG: 10.000000,10.006035
LOG: 10.000000,10.012230
LOG: 10.000000,10.014535
LOG: 10.000000,10.009645
LOG: 10.000000,10.000817
LOG: 10.000000,10.002162
LOG: 10.000000,10.011721
LOG: 10.000000,10.011655
Using the approach mentioned by Nathan, there
are large differences between requested and actual time.
LOG: 10.000000,17.801778
LOG: 10.000000,12.795450
LOG: 10.000000,11.793723
LOG: 10.000000,11.796317
LOG: 10.000000,13.785993
LOG: 10.000000,11.803775
LOG: 10.000000,15.782767
LOG: 10.000000,31.783901
LOG: 10.000000,19.792440
LOG: 10.000000,21.795795
LOG: 10.000000,18.800412
LOG: 10.000000,16.782886
LOG: 10.000000,10.795197
LOG: 10.000000,14.793333
LOG: 10.000000,29.806556
LOG: 10.000000,18.810784
LOG: 10.000000,11.804956
LOG: 10.000000,24.809812
LOG: 10.000000,25.815600
LOG: 10.000000,22.809493
LOG: 10.000000,22.790908
LOG: 10.000000,19.699097
LOG: 10.000000,23.795613
LOG: 10.000000,24.797078
On Fri, Jul 12, 2024 at 12:14:56PM -0500, Sami Imseih wrote: > 1/ TimestampDifference has a dependency on gettimeofday, > while my proposal utilizes clock_gettime. There are old discussions > that did not reach a conclusion comparing both mechanisms. > My main conclusion from these hacker discussions [1], [2] and other > online discussions on the topic is clock_gettime should replace > getimeofday when possible. Precision is the main reason. > > 2/ It no longer uses the remain time. I think the remain time > is still required here. I did a unrealistic stress test which shows > the original proposal can handle frequent interruptions much better. My comment was mostly about coding style and not about gettimeofday() versus clock_gettime(). What does your testing show when you don't have the extra check, i.e., struct timespec delay; struct timespec remain; delay.tv_sec = microsec / 1000000L; delay.tv_nsec = (microsec % 1000000L) * 1000; while (nanosleep(&delay, &remain) == -1 && errno == EINTR) delay = remain; -- nathan
What does your testing show when you don't have
the extra check, i.e.,
struct timespec delay;
struct timespec remain;
delay.tv_sec = microsec / 1000000L;
delay.tv_nsec = (microsec % 1000000L) * 1000;
while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
delay = remain;
On Fri, Jul 12, 2024 at 03:39:57PM -0500, Sami Imseih wrote: > but Bertrand found long drifts [2[ which I could not reproduce. > To safeguard the long drifts, continue to use the &remain time with an > additional safeguard to make sure the actual sleep does not exceed the > requested sleep time. Bertrand, does this safeguard fix the long drifts you saw? -- nathan
Hi, On Mon, Jul 15, 2024 at 12:20:29PM -0500, Nathan Bossart wrote: > On Fri, Jul 12, 2024 at 03:39:57PM -0500, Sami Imseih wrote: > > but Bertrand found long drifts [2[ which I could not reproduce. > > To safeguard the long drifts, continue to use the &remain time with an > > additional safeguard to make sure the actual sleep does not exceed the > > requested sleep time. > > Bertrand, does this safeguard fix the long drifts you saw? Yeah, it was the case with the first version using the safeguard (see [1]) and it's also the case with the last one shared in [2]. [1]: https://www.postgresql.org/message-id/Zo0UdeE3i9d0Wt5E%40ip-10-97-1-34.eu-west-3.compute.internal [2]: https://www.postgresql.org/message-id/C18017A1-EDFD-4F2F-BCA1-0572D4CCC92B%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
attaching the patch again. Something is strange with my email client. Regards, Sami
Attachment
Hi, On Thu, Jul 25, 2024 at 05:27:15PM -0500, Sami Imseih wrote: > I am attaching v3 of the patch which addresses the comments made > earlier by Bertrand about the comment in the patch [1]. Thanks! Looking at it: 1 === + struct instr_time start_time; I think we can get rid of the "struct" keyword here. 2 === + struct instr_time current_time; + struct instr_time elapsed_time; Same as above. 3 === I gave more thoughts and I think it can be simplified a bit to reduce the number of operations in the while loop. What about relying on a "absolute" time that way: instr_time absolute; absolute.ticks = start_time.ticks + msec * 1000000; and then in the while loop: while (nanosleep(&delay, &remain) == -1 && errno == EINTR) { instr_time current_time; INSTR_TIME_SET_CURRENT(current_time); if (current_time.ticks > absolute.ticks) { break; Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> On Jul 26, 2024, at 3:27 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Thu, Jul 25, 2024 at 05:27:15PM -0500, Sami Imseih wrote: >> I am attaching v3 of the patch which addresses the comments made >> earlier by Bertrand about the comment in the patch [1]. > > Thanks! > > Looking at it: > > 1 === > > + struct instr_time start_time; > > I think we can get rid of the "struct" keyword here. > > 2 === > > + struct instr_time current_time; > + struct instr_time elapsed_time; > > Same as above. Will fix those 2. > > 3 === > > I gave more thoughts and I think it can be simplified a bit to reduce the > number of operations in the while loop. > > What about relying on a "absolute" time that way: > > instr_time absolute; > absolute.ticks = start_time.ticks + msec * 1000000; > > and then in the while loop: > > while (nanosleep(&delay, &remain) == -1 && errno == EINTR) > { > instr_time current_time; > INSTR_TIME_SET_CURRENT(current_time); > > if (current_time.ticks > absolute.ticks) > { > break; While I agree this code is cleaner, myy hesitation there is we don’t have any other place in which we access .ticks directly and the common practice is to use the intsr_time.h APIs. What do you think? Regards, Sami
Hi, On Mon, Jul 29, 2024 at 06:15:49PM -0500, Sami Imseih wrote: > > On Jul 26, 2024, at 3:27 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > 3 === > > > > I gave more thoughts and I think it can be simplified a bit to reduce the > > number of operations in the while loop. > > > > What about relying on a "absolute" time that way: > > > > instr_time absolute; > > absolute.ticks = start_time.ticks + msec * 1000000; > > > > and then in the while loop: > > > > while (nanosleep(&delay, &remain) == -1 && errno == EINTR) > > { > > instr_time current_time; > > INSTR_TIME_SET_CURRENT(current_time); > > > > if (current_time.ticks > absolute.ticks) > > { > > break; > > While I agree this code is cleaner, myy hesitation there is we don’t > have any other place in which we access .ticks directly and the > common practice is to use the intsr_time.h APIs. yeah, we already have a few macros that access the .ticks, so maybe we could add 2 new ones, say: 1. INSTR_TIME_ADD_MS(t1, msec) 2. INSTR_TIME_IS_GREATER(t1, t2) I think the less operations is done in the while loop the better. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> yeah, we already have a few macros that access the .ticks, so maybe we could add > 2 new ones, say: > > 1. INSTR_TIME_ADD_MS(t1, msec) > 2. INSTR_TIME_IS_GREATER(t1, t2) > > I think the less operations is done in the while loop the better. > See v4. it includes 2 new instr_time.h macros to simplify the code insidethe while loop. Regards, Sami
Attachment
Hi, On Mon, Aug 05, 2024 at 03:07:34PM -0500, Sami Imseih wrote: > > > yeah, we already have a few macros that access the .ticks, so maybe we could add > > 2 new ones, say: > > > > 1. INSTR_TIME_ADD_MS(t1, msec) > > 2. INSTR_TIME_IS_GREATER(t1, t2) > > > > I think the less operations is done in the while loop the better. > > > > See v4. it includes 2 new instr_time.h macros to simplify the > code insidethe while loop. Thanks! 1 === +#define INSTR_TIME_IS_GREATER(x,y) \ + ((bool) (x).ticks > (y).ticks) Around parentheses are missing, that should be ((bool) ((x).ticks > (y).ticks)). I did not pay attention to it initially but found it was the culprit of breaks not occuring (while my test case produces some). That said, I don't think the cast is necessary here and that we could get rid of it. 2 === What about providing a quick comment about the 2 new macros in header of instr_time.h? (like it is done for the others macros) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
v5 takes care of the latest comments by Bertrand. Regards, Sami
Attachment
Hi, On Tue, Aug 06, 2024 at 12:36:44PM -0500, Sami Imseih wrote: > > > v5 takes care of the latest comments by Bertrand. > Thanks! Please find attached a rebase version (due to 39a138fbef) and in passing I changed one comment: "add t in microseconds to a instr_time" -> "add t (in microseconds) to x" Does that make sense to you? That said, it looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Aug 7, 2024, at 1:00 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:add t (in microseconds) to x”
Hi, On Wed, Aug 07, 2024 at 09:11:19AM -0500, Sami Imseih wrote: > > > > On Aug 7, 2024, at 1:00 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > add t (in microseconds) to x” > > > I was attempting to be more verbose in the comment, > but what you have above matches the format of > the other comments. I am ok with your revision. > Thanks! I'm marking the CF entry [1], as RFC. [1]: https://commitfest.postgresql.org/49/5161/ Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Aug 07, 2024 at 06:00:53AM +0000, Bertrand Drouvot wrote: > + SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE); I think this deserves a comment. > +#define INSTR_TIME_ADD_MICROSEC(x,t) \ > + ((x).ticks += t * NS_PER_US) I'd add parentheses around "t" to ensure any expressions given as "t" are evaluated first. Also, do we need to worry about overflow here? It looks like the rest of instr_time.h is oblivious about overflow, so maybe this is better discussed in a separate thread... -- nathan
> > On Wed, Aug 07, 2024 at 06:00:53AM +0000, Bertrand Drouvot wrote: >> + SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE); > > I think this deserves a comment. > Done >> +#define INSTR_TIME_ADD_MICROSEC(x,t) \ >> + ((x).ticks += t * NS_PER_US) > > I'd add parentheses around "t" to ensure any expressions given as "t" are > evaluated first. > Done > Also, do we need to worry about overflow here? It looks like the rest of > instr_time.h is oblivious about overflow, so maybe this is better discussed > in a separate thread... > I agree, this needs to be handled in a different thread. Please see v7. Regards, Sami
Attachment
Hi, On Wed, Aug 07, 2024 at 09:36:59AM -0500, Nathan Bossart wrote: > Also, do we need to worry about overflow here? It looks like the rest of > instr_time.h is oblivious about overflow, so maybe this is better discussed > in a separate thread... Yeah, a separate thread would be better. FWIW and just out of curiosity: 1. it seems to me that most of the time (always?) we are manipulating instr_time(s) as interval(s) which (with int64) gives “space” for about 292 years interval time measurement (if my maths are correct). 2. for "absolute" manipulation (if any) it would depend of the PG_INSTR_CLOCK. A "man clock_gettime" mentions: 2.1 CLOCK_MONOTONIC: on Linux, time since the system was booted. Not sure what the longest Linux uptime record is but can't be more than since the 90's. 2.2 CLOCK_REALTIME: Its time represents seconds and nanoseconds since the Epoch. It means that we’re currently about 237 years away from the limit. So even, if we were to say add 2 "recents" of them we are still about 183 years away from the limit. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Aug 08, 2024 at 06:49:27AM +0000, Bertrand Drouvot wrote: > 2.2 CLOCK_REALTIME: Its time represents seconds and nanoseconds since the Epoch. > It means that we´re currently about 237 years away from the limit. So even, > if we were to say add 2 "recents" of them we are still about 183 years away from > the limit. I hope to be retired before then. -- nathan
On Wed, Aug 07, 2024 at 04:22:22PM -0500, Sami Imseih wrote: > Please see v7. Thanks. This one looks pretty good to me, and so I plan to commit it in the near future unless anyone voices concerns about the approach. -- nathan
On Thu, Aug 08, 2024 at 03:01:20PM -0500, Nathan Bossart wrote: > Thanks. This one looks pretty good to me, and so I plan to commit it in > the near future unless anyone voices concerns about the approach. As I am preparing this for commit, I'm wondering whether it makes sense to name the new function vacuum_sleep() and keep it private to vacuum.c. Nothing about this function is terribly specific to vacuum, and it's not inconceivable that it might be useful elsewhere. Perhaps we should move it to pgsleep.c and rename it to something to the effect of pg_usleep_non_interruptable(). -- nathan
Hi, On Fri, Aug 09, 2024 at 02:03:55PM -0500, Nathan Bossart wrote: > On Thu, Aug 08, 2024 at 03:01:20PM -0500, Nathan Bossart wrote: > > Thanks. This one looks pretty good to me, and so I plan to commit it in > > the near future unless anyone voices concerns about the approach. > > As I am preparing this for commit, I'm wondering whether it makes sense to > name the new function vacuum_sleep() and keep it private to vacuum.c. > Nothing about this function is terribly specific to vacuum, and it's not > inconceivable that it might be useful elsewhere. Perhaps we should move it > to pgsleep.c and rename it to something to the effect of > pg_usleep_non_interruptable(). Yeah, I had the same thought in [1], so +1. [1]: https://www.postgresql.org/message-id/ZpDhS4nFX66ItAze%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> > Yeah, I had the same thought in [1], so +1. > > [1]: https://www.postgresql.org/message-id/ZpDhS4nFX66ItAze%40ip-10-97-1-34.eu-west-3.compute.internal > The intention ( see start of the thread ) was to make this a general API, but I was not sure if there are use cases outside of vacuum.c. In v8, I moved the function to pgsleep.c/signals.c and called it pg_usleep_non_interruptible. The function, unlike vacuum_sleep, will take in micros as an argument to match with pg_usleep. Regards Sami
Attachment
v9 has some has some minor corrections to the comments. Regards, Sami
Attachment
Hi, On Sat, Aug 10, 2024 at 08:27:56AM -0500, Sami Imseih wrote: > > > v9 has some has some minor corrections to the comments. > Thanks! 1 === + * Unlike pg_usleep, This function continues s/This/this/? Apart from the above, LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases mazon Web Services: https://aws.amazon.com
> > + * Unlike pg_usleep, This function continues > > s/This/this/? > > Apart from the above, LGTM. > Thanks for this catch. Uploaded v10. Regards, Sami
Attachment
My email client attached the last response for some reason :( v10 attached in the previous message addresses Bertrands last comment and replaces “This” with “this" Regards, Sami
Hi, On Mon, Aug 12, 2024 at 10:19:56AM -0500, Sami Imseih wrote: > v10 attached in the previous message addresses > Bertrands last comment and replaces “This” with “this" > Thanks, v10 LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Aug 12, 2024 at 03:56:18PM +0000, Bertrand Drouvot wrote: > Thanks, v10 LGTM. As cfbot points out [0], this is missing a Windows/frontend implementation. [0] https://cirrus-ci.com/task/6393555868975104 -- nathan
> As cfbot points out [0], this is missing a Windows/frontend implementation. > > [0] https://cirrus-ci.com/task/6393555868975104 Looks like the pgsleep implementation is missing the #ifndef WIN32 I followed what is done in pg_usleep. v11 should now build on Windows, hopefully. Strangely, v10 build on a Windows machine I have locally. Regards, Sami
Attachment
I'm trying to understand what the point of this patch is, so I went to read this thread from the beginning: > In the proposal by Bertrand [1] to implement vacuum cost delay tracking > in pg_stat_progress_vacuum, it was discovered that the vacuum cost delay > ends early on the leader process of a parallel vacuum due to parallel workers > reporting progress on index vacuuming,which was introduced in 17 > with commit [2]. With this patch, everytime a parallel worker > completes a vacuum index, it will send a completion message to the leader. Ok, so we might sometimes skip the sleep, if an interrupt is received. I agree that's a bit sloppy, but probably won't make any difference in practice. > The facility that allows a parallel worker to report progress to the leader was > introduced in commit [3]. > > In the case of the patch being proposed by Bertrand, the number of interrupts > will be much more frequent as parallelworkers would send a message to the leader > to update the vacuum delay counters every vacuum_delay_point call. Hmm, I wonder if that's a good design, if it results in a lot of interrupts. On the patch itself: Making the sleeps in vacuum uninterruptible means that vacuum will be more slow to respond to interrupts. If you SIGTERM a vacuum process, or hit CTRL-C, you *would* want to exit the sleep ASAP. Tom raised that concern earlier in this thread (https://www.postgresql.org/message-id/2100439.1719610468%40sss.pgh.pa.us), but it seems the discussion wandered off to the details of how to do the sleep, and left that unaddressed. -- Heikki Linnakangas Neon (https://neon.tech)
On Tue, Aug 13, 2024 at 12:04:28AM +0300, Heikki Linnakangas wrote: >> In the case of the patch being proposed by Bertrand, the number of >> interrupts > will be much more frequent as parallel workers would send a >> message > to the leader >> to update the vacuum delay counters every vacuum_delay_point call. > > Hmm, I wonder if that's a good design, if it results in a lot of interrupts. Skimming the last few messages of that thread [0], it looks like Bertrand is exploring ways to avoid so many interrupts. I guess the unavoidable question is whether this work is still worthwhile given that improvement. > On the patch itself: Making the sleeps in vacuum uninterruptible means that > vacuum will be more slow to respond to interrupts. If you SIGTERM a vacuum > process, or hit CTRL-C, you *would* want to exit the sleep ASAP. Since the delay will typically be pretty small (2 milliseconds by default for autovacuum), I'm assuming this won't ordinarily be noticeable. But I do think it is an important consideration. [0] https://commitfest.postgresql.org/49/5027/ -- nathan
> Skimming the last few messages of that thread [0], it looks like Bertrand > is exploring ways to avoid so many interrupts. I guess the unavoidable > question is whether this work is still worthwhile given that improvement. The way the instrumentation in [0] dealt with interrupts was too complex, which is why it seemed better to handle the restart the remainder of the sleep in the sleep function >> On the patch itself: Making the sleeps in vacuum uninterruptible means that >> vacuum will be more slow to respond to interrupts. If you SIGTERM a vacuum >> process, or hit CTRL-C, you *would* want to exit the sleep ASAP. > Since the delay will typically be pretty small (2 milliseconds by default > for autovacuum), I'm assuming this won't ordinarily be noticeable. But I > do think it is an important consideration. > At most, the sleep will be 100ms for vacuum. > > Tom raised that concern earlier in this thread > (https://www.postgresql.org/message-id/2100439.1719610468%40sss.pgh.pa.us), > but it seems the discussion wandered off to the details of how to do > the sleep, and left that unaddressed. > Doing something like pg_sleep, using WaitLatch [1], was explored. However this does not support microsecond sleeps which was allowed in 720de00af49 Thomas shared WaitLatchUs [2], which supports higher precision sleeps, but it requires epoll_pwait(2) on the system, thus it's not very portable. Using nanosleep with remain time and checking for drift was the most portable approach. Regards, Sami [0] https://commitfest.postgresql.org/49/5027/ [1] https://www.postgresql.org/message-id/67072E39-3B4E-4240-8373-AC45E23721E7%40gmail.com [2] https://www.postgresql.org/message-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com
Hi, On Mon, Aug 12, 2024 at 05:04:02PM -0500, Nathan Bossart wrote: > On Tue, Aug 13, 2024 at 12:04:28AM +0300, Heikki Linnakangas wrote: > >> In the case of the patch being proposed by Bertrand, the number of > >> interrupts > will be much more frequent as parallel workers would send a > >> message > > to the leader > >> to update the vacuum delay counters every vacuum_delay_point call. > > > > Hmm, I wonder if that's a good design, if it results in a lot of interrupts. > > Skimming the last few messages of that thread [0], it looks like Bertrand > is exploring ways to avoid so many interrupts. Yeah, that was mainly to avoid the side effect of the interrupts making the vacuum faster as compared to the master branch (as in [0], the leader is not honouring the delays when the parallel workers report their delayed time): that could be noticeable depending of the amount of work and the number of parallel workers involved in the vacuum. That could be solved thanks to this thread. Once this thread is over (and whatever the outcome is), I'll resume my testing as far the cost delay report is concerned. > I guess the unavoidable > question is whether this work is still worthwhile given that improvement. The delay not being honored already affects the vacuum since we allow a parallel worker to report progress to the leader (see [1]). The interrupts are far less frequent (as compare to [1]) though. [0]: https://commitfest.postgresql.org/49/5027/ [1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1889729dd Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote: >> Skimming the last few messages of that thread [0], it looks like Bertrand >> is exploring ways to avoid so many interrupts. I guess the unavoidable >> question is whether this work is still worthwhile given that improvement. > > The way the instrumentation in [0] dealt with interrupts was too complex, > which is why it seemed better to handle the restart the remainder of the > sleep in the sleep function Can you elaborate on how it is too complex? -- nathan
On 8/13/24 10:09 AM, Nathan Bossart wrote: > On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote: >>> Skimming the last few messages of that thread [0], it looks like Bertrand >>> is exploring ways to avoid so many interrupts. I guess the unavoidable >>> question is whether this work is still worthwhile given that improvement. >> The way the instrumentation in [0] dealt with interrupts was too complex, >> which is why it seemed better to handle the restart the remainder of the >> sleep in the sleep function > Can you elaborate on how it is too complex? > [0] made vacuum_delay_point more complex as it has to instrument cost_delay at an interval to reduce the number of interrupts to the leader. On the other hand, with allowing the sleep to deal with interrupts,no additional logic to space out instrumentation is required. Regards, Sami [0] https://commitfest.postgresql.org/49/5027/
On Tue, Aug 13, 2024 at 10:47:51AM -0500, Imseih (AWS), Sami wrote: > On 8/13/24 10:09 AM, Nathan Bossart wrote: >> On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote: >> > > Skimming the last few messages of that thread [0], it looks like Bertrand >> > > is exploring ways to avoid so many interrupts. I guess the unavoidable >> > > question is whether this work is still worthwhile given that improvement. >> > The way the instrumentation in [0] dealt with interrupts was too complex, >> > which is why it seemed better to handle the restart the remainder of the >> > sleep in the sleep function >> Can you elaborate on how it is too complex? >> > [0] made vacuum_delay_point more complex as it has to > instrument cost_delay at an interval to reduce the number > of interrupts to the leader. Sure, but looking at the patch [0], it adds maybe an extra 10 lines of code to limit the reports to 1 Hz. That doesn't strike me as too complex... [0] https://postgr.es/m/ZnlPZZZJCRu/8fka%40ip-10-97-1-34.eu-west-3.compute.internal -- nathan
On 8/13/24 10:57 AM, Nathan Bossart wrote: > On Tue, Aug 13, 2024 at 10:47:51AM -0500, Imseih (AWS), Sami wrote: >> On 8/13/24 10:09 AM, Nathan Bossart wrote: >>> On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote: >>>>> Skimming the last few messages of that thread [0], it looks like Bertrand >>>>> is exploring ways to avoid so many interrupts. I guess the unavoidable >>>>> question is whether this work is still worthwhile given that improvement. >>>> The way the instrumentation in [0] dealt with interrupts was too complex, >>>> which is why it seemed better to handle the restart the remainder of the >>>> sleep in the sleep function >>> Can you elaborate on how it is too complex? >>> >> [0] made vacuum_delay_point more complex as it has to >> instrument cost_delay at an interval to reduce the number >> of interrupts to the leader. > Sure, but looking at the patch [0], it adds maybe an extra 10 lines of code > to limit the reports to 1 Hz. That doesn't strike me as too complex... > > [0] https://postgr.es/m/ZnlPZZZJCRu/8fka%40ip-10-97-1-34.eu-west-3.compute.internal Perhaps "complex" may not be the correct way to describe it. Having to add special handling to space out instrumentation directly in vacuum_delay_point seems very odd to me. I don't think vacuum_delay_point should have to worry about this. Also, 1/ what is an appropriate interval to collect these stats? 2/ What if there are other callers in the future that wish to instrument parallel vacuum workers? they will need to implement similar logic. Regards, Sami
Please disregards this point from the last reply: """ 2/ What if there are other callers in the future that wish to instrument parallel vacuum workers? they will need to implement similar logic. """ I misspoke about this and this point does not matter since only vacuum sleep matters for this current discussion. Regards, Sami
On Tue, Aug 13, 2024 at 11:07:46AM -0500, Imseih (AWS), Sami wrote: > Having to add special handling to space out instrumentation > directly in vacuum_delay_point seems very odd to me. I don't > think vacuum_delay_point should have to worry about this. > > Also, > 1/ what is an appropriate interval to collect these stats? > 2/ What if there are other callers in the future that wish > to instrument parallel vacuum workers? they will need to implement > similar logic. None of this seems intractable to me. 1 Hz seems like an entirely reasonable place to start, and it is very easy to change (or to even make configurable). pg_stat_progress_vacuum might show slightly old values in this column, but that should be easy enough to explain in the docs if we are really concerned about it. If other callers want to do something similar, maybe we should add a more generic implementation in backend_progress.c. -- nathan
> None of this seems intractable to me. 1 Hz seems like an entirely > reasonable place to start, and it is very easy to change (or to even make > configurable). pg_stat_progress_vacuum might show slightly old values in > this column, but that should be easy enough to explain in the docs if we > are really concerned about it. If other callers want to do something > similar, maybe we should add a more generic implementation in > backend_progress.c. > I don't know if I agree. Making vacuum sleep more robust to handle interrupts seems like a cleaner general solution than to add even more code to handle this case or have to explain the behavior of cost delay instrumentation in docs. Regards, Sami
On Tue, Aug 13, 2024 at 01:12:30PM -0500, Imseih (AWS), Sami wrote: >> None of this seems intractable to me. 1 Hz seems like an entirely >> reasonable place to start, and it is very easy to change (or to even make >> configurable). pg_stat_progress_vacuum might show slightly old values in >> this column, but that should be easy enough to explain in the docs if we >> are really concerned about it. If other callers want to do something >> similar, maybe we should add a more generic implementation in >> backend_progress.c. >> > I don't know if I agree. Making vacuum sleep more robust to handle > interrupts seems like a cleaner general solution than to add > even more code to handle this case or have to explain the behavior of > cost delay instrumentation in docs. Another concern is the huge number of PqMsg_Progress messages sent by parallel workers with that approach. In Bertrand's tests, he was seeing nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per second). That seems a bit extreme to me. I don't see how anyone could possibly need stats about vacuum delays with that level of accuracy. -- nathan
On Tue, Aug 13, 2024 at 01:12:30PM -0500, Imseih (AWS), Sami wrote:
>> None of this seems intractable to me. 1 Hz seems like an entirely
>> reasonable place to start, and it is very easy to change (or to even make
>> configurable). pg_stat_progress_vacuum might show slightly old values in
>> this column, but that should be easy enough to explain in the docs if we
>> are really concerned about it. If other callers want to do something
>> similar, maybe we should add a more generic implementation in
>> backend_progress.c.
>>
> I don't know if I agree. Making vacuum sleep more robust to handle
> interrupts seems like a cleaner general solution than to add
> even more code to handle this case or have to explain the behavior of
> cost delay instrumentation in docs.
Another concern is the huge number of PqMsg_Progress messages sent by
parallel workers with that approach. In Bertrand's tests, he was seeing
nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per
second). That seems a bit extreme to me. I don't see how anyone could
possibly need stats about vacuum delays with that level of accuracy.
--
nathan
Hi, On Tue, Aug 13, 2024 at 04:30:46PM -0500, Nathan Bossart wrote: > On Tue, Aug 13, 2024 at 01:12:30PM -0500, Imseih (AWS), Sami wrote: > >> None of this seems intractable to me. 1 Hz seems like an entirely > >> reasonable place to start, and it is very easy to change (or to even make > >> configurable). pg_stat_progress_vacuum might show slightly old values in > >> this column, but that should be easy enough to explain in the docs if we > >> are really concerned about it. If other callers want to do something > >> similar, maybe we should add a more generic implementation in > >> backend_progress.c. > >> > > I don't know if I agree. Making vacuum sleep more robust to handle > > interrupts seems like a cleaner general solution than to add > > even more code to handle this case or have to explain the behavior of > > cost delay instrumentation in docs. > > Another concern is the huge number of PqMsg_Progress messages sent by > parallel workers with that approach. In Bertrand's tests, he was seeing > nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per > second). That seems a bit extreme to me. I don't see how anyone could > possibly need stats about vacuum delays with that level of accuracy. > I gave it more thoughts and I don't think we have to choose between the two. The 1 Hz approach reduces the number of interrupts and Sami's patch provides a way to get "accurate" delay in case of interrupts. I think both have their own benefit. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Aug 14, 2024 at 06:00:06AM +0000, Bertrand Drouvot wrote: > I gave it more thoughts and I don't think we have to choose between the two. > The 1 Hz approach reduces the number of interrupts and Sami's patch provides a > way to get "accurate" delay in case of interrupts. I think both have their own > benefit. Is it really that important to delay with that level of accuracy? In most cases, the chances of actually interrupting a given vacuum delay point are pretty small. Even in the extreme scenario you tested with ~350K interrupts in a 19 minute vacuum, you only saw a 10-15% difference in total time. I wouldn't say I'm diametrically opposed to this patch, but I do think we need to carefully consider whether it's worth the extra code. Separately, I've been wondering whether it's worth allowing the sleep to be interrupted in certain cases, such as SIGINT and SIGTERM. That should address one of Heikki's points. -- nathan
> time. I wouldn't say I'm diametrically opposed to this patch, but I do > think we need to carefully consider whether it's worth the extra code. FWIW, besides the patch that Bertrand is proposing [1], there is another parallel vacuum case being discussed to allow for parallel heap scan [2]. Being able to support both instrumentation of sleep time by a parallel workers and ensuring that actual sleep times are as close as possible to the requested times is a good think, IMO. > Separately, I've been wondering whether it's worth allowing the sleep to be > interrupted in certain cases, such as SIGINT and SIGTERM. That should > address one of Heikki's points. An idea may be to check for pending interrupts inside the pg_usleep_non_interruptible nanosleep loop. If there is a pending interrupt and the interrupt is QueryCancelPending or ClientConnectionLost, we can break out immediately. I am not sure yet how this can work for Windows, since for this patch, we are using a simple SleepEx call which is non-interruptible anyhow. Is it worth the effort and even more code to deal with specific Interrupts for such short sleeps ( less than 100ms for vacuum at most )? I am also thinking that pg_usleep_non_interruptuble routine should have a cap on the sleep time allowed. That cap can be 100ms to match the max vacuum_cost_delay. This will prevent anyone from trying to use this API for much longer sleeps. What do you think? [1] https://www.postgresql.org/message-id/flat/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal [2] https://www.postgresql.org/message-id/CAD21AoAEfCNv-GgaDheDJ%2Bs-p_Lv1H24AiJeNoPGCmZNSwL1YA%40mail.gmail.com Regards, Sami
On Wed, Aug 14, 2024 at 9:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Another concern is the huge number of PqMsg_Progress messages sent by > parallel workers with that approach. In Bertrand's tests, he was seeing > nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per > second). That seems a bit extreme to me. I don't see how anyone could > possibly need stats about vacuum delays with that level of accuracy. I suspect CF #5118 would fix lots of cases of ProcSignal() senders going berserk, because it deletes SendProcSignal(), and introduces SendInterrupt(), which calls SetLatch(), which doesn't send a signal if the latch is already set. Even if the latch is not already set, it only sends a signal if the latch is currently being waited on ("maybe_sleeping" flag). Even when it sends a signal, it goes to a signalfd, kqueue or NT event flag on common platforms. Of course that is only talking about the receiving side. I'm sure we can improve the senders too. There's nothing we can do about NOTIFY, because that's under user control, but that PqMsg_Progress case sounds pretty bad, and the recovery conflict system could probably be made more precise in its logic about who to wake up and when, etc. Other backends going bananas with SendProcSignal() is the reason dsm_impl_posix_resize() has to block signals while calling posix_fallocate(). Unlike nanosleep(), which you can fix by tracking remaining time, posix_fallocate() is all-or-nothing, it has no way to report partial progress, so it must therefore undo its work if interrupted, so its EINTR retry loop could get stuck forever when other backends are trigger-happy with signals, which was a real production issue. I guess both of these issues go away in practice if CF #5118 goes in.
On Sun, Aug 18, 2024 at 11:12 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I guess both of these issues go away in practice if > CF #5118 goes in. To be more precise, if you just keep doing pg_usleep() the issue goes away, and likewise for posix_fallocate() it goes away... But if you switch to WaitLatchUs() so you can handle latch wakeups in vacuum delays, which really you should because the latch might be an urgent request for you to CHECK_FOR_INTERRUPTS(), because another backend is waiting for all backends to service a ProcSignalBarrier (we need a new name for that), well now you'll get wakeups, so you're back to square one if someone is sending them very fast.
Hi, On Thu, Aug 15, 2024 at 04:13:29PM -0500, Nathan Bossart wrote: > On Wed, Aug 14, 2024 at 06:00:06AM +0000, Bertrand Drouvot wrote: > > I gave it more thoughts and I don't think we have to choose between the two. > > The 1 Hz approach reduces the number of interrupts and Sami's patch provides a > > way to get "accurate" delay in case of interrupts. I think both have their own > > benefit. > > Is it really that important to delay with that level of accuracy? In most > cases, the chances of actually interrupting a given vacuum delay point are > pretty small. Even in the extreme scenario you tested with ~350K > interrupts in a 19 minute vacuum, you only saw a 10-15% difference in total > time. I wouldn't say I'm diametrically opposed to this patch, but I do > think we need to carefully consider whether it's worth the extra code. > I'm not 100% sure that it is worth it but on OTOH I'm thinking that could be the case for someone that cares enough to change the cost delay from say 2ms to 3ms. I mean, given the granularity we expose in the cost delay, one could expect to get "accurate" delay. The doc is cautious enough to mention that "such delays may not be measured accurately on older platforms" which makes me think that could be worth to implement it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Aug 13, 2024 at 11:40:27AM -0500, Nathan Bossart wrote: > On Tue, Aug 13, 2024 at 11:07:46AM -0500, Imseih (AWS), Sami wrote: > > Having to add special handling to space out instrumentation > > directly in vacuum_delay_point seems very odd to me. I don't > > think vacuum_delay_point should have to worry about this. > > > > Also, > > 1/ what is an appropriate interval to collect these stats? > > 2/ What if there are other callers in the future that wish > > to instrument parallel vacuum workers? they will need to implement > > similar logic. > > None of this seems intractable to me. 1 Hz seems like an entirely > reasonable place to start, and it is very easy to change (or to even make > configurable). pg_stat_progress_vacuum might show slightly old values in > this column, but that should be easy enough to explain in the docs if we > are really concerned about it. If other callers want to do something > similar, maybe we should add a more generic implementation in > backend_progress.c. > As it looks like we have a consensus that reducing the number of interrupts also makes sense, I just provided a rebase version of the 1 Hz version (see [0], that also makes clear in the doc that the new field might show slightly old values). [0]: https://www.postgresql.org/message-id/ZsSQnS9OW9EWSOk4%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> As it looks like we have a consensus that reducing the number of interrupts also > makes sense, I just provided a rebase version of the 1 Hz version (see [0], that > also makes clear in the doc that the new field might show slightly old values). That makes sense. However, I suspect the "1 Hz" work code will no longer be needed if CF entry 5118 [1] mentioned by Thomas [2] a few days back goes through. Maybe this extra work can be removed if [1] goes through. What do you think? With regards to CF 5118 and what it means to the current discussion, below are my thoughts. I tested with both CF 5118 [1] and the cost-delay tracking patch. With that in place, pg_usleep is able to sleep the full requested, as mentioned by Thomas [3]. This is because certain interrupts like Parallel Message and others are not signaled by SIGUSR1 any longer, but with latches. From this discussion, there is desire for a sleep function that: 1/ Sleeps for the full duration of the requested time 2/ Continues to handle important interrupts during the sleep While something like CF 5118 will take care of point #1, it will not deal with #2. Also, the v11 [4] patch does not do #2 either. So I think in the sleep loop, we need a C_F_I call. The same type of loop can also be used to call WaitForSingleObject. If CF 5118 gets committed, will still need similar loop that calls C_F_I, but the function will need to call WaitLatchUs [5]. Thoughts? -- Sami [1] https://commitfest.postgresql.org/49/5118/ [2] https://www.postgresql.org/message-id/CA%2BhUKG%2Bf-nEc_SowDLW1JMUa6Of5sCK-JZ%3Dv-KhL1xgXk83fiw%40mail.gmail.com [3] https://www.postgresql.org/message-id/CA%2BhUKGKpo3fM%3DrnfdxHqt%2BjNGh_zUNcL1ob4hMsb%3DjFfKn9Aag%40mail.gmail.com [4] https://www.postgresql.org/message-id/e3297b5e-0b33-4f45-afcd-4b00ba0b3547%40gmail.com [5] https://www.postgresql.org/message-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com
Hi, On Tue, Aug 20, 2024 at 02:25:19PM -0500, Sami Imseih wrote: > > As it looks like we have a consensus that reducing the number of interrupts also > > makes sense, I just provided a rebase version of the 1 Hz version (see [0], that > > also makes clear in the doc that the new field might show slightly old values). > > That makes sense. However, I suspect the "1 Hz" work code will no longer > be needed if CF entry 5118 [1] mentioned by Thomas [2] a few days back > goes through. Maybe this extra work can be removed if [1] goes through. > What do you think? Yeah agree that the "1 Hz" work wouldn't be needed anymore if the CF entry 5118 goes through *and* if vacuum delays keep doing pg_usleep() (as the leader won't receive SIGUSR1 from the parallel workers while executing nanosleep() anymore). It could still receive SIGHUP or such but that's outside of the PqMsg_Progress case though. > With regards to CF 5118 and what it means to the current discussion, below > are my thoughts. > > I tested with both CF 5118 [1] and the cost-delay tracking patch. With that in place, > pg_usleep is able to sleep the full requested, as mentioned by Thomas [3]. This is > because certain interrupts like Parallel Message and others are not signaled > by SIGUSR1 any longer, but with latches. > Yeah. > From this discussion, there is desire for a sleep function that: > 1/ Sleeps for the full duration of the requested time > 2/ Continues to handle important interrupts during the sleep > > While something like CF 5118 will take care of point #1, I'm not sure, even with CF entry 5118, nanosleep() could be interrupted. But I agree that the leader won't be interrupted by PqMsg_Progress anymore. > it will not deal > with #2. Also, the v11 [4] patch does not do #2 either. So I think > in the sleep loop, we need a C_F_I call. The same type of loop can > also be used to call WaitForSingleObject. > > If CF 5118 gets committed, will still need similar loop that calls C_F_I, > but the function will need to call WaitLatchUs [5]. > > Thoughts? If CF 5118 gets committed, then I think we would need to move to using WaitLatchUs() to react to urgent request. We'd also need to find a way to ensure that we'd wait for the full duration of the requested time depending of the reason why we waked up (well, only if we agree that 1/ is needed and I'm not sure we got a consensus). So I think that: 1. we should still implement the "1 Hz" stuff as 1.1/ it could be useful if CF 5118 gets committed and we move to WaitLatchUs() and 2.2/ it won't break anything if CF gets committed and we don't move to WaitLatchUs(). For 1.1 it would still prevent the leader to be waked up too frequently by the parallel workers. 2. still discuss the "need" and get a consensus regarding a sleep that could ensure the wait duration (in some cases and depending of the reason why the process is waked up). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> From this discussion, there is desire for a sleep function that: > 1/ Sleeps for the full duration of the requested time > 2/ Continues to handle important interrupts during the sleep > > While something like CF 5118 will take care of point #1, > I'm not sure, even with CF entry 5118, nanosleep() could be interrupted. But I > agree that the leader won't be interrupted by PqMsg_Progress anymore. Correct. > 1. we should still implement the "1 Hz" stuff as 1.1/ it could be useful if CF > 5118 gets committed and we move to WaitLatchUs() and 2.2/ it won't break anything > if CF gets committed and we don't move to WaitLatchUs(). For 1.1 it would still > prevent the leader to be waked up too frequently by the parallel workers. Yes, regardless of any changes that may occur in the future that change the behaior of pg_usleep, preventing a leader from being woken up too frequently is good to have. The "1 Hz" work is still good to have. > 2. still discuss the "need" and get a consensus regarding a sleep that could > ensure the wait duration (in some cases and depending of the reason why the > process is waked up). Unless there is objection, I will withdraw the CF [1] entry for this patch next week. This discussion however should be one of the points that CF 5118 must deal with. That is, 5118 will change the behavior of pg_usleep when dealing with interrupts previously signaled by SIGUSR1. [1] https://commitfest.postgresql.org/49/5161/ Regards, Sami