Thread: Track the amount of time waiting due to cost_delay
Hi hackers, During the last pgconf.dev I attended Robert’s presentation about autovacuum and it made me remember of an idea I had some time ago: $SUBJECT Please find attached a patch doing so by adding a new field (aka "time_delayed") to the pg_stat_progress_vacuum view. Currently one can change [autovacuum_]vacuum_cost_delay and [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of the changes on the vacuum duration: one could observe the vacuum duration variation but the correlation to the changes is not accurate (as many others factors could impact the vacuum duration (load on the system, i/o latency,...)). This new field reports the time that the vacuum has to sleep due to cost delay: it could be useful to 1) measure the impact of the current cost_delay and cost_limit settings and 2) when experimenting new values (and then help for decision making for those parameters). The patch is relatively small thanks to the work that has been done in f1889729dd (to allow parallel worker to report to the leader). [1]: https://www.pgevents.ca/events/pgconfdev2024/schedule/session/29-how-autovacuum-goes-wrong-and-can-we-please-make-it-stop-doing-that/ Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote: > During the last pgconf.dev I attended Robert´s presentation about autovacuum and > it made me remember of an idea I had some time ago: $SUBJECT This sounds like useful information to me. I wonder if we should also surface the effective cost limit for each autovacuum worker. > Currently one can change [autovacuum_]vacuum_cost_delay and > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of > the changes on the vacuum duration: one could observe the vacuum duration > variation but the correlation to the changes is not accurate (as many others > factors could impact the vacuum duration (load on the system, i/o latency,...)). IIUC you'd need to get information from both pg_stat_progress_vacuum and pg_stat_activity in order to know what percentage of time was being spent in cost delay. Is that how you'd expect for this to be used in practice? > pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); > pg_usleep(msec * 1000); > pgstat_report_wait_end(); > + /* Report the amount of time we slept */ > + if (VacuumSharedCostBalance != NULL) > + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); > + else > + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); Hm. Should we measure the actual time spent sleeping, or is a rough estimate good enough? I believe pg_usleep() might return early (e.g., if the process is signaled) or late, so this field could end up being inaccurate, although probably not by much. If we're okay with millisecond granularity, my first instinct is that what you've proposed is fine, but I figured I'd bring it up anyway. -- nathan
Hi, On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote: > On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote: > > During the last pgconf.dev I attended Robert´s presentation about autovacuum and > > it made me remember of an idea I had some time ago: $SUBJECT > > This sounds like useful information to me. Thanks for looking at it! > I wonder if we should also > surface the effective cost limit for each autovacuum worker. I'm not sure about it as I think that it could be misleading: one could query pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is due to _this_ cost_limit. But that's not necessary true as the cost_limit could have changed multiple times since the vacuum started. So, unless there is frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and the cost_limit could be misleadind IMHO. > > Currently one can change [autovacuum_]vacuum_cost_delay and > > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of > > the changes on the vacuum duration: one could observe the vacuum duration > > variation but the correlation to the changes is not accurate (as many others > > factors could impact the vacuum duration (load on the system, i/o latency,...)). > > IIUC you'd need to get information from both pg_stat_progress_vacuum and > pg_stat_activity in order to know what percentage of time was being spent > in cost delay. Is that how you'd expect for this to be used in practice? Yeah, one could use a query such as: select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid) for example. Worth to provide an example somewhere in the doc? > > pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); > > pg_usleep(msec * 1000); > > pgstat_report_wait_end(); > > + /* Report the amount of time we slept */ > > + if (VacuumSharedCostBalance != NULL) > > + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); > > + else > > + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); > > Hm. Should we measure the actual time spent sleeping, or is a rough > estimate good enough? I believe pg_usleep() might return early (e.g., if > the process is signaled) or late, so this field could end up being > inaccurate, although probably not by much. If we're okay with millisecond > granularity, my first instinct is that what you've proposed is fine, but I > figured I'd bring it up anyway. Thanks for bringing that up! I had the same thought when writing the code and came to the same conclusion. I think that's a good enough estimation and specially during a long running vacuum (which is probably the case where users care the most). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote: > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote: >> I wonder if we should also >> surface the effective cost limit for each autovacuum worker. > > I'm not sure about it as I think that it could be misleading: one could query > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is > due to _this_ cost_limit. But that's not necessary true as the cost_limit could > have changed multiple times since the vacuum started. So, unless there is > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and > the cost_limit could be misleadind IMHO. Well, that's true for the delay, too, right (at least as of commit 7d71d3d)? -- nathan
>> This sounds like useful information to me. > Thanks for looking at it! The VacuumDelay is the only visibility available to gauge the cost_delay. Having this information advertised by pg_stat_progress_vacuum as is being proposed is much better. However, I also think that the "number of times" the vacuum went into delay will be needed as well. Both values will be useful to tune cost_delay and cost_limit. It may also make sense to accumulate the total_time in delay and the number of times delayed in a cumulative statistics [0] view to allow a user to trend this information overtime. I don't think this info fits in any of the existing views, i.e. pg_stat_database, so maybe a new view for cumulative vacuum stats may be needed. This is likely a separate discussion, but calling it out here. >> IIUC you'd need to get information from both pg_stat_progress_vacuum and >> pg_stat_activity in order to know what percentage of time was being spent >> in cost delay. Is that how you'd expect for this to be used in practice? > Yeah, one could use a query such as: > select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid) Maybe all progress views should just expose the "beentry->st_activity_start_timestamp " to let the user know when the current operation began. Regards, Sami Imseih Amazon Web Services (AWS) [0] https://www.postgresql.org/docs/current/monitoring-stats.html
On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Hm. Should we measure the actual time spent sleeping, or is a rough > estimate good enough? I believe pg_usleep() might return early (e.g., if > the process is signaled) or late, so this field could end up being > inaccurate, although probably not by much. If we're okay with millisecond > granularity, my first instinct is that what you've proposed is fine, but I > figured I'd bring it up anyway. I bet you could also sleep for longer than planned, throwing the numbers off in the other direction. I'm always suspicious of this sort of thing. I tend to find nothing gives me the right answer unless I assume that the actual sleep times are randomly and systematically different from the intended sleep times but arbitrarily large amounts. I think we should at least do some testing: if we measure both the intended sleep time and the actual sleep time, how close are they? Does it change if the system is under crushing load (which might elongate sleeps) or if we spam SIGUSR1 against the vacuum process (which might shorten them)? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote: > On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote: > > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote: > >> I wonder if we should also > >> surface the effective cost limit for each autovacuum worker. > > > > I'm not sure about it as I think that it could be misleading: one could query > > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is > > due to _this_ cost_limit. But that's not necessary true as the cost_limit could > > have changed multiple times since the vacuum started. So, unless there is > > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and > > the cost_limit could be misleadind IMHO. > > Well, that's true for the delay, too, right (at least as of commit > 7d71d3d)? Yeah right, but the patch exposes the total amount of time the vacuum has been delayed (not the cost_delay per say) which does not sound misleading to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Jun 10, 2024 at 08:12:46PM +0000, Imseih (AWS), Sami wrote: > >> This sounds like useful information to me. > > > Thanks for looking at it! > > The VacuumDelay is the only visibility available to > gauge the cost_delay. Having this information > advertised by pg_stat_progress_vacuum as is being proposed > is much better. Thanks for looking at it! > However, I also think that the > "number of times" the vacuum went into delay will be needed > as well. Both values will be useful to tune cost_delay and cost_limit. Yeah, I think that's a good idea. With v1 one could figure out how many times the delay has been triggered but that does not work anymore if: 1) cost_delay changed during the vacuum duration or 2) the patch changes the way time_delayed is measured/reported (means get the actual wait time and not the theoritical time as v1 does). > > It may also make sense to accumulate the total_time in delay > and the number of times delayed in a cumulative statistics [0] > view to allow a user to trend this information overtime. > I don't think this info fits in any of the existing views, i.e. > pg_stat_database, so maybe a new view for cumulative > vacuum stats may be needed. This is likely a separate > discussion, but calling it out here. +1 > >> IIUC you'd need to get information from both pg_stat_progress_vacuum and > >> pg_stat_activity in order to know what percentage of time was being spent > >> in cost delay. Is that how you'd expect for this to be used in practice? > > > Yeah, one could use a query such as: > > > select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid) > > Maybe all progress views should just expose the "beentry->st_activity_start_timestamp " > to let the user know when the current operation began. Yeah maybe, I think this is likely a separate discussion too, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Jun 10, 2024 at 3:05 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi hackers, > > During the last pgconf.dev I attended Robert’s presentation about autovacuum and > it made me remember of an idea I had some time ago: $SUBJECT > > Please find attached a patch doing so by adding a new field (aka "time_delayed") > to the pg_stat_progress_vacuum view. > > Currently one can change [autovacuum_]vacuum_cost_delay and > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of > the changes on the vacuum duration: one could observe the vacuum duration > variation but the correlation to the changes is not accurate (as many others > factors could impact the vacuum duration (load on the system, i/o latency,...)). > > This new field reports the time that the vacuum has to sleep due to cost delay: > it could be useful to 1) measure the impact of the current cost_delay and > cost_limit settings and 2) when experimenting new values (and then help for > decision making for those parameters). > > The patch is relatively small thanks to the work that has been done in > f1889729dd (to allow parallel worker to report to the leader). Thank you for the proposal and the patch. I understand the motivation of this patch. Beside the point Nathan mentioned, I'm slightly worried that massive parallel messages could be sent to the leader process when the cost_limit value is low. FWIW when I want to confirm the vacuum delay effect, I often use the information from the DEBUG2 log message in VacuumUpdateCosts() function. Exposing these data (per-worker dobalance, cost_lmit, cost_delay, active, and failsafe) somewhere in a view might also be helpful for users for checking vacuum delay effects. It doesn't mean to measure the impact of the changes on the vacuum duration, though. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote: > On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: > > Hm. Should we measure the actual time spent sleeping, or is a rough > > estimate good enough? I believe pg_usleep() might return early (e.g., if > > the process is signaled) or late, so this field could end up being > > inaccurate, although probably not by much. If we're okay with millisecond > > granularity, my first instinct is that what you've proposed is fine, but I > > figured I'd bring it up anyway. > > I bet you could also sleep for longer than planned, throwing the > numbers off in the other direction. Thanks for looking at it! Agree, that's how I read "or late" from Nathan's comment above. > I'm always suspicious of this sort of thing. I tend to find nothing > gives me the right answer unless I assume that the actual sleep times > are randomly and systematically different from the intended sleep > times but arbitrarily large amounts. I think we should at least do > some testing: if we measure both the intended sleep time and the > actual sleep time, how close are they? Does it change if the system is > under crushing load (which might elongate sleeps) or if we spam > SIGUSR1 against the vacuum process (which might shorten them)? OTOH Sami proposed in [1] to count the number of times the vacuum went into delay. I think that's a good idea. His idea makes me think that (in addition to the number of wait times) it would make sense to measure the "actual" sleep time (and not the intended one) then (so that one could measure the difference between the intended wait time (number of wait times * cost delay (if it does not change during the vacuum duration)) and the actual measured wait time). So I think that in v2 we could: 1) measure the actual wait time instead, 2) count the number of times the vacuum slept. We could also 3) reports the effective cost limit (as proposed by Nathan up-thread) (I think that 3) could be misleading but I'll yield to majority opinion if people think it's not). Thoughts? [1]: https://www.postgresql.org/message-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF%40amazon.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote: > Thank you for the proposal and the patch. I understand the motivation > of this patch. Thanks for looking at it! > Beside the point Nathan mentioned, I'm slightly worried > that massive parallel messages could be sent to the leader process > when the cost_limit value is low. I see, I can/will do some testing in this area and share the numbers. > > FWIW when I want to confirm the vacuum delay effect, I often use the > information from the DEBUG2 log message in VacuumUpdateCosts() > function. Exposing these data (per-worker dobalance, cost_lmit, > cost_delay, active, and failsafe) somewhere in a view might also be > helpful for users for checking vacuum delay effects. Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the other data you mentioned above in another dedicated view? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote: > I'm always suspicious of this sort of thing. I tend to find nothing > gives me the right answer unless I assume that the actual sleep times > are randomly and systematically different from the intended sleep > times but arbitrarily large amounts. I think we should at least do > some testing: if we measure both the intended sleep time and the > actual sleep time, how close are they? Does it change if the system is > under crushing load (which might elongate sleeps) or if we spam > SIGUSR1 against the vacuum process (which might shorten them)? Though I (now) think that it would make sense to record the actual delay time instead (see [1]), I think it's interesting to do some testing as you suggested. With record_actual_time.txt (attached) applied on top of v1, we can see the intended and actual wait time. On my system, "no load at all" except the vacuum running, I see no diff: Tue Jun 11 09:22:06 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration -------+-------+---------------+--------------+---------------------+----------------- 54754 | 16385 | scanning heap | 41107 | 41107 | 00:00:42.301851 (1 row) Tue Jun 11 09:22:07 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration -------+-------+---------------+--------------+---------------------+----------------- 54754 | 16385 | scanning heap | 42076 | 42076 | 00:00:43.301848 (1 row) Tue Jun 11 09:22:08 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration -------+-------+---------------+--------------+---------------------+----------------- 54754 | 16385 | scanning heap | 43045 | 43045 | 00:00:44.301854 (1 row) But if I launch pg_reload_conf() 10 times in a row, I can see: Tue Jun 11 09:22:09 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration -------+-------+---------------+--------------+---------------------+----------------- 54754 | 16385 | scanning heap | 44064 | 44034 | 00:00:45.302965 (1 row) Tue Jun 11 09:22:10 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration -------+-------+---------------+--------------+---------------------+----------------- 54754 | 16385 | scanning heap | 45033 | 45003 | 00:00:46.301858 As we can see the actual wait time is 30ms less than the intended wait time with this simple test. So I still think we should go with 1) actual wait time and 2) report the number of waits (as mentioned in [1]). Does that make sense to you? [1]: https://www.postgresql.org/message-id/Zmf712A5xcOM9Hlg%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote: > So I think that in v2 we could: 1) measure the actual wait time instead, 2) > count the number of times the vacuum slept. We could also 3) reports the > effective cost limit (as proposed by Nathan up-thread) (I think that 3) could > be misleading but I'll yield to majority opinion if people think it's not). I still think the effective cost limit would be useful, if for no other reason than to help reinforce that it is distributed among the autovacuum workers. We could document that this value may change over the lifetime of a worker to help avoid misleading folks. -- nathan
On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > As we can see the actual wait time is 30ms less than the intended wait time with > this simple test. So I still think we should go with 1) actual wait time and 2) > report the number of waits (as mentioned in [1]). Does that make sense to you? I like the idea of reporting the actual wait time better, provided that we verify that doing so isn't too expensive. I think it probably isn't, because in a long-running VACUUM there is likely to be disk I/O, so the CPU overhead of a few extra gettimeofday() calls should be fairly low by comparison. I wonder if there's a noticeable hit when everything is in-memory. I guess probably not, because with any sort of normal configuration, we shouldn't be delaying after every block we process, so the cost of those gettimeofday() calls should still be getting spread across quite a bit of real work. That said, I'm not sure this experiment shows a real problem with the idea of showing intended wait time. It does establish the concept that repeated signals can throw our numbers off, but 30ms isn't much of a discrepancy. I'm worried about being off by a factor of two, or an order of magnitude. I think we still don't know if that can happen, but if we're going to show actual wait time anyway, then we don't need to explore the problems with other hypothetical systems too much. I'm not convinced that reporting the number of waits is useful. If we were going to report a possibly-inaccurate amount of actual waiting, then also reporting the number of waits might make it easier to figure out when the possibly-inaccurate number was in fact inaccurate. But I think it's way better to report an accurate amount of actual waiting, and then I'm not sure what we gain by also reporting the number of waits. -- Robert Haas EDB: http://www.enterprisedb.com
On 6/11/24 13:13, Robert Haas wrote: > On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: >> As we can see the actual wait time is 30ms less than the intended wait time with >> this simple test. So I still think we should go with 1) actual wait time and 2) >> report the number of waits (as mentioned in [1]). Does that make sense to you? > > I like the idea of reporting the actual wait time better, provided > that we verify that doing so isn't too expensive. I think it probably > isn't, because in a long-running VACUUM there is likely to be disk > I/O, so the CPU overhead of a few extra gettimeofday() calls should be > fairly low by comparison. I wonder if there's a noticeable hit when > everything is in-memory. I guess probably not, because with any sort > of normal configuration, we shouldn't be delaying after every block we > process, so the cost of those gettimeofday() calls should still be > getting spread across quite a bit of real work. Does it even require a call to gettimeofday()? The code in vacuum calculates an msec value and calls pg_usleep(msec * 1000). I don't think it is necessary to measure how long that nap was. Regards, Jan
> I'm not convinced that reporting the number of waits is useful. If we > were going to report a possibly-inaccurate amount of actual waiting, > then also reporting the number of waits might make it easier to figure > out when the possibly-inaccurate number was in fact inaccurate. But I > think it's way better to report an accurate amount of actual waiting, > and then I'm not sure what we gain by also reporting the number of > waits. I think including the number of times vacuum went into sleep will help paint a full picture of the effect of tuning the vacuum_cost_delay and vacuum_cost_limit for the user, even if we are reporting accurate amounts of actual sleeping. This is particularly true for autovacuum in which the cost limit is spread across all autovacuum workers, and knowing how many times autovacuum went to sleep will be useful along with the total time spent sleeping. Regards, Sami
On Tue, Jun 11, 2024 at 06:19:23PM +0000, Imseih (AWS), Sami wrote: >> I'm not convinced that reporting the number of waits is useful. If we >> were going to report a possibly-inaccurate amount of actual waiting, >> then also reporting the number of waits might make it easier to figure >> out when the possibly-inaccurate number was in fact inaccurate. But I >> think it's way better to report an accurate amount of actual waiting, >> and then I'm not sure what we gain by also reporting the number of >> waits. > > I think including the number of times vacuum went into sleep > will help paint a full picture of the effect of tuning the vacuum_cost_delay > and vacuum_cost_limit for the user, even if we are reporting accurate > amounts of actual sleeping. > > This is particularly true for autovacuum in which the cost limit is spread > across all autovacuum workers, and knowing how many times autovacuum > went to sleep will be useful along with the total time spent sleeping. I'm struggling to think of a scenario in which the number of waits would be useful, assuming you already know the amount of time spent waiting. Even if the number of waits is huge, it doesn't tell you much else AFAICT. I'd be much more likely to adjust the cost settings based on the percentage of time spent sleeping. -- nathan
On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I'm struggling to think of a scenario in which the number of waits would be > useful, assuming you already know the amount of time spent waiting. Even > if the number of waits is huge, it doesn't tell you much else AFAICT. I'd > be much more likely to adjust the cost settings based on the percentage of > time spent sleeping. This is also how I see it. -- Robert Haas EDB: http://www.enterprisedb.com
>> I'm struggling to think of a scenario in which the number of waits would be >> useful, assuming you already know the amount of time spent waiting. Even >> if the number of waits is huge, it doesn't tell you much else AFAICT. I'd >> be much more likely to adjust the cost settings based on the percentage of >> time spent sleeping. > This is also how I see it. I think it may be useful for a user to be able to answer the "average sleep time" for a vacuum, especially because the vacuum cost limit and delay can be adjusted on the fly for a running vacuum. If we only show the total sleep time, the user could make wrong assumptions about how long each sleep took and they might assume that all sleep delays for a particular vacuum run have been uniform in duration, when in-fact they may not have been. Regards, Sami
Hi, On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote: > On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote: > > So I think that in v2 we could: 1) measure the actual wait time instead, 2) > > count the number of times the vacuum slept. We could also 3) reports the > > effective cost limit (as proposed by Nathan up-thread) (I think that 3) could > > be misleading but I'll yield to majority opinion if people think it's not). > > I still think the effective cost limit would be useful, if for no other > reason than to help reinforce that it is distributed among the autovacuum > workers. I also think it can be useful, my concern is more to put this information in pg_stat_progress_vacuum. What about Sawada-san proposal in [1]? (we could create a new view that would contain those data: per-worker dobalance, cost_lmit, cost_delay, active, and failsafe). [1]: https://www.postgresql.org/message-id/CAD21AoDOu%3DDZcC%2BPemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Jun 11, 2024 at 02:48:30PM -0400, Robert Haas wrote: > On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > I'm struggling to think of a scenario in which the number of waits would be > > useful, assuming you already know the amount of time spent waiting. If we provide the actual time spent waiting, providing the number of waits would allow to see if there is a diff between the actual time and the intended time (i.e: number of waits * cost_delay, should the cost_delay be the same during the vacuum duration). That should trigger some thoughts if the diff is large enough. I think that what we are doing here is to somehow add instrumentation around the "WAIT_EVENT_VACUUM_DELAY" wait event. If we were to add instrumentation for wait events (generaly speaking) we'd probably also expose the number of waits per wait event (in addition to the time waited). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Jun 11, 2024 at 01:13:48PM -0400, Robert Haas wrote: > On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > As we can see the actual wait time is 30ms less than the intended wait time with > > this simple test. So I still think we should go with 1) actual wait time and 2) > > report the number of waits (as mentioned in [1]). Does that make sense to you? > > I like the idea of reporting the actual wait time better, +1 > provided > that we verify that doing so isn't too expensive. I think it probably > isn't, because in a long-running VACUUM there is likely to be disk > I/O, so the CPU overhead of a few extra gettimeofday() calls should be > fairly low by comparison. Agree. > I wonder if there's a noticeable hit when > everything is in-memory. I guess probably not, because with any sort > of normal configuration, we shouldn't be delaying after every block we > process, so the cost of those gettimeofday() calls should still be > getting spread across quite a bit of real work. I did some testing, with: shared_buffers = 12GB vacuum_cost_delay = 1 autovacuum_vacuum_cost_delay = 1 max_parallel_maintenance_workers = 0 max_parallel_workers = 0 added to a default config file. A table and all its indexes were fully in memory, the numbers are: postgres=# SELECT n.nspname, c.relname, count(*) AS buffers FROM pg_buffercache b JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) JOIN pg_namespace n ON n.oid = c.relnamespace GROUP BY n.nspname, c.relname ORDER BY 3 DESC LIMIT 11; nspname | relname | buffers ---------+-------------------+--------- public | large_tbl | 222280 public | large_tbl_pkey | 5486 public | large_tbl_filler7 | 1859 public | large_tbl_filler4 | 1859 public | large_tbl_filler1 | 1859 public | large_tbl_filler6 | 1859 public | large_tbl_filler3 | 1859 public | large_tbl_filler2 | 1859 public | large_tbl_filler5 | 1859 public | large_tbl_filler8 | 1859 public | large_tbl_version | 1576 (11 rows) The observed timings when vacuuming this table are: On master: vacuum phase: cumulative duration --------------------------------- scanning heap: 00:00:37.808184 vacuuming indexes: 00:00:41.808176 vacuuming heap: 00:00:54.808156 On master patched with actual time delayed: vacuum phase: cumulative duration --------------------------------- scanning heap: 00:00:36.502104 (time_delayed: 22202) vacuuming indexes: 00:00:41.002103 (time_delayed: 23769) vacuuming heap: 00:00:54.302096 (time_delayed: 34886) As we can see there is no noticeable degradation while the vacuum entered about 34886 times in this instrumentation code path (cost_delay was set to 1). > That said, I'm not sure this experiment shows a real problem with the > idea of showing intended wait time. It does establish the concept that > repeated signals can throw our numbers off, but 30ms isn't much of a > discrepancy. Yeah, the idea was just to show how easy it is to create a 30ms discrepancy. > I'm worried about being off by a factor of two, or an > order of magnitude. I think we still don't know if that can happen, > but if we're going to show actual wait time anyway, then we don't need > to explore the problems with other hypothetical systems too much. Agree. > I'm not convinced that reporting the number of waits is useful. If we > were going to report a possibly-inaccurate amount of actual waiting, > then also reporting the number of waits might make it easier to figure > out when the possibly-inaccurate number was in fact inaccurate. But I > think it's way better to report an accurate amount of actual waiting, > and then I'm not sure what we gain by also reporting the number of > waits. Sami shared his thoughts in [1] and [2] and so did I in [3]. If some of us still don't think that reporting the number of waits is useful then we can probably start without it. [1]: https://www.postgresql.org/message-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727%40amazon.com [2]: https://www.postgresql.org/message-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D%40amazon.com [3]: https://www.postgresql.org/message-id/ZmmGG4e%2BqTBD2kfn%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Jun 11, 2024 at 08:26:23AM +0000, Bertrand Drouvot wrote: > Hi, > > On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote: > > > Thank you for the proposal and the patch. I understand the motivation > > of this patch. > > Thanks for looking at it! > > > Beside the point Nathan mentioned, I'm slightly worried > > that massive parallel messages could be sent to the leader process > > when the cost_limit value is low. > > I see, I can/will do some testing in this area and share the numbers. Here is the result of the test. It has been launched several times and it produced the same (surprising result) each time. ====================== Context ================================================ The testing has been done with this relation (large_tbl) and its indexes: postgres=# SELECT n.nspname, c.relname, count(*) AS buffers FROM pg_buffercache b JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) JOIN pg_namespace n ON n.oid = c.relnamespace GROUP BY n.nspname, c.relname ORDER BY 3 DESC LIMIT 22; nspname | relname | buffers ---------+--------------------+--------- public | large_tbl | 222280 public | large_tbl_filler13 | 125000 public | large_tbl_filler6 | 125000 public | large_tbl_filler5 | 125000 public | large_tbl_filler3 | 125000 public | large_tbl_filler15 | 125000 public | large_tbl_filler4 | 125000 public | large_tbl_filler20 | 125000 public | large_tbl_filler18 | 125000 public | large_tbl_filler14 | 125000 public | large_tbl_filler8 | 125000 public | large_tbl_filler11 | 125000 public | large_tbl_filler19 | 125000 public | large_tbl_filler7 | 125000 public | large_tbl_filler1 | 125000 public | large_tbl_filler12 | 125000 public | large_tbl_filler9 | 125000 public | large_tbl_filler17 | 125000 public | large_tbl_filler16 | 125000 public | large_tbl_filler10 | 125000 public | large_tbl_filler2 | 125000 public | large_tbl_pkey | 5486 (22 rows) All of them completly fit in memory (to avoid I/O read latency during the vacuum). The config, outside of default is: max_wal_size = 4GB shared_buffers = 30GB vacuum_cost_delay = 1 autovacuum_vacuum_cost_delay = 1 max_parallel_maintenance_workers = 8 max_parallel_workers = 10 vacuum_cost_limit = 10 autovacuum_vacuum_cost_limit = 10 My system is not overloaded, has enough resources to run this test and only this test is running. ====================== Results ================================================ ========== With v2 (attached) applied on master postgres=# VACUUM (PARALLEL 8) large_tbl; VACUUM Time: 1146873.016 ms (19:06.873) The duration is splitted that way: Vacuum phase: cumulative time (cumulative time delayed) ======================================================= scanning heap: 00:08:16.414628 (time_delayed: 444370) vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293) vacuuming heap: 00:19:06.814617 (time_delayed: 2767540) I sampled active sessions from pg_stat_activity (one second interval), here is the summary during the vacuuming indexes phase (ordered by count): leader_pid | pid | wait_event | count ------------+--------+----------------+------- 452996 | 453225 | VacuumDelay | 366 452996 | 453223 | VacuumDelay | 363 452996 | 453226 | VacuumDelay | 362 452996 | 453224 | VacuumDelay | 361 452996 | 453222 | VacuumDelay | 359 452996 | 453221 | VacuumDelay | 359 | 452996 | VacuumDelay | 331 | 452996 | CPU | 30 452996 | 453224 | WALWriteLock | 23 452996 | 453222 | WALWriteLock | 20 452996 | 453226 | WALWriteLock | 20 452996 | 453221 | WALWriteLock | 19 | 452996 | WalSync | 18 452996 | 453225 | WALWriteLock | 18 452996 | 453223 | WALWriteLock | 16 | 452996 | WALWriteLock | 15 452996 | 453221 | CPU | 14 452996 | 453222 | CPU | 14 452996 | 453223 | CPU | 12 452996 | 453224 | CPU | 10 452996 | 453226 | CPU | 10 452996 | 453225 | CPU | 8 452996 | 453223 | WalSync | 4 452996 | 453221 | WalSync | 2 452996 | 453226 | WalWrite | 2 452996 | 453221 | WalWrite | 1 | 452996 | ParallelFinish | 1 452996 | 453224 | WalSync | 1 452996 | 453225 | WalSync | 1 452996 | 453222 | WalWrite | 1 452996 | 453225 | WalWrite | 1 452996 | 453222 | WalSync | 1 452996 | 453226 | WalSync | 1 ========== On master (v2 not applied) postgres=# VACUUM (PARALLEL 8) large_tbl; VACUUM Time: 1322598.087 ms (22:02.598) Surprisingly it has been longer on master by about 3 minutes. Let's see how the time is splitted: Vacuum phase: cumulative time ============================= scanning heap: 00:08:07.061196 vacuuming indexes: 00:17:50.961228 vacuuming heap: 00:22:02.561199 I sampled active sessions from pg_stat_activity (one second interval), here is the summary during the vacuuming indexes phase (ordered by count): leader_pid | pid | wait_event | count ------------+--------+-------------------+------- 468682 | 468858 | VacuumDelay | 548 468682 | 468862 | VacuumDelay | 547 468682 | 468859 | VacuumDelay | 547 468682 | 468860 | VacuumDelay | 545 468682 | 468857 | VacuumDelay | 543 468682 | 468861 | VacuumDelay | 542 | 468682 | VacuumDelay | 378 | 468682 | ParallelFinish | 182 468682 | 468861 | WALWriteLock | 19 468682 | 468857 | WALWriteLock | 19 468682 | 468859 | WALWriteLock | 18 468682 | 468858 | WALWriteLock | 16 468682 | 468860 | WALWriteLock | 15 468682 | 468862 | WALWriteLock | 15 468682 | 468862 | CPU | 12 468682 | 468857 | CPU | 10 468682 | 468859 | CPU | 10 468682 | 468861 | CPU | 10 | 468682 | CPU | 9 468682 | 468860 | CPU | 9 468682 | 468860 | WalSync | 8 | 468682 | WALWriteLock | 7 468682 | 468858 | WalSync | 6 468682 | 468858 | CPU | 6 468682 | 468862 | WalSync | 3 468682 | 468857 | WalSync | 3 468682 | 468861 | WalSync | 3 468682 | 468859 | WalSync | 2 468682 | 468861 | WalWrite | 2 468682 | 468857 | WalWrite | 1 468682 | 468858 | WalWrite | 1 468682 | 468861 | WALBufMappingLock | 1 468682 | 468857 | WALBufMappingLock | 1 | 468682 | WALBufMappingLock | 1 ====================== Observations =========================================== As compare to v2: 1. scanning heap time is about the same 2. vacuuming indexes time is about 3 minutes longer on master 3. vacuuming heap time is about the same One difference we can see in the sampling, is that on master the "ParallelFinish" has been sampled about 182 times (so could be _the_ 3 minutes of interest) for the leader. On master the vacuum indexes phase has been running between 2024-06-13 10:11:34 and 2024-06-13 10:21:15. If I extract the exact minutes and the counts for the "ParallelFinish" wait event I get: minute | wait_event | count --------+----------------+------- 18 | ParallelFinish | 48 19 | ParallelFinish | 60 20 | ParallelFinish | 60 21 | ParallelFinish | 14 So it's likely that the leader waited on ParallelFinish during about 3 minutes at the end of the vacuuming indexes phase (as this wait appeared during consecutives samples). ====================== Conclusion ============================================= 1. During the scanning heap and vacuuming heap phases no noticeable performance degradation has been observed with v2 applied (as compare to master) (cc'ing Robert as it's also related to his question about noticeable hit when everything is in-memory in [1]). 2. During the vacuuming indexes phase, v2 has been faster (as compare to master). The reason is that on master the leader has been waiting during about 3 minutes on "ParallelFinish" at the end. ====================== Remarks ================================================ As v2 is attached, please find below a summary about the current state of this thread: 1. v2 implements delay_time as the actual wait time (and not the intended wait time as proposed in v1). 2. some measurements have been done to check the impact of this new instrumentation (see this email and [2]): no noticeable performance degradation has been observed (and surprisingly that's the opposite as mentioned above). 3. there is an ongoing discussion about exposing the number of waits [2]. 4. there is an ongoing discussion about exposing the effective cost limit [3]. 5. that could be interesting to have a closer look as to why the leader is waiting during 3 minutes on "ParallelFinish" on master and not with v2 applied (but that's probably out of scope for this thread). [1]: https://www.postgresql.org/message-id/CA%2BTgmoZiC%3DzeCDYuMpB%2BGb2yK%3DrTQCGMu0VoxehocKyHxr9Erg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/ZmmOOPwMFIltkdsN%40ip-10-97-1-34.eu-west-3.compute.internal [3]: https://www.postgresql.org/message-id/Zml9%2Bu37iS7DFkJL%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On Thu, Jun 13, 2024 at 11:56:26AM +0000, Bertrand Drouvot wrote: > ====================== Observations =========================================== > > As compare to v2: > > 1. scanning heap time is about the same > 2. vacuuming indexes time is about 3 minutes longer on master > 3. vacuuming heap time is about the same I had a closer look to understand why the vacuuming indexes time has been about 3 minutes longer on master. During the vacuuming indexes phase, the leader is helping vacuuming the indexes until it reaches WaitForParallelWorkersToFinish() (means when all the remaining indexes are currently handled by the parallel workers, the leader has nothing more to do and so it is waiting for the parallel workers to finish). During the time the leader process is involved in indexes vacuuming it is also subject to wait due to cost delay. But with v2 applied, the leader may be interrupted by the parallel workers while it is waiting (due to the new pgstat_progress_parallel_incr_param() calls that the patch is adding). So, with v2 applied, the leader is waiting less (as interrupted while waiting) when being involved in indexes vacuuming and that's why v2 is "faster" than master. To put some numbers, I did count the number of times the leader's pg_usleep() has been interrupted (by counting the number of times the nanosleep() did return a value < 0 in pg_usleep()). Here they are: v2: the leader has been interrupted about 342605 times master: the leader has been interrupted about 36 times The ones on master are mainly coming from the pgstat_progress_parallel_incr_param() calls in parallel_vacuum_process_one_index(). The additional ones on v2 are coming from the pgstat_progress_parallel_incr_param() calls added in vacuum_delay_point(). ======== Conclusion ====== 1. vacuuming indexes time has been longer on master because with v2, the leader has been interrupted 342605 times while waiting, then making v2 "faster". 2. the leader being interrupted while waiting is also already happening on master due to the pgstat_progress_parallel_incr_param() calls in parallel_vacuum_process_one_index() (that have been added in 46ebdfe164). It has been the case "only" 36 times during my test case. I think that 2. is less of a concern but I think that 1. is something that needs to be addressed because the leader process is not honouring its cost delay wait time in a noticeable way (at least during my test case). I did not think of a proposal yet, just sharing my investigation as to why v2 has been faster than master during the vacuuming indexes phase. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote: > 1. vacuuming indexes time has been longer on master because with v2, the leader > has been interrupted 342605 times while waiting, then making v2 "faster". > > 2. the leader being interrupted while waiting is also already happening on master > due to the pgstat_progress_parallel_incr_param() calls in > parallel_vacuum_process_one_index() (that have been added in > 46ebdfe164). It has been the case "only" 36 times during my test case. > > I think that 2. is less of a concern but I think that 1. is something that needs > to be addressed because the leader process is not honouring its cost delay wait > time in a noticeable way (at least during my test case). > > I did not think of a proposal yet, just sharing my investigation as to why > v2 has been faster than master during the vacuuming indexes phase. I think that a reasonable approach is to make the reporting from the parallel workers to the leader less aggressive (means occur less frequently). Please find attached v3, that: - ensures that there is at least 1 second between 2 reports, per parallel worker, to the leader. - ensures that the reported delayed time is still correct (keep track of the delayed time between 2 reports). - does not add any extra pg_clock_gettime_ns() calls (as compare to v2). Remarks: 1. Having a time based only approach to throttle the reporting of the parallel workers sounds reasonable. I don't think that the number of parallel workers has to come into play as: 1.1) the more parallel workers is used, the less the impact of the leader on the vacuum index phase duration/workload is (because the repartition is done on more processes). 1.2) the less parallel workers is, the less the leader will be interrupted ( less parallel workers would report their delayed time). 2. The throttling is not based on the cost limit as that value is distributed proportionally among the parallel workers (so we're back to the previous point). 3. The throttling is not based on the actual cost delay value because the leader could be interrupted at the beginning, the midle or whatever part of the wait and we are more interested about the frequency of the interrupts. 3. A 1 second reporting "throttling" looks a reasonable threshold as: 3.1 the idea is to have a significant impact when the leader could have been interrupted say hundred/thousand times per second. 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum multiple times per second (so a one second reporting granularity seems ok). With this approach in place, v3 attached applied, during my test case: - the leader has been interrupted about 2500 times (instead of about 345000 times with v2) - the vacuum index phase duration is very close to the master one (it has been 4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3 minutes faster with v2). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
>> 2. the leader being interrupted while waiting is also already happening on master >> due to the pgstat_progress_parallel_incr_param() calls in >> parallel_vacuum_process_one_index() (that have been added in >> 46ebdfe164). It has been the case "only" 36 times during my test case. 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports progress, and we currently don't handle interrupts by restarting the sleep with the remaining time. nanosleep does provide the ability to restart with the remaining time [1], but I don't think it's worth the effort to ensure more accurate vacuum delays for the leader process. > 1. Having a time based only approach to throttle I do agree with a time based approach overall. > 1.1) the more parallel workers is used, the less the impact of the leader on > the vacuum index phase duration/workload is (because the repartition is done > on more processes). Did you mean " because the vacuum is done on more processes"? When a leader is operating on a large index(s) during the entirety of the vacuum operation, wouldn't more parallel workers end up interrupting the leader more often? This is why I think reporting even more often than 1 second (more below) will be better. > 3. A 1 second reporting "throttling" looks a reasonable threshold as: > 3.1 the idea is to have a significant impact when the leader could have been > interrupted say hundred/thousand times per second. > 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum > multiple times per second (so a one second reporting granularity seems ok). I feel 1 second may still be too frequent. What about 10 seconds ( or 30 seconds )? I think this metric in particular will be mainly useful for vacuum runs that are running for minutes or more, making reporting every 10 or 30 seconds still useful. It just occurred to me also that pgstat_progress_parallel_incr_param should have a code comment that it will interrupt a leader process and cause activity such as a sleep to end early. Regards, Sami Imseih Amazon Web Services (AWS) [1] https://man7.org/linux/man-pages/man2/nanosleep.2.html
Hi, On Tue, Jun 25, 2024 at 01:12:16AM +0000, Imseih (AWS), Sami wrote: Thanks for the feedback! > >> 2. the leader being interrupted while waiting is also already happening on master > >> due to the pgstat_progress_parallel_incr_param() calls in > >> parallel_vacuum_process_one_index() (that have been added in > >> 46ebdfe164). It has been the case "only" 36 times during my test case. > > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports > progress, and we currently don't handle interrupts by restarting the sleep with > the remaining time. nanosleep does provide the ability to restart with the remaining > time [1], but I don't think it's worth the effort to ensure more accurate > vacuum delays for the leader process. +1. I don't think it's necessary to have a 100% accurate delay for all the times the delay is involded. I think that's an heuristic parameter (among with cost limit). What matters at the end is by how much you've been able to pause the whole vacuum (and not by a sleep by sleep basis)). > > 1. Having a time based only approach to throttle > > I do agree with a time based approach overall. > > > > 1.1) the more parallel workers is used, the less the impact of the leader on > > the vacuum index phase duration/workload is (because the repartition is done > > on more processes). > > Did you mean " because the vacuum is done on more processes"? Yes. > When a leader is operating on a large index(s) during the entirety > of the vacuum operation, wouldn't more parallel workers end up > interrupting the leader more often? That's right but my point was about the impact on the "whole" duration time and "whole" workload (leader + workers included) and not about the number of times the leader is interrupted. If there is say 100 workers then interrupting the leader (1 process out of 101) is probably less of an issue as it means that there is a lot of work to be done to have those 100 workers busy. I don't think the size of the index the leader is vacuuming has an impact. I think that having the leader vacuuming a 100 GB index or 100 x 1GB indexes is the same (as long as all the other workers are actives during all that time). > > 3. A 1 second reporting "throttling" looks a reasonable threshold as: > > > 3.1 the idea is to have a significant impact when the leader could have been > > interrupted say hundred/thousand times per second. > > > 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum > > multiple times per second (so a one second reporting granularity seems ok). > > I feel 1 second may still be too frequent. Maybe we'll need more measurements but this is what my test case made of: vacuum_cost_delay = 1 vacuum_cost_limit = 10 8 parallel workers, 1 leader 21 indexes (about 1GB each, one 40MB), all in memory lead to: With 1 second reporting frequency, the leader has been interruped about 2500 times over 8m39s leading to about the same time as on master (8m43s). > What about 10 seconds ( or 30 seconds )? I'm not sure (may need more measurements) but it would probably complicate the reporting a bit (as with the current v3 we'd miss reporting the indexes that take less time than the threshold to complete). > I think this metric in particular will be mainly useful for vacuum runs that are > running for minutes or more, making reporting every 10 or 30 seconds > still useful. Agree. OTOH, one could be interested to diagnose what happened during a say 5 seconds peak on I/O resource consumption/latency. Sampling pg_stat_progress_vacuum at 1 second interval and see by how much the vaccum has been paused during that time could help too (specially if it is made of a lot of parallel workers that could lead to a lot of I/O). But it would miss data if we are reporting at a larger rate. > It just occurred to me also that pgstat_progress_parallel_incr_param > should have a code comment that it will interrupt a leader process and > cause activity such as a sleep to end early. Good point, I'll add a comment for it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jun 24, 2024 at 7:50 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote: > > 1. vacuuming indexes time has been longer on master because with v2, the leader > > has been interrupted 342605 times while waiting, then making v2 "faster". > > > > 2. the leader being interrupted while waiting is also already happening on master > > due to the pgstat_progress_parallel_incr_param() calls in > > parallel_vacuum_process_one_index() (that have been added in > > 46ebdfe164). It has been the case "only" 36 times during my test case. > > > > I think that 2. is less of a concern but I think that 1. is something that needs > > to be addressed because the leader process is not honouring its cost delay wait > > time in a noticeable way (at least during my test case). > > > > I did not think of a proposal yet, just sharing my investigation as to why > > v2 has been faster than master during the vacuuming indexes phase. Thank you for the benchmarking and analyzing the results! I agree with your analysis and was surprised by the fact that the more times workers go to sleep, the more times the leader wakes up. > > I think that a reasonable approach is to make the reporting from the parallel > workers to the leader less aggressive (means occur less frequently). > > Please find attached v3, that: > > - ensures that there is at least 1 second between 2 reports, per parallel worker, > to the leader. > > - ensures that the reported delayed time is still correct (keep track of the > delayed time between 2 reports). > > - does not add any extra pg_clock_gettime_ns() calls (as compare to v2). > Sounds good to me. I think it's better to keep the logic for throttling the reporting the delay message simple. It's an important consideration but executing parallel vacuum with delays would be less likely to be used in practice. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports > progress, and we currently don't handle interrupts by restarting the sleep with > the remaining time. nanosleep does provide the ability to restart with the remaining > time [1], but I don't think it's worth the effort to ensure more accurate > vacuum delays for the leader process. After discussing offline with Bertrand, it may be better to have a solution to deal with the interrupts and allows the sleep to continue to completion. This will simplify this patch and will be useful for other cases in which parallel workers need to send a message to the leader. This is the thread [1] for that discussion. [1] https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com Regards, Sami
Hi, On Fri, Jun 28, 2024 at 08:07:39PM +0000, Imseih (AWS), Sami wrote: > > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports > > progress, and we currently don't handle interrupts by restarting the sleep with > > the remaining time. nanosleep does provide the ability to restart with the remaining > > time [1], but I don't think it's worth the effort to ensure more accurate > > vacuum delays for the leader process. > > After discussing offline with Bertrand, it may be better to have > a solution to deal with the interrupts and allows the sleep to continue to > completion. This will simplify this patch and will be useful > for other cases in which parallel workers need to send a message > to the leader. This is the thread [1] for that discussion. > > [1] https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com > Yeah, I think it would make sense to put this thread on hold until we know more about [1] (you mentioned above) outcome. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote: > Please find attached v6, a mandatory rebase due to catversion bump conflict. > I'm removing the catversion bump from the patch as it generates too frequent > conflicts (just mention it needs to be done in the commit message). v6 looks generally reasonable to me. I think the nap_time_since_last_report variable needs to be marked static, though. One thing that occurs to me is that this information may not be particularly useful when parallel workers are used. Without parallelism, it's easy enough to figure out the percentage of time that your VACUUM is spending asleep, but when there are parallel workers, it may be hard to deduce much of anything from the value. I'm not sure that this is a deal-breaker for the patch, though, if for no other reason than it'll most likely be used for autovacuum, which doesn't use parallel vacuum yet. If there are no other concerns, I'll plan on committing this one soon after a bit of editorialization. -- nathan
Hi, On Wed, Sep 18, 2024 at 04:04:53PM -0500, Nathan Bossart wrote: > On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote: > > Please find attached v6, a mandatory rebase due to catversion bump conflict. > > I'm removing the catversion bump from the patch as it generates too frequent > > conflicts (just mention it needs to be done in the commit message). > > v6 looks generally reasonable to me. Thanks for looking at it! > I think the > nap_time_since_last_report variable needs to be marked static, though. Agree. > One thing that occurs to me is that this information may not be > particularly useful when parallel workers are used. Without parallelism, > it's easy enough to figure out the percentage of time that your VACUUM is > spending asleep, but when there are parallel workers, it may be hard to > deduce much of anything from the value. I think that if the number of parallel workers being used are the same across runs then one can measure "accurately" the impact of some changes (set vacuum_cost_delay=... for example) on the delay. Without the patch one could just guess as many others factors could impact the vacuum duration (load on the system, i/o latency,...). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com