Thread: Track the amount of time waiting due to cost_delay

Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

10 June 2024, 06:05:13

Hi hackers,

During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT

Please find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view. 

Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).

This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).

The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).

[1]:
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/29-how-autovacuum-goes-wrong-and-can-we-please-make-it-stop-doing-that/

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patch

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

10 June 2024, 15:36:42

On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
> During the last pgconf.dev I attended Robert´s presentation about autovacuum and
> it made me remember of an idea I had some time ago: $SUBJECT

This sounds like useful information to me.  I wonder if we should also
surface the effective cost limit for each autovacuum worker.

> Currently one can change [autovacuum_]vacuum_cost_delay and
> [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
> the changes on the vacuum duration: one could observe the vacuum duration
> variation but the correlation to the changes is not accurate (as many others
> factors could impact the vacuum duration (load on the system, i/o latency,...)).

IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay.  Is that how you'd expect for this to be used in practice?

>          pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
>          pg_usleep(msec * 1000);
>          pgstat_report_wait_end();
> +        /* Report the amount of time we slept */
> +        if (VacuumSharedCostBalance != NULL)
> +            pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> +        else
> +            pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);

Hm.  Should we measure the actual time spent sleeping, or is a rough
estimate good enough?  I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much.  If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

10 June 2024, 17:48:22

Hi,

On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
> > During the last pgconf.dev I attended Robert´s presentation about autovacuum and
> > it made me remember of an idea I had some time ago: $SUBJECT
> 
> This sounds like useful information to me.

Thanks for looking at it!

> I wonder if we should also
> surface the effective cost limit for each autovacuum worker.

I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.

> > Currently one can change [autovacuum_]vacuum_cost_delay and
> > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
> > the changes on the vacuum duration: one could observe the vacuum duration
> > variation but the correlation to the changes is not accurate (as many others
> > factors could impact the vacuum duration (load on the system, i/o latency,...)).
> 
> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> pg_stat_activity in order to know what percentage of time was being spent
> in cost delay.  Is that how you'd expect for this to be used in practice?

Yeah, one could use a query such as:

select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)

for example. Worth to provide an example somewhere in the doc?

> >          pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
> >          pg_usleep(msec * 1000);
> >          pgstat_report_wait_end();
> > +        /* Report the amount of time we slept */
> > +        if (VacuumSharedCostBalance != NULL)
> > +            pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> > +        else
> > +            pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> 
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

Thanks for bringing that up! I had the same thought when writing the code and
came to the same conclusion. I think that's a good enough estimation and specially
during a long running vacuum (which is probably the case where users care the 
most).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

10 June 2024, 19:20:16

On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
> On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
>> I wonder if we should also
>> surface the effective cost limit for each autovacuum worker.
> 
> I'm not sure about it as I think that it could be misleading: one could query
> pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> due to _this_ cost_limit. But that's not necessary true as the cost_limit could
> have changed multiple times since the vacuum started. So, unless there is
> frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
> the cost_limit could be misleadind IMHO.

Well, that's true for the delay, too, right (at least as of commit
7d71d3d)?

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

"Imseih (AWS), Sami"

Date:

10 June 2024, 20:12:46

>> This sounds like useful information to me.

> Thanks for looking at it!

The  VacuumDelay is the only visibility available to
gauge the cost_delay. Having this information
advertised by pg_stat_progress_vacuum as is being proposed
is much better. However, I also think that the
"number of times"  the vacuum went into delay will be needed
as well. Both values will be useful to tune cost_delay and cost_limit. 

It may also make sense to accumulate the total_time in delay
and the number of times delayed in a cumulative statistics [0]
view to allow a user to trend this information overtime.
I don't think this info fits in any of the existing views, i.e.
pg_stat_database, so maybe a new view for cumulative
vacuum stats may be needed. This is likely a separate
discussion, but calling it out here.

>> IIUC you'd need to get information from both pg_stat_progress_vacuum and
>> pg_stat_activity in order to know what percentage of time was being spent
>> in cost delay.  Is that how you'd expect for this to be used in practice?

> Yeah, one could use a query such as:

> select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)

Maybe all  progress views should just expose the "beentry->st_activity_start_timestamp " 
to let the user know when the current operation began.


Regards,

Sami Imseih
Amazon Web Services (AWS)


[0] https://www.postgresql.org/docs/current/monitoring-stats.html

Re: Track the amount of time waiting due to cost_delay

From

Robert Haas

Date:

10 June 2024, 21:58:13

On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

I bet you could also sleep for longer than planned, throwing the
numbers off in the other direction.

I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

11 June 2024, 06:24:42

On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
> > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> >> I wonder if we should also
> >> surface the effective cost limit for each autovacuum worker.
> > 
> > I'm not sure about it as I think that it could be misleading: one could query
> > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> > due to _this_ cost_limit. But that's not necessary true as the cost_limit could
> > have changed multiple times since the vacuum started. So, unless there is
> > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
> > the cost_limit could be misleadind IMHO.
> 
> Well, that's true for the delay, too, right (at least as of commit
> 7d71d3d)?

Yeah right, but the patch exposes the total amount of time the vacuum has
been delayed (not the cost_delay per say) which does not sound misleading to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

11 June 2024, 06:50:19

Hi,

On Mon, Jun 10, 2024 at 08:12:46PM +0000, Imseih (AWS), Sami wrote:
> >> This sounds like useful information to me.
> 
> > Thanks for looking at it!
> 
> The  VacuumDelay is the only visibility available to
> gauge the cost_delay. Having this information
> advertised by pg_stat_progress_vacuum as is being proposed
> is much better.

Thanks for looking at it!

> However, I also think that the
> "number of times"  the vacuum went into delay will be needed
> as well. Both values will be useful to tune cost_delay and cost_limit. 

Yeah, I think that's a good idea. With v1 one could figure out how many times
the delay has been triggered but that does not work anymore if: 1) cost_delay
changed during the vacuum duration or 2) the patch changes the way time_delayed
is measured/reported (means get the actual wait time and not the theoritical
time as v1 does). 

> 
> It may also make sense to accumulate the total_time in delay
> and the number of times delayed in a cumulative statistics [0]
> view to allow a user to trend this information overtime.
> I don't think this info fits in any of the existing views, i.e.
> pg_stat_database, so maybe a new view for cumulative
> vacuum stats may be needed. This is likely a separate
> discussion, but calling it out here.

+1

> >> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> >> pg_stat_activity in order to know what percentage of time was being spent
> >> in cost delay.  Is that how you'd expect for this to be used in practice?
> 
> > Yeah, one could use a query such as:
> 
> > select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
> 
> Maybe all  progress views should just expose the "beentry->st_activity_start_timestamp " 
> to let the user know when the current operation began.

Yeah maybe, I think this is likely a separate discussion too, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Masahiko Sawada

Date:

11 June 2024, 07:07:05

Hi,

On Mon, Jun 10, 2024 at 3:05 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi hackers,
>
> During the last pgconf.dev I attended Robert’s presentation about autovacuum and
> it made me remember of an idea I had some time ago: $SUBJECT
>
> Please find attached a patch doing so by adding a new field (aka "time_delayed")
> to the pg_stat_progress_vacuum view.
>
> Currently one can change [autovacuum_]vacuum_cost_delay and
> [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
> the changes on the vacuum duration: one could observe the vacuum duration
> variation but the correlation to the changes is not accurate (as many others
> factors could impact the vacuum duration (load on the system, i/o latency,...)).
>
> This new field reports the time that the vacuum has to sleep due to cost delay:
> it could be useful to 1) measure the impact of the current cost_delay and
> cost_limit settings and 2) when experimenting new values (and then help for
> decision making for those parameters).
>
> The patch is relatively small thanks to the work that has been done in
> f1889729dd (to allow parallel worker to report to the leader).

Thank you for the proposal and the patch. I understand the motivation
of this patch. Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.

FWIW when I want to confirm the vacuum delay effect, I often use the
information from the DEBUG2 log message in VacuumUpdateCosts()
function. Exposing these data (per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe) somewhere in a view might also be
helpful for users for checking vacuum delay effects. It doesn't mean
to measure the impact of the changes on the vacuum duration, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

11 June 2024, 07:25:11

Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
> > Hm.  Should we measure the actual time spent sleeping, or is a rough
> > estimate good enough?  I believe pg_usleep() might return early (e.g., if
> > the process is signaled) or late, so this field could end up being
> > inaccurate, although probably not by much.  If we're okay with millisecond
> > granularity, my first instinct is that what you've proposed is fine, but I
> > figured I'd bring it up anyway.
> 
> I bet you could also sleep for longer than planned, throwing the
> numbers off in the other direction.

Thanks for looking at it! Agree, that's how I read "or late" from Nathan's
comment above.

> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

OTOH Sami proposed in [1] to count the number of times the vacuum went into
delay. I think that's a good idea. His idea makes me think that (in addition to
the number of wait times) it would make sense to measure the "actual" sleep time
(and not the intended one) then (so that one could measure the difference between
the intended wait time (number of wait times * cost delay (if it does not change
during the vacuum duration)) and the actual measured wait time).

So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).

Thoughts?

[1]: https://www.postgresql.org/message-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

11 June 2024, 08:26:23

Hi,

On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:

> Thank you for the proposal and the patch. I understand the motivation
> of this patch.

Thanks for looking at it!

> Beside the point Nathan mentioned, I'm slightly worried
> that massive parallel messages could be sent to the leader process
> when the cost_limit value is low.

I see, I can/will do some testing in this area and share the numbers.

> 
> FWIW when I want to confirm the vacuum delay effect, I often use the
> information from the DEBUG2 log message in VacuumUpdateCosts()
> function. Exposing these data (per-worker dobalance, cost_lmit,
> cost_delay, active, and failsafe) somewhere in a view might also be
> helpful for users for checking vacuum delay effects.

Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the
other data you mentioned above in another dedicated view?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

11 June 2024, 09:49:11

Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

Though I (now) think that it would make sense to record the actual delay time 
instead (see [1]), I think it's interesting to do some testing as you suggested.

With record_actual_time.txt (attached) applied on top of v1, we can see the
intended and actual wait time.

On my system, "no load at all" except the vacuum running, I see no diff:

                         Tue Jun 11 09:22:06 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        41107 |               41107 | 00:00:42.301851
(1 row)

                         Tue Jun 11 09:22:07 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        42076 |               42076 | 00:00:43.301848
(1 row)

                         Tue Jun 11 09:22:08 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        43045 |               43045 | 00:00:44.301854
(1 row)

But if I launch pg_reload_conf() 10 times in a row, I can see:

                         Tue Jun 11 09:22:09 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        44064 |               44034 | 00:00:45.302965
(1 row)

                         Tue Jun 11 09:22:10 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        45033 |               45003 | 00:00:46.301858


As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?


[1]: https://www.postgresql.org/message-id/Zmf712A5xcOM9Hlg%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

record_actual_time.txt

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

11 June 2024, 16:40:36

On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
> So I think that in v2 we could: 1) measure the actual wait time instead, 2)
> count the number of times the vacuum slept. We could also 3) reports the
> effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
> be misleading but I'll yield to majority opinion if people think it's not).

I still think the effective cost limit would be useful, if for no other
reason than to help reinforce that it is distributed among the autovacuum
workers.  We could document that this value may change over the lifetime of
a worker to help avoid misleading folks.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Robert Haas

Date:

11 June 2024, 17:13:48

On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> As we can see the actual wait time is 30ms less than the intended wait time with
> this simple test. So I still think we should go with 1) actual wait time and 2)
> report the number of waits (as mentioned in [1]). Does that make sense to you?

I like the idea of reporting the actual wait time better, provided
that we verify that doing so isn't too expensive. I think it probably
isn't, because in a long-running VACUUM there is likely to be disk
I/O, so the CPU overhead of a few extra gettimeofday() calls should be
fairly low by comparison. I wonder if there's a noticeable hit when
everything is in-memory. I guess probably not, because with any sort
of normal configuration, we shouldn't be delaying after every block we
process, so the cost of those gettimeofday() calls should still be
getting spread across quite a bit of real work.

That said, I'm not sure this experiment shows a real problem with the
idea of showing intended wait time. It does establish the concept that
repeated signals can throw our numbers off, but 30ms isn't much of a
discrepancy. I'm worried about being off by a factor of two, or an
order of magnitude. I think we still don't know if that can happen,
but if we're going to show actual wait time anyway, then we don't need
to explore the problems with other hypothetical systems too much.

I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

From

Jan Wieck

Date:

11 June 2024, 17:26:20

On 6/11/24 13:13, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>> As we can see the actual wait time is 30ms less than the intended wait time with
>> this simple test. So I still think we should go with 1) actual wait time and 2)
>> report the number of waits (as mentioned in [1]). Does that make sense to you?
> 
> I like the idea of reporting the actual wait time better, provided
> that we verify that doing so isn't too expensive. I think it probably
> isn't, because in a long-running VACUUM there is likely to be disk
> I/O, so the CPU overhead of a few extra gettimeofday() calls should be
> fairly low by comparison. I wonder if there's a noticeable hit when
> everything is in-memory. I guess probably not, because with any sort
> of normal configuration, we shouldn't be delaying after every block we
> process, so the cost of those gettimeofday() calls should still be
> getting spread across quite a bit of real work.

Does it even require a call to gettimeofday()? The code in vacuum 
calculates an msec value and calls pg_usleep(msec * 1000). I don't think 
it is necessary to measure how long that nap was.


Regards, Jan

Re: Track the amount of time waiting due to cost_delay

From

"Imseih (AWS), Sami"

Date:

11 June 2024, 18:19:23

> I'm not convinced that reporting the number of waits is useful. If we
> were going to report a possibly-inaccurate amount of actual waiting,
> then also reporting the number of waits might make it easier to figure
> out when the possibly-inaccurate number was in fact inaccurate. But I
> think it's way better to report an accurate amount of actual waiting,
> and then I'm not sure what we gain by also reporting the number of
> waits.

I think including the number of times vacuum went into sleep 
will help paint a full picture of the effect of tuning the vacuum_cost_delay 
and vacuum_cost_limit for the user, even if we are reporting accurate 
amounts of actual sleeping.

This is particularly true for autovacuum in which the cost limit is spread
across all autovacuum workers, and knowing how many times autovacuum
went to sleep will be useful along with the total time spent sleeping.

Regards,

Sami

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

11 June 2024, 18:47:29

On Tue, Jun 11, 2024 at 06:19:23PM +0000, Imseih (AWS), Sami wrote:
>> I'm not convinced that reporting the number of waits is useful. If we
>> were going to report a possibly-inaccurate amount of actual waiting,
>> then also reporting the number of waits might make it easier to figure
>> out when the possibly-inaccurate number was in fact inaccurate. But I
>> think it's way better to report an accurate amount of actual waiting,
>> and then I'm not sure what we gain by also reporting the number of
>> waits.
> 
> I think including the number of times vacuum went into sleep 
> will help paint a full picture of the effect of tuning the vacuum_cost_delay 
> and vacuum_cost_limit for the user, even if we are reporting accurate 
> amounts of actual sleeping.
> 
> This is particularly true for autovacuum in which the cost limit is spread
> across all autovacuum workers, and knowing how many times autovacuum
> went to sleep will be useful along with the total time spent sleeping.

I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting.  Even
if the number of waits is huge, it doesn't tell you much else AFAICT.  I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Robert Haas

Date:

11 June 2024, 18:48:30

On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I'm struggling to think of a scenario in which the number of waits would be
> useful, assuming you already know the amount of time spent waiting.  Even
> if the number of waits is huge, it doesn't tell you much else AFAICT.  I'd
> be much more likely to adjust the cost settings based on the percentage of
> time spent sleeping.

This is also how I see it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

From

"Imseih (AWS), Sami"

Date:

11 June 2024, 21:04:29

>> I'm struggling to think of a scenario in which the number of waits would be
>> useful, assuming you already know the amount of time spent waiting. Even
>> if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
>> be much more likely to adjust the cost settings based on the percentage of
>> time spent sleeping.


> This is also how I see it.

I think it may be useful for a user to be able to answer the "average
sleep time" for a vacuum, especially because the vacuum cost 
limit and delay can be adjusted on the fly for a running vacuum.

If we only show the total sleep time, the user could make wrong
 assumptions about how long each sleep took and they might 
assume that all sleep delays for a particular vacuum run have been 
uniform in duration, when in-fact they may not have been.


Regards,

Sami

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

12 June 2024, 10:52:42

Hi,

On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote:
> On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
> > So I think that in v2 we could: 1) measure the actual wait time instead, 2)
> > count the number of times the vacuum slept. We could also 3) reports the
> > effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
> > be misleading but I'll yield to majority opinion if people think it's not).
> 
> I still think the effective cost limit would be useful, if for no other
> reason than to help reinforce that it is distributed among the autovacuum
> workers.

I also think it can be useful, my concern is more to put this information in
pg_stat_progress_vacuum. What about Sawada-san proposal in [1]? (we could
create a new view that would contain those data: per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe). 

[1]: https://www.postgresql.org/message-id/CAD21AoDOu%3DDZcC%2BPemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

12 June 2024, 11:27:23

Hi,

On Tue, Jun 11, 2024 at 02:48:30PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> > I'm struggling to think of a scenario in which the number of waits would be
> > useful, assuming you already know the amount of time spent waiting.

If we provide the actual time spent waiting, providing the number of waits would
allow to see if there is a diff between the actual time and the intended time
(i.e: number of waits * cost_delay, should the cost_delay be the same during
the vacuum duration). That should trigger some thoughts if the diff is large
enough.

I think that what we are doing here is to somehow add instrumentation around the
"WAIT_EVENT_VACUUM_DELAY" wait event. If we were to add instrumentation for wait
events (generaly speaking) we'd probably also expose the number of waits per
wait event (in addition to the time waited).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

12 June 2024, 12:02:00

Hi,

On Tue, Jun 11, 2024 at 01:13:48PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > As we can see the actual wait time is 30ms less than the intended wait time with
> > this simple test. So I still think we should go with 1) actual wait time and 2)
> > report the number of waits (as mentioned in [1]). Does that make sense to you?
> 
> I like the idea of reporting the actual wait time better,

+1

> provided
> that we verify that doing so isn't too expensive. I think it probably
> isn't, because in a long-running VACUUM there is likely to be disk
> I/O, so the CPU overhead of a few extra gettimeofday() calls should be
> fairly low by comparison.

Agree.

> I wonder if there's a noticeable hit when
> everything is in-memory. I guess probably not, because with any sort
> of normal configuration, we shouldn't be delaying after every block we
> process, so the cost of those gettimeofday() calls should still be
> getting spread across quite a bit of real work.

I did some testing, with:

shared_buffers = 12GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 0
max_parallel_workers = 0

added to a default config file.

A table and all its indexes were fully in memory, the numbers are:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
             FROM pg_buffercache b JOIN pg_class c
             ON b.relfilenode = pg_relation_filenode(c.oid) AND
                b.reldatabase IN (0, (SELECT oid FROM pg_database
                                      WHERE datname = current_database()))
             JOIN pg_namespace n ON n.oid = c.relnamespace
             GROUP BY n.nspname, c.relname
             ORDER BY 3 DESC
             LIMIT 11;

 nspname |      relname      | buffers
---------+-------------------+---------
 public  | large_tbl         |  222280
 public  | large_tbl_pkey    |    5486
 public  | large_tbl_filler7 |    1859
 public  | large_tbl_filler4 |    1859
 public  | large_tbl_filler1 |    1859
 public  | large_tbl_filler6 |    1859
 public  | large_tbl_filler3 |    1859
 public  | large_tbl_filler2 |    1859
 public  | large_tbl_filler5 |    1859
 public  | large_tbl_filler8 |    1859
 public  | large_tbl_version |    1576
(11 rows)


The observed timings when vacuuming this table are:

On master:

vacuum phase: cumulative duration
---------------------------------

scanning heap: 00:00:37.808184
vacuuming indexes: 00:00:41.808176
vacuuming heap: 00:00:54.808156

On master patched with actual time delayed:

vacuum phase: cumulative duration
---------------------------------

scanning heap: 00:00:36.502104 (time_delayed: 22202)
vacuuming indexes: 00:00:41.002103 (time_delayed: 23769)
vacuuming heap: 00:00:54.302096 (time_delayed: 34886)

As we can see there is no noticeable degradation while the vacuum entered about
34886 times in this instrumentation code path (cost_delay was set to 1).

> That said, I'm not sure this experiment shows a real problem with the
> idea of showing intended wait time. It does establish the concept that
> repeated signals can throw our numbers off, but 30ms isn't much of a
> discrepancy.

Yeah, the idea was just to show how easy it is to create a 30ms discrepancy.

> I'm worried about being off by a factor of two, or an
> order of magnitude. I think we still don't know if that can happen,
> but if we're going to show actual wait time anyway, then we don't need
> to explore the problems with other hypothetical systems too much.

Agree.

> I'm not convinced that reporting the number of waits is useful. If we
> were going to report a possibly-inaccurate amount of actual waiting,
> then also reporting the number of waits might make it easier to figure
> out when the possibly-inaccurate number was in fact inaccurate. But I
> think it's way better to report an accurate amount of actual waiting,
> and then I'm not sure what we gain by also reporting the number of
> waits.

Sami shared his thoughts in [1] and [2] and so did I in [3]. If some of us still
don't think that reporting the number of waits is useful then we can probably
start without it.

[1]: https://www.postgresql.org/message-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727%40amazon.com
[2]: https://www.postgresql.org/message-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D%40amazon.com
[3]: https://www.postgresql.org/message-id/ZmmGG4e%2BqTBD2kfn%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

13 June 2024, 11:56:26

Hi,

On Tue, Jun 11, 2024 at 08:26:23AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
> 
> > Thank you for the proposal and the patch. I understand the motivation
> > of this patch.
> 
> Thanks for looking at it!
> 
> > Beside the point Nathan mentioned, I'm slightly worried
> > that massive parallel messages could be sent to the leader process
> > when the cost_limit value is low.
> 
> I see, I can/will do some testing in this area and share the numbers.

Here is the result of the test. It has been launched several times and it
produced the same (surprising result) each time.

====================== Context ================================================

The testing has been done with this relation (large_tbl) and its indexes:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
             FROM pg_buffercache b JOIN pg_class c
             ON b.relfilenode = pg_relation_filenode(c.oid) AND
                b.reldatabase IN (0, (SELECT oid FROM pg_database
                                      WHERE datname = current_database()))
             JOIN pg_namespace n ON n.oid = c.relnamespace
             GROUP BY n.nspname, c.relname
             ORDER BY 3 DESC
             LIMIT 22;

 nspname |      relname       | buffers
---------+--------------------+---------
 public  | large_tbl          |  222280
 public  | large_tbl_filler13 |  125000
 public  | large_tbl_filler6  |  125000
 public  | large_tbl_filler5  |  125000
 public  | large_tbl_filler3  |  125000
 public  | large_tbl_filler15 |  125000
 public  | large_tbl_filler4  |  125000
 public  | large_tbl_filler20 |  125000
 public  | large_tbl_filler18 |  125000
 public  | large_tbl_filler14 |  125000
 public  | large_tbl_filler8  |  125000
 public  | large_tbl_filler11 |  125000
 public  | large_tbl_filler19 |  125000
 public  | large_tbl_filler7  |  125000
 public  | large_tbl_filler1  |  125000
 public  | large_tbl_filler12 |  125000
 public  | large_tbl_filler9  |  125000
 public  | large_tbl_filler17 |  125000
 public  | large_tbl_filler16 |  125000
 public  | large_tbl_filler10 |  125000
 public  | large_tbl_filler2  |  125000
 public  | large_tbl_pkey     |    5486
(22 rows)

All of them completly fit in memory (to avoid I/O read latency during the vacuum).

The config, outside of default is:

max_wal_size = 4GB
shared_buffers = 30GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 8
max_parallel_workers = 10
vacuum_cost_limit = 10
autovacuum_vacuum_cost_limit = 10

My system is not overloaded, has enough resources to run this test and only this
test is running.

====================== Results ================================================

========== With v2 (attached) applied on master

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1146873.016 ms (19:06.873)

The duration is splitted that way:

Vacuum phase: cumulative time (cumulative time delayed)
=======================================================
scanning heap: 00:08:16.414628 (time_delayed: 444370)
vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293)
vacuuming heap: 00:19:06.814617 (time_delayed: 2767540)

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |   wait_event   | count
------------+--------+----------------+-------
     452996 | 453225 | VacuumDelay    |   366
     452996 | 453223 | VacuumDelay    |   363
     452996 | 453226 | VacuumDelay    |   362
     452996 | 453224 | VacuumDelay    |   361
     452996 | 453222 | VacuumDelay    |   359
     452996 | 453221 | VacuumDelay    |   359
            | 452996 | VacuumDelay    |   331
            | 452996 | CPU            |    30
     452996 | 453224 | WALWriteLock   |    23
     452996 | 453222 | WALWriteLock   |    20
     452996 | 453226 | WALWriteLock   |    20
     452996 | 453221 | WALWriteLock   |    19
            | 452996 | WalSync        |    18
     452996 | 453225 | WALWriteLock   |    18
     452996 | 453223 | WALWriteLock   |    16
            | 452996 | WALWriteLock   |    15
     452996 | 453221 | CPU            |    14
     452996 | 453222 | CPU            |    14
     452996 | 453223 | CPU            |    12
     452996 | 453224 | CPU            |    10
     452996 | 453226 | CPU            |    10
     452996 | 453225 | CPU            |     8
     452996 | 453223 | WalSync        |     4
     452996 | 453221 | WalSync        |     2
     452996 | 453226 | WalWrite       |     2
     452996 | 453221 | WalWrite       |     1
            | 452996 | ParallelFinish |     1
     452996 | 453224 | WalSync        |     1
     452996 | 453225 | WalSync        |     1
     452996 | 453222 | WalWrite       |     1
     452996 | 453225 | WalWrite       |     1
     452996 | 453222 | WalSync        |     1
     452996 | 453226 | WalSync        |     1


========== On master (v2 not applied)

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1322598.087 ms (22:02.598)

Surprisingly it has been longer on master by about 3 minutes.

Let's see how the time is splitted:

Vacuum phase: cumulative time
=============================
scanning heap: 00:08:07.061196
vacuuming indexes: 00:17:50.961228
vacuuming heap: 00:22:02.561199

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |    wait_event     | count
------------+--------+-------------------+-------
     468682 | 468858 | VacuumDelay       |   548
     468682 | 468862 | VacuumDelay       |   547
     468682 | 468859 | VacuumDelay       |   547
     468682 | 468860 | VacuumDelay       |   545
     468682 | 468857 | VacuumDelay       |   543
     468682 | 468861 | VacuumDelay       |   542
            | 468682 | VacuumDelay       |   378
            | 468682 | ParallelFinish    |   182
     468682 | 468861 | WALWriteLock      |    19
     468682 | 468857 | WALWriteLock      |    19
     468682 | 468859 | WALWriteLock      |    18
     468682 | 468858 | WALWriteLock      |    16
     468682 | 468860 | WALWriteLock      |    15
     468682 | 468862 | WALWriteLock      |    15
     468682 | 468862 | CPU               |    12
     468682 | 468857 | CPU               |    10
     468682 | 468859 | CPU               |    10
     468682 | 468861 | CPU               |    10
            | 468682 | CPU               |     9
     468682 | 468860 | CPU               |     9
     468682 | 468860 | WalSync           |     8
            | 468682 | WALWriteLock      |     7
     468682 | 468858 | WalSync           |     6
     468682 | 468858 | CPU               |     6
     468682 | 468862 | WalSync           |     3
     468682 | 468857 | WalSync           |     3
     468682 | 468861 | WalSync           |     3
     468682 | 468859 | WalSync           |     2
     468682 | 468861 | WalWrite          |     2
     468682 | 468857 | WalWrite          |     1
     468682 | 468858 | WalWrite          |     1
     468682 | 468861 | WALBufMappingLock |     1
     468682 | 468857 | WALBufMappingLock |     1
            | 468682 | WALBufMappingLock |     1

====================== Observations ===========================================

As compare to v2:

1. scanning heap time is about the same
2. vacuuming indexes time is about 3 minutes longer on master
3. vacuuming heap time is about the same

One difference we can see in the sampling, is that on master the "ParallelFinish"
has been sampled about 182 times (so could be _the_ 3 minutes of interest) for
the leader.

On master the vacuum indexes phase has been running between 2024-06-13 10:11:34
and 2024-06-13 10:21:15. If I extract the exact minutes and the counts for the
"ParallelFinish" wait event I get:

 minute |   wait_event   | count
--------+----------------+-------
     18 | ParallelFinish |    48
     19 | ParallelFinish |    60
     20 | ParallelFinish |    60
     21 | ParallelFinish |    14

So it's likely that the leader waited on ParallelFinish during about 3 minutes
at the end of the vacuuming indexes phase (as this wait appeared during
consecutives samples).

====================== Conclusion =============================================

1. During the scanning heap and vacuuming heap phases no noticeable performance
degradation has been observed with v2 applied (as compare to master) (cc'ing
Robert as it's also related to his question about noticeable hit when everything
is in-memory in [1]).

2. During the vacuuming indexes phase, v2 has been faster (as compare to master).
The reason is that on master the leader has been waiting during about 3 minutes
on "ParallelFinish" at the end.

====================== Remarks ================================================

As v2 is attached, please find below a summary about the current state of this
thread:

1. v2 implements delay_time as the actual wait time (and not the intended wait
time as proposed in v1).

2. some measurements have been done to check the impact of this new
instrumentation (see this email and [2]): no noticeable performance degradation
has been observed (and surprisingly that's the opposite as mentioned above).

3. there is an ongoing discussion about exposing the number of waits [2].

4. there is an ongoing discussion about exposing the effective cost limit [3].

5. that could be interesting to have a closer look as to why the leader is waiting
during 3 minutes on "ParallelFinish" on master and not with v2 applied (but that's
probably out of scope for this thread).

[1]:
https://www.postgresql.org/message-id/CA%2BTgmoZiC%3DzeCDYuMpB%2BGb2yK%3DrTQCGMu0VoxehocKyHxr9Erg%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/ZmmOOPwMFIltkdsN%40ip-10-97-1-34.eu-west-3.compute.internal
[3]: https://www.postgresql.org/message-id/Zml9%2Bu37iS7DFkJL%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v2-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patch

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

22 June 2024, 12:48:33

Hi,

On Thu, Jun 13, 2024 at 11:56:26AM +0000, Bertrand Drouvot wrote:
> ====================== Observations ===========================================
> 
> As compare to v2:
> 
> 1. scanning heap time is about the same
> 2. vacuuming indexes time is about 3 minutes longer on master
> 3. vacuuming heap time is about the same

I had a closer look to understand why the vacuuming indexes time has been about
3 minutes longer on master.

During the vacuuming indexes phase, the leader is helping vacuuming the indexes
until it reaches WaitForParallelWorkersToFinish() (means when all the remaining
indexes are currently handled by the parallel workers, the leader has nothing
more to do and so it is waiting for the parallel workers to finish).

During the time the leader process is involved in indexes vacuuming it is
also subject to wait due to cost delay.

But with v2 applied, the leader may be interrupted by the parallel workers while
it is waiting (due to the new pgstat_progress_parallel_incr_param() calls that
the patch is adding).

So, with v2 applied, the leader is waiting less (as interrupted while waiting)
when being involved in indexes vacuuming and that's why v2 is "faster" than
master.

To put some numbers, I did count the number of times the leader's pg_usleep() has
been interrupted (by counting the number of times the nanosleep() did return a
value < 0 in pg_usleep()). Here they are:

v2: the leader has been interrupted about 342605 times
master: the leader has been interrupted about 36 times

The ones on master are mainly coming from the pgstat_progress_parallel_incr_param() 
calls in parallel_vacuum_process_one_index().

The additional ones on v2 are coming from the pgstat_progress_parallel_incr_param()
calls added in vacuum_delay_point().

======== Conclusion ======

1. vacuuming indexes time has been longer on master because with v2, the leader
has been interrupted 342605 times while waiting, then making v2 "faster".

2. the leader being interrupted while waiting is also already happening on master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in 
46ebdfe164). It has been the case "only" 36 times during my test case.

I think that 2. is less of a concern but I think that 1. is something that needs
to be addressed because the leader process is not honouring its cost delay wait
time in a noticeable way (at least during my test case).

I did not think of a proposal yet, just sharing my investigation as to why
v2 has been faster than master during the vacuuming indexes phase.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

24 June 2024, 10:50:13

Hi,

On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote:
> 1. vacuuming indexes time has been longer on master because with v2, the leader
> has been interrupted 342605 times while waiting, then making v2 "faster".
> 
> 2. the leader being interrupted while waiting is also already happening on master
> due to the pgstat_progress_parallel_incr_param() calls in
> parallel_vacuum_process_one_index() (that have been added in 
> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> I think that 2. is less of a concern but I think that 1. is something that needs
> to be addressed because the leader process is not honouring its cost delay wait
> time in a noticeable way (at least during my test case).
> 
> I did not think of a proposal yet, just sharing my investigation as to why
> v2 has been faster than master during the vacuuming indexes phase.

I think that a reasonable approach is to make the reporting from the parallel
workers to the leader less aggressive (means occur less frequently).

Please find attached v3, that:

- ensures that there is at least 1 second between 2 reports, per parallel worker,
to the leader.

- ensures that the reported delayed time is still correct (keep track of the
delayed time between 2 reports).

- does not add any extra pg_clock_gettime_ns() calls (as compare to v2).

Remarks:

1. Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable. I don't think that the number of parallel workers has
to come into play as:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The throttling is not based on the cost limit as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The throttling is not based on the actual cost delay value because the leader
could be interrupted at the beginning, the midle or whatever part of the wait and
we are more interested about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

With this approach in place, v3 attached applied, during my test case:

- the leader has been interrupted about 2500 times (instead of about 345000
times with v2)

- the vacuum index phase duration is very close to the master one (it has been
4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3
minutes faster with v2).

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v3-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patch

Re: Track the amount of time waiting due to cost_delay

From

"Imseih (AWS), Sami"

Date:

25 June 2024, 01:12:16

>> 2. the leader being interrupted while waiting is also already happening on master
>> due to the pgstat_progress_parallel_incr_param() calls in
>> parallel_vacuum_process_one_index() (that have been added in
>> 46ebdfe164). It has been the case "only" 36 times during my test case.

46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the remaining
time [1], but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process. 


> 1. Having a time based only approach to throttle 

I do agree with a time based approach overall.


> 1.1) the more parallel workers is used, the less the impact of the leader on
> the vacuum index phase duration/workload is (because the repartition is done
> on more processes).

Did you mean " because the vacuum is done on more processes"? 

When a leader is operating on a large index(s) during the entirety
of the vacuum operation, wouldn't more parallel workers end up
interrupting the leader more often? This is why I think reporting even more
often than 1 second (more below) will be better.

> 3. A 1 second reporting "throttling" looks a reasonable threshold as:

> 3.1 the idea is to have a significant impact when the leader could have been
> interrupted say hundred/thousand times per second.

> 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
> multiple times per second (so a one second reporting granularity seems ok).

I feel 1 second may still be too frequent. 
What about 10 seconds ( or 30 seconds )? 
I think this metric in particular will be mainly useful for vacuum runs that are 
running for minutes or more, making reporting every 10 or 30 seconds 
still useful.

It just occurred to me also that pgstat_progress_parallel_incr_param 
should have a code comment that it will interrupt a leader process and
cause activity such as a sleep to end early.



Regards,

Sami Imseih
Amazon Web Services (AWS)


[1] https://man7.org/linux/man-pages/man2/nanosleep.2.html

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

25 June 2024, 08:29:03

Hi,

On Tue, Jun 25, 2024 at 01:12:16AM +0000, Imseih (AWS), Sami wrote:

Thanks for the feedback!

> >> 2. the leader being interrupted while waiting is also already happening on master
> >> due to the pgstat_progress_parallel_incr_param() calls in
> >> parallel_vacuum_process_one_index() (that have been added in
> >> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
> progress, and we currently don't handle interrupts by restarting the sleep with
> the remaining time. nanosleep does provide the ability to restart with the remaining
> time [1], but I don't think it's worth the effort to ensure more accurate
> vacuum delays for the leader process. 

+1. I don't think it's necessary to have a 100% accurate delay for all the
times the delay is involded. I think that's an heuristic parameter (among
with cost limit). What matters at the end is by how much you've been able to
pause the whole vacuum (and not by a sleep by sleep basis)).

> > 1. Having a time based only approach to throttle 
> 
> I do agree with a time based approach overall.
> 
> 
> > 1.1) the more parallel workers is used, the less the impact of the leader on
> > the vacuum index phase duration/workload is (because the repartition is done
> > on more processes).
> 
> Did you mean " because the vacuum is done on more processes"? 

Yes.

> When a leader is operating on a large index(s) during the entirety
> of the vacuum operation, wouldn't more parallel workers end up
> interrupting the leader more often?

That's right but my point was about the impact on the "whole" duration time and
"whole" workload (leader + workers included) and not about the number of times the
leader is interrupted. If there is say 100 workers then interrupting the leader
(1 process out of 101) is probably less of an issue as it means that there is a
lot of work to be done to have those 100 workers busy. I don't think the size of
the index the leader is vacuuming has an impact. I think that having the leader
vacuuming a 100 GB index or 100 x 1GB indexes is the same (as long as all the
other workers are actives during all that time).

> > 3. A 1 second reporting "throttling" looks a reasonable threshold as:
> 
> > 3.1 the idea is to have a significant impact when the leader could have been
> > interrupted say hundred/thousand times per second.
> 
> > 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
> > multiple times per second (so a one second reporting granularity seems ok).
> 
> I feel 1 second may still be too frequent. 

Maybe we'll need more measurements but this is what my test case made of:

vacuum_cost_delay = 1
vacuum_cost_limit = 10
8 parallel workers, 1 leader
21 indexes (about 1GB each, one 40MB), all in memory

lead to:

With 1 second reporting frequency, the leader has been interruped about 2500
times over 8m39s leading to about the same time as on master (8m43s).

> What about 10 seconds ( or 30 seconds )? 

I'm not sure (may need more measurements) but it would probably complicate the
reporting a bit (as with the current v3 we'd miss reporting the indexes that
take less time than the threshold to complete).

> I think this metric in particular will be mainly useful for vacuum runs that are 
> running for minutes or more, making reporting every 10 or 30 seconds 
> still useful.

Agree. OTOH, one could be interested to diagnose what happened during a say 5
seconds peak on I/O resource consumption/latency. Sampling pg_stat_progress_vacuum
at 1 second interval and see by how much the vaccum has been paused during that
time could help too (specially if it is made of a lot of parallel workers that
could lead to a lot of I/O). But it would miss data if we are reporting at a
larger rate.

> It just occurred to me also that pgstat_progress_parallel_incr_param 
> should have a code comment that it will interrupt a leader process and
> cause activity such as a sleep to end early.

Good point, I'll add a comment for it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Masahiko Sawada

Date:

27 June 2024, 05:17:38

On Mon, Jun 24, 2024 at 7:50 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote:
> > 1. vacuuming indexes time has been longer on master because with v2, the leader
> > has been interrupted 342605 times while waiting, then making v2 "faster".
> >
> > 2. the leader being interrupted while waiting is also already happening on master
> > due to the pgstat_progress_parallel_incr_param() calls in
> > parallel_vacuum_process_one_index() (that have been added in
> > 46ebdfe164). It has been the case "only" 36 times during my test case.
> >
> > I think that 2. is less of a concern but I think that 1. is something that needs
> > to be addressed because the leader process is not honouring its cost delay wait
> > time in a noticeable way (at least during my test case).
> >
> > I did not think of a proposal yet, just sharing my investigation as to why
> > v2 has been faster than master during the vacuuming indexes phase.

Thank you for the benchmarking and analyzing the results! I agree with
your analysis and was surprised by the fact that the more times
workers go to sleep, the more times the leader wakes up.

>
> I think that a reasonable approach is to make the reporting from the parallel
> workers to the leader less aggressive (means occur less frequently).
>
> Please find attached v3, that:
>
> - ensures that there is at least 1 second between 2 reports, per parallel worker,
> to the leader.
>
> - ensures that the reported delayed time is still correct (keep track of the
> delayed time between 2 reports).
>
> - does not add any extra pg_clock_gettime_ns() calls (as compare to v2).
>

Sounds good to me. I think it's better to keep the logic for
throttling the reporting the delay message simple. It's an important
consideration but executing parallel vacuum with delays would be less
likely to be used in practice.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

"Imseih (AWS), Sami"

Date:

28 June 2024, 20:07:39

> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
> progress, and we currently don't handle interrupts by restarting the sleep with
> the remaining time. nanosleep does provide the ability to restart with the remaining
> time [1], but I don't think it's worth the effort to ensure more accurate
> vacuum delays for the leader process. 

After discussing offline with Bertrand, it may be better to have 
a solution to deal with the interrupts and allows the sleep to continue to
completion. This will simplify this patch and will be useful 
for other cases in which parallel workers need to send a message
to the leader. This is the thread [1] for that discussion.

[1]
https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com

Regards,

Sami

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

01 July 2024, 04:59:25

Hi,

On Fri, Jun 28, 2024 at 08:07:39PM +0000, Imseih (AWS), Sami wrote:
> > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
> > progress, and we currently don't handle interrupts by restarting the sleep with
> > the remaining time. nanosleep does provide the ability to restart with the remaining
> > time [1], but I don't think it's worth the effort to ensure more accurate
> > vacuum delays for the leader process. 
> 
> After discussing offline with Bertrand, it may be better to have 
> a solution to deal with the interrupts and allows the sleep to continue to
> completion. This will simplify this patch and will be useful 
> for other cases in which parallel workers need to send a message
> to the leader. This is the thread [1] for that discussion.
> 
> [1]
https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com
> 

Yeah, I think it would make sense to put this thread on hold until we know more
about [1] (you mentioned above) outcome.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

19 September 2024, 00:04:53

On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote:
> Please find attached v6, a mandatory rebase due to catversion bump conflict.
> I'm removing the catversion bump from the patch as it generates too frequent
> conflicts (just mention it needs to be done in the commit message).

v6 looks generally reasonable to me.  I think the
nap_time_since_last_report variable needs to be marked static, though.

One thing that occurs to me is that this information may not be
particularly useful when parallel workers are used.  Without parallelism,
it's easy enough to figure out the percentage of time that your VACUUM is
spending asleep, but when there are parallel workers, it may be hard to
deduce much of anything from the value.  I'm not sure that this is a
deal-breaker for the patch, though, if for no other reason than it'll most
likely be used for autovacuum, which doesn't use parallel vacuum yet.

If there are no other concerns, I'll plan on committing this one soon after
a bit of editorialization.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

19 September 2024, 10:54:21

Hi,

On Wed, Sep 18, 2024 at 04:04:53PM -0500, Nathan Bossart wrote:
> On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote:
> > Please find attached v6, a mandatory rebase due to catversion bump conflict.
> > I'm removing the catversion bump from the patch as it generates too frequent
> > conflicts (just mention it needs to be done in the commit message).
> 
> v6 looks generally reasonable to me.

Thanks for looking at it!

> I think the
> nap_time_since_last_report variable needs to be marked static, though.

Agree.

> One thing that occurs to me is that this information may not be
> particularly useful when parallel workers are used.  Without parallelism,
> it's easy enough to figure out the percentage of time that your VACUUM is
> spending asleep, but when there are parallel workers, it may be hard to
> deduce much of anything from the value.

I think that if the number of parallel workers being used are the same across
runs then one can measure "accurately" the impact of some changes (set
vacuum_cost_delay=... for example) on the delay. Without the patch one could just
guess as many others factors could impact the vacuum duration (load on the system,
i/o latency,...).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Masahiro Ikeda

Date:

05 December 2024, 11:51:11

Hi,

I recently encountered a case where having this feature would have been 
very helpful.
Thank you for developing it! I have a few questions and comments.

Here are questions:

After this patch is merged, are you considering adding delayed_time
information to the logs output by log_autovacuum_min_duration?
In the case I experienced, it would have been great to easily understand
how much of the total execution time was spent on timed delays from the
already executed VACUUM logs.

Recently, this thread has not been active. Is the reason to wait for 
[1]?
[1] Vacuum statistics: https://commitfest.postgresql.org/50/5012/


Here are minor comments on the v7 patch:

+       Total amount of time spent in milliseconds waiting due to 
<varname>vacuum_cost_delay</varname>
+       or <varname>autovacuum_vacuum_cost_delay</varname>. In case of 
parallel

Why not use the <xref> element, for example, <xref 
linkend="guc-autovacuum-vacuum-cost-delay"/>,
as in the max_dead_tuple_bytes column?

+       vacuum the reported time is across all the workers and the 
leader. This
+       column is updated at a 1 Hz frequency (one time per second) so 
could show
+       slightly old values.

I wonder if "Hz frequency" is the best term for the context, as I 
couldn’t
find similar usage in other documents, though I’m not a native English 
speaker.
FWIW, the document contains a similar description.
* not more frequently than once per PGSTAT_MIN_INTERVAL milliseconds

IIUC, only the worker updates the column at a 1 Hz frequency. Would it 
be
better to rephrase the following?"
* The workers update the column no more frequently than once per second,
   so it could show slightly old values.

+            if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > 
WORKER_REPORT_DELAY_INTERVAL)
+            {
+                pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+                                                    nap_time_since_last_report);
+                nap_time_since_last_report = 0;
+                last_report_time = delay_end;
+            }

IIUC, unsent delayed_time will disappear when the parallel workers exit
because they are not considered in parallel_vacuum_end(). I assume this
is intentional behavior, as it is an acceptable error for the use cases.
I didn't see any comments regarding this, so I wanted to confirm.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Track the amount of time waiting due to cost_delay

From

Masahiro Ikeda

Date:

09 December 2024, 12:21:21

On 2024-12-06 18:31, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
>> Yeah, people would likely use this new field to monitor long running 
>> vacuum.
>> Long enough that this error should be acceptable. Do you agree?
> 
> OTOH, adding the 100% accuracy looks as simple as v9-0002 attached 
> (0001 is
> same as for v8), so I think we should provide it.

Thanks! The patch looks good to me.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Track the amount of time waiting due to cost_delay

From

Dilip Kumar

Date:

09 December 2024, 14:48:30

On Mon, Dec 9, 2024 at 2:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> On 2024-12-06 18:31, Bertrand Drouvot wrote:
> > Hi,
> >
> > On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
> >> Yeah, people would likely use this new field to monitor long running
> >> vacuum.
> >> Long enough that this error should be acceptable. Do you agree?
> >
> > OTOH, adding the 100% accuracy looks as simple as v9-0002 attached
> > (0001 is
> > same as for v8), so I think we should provide it.
>
This Idea looks good to me.  Here are some comments

1.
+       Total amount of time spent in milliseconds waiting due to
<xref linkend="guc-vacuum-cost-delay"/>
+       or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case
of parallel
+       vacuum the reported time is across all the workers and the leader. The
+       workers update the column no more frequently than once per second, so it
+       could show slightly old values.
+      </para></entry>

I think this waiting is influenced due to cost delay as well as cost
limit GUCs because here we are counting total wait time and that very
much depends upon how frequently we are waiting and that's completely
driven by cost limit. Isn't it?

2.
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+   last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) >
WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }

Does it make sense to track this "nap_time_since_last_report" in a
shared variable instead of local in individual workers and whenever
the shared variable crosses WORKER_REPORT_DELAY_INTERVAL we can report
this would avoid individual reporting from different workers.  Since
we are already computing the cost balance in shared variable i.e.
VacuumSharedCostBalance, or do you think it will complicate the code?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

09 December 2024, 16:25:54

Hi,

On Mon, Dec 09, 2024 at 05:18:30PM +0530, Dilip Kumar wrote:
> On Mon, Dec 9, 2024 at 2:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >
> > On 2024-12-06 18:31, Bertrand Drouvot wrote:
> > > Hi,
> > >
> > > On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
> > >> Yeah, people would likely use this new field to monitor long running
> > >> vacuum.
> > >> Long enough that this error should be acceptable. Do you agree?
> > >
> > > OTOH, adding the 100% accuracy looks as simple as v9-0002 attached
> > > (0001 is
> > > same as for v8), so I think we should provide it.
> >
> This Idea looks good to me.

Thanks for looking at it!

> 1.
> +       Total amount of time spent in milliseconds waiting due to
> <xref linkend="guc-vacuum-cost-delay"/>
> +       or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case
> of parallel
> +       vacuum the reported time is across all the workers and the leader. The
> +       workers update the column no more frequently than once per second, so it
> +       could show slightly old values.
> +      </para></entry>
> 
> I think this waiting is influenced due to cost delay as well as cost
> limit GUCs because here we are counting total wait time and that very
> much depends upon how frequently we are waiting and that's completely
> driven by cost limit. Isn't it?

Yeah. I think we could change the wording that way:

s/waiting due to/waiting during/

Does that make sense? I don't think we need to mention cost limit here.

> 2.
> + if (IsParallelWorker())
> + {
> + instr_time time_since_last_report;
> +
> + INSTR_TIME_SET_ZERO(time_since_last_report);
> + INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
> +   last_report_time);
> + nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
> +
> + if (INSTR_TIME_GET_MILLISEC(time_since_last_report) >
> WORKER_REPORT_DELAY_INTERVAL)
> + {
> + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
> + nap_time_since_last_report);
> + nap_time_since_last_report = 0;
> + last_report_time = delay_end;
> + }
> + }
> 
> Does it make sense to track this "nap_time_since_last_report" in a
> shared variable instead of local in individual workers and whenever
> the shared variable crosses WORKER_REPORT_DELAY_INTERVAL we can report
> this would avoid individual reporting from different workers.  Since
> we are already computing the cost balance in shared variable i.e.
> VacuumSharedCostBalance, or do you think it will complicate the code?
> 

I'm not sure I follow. nap_time_since_last_report is the time a worker had to
wait since its last report. There is no direct relationship with
WORKER_REPORT_DELAY_INTERVAL (which is compared to time_since_last_report and
not nap_time_since_last_report).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Dilip Kumar

Date:

09 December 2024, 18:04:13

On Mon, Dec 9, 2024 at 6:55 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Mon, Dec 09, 2024 at 05:18:30PM +0530, Dilip Kumar wrote:
> > On Mon, Dec 9, 2024 at 2:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> > >
> > > On 2024-12-06 18:31, Bertrand Drouvot wrote:
> > > > Hi,
> > > >
> > > > On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
> > > >> Yeah, people would likely use this new field to monitor long running
> > > >> vacuum.
> > > >> Long enough that this error should be acceptable. Do you agree?
> > > >
> > > > OTOH, adding the 100% accuracy looks as simple as v9-0002 attached
> > > > (0001 is
> > > > same as for v8), so I think we should provide it.
> > >
> > This Idea looks good to me.
>
> Thanks for looking at it!
>
> > 1.
> > +       Total amount of time spent in milliseconds waiting due to
> > <xref linkend="guc-vacuum-cost-delay"/>
> > +       or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case
> > of parallel
> > +       vacuum the reported time is across all the workers and the leader. The
> > +       workers update the column no more frequently than once per second, so it
> > +       could show slightly old values.
> > +      </para></entry>
> >
> > I think this waiting is influenced due to cost delay as well as cost
> > limit GUCs because here we are counting total wait time and that very
> > much depends upon how frequently we are waiting and that's completely
> > driven by cost limit. Isn't it?
>
> Yeah. I think we could change the wording that way:
>
> s/waiting due to/waiting during/
>
> Does that make sense? I don't think we need to mention cost limit here.

Yeah that should be fine.

> > 2.
> > + if (IsParallelWorker())
> > + {
> > + instr_time time_since_last_report;
> > +
> > + INSTR_TIME_SET_ZERO(time_since_last_report);
> > + INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
> > +   last_report_time);
> > + nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
> > +
> > + if (INSTR_TIME_GET_MILLISEC(time_since_last_report) >
> > WORKER_REPORT_DELAY_INTERVAL)
> > + {
> > + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
> > + nap_time_since_last_report);
> > + nap_time_since_last_report = 0;
> > + last_report_time = delay_end;
> > + }
> > + }
> >
> > Does it make sense to track this "nap_time_since_last_report" in a
> > shared variable instead of local in individual workers and whenever
> > the shared variable crosses WORKER_REPORT_DELAY_INTERVAL we can report
> > this would avoid individual reporting from different workers.  Since
> > we are already computing the cost balance in shared variable i.e.
> > VacuumSharedCostBalance, or do you think it will complicate the code?
> >
>
> I'm not sure I follow. nap_time_since_last_report is the time a worker had to
> wait since its last report. There is no direct relationship with
> WORKER_REPORT_DELAY_INTERVAL (which is compared to time_since_last_report and
> not nap_time_since_last_report).

I mean currently we are tracking "time_since_last_report" and
accumulating the delayed_time in "nap_time_since_last_report" for each
worker.  So my question was does it make sense to do this in a shared
variable across workers so that we can reduce the number of reports to the
leader?


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

From

Dilip Kumar

Date:

10 December 2024, 07:47:02

On Mon, Dec 9, 2024 at 10:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Mon, Dec 09, 2024 at 08:34:13PM +0530, Dilip Kumar wrote:
> > On Mon, Dec 9, 2024 at 6:55 PM Bertrand Drouvot
> > > Yeah. I think we could change the wording that way:
> > >
> > > s/waiting due to/waiting during/
> > >
> > > Does that make sense? I don't think we need to mention cost limit here.
> >
> > Yeah that should be fine.
>
> Thanks! Done in v10 attached. BTW, 0001 and 0002 have been merged (thanks
> Masahiro-san for having confirmed that v9-0002 made sense to you too!).
>
> >
> > I mean currently we are tracking "time_since_last_report" and
> > accumulating the delayed_time in "nap_time_since_last_report" for each
> > worker.  So my question was does it make sense to do this in a shared
> > variable across workers so that we can reduce the number of reports to the
> > leader?
>
> I see. We've seen up-thread that the more we interrupt the leader the faster the
> vacuum is (because the leader could be interrupted while waiting).
>
> OTOH if we make use of shared variable then we'd need to add some "synchronization"
> (pg_atomic_xxx) overhead. So we'd reduce the number of reports and add overhead.
>
> So I think that it might be possible to see performance degradation in some cases
> and so think it's safer to keep the "per worker" implementation.

Okay, that makes sense.  Thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

10 December 2024, 20:55:41

On Mon, Dec 09, 2024 at 04:41:03PM +0000, Bertrand Drouvot wrote:
> +       <structfield>time_delayed</structfield> <type>bigint</type>

I think it's also worth considering names like total_delay and
cumulative_delay.

> +       Total amount of time spent in milliseconds waiting during <xref linkend="guc-vacuum-cost-delay"/>
> +       or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel
> +       vacuum the reported time is across all the workers and the leader. The
> +       workers update the column no more frequently than once per second, so it
> +       could show slightly old values.

I wonder if it makes sense to provide this value as an interval instead of
the number of milliseconds to make it more human-readable.  I might also
suggest some changes to the description:

    Total accumulated time spent sleeping due to the cost-based vacuum
    delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit).  This
    includes the time that any associated parallel workers have slept, too.
    However, parallel workers report their sleep time no more frequently
    than once per second, so the reported value may be slightly stale.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Dilip Kumar

Date:

11 December 2024, 09:43:06

On Tue, Dec 10, 2024 at 11:25 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Mon, Dec 09, 2024 at 04:41:03PM +0000, Bertrand Drouvot wrote:
> > +       <structfield>time_delayed</structfield> <type>bigint</type>
>
> I think it's also worth considering names like total_delay and
> cumulative_delay.

+1, I vote for total_delay

> > +       Total amount of time spent in milliseconds waiting during <xref linkend="guc-vacuum-cost-delay"/>
> > +       or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel
> > +       vacuum the reported time is across all the workers and the leader. The
> > +       workers update the column no more frequently than once per second, so it
> > +       could show slightly old values.
>
> I wonder if it makes sense to provide this value as an interval instead of
> the number of milliseconds to make it more human-readable.  I might also
> suggest some changes to the description:
>
>         Total accumulated time spent sleeping due to the cost-based vacuum
>         delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit).  This
>         includes the time that any associated parallel workers have slept, too.
>         However, parallel workers report their sleep time no more frequently
>         than once per second, so the reported value may be slightly stale.
>

This description looks good.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

11 December 2024, 10:00:50

Hi,

On Tue, Dec 10, 2024 at 11:55:41AM -0600, Nathan Bossart wrote:
> On Mon, Dec 09, 2024 at 04:41:03PM +0000, Bertrand Drouvot wrote:
> > +       <structfield>time_delayed</structfield> <type>bigint</type>
> 
> I think it's also worth considering names like total_delay and
> cumulative_delay.

That's fine by me. Then I think that total_delay is the way to go (I don't see
any existing "cumulative_").

> > +       Total amount of time spent in milliseconds waiting during <xref linkend="guc-vacuum-cost-delay"/>
> > +       or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel
> > +       vacuum the reported time is across all the workers and the leader. The
> > +       workers update the column no more frequently than once per second, so it
> > +       could show slightly old values.
> 
> I wonder if it makes sense to provide this value as an interval instead of
> the number of milliseconds to make it more human-readable. 

Yeah we could do so, but that would mean:

1. Write a dedicated "pg_stat_get_progress_info()" function for VACUUM. Indeed,
the current pg_stat_get_progress_info() is shared across multiple "commands" and
then we wouldn't be able to change it's output types in pg_proc.dat.

Or

2. Make use of make_interval() in the pg_stat_progress_vacuum view creation.

I don't like 1. that much and given that that would be as simple as: 

"
select make_interval(secs => time_delayed / 1000) from pg_stat_progress_vacuum; 
"

for an end user to display an interval, I'm not sure we should provide an interval
by default.

That said, I agree that milliseconds is not really human-readable and
does not provide that much added value (except flexibility), so I'd vote for 2.
if you feel we should provide an interval by default.

> I might also
> suggest some changes to the description:
> 
>     Total accumulated time spent sleeping due to the cost-based vacuum
>     delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit).  This
>     includes the time that any associated parallel workers have slept, too.
>     However, parallel workers report their sleep time no more frequently
>     than once per second, so the reported value may be slightly stale.
> 

Yeah I like it, thanks! Now, I'm wondering if we should not also add something
like this:

"
Since multiple workers can sleep simultaneously, the total sleep time can exceed
the actual duration of the vacuum operation.
"

As that could be surprising to see this behavior in action.

Thoughts?

I'll provide an updated patch version once we agree on the above points.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

11 December 2024, 19:40:04

On Wed, Dec 11, 2024 at 07:00:50AM +0000, Bertrand Drouvot wrote:
> On Tue, Dec 10, 2024 at 11:55:41AM -0600, Nathan Bossart wrote:
>> I wonder if it makes sense to provide this value as an interval instead of
>> the number of milliseconds to make it more human-readable. 
> 
> Yeah we could do so, but that would mean:
> 
> 1. Write a dedicated "pg_stat_get_progress_info()" function for VACUUM. Indeed,
> the current pg_stat_get_progress_info() is shared across multiple "commands" and
> then we wouldn't be able to change it's output types in pg_proc.dat.
> 
> Or
> 
> 2. Make use of make_interval() in the pg_stat_progress_vacuum view creation.
> 
> I don't like 1. that much and given that that would be as simple as: 
> 
> "
> select make_interval(secs => time_delayed / 1000) from pg_stat_progress_vacuum; 
> "
> 
> for an end user to display an interval, I'm not sure we should provide an interval
> by default.
> 
> That said, I agree that milliseconds is not really human-readable and
> does not provide that much added value (except flexibility), so I'd vote for 2.
> if you feel we should provide an interval by default.

That's roughly what I had in mind.

> Yeah I like it, thanks! Now, I'm wondering if we should not also add something
> like this:
> 
> "
> Since multiple workers can sleep simultaneously, the total sleep time can exceed
> the actual duration of the vacuum operation.
> "
> 
> As that could be surprising to see this behavior in action.

I'd vote for leaving that out, if for no other reason than it can be
deduced from the rest of the description.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

12 December 2024, 19:15:18

On Thu, Dec 12, 2024 at 04:36:21AM +0000, Bertrand Drouvot wrote:
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1222,7 +1222,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
>          S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
>          S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
>          S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
> -        S.param10 AS indexes_processed
> +        S.param10 AS indexes_processed,
> +        make_interval(secs => S.param11 / 1000) AS total_delay
>      FROM pg_stat_get_progress_info('VACUUM') AS S
>          LEFT JOIN pg_database D ON S.datid = D.oid;

I think we need to cast one of the operands to "double precision" to avoid
chopping off the fractional part of the result of the division, which seems
important for this case since we are dealing with lots of small sleeps.

Otherwise, at a glance, I think this one is just about ready for commit.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Sergei Kornilov

Date:

14 December 2024, 00:06:37

Hello!

+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;

Maybe last_report_time would be better named worker_last_report_time? (It is not clear to me from the comment that the
variableis not used by the leader or autovacuum worker at all)
 

+        /* Parallel worker */
+        if (IsParallelWorker())

I think this comment doesn't add value (just repeats the code), maybe delete it?

I was surprised that the patch does not add reporting for log_autovacuum_min_duration. But I see it was discussed
earlier,great. (postponed for another topic&patch)
 

- code looks good
- docs pass (not a native English speaker)
- check-world pass

regards, Sergei

Re: Track the amount of time waiting due to cost_delay

From

Sergei Kornilov

Date:

14 December 2024, 12:39:23

Hi

+        Only superusers and users with the appropriate <literal>SET</literal>
+        privilege and change this setting.

a typo? should be "can change"?

I like the separation of vacuum_delay_point and analyze_delay_point, it improves the readability of the analyze code.
Looksgood. I would like to enable track_cost_delay_timing by default, but the analogy with track_io_timing is good... I
agreethat it is better to have it off by default.
 

regards, Sergeii

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

17 December 2024, 01:02:56

On Mon, Dec 16, 2024 at 10:11:23AM +0000, Bertrand Drouvot wrote:
> +#define PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
> 
> Did not changed in v14, but "PARALLEL_VACUUM_REPORT_INTERVAL_NS" could be
> an option as well. I think it keeps the key concepts while being more concise (
> WORKER is somehow implicit in the context).

I think it's important to keep "delay" somewhere in the name, so how about
PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS?

> -vacuum_delay_point(void)
> +static void
> +vacuum_delay_point_internal(bool is_analyze)
> 
> Updated the comment on top of it accordingly.

Thanks.  I think we need to do some additional adjustments to this
commentary since external callers should now use
vacuum/analyze_delay_point().

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

03 February, 23:05:51

Barring objections, I am planning to commit this one soon.  I might move
the addition of analyze_delay_point() to its own patch, but otherwise I
think it looks good to go.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

12 February, 01:42:26

On Tue, Feb 11, 2025 at 08:51:15AM +0000, Bertrand Drouvot wrote:
> On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
>> Off-list, I've asked Bertrand to gauge the feasibility of adding this
>> information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE).  IMHO
>> those are natural places to surface this information, and I want to ensure
>> that we're not painting ourselves into a corner with the approach we're
>> using for the progress views.
> 
> Yeah, I looked at it and that looks as simmple as 0003 attached (as that's the
> leader that is doing the report in case of parallel workers being used).
> 
> 0001 and 0002 remain unchanged.

Thanks.  I've committed 0001 and 0002.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

12 February, 09:19:12

Hi,

On Tue, Feb 11, 2025 at 04:42:26PM -0600, Nathan Bossart wrote:
> On Tue, Feb 11, 2025 at 08:51:15AM +0000, Bertrand Drouvot wrote:
> > On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
> >> Off-list, I've asked Bertrand to gauge the feasibility of adding this
> >> information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE).  IMHO
> >> those are natural places to surface this information, and I want to ensure
> >> that we're not painting ourselves into a corner with the approach we're
> >> using for the progress views.
> > 
> > Yeah, I looked at it and that looks as simmple as 0003 attached (as that's the
> > leader that is doing the report in case of parallel workers being used).
> > 
> > 0001 and 0002 remain unchanged.
> 
> Thanks.  I've committed 0001 and 0002.

Thanks! Regarding 0003 I think it's ok to keep it in this thread (and not
create a dedicated one), as it still fits well with $SUBJECT (and the folks 
interested in are probably already part of this thread). Sounds good to you?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Bertrand Drouvot

Date:

13 February, 10:02:10

Hi,

On Wed, Feb 12, 2025 at 11:13:13AM -0600, Nathan Bossart wrote:
> On Wed, Feb 12, 2025 at 06:19:12AM +0000, Bertrand Drouvot wrote:
> > Thanks! Regarding 0003 I think it's ok to keep it in this thread (and not
> > create a dedicated one), as it still fits well with $SUBJECT (and the folks 
> > interested in are probably already part of this thread). Sounds good to you?
> 
> Yup.  Here is what I have staged for commit.

Thanks! I can see that those changes have been made:

1. The order in the output (it's now displayed before "track_io_timing")

2. The "and" addition in the doc 

All of the above is fine by me.

> I'll create a commifest entry
> for this one and give it a couple days for review, but this seems pretty
> straightforward.

Yup, sounds like a plan, thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track the amount of time waiting due to cost_delay

From

Nathan Bossart

Date:

14 February, 23:54:33

Committed.

I noticed two things that I felt I should note here:

* For the vacuum path, we call pgstat_progress_end_command() prior to
  accessing the value.  This works because pgstat_progress_end_command()
  doesn't clear the st_progress_param array (that is done in
  pgstat_progress_start_command()).  AFAICT it's worked this way since this
  stuff was introduced ~9 years ago, and I have no reason to believe it
  will change anytime soon.

* We are bypassing the changecount mechanism when accessing the value.  I
  believe this is okay because the calling process is the only one that
  updates these values.  Even in the parallel worker case, the worker sends
  a message to the leader to increment the value.  Perhaps this could break
  in the future if we switched to using atomics or something, but that
  approach was already considered and abandoned once before [0], and the
  worst case scenario would likely be compiling errors or bogus delay
  values.

So, I chose to just add comments about this stuff for now.  If someone
feels strongly that we should do pro forma changecount checks before
pgstat_progress_end_command(), I'm happy to draft the patch.

[0] https://postgr.es/m/72CD33F6-C2B5-45E4-A78F-85EC923DCF0F%40amazon.com

-- 
nathan