Thread: Track the amount of time waiting due to cost_delay

Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi hackers,

During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT

Please find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view. 

Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).

This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).

The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).

[1]:
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/29-how-autovacuum-goes-wrong-and-can-we-please-make-it-stop-doing-that/

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Track the amount of time waiting due to cost_delay

From
Nathan Bossart
Date:
On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
> During the last pgconf.dev I attended Robert´s presentation about autovacuum and
> it made me remember of an idea I had some time ago: $SUBJECT

This sounds like useful information to me.  I wonder if we should also
surface the effective cost limit for each autovacuum worker.

> Currently one can change [autovacuum_]vacuum_cost_delay and
> [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
> the changes on the vacuum duration: one could observe the vacuum duration
> variation but the correlation to the changes is not accurate (as many others
> factors could impact the vacuum duration (load on the system, i/o latency,...)).

IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay.  Is that how you'd expect for this to be used in practice?

>          pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
>          pg_usleep(msec * 1000);
>          pgstat_report_wait_end();
> +        /* Report the amount of time we slept */
> +        if (VacuumSharedCostBalance != NULL)
> +            pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> +        else
> +            pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);

Hm.  Should we measure the actual time spent sleeping, or is a rough
estimate good enough?  I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much.  If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.

-- 
nathan



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
> > During the last pgconf.dev I attended Robert´s presentation about autovacuum and
> > it made me remember of an idea I had some time ago: $SUBJECT
> 
> This sounds like useful information to me.

Thanks for looking at it!

> I wonder if we should also
> surface the effective cost limit for each autovacuum worker.

I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.

> > Currently one can change [autovacuum_]vacuum_cost_delay and
> > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
> > the changes on the vacuum duration: one could observe the vacuum duration
> > variation but the correlation to the changes is not accurate (as many others
> > factors could impact the vacuum duration (load on the system, i/o latency,...)).
> 
> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> pg_stat_activity in order to know what percentage of time was being spent
> in cost delay.  Is that how you'd expect for this to be used in practice?

Yeah, one could use a query such as:

select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)

for example. Worth to provide an example somewhere in the doc?

> >          pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
> >          pg_usleep(msec * 1000);
> >          pgstat_report_wait_end();
> > +        /* Report the amount of time we slept */
> > +        if (VacuumSharedCostBalance != NULL)
> > +            pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> > +        else
> > +            pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> 
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

Thanks for bringing that up! I had the same thought when writing the code and
came to the same conclusion. I think that's a good enough estimation and specially
during a long running vacuum (which is probably the case where users care the 
most).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Nathan Bossart
Date:
On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
> On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
>> I wonder if we should also
>> surface the effective cost limit for each autovacuum worker.
> 
> I'm not sure about it as I think that it could be misleading: one could query
> pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> due to _this_ cost_limit. But that's not necessary true as the cost_limit could
> have changed multiple times since the vacuum started. So, unless there is
> frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
> the cost_limit could be misleadind IMHO.

Well, that's true for the delay, too, right (at least as of commit
7d71d3d)?

-- 
nathan



Re: Track the amount of time waiting due to cost_delay

From
"Imseih (AWS), Sami"
Date:
>> This sounds like useful information to me.

> Thanks for looking at it!

The  VacuumDelay is the only visibility available to
gauge the cost_delay. Having this information
advertised by pg_stat_progress_vacuum as is being proposed
is much better. However, I also think that the
"number of times"  the vacuum went into delay will be needed
as well. Both values will be useful to tune cost_delay and cost_limit. 

It may also make sense to accumulate the total_time in delay
and the number of times delayed in a cumulative statistics [0]
view to allow a user to trend this information overtime.
I don't think this info fits in any of the existing views, i.e.
pg_stat_database, so maybe a new view for cumulative
vacuum stats may be needed. This is likely a separate
discussion, but calling it out here.

>> IIUC you'd need to get information from both pg_stat_progress_vacuum and
>> pg_stat_activity in order to know what percentage of time was being spent
>> in cost delay.  Is that how you'd expect for this to be used in practice?

> Yeah, one could use a query such as:

> select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)

Maybe all  progress views should just expose the "beentry->st_activity_start_timestamp " 
to let the user know when the current operation began.


Regards,

Sami Imseih
Amazon Web Services (AWS)


[0] https://www.postgresql.org/docs/current/monitoring-stats.html





Re: Track the amount of time waiting due to cost_delay

From
Robert Haas
Date:
On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

I bet you could also sleep for longer than planned, throwing the
numbers off in the other direction.

I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
> > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> >> I wonder if we should also
> >> surface the effective cost limit for each autovacuum worker.
> > 
> > I'm not sure about it as I think that it could be misleading: one could query
> > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> > due to _this_ cost_limit. But that's not necessary true as the cost_limit could
> > have changed multiple times since the vacuum started. So, unless there is
> > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
> > the cost_limit could be misleadind IMHO.
> 
> Well, that's true for the delay, too, right (at least as of commit
> 7d71d3d)?

Yeah right, but the patch exposes the total amount of time the vacuum has
been delayed (not the cost_delay per say) which does not sound misleading to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Mon, Jun 10, 2024 at 08:12:46PM +0000, Imseih (AWS), Sami wrote:
> >> This sounds like useful information to me.
> 
> > Thanks for looking at it!
> 
> The  VacuumDelay is the only visibility available to
> gauge the cost_delay. Having this information
> advertised by pg_stat_progress_vacuum as is being proposed
> is much better.

Thanks for looking at it!

> However, I also think that the
> "number of times"  the vacuum went into delay will be needed
> as well. Both values will be useful to tune cost_delay and cost_limit. 

Yeah, I think that's a good idea. With v1 one could figure out how many times
the delay has been triggered but that does not work anymore if: 1) cost_delay
changed during the vacuum duration or 2) the patch changes the way time_delayed
is measured/reported (means get the actual wait time and not the theoritical
time as v1 does). 

> 
> It may also make sense to accumulate the total_time in delay
> and the number of times delayed in a cumulative statistics [0]
> view to allow a user to trend this information overtime.
> I don't think this info fits in any of the existing views, i.e.
> pg_stat_database, so maybe a new view for cumulative
> vacuum stats may be needed. This is likely a separate
> discussion, but calling it out here.

+1

> >> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> >> pg_stat_activity in order to know what percentage of time was being spent
> >> in cost delay.  Is that how you'd expect for this to be used in practice?
> 
> > Yeah, one could use a query such as:
> 
> > select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
> 
> Maybe all  progress views should just expose the "beentry->st_activity_start_timestamp " 
> to let the user know when the current operation began.

Yeah maybe, I think this is likely a separate discussion too, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Masahiko Sawada
Date:
Hi,

On Mon, Jun 10, 2024 at 3:05 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi hackers,
>
> During the last pgconf.dev I attended Robert’s presentation about autovacuum and
> it made me remember of an idea I had some time ago: $SUBJECT
>
> Please find attached a patch doing so by adding a new field (aka "time_delayed")
> to the pg_stat_progress_vacuum view.
>
> Currently one can change [autovacuum_]vacuum_cost_delay and
> [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
> the changes on the vacuum duration: one could observe the vacuum duration
> variation but the correlation to the changes is not accurate (as many others
> factors could impact the vacuum duration (load on the system, i/o latency,...)).
>
> This new field reports the time that the vacuum has to sleep due to cost delay:
> it could be useful to 1) measure the impact of the current cost_delay and
> cost_limit settings and 2) when experimenting new values (and then help for
> decision making for those parameters).
>
> The patch is relatively small thanks to the work that has been done in
> f1889729dd (to allow parallel worker to report to the leader).

Thank you for the proposal and the patch. I understand the motivation
of this patch. Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.

FWIW when I want to confirm the vacuum delay effect, I often use the
information from the DEBUG2 log message in VacuumUpdateCosts()
function. Exposing these data (per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe) somewhere in a view might also be
helpful for users for checking vacuum delay effects. It doesn't mean
to measure the impact of the changes on the vacuum duration, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
> > Hm.  Should we measure the actual time spent sleeping, or is a rough
> > estimate good enough?  I believe pg_usleep() might return early (e.g., if
> > the process is signaled) or late, so this field could end up being
> > inaccurate, although probably not by much.  If we're okay with millisecond
> > granularity, my first instinct is that what you've proposed is fine, but I
> > figured I'd bring it up anyway.
> 
> I bet you could also sleep for longer than planned, throwing the
> numbers off in the other direction.

Thanks for looking at it! Agree, that's how I read "or late" from Nathan's
comment above.

> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

OTOH Sami proposed in [1] to count the number of times the vacuum went into
delay. I think that's a good idea. His idea makes me think that (in addition to
the number of wait times) it would make sense to measure the "actual" sleep time
(and not the intended one) then (so that one could measure the difference between
the intended wait time (number of wait times * cost delay (if it does not change
during the vacuum duration)) and the actual measured wait time).

So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).

Thoughts?


[1]: https://www.postgresql.org/message-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:

> Thank you for the proposal and the patch. I understand the motivation
> of this patch.

Thanks for looking at it!

> Beside the point Nathan mentioned, I'm slightly worried
> that massive parallel messages could be sent to the leader process
> when the cost_limit value is low.

I see, I can/will do some testing in this area and share the numbers.

> 
> FWIW when I want to confirm the vacuum delay effect, I often use the
> information from the DEBUG2 log message in VacuumUpdateCosts()
> function. Exposing these data (per-worker dobalance, cost_lmit,
> cost_delay, active, and failsafe) somewhere in a view might also be
> helpful for users for checking vacuum delay effects.

Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the
other data you mentioned above in another dedicated view?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

Though I (now) think that it would make sense to record the actual delay time 
instead (see [1]), I think it's interesting to do some testing as you suggested.

With record_actual_time.txt (attached) applied on top of v1, we can see the
intended and actual wait time.

On my system, "no load at all" except the vacuum running, I see no diff:

                         Tue Jun 11 09:22:06 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        41107 |               41107 | 00:00:42.301851
(1 row)

                         Tue Jun 11 09:22:07 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        42076 |               42076 | 00:00:43.301848
(1 row)

                         Tue Jun 11 09:22:08 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        43045 |               43045 | 00:00:44.301854
(1 row)

But if I launch pg_reload_conf() 10 times in a row, I can see:

                         Tue Jun 11 09:22:09 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        44064 |               44034 | 00:00:45.302965
(1 row)

                         Tue Jun 11 09:22:10 2024 (every 1s)

  pid  | relid |     phase     | time_delayed | actual_time_delayed |    duration
-------+-------+---------------+--------------+---------------------+-----------------
 54754 | 16385 | scanning heap |        45033 |               45003 | 00:00:46.301858


As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?


[1]: https://www.postgresql.org/message-id/Zmf712A5xcOM9Hlg%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Track the amount of time waiting due to cost_delay

From
Nathan Bossart
Date:
On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
> So I think that in v2 we could: 1) measure the actual wait time instead, 2)
> count the number of times the vacuum slept. We could also 3) reports the
> effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
> be misleading but I'll yield to majority opinion if people think it's not).

I still think the effective cost limit would be useful, if for no other
reason than to help reinforce that it is distributed among the autovacuum
workers.  We could document that this value may change over the lifetime of
a worker to help avoid misleading folks.

-- 
nathan



Re: Track the amount of time waiting due to cost_delay

From
Robert Haas
Date:
On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> As we can see the actual wait time is 30ms less than the intended wait time with
> this simple test. So I still think we should go with 1) actual wait time and 2)
> report the number of waits (as mentioned in [1]). Does that make sense to you?

I like the idea of reporting the actual wait time better, provided
that we verify that doing so isn't too expensive. I think it probably
isn't, because in a long-running VACUUM there is likely to be disk
I/O, so the CPU overhead of a few extra gettimeofday() calls should be
fairly low by comparison. I wonder if there's a noticeable hit when
everything is in-memory. I guess probably not, because with any sort
of normal configuration, we shouldn't be delaying after every block we
process, so the cost of those gettimeofday() calls should still be
getting spread across quite a bit of real work.

That said, I'm not sure this experiment shows a real problem with the
idea of showing intended wait time. It does establish the concept that
repeated signals can throw our numbers off, but 30ms isn't much of a
discrepancy. I'm worried about being off by a factor of two, or an
order of magnitude. I think we still don't know if that can happen,
but if we're going to show actual wait time anyway, then we don't need
to explore the problems with other hypothetical systems too much.

I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Track the amount of time waiting due to cost_delay

From
Jan Wieck
Date:
On 6/11/24 13:13, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>> As we can see the actual wait time is 30ms less than the intended wait time with
>> this simple test. So I still think we should go with 1) actual wait time and 2)
>> report the number of waits (as mentioned in [1]). Does that make sense to you?
> 
> I like the idea of reporting the actual wait time better, provided
> that we verify that doing so isn't too expensive. I think it probably
> isn't, because in a long-running VACUUM there is likely to be disk
> I/O, so the CPU overhead of a few extra gettimeofday() calls should be
> fairly low by comparison. I wonder if there's a noticeable hit when
> everything is in-memory. I guess probably not, because with any sort
> of normal configuration, we shouldn't be delaying after every block we
> process, so the cost of those gettimeofday() calls should still be
> getting spread across quite a bit of real work.

Does it even require a call to gettimeofday()? The code in vacuum 
calculates an msec value and calls pg_usleep(msec * 1000). I don't think 
it is necessary to measure how long that nap was.


Regards, Jan




Re: Track the amount of time waiting due to cost_delay

From
"Imseih (AWS), Sami"
Date:
> I'm not convinced that reporting the number of waits is useful. If we
> were going to report a possibly-inaccurate amount of actual waiting,
> then also reporting the number of waits might make it easier to figure
> out when the possibly-inaccurate number was in fact inaccurate. But I
> think it's way better to report an accurate amount of actual waiting,
> and then I'm not sure what we gain by also reporting the number of
> waits.

I think including the number of times vacuum went into sleep 
will help paint a full picture of the effect of tuning the vacuum_cost_delay 
and vacuum_cost_limit for the user, even if we are reporting accurate 
amounts of actual sleeping.

This is particularly true for autovacuum in which the cost limit is spread
across all autovacuum workers, and knowing how many times autovacuum
went to sleep will be useful along with the total time spent sleeping.

Regards,

Sami


Re: Track the amount of time waiting due to cost_delay

From
Nathan Bossart
Date:
On Tue, Jun 11, 2024 at 06:19:23PM +0000, Imseih (AWS), Sami wrote:
>> I'm not convinced that reporting the number of waits is useful. If we
>> were going to report a possibly-inaccurate amount of actual waiting,
>> then also reporting the number of waits might make it easier to figure
>> out when the possibly-inaccurate number was in fact inaccurate. But I
>> think it's way better to report an accurate amount of actual waiting,
>> and then I'm not sure what we gain by also reporting the number of
>> waits.
> 
> I think including the number of times vacuum went into sleep 
> will help paint a full picture of the effect of tuning the vacuum_cost_delay 
> and vacuum_cost_limit for the user, even if we are reporting accurate 
> amounts of actual sleeping.
> 
> This is particularly true for autovacuum in which the cost limit is spread
> across all autovacuum workers, and knowing how many times autovacuum
> went to sleep will be useful along with the total time spent sleeping.

I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting.  Even
if the number of waits is huge, it doesn't tell you much else AFAICT.  I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.

-- 
nathan



Re: Track the amount of time waiting due to cost_delay

From
Robert Haas
Date:
On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I'm struggling to think of a scenario in which the number of waits would be
> useful, assuming you already know the amount of time spent waiting.  Even
> if the number of waits is huge, it doesn't tell you much else AFAICT.  I'd
> be much more likely to adjust the cost settings based on the percentage of
> time spent sleeping.

This is also how I see it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Track the amount of time waiting due to cost_delay

From
"Imseih (AWS), Sami"
Date:
>> I'm struggling to think of a scenario in which the number of waits would be
>> useful, assuming you already know the amount of time spent waiting. Even
>> if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
>> be much more likely to adjust the cost settings based on the percentage of
>> time spent sleeping.


> This is also how I see it.

I think it may be useful for a user to be able to answer the "average
sleep time" for a vacuum, especially because the vacuum cost 
limit and delay can be adjusted on the fly for a running vacuum.

If we only show the total sleep time, the user could make wrong
 assumptions about how long each sleep took and they might 
assume that all sleep delays for a particular vacuum run have been 
uniform in duration, when in-fact they may not have been.


Regards,

Sami 






Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote:
> On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
> > So I think that in v2 we could: 1) measure the actual wait time instead, 2)
> > count the number of times the vacuum slept. We could also 3) reports the
> > effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
> > be misleading but I'll yield to majority opinion if people think it's not).
> 
> I still think the effective cost limit would be useful, if for no other
> reason than to help reinforce that it is distributed among the autovacuum
> workers.

I also think it can be useful, my concern is more to put this information in
pg_stat_progress_vacuum. What about Sawada-san proposal in [1]? (we could
create a new view that would contain those data: per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe). 

[1]: https://www.postgresql.org/message-id/CAD21AoDOu%3DDZcC%2BPemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Tue, Jun 11, 2024 at 02:48:30PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> > I'm struggling to think of a scenario in which the number of waits would be
> > useful, assuming you already know the amount of time spent waiting.

If we provide the actual time spent waiting, providing the number of waits would
allow to see if there is a diff between the actual time and the intended time
(i.e: number of waits * cost_delay, should the cost_delay be the same during
the vacuum duration). That should trigger some thoughts if the diff is large
enough.

I think that what we are doing here is to somehow add instrumentation around the
"WAIT_EVENT_VACUUM_DELAY" wait event. If we were to add instrumentation for wait
events (generaly speaking) we'd probably also expose the number of waits per
wait event (in addition to the time waited).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Tue, Jun 11, 2024 at 01:13:48PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > As we can see the actual wait time is 30ms less than the intended wait time with
> > this simple test. So I still think we should go with 1) actual wait time and 2)
> > report the number of waits (as mentioned in [1]). Does that make sense to you?
> 
> I like the idea of reporting the actual wait time better,

+1

> provided
> that we verify that doing so isn't too expensive. I think it probably
> isn't, because in a long-running VACUUM there is likely to be disk
> I/O, so the CPU overhead of a few extra gettimeofday() calls should be
> fairly low by comparison.

Agree.

> I wonder if there's a noticeable hit when
> everything is in-memory. I guess probably not, because with any sort
> of normal configuration, we shouldn't be delaying after every block we
> process, so the cost of those gettimeofday() calls should still be
> getting spread across quite a bit of real work.

I did some testing, with:

shared_buffers = 12GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 0
max_parallel_workers = 0

added to a default config file.

A table and all its indexes were fully in memory, the numbers are:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
             FROM pg_buffercache b JOIN pg_class c
             ON b.relfilenode = pg_relation_filenode(c.oid) AND
                b.reldatabase IN (0, (SELECT oid FROM pg_database
                                      WHERE datname = current_database()))
             JOIN pg_namespace n ON n.oid = c.relnamespace
             GROUP BY n.nspname, c.relname
             ORDER BY 3 DESC
             LIMIT 11;

 nspname |      relname      | buffers
---------+-------------------+---------
 public  | large_tbl         |  222280
 public  | large_tbl_pkey    |    5486
 public  | large_tbl_filler7 |    1859
 public  | large_tbl_filler4 |    1859
 public  | large_tbl_filler1 |    1859
 public  | large_tbl_filler6 |    1859
 public  | large_tbl_filler3 |    1859
 public  | large_tbl_filler2 |    1859
 public  | large_tbl_filler5 |    1859
 public  | large_tbl_filler8 |    1859
 public  | large_tbl_version |    1576
(11 rows)


The observed timings when vacuuming this table are:

On master:

vacuum phase: cumulative duration
---------------------------------

scanning heap: 00:00:37.808184
vacuuming indexes: 00:00:41.808176
vacuuming heap: 00:00:54.808156

On master patched with actual time delayed:

vacuum phase: cumulative duration
---------------------------------

scanning heap: 00:00:36.502104 (time_delayed: 22202)
vacuuming indexes: 00:00:41.002103 (time_delayed: 23769)
vacuuming heap: 00:00:54.302096 (time_delayed: 34886)

As we can see there is no noticeable degradation while the vacuum entered about
34886 times in this instrumentation code path (cost_delay was set to 1).

> That said, I'm not sure this experiment shows a real problem with the
> idea of showing intended wait time. It does establish the concept that
> repeated signals can throw our numbers off, but 30ms isn't much of a
> discrepancy.

Yeah, the idea was just to show how easy it is to create a 30ms discrepancy.

> I'm worried about being off by a factor of two, or an
> order of magnitude. I think we still don't know if that can happen,
> but if we're going to show actual wait time anyway, then we don't need
> to explore the problems with other hypothetical systems too much.

Agree.

> I'm not convinced that reporting the number of waits is useful. If we
> were going to report a possibly-inaccurate amount of actual waiting,
> then also reporting the number of waits might make it easier to figure
> out when the possibly-inaccurate number was in fact inaccurate. But I
> think it's way better to report an accurate amount of actual waiting,
> and then I'm not sure what we gain by also reporting the number of
> waits.

Sami shared his thoughts in [1] and [2] and so did I in [3]. If some of us still
don't think that reporting the number of waits is useful then we can probably
start without it.

[1]: https://www.postgresql.org/message-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727%40amazon.com
[2]: https://www.postgresql.org/message-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D%40amazon.com
[3]: https://www.postgresql.org/message-id/ZmmGG4e%2BqTBD2kfn%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Tue, Jun 11, 2024 at 08:26:23AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
> 
> > Thank you for the proposal and the patch. I understand the motivation
> > of this patch.
> 
> Thanks for looking at it!
> 
> > Beside the point Nathan mentioned, I'm slightly worried
> > that massive parallel messages could be sent to the leader process
> > when the cost_limit value is low.
> 
> I see, I can/will do some testing in this area and share the numbers.

Here is the result of the test. It has been launched several times and it
produced the same (surprising result) each time.

====================== Context ================================================

The testing has been done with this relation (large_tbl) and its indexes:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
             FROM pg_buffercache b JOIN pg_class c
             ON b.relfilenode = pg_relation_filenode(c.oid) AND
                b.reldatabase IN (0, (SELECT oid FROM pg_database
                                      WHERE datname = current_database()))
             JOIN pg_namespace n ON n.oid = c.relnamespace
             GROUP BY n.nspname, c.relname
             ORDER BY 3 DESC
             LIMIT 22;

 nspname |      relname       | buffers
---------+--------------------+---------
 public  | large_tbl          |  222280
 public  | large_tbl_filler13 |  125000
 public  | large_tbl_filler6  |  125000
 public  | large_tbl_filler5  |  125000
 public  | large_tbl_filler3  |  125000
 public  | large_tbl_filler15 |  125000
 public  | large_tbl_filler4  |  125000
 public  | large_tbl_filler20 |  125000
 public  | large_tbl_filler18 |  125000
 public  | large_tbl_filler14 |  125000
 public  | large_tbl_filler8  |  125000
 public  | large_tbl_filler11 |  125000
 public  | large_tbl_filler19 |  125000
 public  | large_tbl_filler7  |  125000
 public  | large_tbl_filler1  |  125000
 public  | large_tbl_filler12 |  125000
 public  | large_tbl_filler9  |  125000
 public  | large_tbl_filler17 |  125000
 public  | large_tbl_filler16 |  125000
 public  | large_tbl_filler10 |  125000
 public  | large_tbl_filler2  |  125000
 public  | large_tbl_pkey     |    5486
(22 rows)

All of them completly fit in memory (to avoid I/O read latency during the vacuum).

The config, outside of default is:

max_wal_size = 4GB
shared_buffers = 30GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 8
max_parallel_workers = 10
vacuum_cost_limit = 10
autovacuum_vacuum_cost_limit = 10

My system is not overloaded, has enough resources to run this test and only this
test is running.

====================== Results ================================================

========== With v2 (attached) applied on master

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1146873.016 ms (19:06.873)

The duration is splitted that way:

Vacuum phase: cumulative time (cumulative time delayed)
=======================================================
scanning heap: 00:08:16.414628 (time_delayed: 444370)
vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293)
vacuuming heap: 00:19:06.814617 (time_delayed: 2767540)

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |   wait_event   | count
------------+--------+----------------+-------
     452996 | 453225 | VacuumDelay    |   366
     452996 | 453223 | VacuumDelay    |   363
     452996 | 453226 | VacuumDelay    |   362
     452996 | 453224 | VacuumDelay    |   361
     452996 | 453222 | VacuumDelay    |   359
     452996 | 453221 | VacuumDelay    |   359
            | 452996 | VacuumDelay    |   331
            | 452996 | CPU            |    30
     452996 | 453224 | WALWriteLock   |    23
     452996 | 453222 | WALWriteLock   |    20
     452996 | 453226 | WALWriteLock   |    20
     452996 | 453221 | WALWriteLock   |    19
            | 452996 | WalSync        |    18
     452996 | 453225 | WALWriteLock   |    18
     452996 | 453223 | WALWriteLock   |    16
            | 452996 | WALWriteLock   |    15
     452996 | 453221 | CPU            |    14
     452996 | 453222 | CPU            |    14
     452996 | 453223 | CPU            |    12
     452996 | 453224 | CPU            |    10
     452996 | 453226 | CPU            |    10
     452996 | 453225 | CPU            |     8
     452996 | 453223 | WalSync        |     4
     452996 | 453221 | WalSync        |     2
     452996 | 453226 | WalWrite       |     2
     452996 | 453221 | WalWrite       |     1
            | 452996 | ParallelFinish |     1
     452996 | 453224 | WalSync        |     1
     452996 | 453225 | WalSync        |     1
     452996 | 453222 | WalWrite       |     1
     452996 | 453225 | WalWrite       |     1
     452996 | 453222 | WalSync        |     1
     452996 | 453226 | WalSync        |     1


========== On master (v2 not applied)

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1322598.087 ms (22:02.598)

Surprisingly it has been longer on master by about 3 minutes.

Let's see how the time is splitted:

Vacuum phase: cumulative time
=============================
scanning heap: 00:08:07.061196
vacuuming indexes: 00:17:50.961228
vacuuming heap: 00:22:02.561199

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |    wait_event     | count
------------+--------+-------------------+-------
     468682 | 468858 | VacuumDelay       |   548
     468682 | 468862 | VacuumDelay       |   547
     468682 | 468859 | VacuumDelay       |   547
     468682 | 468860 | VacuumDelay       |   545
     468682 | 468857 | VacuumDelay       |   543
     468682 | 468861 | VacuumDelay       |   542
            | 468682 | VacuumDelay       |   378
            | 468682 | ParallelFinish    |   182
     468682 | 468861 | WALWriteLock      |    19
     468682 | 468857 | WALWriteLock      |    19
     468682 | 468859 | WALWriteLock      |    18
     468682 | 468858 | WALWriteLock      |    16
     468682 | 468860 | WALWriteLock      |    15
     468682 | 468862 | WALWriteLock      |    15
     468682 | 468862 | CPU               |    12
     468682 | 468857 | CPU               |    10
     468682 | 468859 | CPU               |    10
     468682 | 468861 | CPU               |    10
            | 468682 | CPU               |     9
     468682 | 468860 | CPU               |     9
     468682 | 468860 | WalSync           |     8
            | 468682 | WALWriteLock      |     7
     468682 | 468858 | WalSync           |     6
     468682 | 468858 | CPU               |     6
     468682 | 468862 | WalSync           |     3
     468682 | 468857 | WalSync           |     3
     468682 | 468861 | WalSync           |     3
     468682 | 468859 | WalSync           |     2
     468682 | 468861 | WalWrite          |     2
     468682 | 468857 | WalWrite          |     1
     468682 | 468858 | WalWrite          |     1
     468682 | 468861 | WALBufMappingLock |     1
     468682 | 468857 | WALBufMappingLock |     1
            | 468682 | WALBufMappingLock |     1

====================== Observations ===========================================

As compare to v2:

1. scanning heap time is about the same
2. vacuuming indexes time is about 3 minutes longer on master
3. vacuuming heap time is about the same

One difference we can see in the sampling, is that on master the "ParallelFinish"
has been sampled about 182 times (so could be _the_ 3 minutes of interest) for
the leader.

On master the vacuum indexes phase has been running between 2024-06-13 10:11:34
and 2024-06-13 10:21:15. If I extract the exact minutes and the counts for the
"ParallelFinish" wait event I get:

 minute |   wait_event   | count
--------+----------------+-------
     18 | ParallelFinish |    48
     19 | ParallelFinish |    60
     20 | ParallelFinish |    60
     21 | ParallelFinish |    14

So it's likely that the leader waited on ParallelFinish during about 3 minutes
at the end of the vacuuming indexes phase (as this wait appeared during
consecutives samples).

====================== Conclusion =============================================

1. During the scanning heap and vacuuming heap phases no noticeable performance
degradation has been observed with v2 applied (as compare to master) (cc'ing
Robert as it's also related to his question about noticeable hit when everything
is in-memory in [1]).

2. During the vacuuming indexes phase, v2 has been faster (as compare to master).
The reason is that on master the leader has been waiting during about 3 minutes
on "ParallelFinish" at the end.

====================== Remarks ================================================

As v2 is attached, please find below a summary about the current state of this
thread:

1. v2 implements delay_time as the actual wait time (and not the intended wait
time as proposed in v1).

2. some measurements have been done to check the impact of this new
instrumentation (see this email and [2]): no noticeable performance degradation
has been observed (and surprisingly that's the opposite as mentioned above).

3. there is an ongoing discussion about exposing the number of waits [2].

4. there is an ongoing discussion about exposing the effective cost limit [3].

5. that could be interesting to have a closer look as to why the leader is waiting
during 3 minutes on "ParallelFinish" on master and not with v2 applied (but that's
probably out of scope for this thread).

[1]:
https://www.postgresql.org/message-id/CA%2BTgmoZiC%3DzeCDYuMpB%2BGb2yK%3DrTQCGMu0VoxehocKyHxr9Erg%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/ZmmOOPwMFIltkdsN%40ip-10-97-1-34.eu-west-3.compute.internal
[3]: https://www.postgresql.org/message-id/Zml9%2Bu37iS7DFkJL%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Thu, Jun 13, 2024 at 11:56:26AM +0000, Bertrand Drouvot wrote:
> ====================== Observations ===========================================
> 
> As compare to v2:
> 
> 1. scanning heap time is about the same
> 2. vacuuming indexes time is about 3 minutes longer on master
> 3. vacuuming heap time is about the same

I had a closer look to understand why the vacuuming indexes time has been about
3 minutes longer on master.

During the vacuuming indexes phase, the leader is helping vacuuming the indexes
until it reaches WaitForParallelWorkersToFinish() (means when all the remaining
indexes are currently handled by the parallel workers, the leader has nothing
more to do and so it is waiting for the parallel workers to finish).

During the time the leader process is involved in indexes vacuuming it is
also subject to wait due to cost delay.

But with v2 applied, the leader may be interrupted by the parallel workers while
it is waiting (due to the new pgstat_progress_parallel_incr_param() calls that
the patch is adding).

So, with v2 applied, the leader is waiting less (as interrupted while waiting)
when being involved in indexes vacuuming and that's why v2 is "faster" than
master.

To put some numbers, I did count the number of times the leader's pg_usleep() has
been interrupted (by counting the number of times the nanosleep() did return a
value < 0 in pg_usleep()). Here they are:

v2: the leader has been interrupted about 342605 times
master: the leader has been interrupted about 36 times

The ones on master are mainly coming from the pgstat_progress_parallel_incr_param() 
calls in parallel_vacuum_process_one_index().

The additional ones on v2 are coming from the pgstat_progress_parallel_incr_param()
calls added in vacuum_delay_point().

======== Conclusion ======

1. vacuuming indexes time has been longer on master because with v2, the leader
has been interrupted 342605 times while waiting, then making v2 "faster".

2. the leader being interrupted while waiting is also already happening on master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in 
46ebdfe164). It has been the case "only" 36 times during my test case.

I think that 2. is less of a concern but I think that 1. is something that needs
to be addressed because the leader process is not honouring its cost delay wait
time in a noticeable way (at least during my test case).

I did not think of a proposal yet, just sharing my investigation as to why
v2 has been faster than master during the vacuuming indexes phase.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote:
> 1. vacuuming indexes time has been longer on master because with v2, the leader
> has been interrupted 342605 times while waiting, then making v2 "faster".
> 
> 2. the leader being interrupted while waiting is also already happening on master
> due to the pgstat_progress_parallel_incr_param() calls in
> parallel_vacuum_process_one_index() (that have been added in 
> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> I think that 2. is less of a concern but I think that 1. is something that needs
> to be addressed because the leader process is not honouring its cost delay wait
> time in a noticeable way (at least during my test case).
> 
> I did not think of a proposal yet, just sharing my investigation as to why
> v2 has been faster than master during the vacuuming indexes phase.

I think that a reasonable approach is to make the reporting from the parallel
workers to the leader less aggressive (means occur less frequently).

Please find attached v3, that:

- ensures that there is at least 1 second between 2 reports, per parallel worker,
to the leader.

- ensures that the reported delayed time is still correct (keep track of the
delayed time between 2 reports).

- does not add any extra pg_clock_gettime_ns() calls (as compare to v2).

Remarks:

1. Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable. I don't think that the number of parallel workers has
to come into play as:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The throttling is not based on the cost limit as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The throttling is not based on the actual cost delay value because the leader
could be interrupted at the beginning, the midle or whatever part of the wait and
we are more interested about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

With this approach in place, v3 attached applied, during my test case:

- the leader has been interrupted about 2500 times (instead of about 345000
times with v2)

- the vacuum index phase duration is very close to the master one (it has been
4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3
minutes faster with v2).

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Track the amount of time waiting due to cost_delay

From
"Imseih (AWS), Sami"
Date:
>> 2. the leader being interrupted while waiting is also already happening on master
>> due to the pgstat_progress_parallel_incr_param() calls in
>> parallel_vacuum_process_one_index() (that have been added in
>> 46ebdfe164). It has been the case "only" 36 times during my test case.

46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the remaining
time [1], but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process. 


> 1. Having a time based only approach to throttle 

I do agree with a time based approach overall.


> 1.1) the more parallel workers is used, the less the impact of the leader on
> the vacuum index phase duration/workload is (because the repartition is done
> on more processes).

Did you mean " because the vacuum is done on more processes"? 

When a leader is operating on a large index(s) during the entirety
of the vacuum operation, wouldn't more parallel workers end up
interrupting the leader more often? This is why I think reporting even more
often than 1 second (more below) will be better.

> 3. A 1 second reporting "throttling" looks a reasonable threshold as:

> 3.1 the idea is to have a significant impact when the leader could have been
> interrupted say hundred/thousand times per second.

> 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
> multiple times per second (so a one second reporting granularity seems ok).

I feel 1 second may still be too frequent. 
What about 10 seconds ( or 30 seconds )? 
I think this metric in particular will be mainly useful for vacuum runs that are 
running for minutes or more, making reporting every 10 or 30 seconds 
still useful.

It just occurred to me also that pgstat_progress_parallel_incr_param 
should have a code comment that it will interrupt a leader process and
cause activity such as a sleep to end early.



Regards,

Sami Imseih
Amazon Web Services (AWS)


[1] https://man7.org/linux/man-pages/man2/nanosleep.2.html



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Tue, Jun 25, 2024 at 01:12:16AM +0000, Imseih (AWS), Sami wrote:

Thanks for the feedback!

> >> 2. the leader being interrupted while waiting is also already happening on master
> >> due to the pgstat_progress_parallel_incr_param() calls in
> >> parallel_vacuum_process_one_index() (that have been added in
> >> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
> progress, and we currently don't handle interrupts by restarting the sleep with
> the remaining time. nanosleep does provide the ability to restart with the remaining
> time [1], but I don't think it's worth the effort to ensure more accurate
> vacuum delays for the leader process. 

+1. I don't think it's necessary to have a 100% accurate delay for all the
times the delay is involded. I think that's an heuristic parameter (among
with cost limit). What matters at the end is by how much you've been able to
pause the whole vacuum (and not by a sleep by sleep basis)).

> > 1. Having a time based only approach to throttle 
> 
> I do agree with a time based approach overall.
> 
> 
> > 1.1) the more parallel workers is used, the less the impact of the leader on
> > the vacuum index phase duration/workload is (because the repartition is done
> > on more processes).
> 
> Did you mean " because the vacuum is done on more processes"? 

Yes.

> When a leader is operating on a large index(s) during the entirety
> of the vacuum operation, wouldn't more parallel workers end up
> interrupting the leader more often?

That's right but my point was about the impact on the "whole" duration time and
"whole" workload (leader + workers included) and not about the number of times the
leader is interrupted. If there is say 100 workers then interrupting the leader
(1 process out of 101) is probably less of an issue as it means that there is a
lot of work to be done to have those 100 workers busy. I don't think the size of
the index the leader is vacuuming has an impact. I think that having the leader
vacuuming a 100 GB index or 100 x 1GB indexes is the same (as long as all the
other workers are actives during all that time).

> > 3. A 1 second reporting "throttling" looks a reasonable threshold as:
> 
> > 3.1 the idea is to have a significant impact when the leader could have been
> > interrupted say hundred/thousand times per second.
> 
> > 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
> > multiple times per second (so a one second reporting granularity seems ok).
> 
> I feel 1 second may still be too frequent. 

Maybe we'll need more measurements but this is what my test case made of:

vacuum_cost_delay = 1
vacuum_cost_limit = 10
8 parallel workers, 1 leader
21 indexes (about 1GB each, one 40MB), all in memory

lead to:

With 1 second reporting frequency, the leader has been interruped about 2500
times over 8m39s leading to about the same time as on master (8m43s).

> What about 10 seconds ( or 30 seconds )? 

I'm not sure (may need more measurements) but it would probably complicate the
reporting a bit (as with the current v3 we'd miss reporting the indexes that
take less time than the threshold to complete).

> I think this metric in particular will be mainly useful for vacuum runs that are 
> running for minutes or more, making reporting every 10 or 30 seconds 
> still useful.

Agree. OTOH, one could be interested to diagnose what happened during a say 5
seconds peak on I/O resource consumption/latency. Sampling pg_stat_progress_vacuum
at 1 second interval and see by how much the vaccum has been paused during that
time could help too (specially if it is made of a lot of parallel workers that
could lead to a lot of I/O). But it would miss data if we are reporting at a
larger rate.

> It just occurred to me also that pgstat_progress_parallel_incr_param 
> should have a code comment that it will interrupt a leader process and
> cause activity such as a sleep to end early.

Good point, I'll add a comment for it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Masahiko Sawada
Date:
On Mon, Jun 24, 2024 at 7:50 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote:
> > 1. vacuuming indexes time has been longer on master because with v2, the leader
> > has been interrupted 342605 times while waiting, then making v2 "faster".
> >
> > 2. the leader being interrupted while waiting is also already happening on master
> > due to the pgstat_progress_parallel_incr_param() calls in
> > parallel_vacuum_process_one_index() (that have been added in
> > 46ebdfe164). It has been the case "only" 36 times during my test case.
> >
> > I think that 2. is less of a concern but I think that 1. is something that needs
> > to be addressed because the leader process is not honouring its cost delay wait
> > time in a noticeable way (at least during my test case).
> >
> > I did not think of a proposal yet, just sharing my investigation as to why
> > v2 has been faster than master during the vacuuming indexes phase.

Thank you for the benchmarking and analyzing the results! I agree with
your analysis and was surprised by the fact that the more times
workers go to sleep, the more times the leader wakes up.

>
> I think that a reasonable approach is to make the reporting from the parallel
> workers to the leader less aggressive (means occur less frequently).
>
> Please find attached v3, that:
>
> - ensures that there is at least 1 second between 2 reports, per parallel worker,
> to the leader.
>
> - ensures that the reported delayed time is still correct (keep track of the
> delayed time between 2 reports).
>
> - does not add any extra pg_clock_gettime_ns() calls (as compare to v2).
>

Sounds good to me. I think it's better to keep the logic for
throttling the reporting the delay message simple. It's an important
consideration but executing parallel vacuum with delays would be less
likely to be used in practice.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
"Imseih (AWS), Sami"
Date:
> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
> progress, and we currently don't handle interrupts by restarting the sleep with
> the remaining time. nanosleep does provide the ability to restart with the remaining
> time [1], but I don't think it's worth the effort to ensure more accurate
> vacuum delays for the leader process. 

After discussing offline with Bertrand, it may be better to have 
a solution to deal with the interrupts and allows the sleep to continue to
completion. This will simplify this patch and will be useful 
for other cases in which parallel workers need to send a message
to the leader. This is the thread [1] for that discussion.

[1]
https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com

Regards,

Sami







Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Fri, Jun 28, 2024 at 08:07:39PM +0000, Imseih (AWS), Sami wrote:
> > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
> > progress, and we currently don't handle interrupts by restarting the sleep with
> > the remaining time. nanosleep does provide the ability to restart with the remaining
> > time [1], but I don't think it's worth the effort to ensure more accurate
> > vacuum delays for the leader process. 
> 
> After discussing offline with Bertrand, it may be better to have 
> a solution to deal with the interrupts and allows the sleep to continue to
> completion. This will simplify this patch and will be useful 
> for other cases in which parallel workers need to send a message
> to the leader. This is the thread [1] for that discussion.
> 
> [1]
https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com
> 

Yeah, I think it would make sense to put this thread on hold until we know more
about [1] (you mentioned above) outcome.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Track the amount of time waiting due to cost_delay

From
Nathan Bossart
Date:
On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote:
> Please find attached v6, a mandatory rebase due to catversion bump conflict.
> I'm removing the catversion bump from the patch as it generates too frequent
> conflicts (just mention it needs to be done in the commit message).

v6 looks generally reasonable to me.  I think the
nap_time_since_last_report variable needs to be marked static, though.

One thing that occurs to me is that this information may not be
particularly useful when parallel workers are used.  Without parallelism,
it's easy enough to figure out the percentage of time that your VACUUM is
spending asleep, but when there are parallel workers, it may be hard to
deduce much of anything from the value.  I'm not sure that this is a
deal-breaker for the patch, though, if for no other reason than it'll most
likely be used for autovacuum, which doesn't use parallel vacuum yet.

If there are no other concerns, I'll plan on committing this one soon after
a bit of editorialization.

-- 
nathan



Re: Track the amount of time waiting due to cost_delay

From
Bertrand Drouvot
Date:
Hi,

On Wed, Sep 18, 2024 at 04:04:53PM -0500, Nathan Bossart wrote:
> On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote:
> > Please find attached v6, a mandatory rebase due to catversion bump conflict.
> > I'm removing the catversion bump from the patch as it generates too frequent
> > conflicts (just mention it needs to be done in the commit message).
> 
> v6 looks generally reasonable to me.

Thanks for looking at it!

> I think the
> nap_time_since_last_report variable needs to be marked static, though.

Agree.

> One thing that occurs to me is that this information may not be
> particularly useful when parallel workers are used.  Without parallelism,
> it's easy enough to figure out the percentage of time that your VACUUM is
> spending asleep, but when there are parallel workers, it may be hard to
> deduce much of anything from the value.

I think that if the number of parallel workers being used are the same across
runs then one can measure "accurately" the impact of some changes (set
vacuum_cost_delay=... for example) on the delay. Without the patch one could just
guess as many others factors could impact the vacuum duration (load on the system,
i/o latency,...).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com