Re: Why does [auto-]vacuum delay not report a wait event? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Why does [auto-]vacuum delay not report a wait event?
Date
Msg-id 20200322002457.dvepqpj4legcmuxn@alap3.anarazel.de
Whole thread Raw
In response to Re: Why does [auto-]vacuum delay not report a wait event?  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: Why does [auto-]vacuum delay not report a wait event?  (Peter Geoghegan <pg@bowt.ie>)
Re: Why does [auto-]vacuum delay not report a wait event?  (Justin Pryzby <pryzby@telsasoft.com>)
Re: Why does [auto-]vacuum delay not report a wait event?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

> On Thu, Mar 19, 2020 at 03:44:49PM -0700, Andres Freund wrote:
> > But uh, unfortunately the vacuum delay code just sleeps without setting
> > a wait event:
> ...
> > Seems like it should instead use a new wait event in the PG_WAIT_TIMEOUT
> > class?
> > 
> > Given how frequently we run into trouble with [auto]vacuum throttling
> > being a problem, and there not being any way to monitor that currently,
> > that seems like it'd be a significant improvement, given the effort?
> 
> I see that pg_sleep sets WAIT_EVENT_PG_SLEEP, but pg_usleep doesn't, I guess
> because the overhead is significant for a small number of usecs, and because it
> doesn't work for pg_usleep to set a generic event if callers want to be able to
> set a more specific wait event.

I don't think the overhead is a meaningful issue - compared to yielding
to the kernel / context switching, setting the wait event isn't a
significant cost.

I think the issue is more the second part - it's used as part of other
things using their own wait events potentially.

I think we should just rip out pg_usleep and replace it with latch
waits. While the case at hand is user configurable (but the max wait
time is 100ms, so it's not too bad), it's generally not great to sleep
without ensuring that interrupts are handled.  Nor is it great that we
don't sleep again if the sleep is interrupted.  There may be a case or
two where we don't want to layer ontop of latches (perhaps the spinlock
delay loop?), but pretty much everywhere else a different routine would
make more sense.


> Also, I noticed that SLEEP_ON_ASSERT comment (31338352b) wants to use pg_usleep
> "which seems too short.".  Surely it should use pg_sleep, added at 782eefc58 a
> few years later ?

I don't see problem with using sleep here?


> Also, there was a suggestion recently that this should have a separate
> vacuum_progress phase:
> |vacuumlazy.c:#define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL    50 /* ms */
> |vacuumlazy.c:pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L);
> 
> I was planning to look at that eventually ; should it have a wait event instead
> or in addition ?

A separate phase? How would that look like? We don't want to replace the
knowledge that currently e.g. the heap scan is in progress?



> > It'd probably also be helpful to report the total time [auto]vacuum
> > spent being delayed for vacuum verbose/autovacuum logging, but imo
> > that'd be a parallel feature to a wait event, not a replacement.
> 
> This requires wider changes than I anticipated.
> 
> 2020-03-20 22:35:51.308 CDT [16534] LOG:  automatic aggressive vacuum of table "template1.pg_catalog.pg_class": index
scans:1
 
>         pages: 0 removed, 11 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 6 removed, 405 remain, 0 are dead but not yet removable, oldest xmin: 1574
>         buffer usage: 76 hits, 7 misses, 8 dirtied
>         avg read rate: 16.378 MB/s, avg write rate: 18.718 MB/s
>         Cost-based delay: 2 msec
>         system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 
> VACUUM VERBOSE wouldn't normally be run with cost_delay > 0, so that field will
> typically be zero, so I made it conditional

I personally dislike conditional output like that, because it makes
parsing the output harder.


> , which is supposedly why it's written like that, even though that's
> not otherwise being used since 17eaae98.

Well, it's also just hard to otherwise manage this long translatable
strings. And we're essentially still conditional, due to the 'msgfmt'
branches - if the whole output were output in a single appendStringInfo
call, we'd have to duplicate all the following format strings too.

Personally I really wish we'd just merge the vacuum verbose and the
autovacuum lgoging code, even if it causes some temporary pain.


On 2020-03-20 23:07:51 -0500, Justin Pryzby wrote:
> From 68c5ad8c7a9feb0c68afad310e3f52c21c3cdbaf Mon Sep 17 00:00:00 2001
> From: Justin Pryzby <pryzbyj@telsasoft.com>
> Date: Fri, 20 Mar 2020 20:47:30 -0500
> Subject: [PATCH v1 1/2] Report wait event for cost-based vacuum delay
> 
> ---
>  doc/src/sgml/monitoring.sgml    | 2 ++
>  src/backend/commands/vacuum.c   | 2 ++
>  src/backend/postmaster/pgstat.c | 3 +++
>  src/include/pgstat.h            | 3 ++-
>  4 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 5bffdcce10..46c99a55b7 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -1507,6 +1507,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
>            (<filename>pg_wal</filename>, archive or stream) before trying
>            again to retrieve WAL data, at recovery.
>           </entry>
> +         <entry><literal>VacuumDelay</literal></entry>
> +         <entry>Waiting in a cost-based vacuum delay point.</entry>
>          </row>
>          <row>
>           <entry morerows="68"><literal>IO</literal></entry>
> diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
> index d625d17bf4..59731d687f 100644
> --- a/src/backend/commands/vacuum.c
> +++ b/src/backend/commands/vacuum.c
> @@ -2019,7 +2019,9 @@ vacuum_delay_point(void)
>          if (msec > VacuumCostDelay * 4)
>              msec = VacuumCostDelay * 4;
>  
> +        pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
>          pg_usleep((long) (msec * 1000));
> +        pgstat_report_wait_end();
>  
>          VacuumCostBalance = 0;
>  
> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index d29c211a76..742ec07b59 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -3824,6 +3824,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
>          case WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL:
>              event_name = "RecoveryRetrieveRetryInterval";
>              break;
> +        case WAIT_EVENT_VACUUM_DELAY:
> +            event_name = "VacuumDelay";
> +            break;
>              /* no default case, so that compiler will warn */
>      }
>  
> diff --git a/src/include/pgstat.h b/src/include/pgstat.h
> index 851d0a7246..4db40e23cc 100644
> --- a/src/include/pgstat.h
> +++ b/src/include/pgstat.h
> @@ -848,7 +848,8 @@ typedef enum
>      WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
>      WAIT_EVENT_PG_SLEEP,
>      WAIT_EVENT_RECOVERY_APPLY_DELAY,
> -    WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL
> +    WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
> +    WAIT_EVENT_VACUUM_DELAY,
>  } WaitEventTimeout;

Looks good to me - unless somebody protests I'm going to apply this
shortly.


> From 8153d909cabb94474890f0b55c7733f33923e3c5 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby <pryzbyj@telsasoft.com>
> Date: Fri, 20 Mar 2020 22:08:09 -0500
> Subject: [PATCH v1 2/2] vacuum to report time spent in cost-based delay
> 
> ---
>  src/backend/access/gin/ginfast.c      |  6 +++---
>  src/backend/access/gin/ginvacuum.c    |  6 +++---
>  src/backend/access/gist/gistvacuum.c  |  2 +-
>  src/backend/access/hash/hash.c        |  2 +-
>  src/backend/access/heap/vacuumlazy.c  | 17 +++++++++++------
>  src/backend/access/nbtree/nbtree.c    |  2 +-
>  src/backend/access/spgist/spgvacuum.c |  4 ++--
>  src/backend/commands/vacuum.c         |  8 ++++++--
>  src/include/access/genam.h            |  1 +
>  src/include/commands/vacuum.h         |  2 +-
>  10 files changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
> index 11d7ec067a..c99dc4a8be 100644
> --- a/src/backend/access/gin/ginfast.c
> +++ b/src/backend/access/gin/ginfast.c
> @@ -892,7 +892,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
>           */
>          processPendingPage(&accum, &datums, page, FirstOffsetNumber);
>  
> -        vacuum_delay_point();
> +        stats->delay_msec += vacuum_delay_point();
>  
>          /*
>           * Is it time to flush memory to disk?    Flush if we are at the end of
> @@ -929,7 +929,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
>              {
>                  ginEntryInsert(ginstate, attnum, key, category,
>                                 list, nlist, NULL);
> -                vacuum_delay_point();
> +                stats->delay_msec += vacuum_delay_point();
>              }
>  
>              /*
> @@ -1002,7 +1002,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
>          /*
>           * Read next page in pending list
>           */
> -        vacuum_delay_point();
> +        stats->delay_msec += vacuum_delay_point();
>          buffer = ReadBuffer(index, blkno);
>          LockBuffer(buffer, GIN_SHARE);
>          page = BufferGetPage(buffer);

On a green field I'd really like to pass a 'vacuum state' struct to
vacuum_delay_point(). But that likely would be too invasive to add,
because it seems like it'd would have to be wired to a number of
functions that can be used by extensions etc (like the bulk delete
callbacks).  Then we'd just have vacuum_delay_point() internally sum up
the waiting time.

Given the current style of vacuum_delay_point() calculating everything
via globals (which I hate), it might be less painful to do this by
adding another global to track the sleeps via a global alongside
VacuumCostBalance?


> diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
> index 4871b7ff4d..86a9c7fdaa 100644
> --- a/src/backend/access/hash/hash.c
> +++ b/src/backend/access/hash/hash.c
> @@ -709,7 +709,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
>          bool        retain_pin = false;
>          bool        clear_dead_marking = false;
>  
> -        vacuum_delay_point();
> +        // XXX stats->delay_msec += vacuum_delay_point();
>  
>          page = BufferGetPage(buf);
>          opaque = (HashPageOpaque) PageGetSpecialPointer(page);

I assume this is because there's no stats object reachable here? Should
still continue to call vacuum_delay_point ;)

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: ssl passphrase callback
Next
From: Jeff Davis
Date:
Subject: Re: Additional size of hash table is alway zero for hash aggregates