Re: Add index scan progress to pg_stat_progress_vacuum - Mailing list pgsql-hackers

From Bossart, Nathan
Subject Re: Add index scan progress to pg_stat_progress_vacuum
Date
Msg-id 7874FB21-FAA5-49BD-8386-2866552656C7@amazon.com
Whole thread Raw
In response to Add index scan progress to pg_stat_progress_vacuum  ("Imseih (AWS), Sami" <simseih@amazon.com>)
Responses Re: Add index scan progress to pg_stat_progress_vacuum
Re: Add index scan progress to pg_stat_progress_vacuum
List pgsql-hackers
On 12/1/21, 3:02 PM, "Imseih (AWS), Sami" <simseih@amazon.com> wrote:
> The current implementation of pg_stat_progress_vacuum does not
> provide progress on which index is being vacuumed making it
> difficult for a user to determine if the "vacuuming indexes" phase
> is making progress. By exposing which index is being scanned as well
> as the total progress the scan has made for the current cycle, a
> user can make better estimations on when the vacuum will complete.

+1

> The proposed patch adds 4 new columns to pg_stat_progress_vacuum:
>
> 1. indrelid - the relid of the index being vacuumed
> 2. index_blks_total - total number of blocks to be scanned in the
> current cycle
> 3. index_blks_scanned - number of blocks scanned in the current
> cycle
> 4. leader_pid - if the pid for the pg_stat_progress_vacuum entry is
> a leader or a vacuum worker. This patch places an entry for every
> worker pid ( if parallel ) as well as the leader pid

nitpick: Shouldn't index_blks_scanned be index_blks_vacuumed?  IMO it
is more analogous to heap_blks_vacuumed.

This will tell us which indexes are currently being vacuumed and the
current progress of those operations, but it doesn't tell us which
indexes have already been vacuumed or which ones are pending vacuum.
I think such information is necessary to truly understand the current
progress of vacuuming indexes, and I can think of a couple of ways we
might provide it:

  1. Make the new columns you've proposed return arrays.  This isn't
     very clean, but it would keep all the information for a given
     vacuum operation in a single row.  The indrelids column would be
     populated with all the indexes that have been vacuumed, need to
     be vacuumed, or are presently being vacuumed.  The other index-
     related columns would then have the associated stats and the
     worker PID (which might be the same as the pid column depending
     on whether parallel index vacuum was being done).  Alternatively,
     the index column could have an array of records, each containing
     all the information for a given index.
  2. Create a new view for just index vacuum progress information.
     This would have similar information as 1.  There would be an
     entry for each index that has been vacuumed, needs to be
     vacuumed, or is currently being vacuumed.  And there would be an
     easy way to join with pg_stat_progress_vacuum (e.g., leader_pid,
     which again might be the same as our index vacuum PID depending
     on whether we were doing parallel index vacuum).  Note that it
     would be possible for the PID of these entries to be null before
     and after we process the index.
  3. Instead of adding columns to pg_stat_progress_vacuum, adjust the
     current ones to be more general, and then add new entries for
     each of the indexes that have been, need to be, or currently are
     being vacuumed.  This is the most similar option to your current
     proposal, but instead of introducing a column like
     index_blks_total, we'd rename heap_blks_total to blks_total and
     use that for both the heap and indexes.  I think we'd still want
     to add a leader_pid column.  Again, we have to be prepared for
     the PID to be null in this case.  Or we could just make the pid
     column always refer to the leader, and we could introduce a
     worker_pid column.  That might create confusion, though.

I wish option #1 was cleaner, because I think it would be really nice
to have all this information in a single row.  However, I don't expect
much support for a 3-dimensional view, so I suspect option #2
(creating a separate view for index vacuum progress) is the way to go.
The other benefit of option #2 versus option #3 or your original
proposal is that it cleanly separates the top-level vacuum operations
and the index vacuum operations, which are related at the moment, but
which might not always be tied so closely together.

Nathan


pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Support for NSS as a libpq TLS backend
Next
From: Daniel Gustafsson
Date:
Subject: Re: Support for NSS as a libpq TLS backend