Re: Proposal for Resumable Vacuum (again ...) - Mailing list pgsql-hackers

From Jay
Subject Re: Proposal for Resumable Vacuum (again ...)
Date
Msg-id CAPdcCKrcou6UvnQV0d1_E_XpLnC3ugnO+8x8WdADS7h3gARNuw@mail.gmail.com
Whole thread Raw
In response to Proposal for Resumable Vacuum (again ...)  (Jay <jsudrikoss@gmail.com>)
List pgsql-hackers
Hi All,

A revised proposal with few minor corrections (thanks to wolfgang20121964@yahoo.de for pointing the error):

<Start> 

I am aware of few previous attempts and discussions on this topic (eventually shelved or didn't materialize): 


And still I want to revise this topic for the obvious benefits. 

I do not have any patch or code changes ready. The changes could be tricky and might need efforts, possibly some re-factoring. Hence, before starting the effort, I would like to get the proposal reviewed and consensus built to avoid redundancy of efforts. 

Why do we need it? 

Since more and more large businesses are on-boarding PostgreSQL, it is only fair that we make the essential utilities like vacuum more manageable and scalable. The data sizes are definitely going to increase and maintenance windows will reduce with businesses operating across the time zones and 24x7. Making the database more manageable with the least overhead is going to be definitely a pressing need. 

To avoid the repetition and duplicate efforts, I have picked up the snippet below from the previous email conversation on the community (Ref: https://www.postgresql.org/message-id/45E2A6AE.1080805@oss.ntt.co.jp)

<Quote>
For a large table, although it can be vacuumed by enabling vacuum cost-based delay, the processing may last for several days (maybe hours). It definitely has a negative effect on system performance. So if systems which have maintenance time, it is preferred to vacuum in the maintenance window. Vacuum tasks can be split into small subtasks, and they can be scheduled into maintenance window time slots. This can reduce the impact of vacuum to system service.

But currently vacuum tasks can not be split: if an interrupt or error occurs during vacuum processing, vacuum totally forgets what it has done and terminates itself. Following vacuum on the same table has to scan from the beginning of the heap block. This proposal enable vacuum has capability to stop and resume.
</Quote> 

External Interface

This feature is especially useful when the size of table/s is quite large and their bloat is quite high and it is expected vacuum runs will take long time.

Ref: https://www.postgresql.org/docs/current/sql-vacuum.html
vacuum [ ( option [, ...], [{ for time = hh:mm}| {resume [for time = hh:mm]}] ) ] [ table_and_columns [, ...] ]

The additional options give flexibility to run the vacuum for a limited time and stop or resume the vacuum from the last time when it was stopped for a given time.

When vacuum is invoked with ‘for time ...’ option it will store the intermediate state of the dead tuples accumulated periodically on the disk as it progresses. It will run for a specified time and stop after that duration.

When vacuum is invoked with ‘for time ...’ option and is stopped automatically after the specified time or interrupted manually and if it is invoked next time with ‘resume’ option, it will try to check the stored state of the last run and try to start as closely as possible from where it left last time and avoid repetition of work.

When resumed, it can either run for a specified time again (if the duration is specified) or run till completion if the duration is not specified.

When vacuum is invoked with ‘resume for’ option when there was no earlier incomplete run or an earlier run with ‘for time’ option, the ‘for resume’ option will be ignored with a message in the errorlog.

When vacuum is invoked without ‘for time’ or ‘resume for’ options after preceding incomplete runs with those options , then the persisted data from the previous runs is discarded and deleted. This is important because successive runs with ‘for time’ or ‘resume for’ assume the persisted data is valid and there’s no run in between to invalidate it and the state of heap pages in terms of vacuum is the same for the given saved vacuum horizon.

In further discussion in the rest of this proposal, we will refer to vacuum invoked with ‘for time’ or ‘resume for’ option as listed above as ‘resumable vacuum’.

Internal Changes (High level)

For each table, vacuum progresses in the following steps or phases (taken from the documentation)
https://www.postgresql.org/docs/current/progress-reporting.html#VACUUM-PHASES :

1. Initializing -   VACUUM is preparing to begin scanning the heap. This phase is expected to be very brief.

2. Scanning heap - VACUUM is currently scanning the heap. It will prune and defragment each page if required, and possibly perform freezing activity. The heap_blks_scanned column can be used to monitor the progress of the scan.

3. Vacuuming Indexes - VACUUM is currently vacuuming the indexes. If a table has any indexes, this will happen at least once per vacuum, after the heap has been completely scanned. It may happen multiple times per vacuum if maintenance_work_mem (or, in the case of autovacuum, autovacuum_work_mem if set) is insufficient to store the number of dead tuples found.

4. Vacuuming Heap - VACUUM is currently vacuuming the heap. Vacuuming the heap is distinct from scanning the heap, and occurs after each instance of vacuuming indexes. If heap_blks_scanned is less than heap_blks_total, the system will return to scanning the heap after this phase is completed; otherwise, it will begin cleaning up indexes after this phase is completed. 

5. Cleaning up indexes - VACUUM is currently cleaning up indexes. This occurs after the heap has been completely scanned and all vacuuming of the indexes and the heap has been completed. 

6. Truncating heap - VACUUM is currently truncating the heap so as to return empty pages at the end of the relation to the operating system. This occurs after cleaning up indexes.

7. Performing final cleanup - VACUUM is performing final cleanup. During this phase, VACUUM will vacuum the free space map, update statistics in pg_class, and report statistics to the cumulative statistics system. When this phase is completed, VACUUM will end.

The resumable vacuum will store the following information during its run on the disk:
  1. Database Name/Id, Table Name/Oid
  2. Vacuum horizon (created during ‘initializing’ phase)
  3. Phase of its run (from the ones listed in the table above)
  4. Array of dead tuple ids accumulated in ‘heap scan’ phase.
  5. In case the phase is #4 or later above, then the progress of vacuum in dead tuple id array (the index of tuple id up to which the processing is done, which can be refreshed with some frequency e.g. per heap page)
  6. Number of times the dead tuple ids array was spilled over due to memory limitation.
Out of the above information, #3 to #6 will be updated as it progresses.

When it is resumed, vacuum will first check the persisted information from the last run. It will
  1. Retrieve all the stored information on the disk from the last run (listed above)
  2. Check what phase the vacuum was during the last run when it stopped
  3. Based on the phase, it will adapt the further actions based on that.
    1. If the previous run was stopped in the initialization phase, then the new run will start from scratch.
    2. If the previous run was stopped during the heap scan phase, it will use the persisted array of dead tids and it will start the scan from the last accumulated dead tuple.
    3. If the previous run was stopped during the ‘vacuuming index’ phase, it will start this phase all over, but from the beginning of the stored dead tuple ids in the array.
    4. If the previous run was stopped during the ’vacuuming heap’ phase, and without previous spillovers, it will continue vacuuming heap for the rest of the dead tuple array and proceed for further phases. If it was with previous spillovers, then it will continue vacuuming the heap for the rest of the dead tuple array and go again for phase 2 if the heap scan is still incomplete and follow the loop which is already there.
    5. If the previous run was stopped during any of the remaining phases, it will just complete the remaining work and exit.

Highlights of the proposal

  • Indexes can change across the runs. Please note that in this proposal any re-run above does not depend on the last state of the indices. Any actions in this whole proposal does not depend on the last state of indices nor does it store it.
  • This approach does not add any overhead in the DML code path. The changes are limited only to vacuum operation and just enough to make it resumable. There are no drastic changes to the regular flow.
  • This approach doesn’t change the core functions apart from conditionally persisting the vacuum progress information to the disk. Thus, any future enhancements to the core functions can be easily accommodated. 

Please let me know or comment on this so that we can conclude if this does look like a reasonable enhancement.

<End>

On Sun, Mar 24, 2024 at 9:57 AM Jay <jsudrikoss@gmail.com> wrote:
Hi,

I am aware of few previous attempts and discussions on this topic (eventually shelved or didn't materialize): 


And still I want to revise this topic for the obvious benefits. 

I do not have any patch or code changes ready. The changes could be tricky and might need efforts, possibly some re-factoring. Hence, before starting the effort, I would like to get the proposal reviewed and consensus built to avoid redundancy of efforts. 

Why do we need it? 

Since more and more large businesses are on-boarding PostgreSQL, it is only fair that we make the essential utilities like vacuum more manageable and scalable. The data sizes are definitely going to increase and maintenance windows will reduce with businesses operating across the time zones and 24x7. Making the database more manageable with the least overhead is going to be definitely a pressing need. 

To avoid the repetition and duplicate efforts, I have picked up the snippet below from the previous email conversation on the community (Ref: https://www.postgresql.org/message-id/45E2A6AE.1080805@oss.ntt.co.jp)

<Quote>
For a large table, although it can be vacuumed by enabling vacuum cost-based delay, the processing may last for several days (maybe hours). It definitely has a negative effect on system performance. So if systems which have maintenance time, it is preferred to vacuum in the maintenance window. Vacuum tasks can be split into small subtasks, and they can be scheduled into maintenance window time slots. This can reduce the impact of vacuum to system service.

But currently vacuum tasks can not be split: if an interrupt or error occurs during vacuum processing, vacuum totally forgets what it has done and terminates itself. Following vacuum on the same table has to scan from the beginning of the heap block. This proposal enable vacuum has capability to stop and resume.
</Quote> 

External Interface

This feature is especially useful when the size of table/s is quite large and their bloat is quite high and it is expected vacuum runs will take long time.

Ref: https://www.postgresql.org/docs/current/sql-vacuum.html
vacuum [ ( option [, ...], [{ for time = hh:mm}| {resume [for time = hh:mm]}] ) ] [ table_and_columns [, ...] ]

The additional options give flexibility to run the vacuum for a limited time and stop or resume the vacuum from the last time when it was stopped for a given time.

When vacuum is invoked with ‘for time ...’ option it will store the intermediate state of the dead tuples accumulated periodically on the disk as it progresses. It will run for a specified time and stop after that duration.

When vacuum is invoked with ‘for time ...’ option and is stopped automatically after the specified time or interrupted manually and if it is invoked next time with ‘resume’ option, it will try to check the stored state of the last run and try to start as closely as possible from where it left last time and avoid repetition of work.

When resumed, it can either run for a specified time again (if the duration is specified) or run till completion if the duration is not specified.

When vacuum is invoked with ‘resume for’ option when there was no earlier incomplete run or an earlier run with ‘for time’ option, the ‘for resume’ option will be ignored with a message in the errorlog.

When vacuum is invoked without ‘for time’ or ‘resume for’ options after preceding incomplete runs with those options , then the persisted data from the previous runs is discarded and deleted. This is important because successive runs with ‘for time’ or ‘resume for’ assume the persisted data is valid and there’s no run in between to invalidate it and the state of heap pages in terms of vacuum is the same for the given saved vacuum horizon.

In further discussion in the rest of this proposal, we will refer to vacuum invoked with ‘for time’ or ‘resume for’ option as listed above as ‘resumable vacuum’.

Internal Changes (High level)

For each table, vacuum progresses in the following steps or phases (taken from the documentation)
https://www.postgresql.org/docs/current/progress-reporting.html#VACUUM-PHASES :

1. Initializing -   VACUUM is preparing to begin scanning the heap. This phase is expected to be very brief.

2. Scanning heap - VACUUM is currently scanning the heap. It will prune and defragment each page if required, and possibly perform freezing activity. The heap_blks_scanned column can be used to monitor the progress of the scan.

3. Vacuuming Indexes - VACUUM is currently vacuuming the indexes. If a table has any indexes, this will happen at least once per vacuum, after the heap has been completely scanned. It may happen multiple times per vacuum if maintenance_work_mem (or, in the case of autovacuum, autovacuum_work_mem if set) is insufficient to store the number of dead tuples found.

4. Vacuuming Heap - VACUUM is currently vacuuming the heap. Vacuuming the heap is distinct from scanning the heap, and occurs after each instance of vacuuming indexes. If heap_blks_scanned is less than heap_blks_total, the system will return to scanning the heap after this phase is completed; otherwise, it will begin cleaning up indexes after this phase is completed. 

5. Cleaning up indexes - VACUUM is currently vacuuming the heap. Vacuuming the heap is distinct from scanning the heap, and occurs after each instance of vacuuming indexes. If heap_blks_scanned is less than heap_blks_total, the system will return to scanning the heap after this phase is completed; otherwise, it will begin cleaning up indexes after this phase is completed. 

6. Truncating heap - VACUUM is currently truncating the heap so as to return empty pages at the end of the relation to the operating system. This occurs after cleaning up indexes.

7. Performing final cleanup - VACUUM is performing final cleanup. During this phase, VACUUM will vacuum the free space map, update statistics in pg_class, and report statistics to the cumulative statistics system. When this phase is completed, VACUUM will end.

The resumable vacuum will store the following information during its run on the disk:
  1. Database Name/Id, Table Name/Oid
  2. Phase of its run (from the ones listed in the table above)
  3. Vacuum horizon (created during ‘initializing’ phase)
  4. Array of dead tuple-ids accumulated in the ‘heap scan’ phase.
  5. Number of times the dead tuple ids array was spilled over due to memory limitation.
Out of the above information, #2 and #3 will be updated as it progresses.

When it is resumed, vacuum will first check the persisted information from the last run. It will
  1. Retrieve all the stored information on the disk from the last run (listed above)
  2. Check what phase the vacuum was during the last run when it stopped
  3. Based on the phase, it will adapt the further actions based on that.
    1. If the previous run was stopped in the initialization phase, then the new run will start from scratch.
    2. If the previous run was stopped during the heap scan phase, it will use the persisted array of dead tids and it will start the scan from the last accumulated dead tuple.
    3. If the previous run was stopped during the ‘vacuuming index’ phase, it will start this phase all over, but from the beginning of the stored dead tuple ids.
    4. If the previous run was stopped during the ’vacuuming heap’ phase, and without previous spillovers, it will continue vacuuming heap for the rest of the dead tuple array and proceed for further phases. If it was with previous spillovers, then it will continue vacuuming the heap for the rest of the dead tuple array and go again for phase 2 if the heap scan is still incomplete and follow the loop which is already there.
    5. If the previous run was stopped during any of the remaining phases, it will just complete the remaining work and exit.
Indexes can change across the runs. Please note that in this proposal any re-run above does not depend on the last state of the indices. Any actions in this whole proposal does not depend on the last state of indices nor does it store it.

This approach does not add any overhead in the DML code path. The changes are limited only to vacuum operation and just enough to make it resumable. There are no drastic changes to the regular flow.

This approach doesn’t change the core functions apart from conditionally persisting the vacuum progress information to the disk. Thus, any future enhancements to the core functions can be easily accommodated. 

Please let me know or comment on this so that we can conclude if this does look like a reasonable enhancement.

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Next
From: shveta malik
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation