Thread: Resumable vacuum proposal and design overview

Resumable vacuum proposal and design overview

From
Galy Lee
Date:
Hi

We are developing a new feature for vacuum, here is a brief overview
about it.

Introduction
------------

A) What is it?

This feature enables vacuum has resumable capability. Vacuum can
remembers the point it stops, then resumes interrupted vacuum operation
from the point next time.

The SQL syntaxes for this feature has the format similar to:
  # VACUUM ALL  --> It has the same functionality with current vacuum;  # VACUUM      --> It uses the information saved
byprevious                    interrupted vacuum to continue vacuum processing.
 

B) Why do we need it?

For large table, although it can be vacuum by enabling vacuum cost-based
delay, but the processing may last for several days. It definitely has
negative affect on system performance. So if systems which has
maintenance time, it is preferred to vacuum in maintenance window.
Vacuum task can be split into small subtasks, and they can be
scheduled into maintenance window time slot. This can reduce the impact
of vacuum to system service.

But currently vacuum task can not be split: if an interrupt or error
occurs during vacuum processing, vacuum totally forgets what it has
done and terminates itself. Following vacuum on the same table has to
scan from the beginning of the heap block. This proposal enable vacuum
has capability to stop and resume.

C) How can we use it?

This feature can enable autovacuum or cron-vacuum-scheduler to develop
more sophisticated vacuum schedule schemes combined with *maintenance
window*.

For example, if the system has two hour maintenance window every day,
vacuum can be interrupted by this way:
  SET statement_timeout TO 2*3600*1000; # two hours  VACUUM freeze talname;  SET statement_timeout TO DEFAULT;

Or it can be interrupted by SIGINT directly.

Autovacuum or manual vacuum scheduler can split a large vacuum task
into small subtasks by this feature. Subtasks can be scheduled according
to system load.

Design Overview
---------------

A) Vacuum internal overview

Concurrent vacuum mainly has the following steps to vacuum a table:
  1. scan heap to collect dead tuple list  2. (if the table has indexes) scan and sweep indexes  3. sweep dead tuples
collectedin step 1  4. perform additional index cleanup operation  5. (if a certain of free space found) truncate table
6. register free space to FSM  7. update statistics of the table
 

If maintenance memory is not sufficient to contain all of dead tuples,
step 1-3 are repeated.

C) Where to stop

The first option is that it can accept stop request at boundary
of each steps, and it resumes from the unfinished step. But some
steps takes a short time, some take a long time. For example,
step 1 takes more than 40% of total time, but step 5 and 6 take
less than 1% of total time. This granularity is too larger.

The second option is to accept stop request before vacuum begin to
process one of the blocks in step 1-4. In current vacuum implementation,
vacuum_delay_point is also placed in such locations. This option has a
much good granularity than option 1.

This implementation accepts stop request at *blocks level* in step 1-4.

D) How to stop and resume

- stop:
   When vacuum stop in step 1-4, vacuum perform following things:   vacuum saves dead tuple list, the heap block number
onwhich it   stop, unswept index relations, unswept dead tuple and FreezeLimit   into a disk file.
 
   Free space information collected in step 1-3 can be registered to   FSM when vacuum is interrupted.

- resume:
   When vacuum is resuming, it reads out saved information and skip the   finished operations, then continue to finish
remainingoperations.
 
   There are two additional issues which need to be discussed here:    *) FreezeLimit.       There are two options to
selectFreezeLimit for a resuming a       vacuum:(a) FreezeLimit of interrupted vacuum, (b) FreezeLimit of
resumingvacuum. FreezeLimit-(b) is safe. But for the heap       blocks are  not full scanned, so when FreezeLimit-(b)
isused ,       the relfrozenxid should be updated with FreezeLimit-(a) at the       end of vacuum, and CLOG can only be
truncatedby FreezeLimit-(a).
 
    *) Resuming index operation.       There are two possible resuming levels when vacuum is interrupted       in step
2:(a)skip the *index relations* which have been swept       completely (b) skip the *index blocks* which have been
swept.      Level (a) is safe and simple to be implemented; level (b) need to       consider the scenarios that leaf
pageis split; further       investigation is needed to clarify if it is safe or not.
 
       This implementation adopts *level (a) resuming*.

3. Implementation Plan
----------------------

We are working on the patch now; I will send the WIP patch to the list
later. I am sorry this late proposal, but I hope it can go into 8.3.

Welcome your comments and ideas.

Best Regards
Galy Lee (lee.galy@oss.ntt.co.jp)
NTT Open Source Software Center


Re: Resumable vacuum proposal and design overview

From
"Simon Riggs"
Date:
On Mon, 2007-02-26 at 18:21 +0900, Galy Lee wrote:

> This implementation accepts stop request at *blocks level* in step 1-4.
> 
> D) How to stop and resume
> 
> - stop:
> 
>     When vacuum stop in step 1-4, vacuum perform following things:
>     vacuum saves dead tuple list, the heap block number on which it
>     stop, unswept index relations, unswept dead tuple and FreezeLimit
>     into a disk file.
> 
>     Free space information collected in step 1-3 can be registered to
>     FSM when vacuum is interrupted.
> 
> - resume:
> 
>     When vacuum is resuming, it reads out saved information and skip the
>     finished operations, then continue to finish remaining operations.

I understand and agree with the need to reduce the time taken for long
VACUUM operations. Doing VACUUM in stages make a lot of sense to me.

What I'm not sure about is the stop/resume logic you've presented.
Stopping this at *any* delay point means you have to dump the dead tuple
list and reload it again later. That means we can't really trust the
list we are given when we resume - the user may provide an incorrect or
old dead tuple list. If the system manages the dead tuple list we may
need to keep such files around for long periods, which doesn't sound
great either.

ISTM simpler to make the optional stop/restart point be after one full
cycle of cleaning, so exactly at the point where we discard one tuple
list and we move on the next. That way, we have no need to store and
reload the dead tuple list, with all of the complexity that involves.
The only thing we'd need to remember would be the block number that the
last partial VACUUM had reached, which would be simple to store in the
catalog. Restarting VACUUM would then occur from the stored block
number, not from the first block. This approach seems like a
stepping-stone on the way to the full proposal you've suggested, so you
could extend further if still required.

Interrupting a running VACUUM could be made possible by having VACUUM
check for a smart stop request at normal vacuum delay points. This would
be made with a system function: e.g. stop_vacuums(). Once a smart stop
request is received, it would finish the rest of its current cycle
before ending. If in step 1, it would not scan further, but would
complete steps 2 and 3 for the dead tuples harvested so far. The smart
stop request would then be a system-wide request for manually run
VACUUMs to end as soon as possible, because normal operation is about to
re-commence.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Resumable vacuum proposal and design overview

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Feb 26, 2007 at 06:21:50PM +0900, Galy Lee wrote:
> Hi
> 
> We are developing a new feature for vacuum, here is a brief overview
> about it.

[...]

> Concurrent vacuum mainly has the following steps to vacuum a table:
> 
>   1. scan heap to collect dead tuple list
>   2. (if the table has indexes) scan and sweep indexes
>   3. sweep dead tuples collected in step 1
>   4. perform additional index cleanup operation
>   5. (if a certain of free space found) truncate table
>   6. register free space to FSM
>   7. update statistics of the table

[...]

> This implementation accepts stop request at *blocks level* in step 1-4.

WARNING: I don't really know what I'm talking about -- but considering
that vaccuming a large table across several "maintainance windows" is
one of the envisioned scenarios, it might make sense to try to update
the statistics (i.e. to do partially step 7) on each partial run.
Otherwise the table's stats might wander off for quite long times?

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFF4rpkBcgs9XrR2kYRAoZ1AJwMOSpxlbVSYtPKitEhjem5Jrax7gCeMokx
8DLhqBv9QrtGIDPKOKUi9qw=
=WIiA
-----END PGP SIGNATURE-----



Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
tomas@tuxteam.de writes:
> WARNING: I don't really know what I'm talking about -- but considering
> that vaccuming a large table across several "maintainance windows" is
> one of the envisioned scenarios, it might make sense to try to update
> the statistics (i.e. to do partially step 7) on each partial run.
> Otherwise the table's stats might wander off for quite long times?

You can handle that by issuing ANALYZE as a separate operation; I see no
need to complicate this proposal even further by having it make magic
changes to the behavior of VACUUM ANALYZE.

Or were you speaking of the pg_class.reltuples count?  Keeping that
from diverging indefinitely far from reality will indeed be a problem
with this sort of thing.  We've already seen some issues related to the
stats collector's similar count.
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
Galy Lee
Date:
Simon Riggs wrote:
>>old dead tuple list. If the system manages the dead tuple list we may
>>need to keep such files around for long periods, which doesn't sound
>>great either.

The system manages such files. The files are kept in location like$PGDATA/pg_vacuum. They are removed when CLUSTER,
DROPTABLE, ALTER
 
TABLE, VACUUM etc, changes the physical layout of heap. I think this
is a reasonable way.

>>ISTM simpler to make the optional stop/restart point be after one full
>>cycle of cleaning, so exactly at the point where we discard one tuple
>>list and we move on the next.

I just summary your ideas here:

  Where to stop?
   -  stop after one full cycle of cleaning finished
   How to stop?
   - When stopping at step 1, goes straight to step 2 and step 3        to clean the dead tuples harvested so far.
- When stopping at step 2 or step 3, vacuum ignores the        stop request to finish all of the steps.
 
   Merit:     -    For it does not require external file to store dead      tuple list, it can simplify the
implementation.
   Demerit:     - The time length between stop request generated and vacuum       stopping is still quit unpredictable.
(Consideringthat       sometimes step 2 takes 2 or 3 hours, step 3 may take the       same time with step 1)     -
Itis difficult to refined vacuum in maintenance window       frame.
 

The point in here is that *how long* can we accept for vacuum stopping.

For example, there is one table:  - The table is a hundreds GBs table.  - It takes 4-8 hours to vacuum such a large
table. - Enabling cost-based delay may make it last for 24 hours.  - It can be vacuumed during night time for 2-4
hours.

It is true there is no such restrict requirement that vacuum
need to be interrupt immediately, but it should be stopped in an
*predictable way*. In the above example, if we have to wait for the endof one full cycle of cleaning, it may take up to
8hours for vacuum to
 
stop after it has received stop request. This seems quit unacceptable.





Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
Galy Lee <lee.galy@oss.ntt.co.jp> writes:
> For example, there is one table:
>    - The table is a hundreds GBs table.
>    - It takes 4-8 hours to vacuum such a large table.
>    - Enabling cost-based delay may make it last for 24 hours.
>    - It can be vacuumed during night time for 2-4 hours.

> It is true there is no such restrict requirement that vacuum
> need to be interrupt immediately, but it should be stopped in an
> *predictable way*. In the above example, if we have to wait for the end
>  of one full cycle of cleaning, it may take up to 8 hours for vacuum to
> stop after it has received stop request. This seems quit unacceptable.

I think you misunderstood what Simon means by "cycle", because you are
claiming that one cycle == one complete table VACUUM; if that were so
then what he is proposing would be exactly the status quo.  What he's
proposing (which is the same thing I was going to say, until I saw he'd
beat me to it) is that you should be prepared to stop after any one
cycle of fill-work-mem-and-clean-indexes.  This should not take that
long given the current physical-scan-order implementation of
btbulkdelete, especially if you don't set maintenance_work_mem too
large.  Moreover, it avoids a boatload of risks associated with assuming
that a batch of TIDs you've hidden somewhere are still valid.  It's not
hard to come up with scenarios where you could be discarding live tuples
because of reliance on a stale TID-list file.  State that consists only
of a next-heap-page-to-scan is just a whole lot more robust.
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
"Jim C. Nasby"
Date:
On Tue, Feb 27, 2007 at 11:44:28AM +0900, Galy Lee wrote:
> For example, there is one table:
>    - The table is a hundreds GBs table.
>    - It takes 4-8 hours to vacuum such a large table.
>    - Enabling cost-based delay may make it last for 24 hours.
>    - It can be vacuumed during night time for 2-4 hours.
> 
> It is true there is no such restrict requirement that vacuum
> need to be interrupt immediately, but it should be stopped in an
> *predictable way*. In the above example, if we have to wait for the end
>  of one full cycle of cleaning, it may take up to 8 hours for vacuum to
> stop after it has received stop request. This seems quit unacceptable.

Even with very large tables, you could likely still fit things into a
specific time frame by adjusting how much time is spent scanning for
dead tuples. The idea would be to give vacuum a target run time, and it
would monitor how much time it had remaining, taking into account how
long it should take to scan the indexes based on how long it's been
taking to scan the heap. When the amount of time left becomes less than
the estimate of the amount of time required to scan the indexes (and
clean the heap), you stop the heap scan and start scanning indexes. As
long as the IO workload on the machine doesn't vary wildly between the
heap scan and the rest of the vacuum process, I would expect this to
work out fairly well.

While not as nice as the ability to 'stop on a dime' as Tom puts it,
this would be much easier and safer to implement. If there's still a
need for something better after that we could revisit it at that time.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Resumable vacuum proposal and design overview

From
"Simon Riggs"
Date:
On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 11:44:28AM +0900, Galy Lee wrote:
> > For example, there is one table:
> >    - The table is a hundreds GBs table.
> >    - It takes 4-8 hours to vacuum such a large table.
> >    - Enabling cost-based delay may make it last for 24 hours.
> >    - It can be vacuumed during night time for 2-4 hours.
> > 
> > It is true there is no such restrict requirement that vacuum
> > need to be interrupt immediately, but it should be stopped in an
> > *predictable way*. In the above example, if we have to wait for the end
> >  of one full cycle of cleaning, it may take up to 8 hours for vacuum to
> > stop after it has received stop request. This seems quit unacceptable.
> 
> Even with very large tables, you could likely still fit things into a
> specific time frame by adjusting how much time is spent scanning for
> dead tuples. The idea would be to give vacuum a target run time, and it
> would monitor how much time it had remaining, taking into account how
> long it should take to scan the indexes based on how long it's been
> taking to scan the heap. When the amount of time left becomes less than
> the estimate of the amount of time required to scan the indexes (and
> clean the heap), you stop the heap scan and start scanning indexes. As
> long as the IO workload on the machine doesn't vary wildly between the
> heap scan and the rest of the vacuum process, I would expect this to
> work out fairly well.
> 
> While not as nice as the ability to 'stop on a dime' as Tom puts it,
> this would be much easier and safer to implement. If there's still a
> need for something better after that we could revisit it at that time.

I do like this idea, but it also seems easy to calculate that bit
yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how
long it takes to finish. Adjust X until you have it right.

If we did want to automate it, vacuum_target_duration userset GUC would
make, following Jim's thought. =0 means run-to-completion. 

Getting it to work well for VACUUM FULL would be more than a little
interesting though.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote:
>> ... The idea would be to give vacuum a target run time, and it
>> would monitor how much time it had remaining, taking into account how
>> long it should take to scan the indexes based on how long it's been
>> taking to scan the heap. When the amount of time left becomes less than
>> the estimate of the amount of time required to scan the indexes (and
>> clean the heap), you stop the heap scan and start scanning indexes.

> I do like this idea, but it also seems easy to calculate that bit
> yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how
> long it takes to finish. Adjust X until you have it right.

One problem with it is that a too-small target would result in vacuum
proceeding to scan indexes after having accumulated only a few dead
tuples, resulting in increases (potentially enormous ones) in the total
work needed to vacuum the table completely.

I think it's sufficient to have two cases: abort now, and restart from
the last cycle-completion point next time (this would basically just be
SIGINT); or set a flag to stop at the next cycle-completion point.


It occurs to me that we may be thinking about this the wrong way
entirely.  Perhaps a more useful answer to the problem of using a
defined maintenance window is to allow VACUUM to respond to changes in
the vacuum cost delay settings on-the-fly.  So when your window closes,
you don't abandon your work so far, you just throttle your I/O rate back
to whatever's considered acceptable for daytime vacuuming.
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote:
>>> ... The idea would be to give vacuum a target run time, and it
>>> would monitor how much time it had remaining, taking into account how
>>> long it should take to scan the indexes based on how long it's been
>>> taking to scan the heap. When the amount of time left becomes less than
>>> the estimate of the amount of time required to scan the indexes (and
>>> clean the heap), you stop the heap scan and start scanning indexes.
> 
>> I do like this idea, but it also seems easy to calculate that bit
>> yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how
>> long it takes to finish. Adjust X until you have it right.
> 
> One problem with it is that a too-small target would result in vacuum
> proceeding to scan indexes after having accumulated only a few dead
> tuples, resulting in increases (potentially enormous ones) in the total
> work needed to vacuum the table completely.
> 
> I think it's sufficient to have two cases: abort now, and restart from
> the last cycle-completion point next time (this would basically just be
> SIGINT); or set a flag to stop at the next cycle-completion point.
> 
> 
> It occurs to me that we may be thinking about this the wrong way
> entirely.  Perhaps a more useful answer to the problem of using a
> defined maintenance window is to allow VACUUM to respond to changes in
> the vacuum cost delay settings on-the-fly.  So when your window closes,
> you don't abandon your work so far, you just throttle your I/O rate back
> to whatever's considered acceptable for daytime vacuuming.

I thought we already did that?  Which BTW was part of my plan on how to 
deal with a vacuum that is still running after it's maintenance window 
has expired.


Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@tocr.com> writes:
> Tom Lane wrote:
>> It occurs to me that we may be thinking about this the wrong way
>> entirely.  Perhaps a more useful answer to the problem of using a
>> defined maintenance window is to allow VACUUM to respond to changes in
>> the vacuum cost delay settings on-the-fly.  So when your window closes,
>> you don't abandon your work so far, you just throttle your I/O rate back
>> to whatever's considered acceptable for daytime vacuuming.

> I thought we already did that?

No, we only react to SIGHUP when idle.  I think that's a good policy for
standard backends, but for autovacuum it might be appropriate to check
more often.
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
"Simon Riggs"
Date:
On Tue, 2007-02-27 at 12:23 -0500, Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@tocr.com> writes:
> > Tom Lane wrote:
> >> It occurs to me that we may be thinking about this the wrong way
> >> entirely.  Perhaps a more useful answer to the problem of using a
> >> defined maintenance window is to allow VACUUM to respond to changes in
> >> the vacuum cost delay settings on-the-fly.  So when your window closes,
> >> you don't abandon your work so far, you just throttle your I/O rate back
> >> to whatever's considered acceptable for daytime vacuuming.
> 
> > I thought we already did that?
> 
> No, we only react to SIGHUP when idle.  I think that's a good policy for
> standard backends, but for autovacuum it might be appropriate to check
> more often.

You mean react to changes while in the middle of a VACUUM? Sounds like a
great idea to me.

Not sure why you'd do that just for autovacuum though. Sounds like it
would be good in all cases. Vacuum and autovacuum have different vacuum
delays, after all.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Resumable vacuum proposal and design overview

From
Galy Lee
Date:
Tom Lane wrote:
>One problem with it is that a too-small target would result in vacuum
>proceeding to scan indexes after having accumulated only a few dead
>tuples, resulting in increases (potentially enormous ones) in the total
>work needed to vacuum the table completely.

Yeah. This is also my big concern about the idea of Simon and you.
Every vacuum stop causes an index scan, this means the total
time of vacuum is relative to how much times vacuum have stopped.

>I think it's sufficient to have two cases: abort now, and restart from
>the last cycle-completion point next time (this would basically just be
If there is only one cycle, then there is a problem for this approach.
(If maintenance work memory is not so small, this situation is normal.)

>or set a flag to stop at the next cycle-completion point.

The extra cost to clean indexes may prevent this approach to work in
practices.

>Perhaps a more useful answer to the problem of using a
>defined maintenance window is to allow VACUUM to respond to changes in
>the vacuum cost delay settings on-the-fly.

This is a good idea! Itagaki also have talked about exactly the same
idea to me yesterday.

But if we change the parameters on-fly to make vacuum less aggressive,
my concern is that: is there any potential problems to run vacuum inseveral days? Although I don’t have plan to touch
VACUUMFULL, but
 
seems concurrent VACUUM also holds excusive lock when truncating table.
I am a little worrying about this kind of problem for this approach.

Also maybe we need some share memory area to share the cost-delay
parameter between VACUUMs, or any other ideas?



Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
Galy Lee <lee.galy@oss.ntt.co.jp> writes:
> Tom Lane wrote:
>> ... or set a flag to stop at the next cycle-completion point.

> The extra cost to clean indexes may prevent this approach to work in
> practices.

Huh?  There is no extra cost in what I suggested; it'll perform
exactly the same number of index scans that it would do anyway.

>> Perhaps a more useful answer to the problem of using a
>> defined maintenance window is to allow VACUUM to respond to changes in
>> the vacuum cost delay settings on-the-fly.

> This is a good idea! Itagaki also have talked about exactly the same
> idea to me yesterday.

> But if we change the parameters on-fly to make vacuum less aggressive,
> my concern is that: is there any potential problems to run vacuum in
>  several days?

If the table is sufficiently large, that could happen anyway.  The
issues here, I think, are to not eat resources that foreground processes
need (which vacuum-cost-delay addresses) and to not block vacuuming of
hot-update tables (which can be addressed by allowing multiple autovac
workers).  So I'm not really convinced that being able to stop a table
vacuum halfway is critical.
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
Galy Lee
Date:

Tom Lane wrote:
> Huh?  There is no extra cost in what I suggested; it'll perform
> exactly the same number of index scans that it would do anyway.

The things I wanted to say is that:
If we can stop at any point, we can make maintenance memory large
sufficient to contain all of the dead tuples, then we only need to
clean index for once. No matter how many times vacuum stops,
indexes are cleaned for once.

But in your proposal, indexes will be scan as many as vacuum stops.
Those extra indexes cleaning are thought as the extra cost compared
with stop-on-dime approach. To vacuum a large table by stopping 8
times, tests show the extra cost can be one third of the stop-on-dimeapproach.


>So I'm not really convinced that being able to stop a table
> vacuum halfway is critical.
To run vacuum on the same table for a long period, it is critical
to be sure:
1. not to eat resources that foreground processes needs
2. not to block vacuuming of hot-updated tables
3. not to block any transaction, not to block any backup activities

In the current implementation of concurrent vacuum, the third is not
satisfied obviously, the first issue comes to my mind is the
lazy_truncate_heap, it takes AccessExclusiveLock for a long time,
that is problematic. Except we change such kinds of mechanism to ensure
that there is no problem to run vacuum on the same table for several
days, we can not say we don’t need to stop in a half way.


Best Regards,
-- 
Galy Lee <lee.galy@oss.ntt.co.jp>
NTT Open Source Software Center


Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
Galy Lee <lee.galy@oss.ntt.co.jp> writes:
> If we can stop at any point, we can make maintenance memory large
> sufficient to contain all of the dead tuples, then we only need to
> clean index for once. No matter how many times vacuum stops,
> indexes are cleaned for once.

I beg your pardon?  You're the one who's been harping on the
table-so-large-it-takes-days-to-vacuum scenario.  How you figure that
you can store all the dead TIDs in working memory?
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Feb 26, 2007 at 01:39:40PM -0500, Tom Lane wrote:
[...]
> Or were you speaking of the pg_class.reltuples count?

Yes (modulo my warning, that is)

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFF5T2SBcgs9XrR2kYRAndUAJoDG+5zqk0PxOI5GUM68GKW7+NdRgCfVB5p
6eod6gx21tgOciSKXAuuCvA=
=3Oz7
-----END PGP SIGNATURE-----



Re: Resumable vacuum proposal and design overview

From
"Simon Riggs"
Date:
On Wed, 2007-02-28 at 13:53 +0900, Galy Lee wrote:
> 
> Tom Lane wrote:
> > Huh?  There is no extra cost in what I suggested; it'll perform
> > exactly the same number of index scans that it would do anyway.
> 
> The things I wanted to say is that:
> If we can stop at any point, we can make maintenance memory large
> sufficient to contain all of the dead tuples, then we only need to
> clean index for once. No matter how many times vacuum stops,
> indexes are cleaned for once.

I agree that the cycle-at-a-time approach could perform more poorly with
repeated stop-start. The reason for the suggestion was robustness, not
performance. If you provide the wrong dead-tuple-list to VACUUM, you
will destroy the integrity of a table, which can result in silent data
loss. You haven't explained how saving the dead-tuple-list could be done
in a safe mannner and it seems risky to me.

> But in your proposal, indexes will be scan as many as vacuum stops.
> Those extra indexes cleaning are thought as the extra cost compared
> with stop-on-dime approach. To vacuum a large table by stopping 8
> times, tests show the extra cost can be one third of the stop-on-dime
>  approach.

But the VACUUM is being run during your maintenance window, so why do
you care about performance of VACUUM during that time? There is some
inefficiency in the VACUUM process, but seems like a high price to pay
for more robustness. 

Does the loss of efficiency during VACUUM translate directly into
reduced performance during operational periods? I think not.

Deferring completion of VACUUM means deferring refreshing the FSM.
Allowing cycle-at-a-time VACUUM would allow the FSM to be updated after
each run, thus releasing space for reuse again. ISTM that the
saving-dead-list approach would defer the updating of the FSM for many
days in your situation.

If you would like to reduce VACUUM times have you considered
partitioning? It can be very effective at isolating changes and is
designed specifically to cope with large data maintenance issues. If
there are issues that prevent the use of partitioning in your case,
perhaps we should be discussing those instead?

Migration from a non-partitioned environment to a partitioned one is
quite simple from 8.2 onwards.

> >So I'm not really convinced that being able to stop a table
> > vacuum halfway is critical.
> To run vacuum on the same table for a long period, it is critical
> to be sure:
> 1. not to eat resources that foreground processes needs
> 2. not to block vacuuming of hot-updated tables
> 3. not to block any transaction, not to block any backup activities
> 
> In the current implementation of concurrent vacuum, the third is not
> satisfied obviously, the first issue comes to my mind is the
> lazy_truncate_heap, it takes AccessExclusiveLock for a long time,
> that is problematic. 

Are you saying you know for certain this lock is held for a long time,
or are you just saying you think it is? If you have some evidence for
long truncation times then that would be a separate issue of concern,
since that might starve out normal users. Please say more?

ISTM that if you can refresh the FSM more frequently you will have less
need to truncate the relation at the end of each run. After some time, I
would expect that no truncation would be required because of the cyclic
reuse of space within the table, rather than extension/truncation.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Wed, 2007-02-28 at 13:53 +0900, Galy Lee wrote:
>> In the current implementation of concurrent vacuum, the third is not
>> satisfied obviously, the first issue comes to my mind is the
>> lazy_truncate_heap, it takes AccessExclusiveLock for a long time,
>> that is problematic. 

> Are you saying you know for certain this lock is held for a long time,
> or are you just saying you think it is? If you have some evidence for
> long truncation times then that would be a separate issue of concern,
> since that might starve out normal users. Please say more?

lazy_truncate_heap does a ConditionalLockAcquire, that is, it won't
succeed in acquiring the exclusive lock if there is any competition.
And I find it hard to believe that it will hold the lock very long
if it does get it --- in most scenarios it won't be possible to remove
very many pages, so the scan won't take long.  (Of course that is
arguably a bug, but until you can fix things so that an average VACUUM
*can* remove a lot of pages, it's hardly profitable to worry about
whether this step creates a concurrency problem.)

So I agree with Simon: if you want us to believe there's a performance
issue here, please present some evidence.
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Galy Lee <lee.galy@oss.ntt.co.jp> writes:
>> If we can stop at any point, we can make maintenance memory large
>> sufficient to contain all of the dead tuples, then we only need to
>> clean index for once. No matter how many times vacuum stops,
>> indexes are cleaned for once.
> 
> I beg your pardon?  You're the one who's been harping on the
> table-so-large-it-takes-days-to-vacuum scenario.  How you figure that
> you can store all the dead TIDs in working memory?

This reminds me of an idea I had while looking at the bitmap index 
patch: We could store the dead TIDs more efficiently in a bitmap, 
allowing tables to be vacuumed in lesser cycles.

Of course, that's orthogonal to the above discussion.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Resumable vacuum proposal and design overview

From
Galy Lee
Date:

Simon Riggs wrote:
> You haven't explained how saving the dead-tuple-list could be done
> in a safe mannner and it seems risky to me.

The files are placed in a new directory $PGDATA/pg_vacuum
with the name: spcNode.dbNode.relNode for each relations
which have been interrupted during vacuum.

It has the format likes:

1. VacStateFileHeader
2. VacStateData
3. Dead Tuple list
4. CRC32

The files are removed- when original physical heap files are removed,- when vacuum full have been issued,- or after the
contenthas been read in memory.- etc.
 

Is there any potential big risk there? Correct me if I am
wrong.

> Deferring completion of VACUUM means deferring refreshing the FSM.

I borrow the code from DSM patch to merge free space infointo FSM when vacuum stops.

> Are you saying you know for certain this lock is held for a long time,
> or are you just saying you think it is? If you have some evidence for
> long truncation times then that would be a separate issue of concern,
> since that might starve out normal users. Please say more?

Sorry. I *thought* it is. The benchmark has not shown such
kind of problem anyway. Thanks for the clarification for me. :)

Regards,
-- 
Galy Lee
lee.galy _at_ oss.ntt.co.jp
NTT Open Source Software Center


Re: Resumable vacuum proposal and design overview

From
"Zeugswetter Andreas ADI SD"
Date:
> > The things I wanted to say is that:
> > If we can stop at any point, we can make maintenance memory large
> > sufficient to contain all of the dead tuples, then we only need to
> > clean index for once. No matter how many times vacuum
> stops, indexes
> > are cleaned for once.
>
> I agree that the cycle-at-a-time approach could perform more
> poorly with repeated stop-start. The reason for the
> suggestion was robustness, not performance. If you provide

It performs more poorly, but it also gives immediate gain, since part of
the table is readily vacuumed. If you do it all in one pass with stop
resume, the first visible effect may be several days after you start
vacuuming. And, basically you need to pretend the vacuum transaction is
still running after the first stop. Else dead tuple reuse ala HOT is not
possible (or the ctid list needs to be reevaluated during resume, which
per se is not efficient).

> the wrong dead-tuple-list to VACUUM, you will destroy the
> integrity of a table, which can result in silent data loss.
> You haven't explained how saving the dead-tuple-list could be
> done in a safe mannner and it seems risky to me.

Agreed. It seems not efficiently possible.

Andreas



Re: Resumable vacuum proposal and design overview

From
"Zeugswetter Andreas ADI SD"
Date:
> > You haven't explained how saving the dead-tuple-list could be done
in
> > a safe mannner and it seems risky to me.
>
> The files are placed in a new directory $PGDATA/pg_vacuum
> with the name: spcNode.dbNode.relNode for each relations
> which have been interrupted during vacuum.
>
> It has the format likes:
>
> 1. VacStateFileHeader
> 2. VacStateData
> 3. Dead Tuple list
> 4. CRC32
>
> The files are removed
>  - when original physical heap files are removed,
>  - when vacuum full have been issued,
>  - or after the content has been read in memory.
>  - etc.
>
> Is there any potential big risk there? Correct me if I am wrong.

The main risc is not a corrupt file or broken list. The risc is, that a
ctid in the list points at a tuple that is not dead anymore. To avoid
that risc you would need to:
1. keep the vacuum lock open
2. leave the vacuum tx open

(or reevaluate visibility of list members upon resume)

Andreas


Re: Resumable vacuum proposal and design overview

From
"Simon Riggs"
Date:
On Wed, 2007-02-28 at 11:19 +0100, Zeugswetter Andreas ADI SD wrote:
> > > The things I wanted to say is that:
> > > If we can stop at any point, we can make maintenance memory large 
> > > sufficient to contain all of the dead tuples, then we only need to 
> > > clean index for once. No matter how many times vacuum 
> > stops, indexes 
> > > are cleaned for once.
> > 
> > I agree that the cycle-at-a-time approach could perform more 
> > poorly with repeated stop-start. The reason for the 
> > suggestion was robustness, not performance. If you provide 
> 
> It performs more poorly, but it also gives immediate gain, since part of
> the table is readily vacuumed. If you do it all in one pass with stop
> resume, the first visible effect may be several days after you start
> vacuuming. 

I think that in itself is enough to tip the scales.

> And, basically you need to pretend the vacuum transaction is
> still running after the first stop. Else dead tuple reuse ala HOT is not
> possible (or the ctid list needs to be reevaluated during resume, which
> per se is not efficient). 

Ah, I see you got there ahead of me. Yes, it would prevent HOT from
performing retail VACUUMs on heap blocks. (I'm not saying HOT will be
accepted/acceptable, but I'd rather not have its acceptability hinge on
a use case that seems so rare).

One proposal that we do still have in discussion is Heikki's patch to
re-evaluate the OldestXmin while the VACUUM runs. That's something we'd
definitely want to do in a restartable VACUUM anyway. But my thought is
that it actually increases quite dramatically the number of dead rows
harvested during VACUUM (a good thing), which is likely to increase the
number of cycles required to complete a large table (no problem, because
of the increased efficiency of the VACUUM). I think there's a strong
argument to make VACUUM refresh rather than rebuild the FSM after each
cycle rather than wait until the end, whether or not we stop/start the
VACUUM. In any long running VACUUM that seems very worthwhile.

Big VACUUM needs big memory. Using huge settings of maintenance_work_mem
dedicated solely to VACUUM seems like it could be a waste of resources
in many cases. It may be much better to allow 1 GB of memory to be used
to cache indexes better, which would improve performance of other
applications, as well as improving the index scan time during VACUUM. So
scanning indexes more often during VACUUM isn't necessarily bad either,
unless your table is truly huge, in which case you should use
partitioning to reduce it.

Galy, please hear that people like your idea and understand your use
case, but just don't like all of the proposal, just the main thrust of
it. The usual way is that 
(people that agree + amount of your exact idea remaining) = 100%

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Resumable vacuum proposal and design overview

From
"Simon Riggs"
Date:
On Wed, 2007-02-28 at 09:38 +0000, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Galy Lee <lee.galy@oss.ntt.co.jp> writes:
> >> If we can stop at any point, we can make maintenance memory large
> >> sufficient to contain all of the dead tuples, then we only need to
> >> clean index for once. No matter how many times vacuum stops,
> >> indexes are cleaned for once.
> > 
> > I beg your pardon?  You're the one who's been harping on the
> > table-so-large-it-takes-days-to-vacuum scenario.  How you figure that
> > you can store all the dead TIDs in working memory?
> 
> This reminds me of an idea I had while looking at the bitmap index 
> patch: We could store the dead TIDs more efficiently in a bitmap, 
> allowing tables to be vacuumed in lesser cycles.
> 
> Of course, that's orthogonal to the above discussion.

I like the idea. 

How much memory would it save during VACUUM on a 1 billion row table
with 200 million dead rows? Would that reduce the number of cycles a
normal non-interrupted VACUUM would perform?

Would it work efficiently for all of the current index AMs? Each index
might use the index slightly differently during cleanup, I'm not sure.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Resumable vacuum proposal and design overview

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> How much memory would it save during VACUUM on a 1 billion row table
> with 200 million dead rows? Would that reduce the number of cycles a
> normal non-interrupted VACUUM would perform?

It would depend on how many dead tuples you have per-page. If you have a very
large table with only one dead tuple per page then it might even be larger.
But in the usual case it would be smaller.

Also note that it would have to be non-lossy.

My only objection to this idea, and it's not really an objection at all, is
that I think we want to head in the direction of making indexes cheaper to
scan and doing the index scan phase more often. That reduces the need for
multiple concurrent vacuums and makes the problem of busy tables getting
starved less of a concern.

That doesn't mean there's any downside to making the dead tuple list take less
memory but I think the upside is limited. As we optimize our index
representations with GII and bitmapped indexes scanning them gets easier and
easier anyways. And you don't really want to wait too long before you get the
benefit of the recovered space in the table.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Resumable vacuum proposal and design overview

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
>> How much memory would it save during VACUUM on a 1 billion row table
>> with 200 million dead rows? Would that reduce the number of cycles a
>> normal non-interrupted VACUUM would perform?
> 
> It would depend on how many dead tuples you have per-page. If you have a very
> large table with only one dead tuple per page then it might even be larger.
> But in the usual case it would be smaller.

FWIW, there's some unused bits in current representation, so it might 
actually be possible to design it so that it's never larger.

One optimization to the current structure, instead of switching to a 
bitmap, would be to store the block number just once for each block, 
followed by a variable length list of offsets. It'd complicate the 
binary search though.

> Also note that it would have to be non-lossy.

Yep. Or actually, it might be useful to forget some dead tids if it 
allowed you to memorize a larger number of other dead tids. Hmm, what a 
weird thought :).

Another insight I had while thinking about this is that the dead tid 
list behaves quite nicely from a OS memory management point of view. In 
the 1st vacuum phase, the array is filled in sequence, which means that 
the OS can swap out the early parts of it and use the memory for buffer 
cache instead. In the index scan phase, it's randomly accessed, but if 
the table is clustered, it's in fact not completely random access. In 
the 2nd vacuum pass, the array is scanned sequentially again. I'm not 
sure how that works out in practice, but you might want to use a larger 
maintenance_work_mem than you'd think.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Resumable vacuum proposal and design overview

From
Galy Lee
Date:
Simon Riggs wrote:
> Galy, please hear that people like your idea and understand your use
> case, but just don't like all of the proposal, just the main thrust of
> it. The usual way is that 
> (people that agree + amount of your exact idea remaining) = 100%

Thank you. I am glad to hear that. :) But I still can not kill my
idea yet.

Let's come to the core issue we care about: do we need the stop-on-dime
feature to stop vacuum immediately?  As my previous opinion: if there
are some problems for long running vacuum, yes we *did need* to stop
vacuum immediately.

I am still not convinced that we don’t have such kind of problems. The
potential problem for running a long vacuum is that it may block some
foreground transactions, like ALTER TABLE; if it is true that long
running vacuum did block some DDLs for a long time, it is a big problem.
I think we need to stop vacuum immediately to handle such kind of
problems.

I admit that the implementation is much complex, but I can not
see any big problems to save the dead tuples out and read it in
again(like two phase commit does). Why do we need to hold the lock
and transaction? We can open the lock and abandon the transaction ID,
vacuum can take the lock and get a new ID when restarting. Why do we
need to worry about if the dead tuple is still alive, only vacuum will
sweep them, HOT can not touch the tuple until we have finished sweeping.

But it is true that in some scenarios, it is better to stop after
the cycle has finished, then the effect of vacuum can appear quickly.
So I hope the final design may combine then together.



Re: Resumable vacuum proposal and design overview

From
Tom Lane
Date:
Galy Lee <lee.galy@oss.ntt.co.jp> writes:
> Let's come to the core issue we care about: do we need the stop-on-dime
> feature to stop vacuum immediately?  As my previous opinion: if there
> are some problems for long running vacuum, yes we *did need* to stop
> vacuum immediately.

There's always SIGINT.  The question you are avoiding is whether it's
really worth adding a lot of overhead to make vacuum able to stop
without losing some work.

> I admit that the implementation is much complex, but I can not
> see any big problems to save the dead tuples out and read it in
> again(like two phase commit does).

The big problem is that this creates a number of failure scenarios that
didn't exist before.  Your proposal to store the dead-tuple TIDs in a
separate file scares the heck out of me: there are any number of ways
for that to get out-of-sync with the underlying relation.  If you don't
see the point here, go back to re-read the documentation about PITR and
how one creates a base backup and exactly why that behavior is proof
against crashes.
        regards, tom lane


Re: Resumable vacuum proposal and design overview

From
"Zeugswetter Andreas ADI SD"
Date:
> I admit that the implementation is much complex, but I can
> not see any big problems to save the dead tuples out and read
> it in again(like two phase commit does). Why do we need to
> hold the lock and transaction? We can open the lock and
> abandon the transaction ID, vacuum can take the lock and get

> a new ID when restarting. Why do we need to worry about if
> the dead tuple is still alive, only vacuum will sweep them,
> HOT can not touch the tuple until we have finished sweeping.

One imho important (not necessarily mandatory) aspect of HOT is, that it
does parts of what vacuum would usually do.

Thus:1. resume, load ctid list2. continue filling ctid list3. remove index tuples for these ctids (* problem *)


You have just removed index entries for possibly now live tuples that
have been reused by HOT.
Unless ...

Andreas



Re: Resumable vacuum proposal and design overview

From
"Zeugswetter Andreas ADI SD"
Date:
> One imho important (not necessarily mandatory) aspect of HOT
> is, that it does parts of what vacuum would usually do.
>
> Thus:
>     1. resume, load ctid list
>     2. continue filling ctid list
>     3. remove index tuples for these ctids (* problem *)
>
> You have just removed index entries for possibly now live
> tuples that have been reused by HOT.

Sorry my error, this is no problem, since HOT can only reuse a slot that
has no direct index entries (part of an old HOT chain).
So with HOT it is only important, that the 1st heap scan ignores
(does not add to the list) HOT chained tuples.

So, I still don't feel comfortable with the idea of breaking the
visibility rules (using a ctid that is days old and globalxmin not
knowing), even if I do not currently see a direct problem that cannot be
worked around (like removing all lists upon startup recovery).

Andreas


Re: Resumable vacuum proposal and design overview

From
"Jim C. Nasby"
Date:
On Wed, Feb 28, 2007 at 10:14:24PM +0000, Heikki Linnakangas wrote:
> cache instead. In the index scan phase, it's randomly accessed, but if 
> the table is clustered, it's in fact not completely random access. In 
> the 2nd vacuum pass, the array is scanned sequentially again. I'm not 

Only if there's only one index on the table... otherwise I'd argue that
you're much less likely to be searching the TID list incrementally.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Resumable vacuum proposal and design overview

From
Heikki Linnakangas
Date:
Jim C. Nasby wrote:
> On Wed, Feb 28, 2007 at 10:14:24PM +0000, Heikki Linnakangas wrote:
>> cache instead. In the index scan phase, it's randomly accessed, but if 
>> the table is clustered, it's in fact not completely random access. In 
>> the 2nd vacuum pass, the array is scanned sequentially again. I'm not 
> 
> Only if there's only one index on the table... otherwise I'd argue that
> you're much less likely to be searching the TID list incrementally.

Yeah, sure.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com