Thread: Resumable vacuum proposal and design overview
Hi We are developing a new feature for vacuum, here is a brief overview about it. Introduction ------------ A) What is it? This feature enables vacuum has resumable capability. Vacuum can remembers the point it stops, then resumes interrupted vacuum operation from the point next time. The SQL syntaxes for this feature has the format similar to: # VACUUM ALL --> It has the same functionality with current vacuum; # VACUUM --> It uses the information saved byprevious interrupted vacuum to continue vacuum processing. B) Why do we need it? For large table, although it can be vacuum by enabling vacuum cost-based delay, but the processing may last for several days. It definitely has negative affect on system performance. So if systems which has maintenance time, it is preferred to vacuum in maintenance window. Vacuum task can be split into small subtasks, and they can be scheduled into maintenance window time slot. This can reduce the impact of vacuum to system service. But currently vacuum task can not be split: if an interrupt or error occurs during vacuum processing, vacuum totally forgets what it has done and terminates itself. Following vacuum on the same table has to scan from the beginning of the heap block. This proposal enable vacuum has capability to stop and resume. C) How can we use it? This feature can enable autovacuum or cron-vacuum-scheduler to develop more sophisticated vacuum schedule schemes combined with *maintenance window*. For example, if the system has two hour maintenance window every day, vacuum can be interrupted by this way: SET statement_timeout TO 2*3600*1000; # two hours VACUUM freeze talname; SET statement_timeout TO DEFAULT; Or it can be interrupted by SIGINT directly. Autovacuum or manual vacuum scheduler can split a large vacuum task into small subtasks by this feature. Subtasks can be scheduled according to system load. Design Overview --------------- A) Vacuum internal overview Concurrent vacuum mainly has the following steps to vacuum a table: 1. scan heap to collect dead tuple list 2. (if the table has indexes) scan and sweep indexes 3. sweep dead tuples collectedin step 1 4. perform additional index cleanup operation 5. (if a certain of free space found) truncate table 6. register free space to FSM 7. update statistics of the table If maintenance memory is not sufficient to contain all of dead tuples, step 1-3 are repeated. C) Where to stop The first option is that it can accept stop request at boundary of each steps, and it resumes from the unfinished step. But some steps takes a short time, some take a long time. For example, step 1 takes more than 40% of total time, but step 5 and 6 take less than 1% of total time. This granularity is too larger. The second option is to accept stop request before vacuum begin to process one of the blocks in step 1-4. In current vacuum implementation, vacuum_delay_point is also placed in such locations. This option has a much good granularity than option 1. This implementation accepts stop request at *blocks level* in step 1-4. D) How to stop and resume - stop: When vacuum stop in step 1-4, vacuum perform following things: vacuum saves dead tuple list, the heap block number onwhich it stop, unswept index relations, unswept dead tuple and FreezeLimit into a disk file. Free space information collected in step 1-3 can be registered to FSM when vacuum is interrupted. - resume: When vacuum is resuming, it reads out saved information and skip the finished operations, then continue to finish remainingoperations. There are two additional issues which need to be discussed here: *) FreezeLimit. There are two options to selectFreezeLimit for a resuming a vacuum:(a) FreezeLimit of interrupted vacuum, (b) FreezeLimit of resumingvacuum. FreezeLimit-(b) is safe. But for the heap blocks are not full scanned, so when FreezeLimit-(b) isused , the relfrozenxid should be updated with FreezeLimit-(a) at the end of vacuum, and CLOG can only be truncatedby FreezeLimit-(a). *) Resuming index operation. There are two possible resuming levels when vacuum is interrupted in step 2:(a)skip the *index relations* which have been swept completely (b) skip the *index blocks* which have been swept. Level (a) is safe and simple to be implemented; level (b) need to consider the scenarios that leaf pageis split; further investigation is needed to clarify if it is safe or not. This implementation adopts *level (a) resuming*. 3. Implementation Plan ---------------------- We are working on the patch now; I will send the WIP patch to the list later. I am sorry this late proposal, but I hope it can go into 8.3. Welcome your comments and ideas. Best Regards Galy Lee (lee.galy@oss.ntt.co.jp) NTT Open Source Software Center
On Mon, 2007-02-26 at 18:21 +0900, Galy Lee wrote: > This implementation accepts stop request at *blocks level* in step 1-4. > > D) How to stop and resume > > - stop: > > When vacuum stop in step 1-4, vacuum perform following things: > vacuum saves dead tuple list, the heap block number on which it > stop, unswept index relations, unswept dead tuple and FreezeLimit > into a disk file. > > Free space information collected in step 1-3 can be registered to > FSM when vacuum is interrupted. > > - resume: > > When vacuum is resuming, it reads out saved information and skip the > finished operations, then continue to finish remaining operations. I understand and agree with the need to reduce the time taken for long VACUUM operations. Doing VACUUM in stages make a lot of sense to me. What I'm not sure about is the stop/resume logic you've presented. Stopping this at *any* delay point means you have to dump the dead tuple list and reload it again later. That means we can't really trust the list we are given when we resume - the user may provide an incorrect or old dead tuple list. If the system manages the dead tuple list we may need to keep such files around for long periods, which doesn't sound great either. ISTM simpler to make the optional stop/restart point be after one full cycle of cleaning, so exactly at the point where we discard one tuple list and we move on the next. That way, we have no need to store and reload the dead tuple list, with all of the complexity that involves. The only thing we'd need to remember would be the block number that the last partial VACUUM had reached, which would be simple to store in the catalog. Restarting VACUUM would then occur from the stored block number, not from the first block. This approach seems like a stepping-stone on the way to the full proposal you've suggested, so you could extend further if still required. Interrupting a running VACUUM could be made possible by having VACUUM check for a smart stop request at normal vacuum delay points. This would be made with a system function: e.g. stop_vacuums(). Once a smart stop request is received, it would finish the rest of its current cycle before ending. If in step 1, it would not scan further, but would complete steps 2 and 3 for the dead tuples harvested so far. The smart stop request would then be a system-wide request for manually run VACUUMs to end as soon as possible, because normal operation is about to re-commence. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Feb 26, 2007 at 06:21:50PM +0900, Galy Lee wrote: > Hi > > We are developing a new feature for vacuum, here is a brief overview > about it. [...] > Concurrent vacuum mainly has the following steps to vacuum a table: > > 1. scan heap to collect dead tuple list > 2. (if the table has indexes) scan and sweep indexes > 3. sweep dead tuples collected in step 1 > 4. perform additional index cleanup operation > 5. (if a certain of free space found) truncate table > 6. register free space to FSM > 7. update statistics of the table [...] > This implementation accepts stop request at *blocks level* in step 1-4. WARNING: I don't really know what I'm talking about -- but considering that vaccuming a large table across several "maintainance windows" is one of the envisioned scenarios, it might make sense to try to update the statistics (i.e. to do partially step 7) on each partial run. Otherwise the table's stats might wander off for quite long times? Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFF4rpkBcgs9XrR2kYRAoZ1AJwMOSpxlbVSYtPKitEhjem5Jrax7gCeMokx 8DLhqBv9QrtGIDPKOKUi9qw= =WIiA -----END PGP SIGNATURE-----
tomas@tuxteam.de writes: > WARNING: I don't really know what I'm talking about -- but considering > that vaccuming a large table across several "maintainance windows" is > one of the envisioned scenarios, it might make sense to try to update > the statistics (i.e. to do partially step 7) on each partial run. > Otherwise the table's stats might wander off for quite long times? You can handle that by issuing ANALYZE as a separate operation; I see no need to complicate this proposal even further by having it make magic changes to the behavior of VACUUM ANALYZE. Or were you speaking of the pg_class.reltuples count? Keeping that from diverging indefinitely far from reality will indeed be a problem with this sort of thing. We've already seen some issues related to the stats collector's similar count. regards, tom lane
Simon Riggs wrote: >>old dead tuple list. If the system manages the dead tuple list we may >>need to keep such files around for long periods, which doesn't sound >>great either. The system manages such files. The files are kept in location like$PGDATA/pg_vacuum. They are removed when CLUSTER, DROPTABLE, ALTER TABLE, VACUUM etc, changes the physical layout of heap. I think this is a reasonable way. >>ISTM simpler to make the optional stop/restart point be after one full >>cycle of cleaning, so exactly at the point where we discard one tuple >>list and we move on the next. I just summary your ideas here: Where to stop? - stop after one full cycle of cleaning finished How to stop? - When stopping at step 1, goes straight to step 2 and step 3 to clean the dead tuples harvested so far. - When stopping at step 2 or step 3, vacuum ignores the stop request to finish all of the steps. Merit: - For it does not require external file to store dead tuple list, it can simplify the implementation. Demerit: - The time length between stop request generated and vacuum stopping is still quit unpredictable. (Consideringthat sometimes step 2 takes 2 or 3 hours, step 3 may take the same time with step 1) - Itis difficult to refined vacuum in maintenance window frame. The point in here is that *how long* can we accept for vacuum stopping. For example, there is one table: - The table is a hundreds GBs table. - It takes 4-8 hours to vacuum such a large table. - Enabling cost-based delay may make it last for 24 hours. - It can be vacuumed during night time for 2-4 hours. It is true there is no such restrict requirement that vacuum need to be interrupt immediately, but it should be stopped in an *predictable way*. In the above example, if we have to wait for the endof one full cycle of cleaning, it may take up to 8hours for vacuum to stop after it has received stop request. This seems quit unacceptable.
Galy Lee <lee.galy@oss.ntt.co.jp> writes: > For example, there is one table: > - The table is a hundreds GBs table. > - It takes 4-8 hours to vacuum such a large table. > - Enabling cost-based delay may make it last for 24 hours. > - It can be vacuumed during night time for 2-4 hours. > It is true there is no such restrict requirement that vacuum > need to be interrupt immediately, but it should be stopped in an > *predictable way*. In the above example, if we have to wait for the end > of one full cycle of cleaning, it may take up to 8 hours for vacuum to > stop after it has received stop request. This seems quit unacceptable. I think you misunderstood what Simon means by "cycle", because you are claiming that one cycle == one complete table VACUUM; if that were so then what he is proposing would be exactly the status quo. What he's proposing (which is the same thing I was going to say, until I saw he'd beat me to it) is that you should be prepared to stop after any one cycle of fill-work-mem-and-clean-indexes. This should not take that long given the current physical-scan-order implementation of btbulkdelete, especially if you don't set maintenance_work_mem too large. Moreover, it avoids a boatload of risks associated with assuming that a batch of TIDs you've hidden somewhere are still valid. It's not hard to come up with scenarios where you could be discarding live tuples because of reliance on a stale TID-list file. State that consists only of a next-heap-page-to-scan is just a whole lot more robust. regards, tom lane
On Tue, Feb 27, 2007 at 11:44:28AM +0900, Galy Lee wrote: > For example, there is one table: > - The table is a hundreds GBs table. > - It takes 4-8 hours to vacuum such a large table. > - Enabling cost-based delay may make it last for 24 hours. > - It can be vacuumed during night time for 2-4 hours. > > It is true there is no such restrict requirement that vacuum > need to be interrupt immediately, but it should be stopped in an > *predictable way*. In the above example, if we have to wait for the end > of one full cycle of cleaning, it may take up to 8 hours for vacuum to > stop after it has received stop request. This seems quit unacceptable. Even with very large tables, you could likely still fit things into a specific time frame by adjusting how much time is spent scanning for dead tuples. The idea would be to give vacuum a target run time, and it would monitor how much time it had remaining, taking into account how long it should take to scan the indexes based on how long it's been taking to scan the heap. When the amount of time left becomes less than the estimate of the amount of time required to scan the indexes (and clean the heap), you stop the heap scan and start scanning indexes. As long as the IO workload on the machine doesn't vary wildly between the heap scan and the rest of the vacuum process, I would expect this to work out fairly well. While not as nice as the ability to 'stop on a dime' as Tom puts it, this would be much easier and safer to implement. If there's still a need for something better after that we could revisit it at that time. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote: > On Tue, Feb 27, 2007 at 11:44:28AM +0900, Galy Lee wrote: > > For example, there is one table: > > - The table is a hundreds GBs table. > > - It takes 4-8 hours to vacuum such a large table. > > - Enabling cost-based delay may make it last for 24 hours. > > - It can be vacuumed during night time for 2-4 hours. > > > > It is true there is no such restrict requirement that vacuum > > need to be interrupt immediately, but it should be stopped in an > > *predictable way*. In the above example, if we have to wait for the end > > of one full cycle of cleaning, it may take up to 8 hours for vacuum to > > stop after it has received stop request. This seems quit unacceptable. > > Even with very large tables, you could likely still fit things into a > specific time frame by adjusting how much time is spent scanning for > dead tuples. The idea would be to give vacuum a target run time, and it > would monitor how much time it had remaining, taking into account how > long it should take to scan the indexes based on how long it's been > taking to scan the heap. When the amount of time left becomes less than > the estimate of the amount of time required to scan the indexes (and > clean the heap), you stop the heap scan and start scanning indexes. As > long as the IO workload on the machine doesn't vary wildly between the > heap scan and the rest of the vacuum process, I would expect this to > work out fairly well. > > While not as nice as the ability to 'stop on a dime' as Tom puts it, > this would be much easier and safer to implement. If there's still a > need for something better after that we could revisit it at that time. I do like this idea, but it also seems easy to calculate that bit yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how long it takes to finish. Adjust X until you have it right. If we did want to automate it, vacuum_target_duration userset GUC would make, following Jim's thought. =0 means run-to-completion. Getting it to work well for VACUUM FULL would be more than a little interesting though. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote: >> ... The idea would be to give vacuum a target run time, and it >> would monitor how much time it had remaining, taking into account how >> long it should take to scan the indexes based on how long it's been >> taking to scan the heap. When the amount of time left becomes less than >> the estimate of the amount of time required to scan the indexes (and >> clean the heap), you stop the heap scan and start scanning indexes. > I do like this idea, but it also seems easy to calculate that bit > yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how > long it takes to finish. Adjust X until you have it right. One problem with it is that a too-small target would result in vacuum proceeding to scan indexes after having accumulated only a few dead tuples, resulting in increases (potentially enormous ones) in the total work needed to vacuum the table completely. I think it's sufficient to have two cases: abort now, and restart from the last cycle-completion point next time (this would basically just be SIGINT); or set a flag to stop at the next cycle-completion point. It occurs to me that we may be thinking about this the wrong way entirely. Perhaps a more useful answer to the problem of using a defined maintenance window is to allow VACUUM to respond to changes in the vacuum cost delay settings on-the-fly. So when your window closes, you don't abandon your work so far, you just throttle your I/O rate back to whatever's considered acceptable for daytime vacuuming. regards, tom lane
Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote: >>> ... The idea would be to give vacuum a target run time, and it >>> would monitor how much time it had remaining, taking into account how >>> long it should take to scan the indexes based on how long it's been >>> taking to scan the heap. When the amount of time left becomes less than >>> the estimate of the amount of time required to scan the indexes (and >>> clean the heap), you stop the heap scan and start scanning indexes. > >> I do like this idea, but it also seems easy to calculate that bit >> yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how >> long it takes to finish. Adjust X until you have it right. > > One problem with it is that a too-small target would result in vacuum > proceeding to scan indexes after having accumulated only a few dead > tuples, resulting in increases (potentially enormous ones) in the total > work needed to vacuum the table completely. > > I think it's sufficient to have two cases: abort now, and restart from > the last cycle-completion point next time (this would basically just be > SIGINT); or set a flag to stop at the next cycle-completion point. > > > It occurs to me that we may be thinking about this the wrong way > entirely. Perhaps a more useful answer to the problem of using a > defined maintenance window is to allow VACUUM to respond to changes in > the vacuum cost delay settings on-the-fly. So when your window closes, > you don't abandon your work so far, you just throttle your I/O rate back > to whatever's considered acceptable for daytime vacuuming. I thought we already did that? Which BTW was part of my plan on how to deal with a vacuum that is still running after it's maintenance window has expired.
"Matthew T. O'Connor" <matthew@tocr.com> writes: > Tom Lane wrote: >> It occurs to me that we may be thinking about this the wrong way >> entirely. Perhaps a more useful answer to the problem of using a >> defined maintenance window is to allow VACUUM to respond to changes in >> the vacuum cost delay settings on-the-fly. So when your window closes, >> you don't abandon your work so far, you just throttle your I/O rate back >> to whatever's considered acceptable for daytime vacuuming. > I thought we already did that? No, we only react to SIGHUP when idle. I think that's a good policy for standard backends, but for autovacuum it might be appropriate to check more often. regards, tom lane
On Tue, 2007-02-27 at 12:23 -0500, Tom Lane wrote: > "Matthew T. O'Connor" <matthew@tocr.com> writes: > > Tom Lane wrote: > >> It occurs to me that we may be thinking about this the wrong way > >> entirely. Perhaps a more useful answer to the problem of using a > >> defined maintenance window is to allow VACUUM to respond to changes in > >> the vacuum cost delay settings on-the-fly. So when your window closes, > >> you don't abandon your work so far, you just throttle your I/O rate back > >> to whatever's considered acceptable for daytime vacuuming. > > > I thought we already did that? > > No, we only react to SIGHUP when idle. I think that's a good policy for > standard backends, but for autovacuum it might be appropriate to check > more often. You mean react to changes while in the middle of a VACUUM? Sounds like a great idea to me. Not sure why you'd do that just for autovacuum though. Sounds like it would be good in all cases. Vacuum and autovacuum have different vacuum delays, after all. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: >One problem with it is that a too-small target would result in vacuum >proceeding to scan indexes after having accumulated only a few dead >tuples, resulting in increases (potentially enormous ones) in the total >work needed to vacuum the table completely. Yeah. This is also my big concern about the idea of Simon and you. Every vacuum stop causes an index scan, this means the total time of vacuum is relative to how much times vacuum have stopped. >I think it's sufficient to have two cases: abort now, and restart from >the last cycle-completion point next time (this would basically just be If there is only one cycle, then there is a problem for this approach. (If maintenance work memory is not so small, this situation is normal.) >or set a flag to stop at the next cycle-completion point. The extra cost to clean indexes may prevent this approach to work in practices. >Perhaps a more useful answer to the problem of using a >defined maintenance window is to allow VACUUM to respond to changes in >the vacuum cost delay settings on-the-fly. This is a good idea! Itagaki also have talked about exactly the same idea to me yesterday. But if we change the parameters on-fly to make vacuum less aggressive, my concern is that: is there any potential problems to run vacuum inseveral days? Although I don’t have plan to touch VACUUMFULL, but seems concurrent VACUUM also holds excusive lock when truncating table. I am a little worrying about this kind of problem for this approach. Also maybe we need some share memory area to share the cost-delay parameter between VACUUMs, or any other ideas?
Galy Lee <lee.galy@oss.ntt.co.jp> writes: > Tom Lane wrote: >> ... or set a flag to stop at the next cycle-completion point. > The extra cost to clean indexes may prevent this approach to work in > practices. Huh? There is no extra cost in what I suggested; it'll perform exactly the same number of index scans that it would do anyway. >> Perhaps a more useful answer to the problem of using a >> defined maintenance window is to allow VACUUM to respond to changes in >> the vacuum cost delay settings on-the-fly. > This is a good idea! Itagaki also have talked about exactly the same > idea to me yesterday. > But if we change the parameters on-fly to make vacuum less aggressive, > my concern is that: is there any potential problems to run vacuum in > several days? If the table is sufficiently large, that could happen anyway. The issues here, I think, are to not eat resources that foreground processes need (which vacuum-cost-delay addresses) and to not block vacuuming of hot-update tables (which can be addressed by allowing multiple autovac workers). So I'm not really convinced that being able to stop a table vacuum halfway is critical. regards, tom lane
Tom Lane wrote: > Huh? There is no extra cost in what I suggested; it'll perform > exactly the same number of index scans that it would do anyway. The things I wanted to say is that: If we can stop at any point, we can make maintenance memory large sufficient to contain all of the dead tuples, then we only need to clean index for once. No matter how many times vacuum stops, indexes are cleaned for once. But in your proposal, indexes will be scan as many as vacuum stops. Those extra indexes cleaning are thought as the extra cost compared with stop-on-dime approach. To vacuum a large table by stopping 8 times, tests show the extra cost can be one third of the stop-on-dimeapproach. >So I'm not really convinced that being able to stop a table > vacuum halfway is critical. To run vacuum on the same table for a long period, it is critical to be sure: 1. not to eat resources that foreground processes needs 2. not to block vacuuming of hot-updated tables 3. not to block any transaction, not to block any backup activities In the current implementation of concurrent vacuum, the third is not satisfied obviously, the first issue comes to my mind is the lazy_truncate_heap, it takes AccessExclusiveLock for a long time, that is problematic. Except we change such kinds of mechanism to ensure that there is no problem to run vacuum on the same table for several days, we can not say we don’t need to stop in a half way. Best Regards, -- Galy Lee <lee.galy@oss.ntt.co.jp> NTT Open Source Software Center
Galy Lee <lee.galy@oss.ntt.co.jp> writes: > If we can stop at any point, we can make maintenance memory large > sufficient to contain all of the dead tuples, then we only need to > clean index for once. No matter how many times vacuum stops, > indexes are cleaned for once. I beg your pardon? You're the one who's been harping on the table-so-large-it-takes-days-to-vacuum scenario. How you figure that you can store all the dead TIDs in working memory? regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Feb 26, 2007 at 01:39:40PM -0500, Tom Lane wrote: [...] > Or were you speaking of the pg_class.reltuples count? Yes (modulo my warning, that is) Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFF5T2SBcgs9XrR2kYRAndUAJoDG+5zqk0PxOI5GUM68GKW7+NdRgCfVB5p 6eod6gx21tgOciSKXAuuCvA= =3Oz7 -----END PGP SIGNATURE-----
On Wed, 2007-02-28 at 13:53 +0900, Galy Lee wrote: > > Tom Lane wrote: > > Huh? There is no extra cost in what I suggested; it'll perform > > exactly the same number of index scans that it would do anyway. > > The things I wanted to say is that: > If we can stop at any point, we can make maintenance memory large > sufficient to contain all of the dead tuples, then we only need to > clean index for once. No matter how many times vacuum stops, > indexes are cleaned for once. I agree that the cycle-at-a-time approach could perform more poorly with repeated stop-start. The reason for the suggestion was robustness, not performance. If you provide the wrong dead-tuple-list to VACUUM, you will destroy the integrity of a table, which can result in silent data loss. You haven't explained how saving the dead-tuple-list could be done in a safe mannner and it seems risky to me. > But in your proposal, indexes will be scan as many as vacuum stops. > Those extra indexes cleaning are thought as the extra cost compared > with stop-on-dime approach. To vacuum a large table by stopping 8 > times, tests show the extra cost can be one third of the stop-on-dime > approach. But the VACUUM is being run during your maintenance window, so why do you care about performance of VACUUM during that time? There is some inefficiency in the VACUUM process, but seems like a high price to pay for more robustness. Does the loss of efficiency during VACUUM translate directly into reduced performance during operational periods? I think not. Deferring completion of VACUUM means deferring refreshing the FSM. Allowing cycle-at-a-time VACUUM would allow the FSM to be updated after each run, thus releasing space for reuse again. ISTM that the saving-dead-list approach would defer the updating of the FSM for many days in your situation. If you would like to reduce VACUUM times have you considered partitioning? It can be very effective at isolating changes and is designed specifically to cope with large data maintenance issues. If there are issues that prevent the use of partitioning in your case, perhaps we should be discussing those instead? Migration from a non-partitioned environment to a partitioned one is quite simple from 8.2 onwards. > >So I'm not really convinced that being able to stop a table > > vacuum halfway is critical. > To run vacuum on the same table for a long period, it is critical > to be sure: > 1. not to eat resources that foreground processes needs > 2. not to block vacuuming of hot-updated tables > 3. not to block any transaction, not to block any backup activities > > In the current implementation of concurrent vacuum, the third is not > satisfied obviously, the first issue comes to my mind is the > lazy_truncate_heap, it takes AccessExclusiveLock for a long time, > that is problematic. Are you saying you know for certain this lock is held for a long time, or are you just saying you think it is? If you have some evidence for long truncation times then that would be a separate issue of concern, since that might starve out normal users. Please say more? ISTM that if you can refresh the FSM more frequently you will have less need to truncate the relation at the end of each run. After some time, I would expect that no truncation would be required because of the cyclic reuse of space within the table, rather than extension/truncation. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Wed, 2007-02-28 at 13:53 +0900, Galy Lee wrote: >> In the current implementation of concurrent vacuum, the third is not >> satisfied obviously, the first issue comes to my mind is the >> lazy_truncate_heap, it takes AccessExclusiveLock for a long time, >> that is problematic. > Are you saying you know for certain this lock is held for a long time, > or are you just saying you think it is? If you have some evidence for > long truncation times then that would be a separate issue of concern, > since that might starve out normal users. Please say more? lazy_truncate_heap does a ConditionalLockAcquire, that is, it won't succeed in acquiring the exclusive lock if there is any competition. And I find it hard to believe that it will hold the lock very long if it does get it --- in most scenarios it won't be possible to remove very many pages, so the scan won't take long. (Of course that is arguably a bug, but until you can fix things so that an average VACUUM *can* remove a lot of pages, it's hardly profitable to worry about whether this step creates a concurrency problem.) So I agree with Simon: if you want us to believe there's a performance issue here, please present some evidence. regards, tom lane
Tom Lane wrote: > Galy Lee <lee.galy@oss.ntt.co.jp> writes: >> If we can stop at any point, we can make maintenance memory large >> sufficient to contain all of the dead tuples, then we only need to >> clean index for once. No matter how many times vacuum stops, >> indexes are cleaned for once. > > I beg your pardon? You're the one who's been harping on the > table-so-large-it-takes-days-to-vacuum scenario. How you figure that > you can store all the dead TIDs in working memory? This reminds me of an idea I had while looking at the bitmap index patch: We could store the dead TIDs more efficiently in a bitmap, allowing tables to be vacuumed in lesser cycles. Of course, that's orthogonal to the above discussion. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > You haven't explained how saving the dead-tuple-list could be done > in a safe mannner and it seems risky to me. The files are placed in a new directory $PGDATA/pg_vacuum with the name: spcNode.dbNode.relNode for each relations which have been interrupted during vacuum. It has the format likes: 1. VacStateFileHeader 2. VacStateData 3. Dead Tuple list 4. CRC32 The files are removed- when original physical heap files are removed,- when vacuum full have been issued,- or after the contenthas been read in memory.- etc. Is there any potential big risk there? Correct me if I am wrong. > Deferring completion of VACUUM means deferring refreshing the FSM. I borrow the code from DSM patch to merge free space infointo FSM when vacuum stops. > Are you saying you know for certain this lock is held for a long time, > or are you just saying you think it is? If you have some evidence for > long truncation times then that would be a separate issue of concern, > since that might starve out normal users. Please say more? Sorry. I *thought* it is. The benchmark has not shown such kind of problem anyway. Thanks for the clarification for me. :) Regards, -- Galy Lee lee.galy _at_ oss.ntt.co.jp NTT Open Source Software Center
> > The things I wanted to say is that: > > If we can stop at any point, we can make maintenance memory large > > sufficient to contain all of the dead tuples, then we only need to > > clean index for once. No matter how many times vacuum > stops, indexes > > are cleaned for once. > > I agree that the cycle-at-a-time approach could perform more > poorly with repeated stop-start. The reason for the > suggestion was robustness, not performance. If you provide It performs more poorly, but it also gives immediate gain, since part of the table is readily vacuumed. If you do it all in one pass with stop resume, the first visible effect may be several days after you start vacuuming. And, basically you need to pretend the vacuum transaction is still running after the first stop. Else dead tuple reuse ala HOT is not possible (or the ctid list needs to be reevaluated during resume, which per se is not efficient). > the wrong dead-tuple-list to VACUUM, you will destroy the > integrity of a table, which can result in silent data loss. > You haven't explained how saving the dead-tuple-list could be > done in a safe mannner and it seems risky to me. Agreed. It seems not efficiently possible. Andreas
> > You haven't explained how saving the dead-tuple-list could be done in > > a safe mannner and it seems risky to me. > > The files are placed in a new directory $PGDATA/pg_vacuum > with the name: spcNode.dbNode.relNode for each relations > which have been interrupted during vacuum. > > It has the format likes: > > 1. VacStateFileHeader > 2. VacStateData > 3. Dead Tuple list > 4. CRC32 > > The files are removed > - when original physical heap files are removed, > - when vacuum full have been issued, > - or after the content has been read in memory. > - etc. > > Is there any potential big risk there? Correct me if I am wrong. The main risc is not a corrupt file or broken list. The risc is, that a ctid in the list points at a tuple that is not dead anymore. To avoid that risc you would need to: 1. keep the vacuum lock open 2. leave the vacuum tx open (or reevaluate visibility of list members upon resume) Andreas
On Wed, 2007-02-28 at 11:19 +0100, Zeugswetter Andreas ADI SD wrote: > > > The things I wanted to say is that: > > > If we can stop at any point, we can make maintenance memory large > > > sufficient to contain all of the dead tuples, then we only need to > > > clean index for once. No matter how many times vacuum > > stops, indexes > > > are cleaned for once. > > > > I agree that the cycle-at-a-time approach could perform more > > poorly with repeated stop-start. The reason for the > > suggestion was robustness, not performance. If you provide > > It performs more poorly, but it also gives immediate gain, since part of > the table is readily vacuumed. If you do it all in one pass with stop > resume, the first visible effect may be several days after you start > vacuuming. I think that in itself is enough to tip the scales. > And, basically you need to pretend the vacuum transaction is > still running after the first stop. Else dead tuple reuse ala HOT is not > possible (or the ctid list needs to be reevaluated during resume, which > per se is not efficient). Ah, I see you got there ahead of me. Yes, it would prevent HOT from performing retail VACUUMs on heap blocks. (I'm not saying HOT will be accepted/acceptable, but I'd rather not have its acceptability hinge on a use case that seems so rare). One proposal that we do still have in discussion is Heikki's patch to re-evaluate the OldestXmin while the VACUUM runs. That's something we'd definitely want to do in a restartable VACUUM anyway. But my thought is that it actually increases quite dramatically the number of dead rows harvested during VACUUM (a good thing), which is likely to increase the number of cycles required to complete a large table (no problem, because of the increased efficiency of the VACUUM). I think there's a strong argument to make VACUUM refresh rather than rebuild the FSM after each cycle rather than wait until the end, whether or not we stop/start the VACUUM. In any long running VACUUM that seems very worthwhile. Big VACUUM needs big memory. Using huge settings of maintenance_work_mem dedicated solely to VACUUM seems like it could be a waste of resources in many cases. It may be much better to allow 1 GB of memory to be used to cache indexes better, which would improve performance of other applications, as well as improving the index scan time during VACUUM. So scanning indexes more often during VACUUM isn't necessarily bad either, unless your table is truly huge, in which case you should use partitioning to reduce it. Galy, please hear that people like your idea and understand your use case, but just don't like all of the proposal, just the main thrust of it. The usual way is that (people that agree + amount of your exact idea remaining) = 100% -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-02-28 at 09:38 +0000, Heikki Linnakangas wrote: > Tom Lane wrote: > > Galy Lee <lee.galy@oss.ntt.co.jp> writes: > >> If we can stop at any point, we can make maintenance memory large > >> sufficient to contain all of the dead tuples, then we only need to > >> clean index for once. No matter how many times vacuum stops, > >> indexes are cleaned for once. > > > > I beg your pardon? You're the one who's been harping on the > > table-so-large-it-takes-days-to-vacuum scenario. How you figure that > > you can store all the dead TIDs in working memory? > > This reminds me of an idea I had while looking at the bitmap index > patch: We could store the dead TIDs more efficiently in a bitmap, > allowing tables to be vacuumed in lesser cycles. > > Of course, that's orthogonal to the above discussion. I like the idea. How much memory would it save during VACUUM on a 1 billion row table with 200 million dead rows? Would that reduce the number of cycles a normal non-interrupted VACUUM would perform? Would it work efficiently for all of the current index AMs? Each index might use the index slightly differently during cleanup, I'm not sure. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > How much memory would it save during VACUUM on a 1 billion row table > with 200 million dead rows? Would that reduce the number of cycles a > normal non-interrupted VACUUM would perform? It would depend on how many dead tuples you have per-page. If you have a very large table with only one dead tuple per page then it might even be larger. But in the usual case it would be smaller. Also note that it would have to be non-lossy. My only objection to this idea, and it's not really an objection at all, is that I think we want to head in the direction of making indexes cheaper to scan and doing the index scan phase more often. That reduces the need for multiple concurrent vacuums and makes the problem of busy tables getting starved less of a concern. That doesn't mean there's any downside to making the dead tuple list take less memory but I think the upside is limited. As we optimize our index representations with GII and bitmapped indexes scanning them gets easier and easier anyways. And you don't really want to wait too long before you get the benefit of the recovered space in the table. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> How much memory would it save during VACUUM on a 1 billion row table >> with 200 million dead rows? Would that reduce the number of cycles a >> normal non-interrupted VACUUM would perform? > > It would depend on how many dead tuples you have per-page. If you have a very > large table with only one dead tuple per page then it might even be larger. > But in the usual case it would be smaller. FWIW, there's some unused bits in current representation, so it might actually be possible to design it so that it's never larger. One optimization to the current structure, instead of switching to a bitmap, would be to store the block number just once for each block, followed by a variable length list of offsets. It'd complicate the binary search though. > Also note that it would have to be non-lossy. Yep. Or actually, it might be useful to forget some dead tids if it allowed you to memorize a larger number of other dead tids. Hmm, what a weird thought :). Another insight I had while thinking about this is that the dead tid list behaves quite nicely from a OS memory management point of view. In the 1st vacuum phase, the array is filled in sequence, which means that the OS can swap out the early parts of it and use the memory for buffer cache instead. In the index scan phase, it's randomly accessed, but if the table is clustered, it's in fact not completely random access. In the 2nd vacuum pass, the array is scanned sequentially again. I'm not sure how that works out in practice, but you might want to use a larger maintenance_work_mem than you'd think. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > Galy, please hear that people like your idea and understand your use > case, but just don't like all of the proposal, just the main thrust of > it. The usual way is that > (people that agree + amount of your exact idea remaining) = 100% Thank you. I am glad to hear that. :) But I still can not kill my idea yet. Let's come to the core issue we care about: do we need the stop-on-dime feature to stop vacuum immediately? As my previous opinion: if there are some problems for long running vacuum, yes we *did need* to stop vacuum immediately. I am still not convinced that we don’t have such kind of problems. The potential problem for running a long vacuum is that it may block some foreground transactions, like ALTER TABLE; if it is true that long running vacuum did block some DDLs for a long time, it is a big problem. I think we need to stop vacuum immediately to handle such kind of problems. I admit that the implementation is much complex, but I can not see any big problems to save the dead tuples out and read it in again(like two phase commit does). Why do we need to hold the lock and transaction? We can open the lock and abandon the transaction ID, vacuum can take the lock and get a new ID when restarting. Why do we need to worry about if the dead tuple is still alive, only vacuum will sweep them, HOT can not touch the tuple until we have finished sweeping. But it is true that in some scenarios, it is better to stop after the cycle has finished, then the effect of vacuum can appear quickly. So I hope the final design may combine then together.
Galy Lee <lee.galy@oss.ntt.co.jp> writes: > Let's come to the core issue we care about: do we need the stop-on-dime > feature to stop vacuum immediately? As my previous opinion: if there > are some problems for long running vacuum, yes we *did need* to stop > vacuum immediately. There's always SIGINT. The question you are avoiding is whether it's really worth adding a lot of overhead to make vacuum able to stop without losing some work. > I admit that the implementation is much complex, but I can not > see any big problems to save the dead tuples out and read it in > again(like two phase commit does). The big problem is that this creates a number of failure scenarios that didn't exist before. Your proposal to store the dead-tuple TIDs in a separate file scares the heck out of me: there are any number of ways for that to get out-of-sync with the underlying relation. If you don't see the point here, go back to re-read the documentation about PITR and how one creates a base backup and exactly why that behavior is proof against crashes. regards, tom lane
> I admit that the implementation is much complex, but I can > not see any big problems to save the dead tuples out and read > it in again(like two phase commit does). Why do we need to > hold the lock and transaction? We can open the lock and > abandon the transaction ID, vacuum can take the lock and get > a new ID when restarting. Why do we need to worry about if > the dead tuple is still alive, only vacuum will sweep them, > HOT can not touch the tuple until we have finished sweeping. One imho important (not necessarily mandatory) aspect of HOT is, that it does parts of what vacuum would usually do. Thus:1. resume, load ctid list2. continue filling ctid list3. remove index tuples for these ctids (* problem *) You have just removed index entries for possibly now live tuples that have been reused by HOT. Unless ... Andreas
> One imho important (not necessarily mandatory) aspect of HOT > is, that it does parts of what vacuum would usually do. > > Thus: > 1. resume, load ctid list > 2. continue filling ctid list > 3. remove index tuples for these ctids (* problem *) > > You have just removed index entries for possibly now live > tuples that have been reused by HOT. Sorry my error, this is no problem, since HOT can only reuse a slot that has no direct index entries (part of an old HOT chain). So with HOT it is only important, that the 1st heap scan ignores (does not add to the list) HOT chained tuples. So, I still don't feel comfortable with the idea of breaking the visibility rules (using a ctid that is days old and globalxmin not knowing), even if I do not currently see a direct problem that cannot be worked around (like removing all lists upon startup recovery). Andreas
On Wed, Feb 28, 2007 at 10:14:24PM +0000, Heikki Linnakangas wrote: > cache instead. In the index scan phase, it's randomly accessed, but if > the table is clustered, it's in fact not completely random access. In > the 2nd vacuum pass, the array is scanned sequentially again. I'm not Only if there's only one index on the table... otherwise I'd argue that you're much less likely to be searching the TID list incrementally. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Wed, Feb 28, 2007 at 10:14:24PM +0000, Heikki Linnakangas wrote: >> cache instead. In the index scan phase, it's randomly accessed, but if >> the table is clustered, it's in fact not completely random access. In >> the 2nd vacuum pass, the array is scanned sequentially again. I'm not > > Only if there's only one index on the table... otherwise I'd argue that > you're much less likely to be searching the TID list incrementally. Yeah, sure. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com