Thread: VACUUMs and WAL
Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. 2. Then we scan the index, issuing WAL records as appropriate. 3. Then we rescan the heap, issuing a HEAP2 clean record for every block. I don't see a reason why we would issue 2 WAL records per block for a VACUUM, nor why we would prune and remove in two steps, dirtying the block each time. Seems like we could write approximately half the amount of data that we do. Surely we can come up with a better plan than that one? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-10-28 at 08:49 +0000, Simon Riggs wrote: > Looking at a VACUUM's WAL records makes me think twice about the way we > issue a VACUUM. > > 1. First we scan the heap, issuing a HEAP2 clean record for every block > that needs cleaning. IIRC the first heap pass just collects info and does nothing else. Is this just an empty/do-nothing WAL record ? > 2. Then we scan the index, issuing WAL records as appropriate. > > 3. Then we rescan the heap, issuing a HEAP2 clean record for every > block. > > I don't see a reason why we would issue 2 WAL records per block for a > VACUUM, nor why we would prune and remove in two steps, dirtying the > block each time. The first pass should just be collecting info and not dirtying anything. Could it be side effect of setting some transaction visibility bits on first visit ? In that case It would be good, if we could disable doing that that for vacuum. > Seems like we could write approximately half the amount > of data that we do. > > Surely we can come up with a better plan than that one? ------------------- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability Training, Services and Support
On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: > On Tue, 2008-10-28 at 08:49 +0000, Simon Riggs wrote: > > Looking at a VACUUM's WAL records makes me think twice about the way we > > issue a VACUUM. > > > > 1. First we scan the heap, issuing a HEAP2 clean record for every block > > that needs cleaning. > > IIRC the first heap pass just collects info and does nothing else. > Is this just an empty/do-nothing WAL record ? 8.3 changed that; it used to work that way. I guess I never looked at the amount of WAL being generated. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > I don't see a reason why we would issue 2 WAL records per block for a > VACUUM, nor why we would prune and remove in two steps, dirtying the > block each time. Seems like we could write approximately half the amount > of data that we do. > > Surely we can come up with a better plan than that one? This sounds like the same issue Pavan identified and proposed solving by rotating the three passes so that we do step 3 at the beginning of the next vacuum run, so it can be done in the same pass as step 1. To do that he proposed we do: 1. scan heap doing two things: a) remove any marked tuples if they were marked by a previous vacuum which committed andb) prune and mark any tuples we find are deletable for a future vacuum to remove. 2. scan indexes and remove the tuples we marked in 1b. The main blocking issue IIRC was: How to mark the tuples in a way which is safe if vacuum aborts. He suggested putting a vacuum xid in pg_class. Other suggestions were to mark the pages in the page header (which I thought made the most sense) or to put the xid in the line pointer (since nothing else needs to go there, the tuples are already cleaned). There might also have been a question of how to deal with the statistics. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
On Tue, 2008-10-28 at 10:59 +0000, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > I don't see a reason why we would issue 2 WAL records per block for a > > VACUUM, nor why we would prune and remove in two steps, dirtying the > > block each time. Seems like we could write approximately half the amount > > of data that we do. > > > > Surely we can come up with a better plan than that one? > > This sounds like the same issue Pavan identified and proposed solving by > rotating the three passes so that we do step 3 at the beginning of the next > vacuum run, so it can be done in the same pass as step 1. > > To do that he proposed we do: > > 1. scan heap doing two things: a) remove any marked tuples if they were marked > by a previous vacuum which committed and b) prune and mark any tuples we > find are deletable for a future vacuum to remove. > > 2. scan indexes and remove the tuples we marked in 1b. It's fairly hard to remove the second heap pass completely. I think what I am suggesting is two heap passes, but writing WAL and dirtying blocks on only one of the passes. The biggest I/O cost comes from the writes, not the reads, ISTM. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2008-10-28 at 10:59 +0000, Gregory Stark wrote: >> To do that he proposed we do: >> >> 1. scan heap doing two things: a) remove any marked tuples if they were marked >> by a previous vacuum which committed and b) prune and mark any tuples we >> find are deletable for a future vacuum to remove. >> >> 2. scan indexes and remove the tuples we marked in 1b. > > It's fairly hard to remove the second heap pass completely. > > I think what I am suggesting is two heap passes, but writing WAL and > dirtying blocks on only one of the passes. How small a patch would it be? I guess you just need to disable all pruning in the first pass and do it in the second patch? I would still rather see Pavan's optimizationo if we can do it cleanly, but if it's not going to happen and this is trivial then sure, we may as well. > > The biggest I/O cost comes from the writes, not the reads, ISTM. It's counter-intuitive but actually it's usually the other way around. Writes can be buffered, re-ordered, and scheduled during otherwise idle time. Reads however are always blocking. However in this situation I think you may be right. Vacuum is doing a sequential scan through the table so if the OS has to interrupt that scan to go do some writes it'll end up having to go back and forth. That would be a *lot* slower than just doing a sequential scan. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Tue, 2008-10-28 at 10:10 +0000, Simon Riggs wrote: > On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: > > On Tue, 2008-10-28 at 08:49 +0000, Simon Riggs wrote: > > > Looking at a VACUUM's WAL records makes me think twice about the way we > > > issue a VACUUM. > > > > > > 1. First we scan the heap, issuing a HEAP2 clean record for every block > > > that needs cleaning. > > > > IIRC the first heap pass just collects info and does nothing else. > > Is this just an empty/do-nothing WAL record ? > > 8.3 changed that; it used to work that way. I guess I never looked at > the amount of WAL being generated. I can't see how it is safe to do anything more than just lookups on first pass. There will be dangling index pointers if the system crashes/is rebooted or the vacuum is just interrupted after cleaning some heap pages but before cleaning corresponding index pages. --------------- Hannu
Hannu Krosing wrote: > On Tue, 2008-10-28 at 10:10 +0000, Simon Riggs wrote: >> On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: >>> On Tue, 2008-10-28 at 08:49 +0000, Simon Riggs wrote: >>>> Looking at a VACUUM's WAL records makes me think twice about the way we >>>> issue a VACUUM. >>>> >>>> 1. First we scan the heap, issuing a HEAP2 clean record for every block >>>> that needs cleaning. >>> IIRC the first heap pass just collects info and does nothing else. >>> Is this just an empty/do-nothing WAL record ? >> 8.3 changed that; it used to work that way. I guess I never looked at >> the amount of WAL being generated. > > I can't see how it is safe to do anything more than just lookups on > first pass. What's done in the first pass is the same HOT pruning that is done opportunistically on other page accesses as well. IIRC it's required for correctness, though I can't remember what exactly the issue was. I don't think the extra WAL volume is a problem; VACUUM doesn't generate much WAL, anyway. As for the extra data page writes it causes; yeah, that might cause some I/O that could be avoided, but remember that the first pass often dirties buffers anyway to set hint bits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndQuadrant.com> writes: > I think what I am suggesting is two heap passes, but writing WAL and > dirtying blocks on only one of the passes. I think you've all forgotten about hint-bit setting. The assumption is that the first VACUUM pass is going to update a lot of hint bits and we might as well get some other work done with the same write. Now of course that doesn't necessarily entail a WAL write too, but it makes this less than a slam-dunk win. Also, I think that the reason the code ended up this way is that there were pretty severe difficulties in making the VACUUM code cope correctly with un-pruned tuples. Pavan might remember more about that. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > Looking at a VACUUM's WAL records makes me think twice about the way we > issue a VACUUM. > 1. First we scan the heap, issuing a HEAP2 clean record for every block > that needs cleaning. > 2. Then we scan the index, issuing WAL records as appropriate. > 3. Then we rescan the heap, issuing a HEAP2 clean record for every > block. The first pass removes dead HOT tuples. The second pass removes dead normal tuples (it does NOT write "every block", only those with dead tuples). In principle the set of pages written in pass 1 might be completely disjoint from the set of pages written in pass 2 (though I admit that's probably not real likely). > Surely we can come up with a better plan than that one? Maybe, but it's not as bad as you're painting it. regards, tom lane
On Tue, 2008-10-28 at 14:28 +0200, Heikki Linnakangas wrote: > Hannu Krosing wrote: > > On Tue, 2008-10-28 at 10:10 +0000, Simon Riggs wrote: > >> On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: > >>> On Tue, 2008-10-28 at 08:49 +0000, Simon Riggs wrote: > >>>> Looking at a VACUUM's WAL records makes me think twice about the way we > >>>> issue a VACUUM. > >>>> > >>>> 1. First we scan the heap, issuing a HEAP2 clean record for every block > >>>> that needs cleaning. > >>> IIRC the first heap pass just collects info and does nothing else. > >>> Is this just an empty/do-nothing WAL record ? > >> 8.3 changed that; it used to work that way. I guess I never looked at > >> the amount of WAL being generated. > > > > I can't see how it is safe to do anything more than just lookups on > > first pass. > > What's done in the first pass is the same HOT pruning that is done > opportunistically on other page accesses as well. IIRC it's required for > correctness, though I can't remember what exactly the issue was. Are you sure it is a correctness thing ? Maybe HOT pruning just happened to be in a path used by vacuum to read pages. > I don't think the extra WAL volume is a problem; Probably not ( unless you need to ship your WAL records via a very expensive network connection ). If it is a simple performance problem, then it can probably be fixed by just running VACUUM slower. > VACUUM doesn't generate > much WAL, anyway. As for the extra data page writes it causes; yeah, > that might cause some I/O that could be avoided, but remember that the > first pass often dirties buffers anyway to set hint bits. Still, can't we special-case HOT pruning and hint-bit change WAL-logging for first the pass of vacuum ? They both seem redundant in case of VACUUM. --------------- Hannu