Thread: Proposal for DROP TABLE rollback mechanism
(This is mostly directed at Vadim, but kibitzing is welcome.) Here's what I plan to do to make DROP TABLE rollbackable and clean up the handling of CREATE TABLE rollback. Comments? Overview: 1. smgrcreate() will create the file for the relation same as it does now, and will add the rel's RelFileNode information to an smgr-private list of rels created in the current transaction. 2. smgrunlink() normally will NOT immediately delete the file; instead it will perform smgrclose() and then add the rel's RelFileNode information to an smgr-private list of rels to be deleted at commit. However, if the file appears in the list created by smgrcreate() --- ie, the rel is being created and deleted in the same xact --- then we can delete it immediately. In this case we remove the file from the smgrcreate list and do not put it on the unlink list. 3. smgrcommit() will delete all the files mentioned in the list created by smgrunlink, then discard both lists. 4. smgrabort() will delete all the files mentioned in the list created by smgrcreate, then discard both lists. Points 1 and 4 will replace the existing relcache-based mechanism for deleting files created in the current xact when the xact aborts. Various details: To support deleting files at xact commit/abort, we will need something like an "mdblindunlink" entrypoint to md.c. I am inclined to simply redefine mdunlink to take a RelFileNode instead of a complete Relation, rather than supporting two entrypoints --- I don't think there'll be any future use for the existing mdunlink. Objections? bufmgr.c's ReleaseRelationBuffers drops any dirty buffers for the target rel, and therefore it must NOT be called inside the transaction (else, rollback would mean we'd lost data). I am inclined to let it continue to behave that way, but to call it from smgrcommit/smgrabort, not from anywhere else. This would mean changing its API to take a RelFileNode, but that seems no big problem. This way, dirty buffers for a doomed relation will be allowed to live until transaction commit, in the hopes that we will be able to discard them unwritten. Will remove notices in DROP TABLE etc. warning that these operations are not rollbackable. Note that CREATE/DROP DATABASE is still not rollback-able, and so those two ops will continue to elog(ERROR) when called in a transaction block. Ditto for VACUUM; probably also ditto for REINDEX, though I haven't looked closely at that yet. The temp table name mapper will need to be modified so that it can undo all current-xact changes to its name mapping list at xact abort. Currently I think it only handles undoing additions, not deletions/renames. This does not need to be WAL-aware, does it? WAL: AFAICS, things will behave properly if calls to smgrcreate/smgrunlink are logged as WAL events. For redo, they are executed just the same as normal, except they shouldn't complain if the target file already exists (or already doesn't exist, for unlink). Undo of smgrcreate is just immediate mdunlink; undo of smgrunlink is a no-op. I have not studied the WAL code enough to be prepared to add the logging/undo/redo code, and it looks like you haven't implemented that anyway yet for smgr.c, so I will leave that part to you, OK? regards, tom lane
> 1. smgrcreate() will create the file for the relation same as it does now, > and will add the rel's RelFileNode information to an smgr-private list of > rels created in the current transaction. > > 2. smgrunlink() normally will NOT immediately delete the file; instead it > will perform smgrclose() and then add the rel's RelFileNode information to ^^^^^^^^^^^^^^^ Seems that smgrclose still need in Relation (where rd_fd lives and this is bad) and you're going to put just file node to smgrunlink (below). Shouldn't relation be removed from relcache (and smgrclose called from there) before smgrunlink? > an smgr-private list of rels to be deleted at commit. However, if the > file appears in the list created by smgrcreate() --- ie, the rel is being > created and deleted in the same xact --- then we can delete it > immediately. In this case we remove the file from the smgrcreate list > and do not put it on the unlink list. (This wouldn't work for savepoints but ok for now.) > 3. smgrcommit() will delete all the files mentioned in the list created > by smgrunlink, then discard both lists. > > 4. smgrabort() will delete all the files mentioned in the list created > by smgrcreate, then discard both lists. > > Points 1 and 4 will replace the existing relcache-based mechanism for > deleting files created in the current xact when the xact aborts. > > > Various details: > > To support deleting files at xact commit/abort, we will need something > like an "mdblindunlink" entrypoint to md.c. I am inclined to simply > redefine mdunlink to take a RelFileNode instead of a complete Relation, > rather than supporting two entrypoints --- I don't think there'll be any > future use for the existing mdunlink. Objections? No one. Actually I would like to see all smgr entries taking RelFileNode instead of Relation. This would require smgr cache to map nodes to fd-es. Having this we could get rid of all blind smgr entries: smgropen would put file node & fd into smgr cache and all other smgrmethods would act as blind ones do now - no file node found in cache then open file, performe op, close file. But probably it's too late to implement this now. > bufmgr.c's ReleaseRelationBuffers drops any dirty buffers for the target > rel, and therefore it must NOT be called inside the transaction (else, > rollback would mean we'd lost data). I am inclined to let it continue > to behave that way, but to call it from smgrcommit/smgrabort, not from > anywhere else. This would mean changing its API to take a RelFileNode, > but that seems no big problem. This way, dirty buffers for a doomed > relation will be allowed to live until transaction commit, in the hopes > that we will be able to discard them unwritten. Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr doesn't look like right thing. ? > Will remove notices in DROP TABLE etc. warning that these operations > are not rollbackable. Note that CREATE/DROP DATABASE is still not > rollback-able, and so those two ops will continue to elog(ERROR) when > called in a transaction block. Ditto for VACUUM; probably also ditto > for REINDEX, though I haven't looked closely at that yet. > > The temp table name mapper will need to be modified so that it can > undo all current-xact changes to its name mapping list at xact abort. > Currently I think it only handles undoing additions, not > deletions/renames. This does not need to be WAL-aware, does it? > > > WAL: > > AFAICS, things will behave properly if calls to smgrcreate/smgrunlink > are logged as WAL events. For redo, they are executed just the same Yes, there will be logging for smgrcreate, but not for smgrunlink because of we'll make real unlinking after commit. All what is needed for WAL is list of file nodes to remove - I need to put this list into commit log record to ensure that files are removed on redo and need in ability to remove file immediately in this case. > as normal, except they shouldn't complain if the target file already > exists (or already doesn't exist, for unlink). Undo of smgrcreate > is just immediate mdunlink; undo of smgrunlink is a no-op. > > I have not studied the WAL code enough to be prepared to add the > logging/undo/redo code, and it looks like you haven't implemented that > anyway yet for smgr.c, so I will leave that part to you, OK? Unfortunately, there will be no undo in 7.1 -:( I've found that to undo index updates we would need in either compensation records or in xmin/cmin in index tuples. So, we'll still live with dust in storage -:( Redo is much easy. Vadim
"Vadim Mikheev" <vmikheev@sectorbase.com> writes: >> 2. smgrunlink() normally will NOT immediately delete the file; instead it >> will perform smgrclose() and then add the rel's RelFileNode information to > ^^^^^^^^^^^^^^^ > Seems that smgrclose still need in Relation (where rd_fd lives and this is > bad) and you're going to put just file node to smgrunlink (below). Shouldn't > relation be removed from relcache (and smgrclose called from there) > before smgrunlink? No, the way the existing higher-level code works is first to call smgrunlink (NOT smgrclose) and then remove the relcache entry. I don't see a need to change that. In the interval where we're waiting to commit, there will be no relcache entry, only a RelFileNode value sitting in smgr's list of files to delete at commit. > Actually I would like to see all smgr entries taking RelFileNode > instead of Relation. I agree, but I think that's a project for a future release. Not enough time for it for 7.1. >> bufmgr.c's ReleaseRelationBuffers drops any dirty buffers for the target >> rel, and therefore it must NOT be called inside the transaction (else, >> rollback would mean we'd lost data). I am inclined to let it continue >> to behave that way, but to call it from smgrcommit/smgrabort, not from >> anywhere else. This would mean changing its API to take a RelFileNode, >> but that seems no big problem. This way, dirty buffers for a doomed >> relation will be allowed to live until transaction commit, in the hopes >> that we will be able to discard them unwritten. > Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr > doesn't look like right thing. ? Yes, it's a little bit ugly, but if we call FlushRelationBuffers then we will likely be doing some useless writes (to flush out pages that we are only going to throw away anyway). If we leave the buffers alone till commit, then we'll only write out pages if we need to recycle a buffer for another use during that transaction. Also, I don't feel comfortable with the idea of doing FlushRelationBuffers mid-transaction and then relying on the buffer cache to still be empty of pages for that relation much later on when we finally commit. Sure, it *should* be empty, but I'll be happier if we flush the buffer cache immediately before deleting the file. What might make sense is to make a pass over the buffer cache at the time of DROP (inside the transaction) to make sure there are no pinned buffers for the rel --- if so, we want to elog() during the transaction not after commit. We could also release any non-dirty buffers at that point. Then after commit we know we don't care about the dirty buffers anymore, so we come back and discard them. regards, tom lane
> > Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr > > doesn't look like right thing. ? > > Yes, it's a little bit ugly, but if we call FlushRelationBuffers then we > will likely be doing some useless writes (to flush out pages that we are > only going to throw away anyway). If we leave the buffers alone till > commit, then we'll only write out pages if we need to recycle a buffer > for another use during that transaction. BTW, why do we force buffers to disk in FlushRelationBuffers at all? Seems all what is required is to flush them *from* pool, not *to* disk immediately. In WAL bufmgr version (temporary in xlog_bufmgr.c) I've changed this. Actually I wouldn't worry about these writes at all - drop relation is rare case, - but ok. > Also, I don't feel comfortable with the idea of doing > FlushRelationBuffers mid-transaction and then relying on the buffer > cache to still be empty of pages for that relation much later on when > we finally commit. Sure, it *should* be empty, but I'll be happier > if we flush the buffer cache immediately before deleting the file. (It *must* be empty -:) Sorry, I don't understand "*immediately* before" in multi-user environment. Relation must be excl locked and this must be only guarantee for us, not time of doing anything with this relation.) > What might make sense is to make a pass over the buffer cache at the > time of DROP (inside the transaction) to make sure there are no pinned > buffers for the rel --- if so, we want to elog() during the transaction > not after commit. We could also release any non-dirty buffers at > that point. Then after commit we know we don't care about the dirty > buffers anymore, so we come back and discard them. Please note that there is xlog_bufmgr.c If you'll add/change something in bufmgr please let me know later. Vadim
"Vadim Mikheev" <vmikheev@sectorbase.com> writes: > BTW, why do we force buffers to disk in FlushRelationBuffers at all? > Seems all what is required is to flush them *from* pool, not *to* disk > immediately. Good point. Seems like it'd be sufficient to do a standard async write rather than write + fsync. We'd still need some additional logic at commit time, however, because we want to make sure there are no BufferDirtiedByMe bits set for the file we're about to delete... >> Also, I don't feel comfortable with the idea of doing >> FlushRelationBuffers mid-transaction and then relying on the buffer >> cache to still be empty of pages for that relation much later on when >> we finally commit. Sure, it *should* be empty, but I'll be happier >> if we flush the buffer cache immediately before deleting the file. > (It *must* be empty -:) In the absence of bugs, yes, it must be empty for the case where we are committing a DROP. But I want a safety check to catch those bugs. Besides, what of the case where we are aborting a CREATE? There must be a bufmgr call that will allow us to ensure there are no buffers left for the relation we intend to delete. How about this: change FlushRelationBuffers so that it does standard async writes for dirty buffers and then removes all the rel's buffers from the pool. This is invoked inside the transaction, same as now. Make DROP TABLE call this routine, *not* RemoveRelationBuffers. Then call RemoveRelationBuffers from smgr during transaction commit or abort. In the commit case, there really shouldn't be any buffers for the rel, so we can emit an elog NOTICE (it's too late for ERROR, no?) if we find any. But in the abort case, we'd not be surprised to find buffers, even dirty buffers, and we just want to throw 'em away. This would also be the place to clean out the BufferDirtiedByMe state. If you don't like the smgr->bufmgr call, the reason for that is we're asking smgr to keep the list of pending relation deletes. We could make a new module, logically above smgr and bufmgr, to keep that list instead. But I don't think it's worth the trouble. > Please note that there is xlog_bufmgr.c If you'll add/change something in > bufmgr please let me know later. Will keep you informed. regards, tom lane
> > BTW, why do we force buffers to disk in FlushRelationBuffers at all? > > Seems all what is required is to flush them *from* pool, not *to* disk > > immediately. > > Good point. Seems like it'd be sufficient to do a standard async write > rather than write + fsync. > > We'd still need some additional logic at commit time, however, because > we want to make sure there are no BufferDirtiedByMe bits set for the > file we're about to delete... Note that there is no BufferDirtiedByMe in WAL bufmgr version. > How about this: change FlushRelationBuffers so that it does standard > async writes for dirty buffers and then removes all the rel's buffers ^^^^^^^^^^^^^^^^^^^^^^^^^ When it's used from vacuum no reason to do this. > from the pool. This is invoked inside the transaction, same as now. > Make DROP TABLE call this routine, *not* RemoveRelationBuffers. > Then call RemoveRelationBuffers from smgr during transaction commit or > abort. In the commit case, there really shouldn't be any buffers for > the rel, so we can emit an elog NOTICE (it's too late for ERROR, no?) ^^^^^^^^^^^^^^^^^^^^^ Sure, but good time for Assert -:) > if we find any. But in the abort case, we'd not be surprised to find > buffers, even dirty buffers, and we just want to throw 'em away. This > would also be the place to clean out the BufferDirtiedByMe state. BTW, currently smgrcommit is called twice on commit *before* xact status in pg_log updated and so you can't use it to remove files. In WAL version smgrcommit isn't called at all now but we easy can add smgrcommit call after commit record is logged. Vadim
"Vadim Mikheev" <vmikheev@sectorbase.com> writes: > Please note that there is xlog_bufmgr.c If you'll add/change something in > bufmgr please let me know later. Per your request: I've changed bufmgr.c. I think I made appropriate changes in xlog_bufmgr, but please check. The changes were: 1. Modify FlushRelationBuffers to do plain write, not flush (no fsync) of dirty buffers. This was per your suggestion. FlushBuffer() now takes an extra parameter indicating whether fsync is wanted. I think this change does not affect xlog_bufmgr at all. 2. Rename ReleaseRelationBuffers to DropRelationBuffers to make it more clear what it's doing. 3. Add a DropRelFileNodeBuffers, which is just like DropRelationBuffers except it takes a RelFileNode argument. This is used by smgr to ensure that the buffer cache is clear of buffers for a rel about to be deleted. 4. Update comments about usage of DropRelationBuffers and FlushRelationBuffers. Rollback of DROP TABLE now works in non-WAL code, and seems to work in WAL code too. I did not add WAL logging, because I'm not quite sure what to do, so rollforward probably does the wrong thing. Could you deal with that part? smgr.c is the place that keeps the list of what to delete at commit or abort. regards, tom lane
> Rollback of DROP TABLE now works in non-WAL code, and seems to work in > WAL code too. I did not add WAL logging, because I'm not quite sure > what to do, so rollforward probably does the wrong thing. Could you > deal with that part? smgr.c is the place that keeps the list of what > to delete at commit or abort. Ok, thanks! I'll take list of relfilenodes to delete just before commit and put it into commit record. Vadim