Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoZNkkyOGOoQTsjWbptywHG58=YOn9XyBxLynq+JONo96g@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
On Sun, May 12, 2019 at 2:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have removed some of the globals and also improved some comments. I don't like the discard_lock very much. Perhaps it's OK, but I hope that there are better alternatives. One problem with Thomas Munro pointed out to me in off-list discussion is that the discard_lock has to be held by anyone reading undo even if the undo they are reading and the undo that the discard worker wants to discard are in completely different parts of the undo log. Somebody could be trying to read an undo page written 1 second ago while the discard worker is trying to discard an undo page written to the same undo log 1 hour ago. Those things need not block each other, but with this design they will. Another problem is that we end up holding it across an I/O; there's precedent for that, but it's not particularly good precedent. Let's see if we can do better. My first idea was that we should just make this the caller's problem instead of handling it in this layer. Undo is retained for committed transactions until they are all-visible, and the reason for that is that we presume that nobody can be interested in the data for MVCC purposes unless there's a snapshot that can't see the results of the transaction in question. Once the committed transaction is all-visible, that's nobody, so it should be fine to just discard the undo any time we like. That won't work with the existing zheap code, which currently sometimes follows undo chains for transactions that are all-visible, but I think that's a problem we should fix rather than something we should force the undo layer to support. We'd still need something kinda like the discard_lock for aborted transactions, though, because as soon as you release the buffer lock on a table page, the undo workers could apply all the undo to that page and then discard it, and then you could afterwards try to look up the undo pointer which you had retrieved from that page and stored in backend-local memory. One thing we could probably do is make that a heavyweight lock on the XID itself, so if you observe that an XID is aborted, you have to go get this lock in ShareLock mode, then recheck the page, and only then consult the undo; discarding the undo for an aborted transaction would require AccessExclusiveLock on the XID. This solution gets rid of the LWLock for committed undo; for aborted undo, it avoids the false sharing and non-interruptibility that an LWLock imposes. But then I had what I think may be a better idea. Let's add a new ReadBufferMode that suppresses the actual I/O; if the buffer is not already present in shared_buffers, it allocates a buffer but returns it without doing any I/O, so the caller must be prepared for BM_VALID to be unset. I don't know what to call this, so I'll call it RBM_ALLOCATE (leaving room for possible future variants like RBM_ALLOCATE_AND_LOCK). Then, the protocol for reading an undo buffer would go like this: 1. Read the buffer with RBM_ALLOCATE, thus acquiring a pin on the relevant buffer. 2. Check whether the buffer precedes the discard horizon for that undo log stored in shared memory. 3. If so, use the ForgetBuffer() code we have in the zheap branch to deallocate the buffer and stop here. The undo is not available to be read, whether it's still physically present or not. 4. Otherwise, if the buffer is not valid, call ReadBufferExtended again, or some new function, to make it so. Remember to release all of our pins. The protocol for discarding an undo buffer would go like this: 1. Advance the discard horizon in shared memory. 2. Take a cleanup lock on each buffer that ought to be discarded. Remember the dirty ones and forget the others. 3. WAL-log the discard operation. 4. Revisit the dirty buffers we remembered in step 2 and forget them. The idea is that, once we've advanced the discard horizon in shared memory, any readers that come along later are responsible for making sure that they never do I/O on any older undo. They may create some invalid buffers in shared memory, but they'll hopefully also get rid of them if they do, and if they error out for some reason before doing so, that buffer should age out naturally. So, the discard worker just needs to worry about buffers that already exist. Once it's taken a cleanup lock on each buffer, it knows that there are no I/O operations and in fact no buffer usage of any kind still in progress from before it moved the in-memory discard horizon. Anyone new that comes along will clean up after themselves. We postpone forgetting dirty buffers until after we've successfully WAL-logged the discard, in case we fail to do so. With this design, we don't add any new cases where a lock of any kind must be held across an I/O, and there's also no false sharing. Furthermore, unlike the previous proposal, this will work nicely with something like old_snapshot_threshold. The previous design relies on undo not getting discarded while anyone still cares about it, but old_snapshot_threshold, if applied to zheap, would have the express goal of discarding undo while somebody still cares about it. With this design, we could support old_snapshot_threshold by having undo readers error out in step #2 if the transaction is committed and not visible to our snapshot but yet the undo is discarded. Heck, we can do that anyway as a safety check, basically for free, and just tailor the error message depending on whether old_snapshot_threshold is such that the condition is expected to be possible. While I'm kvetching, I can't help noticing that undoinsert.c contains functions both for inserting undo and also for reading it, which seems like a loose end that needs to be tied up somehow. I'm mildly inclined to think that we should rename the file to something more generic (e.g. undoaccess.h) rather than splitting it into two files (e.g. undoinsert.c and undoread.c). Also, it looks to me like you need to go through what is currently undoinsert.h and look for stuff that can be made private to the .c file. I don't see why thing like MAX_PREPARED_UNDO need to be exposed at all, and for things like PreparedUndoSpace it seems like it would suffice to just do 'struct PreparedUndoSpace; typedef struct PreparedUndoSpace PreparedUndoSpace;' in the header and put the actual 'struct PreparedUndoSpace { ... };' definition in the .c file. And UnlockReleaseUndoBuffers has a declaration but no longer has a definition, so I think that can go away too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: