Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+Tgmoamn-tqf2tStdR8Y6A_G+9KDFTGWdOv3KOLDj6vO-L0cg@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > The idea behind having the loop inside the undo machinery was that > while traversing the blkprev chain, we can read all the undo records > on the same undo page under one buffer lock. That's not a bad goal, although invoking a user-supplied callback while holding a buffer lock is a little scary. If we stick with that, it had better be clearly documented. Perhaps worth noting: ReadBuffer() is noticeably more expensive in terms of CPU consumption than LockBuffer(). So it may work out that we keep a pin to avoid redoing that and worry less about retaking the buffer locks. But I'm not sure: avoiding buffer locks is clearly good, too. I have a couple of observations after poking at this some more. One is that it's not necessarily adequate to have an interface that iterates backward through the undo chain until the callback returns true. Here's an example: suppose you want to walk backward through the undo chain until you find the first undo record that corresponds to a change you can see, and then return the undo record immediately prior to that. zheap doesn't really need this, because it (mostly?) stores the XID of the next record we're going to look up in the current record, and the XID of the first record we're going to look up in the chain -- so it can tell from the record it's found so far whether it should bother looking up the next record. That, however, would not necessarily be true for some other AM. Suppose you just store an undo pointer in the tuple header, as Heikki proposed to do for zedstore. Suppose further that each record has the undo pointer for the previous record that modified that TID but not necessarily the XID. Then imagine a TID where we do an insert and a bunch of in-place updates. Then, a scan comes along with an old snapshot. It seems to me that what you need to do is walk backward through the chain of undo records until you see one that has an XID that is visible to your snapshot, and then the version of the tuple that you want is in the payload of the next-newer undo record. So what you want to do is something like this: look up the undo pointer in the tuple. call that the current undo record. loop: - look up the undo pointer in the current undo record. call that the previous undo record. - if the XID from the previous undo record is visible, then stop; use the tuple version from the current undo record. - release the current undo record and let the new current undo record be the previous undo record. I'm not sure if this is actually a reasonable design from a performance point of view, but it doesn't sound totally crazy, and it's just to illustrate the point that there might be cases that are too complicated for a loop-until-true model. In this kind of loop, at any given time you are holding onto two undo records, working your way back through the undo log, and you just can't make this work with the UndoFetchRecord callback interface. Possibly you could have a context object that could hold onto one or a few buffer pins: BeginUndoFetch(&cxt); uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times FinishUndoFetch(&cxt); ...but I'm not sure if that's exactly what we want either. Still, if there is a significant savings from avoiding repinning and relocking the buffer, we want to make it easy for people to get that advantage as often as possible. Another point that has occurred to me is that it is probably impossible to avoid a fairly large number of duplicate undo fetches. For instance, suppose somebody runs an UPDATE on a tuple that has been recently updated. The tuple_update method just gets a TID + snapshot, so the AM basically has to go look up the tuple all over again, including checking whether the latest version of the tuple is the one visible to our snapshot. So that means repinning and relocking the same buffers and decoding the same undo record all over again. I'm not exactly sure what to do about this, but it seems like a potential performance problem. I wonder if it's feasible to cache undo lookups so that in common cases we can just reuse the result of a previous lookup instead of doing a new one, and I wonder whether it's possible to make that fast enough that it actually helps... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: