Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+Tgmoamn-tqf2tStdR8Y6A_G+9KDFTGWdOv3KOLDj6vO-L0cg@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> The idea behind having the loop inside the undo machinery was that
> while traversing the blkprev chain, we can read all the undo records
> on the same undo page under one buffer lock.

That's not a bad goal, although invoking a user-supplied callback
while holding a buffer lock is a little scary.  If we stick with that,
it had better be clearly documented.  Perhaps worth noting:
ReadBuffer() is noticeably more expensive in terms of CPU consumption
than LockBuffer().  So it may work out that we keep a pin to avoid
redoing that and worry less about retaking the buffer locks.  But I'm
not sure: avoiding buffer locks is clearly good, too.

I have a couple of observations after poking at this some more.  One
is that it's not necessarily adequate to have an interface that
iterates backward through the undo chain until the callback returns
true.  Here's an example: suppose you want to walk backward through
the undo chain until you find the first undo record that corresponds
to a change you can see, and then return the undo record immediately
prior to that.  zheap doesn't really need this, because it (mostly?)
stores the XID of the next record we're going to look up in the
current record, and the XID of the first record we're going to look up
in the chain -- so it can tell from the record it's found so far
whether it should bother looking up the next record. That, however,
would not necessarily be true for some other AM.

Suppose you just store an undo pointer in the tuple header, as Heikki
proposed to do for zedstore.  Suppose further that each record has the
undo pointer for the previous record that modified that TID but not
necessarily the XID.  Then imagine a TID where we do an insert and a
bunch of in-place updates.  Then, a scan comes along with an old
snapshot.  It seems to me that what you need to do is walk backward
through the chain of undo records until you see one that has an XID
that is visible to your snapshot, and then the version of the tuple
that you want is in the payload of the next-newer undo record. So what
you want to do is something like this:

look up the undo pointer in the tuple.  call that the current undo record.
loop:
- look up the undo pointer in the current undo record.  call that the
previous undo record.
- if the XID from the previous undo record is visible, then stop; use
the tuple version from the current undo record.
- release the current undo record and let the new current undo record
be the previous undo record.

I'm not sure if this is actually a reasonable design from a
performance point of view, but it doesn't sound totally crazy, and
it's just to illustrate the point that there might be cases that are
too complicated for a loop-until-true model.  In this kind of loop, at
any given time you are holding onto two undo records, working your way
back through the undo log, and you just can't make this work with the
UndoFetchRecord callback interface. Possibly you could have a context
object that could hold onto one or a few buffer pins:

BeginUndoFetch(&cxt);
uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times
FinishUndoFetch(&cxt);

...but I'm not sure if that's exactly what we want either. Still, if
there is a significant savings from avoiding repinning and relocking
the buffer, we want to make it easy for people to get that advantage
as often as possible.

Another point that has occurred to me is that it is probably
impossible to avoid a fairly large number of duplicate undo fetches.
For instance, suppose somebody runs an UPDATE on a tuple that has been
recently updated.  The tuple_update method just gets a TID + snapshot,
so the AM basically has to go look up the tuple all over again,
including checking whether the latest version of the tuple is the one
visible to our snapshot.  So that means repinning and relocking the
same buffers and decoding the same undo record all over again.  I'm
not exactly sure what to do about this, but it seems like a potential
performance problem. I wonder if it's feasible to cache undo lookups
so that in common cases we can just reuse the result of a previous
lookup instead of doing a new one, and I wonder whether it's possible
to make that fast enough that it actually helps...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: \describe*
Next
From: David Fetter
Date:
Subject: Re: \describe*