Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+TgmobvqZoTX9XHkw45c-VbbY6PLD+HNS4vw8yyrmy5gBidJA@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On Mon, May 27, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Apart from fixing the above comments, the patch is rebased on latest
> undo patchset.  As of now, I have split the binaryheap.c changes into
> a separate patch.  We are stilll enhancing the patch to compute
> oldestXidHavingUnappliedUndo which touches various parts of patch, so
> splitting further without completing that can make it a bit difficult
> to work on that.

Some review comments around execute_undo_actions:

The 'nopartial' argument to execute_undo_actions is confusing.  First,
it would probably be worth spelling it out instead of abbreviation:
not_partial_transaction rather than nopartial.  Second, it is usually
better to phrase parameter names in terms of what they are rather than
in terms of what they are not: complete_transaction rather than
not_partial_transaction.  Third, it's unclear from these comments why
we'd be undoing something other than a complete transaction.  It looks
as though the answer is that this flag will be false when we're
undoing a subxact -- in which case, why not invert the sense of the
flag and call it 'bool subxact'?  I might be wrong, but it seems like
that would be a whole lot clearer. Fourth, the block at the top of
this function, guarded by nopartial, seems like it must be vulnerable
to race conditions.  If we're undoing the complete transaction, then
it checks whether UndoFetchRecord() still returns anything.  But that
could change not just at the beginning of the function, but also any
time in the middle, or so it seems to me.  I doubt that this is the
right level at which to deal with this sort of interlocking. I think
there should be some higher-level mechanism that prevents two
processes from trying to undo the same transaction at the same time,
like a heavyweight lock or some kind of flag in the shared memory data
structure that keeps track of pending undo, so that we never even
reach this code unless we know that this XID needs undo work and no
other process is already doing it.  If you're the only one undoing XID
123456, then there shouldn't be any chance of the undo disappearing
from underneath you.  And we definitely want to guarantee that only
one process is undoing any given XID at a time.

The 'blk_chain_complete' variable which is set in this function and
passed down to execute_undo_actions_page() and then to the rmgr's
rm_undo callback also looks problematic.  First, not every AM that
uses undo may even have the notion of a 'block chain'; zedstore for
example uses TIDs as a 48-bit integer, not a block + offset number, so
it's really not going to have a 'block chain.'  Second, even in
zheap's usage, it seems to me that the block chain could be complete
even when this gets set to false. It gets set to true when we're
undoing a toplevel transaction (not a subxact) and we were able to
fetch all of the undo for that toplevel transaction. But even if
that's not true, the chain for an individual block could still be
complete, because all the remaining undo for the block at issue
might've been in the chunk of undo we already read; the remaining undo
could be for other blocks.  For that reason, I can't see how the zheap
code that relies on this value can be correct; it uses this value to
decide whether to stick zeroes in the transaction slot, but if the
scenario described above happened, then I suppose the XID would not
get cleared from the slot during undo.  Maybe zheap is just relying on
that being harmless, since if all of the undo actions have been
correctly executed for the page, the fact that the transaction slot is
still bogusly used by an aborted xact won't matter; nothing will refer
to it. However, it seems to me that it would be better for zheap to
set things up so that the first undo record for a particular txn/page
combination is flagged in some way (in the payload!) so that undo can
zero the slot if the action being undone is the one that claimed the
slot.  That seems cleaner on principle, and it also avoids having
supposedly AM-independent code pass down details that are driven by
zheap's particular needs.

While it's probably moot since I think this code should go away
anyway, I find it poor style to write something like:

+ if (nopartial && !UndoRecPtrIsValid(urec_ptr))
+ blk_chain_complete = true;
+ else
+ blk_chain_complete = false;

"if (x) y = true; else y = false;" can be more compactly written as "y
= x;", like this:

blk_chain_complete = nopartial && !UndoRecPtrIsValid(urec_ptr);

I think that the signature for rm_undo can be simplified considerably.
I think blk_chain_complete should go away for the reasons discussed
above.  Also, based on our conversations with Heikki at PGCon, we
decided that we should not presume that the AM wants the records
grouped by block, so the blkno argument should go away.  In addition,
I don't see much reason to have a first_idx argument. Instead of
passing a pointer to the caller's entire array and telling the
callback where to start looking, couldn't we just pass a pointer to
the first record the callback should examine, i.e. instead of passing
urp_array, pass urp_array + first_idx.  Then instead of having a
last_idx argument, have an argument for the number of entries in the
array, computed as last_idx - first_idx + 1.  With those changes,
rm_undo would look like this:

bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid,
FullTransactionId full_xid);

Now for the $10m question: why even pass reloid and full_xid?  Aren't
those values going to be present inside every UnpackedUndoRecord?  Why
not just let the callback get them from the first record (or however
it wants to do things)?  Perhaps there is some documentation value
here in that it implies that the value will be the same for every
record, but we could also handle that by just documenting in the
appropriate places that undo is done by transaction and relation and
therefore the callback is entitled to assume that the same value will
be present in every record.  Then again, I am not sure we really want
the callback to assume that reloid doesn't change.  I don't see a
reason offhand not to just pass as many records as we have for a given
transaction and let the callback do what it likes.  So maybe that's
another reason to get rid of the reloid argument, at least.  And then
we could document that all the record will have the same full_xid
(unless we decide that we don't want to guarantee that either).

Additionally, it strikes me that urp_array is not the greatest name.
Generally, putting _array into the name of the variable to indicate
that it's an array doesn't seem all that great from a coding-style
perspective.  I mean, sometimes it's the best you can do, but it's not
amazing.  And urp seems like it's using an abbreviation without any
real reason.  For contrast, consider this existing precedent:

extern SysScanDesc systable_beginscan_ordered(Relation heapRelation,
                                                   Relation indexRelation,
                                                   Snapshot snapshot,
                                                   int nkeys, ScanKey key);

Or this one:

extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs);

Notice that in each case the array parameter (which is the last one)
is named based on what data it contains rather than on the fact that
it is an array.

Finally, I observe that rm_undo returns a Boolean, but it's not used
for anything.  The only call to rm_undo in the current patch set is in
execute_undo_actions_page, which returns that value to the caller, but
the callers just discard it.  I suppose maybe this was intended to
report success or failure, but I think the way that rm_undo will
report failure is to ERROR.  Or, if we want to allow a fail-soft
behavior for some reason, then the callers all need to check the
value.  I'm not sure whether there's a use case for that or not.

Putting all that together, I suggest a signature like this:

void (*rm_undo) (int nrecords, UndoRecInfo *records);

Or if we decide we need to have a fail-soft behavior, then like this:

bool (*rm_undo) (int nrecords, UndoRecInfo *records);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: PG 12 draft release notes
Next
From: Bruce Momjian
Date:
Subject: Re: PG 12 draft release notes