Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CAA4eK1JukkDw3hgouwRRaFwQ-EX9k3NUPYU+5_1-bRfSrshTpw@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On Thu, Jun 13, 2019 at 3:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, May 27, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Apart from fixing the above comments, the patch is rebased on latest
> > undo patchset.  As of now, I have split the binaryheap.c changes into
> > a separate patch.  We are stilll enhancing the patch to compute
> > oldestXidHavingUnappliedUndo which touches various parts of patch, so
> > splitting further without completing that can make it a bit difficult
> > to work on that.
>
> Some review comments around execute_undo_actions:
>
> The 'nopartial' argument to execute_undo_actions is confusing.  First,
> it would probably be worth spelling it out instead of abbreviation:
> not_partial_transaction rather than nopartial.  Second, it is usually
> better to phrase parameter names in terms of what they are rather than
> in terms of what they are not: complete_transaction rather than
> not_partial_transaction.  Third, it's unclear from these comments why
> we'd be undoing something other than a complete transaction.  It looks
> as though the answer is that this flag will be false when we're
> undoing a subxact -- in which case, why not invert the sense of the
> flag and call it 'bool subxact'?  I might be wrong, but it seems like
> that would be a whole lot clearer.
>

The idea was that it could be use for multiple purposes like 'rolling
back complete xact', 'rolling back subxact', 'rollback at page-level'
or any similar future need even though not all code paths use that
function.  I am not wedded to any particular name here, but among your
suggestions complete_transaction sounds better to me.  Are you okay
going with that?

> Fourth, the block at the top of
> this function, guarded by nopartial, seems like it must be vulnerable
> to race conditions.  If we're undoing the complete transaction, then
> it checks whether UndoFetchRecord() still returns anything.  But that
> could change not just at the beginning of the function, but also any
> time in the middle, or so it seems to me.
>

It won't change in between because we have ensured at top-level that
no two processes can start executing pending undo at the same time.
Basically, anyone wants to execute the undo actions will have an entry
in rollback hash table and that will be marked as in-progress.  As
mentioned in comments, the race is only "after discard worker
fetches the record and found that this transaction need to be rolled
back, backend might concurrently execute the actions and remove the
request from rollback hash table."

>  I doubt that this is the
> right level at which to deal with this sort of interlocking. I think
> there should be some higher-level mechanism that prevents two
> processes from trying to undo the same transaction at the same time,
> like a heavyweight lock or some kind of flag in the shared memory data
> structure that keeps track of pending undo, so that we never even
> reach this code unless we know that this XID needs undo work
>

Introducing heavyweight lock can create different sort of problems
because we need to hold it till all the actions are applied to avoid
what I have mentioned above.  The problem will be that discard worker
will be blocked till backend/undo worker applies the complete actions
unless we just take this lock conditionally in discard worker.

Another way could be that we re-fetch the undo record when we are
registering the undo request under RollbackRequestLock and check it's
status again becuase in that case backend or other undoworker won't be
able to remove the request from hash table concurrently.  However, the
advantage of checking it in execute_undo_actions is that we can
optimize it in the future to avoid re-fetching this record when
actually fetching the records to apply undo actions.

> and no
> other process is already doing it.
>

This part is already ensured in the current code.

>
> The 'blk_chain_complete' variable which is set in this function and
> passed down to execute_undo_actions_page() and then to the rmgr's
> rm_undo callback also looks problematic.
>

I agree this parameter should go away from the generic interface
considering the requirements from zedstore.

>  First, not every AM that
> uses undo may even have the notion of a 'block chain'; zedstore for
> example uses TIDs as a 48-bit integer, not a block + offset number, so
> it's really not going to have a 'block chain.'  Second, even in
> zheap's usage, it seems to me that the block chain could be complete
> even when this gets set to false. It gets set to true when we're
> undoing a toplevel transaction (not a subxact) and we were able to
> fetch all of the undo for that toplevel transaction. But even if
> that's not true, the chain for an individual block could still be
> complete, because all the remaining undo for the block at issue
> might've been in the chunk of undo we already read; the remaining undo
> could be for other blocks.  For that reason, I can't see how the zheap
> code that relies on this value can be correct; it uses this value to
> decide whether to stick zeroes in the transaction slot, but if the
> scenario described above happened, then I suppose the XID would not
> get cleared from the slot during undo.  Maybe zheap is just relying on
> that being harmless, since if all of the undo actions have been
> correctly executed for the page, the fact that the transaction slot is
> still bogusly used by an aborted xact won't matter; nothing will refer
> to it. However, it seems to me that it would be better for zheap to
> set things up so that the first undo record for a particular txn/page
> combination is flagged in some way (in the payload!) so that undo can
> zero the slot if the action being undone is the one that claimed the
> slot.  That seems cleaner on principle, and it also avoids having
> supposedly AM-independent code pass down details that are driven by
> zheap's particular needs.
>

Yeah, we can do what you are suggesting for zheap or in many cases, we
should be able to detect it via uur_blkprev of the last record of
page.  The invalid value will indicate that the chain for the page is
complete.

>
> I think that the signature for rm_undo can be simplified considerably.
> I think blk_chain_complete should go away for the reasons discussed
> above.  Also, based on our conversations with Heikki at PGCon, we
> decided that we should not presume that the AM wants the records
> grouped by block, so the blkno argument should go away.  In addition,
> I don't see much reason to have a first_idx argument. Instead of
> passing a pointer to the caller's entire array and telling the
> callback where to start looking, couldn't we just pass a pointer to
> the first record the callback should examine, i.e. instead of passing
> urp_array, pass urp_array + first_idx.  Then instead of having a
> last_idx argument, have an argument for the number of entries in the
> array, computed as last_idx - first_idx + 1.  With those changes,
> rm_undo would look like this:
>
> bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid,
> FullTransactionId full_xid);
>

I agree.

> Now for the $10m question: why even pass reloid and full_xid?  Aren't
> those values going to be present inside every UnpackedUndoRecord?  Why
> not just let the callback get them from the first record (or however
> it wants to do things)?  Perhaps there is some documentation value
> here in that it implies that the value will be the same for every
> record, but we could also handle that by just documenting in the
> appropriate places that undo is done by transaction and relation and
> therefore the callback is entitled to assume that the same value will
> be present in every record.  Then again, I am not sure we really want
> the callback to assume that reloid doesn't change.  I don't see a
> reason offhand not to just pass as many records as we have for a given
> transaction and let the callback do what it likes.  So maybe that's
> another reason to get rid of the reloid argument, at least.  And then
> we could document that all the record will have the same full_xid
> (unless we decide that we don't want to guarantee that either).
>
> Additionally, it strikes me that urp_array is not the greatest name.
> Generally, putting _array into the name of the variable to indicate
> that it's an array doesn't seem all that great from a coding-style
> perspective.  I mean, sometimes it's the best you can do, but it's not
> amazing.  And urp seems like it's using an abbreviation without any
> real reason.  For contrast, consider this existing precedent:
>
> extern SysScanDesc systable_beginscan_ordered(Relation heapRelation,
>                                                    Relation indexRelation,
>                                                    Snapshot snapshot,
>                                                    int nkeys, ScanKey key);
>
> Or this one:
>
> extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs);
>
> Notice that in each case the array parameter (which is the last one)
> is named based on what data it contains rather than on the fact that
> it is an array.
>

Agreed, will change accordingly.

> Finally, I observe that rm_undo returns a Boolean, but it's not used
> for anything.  The only call to rm_undo in the current patch set is in
> execute_undo_actions_page, which returns that value to the caller, but
> the callers just discard it.  I suppose maybe this was intended to
> report success or failure, but I think the way that rm_undo will
> report failure is to ERROR.
>

For Error case, it is fine to report failure, but there can be cases
where we don't need to apply undo actions like when the relation is
dropped/truncated, undo actions are already applied.  The original
idea was to cover such cases by the return value.  I agree that
currently, caller ignores this value, but there is some value in
keeping it.  So, I am in favor of a signature with bool as the return
value.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: didier
Date:
Subject: Re: Generating partitioning tuple conversion maps faster
Next
From: Liudmila Mantrova
Date:
Subject: Re: SQL/JSON path issues/questions