Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmobvqZoTX9XHkw45c-VbbY6PLD+HNS4vw8yyrmy5gBidJA@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Mon, May 27, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Apart from fixing the above comments, the patch is rebased on latest > undo patchset. As of now, I have split the binaryheap.c changes into > a separate patch. We are stilll enhancing the patch to compute > oldestXidHavingUnappliedUndo which touches various parts of patch, so > splitting further without completing that can make it a bit difficult > to work on that. Some review comments around execute_undo_actions: The 'nopartial' argument to execute_undo_actions is confusing. First, it would probably be worth spelling it out instead of abbreviation: not_partial_transaction rather than nopartial. Second, it is usually better to phrase parameter names in terms of what they are rather than in terms of what they are not: complete_transaction rather than not_partial_transaction. Third, it's unclear from these comments why we'd be undoing something other than a complete transaction. It looks as though the answer is that this flag will be false when we're undoing a subxact -- in which case, why not invert the sense of the flag and call it 'bool subxact'? I might be wrong, but it seems like that would be a whole lot clearer. Fourth, the block at the top of this function, guarded by nopartial, seems like it must be vulnerable to race conditions. If we're undoing the complete transaction, then it checks whether UndoFetchRecord() still returns anything. But that could change not just at the beginning of the function, but also any time in the middle, or so it seems to me. I doubt that this is the right level at which to deal with this sort of interlocking. I think there should be some higher-level mechanism that prevents two processes from trying to undo the same transaction at the same time, like a heavyweight lock or some kind of flag in the shared memory data structure that keeps track of pending undo, so that we never even reach this code unless we know that this XID needs undo work and no other process is already doing it. If you're the only one undoing XID 123456, then there shouldn't be any chance of the undo disappearing from underneath you. And we definitely want to guarantee that only one process is undoing any given XID at a time. The 'blk_chain_complete' variable which is set in this function and passed down to execute_undo_actions_page() and then to the rmgr's rm_undo callback also looks problematic. First, not every AM that uses undo may even have the notion of a 'block chain'; zedstore for example uses TIDs as a 48-bit integer, not a block + offset number, so it's really not going to have a 'block chain.' Second, even in zheap's usage, it seems to me that the block chain could be complete even when this gets set to false. It gets set to true when we're undoing a toplevel transaction (not a subxact) and we were able to fetch all of the undo for that toplevel transaction. But even if that's not true, the chain for an individual block could still be complete, because all the remaining undo for the block at issue might've been in the chunk of undo we already read; the remaining undo could be for other blocks. For that reason, I can't see how the zheap code that relies on this value can be correct; it uses this value to decide whether to stick zeroes in the transaction slot, but if the scenario described above happened, then I suppose the XID would not get cleared from the slot during undo. Maybe zheap is just relying on that being harmless, since if all of the undo actions have been correctly executed for the page, the fact that the transaction slot is still bogusly used by an aborted xact won't matter; nothing will refer to it. However, it seems to me that it would be better for zheap to set things up so that the first undo record for a particular txn/page combination is flagged in some way (in the payload!) so that undo can zero the slot if the action being undone is the one that claimed the slot. That seems cleaner on principle, and it also avoids having supposedly AM-independent code pass down details that are driven by zheap's particular needs. While it's probably moot since I think this code should go away anyway, I find it poor style to write something like: + if (nopartial && !UndoRecPtrIsValid(urec_ptr)) + blk_chain_complete = true; + else + blk_chain_complete = false; "if (x) y = true; else y = false;" can be more compactly written as "y = x;", like this: blk_chain_complete = nopartial && !UndoRecPtrIsValid(urec_ptr); I think that the signature for rm_undo can be simplified considerably. I think blk_chain_complete should go away for the reasons discussed above. Also, based on our conversations with Heikki at PGCon, we decided that we should not presume that the AM wants the records grouped by block, so the blkno argument should go away. In addition, I don't see much reason to have a first_idx argument. Instead of passing a pointer to the caller's entire array and telling the callback where to start looking, couldn't we just pass a pointer to the first record the callback should examine, i.e. instead of passing urp_array, pass urp_array + first_idx. Then instead of having a last_idx argument, have an argument for the number of entries in the array, computed as last_idx - first_idx + 1. With those changes, rm_undo would look like this: bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid, FullTransactionId full_xid); Now for the $10m question: why even pass reloid and full_xid? Aren't those values going to be present inside every UnpackedUndoRecord? Why not just let the callback get them from the first record (or however it wants to do things)? Perhaps there is some documentation value here in that it implies that the value will be the same for every record, but we could also handle that by just documenting in the appropriate places that undo is done by transaction and relation and therefore the callback is entitled to assume that the same value will be present in every record. Then again, I am not sure we really want the callback to assume that reloid doesn't change. I don't see a reason offhand not to just pass as many records as we have for a given transaction and let the callback do what it likes. So maybe that's another reason to get rid of the reloid argument, at least. And then we could document that all the record will have the same full_xid (unless we decide that we don't want to guarantee that either). Additionally, it strikes me that urp_array is not the greatest name. Generally, putting _array into the name of the variable to indicate that it's an array doesn't seem all that great from a coding-style perspective. I mean, sometimes it's the best you can do, but it's not amazing. And urp seems like it's using an abbreviation without any real reason. For contrast, consider this existing precedent: extern SysScanDesc systable_beginscan_ordered(Relation heapRelation, Relation indexRelation, Snapshot snapshot, int nkeys, ScanKey key); Or this one: extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs); Notice that in each case the array parameter (which is the last one) is named based on what data it contains rather than on the fact that it is an array. Finally, I observe that rm_undo returns a Boolean, but it's not used for anything. The only call to rm_undo in the current patch set is in execute_undo_actions_page, which returns that value to the caller, but the callers just discard it. I suppose maybe this was intended to report success or failure, but I think the way that rm_undo will report failure is to ERROR. Or, if we want to allow a fail-soft behavior for some reason, then the callers all need to check the value. I'm not sure whether there's a use case for that or not. Putting all that together, I suggest a signature like this: void (*rm_undo) (int nrecords, UndoRecInfo *records); Or if we decide we need to have a fail-soft behavior, then like this: bool (*rm_undo) (int nrecords, UndoRecInfo *records); -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: