Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CAA4eK1KnnWHKiPT_JYV=XGXPHZGdZCtvH5WqcoRoyBvKsEqDjg@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Andres Freund <andres@anarazel.de>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-08-05 11:29:34 -0700, Andres Freund wrote: > I am responding to some of the points where I need some more inputs or some discussion is required. Some of the things need more thoughts which I will respond later and some are quite straight forward and doesn't need much discussion. > > > +/* > > + * Binary heap comparison function to compare the time at which an error > > + * occurred for transactions. > > + * > > + * The error queue is sorted by next_retry_at and err_occurred_at. Currently, > > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so > > + * it doesn't make much sense to sort by both values. However, in future, if > > + * we have some different algorithm for next_retry_at, then it will work > > + * seamlessly. > > + */ > > Why is it useful to have error_occurred_at be part of the comparison at > all? If we need a tiebraker, err_occurred_at isn't that (if we can get > conflicts for next_retry_at, then we can also get conflicts in > err_occurred_at). Seems better to use something actually guaranteed to > be unique for a tiebreaker. > This was to distinguish the cases where the request is failing multiple times with the request failed the first time. I agree that we need a better unique identifier like FullTransactionid though. Do let me know if you have any other suggestion? > > > > + /* > > + * The rollback hash table is used to avoid duplicate undo requests by > > + * backends and discard worker. The table must be able to accomodate all > > + * active undo requests. The undo requests must appear in both xid and > > + * size requests queues or neither. In same transaction, there can be two > > + * requests one for logged relations and another for unlogged relations. > > + * So, the rollback hash table size should be equal to two request queues, > > + * an error queue (currently this is same as request queue) and max > > "the same"? I assume this intended to mean the same size? > Yes. I will add the word size to be more clear. > > > + * backends. This will ensure that it won't get filled. > > + */ > > How does this ensure anything? > Because based on this we will have a hard limit on the number of undo requests after which we won't allow more requests. See some more detailed explanation for the same later in this email. I think the comment needs to be updated. > > > + * the binary heaps which can change. > > + */ > > + Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE)); > > + > > + /* > > + * We normally push the rollback request to undo workers if the size of > > + * same is above a certain threshold. > > + */ > > + if (req_size >= rollback_overflow_size * 1024 * 1024) > > + { > > Why is this being checked with the lock held? Seems like this should be > handled in a pre-check? > Yeah, it can be a pre-check, but I thought it is better to encapsulate everything in the function as this is not an expensive check. I think we can move it to outside lock to avoid any such confusion. > > > + * allow_peek - if true, peeks a few element from each queue to check whether > > + * any request matches current dbid. > > + * remove_from_queue - if true, picks an element from the queue whose dbid > > + * matches current dbid and remove it from the queue before returning the same > > + * to caller. > > + * urinfo - this is an OUT parameter that returns the details of undo request > > + * whose undo action is still pending. > > + * in_other_db_out - this is an OUT parameter. If we've not found any work > > + * for current database, but there is work for some other database, we set > > + * this parameter as true. > > + */ > > +bool > > +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo, > > + bool *in_other_db_out) > > +{ > > > > + /* > > + * If some undo worker is already processing the rollback request or > > + * it is already processed, then we drop that request from the queue > > + * and fetch the next entry from the queue. > > + */ > > + if (!rh || UndoRequestIsInProgress(rh)) > > + { > > + RemoveRequestFromQueue(cur_queue, 0); > > + cur_undo_queue++; > > + continue; > > + } > > When is it possible to hit the in-progress case? > The same request is in two queues. It is possible that when the request is being processed from xid queue by one of the workers, the request from another queue is picked by another worker. I think this case won't exist after making rbtree based queues. > > +/* > > + * UpdateUndoApplyProgress - Updates how far undo actions from a particular > > + * log have been applied while rolling back a transaction. This progress is > > + * measured in terms of undo block number of the undo log till which the > > + * undo actions have been applied. > > + */ > > +static void > > +UpdateUndoApplyProgress(UndoRecPtr progress_urec_ptr, > > + BlockNumber block_num) > > +{ > > + UndoLogCategory category; > > + UndoRecordInsertContext context = {{0}}; > > + > > + category = > > + UndoLogNumberGetCategory(UndoRecPtrGetLogNo(progress_urec_ptr)); > > + > > + /* > > + * We don't need to update the progress for temp tables as they get > > + * discraded after startup. > > + */ > > + if (category == UNDO_TEMP) > > + return; > > + > > + BeginUndoRecordInsert(&context, category, 1, NULL); > > + > > + /* > > + * Prepare and update the undo apply progress in the transaction header. > > + */ > > + UndoRecordPrepareApplyProgress(&context, progress_urec_ptr, block_num); > > + > > + START_CRIT_SECTION(); > > + > > + /* Update the progress in the transaction header. */ > > + UndoRecordUpdateTransInfo(&context, 0); > > + > > + /* WAL log the undo apply progress. */ > > + { > > + XLogRecPtr lsn; > > + xl_undoapply_progress xlrec; > > + > > + xlrec.urec_ptr = progress_urec_ptr; > > + xlrec.progress = block_num; > > + > > + XLogBeginInsert(); > > + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); > > + > > + RegisterUndoLogBuffers(&context, 1); > > + lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS); > > + UndoLogBuffersSetLSN(&context, lsn); > > + } > > + > > + END_CRIT_SECTION(); > > + > > + /* Release undo buffers. */ > > + FinishUndoRecordInsert(&context); > > +} > > This whole prepare/execute split for updating apply pregress, and next > undo pointers makes no sense to me. > Can you explain what is your concern here? Basically, in the prepare phase, we do read and lock the buffer and in the actual update phase (which is under critical section), we update the contents in the shared buffer. This is the same idea as we use in many places in code. > > > typedef struct TwoPhaseFileHeader > > { > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader > > uint16 gidlen; /* length of the GID - GID follows the header */ > > XLogRecPtr origin_lsn; /* lsn of this record at origin node */ > > TimestampTz origin_timestamp; /* time of prepare at origin node */ > > + > > + /* > > + * We need the locations of the start and end undo record pointers when > > + * rollbacks are to be performed for prepared transactions using undo-based > > + * relations. We need to store this information in the file as the user > > + * might rollback the prepared transaction after recovery and for that we > > + * need its start and end undo locations. > > + */ > > + UndoRecPtr start_urec_ptr[UndoLogCategories]; > > + UndoRecPtr end_urec_ptr[UndoLogCategories]; > > } TwoPhaseFileHeader; > > Why do we not need that knowledge for undo processing of a non-prepared > transaction? > The non-prepared transaction also needs to be aware of that. It is stored in TransactionStateData. I am not sure if I understand your question here. > > > + * applying undo via top-level transaction, if we get an error, > > + * then it is handled by ReleaseResourcesAndProcessUndo > > Where and how does it handle that? Maybe I misunderstand what you mean? > It is handled in ProcessUndoRequestForEachLogCat which is called from ReleaseResourcesAndProcessUndo. Basically, the error is handled in catch and we insert the request in error queue. The function name should be changed in comments. > > > + case TBLOCK_UNDO: > > + /* > > + * We reach here when we got error while applying undo > > + * actions, so we don't want to again start applying it. Undo > > + * workers can take care of it. > > + * > > + * AbortTransaction is already done, still need to release > > + * locks and perform cleanup. > > + */ > > + ResetUndoActionsInfo(); > > + ResourceOwnerRelease(s->curTransactionOwner, > > + RESOURCE_RELEASE_LOCKS, > > + false, > > + true); > > + s->state = TRANS_ABORT; > > CleanupTransaction(); > > Hm. Why is it ok that we only perform that cleanup action? Either the > rest of potentially held resources will get cleaned up somehow as well, > in which case this ResourceOwnerRelease() ought to be redundant, or > we're potentially leaking important resources like buffer pins, relcache > references and whatnot here? > I had initially used AbortTransaction() here for such things, but I was not sure whether that is the right thing when we reach here in this state. Because AbortTransaction is already done once we reach here. The similar thing happens for the TBLOCK_SUBUNDO state few lines below where I had used AbortSubTransaction. Now, one problem I faced when AbortSubTransaction got invoked in this code path was it internally invokes RecordTransactionAbort->XidCacheRemoveRunningXids which result in the error "did not find subXID %u in MyProc". The reason is obvious which is that we had already removed it when AbortSubTransaction was invoked before applying undo actions. The releasing of locks was the thing which we have delayed to allow undo actions to be applied which is done here. The other idea here I had was to call AbortTransaction/AbortSubTransaction but somehow avoid calling RecordTransactionAbort when in this state. Do you have any suggestion to deal with this? > > > +{ > > + TransactionState s = CurrentTransactionState; > > + bool result; > > + int i; > > + > > + /* > > + * We don't want to apply the undo actions when we are already cleaning up > > + * for FATAL error. See ReleaseResourcesAndProcessUndo. > > + */ > > + if (SemiCritSectionCount > 0) > > + { > > + ResetUndoActionsInfo(); > > + return; > > + } > > Wait what? Semi critical sections? > Robert up thread suggested this idea [1] (See paragraph starting with "I am not a fan of applying_subxact_undo....") to deal with cases where we get an error while applying undo actions and we need to promote the error to FATAL. We have two such cases as of now in this patch, one is when we process temp log category log and other is when we are rolling back sub-transactions. The detailed reasons are mentioned in function execute_undo_actions. I think this can be used for other things as well in the future. > > > > + for (i = 0; i < UndoLogCategories; i++) > > + { > > + /* > > + * We can't push the undo actions for temp table to background > > + * workers as the the temp tables are only accessible in the > > + * backend that has created them. > > + */ > > + if (i != UNDO_TEMP && UndoRecPtrIsValid(s->latestUrecPtr[i])) > > + { > > + result = RegisterUndoRequest(s->latestUrecPtr[i], > > + s->startUrecPtr[i], > > + MyDatabaseId, > > + GetTopFullTransactionId()); > > + s->undoRequestResgistered[i] = result; > > + } > > + } > > Give code like this I have a hard time seing what the point of having > separate queue entries for the different persistency levels is. > It is not for this case, rather, it is for the case of discard worker (background worker) where we process the transactions at log level. The permanent and unlogged transactions will be in a separate log and can be encountered at different times, so this leads to having separate entries for them. I am planning to give a try to unify them based on some of the discussions in this email chain. > > > +void > > +ReleaseResourcesAndProcessUndo(void) > > +{ > > + TransactionState s = CurrentTransactionState; > > + > > + /* > > + * We don't want to apply the undo actions when we are already cleaning up > > + * for FATAL error. One of the main reasons is that we might be already > > + * processing undo actions for a (sub)transaction when we reach here > > + * (for ex. error happens while processing undo actions for a > > + * subtransaction). > > + */ > > + if (SemiCritSectionCount > 0) > > + { > > + ResetUndoActionsInfo(); > > + return; > > + } > > + > > + if (!NeedToPerformUndoActions()) > > + return; > > + > > + /* > > + * State should still be TRANS_ABORT from AbortTransaction(). > > + */ > > + if (s->state != TRANS_ABORT) > > + elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s", > > + TransStateAsString(s->state)); > > + > > + /* > > + * Do abort cleanup processing before applying the undo actions. We must > > + * do this before applying the undo actions to remove the effects of > > + * failed transaction. > > + */ > > + if (IsSubTransaction()) > > + { > > + AtSubCleanup_Portals(s->subTransactionId); > > + s->blockState = TBLOCK_SUBUNDO; > > + } > > + else > > + { > > + AtCleanup_Portals(); /* now safe to release portal memory */ > > + AtEOXact_Snapshot(false, true); /* and release the transaction's > > + * snapshots */ > > Why do precisely these actions need to be performed here? > This is to get a transaction into a clean state. Before calling this function AbortTransaction has been performed and there were few more things we need to do for cleanup. > > > + s->fullTransactionId = InvalidFullTransactionId; > > + s->subTransactionId = TopSubTransactionId; > > + s->blockState = TBLOCK_UNDO; > > + } > > + > > + s->state = TRANS_UNDO; > > This seems guaranteed to constantly be out of date with other > modifications of the commit/abort sequence. > It is similar to how we change state in Abort(Sub)Transaction and we change the state back to TRANS_ABORT after applying undo in this function. So not sure, how it can be out-of-date. Do you have any better suggestion here? > > > > +bool > > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid, > > + UndoRecPtr *end_urec_ptr, UndoRecPtr *start_urec_ptr, > > + bool *undoRequestResgistered, bool isSubTrans) > > +{ > > + UndoRequestInfo urinfo; > > + int i; > > + uint32 save_holdoff; > > + bool success = true; > > + > > + for (i = 0; i < UndoLogCategories; i++) > > + { > > + if (end_urec_ptr[i] && !undoRequestResgistered[i]) > > + { > > + save_holdoff = InterruptHoldoffCount; > > + > > + PG_TRY(); > > + { > > + /* for subtransactions, we do partial rollback. */ > > + execute_undo_actions(fxid, > > + end_urec_ptr[i], > > + start_urec_ptr[i], > > + !isSubTrans); > > + } > > + PG_CATCH(); > > + { > > + /* > > + * Add the request into an error queue so that it can be > > + * processed in a timely fashion. > > + * > > + * If we fail to add the request in an error queue, then mark > > + * the entry status as invalid and continue to process the > > + * remaining undo requests if any. This request will be later > > + * added back to the queue by discard worker. > > + */ > > + ResetUndoRequestInfo(&urinfo); > > + urinfo.dbid = dbid; > > + urinfo.full_xid = fxid; > > + urinfo.start_urec_ptr = start_urec_ptr[i]; > > + if (!InsertRequestIntoErrorUndoQueue(&urinfo)) > > + RollbackHTMarkEntryInvalid(urinfo.full_xid, > > + urinfo.start_urec_ptr); > > + /* > > + * Errors can reset holdoff count, so restore back. This is > > + * required because this function can be called after holding > > + * interrupts. > > + */ > > + InterruptHoldoffCount = save_holdoff; > > + > > + /* Send the error only to server log. */ > > + err_out_to_client(false); > > + EmitErrorReport(); > > + > > + success = false; > > + > > + /* We should never reach here when we are in a semi-critical-section. */ > > + Assert(SemiCritSectionCount == 0); > > This seems entirely and completely broken. You can't just catch an > exception and continue. What if somebody held an lwlock when the error > was thrown? A buffer pin? > The caller deals with that. For example, when this is called from FinishPreparedTransaction, we do AbortOutOfAnyTransaction and when called from ReleaseResourcesAndProcessUndo, we just release locks. I think we might need to do something additional for ReleaseResourcesAndProcessUndo. Earlier here also, I had AbortTransaction but was not sure whether that is the right thing to do especially because it will lead to RecordTransactionAbort called twice, once when we do AbortTransaction before applying undo actions and once when we do it after catching the exception. Like as I said earlier maybe the right way is to just avoid calling RecordTransactionAbort again. > > > +to complete the requests by themselves. There is an exception to it where when > > +error queue becomes full, we just mark the request as 'invalid' and continue to > > +process other requests if any. The discard worker will find this errored > > +transaction at later point of time and again add it to the request queues. > > You say it's an exception, but you do not explain why that exception is > there. > The exception is when the error queue becomes full. The idea is that individual queues can be full but not the hash table. > Nor why that's not a problem for: > > > +We have the hard limit (proportional to the size of the rollback hash table) > > +for the number of transactions that can have pending undo. This can help us > > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to > > +accumulate pending undo for a long time which will eventually block the > > +discard of undo. > The reason why it is not a problem is that we don't remove the entry from the hash table rather just mark it such that later discard worker can add it to the queues. I am not sure if I understood your question completely, but let me try to explain this idea in a bit more detail. The basic idea is that Rollback Hash Table has space equivalent to all the three queues plus (2 * MaxBackends). Now, we will stop allowing the new transactions that want to write undo once the hash table has entries equivalent to all three queues and we have 2 * Max_Backends already attached to undo logs that are not committed. Assume we have each queue size as 5 and Max_Backends =10, then ideally we can 35 entries (3 * 5 + 2 * 10) in the hash table. The way all this is related to the error queue being full is like this: Say, we have a number of hash table entries equal to 15 which indicates all queues are full and now 10 backends connected to two different logs (permanent and unlogged). Next one of the transaction errors out and try to rollback, at this stage, it will add an entry in the hash table and try to execute the actions. While executing actions, it got an error and couldn't add to error queue because it was full, so at this stage, it just marks the hash table entry as invalid and proceeds (consider this happens for both logged and unlogged categories). So, at this stage, we will have 17 entries in the hash table and the other 9 backends attached to 18 logs which makes space for 35 xacts if the system crashes at this stage. The backend which errored out again tries to perform an operation for which it needs to perform undo. Now, we won't allow this backend to perform that action because if it crashed after performing the operation and before committing, the hash table will overflow. Currently, there are some problems with the hash table overflow checks in the code that needs to be fixed. > > > + /* There might not be any undo log and hibernation might be needed. */ > > + *hibernate = true; > > + > > + StartTransactionCommand(); > > Why do we need this? I assume it's so we can have a resource owner? > Yes, and another reason is we are using dbid_exists in this function. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: