Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CAA4eK1JudageWm5fF8F-5mAg5hXBSvggSMDeZPd-JSH0-_ggjA@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Amit Khandekar <amitdkhan.pg@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Tue, 23 Jul 2019 at 08:48, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > -------------- > > > > > > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > > > I was thinking what happens if for some reason > > > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > > > entry will not be marked invalid, and so there will be no undo action > > > carried out because I think the undo worker will exit. What happens > > > next with this entry ? > > > > The same entry is present in two queues xid and size, so next time it > > will be executed from the second queue based on it's priority in that > > queue. However, if it fails again a second time in the same way, then > > we will be in trouble because now the hash table has entry, but none > > of the queues has entry, so none of the workers will attempt to > > execute again. Also, when discard worker again tries to register it, > > we won't allow adding the entry to queue thinking either some backend > > is executing the same or it must be part of some queue. > > > > The one possibility to deal with this could be that we somehow allow > > discard worker to register it again in the queue or we can do this in > > critical section so that it allows system restart on error. However, > > the main thing is it possible that InsertRequestIntoErrorUndoQueue > > will fail unless there is some bug in the code? If so, we might want > > to have an Assert for this rather than handling this condition. > > Yes, I also think that the function would error out only because of > can't-happen cases, like "too many locks taken" or "out of binary heap > slots" or "out of memory" (this last one is not such a can't happen > case). These cases happen probably due to some bugs, I suppose. But I > was wondering : Generally when the code errors out with such > can't-happen elog() calls, worst thing that happens is that the > transaction gets aborted. Whereas, in this case, the worst thing that > could happen is : the undo action would never get executed, which > means selects for this tuple will keep on accessing the undo log ? > Yeah, or in zheap, we have page-wise rollback facility which rollbacks the transaction for a particular page (this gets triggers whenever we try to update/delete a tuple which was last updated by aborted xact or when we try to reuse slot of aborted xact) and we don't need to traverse undo chain. > This does not sound like any data consistency issue, so we should be > fine after all ? > I will see if we can have an Assert in the code for this. > > -------------- > > +if (UndoGetWork(false, false, &urinfo, NULL) && > + IsUndoWorkerAvailable()) > + UndoWorkerLaunch(urinfo); > > There is no lock acquired between IsUndoWorkerAvailable() and > UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable() > returns true, there is a small window where UndoWorkerLaunch() does > not find any worker slot with in_use false, causing assertion failure > for (worker != NULL). > -------------- > Yeah, I think UndoWorkerLaunch should be able to return without launching worker in such a case. > > + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) > + { > + /* Failed to start worker, so clean up the worker slot. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + UndoWorkerCleanup(worker); > + LWLockRelease(UndoWorkerLock); > + > + return false; > + } > > Is it intentional that there is no (warning?) message logged when we > can't register a bg worker ? > ------------- I don't think it was intentional. I think it will be good to have a warning here. I agree with all your other comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: