Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | 20210117154457.ekxrgdxxa3qfsscq@localhost Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Antonin Houska <ah@cybertec.at>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
> On Fri, Dec 04, 2020 at 10:22:42AM +0100, Antonin Houska wrote: > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > If you want to track at undo record level, then won't it lead to > > > > performance overhead and probably additional WAL overhead considering > > > > this action needs to be WAL-logged. I think recording at page-level > > > > might be a better idea. > > > > > > I'm not worried about WAL because the undo execution needs to be WAL-logged > > > anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be > > > evaluated regarding performance is the (exclusive) locking of the page that > > > carries the progress information. > > > > > > > That is just for one kind of smgr, think how you will do it for > > something like zheap. Their idea is to collect all the undo records > > (unless the undo for a transaction is very large) for one zheap-page > > and apply them together, so maintaining the status at each undo record > > level will surely lead to a large amount of additional WAL. See below > > how and why we have decided to do it differently. > > > > > I'm still not sure whether this info should > > > be on every page or only in the chunk header. In either case, we have a > > > problem if there are two or more chunks created by different transactions on > > > the same page, and if more than on of these transactions need to perform > > > undo. I tend to believe that this should happen rarely though. > > > > > > > I think we need to maintain this information at the transaction level > > and need to update it after processing a few blocks, at least that is > > what was decided and implemented earlier. We also need to update it > > when the log is switched or all the actions of the transaction were > > applied. The reasoning is that for short transactions it won't matter > > and for larger transactions, it is good to update it after a few pages > > to avoid WAL and locking overhead. Also, it is better if we collect > > the undo in bulk, this is proved to be beneficial for large > > transactions. > > Attached is what I originally did not include in the patch series, see the > part 0012. I have no better idea so far. The progress information is stored in > the chunk header. > > To avoid too frequent locking, maybe the UpdateLastAppliedRecord() function > can be modified so it recognizes when it's necessary to update the progress > info. Also the user (zheap) should think when it should call the function. > Since I've included 0012 now as a prerequisite for discarding (0013), > currently it's only necessary to update the progress at undo log chunk > boundary. > > In this version of the patch series I wanted to publish the remaining ideas I > haven't published yet. Thanks for the updated patch. As I've mentioned off the list I'm slowly looking through it with the intent to concentrate on undo progress tracking. But before I will post anything I want to mention couple of strange issues I see, otherwise I will forget for sure. Maybe it's already known, but running several times 'make installcheck' against a freshly build postgres with the patch applied from time to time I observe various errors. This one happens on a crash recovery, seems like UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in the replay process: TRAP: FailedAssertion("page_offset + this_page_bytes <= uph->ud_insertion_point", File: "undopage.c", Line: 300) postgres: startup recovering 000000010000000000000012(ExceptionalCondition+0xa1)[0x558b38b8a350] postgres: startup recovering 000000010000000000000012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e] postgres: startup recovering 000000010000000000000012(UndoReplay+0xa1d)[0x558b38766f32] postgres: startup recovering 000000010000000000000012(XactUndoReplay+0x77)[0x558b38769281] postgres: startup recovering 000000010000000000000012(smgr_redo+0x1af)[0x558b387aa7bd] This one is somewhat similar: TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: "undopage.c", Line: 287) postgres: undo worker for database 36893 (ExceptionalCondition+0xa1)[0x5559c90f1350] postgres: undo worker for database 36893 (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3] postgres: undo worker for database 36893 (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008] postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989] There are also here and there messages about not found undo files: ERROR: cannot open undo segment file 'base/undo/000008.0000020000': No such file or directory WARNING: failed to undo transaction I haven't found out the trigger yet, but got an impression that it happens after create_table tests.
pgsql-hackers by date: