Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | 92024.1605176256@antos Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote: > > > Oops, right. So it should just be added to the if condition. Will do. > > > > It's been a couple of months and the discussion has stale. It seems > > also that the patch was waiting for an update. So I am marking it as > > RwF for now. Please feel free to update it if you feel that's not > > adapted. > > Thanks. We decided to redesign a couple of aspects of the undo > storage and record layers that this patch was intended to demonstrate, > and work on that is underway. More on that soon. As my boss expressed in his recent blog post, we'd like to contribute to the zheap development, and a couple of developers from other companies are interested in this as well. Amit Kapila suggested that the "cleanup of orphaned files" feature is a good start point in getting the code into PG core, so I've spent some time on it and tried to rebase the patch set. In fact what I did is not mere rebasing against the current master branch - I've also (besides various bug fixes) done some design changes. Incorporated the new Undo Record Set (URS) infrastructure --------------------------------------------------------- This is also pointed out in [0]. I started from [1] and tried to implement some missing parts (e.g. proper closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro which makes the undo log segments very small and fixed some bugs that the small segments exposed. The most significant change I've done was removal of the undo requests from checkpoint. I could not find any particular bug / race conditions related to including the requests into the checkpoint, but I concluded that it's easier to think about consistency and checkpoint timings if we scan the undo log on restart (after recovery has finished) and create the requests from scratch. [2] shows where I ended up before I started to rebase this patchset. No background undo ------------------ Reduced complexity of the patch seems to be the priority at the moment. Amit suggested that cleanup of an orphaned relation file is simple enough to be done on foreground and I agree. "undo worker" is still there, but it only processes undo requests after server restart because relation data can only be changed in a transaction - it seems cleaner to launch a background worker for this than to hack the startup process. Since the concept of undo requests is closely related to the undo worker, I removed undorequest.c too. The new (much simpler) undo worker gets the information on incomplete / aborted transactions from the undo log as mentioned above. SMGR enhancement ---------------- I used the 0001 patch from [3] rather than [4], although it's more invasive because I noticed somewhere in the discussion that there should be no reserved database OID for the undo log. (InvalidOid cannot be used because it's already in use for shared catalogs.) Components added ---------------- pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems to need some refactoring. Following are a few areas which are not implemented yet because more discussion is needed there: Discarding ---------- There's no discard worker for the URS infrastructure yet. I thought about discarding the undo log during checkpoint, but checkpoint should probably do more straightforward tasks than the calculation of a new discard pointer for each undo log, so a background worker is needed. A few notes on that: * until the zheap AM gets added, only the transaction that creates the undo records needs to access them. This assumption should make the discarding algorithm a bit simpler. Note that with zheap, the other transactions need to look for old versions of tuples, so the concept of oldestXidHavingUndo variable is needed there. * it's rather simple to pass pointer the URS pointer to the discard worker when transaction either committed or the undo has been executed. If the URS only consists of one chunk, the discard pointer can simply be advanced to the end of the chunk. But if there are multiple chunks, the discard worker might need to scan quite some amount of the undo log because (IIUC) chunks of different URSs can be interleaved (if there's not enough space for a record in the log 1, log 2 is used, but before we get to discarding, another transaction could have added its chunk to the log 1) and because the chunks only contain links backwards, not forward. If we added the forward link to the chunk header, it would make chunk closing more complex. How about storing the type header (which includes XID) in each chunk instead of only the first chunk of the URS? Thus we'd be able to check for each chunk separately whether it can be discarded. * if the URS belongs to an aborted transaction or a transaction that could not finish due to server crash, the transaction status alone does not justify discarding: we also need to be sure that the underlying undo records have been applied. So if we want to do without the oldestXidHavingUndo variable, some sort of undo progress tracking is needed, see below. Do not execute the same undo record multiple times -------------------------------------------------- Although I've noticed in the zheap code that it checks whether particular undo action was already undone, I think this functionality fits better in the URS layer. Also note in [1] (i.e. the undo layer, no zheap) that the header comment of AtSubAbort_XactUndo() refers to this problem. I've tried to implement such a thing (not included in this patch) by adding last_rec_applied field to UndoRecordSetChunkHeader. When the UNDO stage of the transaction starts, this field is set to the last undo record of given chunk, and once that record is applied, the pointer moves to the previous record in terms of undo pointer (i.e. the next record to be applied - the records are applied in reverse order) and so on. For recovery purposes, the pointer is maintained in a similar way as the ud_insertion_point field of UndoPageHeaderData. However, although I haven't tested performance yet, I wonder if it's o.k. to lock the buffer containing the chunk header exclusively for each undo record execution. I wonder if there's a better place to store the progress information, maybe at page level? I can spend more time on this project, but need a hint which part I should focus on. Other hackers might have the same problem. Thanks for any suggestions. [0] https://www.postgresql.org/message-id/CA%2BTgmoZwkqXs3hpT_nd17fyMnZDkg8yU%3D5kG%2BHQw%2B80rumiwUA%40mail.gmail.com [1] https://github.com/EnterpriseDB/zheap/tree/undo-record-set [2] https://github.com/cybertec-postgresql/postgres/tree/undo-record-set-ah [3] https://www.postgresql.org/message-id/CA%2BhUKGJfznxutTwpMLKPMjU_k9GhERoogyxx2Sf105LOA2La2A%40mail.gmail.com [4] https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
pgsql-hackers by date: