Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | 20190814065745.2faw3hirvfhbrdwe@alap3.anarazel.de Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Andres Freund <andres@anarazel.de>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
Hi, On 2019-08-06 14:18:42 -0700, Andres Freund wrote: > Here's the last section of my low-leve review. Plan to write a higher > level summary afterwards, now that I have a better picture of the code. General comments: - For a feature of this complexity, there's very little architectural documentation. Some individual parts have a bit, but there's basically nothing over-arching. That makes it extremely hard for anybody that is not already involved to understand the design constraints, and even for people involved it's hard to understand. I think it's very important for this to have a document that first explains what the goals, and non-goals, of this feature are. And then secondly explains the chosen architecture referencing those constraints. Once that's there it's a lot easier to review this patchset, to discuss the overall architecture, etc. - There are too many different email threads and git branches. The same patches are discussed in different threads, layers exist in slightly diverging versions in different git trees. Again making it very hard for anybody not primarily focussing on undo to join the discussion. I think most of the older git branches should be renamed into something indicating their historic status. The remaining branches should be referenced from a wiki page (linked to in each submission of a new patch version), explaining what they're about. I don't think it's realistic to have people infer meaning from the current branch names (undo, proposal-undo-log, undo-log-storage, undo-log-storage-v2, undo_interface_v1, undoprocessing). Given the size of the the overall project it's quite possibly not realistic to manage the whole work in a single git branch. With separate git branches, as currently done, it's however hard to understand which version of a what layer is used. I think at the very least higher layers need to indicate the version of the underlying layers is being used. I suggest just adding a "v01: " style prefix to all commit subjects a branch is rebased onto. It's also currently hard to understand what version of a layer is being discussed. I think submissions all need to include a version number (c.f. git format-patch's -v option), and that version ought to be included in the subject line. Each new major version of a patch should be started as a reply to the first message of a thread, to keep the structure of a discussion in a managable shape. New versions should include explicit explanations about the major changes compared to the last version. - The code quality of pretty significant parts of the patchset is not even close to being good enough. There are areas with a lot of code duplication. There are very few higher level explanations for interfaces. There's a lot of "i++; /* increment i to increment it */" style comments, but not enough higher level comments. There are significant issues with parts of the code that aren't noted anywhere in comments, leading to reviewers having to repeatedly re-discover them (and wasting time on that). There's different naming styles in related code without a discernible pattern (e.g. UndoRecordSetInfo being followed by get_undo_rec_cid_offset). The word-ordering of different layers is confusing (e.g. BeginUndoRecordInsert vs UndoLogBeginInsert vs PrepareUndoInsert). Different important externally visible functions have names that don't allow to determine which is supposed to do what (PrepareUndoInsert vs BeginUndoRecordInsert). More specific comments: - The whole sequencing of undo record insertion in combination with WAL logging does not appear to be right. It's a bit hard to say, because there's very little documentation on what the intended default sequence of operations is. My understanding is that the currently expected pattern is to: 1) Collect information / perform work needed to perform the action that needs to be UNDO logged. E.g. perform visibility determinations, wait for lockers, compute new infomask, etc. This will likely end with the "content" page(s) (e.g. a table's page) being exclusively locked. 2) Estimate space needed for all UNDO logging (BeginUndoRecordInsert) 3) Prepare for each undo record, this includes building the content for each undo record. PrepareUndoInsert(). This acquires, pins and locks buffers for undo. 4) begin a critical section 5) modify content page, mark buffer dirty 6) write UNDO, using InsertPreparedUndo() 7) associate undo with WAL record (RegisterUndoLogBuffers) 8) write WAL 9) End critical section But despite reading through the code, including READMEs, I'm doubtful that's quite the intended pattern. It REALLY can't be right that one needs to parse many function headers to figure out how the most basic use of undo could possibly work. There needs to be very clear documentation about how to write undo records. Functions that sound like they're involved, need actually useful comments, rather than just restatements of their function names (cf RegisterUndoLogBuffers, UndoLogBuffersSetLSN, UndoLogRegister). - I think there's two fairly fundamental, and related, problems with the sequence outlined above: - We can't search for target buffers to store undo data, while holding the "table" page content locked. That can involve writing out multiple pages till we find a usable victim buffer. That can take a pretty long time. While that's happening the content page would currently be locked. Note how e.g. heapam.c is careful to not hold *any* content locks while potentially performing IO. I think the current interface makes that hard. The easy way to solve would be to require sites performing UNDO logging to acquire victim pages before even acquiring other content locks. Perhaps the better approach could be for the undo layer to hold onto a number of clean buffers, and to keep the last page in an already written to undo log pinned. - We can't search for victim buffers for further undo data while already holding other undo pages content locked. Doing so means that while we're e.g. doing IO to clean out the new page, old undo data on the previous page can't be read. This seems easier to fix. Instead of PrepareUndoInsert() acquiring, pinning and locking buffers, it'd need to be split into two operations. One that acquires buffers and pins them, and one that locks them. I think it's quite possible that the locking operation could just be delayed until InsertPreparedUndo(). But if we solve the above problem, most of this might already be solved. - To me the current split between the packed and unpacked UNDO record formats makes very little sense, the reasoning behind having them is poorly if at all documented, results in extremely verbose code, and isn't extensible. When preparing to insert an undo record the in-buffer size is computed with UndoRecordHeaderSize() (needs to know about all optional data) from within PrepareUndoInsert() (which has a bunch a bunch of additional knowledge about the record format). Then during insertion InsertPreparedUndo(), first copies the UnpackedUndoRecord into an UndoPackContext (again needing ...), and then, via InsertUndoData(), copies that in small increments into the respective buffers (again needing knowledge about the complete record format, two copies even). Beside the code duplication, that also means the memory copies are very inefficient, because they're all done in tiny increments, multiple times. When reading undo it's smilar: UnpackUndoData(), again in small chunks, reads the buffer data into an UndoPackContext (another full copy of the unpacked record format). But then FinishUnpackUndo() *again* copies all that data, into an actual UnpackedUndoRecord (again, with a copy of the record format, albeit slightly different looking). I'm not convinced by Heikki's argument that we shouldn't have structure within undo records. In my opinion that is a significant weakness of how WAL was initially designed, and even after Heikki's work, still is a problem. But this isn't the right design either. Even if were to stay with the current fixed record format, I think the current code needs a substantial redesign: - I think 'packing' during insertion needs to serialize into a char* allocation during PrepareUndoInsert computing the size in parallel (or perhaps in InsertPreparedUndo, but probably not). The size of the record should never be split across record boundaries (i.e. we'll leave that space unused if we otherwise would need to split the size). The actual insertion would be a maximally sized memcpy() (so we'd as many memcpys as the buffer fits in, rather than one for each sub-type of a record). That allows to remove most of the duplicated knowledge of the record format, and makes insertions faster (by doing only large memcpys while holding exclusive content locks). - When reading an undo record, the whole stage of UnpackUndoData() reading data into a the UndoPackContext is omitted, reading directly into the UnpackedUndoRecord. That removes one further copy of the record format. - To avoid having separate copies of the record format logic, I'd probably encode it into *one* array of metadata. If we had {offsetoff(UnpackedUndoRecord, member), membersize(UnpackedUndoRecord, member), flag} we could fairly trivially remove most knowledge from the places currently knowing about the record format. I have some vague ideas for how to specify the format in a way that is more extensible, but with more structure than just a blob of data. But I don't think they're there yet. - The interface to read undo also doesn't seem right to me. For one there's various different ways to read records, with associated code duplication (batch, batch in "one page" mode - but that's being removed now I think, single record mode). I think the batch mode is too restrictive. We might not need this during the first merged version, but I think before long we're going to want to be able to efficiently traverse all the undo chains we need to determine the visibility of all tuples on a page. Otherwise we'll cause a lot of additional synchronous read IO, and will repeatedly re-fetch information, especially during sequential scans for an older snapshot. I think I briefly outlined this in an earlier email - my current though is that the batch interface (which the non-batch interface should just be a tiny helper around), should basically be a queue of "to-be-fetched" undo records. When batching reading an entire transaction, all blocks get put onto that queue. When traversing multiple chains, the chains are processed in a breadth-first fashion (by just looking at the queue, and pushing additional work to the end). That allows to efficiently issue prefetch requests for blocks to be read in the near future. I think that batch reading should just copy the underlying data into a char* buffer. Only the records that currently are being used by higher layers should get exploded into an unpacked record. That will reduce memory usage quite noticably (and I suspect it also drastically reduce the overhead due to a large context with a lot of small allocations that then get individually freed). That will make the sorting of undo a bit more CPU inefficient, because individual records will need to be partially unpacked for comparison, but I think that's going to be a far smaller loss than the win. - My reading of the current xact.c integration is that it's not workable as is. Undo is executed outside of a valid transaction state, exceptions aren't properly undone, logic would need to be duplicated to a significant degree, new kind of critical section. Greetings, Andres Freund
pgsql-hackers by date: