Re: In-placre persistance change of a relation - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: In-placre persistance change of a relation |
Date | |
Msg-id | 20241105.132526.317017969504645384.horikyota.ntt@gmail.com Whole thread Raw |
In response to | In-placre persistance change of a relation (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
List | pgsql-hackers |
Thank you for the quick comments. At Thu, 31 Oct 2024 23:24:36 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 31/10/2024 10:01, Kyotaro Horiguchi wrote: > > After some delays, here’s the new version. In this update, UNDO logs > > are WAL-logged and processed in memory under most conditions. During > > checkpoints, they’re flushed to files, which are then read when a > > specific XID’s UNDO log is accessed for the first time during > > recovery. > > The biggest changes are in patches 0001 through 0004 (equivalent to > > the previous 0001-0002). After that, there aren’t any major > > changes. Since this update involves removing some existing features, > > I’ve split these parts into multiple smaller identity transformations > > to make them clearer. > > As for changes beyond that, the main one is lifting the previous > > restriction on PREPARE for transactions after a persistence > > change. This was made possible because, with the shift to in-memory > > processing of UNDO logs, commit-time crash recovery detection is now > > simpler. Additional changes include completely removing the > > abort-handling portion from the pendingDeletes mechanism (0008-0010). > > In this patch version, the undo log is kept in dynamic shared > memory. It can grow indefinitely. On a checkpoint, it's flushed to > disk. > > If I'm reading it correctly, the undo records are kept in the DSA area > even after it's flushed to disk. That's not necessary; system never > needs to read the undo log unless there's a crash, so there's no need The system also needs to read the undo log whenever additional undo logs are added. In this version, I’ve moved all abort-time pendingDeletes data entirely to the undo logs. In other words, the DSA area is expanded in exchange for reducing the pendingDelete list. As a result, there is minimal impact on overall memory usage. Additionally, the current flushing code is straightforward because it relies on the in-memory primary image. If we drop the in-memory image during flush, we might need exclusive locking or possibly some ordering techniques. Anyway, I’ll consider that approach. > to keep it in memory after it's been flushed to disk. That's true > today; we could start relying on the undo log to clean up on abort > even when there's no crash, but I think it's a good design to not do > that and rely on backend-private state for non-crash transaction > abort. Hmm. Sounds reasonable. In the next version, I'll revert the changes to pendingDeletes and adjust it to just discard the log on regular aborts. > I'd suggest doing this the other way 'round. Let's treat the on-disk > representation as the primary representation, not the in-memory > one. Let's use a small fixed-size shared memory area just as a write > buffer to hold the dirty undo log entries that haven't been written to > disk yet. Most transactions are short, so most undo log entries never > need to be flushed to disk, but I think it'll be simpler to think of > it that way. On checkpoint, flush all the buffered dirty entries from > memory to disk and clear the buffer. Also do that if the buffer fills > up. I'd like to clarify the specific concept of these fixed-length memory slots. Is it something like this: each slot is keyed by an XID, followed by an in-file offset and a series of, say, 1024-byte areas? When writing a log for a new XID, if no slot is available, the backend would immediately evict the slot with the smallest XID to disk to free up space. If an existing slot runs out of space while writing new logs, the backend would flush it immediately and continue using the area. Is this correct? Additionally, if multiple processes try to write to a single slot, stricter locking might be needed. For example, if a slot is evicted by a backend other than its user, exclusive control might be required during the file write. jjjIs there any effective way to avoid such locking? In the current patch set, I’m avoiding any impact on the backend from checkpointer file writes by treating the in-memory image as primary. And regarding the number of these areas… although I’m not entirely sure, it seems unlikely we’d have hundreds of sessions simultaneously creating tables, so would it make sense to make this configurable, with a default of around 32 areas? > A high-level overview comment of the on-disk format would be nice. If > I understand correctly, there's a magic constant at the beginning of > each undo file, followed by UndoLogRecords. There are no other file > headers and no page structure within the file. Right. > That format seems reasonable. For cross-checking, maybe add the XID to > the file header too. There is a separate CRC value on each record, > which is nice, but not strictly necessary since the writes to the UNDO > log are WAL-logged. The WAL needs CRCs on each record to detect the > end of log, but the UNDO log doesn't need that. Anyway, it's fine. For the first point, I considered it when designing the previous patch set but chose not to implement it. As for the CRC, you're right - it’s simply a leftover from the previous design. I have no issues with following both points. > I somehow dislike the file per subxid design. I'm sure it works, it's > just more of a feeling that it doesn't feel right. I'm somewhat > worried about ending up with lots of files, if you e.g. use temporary > tables with subtransactions heavily. Could we have just one file per I first thought the same thing when working on the previos patch. > top-level XID? I guess that can become a problem too, if you have a > lot of aborted subtransactions. The UNDO records for the aborted > subtransactions would bloat the undo file. But maybe that's > nevertheless better? In the current patch set, normal abort processing is handled by the UNDO log, so maintaining the performance of the UNDO process is essential. If we were to return this to pendingDeletes, it might also be feasible to add an XID cancellation record to the UNDO log and scan the entire file once before executing individual logs. I’ll give it some thought. At Mon, 28 Oct 2024 15:33:41 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 31/08/2024 19:09, Kyotaro Horiguchi wrote: > > Subject: [PATCH v34 03/16] Remove function for retaining files on > > outer > > transaction aborts > > The function RelationPreserveStorage() was initially created to keep > > storage files committed in a subtransaction on the abort of outer > > transactions. It was introduced by commit b9b8831ad6 in 2010, but no > > use case for this behavior has emerged since then. If we move the > > at-commit removal feature of storage files from pendingDeletes to the > > UNDO log system, the UNDO system would need to accept the cancellation > > of already logged entries, which makes the system overly complex with > > no benefit. Therefore, remove the feature. > > I don't think that's quite right. I don't think this was meant for > subtransaction aborts, but to make sure that if the top-transaction > aborts after AtEOXact_RelationMap() has already been called, we don't > remove the new relation. AtEOXact_RelationMap() is called very late in Hmm. I believe I wrote that. It prevents storage removal once it’s committed in any subtransaction, even if that subtransaction is finally aborted, including by the top transaction. > the commit process to keep the window as small as possible, but if it > nevertheless happens, the consequences are pretty bad if you remove a > relation file that is in fact needed. However, on second thought, it does seem odd. I may have confused something here. If pendingDeletes is restored and undo cancellation is implemented, this change would be unnecessary. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: