Re: In-placre persistance change of a relation - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: In-placre persistance change of a relation |
Date | |
Msg-id | 1f201ea8-b1e3-4606-9525-c5817e651cda@iki.fi Whole thread Raw |
In response to | Re: In-placre persistance change of a relation (Michael Paquier <michael@paquier.xyz>) |
List | pgsql-hackers |
On 31/10/2024 10:01, Kyotaro Horiguchi wrote: > After some delays, here’s the new version. In this update, UNDO logs > are WAL-logged and processed in memory under most conditions. During > checkpoints, they’re flushed to files, which are then read when a > specific XID’s UNDO log is accessed for the first time during > recovery. > > The biggest changes are in patches 0001 through 0004 (equivalent to > the previous 0001-0002). After that, there aren’t any major > changes. Since this update involves removing some existing features, > I’ve split these parts into multiple smaller identity transformations > to make them clearer. > > As for changes beyond that, the main one is lifting the previous > restriction on PREPARE for transactions after a persistence > change. This was made possible because, with the shift to in-memory > processing of UNDO logs, commit-time crash recovery detection is now > simpler. Additional changes include completely removing the > abort-handling portion from the pendingDeletes mechanism (0008-0010). In this patch version, the undo log is kept in dynamic shared memory. It can grow indefinitely. On a checkpoint, it's flushed to disk. If I'm reading it correctly, the undo records are kept in the DSA area even after it's flushed to disk. That's not necessary; system never needs to read the undo log unless there's a crash, so there's no need to keep it in memory after it's been flushed to disk. That's true today; we could start relying on the undo log to clean up on abort even when there's no crash, but I think it's a good design to not do that and rely on backend-private state for non-crash transaction abort. I'd suggest doing this the other way 'round. Let's treat the on-disk representation as the primary representation, not the in-memory one. Let's use a small fixed-size shared memory area just as a write buffer to hold the dirty undo log entries that haven't been written to disk yet. Most transactions are short, so most undo log entries never need to be flushed to disk, but I think it'll be simpler to think of it that way. On checkpoint, flush all the buffered dirty entries from memory to disk and clear the buffer. Also do that if the buffer fills up. A high-level overview comment of the on-disk format would be nice. If I understand correctly, there's a magic constant at the beginning of each undo file, followed by UndoLogRecords. There are no other file headers and no page structure within the file. That format seems reasonable. For cross-checking, maybe add the XID to the file header too. There is a separate CRC value on each record, which is nice, but not strictly necessary since the writes to the UNDO log are WAL-logged. The WAL needs CRCs on each record to detect the end of log, but the UNDO log doesn't need that. Anyway, it's fine. I somehow dislike the file per subxid design. I'm sure it works, it's just more of a feeling that it doesn't feel right. I'm somewhat worried about ending up with lots of files, if you e.g. use temporary tables with subtransactions heavily. Could we have just one file per top-level XID? I guess that can become a problem too, if you have a lot of aborted subtransactions. The UNDO records for the aborted subtransactions would bloat the undo file. But maybe that's nevertheless better? -- Heikki Linnakangas Neon (https://neon.tech)
pgsql-hackers by date: