Re: In-placre persistance change of a relation - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: In-placre persistance change of a relation
Date
Msg-id 1f201ea8-b1e3-4606-9525-c5817e651cda@iki.fi
Whole thread Raw
In response to Re: In-placre persistance change of a relation  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On 31/10/2024 10:01, Kyotaro Horiguchi wrote:
> After some delays, here’s the new version. In this update, UNDO logs
> are WAL-logged and processed in memory under most conditions. During
> checkpoints, they’re flushed to files, which are then read when a
> specific XID’s UNDO log is accessed for the first time during
> recovery.
> 
> The biggest changes are in patches 0001 through 0004 (equivalent to
> the previous 0001-0002). After that, there aren’t any major
> changes. Since this update involves removing some existing features,
> I’ve split these parts into multiple smaller identity transformations
> to make them clearer.
> 
> As for changes beyond that, the main one is lifting the previous
> restriction on PREPARE for transactions after a persistence
> change. This was made possible because, with the shift to in-memory
> processing of UNDO logs, commit-time crash recovery detection is now
> simpler. Additional changes include completely removing the
> abort-handling portion from the pendingDeletes mechanism (0008-0010).

In this patch version, the undo log is kept in dynamic shared memory. It 
can grow indefinitely. On a checkpoint, it's flushed to disk.

If I'm reading it correctly, the undo records are kept in the DSA area 
even after it's flushed to disk. That's not necessary; system never 
needs to read the undo log unless there's a crash, so there's no need to 
keep it in memory after it's been flushed to disk. That's true today; we 
could start relying on the undo log to clean up on abort even when 
there's no crash, but I think it's a good design to not do that and rely 
on backend-private state for non-crash transaction abort.


I'd suggest doing this the other way 'round. Let's treat the on-disk 
representation as the primary representation, not the in-memory one. 
Let's use a small fixed-size shared memory area just as a write buffer 
to hold the dirty undo log entries that haven't been written to disk 
yet. Most transactions are short, so most undo log entries never need to 
be flushed to disk, but I think it'll be simpler to think of it that 
way. On checkpoint, flush all the buffered dirty entries from memory to 
disk and clear the buffer. Also do that if the buffer fills up.

A high-level overview comment of the on-disk format would be nice. If I 
understand correctly, there's a magic constant at the beginning of each 
undo file, followed by UndoLogRecords. There are no other file headers 
and no page structure within the file.

That format seems reasonable. For cross-checking, maybe add the XID to 
the file header too. There is a separate CRC value on each record, which 
is nice, but not strictly necessary since the writes to the UNDO log are 
WAL-logged. The WAL needs CRCs on each record to detect the end of log, 
but the UNDO log doesn't need that. Anyway, it's fine.


I somehow dislike the file per subxid design. I'm sure it works, it's 
just more of a feeling that it doesn't feel right. I'm somewhat worried 
about ending up with lots of files, if you e.g. use temporary tables 
with subtransactions heavily. Could we have just one file per top-level 
XID? I guess that can become a problem too, if you have a lot of aborted 
subtransactions. The UNDO records for the aborted subtransactions would 
bloat the undo file. But maybe that's nevertheless better?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: RFC: Extension Packaging & Lookup
Next
From: Heikki Linnakangas
Date:
Subject: Re: IPC::Run::time[r|out] vs our TAP tests