Re: In-placre persistance change of a relation - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: In-placre persistance change of a relation
Date
Msg-id 20241105.132526.317017969504645384.horikyota.ntt@gmail.com
Whole thread Raw
In response to In-placre persistance change of a relation  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Thank you for the quick comments.

At Thu, 31 Oct 2024 23:24:36 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 31/10/2024 10:01, Kyotaro Horiguchi wrote:
> > After some delays, here’s the new version. In this update, UNDO logs
> > are WAL-logged and processed in memory under most conditions. During
> > checkpoints, they’re flushed to files, which are then read when a
> > specific XID’s UNDO log is accessed for the first time during
> > recovery.
> > The biggest changes are in patches 0001 through 0004 (equivalent to
> > the previous 0001-0002). After that, there aren’t any major
> > changes. Since this update involves removing some existing features,
> > I’ve split these parts into multiple smaller identity transformations
> > to make them clearer.
> > As for changes beyond that, the main one is lifting the previous
> > restriction on PREPARE for transactions after a persistence
> > change. This was made possible because, with the shift to in-memory
> > processing of UNDO logs, commit-time crash recovery detection is now
> > simpler. Additional changes include completely removing the
> > abort-handling portion from the pendingDeletes mechanism (0008-0010).
> 
> In this patch version, the undo log is kept in dynamic shared
> memory. It can grow indefinitely. On a checkpoint, it's flushed to
> disk.
> 
> If I'm reading it correctly, the undo records are kept in the DSA area
> even after it's flushed to disk. That's not necessary; system never
> needs to read the undo log unless there's a crash, so there's no need

The system also needs to read the undo log whenever additional undo
logs are added. In this version, I’ve moved all abort-time
pendingDeletes data entirely to the undo logs. In other words, the DSA
area is expanded in exchange for reducing the pendingDelete list. As a
result, there is minimal impact on overall memory usage. Additionally,
the current flushing code is straightforward because it relies on the
in-memory primary image. If we drop the in-memory image during flush,
we might need exclusive locking or possibly some ordering
techniques. Anyway, I’ll consider that approach.

> to keep it in memory after it's been flushed to disk. That's true
> today; we could start relying on the undo log to clean up on abort
> even when there's no crash, but I think it's a good design to not do
> that and rely on backend-private state for non-crash transaction
> abort.

Hmm. Sounds reasonable. In the next version, I'll revert the changes
to pendingDeletes and adjust it to just discard the log on regular
aborts.

> I'd suggest doing this the other way 'round. Let's treat the on-disk
> representation as the primary representation, not the in-memory
> one. Let's use a small fixed-size shared memory area just as a write
> buffer to hold the dirty undo log entries that haven't been written to
> disk yet. Most transactions are short, so most undo log entries never
> need to be flushed to disk, but I think it'll be simpler to think of
> it that way. On checkpoint, flush all the buffered dirty entries from
> memory to disk and clear the buffer. Also do that if the buffer fills
> up.

I'd like to clarify the specific concept of these fixed-length memory
slots. Is it something like this: each slot is keyed by an XID,
followed by an in-file offset and a series of, say, 1024-byte areas?
When writing a log for a new XID, if no slot is available, the backend
would immediately evict the slot with the smallest XID to disk to free
up space. If an existing slot runs out of space while writing new
logs, the backend would flush it immediately and continue using the
area. Is this correct? Additionally, if multiple processes try to
write to a single slot, stricter locking might be needed. For example,
if a slot is evicted by a backend other than its user, exclusive
control might be required during the file write.  jjjIs there any
effective way to avoid such locking? In the current patch set, I’m
avoiding any impact on the backend from checkpointer file writes by
treating the in-memory image as primary. And regarding the number of
these areas… although I’m not entirely sure, it seems unlikely we’d
have hundreds of sessions simultaneously creating tables, so would it
make sense to make this configurable, with a default of around 32
areas?

> A high-level overview comment of the on-disk format would be nice. If
> I understand correctly, there's a magic constant at the beginning of
> each undo file, followed by UndoLogRecords. There are no other file
> headers and no page structure within the file.

Right.

> That format seems reasonable. For cross-checking, maybe add the XID to
> the file header too. There is a separate CRC value on each record,
> which is nice, but not strictly necessary since the writes to the UNDO
> log are WAL-logged. The WAL needs CRCs on each record to detect the
> end of log, but the UNDO log doesn't need that. Anyway, it's fine.

For the first point, I considered it when designing the previous patch
set but chose not to implement it. As for the CRC, you're right - it’s
simply a leftover from the previous design. I have no issues with
following both points.

> I somehow dislike the file per subxid design. I'm sure it works, it's
> just more of a feeling that it doesn't feel right. I'm somewhat
> worried about ending up with lots of files, if you e.g. use temporary
> tables with subtransactions heavily. Could we have just one file per

I first thought the same thing when working on the previos patch.

> top-level XID? I guess that can become a problem too, if you have a
> lot of aborted subtransactions. The UNDO records for the aborted
> subtransactions would bloat the undo file. But maybe that's
> nevertheless better?

In the current patch set, normal abort processing is handled by the
UNDO log, so maintaining the performance of the UNDO process is
essential. If we were to return this to pendingDeletes, it might also
be feasible to add an XID cancellation record to the UNDO log and scan
the entire file once before executing individual logs. I’ll give it
some thought.


At Mon, 28 Oct 2024 15:33:41 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
> > Subject: [PATCH v34 03/16] Remove function for retaining files on
> > outer
> >  transaction aborts
> > The function RelationPreserveStorage() was initially created to keep
> > storage files committed in a subtransaction on the abort of outer
> > transactions. It was introduced by commit b9b8831ad6 in 2010, but no
> > use case for this behavior has emerged since then. If we move the
> > at-commit removal feature of storage files from pendingDeletes to the
> > UNDO log system, the UNDO system would need to accept the cancellation
> > of already logged entries, which makes the system overly complex with
> > no benefit. Therefore, remove the feature.
> 
> I don't think that's quite right. I don't think this was meant for
> subtransaction aborts, but to make sure that if the top-transaction
> aborts after AtEOXact_RelationMap() has already been called, we don't
> remove the new relation. AtEOXact_RelationMap() is called very late in

Hmm. I believe I wrote that. It prevents storage removal once it’s
committed in any subtransaction, even if that subtransaction is
finally aborted, including by the top transaction.

> the commit process to keep the window as small as possible, but if it
> nevertheless happens, the consequences are pretty bad if you remove a
> relation file that is in fact needed.

However, on second thought, it does seem odd. I may have confused
something here. If pendingDeletes is restored and undo cancellation is
implemented, this change would be unnecessary.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Alias of VALUES RTE in explain plan
Next
From: Amit Kapila
Date:
Subject: Re: Pgoutput not capturing the generated columns