Re: Disabling Heap-Only Tuples - Mailing list pgsql-hackers

From James Locke
Subject Re: Disabling Heap-Only Tuples
Date
Msg-id CAGEtbYUe=9bJdj9Pd4BY5RA-gC2hSoHo0BbfYeJ_t7z_0+z2vg@mail.gmail.com
Whole thread
In response to Re: Disabling Heap-Only Tuples  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Disabling Heap-Only Tuples
List pgsql-hackers
On Fri, May 8, 2026 at 2:00 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Hello James,
>
> On 2026-May-08, James Locke wrote:
>
> > Attached is a POC to enable userland table compaction: A top-level COMPACT
> > command that performs the relocation directly in the server, with a
> > stripped-down heap_relocate primitive instead of full UPDATE, and a
> > built-in prune-and-truncate pass so it runs to a useful end state in one
> > command.
>
> How does this implementation handle the case of a seqscan in the middle
> of scanning the table, which has already skipped the destination page
> and not yet the page from where the table is to be removed?  There needs
> to be a way to distinguish which of these to show (it must be exactly
> one), and you didn't mention this in your description.

It's the same invariant a cross-page UPDATE relies on, and heap_relocate inherits it because the on-disk and WAL record are identical to a regular update.

heap_relocate sets the source's xmax and the new tuple's xmin to the same xid (the relocator's), and both writes go through one log_heap_update AL record. So when HeapTupleSatisfiesMVCC asks "is this visible" for either tuple, it ends up asking the same XidInMVCCSnapshot(R, snap) question against the eqscan's snapshot; once for the destination's xmin and once for the source's xmax. Same xid, same answer.

seqscan reads block 5 first and sees no live tuple there, either because the relocation hasn't happened yet, or it has but R is still in the snapshot's xip list so xmin reads as in-progress. Then COMPACT commits cluster-wide. Seqscan reaches block 200 still using the snapshot it took at scan start, which treats R the same way it did at block 5; snapshots don't change mid-scan. So either both pages treated R as committed (block 5 returned the row already, block 200 now sees the source as dead) or both treated it as running (block 5 saw nothing, block 200 returns the source). Exactly one.

The page-level atomicity comes from log_heap_update registering both buffers in one record and the modifications happening inside one RIT_SECTION with exclusive content locks on both pages; concurrent share-locking readers can't see half-applied state.

James

pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: Disallow whole-row index references with virtual generated columns?
Next
From: Nathan Bossart
Date:
Subject: Re: Fix typo 586/686 in atomics/arch-x86.h