Re: Disabling Heap-Only Tuples - Mailing list pgsql-hackers

From James Locke
Subject Re: Disabling Heap-Only Tuples
Date
Msg-id CAGEtbYUFNMQ1XqX2VFNM-RZ_2AZaUmU970V20Guvx2L6soMw9Q@mail.gmail.com
Whole thread
In response to Re: Disabling Heap-Only Tuples  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Disabling Heap-Only Tuples
List pgsql-hackers
On Fri, May 8, 2026 at 12:45 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 13 Mar 2024 at 14:27, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> >
> > On Thu, 2023-09-21 at 16:18 -0700, Andres Freund wrote:
> > > I think a minimal working approach could be to have the configuration be based
> > > on the relation size vs space known to the FSM. If the target block of an
> > > update is higher than ((relation_size - fsm_free_space) *
> > > new_reloption_or_guc), try finding the target block via the FSM, even if
> > > there's space on the page.
> >
> > That sounds like a good way forward.
> >
> > The patch is in state "needs review", but it got review.  I'll change it to
> > "waiting for author".
>
> Then I'll withdraw this patch as I don't currently have (nor expect to
> have anytime soon) the bandwitdh or expertise to rewrite this patch to
> include a system that calculates the free space available in a
> relation.
>
> I've added a TODO item in the UPDATE section with a backlink to this
> thread so the discussion isn't lost.

Attached is a POC to enable userland table compaction: A top-level COMPACT command that performs the relocation directly in the server, with a stripped-down heap_relocate primitive instead of full UPDATE, and a built-in prune-and-truncate pass so it runs to a useful end state in one command.  Rough comparison:

  Approach           Driver          Disk overhead   One shot
  -------------------------------------------------------------
  VACUUM FULL        server          ~live size      yes
  REPACK             server          ~live size      yes
  local_update_limit + UPDATE loop   user            ~1 page         no
  COMPACT (this POC) server          ~1 page         yes

Design summary:

  - heap_relocate: low-level "move this tuple to that page" primitive,
    sibling of heap_update.  Skips HOT, modified-attr analysis, toast,
    and replica-identity extraction (data is byte-identical).
    Concurrency: handles key-share lockers and multixacts via the
    same logic as heap_update (no-wait variant).  Skips tuples held
    under stronger locks or being updated by a live transaction; the
    caller revisits them on a future run.

  - XLH_UPDATE_RELOCATED: new flag on xl_heap_update.  Replay treats
    these as ordinary updates; logical decoding's DecodeUpdate
    filters them out so subscribers see no phantom UPDATE events.

  - lazy_compact_heap: new internal vacuum phase, walks pages
    high-to-low, snapshots live tuples, calls heap_relocate with
    FSM-chosen low-numbered targets, and inserts matching index
    entries via index_insert(UNIQUE_CHECK_NO).

  - COMPACT command: runs three vacuum() invocations per relation
    (compact, then prune+truncate in a fresh xact, then optional
    analyze).  No AccessExclusiveLock except briefly during truncate.

How it engages with the concerns previously raised on this thread:

  - Index bloat: confirmed and unavoidable -- every
    relocation creates a new index entry, the old one stays until
    the next ordinary vacuum reaps it.  For an index-heavy table,
    REINDEX CONCURRENTLY after COMPACT is the recommended remedy.
    Documented in compact.sgml.

  - Tuning difficulty: no parameter to tune.  The user
    runs COMPACT and the server figures out what to move where.  At
    the cost of being less flexible than local_update_limit (you
    can't, e.g., compact only the top 50 GB of a 100 GB table, you
    compact what the FSM and the high-water heuristic decide to
    compact).

  - Page selection: the compaction loop only places
    tuples on pages strictly lower than the one it's draining, so
    progress is monotonic and tail pages will be empty if compaction
    succeeds (the truncation pass then reclaims them).

  - Low-fillfactor interaction: not directly relevant
    here. COMPACT doesn't change UPDATE behaviour generally, only
    runs an explicit relocation loop.  fillfactor is honoured when
    picking targets.

 This POC is intended as an end-to-end demonstration that the use case can be served by a server-side automated command rather than a userland UPDATE loop.  All five patches build cleanly; make check-world
    passes; isolation spec covers concurrent FOR UPDATE / UPDATE /  FOR KEY SHARE / multixact / REPEATABLE READ readers.

  - Isn't: production-ready.  Notable rough edges:

      * The three-pass vacuum() structure inside ExecCompact is
        correct but ugly.  Folding the prune-and-truncate pass back
        into a single vacuum() invocation that retries with a fresh
        snapshot would be better.  This requires letting vacuum_rel
        commit and start a new transaction mid-flight; doable but
        invasive.

      * Index updates inside lazy_compact_heap call index_insert
        directly with UNIQUE_CHECK_NO.  An earlier attempt to use
        ExecInsertIndexTuples crashed because the executor
        scaffolding (es_snapshot, ECxt, range table) isn't fully
        constructible from outside the executor.  Worth a closer
        look from someone who knows that area.

      * contrib/pg_compact_test (in 0005) exposes the lower
        primitives to SQL for development testing.  Not really
        appropriate for in-tree.  Useful for review.

The two approaches are not exclusive.  local_update_limit serves users
who want a knob and don't want a new command; COMPACT serves users
who want one-shot behaviour.  Posting this so the current thread can
consider whether server-side automation is on the table at all,
before settling on the reloption design.

Patch series structure:

  0001 -- New hio primitive, RelationGetSpecificBufferForTuple
  0002 -- New heap primitive, heap_relocate (+ XLH_UPDATE_RELOCATED)
  0003 -- New internal vacuum phase, lazy_compact_heap
  0004 -- New COMPACT command (grammar, executor, docs)
  0005 -- Tests (regression, isolation, contrib test)

Benchmark on a 24 MB / 100K-row workload (90% deleted at the head,
then VACUUMed):

  Strategy        Mean WAL    Mean time
  COMPACT         2.97 MB     51 ms
  VACUUM FULL     2.39 MB     50 ms
  REPACK          2.43 MB     43 ms

COMPACT writes ~24% more WAL than the rewrite-based strategies (one
cross-page heap update + index inserts per relocated tuple, vs. a
single bulk relation rewrite).  Final size is identical for all
three.  Peak extra disk: ~1 page for COMPACT, ~live-data-size for
the others.

James
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Adding REPACK [concurrently]
Next
From: Ayush Tiwari
Date:
Subject: Disallow whole-row index references with virtual generated columns?