Home > mailing lists

Adding REPACK [concurrently] - Mailing list pgsql-hackers

From	Alvaro Herrera
Subject	Adding REPACK [concurrently]
Date	July 27, 2025 00:56:04
Msg-id	202507262156.sb455angijk6@alvherre.pgsql Whole thread Raw
Responses	Re: Adding REPACK [concurrently] Re: Adding REPACK [concurrently] Re: Adding REPACK [concurrently] Re: Adding REPACK [concurrently] Re: Adding REPACK [concurrently] Re: Adding REPACK [concurrently] Re: Adding REPACK [concurrently]
List	pgsql-hackers

Tree view

Hello,

Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it.
This is coming from [1].  The ultimate goal is to have an in-core tool
to allow concurrent table rewrite to get rid of bloat; right now, VACUUM
FULL does that, but it's not concurrent.  Users have resorted to using
the pg_repack third-party tool, which is ancient and uses a weird
internal implementation, as well as pg_squeeze, which uses logical
decoding to capture changes that occur during the table rewrite.  The
patch submitted here, largely by Antonin Houska with some changes by me,
is based on the the pg_squeeze code which he authored, and first
introduces a new command called REPACK to absorb both VACUUM FULL and
CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms
of REPACK to operate online using logical decoding.

Essentially, this first patch just reshuffles the CLUSTER code to create
the REPACK command.

I made a few changes from Antonin's original at [2].  First, I modified
the grammar to support "REPACK [tab] USING INDEX" without specifying the
index name.  With this change, all possibilities of the old commands are
covered, which gives us the chance to flag them as obsolete.  (This is
good, because having VACUUM FULL do something completely different from
regular VACUUM confuses users all the time; and on the other hand,
having a command called CLUSTER which is at odds with what most people
think of as a "database cluster" is also confusing.)

Here's a list of existing commands, and how to write them in the current
patch's proposal for REPACK:

-- re-clusters all tables that have a clustered index set
CLUSTER                     -> REPACK USING INDEX

-- clusters the given table using the given index
CLUSTER tab USING idx       -> REPACK tab USING INDEX idx

-- clusters this table using a clustered index; error if no index clustered
CLUSTER tab                 -> REPACK tab USING INDEX

-- vacuum-full all tables
VACUUM FULL                 -> REPACK

-- vacuum-full the specified table
VACUUM FULL tab             -> REPACK tab


My other change to Antonin's patch is that I made REPACK USING INDEX set
the 'indisclustered' flag to the index being used, so REPACK behaves
identically to CLUSTER.  We can discuss whether we really want this.
For instance we could add an option so that by default REPACK omits
persisting the clustered index, and instead it only does that when you
give it some special option, say something like
  "REPACK (persist_clustered_index=true) tab USING INDEX idx"
Overall I'm not sure this is terribly interesting, since clustered
indexes are not very useful for most users anyway.

I made a few other minor changes not worthy of individual mention, and
there are a few others pending, such as updates to the
pg_stat_progress_repack view infrastructure, as well as phasing out
pg_stat_progress_cluster (maybe the latter would offer a subset of the
former; not yet sure about this.)  Also, I'd like to work on adding a
`repackdb` command for completeness.

On repackdb: I think is going to be very similar to vacuumdb, mostly in
that it is going to need to be able to run tasks in parallel; but there
are things it doesn't have to deal with, such as analyze-in-stages,
which I think is a large burden.  I estimate about 1k LOC there,
extremely similar to vacuumdb.  Maybe it makes sense to share the source
code and make the new executable a symlink instead, with some additional
code to support the two different modes.  Again, I'm not sure about
this -- I like the idea, but I'd have to see the implementation.

I'll be rebasing the rest of Antonin's patch series afterwards,
including the logical decoding changes necessary for CONCURRENTLY.  In
the meantime, if people want to review those, which would be very
valuable, they can go back to branch master from around the time he
submitted it and apply the old patches there.

[1] https://postgr.es/m/76278.1724760050@antos
[2] https://postgr.es/m/152010.1751307725@localhost

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Attachment

v1-0001-Add-REPACK-command.patch

pgsql-hackers by date:

From: Corey Huinker
Date: 26 July 2025, 22:38:09
Subject: Re: vacuumdb changes for stats import/export

From: Alexander Korotkov
Date: 27 July 2025, 01:51:13
Subject: Re: MergeAppend could consider sorting cheapest child path

Adding REPACK [concurrently] - Mailing list pgsql-hackers

Attachment

Previous

Next