Adding REPACK [concurrently] - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Adding REPACK [concurrently] |
Date | |
Msg-id | 202507262156.sb455angijk6@alvherre.pgsql Whole thread Raw |
Responses |
Re: Adding REPACK [concurrently]
Re: Adding REPACK [concurrently] |
List | pgsql-hackers |
Hello, Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it. This is coming from [1]. The ultimate goal is to have an in-core tool to allow concurrent table rewrite to get rid of bloat; right now, VACUUM FULL does that, but it's not concurrent. Users have resorted to using the pg_repack third-party tool, which is ancient and uses a weird internal implementation, as well as pg_squeeze, which uses logical decoding to capture changes that occur during the table rewrite. The patch submitted here, largely by Antonin Houska with some changes by me, is based on the the pg_squeeze code which he authored, and first introduces a new command called REPACK to absorb both VACUUM FULL and CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms of REPACK to operate online using logical decoding. Essentially, this first patch just reshuffles the CLUSTER code to create the REPACK command. I made a few changes from Antonin's original at [2]. First, I modified the grammar to support "REPACK [tab] USING INDEX" without specifying the index name. With this change, all possibilities of the old commands are covered, which gives us the chance to flag them as obsolete. (This is good, because having VACUUM FULL do something completely different from regular VACUUM confuses users all the time; and on the other hand, having a command called CLUSTER which is at odds with what most people think of as a "database cluster" is also confusing.) Here's a list of existing commands, and how to write them in the current patch's proposal for REPACK: -- re-clusters all tables that have a clustered index set CLUSTER -> REPACK USING INDEX -- clusters the given table using the given index CLUSTER tab USING idx -> REPACK tab USING INDEX idx -- clusters this table using a clustered index; error if no index clustered CLUSTER tab -> REPACK tab USING INDEX -- vacuum-full all tables VACUUM FULL -> REPACK -- vacuum-full the specified table VACUUM FULL tab -> REPACK tab My other change to Antonin's patch is that I made REPACK USING INDEX set the 'indisclustered' flag to the index being used, so REPACK behaves identically to CLUSTER. We can discuss whether we really want this. For instance we could add an option so that by default REPACK omits persisting the clustered index, and instead it only does that when you give it some special option, say something like "REPACK (persist_clustered_index=true) tab USING INDEX idx" Overall I'm not sure this is terribly interesting, since clustered indexes are not very useful for most users anyway. I made a few other minor changes not worthy of individual mention, and there are a few others pending, such as updates to the pg_stat_progress_repack view infrastructure, as well as phasing out pg_stat_progress_cluster (maybe the latter would offer a subset of the former; not yet sure about this.) Also, I'd like to work on adding a `repackdb` command for completeness. On repackdb: I think is going to be very similar to vacuumdb, mostly in that it is going to need to be able to run tasks in parallel; but there are things it doesn't have to deal with, such as analyze-in-stages, which I think is a large burden. I estimate about 1k LOC there, extremely similar to vacuumdb. Maybe it makes sense to share the source code and make the new executable a symlink instead, with some additional code to support the two different modes. Again, I'm not sure about this -- I like the idea, but I'd have to see the implementation. I'll be rebasing the rest of Antonin's patch series afterwards, including the logical decoding changes necessary for CONCURRENTLY. In the meantime, if people want to review those, which would be very valuable, they can go back to branch master from around the time he submitted it and apply the old patches there. [1] https://postgr.es/m/76278.1724760050@antos [2] https://postgr.es/m/152010.1751307725@localhost -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Attachment
pgsql-hackers by date: