Re: why there is not VACUUM FULL CONCURRENTLY? - Mailing list pgsql-hackers
From | Euler Taveira |
---|---|
Subject | Re: why there is not VACUUM FULL CONCURRENTLY? |
Date | |
Msg-id | ada2ca01-3629-4145-b365-664cc0a2e066@app.fastmail.com Whole thread Raw |
In response to | Re: why there is not VACUUM FULL CONCURRENTLY? (Antonin Houska <ah@cybertec.at>) |
List | pgsql-hackers |
On Tue, Apr 1, 2025, at 10:31 AM, Antonin Houska wrote:
One more version, hopefully to make cfbot happy (I missed the bug because Idid not set the RELCACHE_FORCE_RELEASE macro in my environment.)
I started reviewing this patch. It was in my radar to review it but I didn't
have spare time until now.
The syntax is fine for me. It is really close to CLUSTER syntax but it adds one
detail: INDEX keyword. It is a good approach because USING clause can be
expanded in the future if required.
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
This description is not accurate because the index is optional. It means if the
index is not specified, it is a "rewrite" instead of a "cluster". One
suggestion is to use "rewrite" because it says nothing about the tuple order. A
"rewrite a table" or "rewrite a table to reclaim space" are good candidates. On
the other hand, the command is called "repack" and it should be a strong
candidate for the verb in this description. (I'm surprised that repack is not a
recent term [1]). It seems a natural choice.
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
You missed a space after INDEX.
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
If we decide for another term (other than "repacked") then it should reflect
here and in some other parts ("repacking" is also used) too.
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order.
That's true for CLUSTER but not for REPACK. Index is optional.
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
s/clustered/repacked/?
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
Ditto.
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
It sounds strange to use "Repack" in the other examples but this one it says
"Cluster". Let's use the same terminology.
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
The warnings, notes, and tips are usually placed *after* the description.
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Is it worth to rename table_relation_copy_for_cluster() and
heapam_relation_copy_for_cluster() to replace cluster with repack?
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
Do you really need command? IIUC REPACK is the only command that will used by
this view. There is no need to differentiate commands here.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
command.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
pg_index is a local catalog. To be consistent while clustering a shared
catalog, it should set indisclustered in all existent databases because in each
pg_index table there is a tuple for the referred index. As the comment says it
is not possible.
euler=# select relname, relkind, pg_relation_filepath(oid) from pg_class where relname = 'pg_index';
relname | relkind | pg_relation_filepath
----------+---------+----------------------
pg_index | r | base/16424/2610
(1 row)
euler=# select indexrelid::regclass, indexrelid::regclass, indisclustered from pg_index where indrelid = 'pg_database'::regclass;
indexrelid | indexrelid | indisclustered
---------------------------+---------------------------+----------------
pg_database_datname_index | pg_database_datname_index | f
pg_database_oid_index | pg_database_oid_index | f
(2 rows)
euler=# \c postgres
You are now connected to database "postgres" as user "euler".
postgres=# select relname, relkind, pg_relation_filepath(oid) from pg_class where relname = 'pg_index';
relname | relkind | pg_relation_filepath
----------+---------+----------------------
pg_index | r | base/5/2610
(1 row)
postgres=# select indexrelid::regclass, indexrelid::regclass, indisclustered from pg_index where indrelid = 'pg_database'::regclass;
indexrelid | indexrelid | indisclustered
---------------------------+---------------------------+----------------
pg_database_datname_index | pg_database_datname_index | f
pg_database_oid_index | pg_database_oid_index | f
(2 rows)
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
I'm confused about this change. Why is it required?
If it prints this message only for CLUSTER command, you don't need to have a
generic message. This kind of message is not good for translation. If you need
multiple verbs here, I advise you to break it into multiple messages.
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
Ditto.
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
If the idea is to remove CLUSTER and VACUUM from this routine in the future, I
wouldn't include formatting.h just for asc_toupper(). Instead, I would use an
if condition. I think it will be easy to remove this code path when the time
comes.
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
Ditto. I don't think check_index_is_clusterable() should be changed. The action
is "cluster" independently of the command. You can keep "cluster" until we
completely remove CLUSTER command and then we can replace this term with
"repack". It also applies to cluster_is_permitted_for_relation().
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
Ditto.
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
Ditto.
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
This is bad for translation. Use complete sentences.
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
Ditto.
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
Ditto.
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
Since the goal is to remove CLUSTER in the future, provide a comment that
doesn't mention routines that will certainly be removed. Hence, there is no
need to fix them in the future.
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
This comment is not true.
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
Fix for translation.
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
This code path is confusing. It took me some time (after reading
process_single_relation() that could have a better name) to understand it. I
don't have a good suggestion but it should have at least one comment explaining
what the purpose is.
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
This comment should be expanded. As I said in the previous hunk, there isn't
sufficient information to understand how process_single_relation() works.
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
I'm wondering if there is an easy way to avoid these rules.
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
It is just a matter of style but I have the habit to include new stuff at the
end.
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
You forgot to remove the USING INDEX here.
I'm still review the other patches (that is basically the implementation of
CONCURRENTLY) and to avoid a long review, I'm sending the 0001 review. Anyway,
0001 is independent of the other patches and should be applied separately.
pgsql-hackers by date: