Re: pg_dump versus hash partitioning - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: pg_dump versus hash partitioning
Date
Msg-id 20230316090352.v2m2d473fcv5tli6@jrouhaud
Whole thread Raw
In response to Re: pg_dump versus hash partitioning  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pg_dump versus hash partitioning  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Mar 13, 2023 at 07:39:12PM -0400, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
> >> The trick is to detect in pg_restore whether pg_dump chose to do
> >> load-via-partition-root.
>
> > Given that this approach wouldn't help with existing dump files (at least if
> > using COPY, in any case the one using INSERT are doomed), so I'm slightly in
> > favor of the first approach, and later add an easy and non magic incantation
> > way to produce dumps that don't depend on partitioning.
>
> Yeah, we need to do both.  Attached find an updated patch series:

I didn't find a CF entry, is it intended?

> 0001: TAP test that exhibits both this deadlock problem and the
> different-hash-codes problem.  I'm not sure if we want to commit
> this, or if it should be in exactly this form --- the second set
> of tests with a manual --load-via-partition-root switch will be
> pretty redundant after this patch series.

I think there should be at least the first set of tests committed.  I would
also be happy to see some test with the --insert case, even if those are
technically redundant too, as it's quite cheap to do once you setup the
cluster.

> 0002: Make pg_restore detect load-via-partition-root by examining the
> COPY commands embedded in the dump, and skip the TRUNCATE if so,
> thereby fixing the deadlock issue.  This is the best we can do for
> legacy dump files, I think, but it should be good enough.

is_load_via_partition_root():

+ * In newer archive files this can be detected by checking for a special
+ * comment placed in te->defn.  In older files we have to fall back to seeing
+ * if the COPY statement targets the named table or some other one.  This
+ * will not work for data dumped as INSERT commands, so we could give a false
+ * negative in that case; fortunately, that's a rarely-used option.

I'm not sure if you intend to keep the current 0002 - 0003 separation, but if
yes the part about te->defn and possible fallback should be moved to 0003.  In
0002 we're only looking at the COPY statement.

-                    * the run then we wrap the COPY in a transaction and
-                    * precede it with a TRUNCATE.  If archiving is not on
-                    * this prevents WAL-logging the COPY.  This obtains a
-                    * speedup similar to that from using single_txn mode in
-                    * non-parallel restores.
+                    * the run and we are not restoring a
+                    * load-via-partition-root data item then we wrap the COPY
+                    * in a transaction and precede it with a TRUNCATE.  If
+                    * archiving is not on this prevents WAL-logging the COPY.
+                    * This obtains a speedup similar to that from using
+                    * single_txn mode in non-parallel restores.

I know you're only inheriting this comment, but isn't it well outdated and not
accurate anymore?  I'm assuming that "archiving is not on" was an acceptable
way to mean "wal_level < archive" at some point, but it's now completely
misleading.

Minor nitpicking:
- should the function name prefixed with a "_" like almost all nearby code?
- should there be an assert that the given toc entry is indeed a TABLE DATA?

FWIW it unsurprisingly fixes the problem on my original use case.

> 0003: Also detect load-via-partition-root by adding a label in the
> dump.  This is a more bulletproof solution going forward.
>
> 0004-0006: same as previous patches, but rebased over these.
> This gets us to a place where the new TAP test passes.

+getPartitioningInfo(Archive *fout)
+{
+   PQExpBuffer query;
+   PGresult   *res;
+   int         ntups;
+
+   /* no partitions before v10 */
+   if (fout->remoteVersion < 100000)
+       return;
+ [...]
+   /*
+    * Unsafe partitioning schemes are exactly those for which hash enum_ops
+    * appears among the partition opclasses.  We needn't check partstrat.
+    *
+    * Note that this query may well retrieve info about tables we aren't
+    * going to dump and hence have no lock on.  That's okay since we need not
+    * invoke any unsafe server-side functions.
+    */
+   appendPQExpBufferStr(query,
+                        "SELECT partrelid FROM pg_partitioned_table WHERE\n"
+                        "(SELECT c.oid FROM pg_opclass c JOIN pg_am a "
+                        "ON c.opcmethod = a.oid\n"
+                        "WHERE opcname = 'enum_ops' "
+                        "AND opcnamespace = 'pg_catalog'::regnamespace "
+                        "AND amname = 'hash') = ANY(partclass)");

Hash partitioning was added with pg11, should we bypass pg10 too with a comment
saying that we only care about hash, at least for the forseeable future?

Other than that, the patchset looks quite good to me, modulo the upcoming doc
changes.

More generally, I also think that forcing --load-via-partition-root for
known unsafe partitioning seems like the best compromise.  I'm not sure if we
should have an option to turn it off though.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Next
From: Juan José Santamaría Flecha
Date:
Subject: Re: Fix fseek() detection of unseekable files on WIN32