Home > mailing lists

Thread: pg_dump versus hash partitioning

pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 16:17:59

Over at [1] we have a complaint that dump-and-restore fails for
hash-partitioned tables if a partitioning column is an enum,
because the enum values are unlikely to receive the same OIDs
in the destination database as they had in the source, and the
hash codes are dependent on those OIDs. So restore tries to
load rows into the wrong leaf tables, and it's all a mess.
The patch approach proposed at [1] doesn't really work, but
what does work is to use pg_dump's --load-via-partition-root
option, so that the tuple routing decisions are all re-made.

I'd initially proposed that we force --load-via-partition-root
if we notice that we have hash partitioning on an enum column.
But the more I thought about this, the more comparable problems
came to mind:

1. Hash partitioning on text columns will likely fail if the
destination uses a different encoding.

2. Hash partitioning on float columns will fail if you use
--extra-float-digits to round off the values. And then
there's the fact that the behavior of strtod() might vary
across platforms.

3. Hash partitioning on floats is also endian-dependent,
and the same is likely true for some other types.

4. Anybody want to bet that complex types such as jsonb
are entirely free of similar hazards? (Yes, somebody
thought it'd be a good idea to provide jsonb_hash.)

In general, we've never thought that hash values are
required to be consistent across platforms.

That was leading me to think that we should force
--load-via-partition-root for any hash-partitioned table,
just to pre-emptively avoid these problems. But then
I remembered that

5. Range partitioning on text columns will likely fail if the
destination uses a different collation.

This is looking like a very large-caliber foot-gun, isn't it?
And remember that --load-via-partition-root acts at pg_dump
time, not restore. If all you have is a dump file with no
opportunity to go back and get a new one, and it won't load
into your new server, you have got a nasty problem.

I don't think this is an acceptable degree of risk, considering
that the primary use-cases for pg_dump involve target systems
that aren't 100.00% identical to the source.

So here's what I think we should actually do: make
--load-via-partition-root the default. We can provide a
switch to turn it off, for those who are brave or foolish
enough to risk that in the name of saving a few cycles,
but it ought to be the default.

Furthermore, I think we should make this happen in time for
next week's releases. I can write the patch easily enough,
but we need a consensus PDQ that this is what to do.

Anyone want to bikeshed on the spelling of the new switch?
I'm contemplating "--load-via-partition-leaf" or perhaps
"--no-load-via-partition-root".

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/765e5968-6c39-470f-95bf-7b14e6b9a1c0%40app.fastmail.com

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

01 February 2023, 17:39:32

On Wed, Feb 1, 2023 at 11:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Over at [1] we have a complaint that dump-and-restore fails for
> hash-partitioned tables if a partitioning column is an enum,
> because the enum values are unlikely to receive the same OIDs
> in the destination database as they had in the source, and the
> hash codes are dependent on those OIDs.

It seems to me that this is the root of the problem. We can't expect
to hash on something that's not present in the dump file and have
anything work.

> So here's what I think we should actually do: make
> --load-via-partition-root the default.  We can provide a
> switch to turn it off, for those who are brave or foolish
> enough to risk that in the name of saving a few cycles,
> but it ought to be the default.
>
> Furthermore, I think we should make this happen in time for
> next week's releases.  I can write the patch easily enough,
> but we need a consensus PDQ that this is what to do.

This seems extremely precipitous to me and I'm against it. I like the
fact that we have --load-via-partition-root, but it is a bit of a
hack. You don't get a single copy into the partition root, you get one
per child table -- and those COPY statements are listed as data for
the partitions where the data lives now, not for the parent table. I
am not completely sure whether there is a scenario where that's an
issue, but it's certainly an oddity. Also, and I think pretty
significantly, using --load-via-partition-root forces you to pay the
overhead of rerouting every tuple to the target partition whether you
need it or not, which is potentially a large unnecessary expense. I
don't think we should just foist that kind of overhead onto everyone
in every situation for every data type because somebody had a problem
in a certain case.

And even if we do decide to do that at some point, I don't think it is
right at all to rush such a change out on a short time scale, with
little time to mull over consequences and alternative fixes. I think
that could easily hurt more people than it helps.

I think that not all of the cases that you list are of the same type.
Loading a dump under a different encoding or on a different endianness
are surely corner cases. They might come up for some people
occasionally, but they're not typical. In the case of endianness,
that's because little-Endian has pretty much taken over the world; in
the case of encoding, that's because converting data between encodings
is a real pain, and combining that with a database dump and restore is
likely to be very little fun. It's hard to argue that collation
changes fall into the same category: we know that they get changed all
the time, often silently. But none of us database geeks think that's a
good thing: just that it's a thing that we have to deal with.

The enum case seems completely different to me. That's not the result
of trying to migrate your data to another architecture or of the glibc
maintainers not believing in sorting working the same way on Tuesday
that it did on Monday. That's the result of the PostgreSQL project
hashing data in a way that is does not make any real sense for the
application at hand. Any hash function that we use for partitioning
has to work based on data that is preserved by the dump-and-restore
process. I would argue that the float case is not of the same kind:
yes, if you round your data off, then the values are going to hash
differently, but if you truncate your strings, those will hash
differently too. Duh. Intentionally changing the value is supposed to
change the hash code, that's kind of the point of hashing.

So I think we should be asking ourselves what we could do about the
enum case specifically, rather than resorting to a bazooka.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 18:23:26

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 1, 2023 at 11:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Over at [1] we have a complaint that dump-and-restore fails for
>> hash-partitioned tables if a partitioning column is an enum,
>> because the enum values are unlikely to receive the same OIDs
>> in the destination database as they had in the source, and the
>> hash codes are dependent on those OIDs.

> It seems to me that this is the root of the problem.

Well, that was what I thought too to start with, but I now think that
it is far too narrow-minded a view of the problem.  The real issue
is something I said that you trimmed:

>> In general, we've never thought that hash values are
>> required to be consistent across platforms.

hashenum() is doing something that is perfectly fine in the context
of hash indexes, which have traditionally been our standard of how
reproducible hash values need to be.  Hash partitioning has moved
those goalposts, and if there was any discussion of the consequences
of that, I didn't see it.

>> Furthermore, I think we should make this happen in time for
>> next week's releases.  I can write the patch easily enough,
>> but we need a consensus PDQ that this is what to do.

> This seems extremely precipitous to me and I'm against it.

Yeah, it's been this way for awhile, so if there's debate then
we can let it go awhile longer.

> So I think we should be asking ourselves what we could do about the
> enum case specifically, rather than resorting to a bazooka.

My idea of solving this with a bazooka would be to deprecate hash
partitioning.  Quite frankly that's where I think we will end up
someday, but I'm not going to try to make that argument today.

In the meantime, I think we need to recognize that hash values are
not very portable.  I do not think we do our users a service by
letting them discover the corner cases the hard way.

I'd be willing to compromise on the intermediate idea of forcing
--load-via-partition-root just for hashed partitioning.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 20:34:52

Robert Haas <robertmhaas@gmail.com> writes:
> ... I like the
> fact that we have --load-via-partition-root, but it is a bit of a
> hack. You don't get a single copy into the partition root, you get one
> per child table -- and those COPY statements are listed as data for
> the partitions where the data lives now, not for the parent table. I
> am not completely sure whether there is a scenario where that's an
> issue, but it's certainly an oddity.

I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning.  You would notice an issue if you tried to do a selective
restore of just one partition --- but under what circumstance would
that be a useful thing to do?  By definition, under hash partitioning
there is no user-meaningful difference between different partitions.
Moreover, in the case at hand you would get constraint failures without
--load-via-partition-root, or tuple routing failures with it,
so what's the difference?  (Unless you'd created all the partitions
to start with and were doing a selective restore of just one partition's
data, in which case the outcome is "fails" or "works" respectively.)

> Also, and I think pretty
> significantly, using --load-via-partition-root forces you to pay the
> overhead of rerouting every tuple to the target partition whether you
> need it or not, which is potentially a large unnecessary expense.

Oddly, I always thought that we prioritize correctness over speed.
I don't mind having an option that allows people to select a less-safe
way of doing this, but I do think it's unwise for less-safe to be the
default, especially when it's something you can't fix after the fact.

What do you think of "--load-via-partition-root=on/off/auto", where
auto means "not with hash partitions" or the like?  I'm still
uncomfortable about the collation aspect, but I'm willing to concede
that range partitioning is less likely to fail in this way than hash.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

01 February 2023, 20:38:44

On Wed, Feb 1, 2023 at 1:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, that was what I thought too to start with, but I now think that
> it is far too narrow-minded a view of the problem.  The real issue
> is something I said that you trimmed:
>
> >> In general, we've never thought that hash values are
> >> required to be consistent across platforms.
>
> hashenum() is doing something that is perfectly fine in the context
> of hash indexes, which have traditionally been our standard of how
> reproducible hash values need to be.  Hash partitioning has moved
> those goalposts, and if there was any discussion of the consequences
> of that, I didn't see it.

Right, I don't think it was ever discussed, and I think that was a
mistake on my part.

> In the meantime, I think we need to recognize that hash values are
> not very portable.  I do not think we do our users a service by
> letting them discover the corner cases the hard way.

I think you're not really engaging with the argument that "not
completely portable" and "totally broken" are two different things,
and I still think that's an important point here. One idea that I had
is to add a flag somewhere to indicate whether a particular opclass or
opfamily is suitable for hash partitioning, or perhaps better, an
alternative to opcdefault that sets the default for partitioning,
which could be different from the default for indexing. Then we could
either prohibit this case, or make it work. Of course we would have to
define what "suitable for hash partitioning" means, but "would be
likely to survive a dump and reload on the same machine without any
software changes" is probably a reasonable minimum standard.

I don't think the fact that our *traditional* standard for how stable
a hash function needs to be has been XYZ carries any water. Needs
change over time, and we adapt the code to meet the new needs. Since
we have no system for type properties in PostgreSQL -- a design
decision I find questionable -- we tie all such properties to operator
classes. That's why, for example, we have HASHEXTENDED_PROC, even
though hash indexes don't need 64-bit hash values or a seed. We added
that for hash partitioning, and it's now used in other places too,
because 32-bits aren't enough for everything just because they're
enough for hash indexes, and seeds are handy. That's also why we have
BTINRANGE_PROC, which doesn't exist to support btree indexes, but
rather window frames. The fact that a certain feature requires us to
graft some additional stuff into the operator class/family mechanism,
or that it doesn't quite work with everything that's already part of
that mechanism, isn't an argument against the feature. That's just how
we do things around here. Indeed, if somebody, instead of implementing
hash partitioning by tying it into hash opfamilies, were to make up
some completely new hashing infrastructure that had exactly the
properties they wanted for partitioning, that would be *totally*
unacceptable and surely a reason for rejecting such a feature
outright. The fact that it tries to make use of the existing
infrastructure is a good thing about that feature, not a bad thing,
even though it is turning out that there are some problems.

On the question of whether hash partitioning is a good feature in
general, I can only say that I disagree with what seems to be your
position, which as best as I can tell is "it sucks and we should kill
it with fire". I do think that partitioning in general leaves a lot to
be desired in PostgreSQL in general, and perhaps the issues are even
worse for hash partitioning than they are elsewhere. However, I think
that the solution to that is for people to keep putting more work into
making it better, not to give up and say "ah, partitioning (or hash
partitioning specifically) is a stupid feature that nobody wants". To
think that, you have to be living in a bubble. It's unfortunate that
with all the work that so many people have put into this area we don't
have better results to show for it, but AFAICS there's no help for
that but to keep hacking. Amit Langote's work on locking before
pruning, for example, will be hugely impactful for some kinds of
queries if it gets committed, and it's been a long time coming, partly
because so many other problems needed to be sorted out first. But you
can't run the simplest workload with any kind of partitioning, range,
hash, whatever, and not run into that problem immediately.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 21:12:13

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 1, 2023 at 1:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In the meantime, I think we need to recognize that hash values are
>> not very portable.  I do not think we do our users a service by
>> letting them discover the corner cases the hard way.

> I think you're not really engaging with the argument that "not
> completely portable" and "totally broken" are two different things,
> and I still think that's an important point here.

I don't buy that argument.  From the user's perspective, it's broken
if her use-case fails.  "It works for most people" is cold comfort,
most especially so if there's no convenient path to fixing it after
a failure.

> I don't think the fact that our *traditional* standard for how stable
> a hash function needs to be has been XYZ carries any water.

Well, it wouldn't need to if we had a practical way of changing the
behavior of an existing hash function, but guess what: we don't.
Andrew's original proposal for fixing this was exactly to change the
behavior of hashenum().  There were some problems with the idea of
depending on enumsortorder instead of enum OID, but the really
fundamental issue is that you can't change hashing behavior without
breaking pg_upgrade completely.  Not only will your hash indexes be
corrupt, but your hash-partitioned tables will be broken, in exactly
the same way that we're trying to solve for dump/reload cases (which
of course will *also* be broken by redefining the hash function, if
you didn't use --load-via-partition-root).  Moreover, while we can
always advise people to reindex, there's no similarly easy way to fix
broken partitioning.

That being the case, I don't think moving the goalposts for hash
function stability is going to lead to a workable solution.

> On the question of whether hash partitioning is a good feature in
> general, I can only say that I disagree with what seems to be your
> position, which as best as I can tell is "it sucks and we should kill
> it with fire".

As I said, I'm not prepared to litigate that case today ... but
I do have a sneaking suspicion that we will eventually reach that
conclusion.  In any case, if we don't want to reach that conclusion,
we need some practical solution to these dump/reload problems.
Have you got a better idea than --load-via-partition-root?

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Peter Geoghegan

Date:

01 February 2023, 21:12:31

On Wed, Feb 1, 2023 at 12:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Also, and I think pretty
> > significantly, using --load-via-partition-root forces you to pay the
> > overhead of rerouting every tuple to the target partition whether you
> > need it or not, which is potentially a large unnecessary expense.
>
> Oddly, I always thought that we prioritize correctness over speed.
> I don't mind having an option that allows people to select a less-safe
> way of doing this, but I do think it's unwise for less-safe to be the
> default, especially when it's something you can't fix after the fact.

+1

As you pointed out already, pg_dump's primary use-cases all involve
target systems that aren't identical to the source system. Anybody
using pg_dump is unlikely to be particularly concerned about
performance.

> What do you think of "--load-via-partition-root=on/off/auto", where
> auto means "not with hash partitions" or the like?  I'm still
> uncomfortable about the collation aspect, but I'm willing to concede
> that range partitioning is less likely to fail in this way than hash.

Currently, pg_dump ignores collation versions entirely, except when
run by pg_upgrade. So pg_dump already understands that sometimes it's
important that the collation behavior be completely identical when the
database is restored, and sometimes it's desirable to produce a dump
with any available "logically equivalent" collation. This is about the
high level requirements, which makes sense to me.

ISTM that range partitioning is missing a good high level model that
builds on that. What's really needed is a fully worked out abstraction
that recognizes how collations can be equivalent for some purposes,
but not other purposes. The indirection between "logical and physical
collations" is underdeveloped. There isn't even an official name for
that idea.

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

01 February 2023, 21:14:10

On Wed, Feb 1, 2023 at 3:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I spent a bit more time thinking about that, and while I agree that
> it's an oddity, I don't see that it matters in the case of hash
> partitioning.  You would notice an issue if you tried to do a selective
> restore of just one partition --- but under what circumstance would
> that be a useful thing to do?  By definition, under hash partitioning
> there is no user-meaningful difference between different partitions.
> Moreover, in the case at hand you would get constraint failures without
> --load-via-partition-root, or tuple routing failures with it,
> so what's the difference?  (Unless you'd created all the partitions
> to start with and were doing a selective restore of just one partition's
> data, in which case the outcome is "fails" or "works" respectively.)

I guess I was worried that pg_dump's dependency ordering stuff might
do something weird in some case that I'm not smart enough to think up.

> > Also, and I think pretty
> > significantly, using --load-via-partition-root forces you to pay the
> > overhead of rerouting every tuple to the target partition whether you
> > need it or not, which is potentially a large unnecessary expense.
>
> Oddly, I always thought that we prioritize correctness over speed.

That's a bit rich.

It seems to me that the job of pg_dump is to produce a dump that, when
reloaded on another system, recreates the same database state. That
means that we end up with all of the same objects, each defined in the
same way, and that all of the tables end up with all the same contents
that they had before. Here, you'd like to argue that it's perfectly
fine if we instead insert some of the rows into different tables than
where they were on the original system. Under normal circumstances, of
course, we wouldn't consider any such thing, because then we would not
be faithfully replicating the database state, which would be
incorrect. But here you want to argue that it's OK to create a
different database state because trying to recreate the same one would
produce an error and the user might not like getting an error so let's
just do something else instead and not even bother telling them.

As you have quite rightly pointed out, the --load-via-partition-root
behavior is useful for working around a variety of unfortunate things
that can happen. If a user is willing to say that getting a row into
one partition of some table is just as good as getting it into another
partition of that same table and that you don't mind paying the cost
associated with that, then that is something that we can do for that
user. But just as we normally prioritize correctness over speed, so
also do we normally throw errors when things aren't right instead of
silently accepting bad input. The partitions in this scenario are
tables that have constraints. If a dump contains a row that doesn't
satisfy some constraint on the table into which it is being loaded,
that's an error. Keep in mind that there's no rule that a user can't
query a partition directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Peter Geoghegan

Date:

01 February 2023, 21:44:31

On Wed, Feb 1, 2023 at 1:14 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It seems to me that the job of pg_dump is to produce a dump that, when
> reloaded on another system, recreates the same database state. That
> means that we end up with all of the same objects, each defined in the
> same way, and that all of the tables end up with all the same contents
> that they had before. Here, you'd like to argue that it's perfectly
> fine if we instead insert some of the rows into different tables than
> where they were on the original system. Under normal circumstances, of
> course, we wouldn't consider any such thing, because then we would not
> be faithfully replicating the database state, which would be
> incorrect. But here you want to argue that it's OK to create a
> different database state because trying to recreate the same one would
> produce an error and the user might not like getting an error so let's
> just do something else instead and not even bother telling them.

This is a misrepresentation of Tom's words. It isn't actually
self-evident what "we end up with all of the same objects, each
defined in the same way, and that all of the tables end up with all
the same contents that they had before" actually means here, in
general. Tom's main concern seems to be just that -- the ambiguity
itself.

If there was a fully worked out idea of what that would mean, then I
suspect it would be quite subtle and complicated -- it's an inherently
tricky area. You seem to be saying that the way that this stuff
currently works is correct by definition, except when it isn't.

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

From

Peter Geoghegan

Date:

01 February 2023, 21:49:39

On Wed, Feb 1, 2023 at 12:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think the fact that our *traditional* standard for how stable
> a hash function needs to be has been XYZ carries any water. Needs
> change over time, and we adapt the code to meet the new needs. Since
> we have no system for type properties in PostgreSQL -- a design
> decision I find questionable -- we tie all such properties to operator
> classes.

Are you familiar with B-Tree opclass support function 4, equalimage?
It's used to determine whether a B-Tree index can use deduplication at
CREATE INDEX time. ISTM that the requirements are rather similar here
-- perhaps even identical.

See: https://www.postgresql.org/docs/devel/btree-support-funcs.html

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

01 February 2023, 21:52:30

On Wed, Feb 1, 2023 at 4:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I don't think the fact that our *traditional* standard for how stable
> > a hash function needs to be has been XYZ carries any water.
>
> Well, it wouldn't need to if we had a practical way of changing the
> behavior of an existing hash function, but guess what: we don't.
> Andrew's original proposal for fixing this was exactly to change the
> behavior of hashenum().  There were some problems with the idea of
> depending on enumsortorder instead of enum OID, but the really
> fundamental issue is that you can't change hashing behavior without
> breaking pg_upgrade completely.  Not only will your hash indexes be
> corrupt, but your hash-partitioned tables will be broken, in exactly
> the same way that we're trying to solve for dump/reload cases (which
> of course will *also* be broken by redefining the hash function, if
> you didn't use --load-via-partition-root).  Moreover, while we can
> always advise people to reindex, there's no similarly easy way to fix
> broken partitioning.
>
> That being the case, I don't think moving the goalposts for hash
> function stability is going to lead to a workable solution.

I don't see that there is any easy, clean way to solve this in
released branches. The idea that I proposed could be implemented in
master, and I think it is the right kind of fix, but it is not
back-patchable. However, I think your argument rests on the premise
that making --load-via-partition-root the default behavior in some or
all cases will not break anything for anyone, and I'm skeptical. I
think that's a significant behavior change and that some people will
notice, and some will find it an improvement while others will find it
worse than the current behavior. I also think that there must be a lot
more people using partitioning in general, and even hash partitioning
specifically, than there are people using hash partitioning on an enum
column.

Personally, I would rather disallow this case in the back-branches --
i.e. make pg_dump barf if it is encountered and block CREATE TABLE
from setting up any new situations of this type -- than foist
--load-via-partition-root on many people who aren't affected by the
issue. I'm not saying that's a great answer, but we have to pick from
the choices we have.

I also don't accept that if someone has hit this issue they are just
hosed and there's no way out. Yeah, it's not a lot of fun: you
probably have to use "CREATE TABLE unpartitioned AS SELECT * FROM
borked; DROP TABLE borked;" or so to rescue your data. But what would
we do if we discovered that the btree opclass sorts 1 before 0, or
something? Surely we wouldn't refuse to fix the opclass just because
some users have existing indexes on disk that would be out of order
with the new opclass definition. We'd just change it and people would
have to deal. People with indexes would need to reindex. People with
partitioning boundaries between 0 and 1 would need to repartition.
This case isn't the same because hashenum() isn't broken in general,
just for this particular purpose. But I think you're trying to argue
that we should fix this by changing something other than the thing
that is broken, and I don't agree with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 22:08:37

Robert Haas <robertmhaas@gmail.com> writes:
> It seems to me that the job of pg_dump is to produce a dump that, when
> reloaded on another system, recreates the same database state. That
> means that we end up with all of the same objects, each defined in the
> same way, and that all of the tables end up with all the same contents
> that they had before.

No, its job is to produce *logically* the same state.  We don't expect
it to produce, say, the same CTID value that each row had before ---
if you want that, you use pg_upgrade or physical replication, not
pg_dump.  There are fairly strong constraints on how identical we
can make the results given that we're not replicating physical state.
And that's not even touching the point that frequently what the user
is after is not an identical copy anyway.

> Here, you'd like to argue that it's perfectly
> fine if we instead insert some of the rows into different tables than
> where they were on the original system.

I can agree with that argument for range or list partitioning, where
the partitions have some semantic meaning to the user.  I don't buy it
for hash partitioning.  It was an implementation artifact to begin
with that a given row ended up in partition 3 not partition 11, so why
would users care which partition it ends up in after a dump/reload?
If they think there is a difference between the partitions, they need
education.

> ... But just as we normally prioritize correctness over speed, so
> also do we normally throw errors when things aren't right instead of
> silently accepting bad input. The partitions in this scenario are
> tables that have constraints.

Again, for hash partitioning those constraints are implementation
artifacts not something that users should have any concern with.

> Keep in mind that there's no rule that a user can't
> query a partition directly.

Sure, and in the case of a hash partition, what he is going to
get is an implementation-dependent subset of the rows.  I use
"implementation-dependent" here in the same sense that the SQL
standard does, which is "there is a rule but the implementation
doesn't have to tell you what it is".  In particular, we are not
bound to make the subset be the same on different installations.
We already didn't promise that, because of issues like endianness.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

01 February 2023, 22:12:41

On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This is a misrepresentation of Tom's words. It isn't actually
> self-evident what "we end up with all of the same objects, each
> defined in the same way, and that all of the tables end up with all
> the same contents that they had before" actually means here, in
> general. Tom's main concern seems to be just that -- the ambiguity
> itself.

As far as I can see, Tom didn't admit that there was any ambiguity. He
just said that I was advocating for wrong behavior for the sake of
performance. I don't think that is what I was doing. I also don't
really think Tom thinks that that is what I was doing. But it is what
he said I was doing.

I do agree with you that the ambiguity is the root of the issue. I
mean, if we can't put the rows back into the same partitions where
they were before, does the user care about that, or do they only care
that the rows end up in some partition of the toplevel partitioned
table? I think that there's no single definition of correctness that
is the only defensible one here, and we can't know which one the user
wants a priori. I also think that they can care which one they're
getting, and thus I think that changing the default in a minor release
is a bad plan. Tom, as I understand it, is arguing that the
--load-via-partition-root behavior has negligible downsides and is
almost categorically better than the current default behavior, and
thus making that the new default in some or all situations in a minor
release is totally fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 22:33:16

Robert Haas <robertmhaas@gmail.com> writes:
> Tom, as I understand it, is arguing that the
> --load-via-partition-root behavior has negligible downsides and is
> almost categorically better than the current default behavior, and
> thus making that the new default in some or all situations in a minor
> release is totally fine.

I think it's categorically better than a failed restore.  I wouldn't
be proposing this if there were no such problem; but there is,
and I don't buy your apparent position that we should leave affected
users to cope as best they can.  Yes, it's clearly only a minority
of users that are affected, else we'd have heard complaints before.
But it could be absolutely catastrophic for an affected user,
if they're trying to restore their only backup.  I'd rather impose
an across-the-board cost on all users of hash partitioning than
risk such outcomes for a few.

Also, you've really offered no evidence for your apparent position
that --load-via-partition-root has unacceptable overhead.  We've
done enough work on partition routing over the last few years that
whatever measurements might've originally justified that idea
don't necessarily apply anymore.  Admittedly, I've not measured
it either.  But we don't tell people to avoid partitioning because
INSERT is unduly expensive.  Partition routing is just the cost of
doing business in that space.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Peter Geoghegan

Date:

01 February 2023, 22:33:31

On Wed, Feb 1, 2023 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > This is a misrepresentation of Tom's words. It isn't actually
> > self-evident what "we end up with all of the same objects, each
> > defined in the same way, and that all of the tables end up with all
> > the same contents that they had before" actually means here, in
> > general. Tom's main concern seems to be just that -- the ambiguity
> > itself.
>
> As far as I can see, Tom didn't admit that there was any ambiguity.

Tom said:

"I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning.  You would notice an issue if you tried to do a
selective restore of just one partition --- but under what
circumstance would that be a useful thing to do?"

While the word ambiguity may not have actually been used, Tom very
clearly admitted some ambiguity. But even if he didn't, so what? It's
perfectly obvious that that's the major underlying issue, and that
this is a high level problem rather than a low level problem.

> He just said that I was advocating for wrong behavior for the sake of
> performance. I don't think that is what I was doing. I also don't
> really think Tom thinks that that is what I was doing. But it is what
> he said I was doing.

And I think that you're making a mountain out of a molehill.

> I do agree with you that the ambiguity is the root of the issue. I
> mean, if we can't put the rows back into the same partitions where
> they were before, does the user care about that, or do they only care
> that the rows end up in some partition of the toplevel partitioned
> table?

That was precisely the question that Tom posed to you, in the same
email as the one that you found objectionable.

> Tom, as I understand it, is arguing that the
> --load-via-partition-root behavior has negligible downsides and is
> almost categorically better than the current default behavior, and
> thus making that the new default in some or all situations in a minor
> release is totally fine.

I don't know why you seem to think that it was such an absolutist
position as that.

You mentioned "minor releases" here. Who said anything about that?

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

01 February 2023, 22:36:47

On Wed, Feb 1, 2023 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Here, you'd like to argue that it's perfectly
> > fine if we instead insert some of the rows into different tables than
> > where they were on the original system.
>
> I can agree with that argument for range or list partitioning, where
> the partitions have some semantic meaning to the user.  I don't buy it
> for hash partitioning.  It was an implementation artifact to begin
> with that a given row ended up in partition 3 not partition 11, so why
> would users care which partition it ends up in after a dump/reload?
> If they think there is a difference between the partitions, they need
> education.

I see your point. I think it's somewhat valid. However, I also think
it muddies the definition of what pg_dump is allowed to do in a way
that I do not like. I think there's a difference between the CTID or
XMAX of a tuple changing and it ending up in a totally different
partition. It feels like it makes the definition of correctness
subjective: we do think that people care about range and list
partitions as individual entities, so we'll put the rows back where
they were and complain if we can't, but we don't think they think
about hash partitions that way, so we will err on the side of making
the dump restoration succeed. That's a level of guessing what the user
meant that I think is uncomfortable. I feel like when somebody around
here discovers that sort of reasoning in some other software's code,
or in a proposed patch, it pretty often gets blasted on this mailing
list with considerable vigor.

I think you can construct plausible cases where it's not just
academic. For instance, suppose I intend to use some kind of logical
replication system, not necessarily the one built into PostgreSQL, to
replicate data between two systems. Before engaging that system, I
need to make the initial database contents match. The system is
oblivious to partitioning, and just replicates each table to a table
with a matching name. Well, if the partitions don't actually match up
1:1, I kind of need to know about that. In this use case, the rows
silently moving around doesn't meet my needs. Or, suppose I dump and
restore two databases. It works perfectly. I then run a comparison
tool of some sort that compares the two databases. EDB has such a
tool! I don't know whether it would perform the comparison via the
partition root or not, because I don't know how it works. But I find
it pretty plausible that some such tool would show differences between
the source and target databases. Now, if I had done the data migration
using --load-with-partition-root, I would expect that. I might even be
looking for it, to see what got moved around. But otherwise, it might
be unexpected.

Another subtle problem with this whole situation is: suppose that on
host A, I set up a table hash-partitioned by an enum column and make a
bunch of hash partitions. Then, on host B, I set up the same table
with a bunch of foreign table partitions, each corresponding to the
matching partition on the other node. I guess that just doesn't work,
whereas if the column were of any other data type, it would work. If
it were a collatable data type, it would need the collations and
collation definitions to match, too.

I know these things are subtle, and maybe these specific things will
never happen to anyone, or nobody will care. I don't know. I just have
a really hard time accepting that a categorical statement that this
just can't ever matter to anyone.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 22:38:13

Peter Geoghegan <pg@bowt.ie> writes:
> You mentioned "minor releases" here. Who said anything about that?

I did: I'd like to back-patch the fix if possible.  I think changing
the default --load-via-partition-root choice could be back-patchable.

If Robert is resistant to that but would accept it in master,
I'd settle for that in preference to having no fix.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 22:49:04

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 1, 2023 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I can agree with that argument for range or list partitioning, where
>> the partitions have some semantic meaning to the user.  I don't buy it
>> for hash partitioning.  It was an implementation artifact to begin
>> with that a given row ended up in partition 3 not partition 11, so why
>> would users care which partition it ends up in after a dump/reload?
>> If they think there is a difference between the partitions, they need
>> education.

> I see your point. I think it's somewhat valid. However, I also think
> it muddies the definition of what pg_dump is allowed to do in a way
> that I do not like. I think there's a difference between the CTID or
> XMAX of a tuple changing and it ending up in a totally different
> partition. It feels like it makes the definition of correctness
> subjective: we do think that people care about range and list
> partitions as individual entities, so we'll put the rows back where
> they were and complain if we can't, but we don't think they think
> about hash partitions that way, so we will err on the side of making
> the dump restoration succeed. That's a level of guessing what the user
> meant that I think is uncomfortable.

I see your point too, and to me it's evidence for the position that
we should never have done hash partitioning in the first place.
It's precisely because you want to analyze it in the same terms
as range/list partitioning that we have these issues.  Or we could
have built it on some other infrastructure than hash index opclasses
... but we didn't do that, and now we have a mess.  I don't see a way
out other than relaxing the guarantees about how hash partitioning
works compared to the other kinds.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

"David G. Johnston"

Date:

01 February 2023, 22:49:41

On Wed, Feb 1, 2023 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:
> You mentioned "minor releases" here. Who said anything about that?

I did: I'd like to back-patch the fix if possible. I think changing
the default --load-via-partition-root choice could be back-patchable.

If Robert is resistant to that but would accept it in master,
I'd settle for that in preference to having no fix.

I'm for accepting --load-via-partition-root as the solution to this problem. I think it is better than doing nothing and, at the moment, I don't see any alternatives to pick from.

As evidenced from the current influx of collation problems related to indexes, we would be foolish to discount the collation issues with just plain text, so limiting this only to the enum case (which is a must-have) doesn't seem wise.

pg_dump should be conservative in what it produces - which in this situation means having minimal environmental dependencies and internal volatility.

In the interest of compatibility, having an escape hatch to do things as they are done today is something we should provide. We got this one wrong and that is going to cause some pain. Though at least with the escape hatch we shouldn't be dealing with as many unresolvable complaints as when we back-patched removing the public schema from search_path.

In the worst case, being conservative, the user can always at minimum restore their dump file into a local database and get access to their data in a usable format with minimal hassle. The few that would like "same table name or bust" behavior because of external or application-specific requirements can still get that behavior.

David J.

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

01 February 2023, 23:14:36

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 1, 2023 at 4:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That being the case, I don't think moving the goalposts for hash
>> function stability is going to lead to a workable solution.

> I don't see that there is any easy, clean way to solve this in
> released branches. The idea that I proposed could be implemented in
> master, and I think it is the right kind of fix, but it is not
> back-patchable.

You waved your arms about inventing some new hashing infrastructure,
but it was phrased in such a way that it wasn't clear to me if that
was actually a serious proposal or not.  But if it is: how will you
get around the fact that any change to hashing behavior will break
pg_upgrade of existing hash-partitioned tables?  New infrastructure
avails nothing if it has to be bug-compatible with the old.  So I'm
not sure how restricting the fix to master helps us.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Peter Geoghegan

Date:

01 February 2023, 23:15:43

On Wed, Feb 1, 2023 at 2:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's precisely because you want to analyze it in the same terms
> as range/list partitioning that we have these issues.  Or we could
> have built it on some other infrastructure than hash index opclasses
> ... but we didn't do that, and now we have a mess.  I don't see a way
> out other than relaxing the guarantees about how hash partitioning
> works compared to the other kinds.

I suspect that using hash index opclasses was the right design --
sticking to the same approach to hashing seems valuable. I agree with
your overall conclusion, though, since it doesn't seem sensible to
allow hashing behavior to ever be anything greater than an
implementation detail. On general principle. What happens when a hash
function is discovered to have a huge flaw, as happened a couple of
times before now?

It's just the same with collations, where a particular collation
shouldn't be expected to have perfectly stable behavior across a dump
and reload. While admitting that possibility does open the door to
problems, in particular problems when range partitioning is in use,
those problems at least make sense. And they probably won't come up
very often -- collation updates don't often contain enormous
gratuitous differences that are liable to create dump/reload hazards
with range partitioning. It is the least worst approach, overall. In
theory, and in practice.

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

From

David Rowley

Date:

02 February 2023, 00:15:28

On Thu, 2 Feb 2023 at 11:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Geoghegan <pg@bowt.ie> writes:
> > You mentioned "minor releases" here. Who said anything about that?
>
> I did: I'd like to back-patch the fix if possible.  I think changing
> the default --load-via-partition-root choice could be back-patchable.
>
> If Robert is resistant to that but would accept it in master,
> I'd settle for that in preference to having no fix.

I'm not sure it'll help the discussion any, but on master, the
performance gap between using --load-via-partition-root and not using
it should be somewhat closed due to 3592e0ff9. So using that is likely
not as terrible as it once was.

[1] does not show results for inserting directly into partitions, but
the benchmark results do show that performance is better than it was
without the caching. The order that pg_dump outputs the rows should
mean the cache is hit most of the time for RANGE partitioned tables,
at least, and likely more often than not for LIST. HASH partitioning
is not really affected by that commit. The idea there is that it's
probably as cheap to hash as it is to do an equality check with the
last Datum.

Digging into the history a bit, I found [2] and particularly [3] that
seem to indicate this option was thought about due to concerns about
hash functions not returning consistent results on different
architectures. I suspect it might have been defaulted to load into the
leaf partitions for performance reasons, however. I mean, why else
would you?

David

[1] https://www.postgresql.org/message-id/CAApHDvqFeW5hvQqprXOLuGMMJSf%2B1C%2BWk4w_L-M03sVduF3oYg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAGPqQf0C1he087bz9xRBOGZBuESYz9X=Fp8Ca_g+TfHgAff75g@mail.gmail.com
[3] https://www.postgresql.org/message-id/CA%2BTgmoZFn7TJ7QBsFatnuEE%3DGYGdZSNXqr9489n5JBsdy5rFfA%40mail.gmail.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

02 February 2023, 01:03:10

David Rowley <dgrowleyml@gmail.com> writes:
> Digging into the history a bit, I found [2] and particularly [3] that
> seem to indicate this option was thought about due to concerns about
> hash functions not returning consistent results on different
> architectures. I suspect it might have been defaulted to load into the
> leaf partitions for performance reasons, however. I mean, why else
> would you?
> [3] https://www.postgresql.org/message-id/CA%2BTgmoZFn7TJ7QBsFatnuEE%3DGYGdZSNXqr9489n5JBsdy5rFfA%40mail.gmail.com

Hah ... so we went over almost exactly this same ground in 2017.
The consensus at that point seemed to be that actual problems
would be rare enough that we'd not need to impose the overhead
of --load-via-partition-root by default.  Now, AFAICS that was
based on exactly zero hard evidence, as to either the frequency
of actual problems or the cost of --load-via-partition-root.
Our optimism about the former seems to have been mostly borne out,
given the lack of complaints since then; but I still think our
pessimism about the latter is on shaky grounds.

Anyway, after re-reading the old thread I wonder if my first instinct
(force --load-via-partition-root for enum hash cases only) was the
best compromise after all.  I'm not sure how painful it is to get
pg_dump to detect such cases, but it's probably possible.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Laurenz Albe

Date:

02 February 2023, 09:51:14

On Wed, 2023-02-01 at 17:49 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Feb 1, 2023 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > I can agree with that argument for range or list partitioning, where
> > > the partitions have some semantic meaning to the user.  I don't buy it
> > > for hash partitioning.  It was an implementation artifact to begin
> > > with that a given row ended up in partition 3 not partition 11, so why
> > > would users care which partition it ends up in after a dump/reload?
> > > If they think there is a difference between the partitions, they need
> > > education.
>
> > I see your point. I think it's somewhat valid. However, I also think
> > it muddies the definition of what pg_dump is allowed to do in a way
> > that I do not like. I think there's a difference between the CTID or
> > XMAX of a tuple changing and it ending up in a totally different
> > partition. It feels like it makes the definition of correctness
> > subjective: we do think that people care about range and list
> > partitions as individual entities, so we'll put the rows back where
> > they were and complain if we can't, but we don't think they think
> > about hash partitions that way, so we will err on the side of making
> > the dump restoration succeed. That's a level of guessing what the user
> > meant that I think is uncomfortable.
>
> I see your point too, and to me it's evidence for the position that
> we should never have done hash partitioning in the first place.

You suggested earlier to deprecate hash partitioning.  That's a bit
much, but I'd say that most use cases of hash partitioning that I can
imagine would involve integers.  We could warn against using hash
partitioning for data types other than numbers and date/time in the
documentation.

I also understand the bad feeling of changing partitions during a
dump/restore, but I cannot think of a better way out.

> What do you think of "--load-via-partition-root=on/off/auto", where
> auto means "not with hash partitions" or the like?

That's perhaps the best way.  So users who know that their hash
partitions won't change and want the small speed benefit can have it.

Yours,
Laurenz Albe

Re: pg_dump versus hash partitioning

From

Alvaro Herrera

Date:

02 February 2023, 10:58:59

On 2023-Feb-01, Robert Haas wrote:

> I think you can construct plausible cases where it's not just
> academic. For instance, suppose I intend to use some kind of logical
> replication system, not necessarily the one built into PostgreSQL, to
> replicate data between two systems. Before engaging that system, I
> need to make the initial database contents match. The system is
> oblivious to partitioning, and just replicates each table to a table
> with a matching name.

This only works if that other system's hashing behavior is identical to
Postgres' for hashing that particular enum; there's no other way that
you could make the tables match exactly in the way you propose.  What
this tells me is that it's not really reasonable for users to expect
that this situation would actually work.  It is totally reasonable for
range and list, but not for hash.

If the idea of --load-via-partition-root=auto is going to be the fix for
this problem, then it has to consider that hash partitioning might be in
a level below the topmost one.  For example,

create type colors as enum ('blue', 'red', 'green');
create table topmost (prim int, col colors, a int) partition by range (prim);
create table parent partition of topmost for values from (0) to (1000) partition by hash (col);
create table child1 partition of parent for values with (modulus 3, remainder 0);
create table child2 partition of parent for values with (modulus 3, remainder 1);
create table child3 partition of parent for values with (modulus 3, remainder 2);

If you dump this with --load-via-partition-root, for child1 it'll give you this:

--
-- Data for Name: child1; Type: TABLE DATA; Schema: public; Owner: alvherre
--

COPY public.topmost (prim, col, a) FROM stdin;
\.

which is what we want; so for --load-via-partition-root=auto (or
whatever), we need to ensure that we detect hash partitioning all the
way down from the topmost to the leaves.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)

Re: pg_dump versus hash partitioning

From

Andrew Dunstan

Date:

02 February 2023, 12:56:22

On 2023-02-01 We 20:03, Tom Lane wrote:
>
> Anyway, after re-reading the old thread I wonder if my first instinct
> (force --load-via-partition-root for enum hash cases only) was the
> best compromise after all.  I'm not sure how painful it is to get
> pg_dump to detect such cases, but it's probably possible.
>
>             

Given the other problems you enumerated upthread, I'd be more inclined
to go with your other suggestion of
"--load-via-partition-root=on/off/auto" (with the default presumably
"auto").

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

02 February 2023, 14:50:48

On Wed, Feb 1, 2023 at 6:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> You waved your arms about inventing some new hashing infrastructure,
> but it was phrased in such a way that it wasn't clear to me if that
> was actually a serious proposal or not.  But if it is: how will you
> get around the fact that any change to hashing behavior will break
> pg_upgrade of existing hash-partitioned tables?  New infrastructure
> avails nothing if it has to be bug-compatible with the old.  So I'm
> not sure how restricting the fix to master helps us.

I think what I'd propose to do is invent a new method of hashing enums
that can be used for hash partitioning on enum columns going forward,
and make it the default for hash partitioning going forward. The
existing method can continue to be used for hash indexes, to avoid
breaking on disk compatibility.

Now, for the back branches, if you wanted to force
--load-via-partition-root specifically for the case of hash
partitioning on an enum column, I think that would be fine. It's a
hack, but there's no way out in the back branches that is not a hack.
What I don't like is the idea of enabling --load-via-partition-root in
the back branches more broadly, e.g. in all cases whatsoever, or in
all cases involving hash partitioning. If we really want to, we can
make --load-via-partition-root the new default categorically in
master, and make the pg_dump option --no-load-via-partition-root. I'm
not convinced that's a good idea, but maybe it is, and you know, major
releases change behavior sometimes, that's how life goes. Minor
releases, though, should minimize behavior changes, IMHO. It's not at
all nice to apply a critical security update and find that pg_dump
works differently to fix a problem you weren't having.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

02 February 2023, 14:52:34

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-02-01 We 20:03, Tom Lane wrote:
>> Anyway, after re-reading the old thread I wonder if my first instinct
>> (force --load-via-partition-root for enum hash cases only) was the
>> best compromise after all.  I'm not sure how painful it is to get
>> pg_dump to detect such cases, but it's probably possible.

> Given the other problems you enumerated upthread, I'd be more inclined
> to go with your other suggestion of
> "--load-via-partition-root=on/off/auto" (with the default presumably
> "auto").

Hmm ... is there any actual value in "off" in this case?  We can be
just about certain that dump/reload of a hashed enum key will fail.

If we made "auto" also use --load-via-partition-root for range keys
having collation properties, there'd be more of an argument for
letting users override it.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

02 February 2023, 15:04:10

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> ... so for --load-via-partition-root=auto (or
> whatever), we need to ensure that we detect hash partitioning all the
> way down from the topmost to the leaves.

Yeah, that had already occurred to me, which is why I was not feeling
confident about it being an easy hack in pg_dump.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

14 February 2023, 19:21:33

Here's a set of draft patches around this issue.

0001 does what I last suggested, ie force load-via-partition-root for
leaf tables underneath a partitioned table with a partitioned-by-hash
enum column.  It wasn't quite as messy as I first feared, although we do
need a new query (and pg_dump now knows more about pg_partitioned_table
than it used to).

I was a bit unhappy to read this in the documentation:

        It is best not to use parallelism when restoring from an archive made
        with this option, because <application>pg_restore</application> will
        not know exactly which partition(s) a given archive data item will
        load data into.  This could result in inefficiency due to lock
        conflicts between parallel jobs, or perhaps even restore failures due
        to foreign key constraints being set up before all the relevant data
        is loaded.

This made me wonder if this could be a usable solution at all, but
after thinking for awhile, I don't see how the claim about foreign key
constraints is anything but FUD.  pg_dump/pg_restore have sufficient
dependency logic to prevent that from happening.  I think we can just
drop the "or perhaps ..." clause here, and tolerate the possible
inefficiency as better than failing.

0002 and 0003 are not part of the bug fix, but are some performance
improvements I noticed while working on this.  0002 is pretty minor,
but 0003 is possibly significant if you have a ton of partitions.
I haven't done any performance measurement on it though.

            regards, tom lane

From 0c66c9f1211e16366cf471ef48a52e5447c79424 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 14 Feb 2023 13:09:04 -0500
Subject: [PATCH v1 1/3] Force load-via-partition-root for hash partitioning on
 an enum column.

Without this, dump and restore will almost always fail because the
enum values will receive different OIDs in the destination database,
and their hash codes will therefore be different.  (Improving the
hash algorithm would not make this situation better, and would
indeed break other cases as well.)

Discussion: https://postgr.es/m/1376149.1675268279@sss.pgh.pa.us
---
 src/bin/pg_dump/common.c  |  18 +++---
 src/bin/pg_dump/pg_dump.c | 120 +++++++++++++++++++++++++++++++++++---
 src/bin/pg_dump/pg_dump.h |   2 +
 3 files changed, 124 insertions(+), 16 deletions(-)

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index a43f2e5553..3d0cfc1b8e 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -230,6 +230,9 @@ getSchemaData(Archive *fout, int *numTablesPtr)
     pg_log_info("flagging inherited columns in subtables");
     flagInhAttrs(fout->dopt, tblinfo, numTables);

+    pg_log_info("reading partitioning data");
+    getPartitioningInfo(fout);
+
     pg_log_info("reading indexes");
     getIndexes(fout, tblinfo, numTables);

@@ -285,7 +288,6 @@ static void
 flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
               InhInfo *inhinfo, int numInherits)
 {
-    DumpOptions *dopt = fout->dopt;
     int            i,
                 j;

@@ -301,18 +303,18 @@ flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
             continue;

         /*
-         * Normally, we don't bother computing anything for non-target tables,
-         * but if load-via-partition-root is specified, we gather information
-         * on every partition in the system so that getRootTableInfo can trace
-         * from any given to leaf partition all the way up to the root.  (We
-         * don't need to mark them as interesting for getTableAttrs, though.)
+         * Normally, we don't bother computing anything for non-target tables.
+         * However, we must find the parents of non-root partitioned tables in
+         * any case, so that we can trace from leaf partitions up to the root
+         * (in case a leaf is to be dumped but its parents are not).  We need
+         * not mark such parents interesting for getTableAttrs, though.
          */
         if (!tblinfo[i].dobj.dump)
         {
             mark_parents = false;

-            if (!dopt->load_via_partition_root ||
-                !tblinfo[i].ispartition)
+            if (!(tblinfo[i].relkind == RELKIND_PARTITIONED_TABLE &&
+                  tblinfo[i].ispartition))
                 find_parents = false;
         }

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..51b317aa3d 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -320,6 +320,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AH);
 static TableInfo *getRootTableInfo(const TableInfo *tbinfo);
+static bool forcePartitionRootLoad(const TableInfo *tbinfo);


 int
@@ -2228,11 +2229,13 @@ dumpTableData_insert(Archive *fout, const void *dcontext)
             insertStmt = createPQExpBuffer();

             /*
-             * When load-via-partition-root is set, get the root table name
-             * for the partition table, so that we can reload data through the
-             * root table.
+             * When load-via-partition-root is set or forced, get the root
+             * table name for the partition table, so that we can reload data
+             * through the root table.
              */
-            if (dopt->load_via_partition_root && tbinfo->ispartition)
+            if (tbinfo->ispartition &&
+                (dopt->load_via_partition_root ||
+                 forcePartitionRootLoad(tbinfo)))
                 targettab = getRootTableInfo(tbinfo);
             else
                 targettab = tbinfo;
@@ -2430,6 +2433,35 @@ getRootTableInfo(const TableInfo *tbinfo)
     return parentTbinfo;
 }

+/*
+ * forcePartitionRootLoad
+ *     Check if we must force load_via_partition_root for this partition.
+ *
+ * This is required if any level of ancestral partitioned table has an
+ * unsafe partitioning scheme.
+ */
+static bool
+forcePartitionRootLoad(const TableInfo *tbinfo)
+{
+    TableInfo  *parentTbinfo;
+
+    Assert(tbinfo->ispartition);
+    Assert(tbinfo->numParents == 1);
+
+    parentTbinfo = tbinfo->parents[0];
+    if (parentTbinfo->unsafe_partitions)
+        return true;
+    while (parentTbinfo->ispartition)
+    {
+        Assert(parentTbinfo->numParents == 1);
+        parentTbinfo = parentTbinfo->parents[0];
+        if (parentTbinfo->unsafe_partitions)
+            return true;
+    }
+
+    return false;
+}
+
 /*
  * dumpTableData -
  *      dump the contents of a single table
@@ -2456,11 +2488,13 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
         dumpFn = dumpTableData_copy;

         /*
-         * When load-via-partition-root is set, get the root table name for
-         * the partition table, so that we can reload data through the root
-         * table.
+         * When load-via-partition-root is set or forced, get the root table
+         * name for the partition table, so that we can reload data through
+         * the root table.
          */
-        if (dopt->load_via_partition_root && tbinfo->ispartition)
+        if (tbinfo->ispartition &&
+            (dopt->load_via_partition_root ||
+             forcePartitionRootLoad(tbinfo)))
         {
             TableInfo  *parentTbinfo;

@@ -6744,6 +6778,76 @@ getInherits(Archive *fout, int *numInherits)
     return inhinfo;
 }

+/*
+ * getPartitioningInfo
+ *      get information about partitioning
+ *
+ * For the most part, we only collect partitioning info about tables we
+ * intend to dump.  However, this function has to consider all partitioned
+ * tables in the database, because we need to know about parents of partitions
+ * we are going to dump even if the parents themselves won't be dumped.
+ *
+ * Specifically, what we need to know is whether each partitioned table
+ * has an "unsafe" partitioning scheme that requires us to force
+ * load-via-partition-root mode for its children.  Currently the only case
+ * for which we force that is hash partitioning on enum columns, since the
+ * hash codes depend on enum value OIDs which won't be replicated across
+ * dump-and-reload.  There are other cases in which load-via-partition-root
+ * might be necessary, but we expect users to cope with them.
+ */
+void
+getPartitioningInfo(Archive *fout)
+{
+    PQExpBuffer query;
+    PGresult   *res;
+    int            ntups;
+
+    /* no partitions before v10 */
+    if (fout->remoteVersion < 100000)
+        return;
+    /* needn't bother if schema-only dump */
+    if (fout->dopt->schemaOnly)
+        return;
+
+    query = createPQExpBuffer();
+
+    /*
+     * Unsafe partitioning schemes are exactly those for which hash enum_ops
+     * appears among the partition opclasses.  We needn't check partstrat.
+     *
+     * Note that this query may well retrieve info about tables we aren't
+     * going to dump and hence have no lock on.  That's okay since we need not
+     * invoke any unsafe server-side functions.
+     */
+    appendPQExpBufferStr(query,
+                         "SELECT partrelid FROM pg_partitioned_table WHERE\n"
+                         "(SELECT c.oid FROM pg_opclass c JOIN pg_am a "
+                         "ON c.opcmethod = a.oid\n"
+                         "WHERE opcname = 'enum_ops' "
+                         "AND opcnamespace = 'pg_catalog'::regnamespace "
+                         "AND amname = 'hash') = ANY(partclass)");
+
+    res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);
+
+    ntups = PQntuples(res);
+
+    for (int i = 0; i < ntups; i++)
+    {
+        Oid            tabrelid = atooid(PQgetvalue(res, i, 0));
+        TableInfo  *tbinfo;
+
+        tbinfo = findTableByOid(tabrelid);
+        if (tbinfo == NULL)
+            pg_fatal("failed sanity check, table OID %u appearing in pg_partitioned_table not found",
+                     tabrelid);
+        tbinfo->unsafe_partitions = true;
+    }
+
+    PQclear(res);
+
+    destroyPQExpBuffer(query);
+}
+
 /*
  * getIndexes
  *      get information about every index on a dumpable table
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..bfd3cf17b7 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -318,6 +318,7 @@ typedef struct _tableInfo
     bool        dummy_view;        /* view's real definition must be postponed */
     bool        postponed_def;    /* matview must be postponed into post-data */
     bool        ispartition;    /* is table a partition? */
+    bool        unsafe_partitions;    /* is it an unsafe partitioned table? */

     /*
      * These fields are computed only if we decide the table is interesting
@@ -715,6 +716,7 @@ extern ConvInfo *getConversions(Archive *fout, int *numConversions);
 extern TableInfo *getTables(Archive *fout, int *numTables);
 extern void getOwnedSeqs(Archive *fout, TableInfo tblinfo[], int numTables);
 extern InhInfo *getInherits(Archive *fout, int *numInherits);
+extern void getPartitioningInfo(Archive *fout);
 extern void getIndexes(Archive *fout, TableInfo tblinfo[], int numTables);
 extern void getExtendedStatistics(Archive *fout);
 extern void getConstraints(Archive *fout, TableInfo tblinfo[], int numTables);
--
2.31.1

From 5f73193c120ce04f4e5bb3530fe1ee84afc841f1 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 14 Feb 2023 13:18:29 -0500
Subject: [PATCH v1 2/3] Don't create useless TableAttachInfo objects.

It's silly to create a TableAttachInfo object that we're not
going to dump, when we know perfectly well at creation time
that it won't be dumped.

Discussion: https://postgr.es/m/1376149.1675268279@sss.pgh.pa.us
---
 src/bin/pg_dump/common.c  | 3 ++-
 src/bin/pg_dump/pg_dump.c | 3 ---
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 3d0cfc1b8e..ce9d2a8ee8 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -336,7 +336,8 @@ flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
         }

         /* Create TableAttachInfo object if needed */
-        if (tblinfo[i].dobj.dump && tblinfo[i].ispartition)
+        if ((tblinfo[i].dobj.dump & DUMP_COMPONENT_DEFINITION) &&
+            tblinfo[i].ispartition)
         {
             TableAttachInfo *attachinfo;

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 51b317aa3d..35bba8765f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16082,9 +16082,6 @@ dumpTableAttach(Archive *fout, const TableAttachInfo *attachinfo)
     if (dopt->dataOnly)
         return;

-    if (!(attachinfo->partitionTbl->dobj.dump & DUMP_COMPONENT_DEFINITION))
-        return;
-
     q = createPQExpBuffer();

     if (!fout->is_prepared[PREPQUERY_DUMPTABLEATTACH])
--
2.31.1

From 7cc336043e750ec154185b972970b2364f0f57ee Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 14 Feb 2023 13:35:52 -0500
Subject: [PATCH v1 3/3] Simplify pg_dump's creation of parent-table links.

Instead of trying to optimize this by skipping creation of the
links for tables we don't plan to dump, just create them all in
bulk with a single scan over the pg_inherits data.  The previous
approach was more or less O(N^2) in the number of pg_inherits
entries, not to mention being way too complicated.

Discussion: https://postgr.es/m/1376149.1675268279@sss.pgh.pa.us
---
 src/bin/pg_dump/common.c  | 125 ++++++++++++++++----------------------
 src/bin/pg_dump/pg_dump.h |   5 +-
 2 files changed, 54 insertions(+), 76 deletions(-)

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index ce9d2a8ee8..84d99ee8b6 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -83,8 +83,6 @@ static void flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
                           InhInfo *inhinfo, int numInherits);
 static void flagInhIndexes(Archive *fout, TableInfo *tblinfo, int numTables);
 static void flagInhAttrs(DumpOptions *dopt, TableInfo *tblinfo, int numTables);
-static void findParentsByOid(TableInfo *self,
-                             InhInfo *inhinfo, int numInherits);
 static int    strInArray(const char *pattern, char **arr, int arr_size);
 static IndxInfo *findIndexByOid(Oid oid);

@@ -288,45 +286,70 @@ static void
 flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
               InhInfo *inhinfo, int numInherits)
 {
+    TableInfo  *child = NULL;
+    TableInfo  *parent = NULL;
     int            i,
                 j;

-    for (i = 0; i < numTables; i++)
+    /*
+     * Set up links from child tables to their parents.
+     *
+     * We used to attempt to skip this work for tables that are not to be
+     * dumped; but the optimizable cases are rare in practice, and setting up
+     * these links in bulk is cheaper than the old way.  (Note in particular
+     * that it's very rare for a child to have more than one parent.)
+     */
+    for (i = 0; i < numInherits; i++)
     {
-        bool        find_parents = true;
-        bool        mark_parents = true;
-
-        /* Some kinds never have parents */
-        if (tblinfo[i].relkind == RELKIND_SEQUENCE ||
-            tblinfo[i].relkind == RELKIND_VIEW ||
-            tblinfo[i].relkind == RELKIND_MATVIEW)
-            continue;
-
         /*
-         * Normally, we don't bother computing anything for non-target tables.
-         * However, we must find the parents of non-root partitioned tables in
-         * any case, so that we can trace from leaf partitions up to the root
-         * (in case a leaf is to be dumped but its parents are not).  We need
-         * not mark such parents interesting for getTableAttrs, though.
+         * Skip a hashtable lookup if it's same table as last time.  This is
+         * unlikely for the child, but less so for the parent.  (Maybe we
+         * should ask the backend for a sorted array to make it more likely?
+         * Not clear the sorting effort would be repaid, though.)
          */
-        if (!tblinfo[i].dobj.dump)
+        if (child == NULL ||
+            child->dobj.catId.oid != inhinfo[i].inhrelid)
         {
-            mark_parents = false;
+            child = findTableByOid(inhinfo[i].inhrelid);

-            if (!(tblinfo[i].relkind == RELKIND_PARTITIONED_TABLE &&
-                  tblinfo[i].ispartition))
-                find_parents = false;
+            /*
+             * If we find no TableInfo, assume the pg_inherits entry is for a
+             * partitioned index, which we don't need to track.
+             */
+            if (child == NULL)
+                continue;
         }
+        if (parent == NULL ||
+            parent->dobj.catId.oid != inhinfo[i].inhparent)
+        {
+            parent = findTableByOid(inhinfo[i].inhparent);
+            if (parent == NULL)
+                pg_fatal("failed sanity check, parent OID %u of table \"%s\" (OID %u) not found",
+                         inhinfo[i].inhparent,
+                         child->dobj.name,
+                         child->dobj.catId.oid);
+        }
+        /* Add this parent to the child's list of parents. */
+        if (child->numParents > 0)
+            child->parents = pg_realloc_array(child->parents,
+                                              TableInfo *,
+                                              child->numParents + 1);
+        else
+            child->parents = pg_malloc_array(TableInfo *, 1);
+        child->parents[child->numParents++] = parent;
+    }

-        /* If needed, find all the immediate parent tables. */
-        if (find_parents)
-            findParentsByOid(&tblinfo[i], inhinfo, numInherits);
-
+    /*
+     * Now consider all child tables and mark parents interesting as needed.
+     */
+    for (i = 0; i < numTables; i++)
+    {
         /*
          * If needed, mark the parents as interesting for getTableAttrs and
-         * getIndexes.
+         * getIndexes.  We only need this for direct parents of dumpable
+         * tables.
          */
-        if (mark_parents)
+        if (tblinfo[i].dobj.dump)
         {
             int            numParents = tblinfo[i].numParents;
             TableInfo **parents = tblinfo[i].parents;
@@ -979,52 +1002,6 @@ findOwningExtension(CatalogId catalogId)
 }


-/*
- * findParentsByOid
- *      find a table's parents in tblinfo[]
- */
-static void
-findParentsByOid(TableInfo *self,
-                 InhInfo *inhinfo, int numInherits)
-{
-    Oid            oid = self->dobj.catId.oid;
-    int            i,
-                j;
-    int            numParents;
-
-    numParents = 0;
-    for (i = 0; i < numInherits; i++)
-    {
-        if (inhinfo[i].inhrelid == oid)
-            numParents++;
-    }
-
-    self->numParents = numParents;
-
-    if (numParents > 0)
-    {
-        self->parents = pg_malloc_array(TableInfo *, numParents);
-        j = 0;
-        for (i = 0; i < numInherits; i++)
-        {
-            if (inhinfo[i].inhrelid == oid)
-            {
-                TableInfo  *parent;
-
-                parent = findTableByOid(inhinfo[i].inhparent);
-                if (parent == NULL)
-                    pg_fatal("failed sanity check, parent OID %u of table \"%s\" (OID %u) not found",
-                             inhinfo[i].inhparent,
-                             self->dobj.name,
-                             oid);
-                self->parents[j++] = parent;
-            }
-        }
-    }
-    else
-        self->parents = NULL;
-}
-
 /*
  * parseOidArray
  *      parse a string of numbers delimited by spaces into a character array
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index bfd3cf17b7..bb4e981e7d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -320,6 +320,9 @@ typedef struct _tableInfo
     bool        ispartition;    /* is table a partition? */
     bool        unsafe_partitions;    /* is it an unsafe partitioned table? */

+    int            numParents;        /* number of (immediate) parent tables */
+    struct _tableInfo **parents;    /* TableInfos of immediate parents */
+
     /*
      * These fields are computed only if we decide the table is interesting
      * (it's either a table to dump, or a direct parent of a dumpable table).
@@ -352,8 +355,6 @@ typedef struct _tableInfo
     /*
      * Stuff computed only for dumpable tables.
      */
-    int            numParents;        /* number of (immediate) parent tables */
-    struct _tableInfo **parents;    /* TableInfos of immediate parents */
     int            numIndexes;        /* number of indexes */
     struct _indxInfo *indexes;    /* indexes */
     struct _tableDataInfo *dataObj; /* TableDataInfo, if dumping its data */
--
2.31.1

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

27 February 2023, 15:50:42

On Tue, Feb 14, 2023 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This made me wonder if this could be a usable solution at all, but
> after thinking for awhile, I don't see how the claim about foreign key
> constraints is anything but FUD.  pg_dump/pg_restore have sufficient
> dependency logic to prevent that from happening.  I think we can just
> drop the "or perhaps ..." clause here, and tolerate the possible
> inefficiency as better than failing.

Right, but isn't that dependency logic based around the fact that the
inserts are targeting the original partition? Like, suppose partition
A has a foreign key that is not present on partition B. A row that is
originally in partition B gets rerouted into partition A. It must now
satisfy the foreign key constraint when, previously, that was
unnecessary.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

27 February 2023, 16:20:34

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Feb 14, 2023 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This made me wonder if this could be a usable solution at all, but
>> after thinking for awhile, I don't see how the claim about foreign key
>> constraints is anything but FUD.  pg_dump/pg_restore have sufficient
>> dependency logic to prevent that from happening.  I think we can just
>> drop the "or perhaps ..." clause here, and tolerate the possible
>> inefficiency as better than failing.

> Right, but isn't that dependency logic based around the fact that the
> inserts are targeting the original partition? Like, suppose partition
> A has a foreign key that is not present on partition B. A row that is
> originally in partition B gets rerouted into partition A. It must now
> satisfy the foreign key constraint when, previously, that was
> unnecessary.

Well, that's a user error not pg_dump's fault.  Particularly so for hash
partitioning, where there is no defensible reason to make the partitions
semantically different.

There could be a risk of a timing problem, namely that parallel pg_restore
tries to check an FK constraint before all the relevant data has arrived.
But in practice I don't believe that either.  We load all the data in
"data" phase and then create indexes and check FKs in "post-data" phase,
and I do not believe that parallel restore weakens that separation
(because it's enforced by a post-data boundary object in the dependencies).

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

27 February 2023, 17:02:03

On Mon, Feb 27, 2023 at 11:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, that's a user error not pg_dump's fault.  Particularly so for hash
> partitioning, where there is no defensible reason to make the partitions
> semantically different.

I am still of the opinion that you're going down a dangerous path of
redefining pg_dump's mission from "dump and restore the database, as
it actually exists" to "dump and restore the database, unless the user
did something that I think is silly".

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

27 February 2023, 17:50:15

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Feb 27, 2023 at 11:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, that's a user error not pg_dump's fault.  Particularly so for hash
>> partitioning, where there is no defensible reason to make the partitions
>> semantically different.

> I am still of the opinion that you're going down a dangerous path of
> redefining pg_dump's mission from "dump and restore the database, as
> it actually exists" to "dump and restore the database, unless the user
> did something that I think is silly".

Let's not attack straw men, shall we?  I'm defining pg_dump's mission
as "dump and restore the database successfully".  Failure to restore
does not help anyone, especially if they are in a disaster recovery
situation where it's not possible to re-take the dump.  It's not like
there's no precedent for having pg_dump tweak things to ensure a
successful restore.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Robert Haas

Date:

27 February 2023, 17:55:01

On Mon, Feb 27, 2023 at 12:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Mon, Feb 27, 2023 at 11:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Well, that's a user error not pg_dump's fault.  Particularly so for hash
> >> partitioning, where there is no defensible reason to make the partitions
> >> semantically different.
>
> > I am still of the opinion that you're going down a dangerous path of
> > redefining pg_dump's mission from "dump and restore the database, as
> > it actually exists" to "dump and restore the database, unless the user
> > did something that I think is silly".
>
> Let's not attack straw men, shall we?  I'm defining pg_dump's mission
> as "dump and restore the database successfully".  Failure to restore
> does not help anyone, especially if they are in a disaster recovery
> situation where it's not possible to re-take the dump.  It's not like
> there's no precedent for having pg_dump tweak things to ensure a
> successful restore.

Sure, but I was responding to your assertion that there's no case in
which --load-via-partition-root could cause a restore failure. I'm not
sure that's accurate. The fact that the case is something that nobody
is especially likely to do doesn't mean it doesn't exist, and ISTM we
have had more than a few pg_dump bug reports that come down to someone
having done something which we didn't foresee.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

27 February 2023, 18:04:16

Robert Haas <robertmhaas@gmail.com> writes:
> Sure, but I was responding to your assertion that there's no case in
> which --load-via-partition-root could cause a restore failure. I'm not
> sure that's accurate.

Perhaps it's not, but it's certainly far less likely to cause a restore
failure than the behavior I want to replace.

More to the current point perhaps, I doubt that it's likely enough to
cause a restore failure to justify the existing docs warning.  There
may have been a time when the warning was justified, but I don't believe
it today.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Julien Rouhaud

Date:

11 March 2023, 03:01:54

On Tue, Feb 14, 2023 at 02:21:33PM -0500, Tom Lane wrote:
> Here's a set of draft patches around this issue.
>
> 0001 does what I last suggested, ie force load-via-partition-root for
> leaf tables underneath a partitioned table with a partitioned-by-hash
> enum column.  It wasn't quite as messy as I first feared, although we do
> need a new query (and pg_dump now knows more about pg_partitioned_table
> than it used to).
>
> I was a bit unhappy to read this in the documentation:
>
>         It is best not to use parallelism when restoring from an archive made
>         with this option, because <application>pg_restore</application> will
>         not know exactly which partition(s) a given archive data item will
>         load data into.  This could result in inefficiency due to lock
>         conflicts between parallel jobs, or perhaps even restore failures due
>         to foreign key constraints being set up before all the relevant data
>         is loaded.
>
> This made me wonder if this could be a usable solution at all, but
> after thinking for awhile, I don't see how the claim about foreign key
> constraints is anything but FUD.  pg_dump/pg_restore have sufficient
> dependency logic to prevent that from happening.  I think we can just
> drop the "or perhaps ..." clause here, and tolerate the possible
> inefficiency as better than failing.

Working on some side project that can cause dump of hash partitions to be
routed to a different partition, I realized that --load-via-partition-root can
indeed cause deadlock in such case without FK dependency or anything else.

The problem is that each worker will perform a TRUNCATE TABLE ONLY followed by
a copy of the original partition's data in a transaction, and that obviously
will lead to deadlock if the original and locked partition and the restored
partition are different.

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

11 March 2023, 03:10:14

Julien Rouhaud <rjuju123@gmail.com> writes:
> Working on some side project that can cause dump of hash partitions to be
> routed to a different partition, I realized that --load-via-partition-root can
> indeed cause deadlock in such case without FK dependency or anything else.

> The problem is that each worker will perform a TRUNCATE TABLE ONLY followed by
> a copy of the original partition's data in a transaction, and that obviously
> will lead to deadlock if the original and locked partition and the restored
> partition are different.

Oh, interesting.  I wonder if we can rearrange things to avoid that.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Julien Rouhaud

Date:

11 March 2023, 03:32:32

On Fri, Mar 10, 2023 at 10:10:14PM -0500, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > Working on some side project that can cause dump of hash partitions to be
> > routed to a different partition, I realized that --load-via-partition-root can
> > indeed cause deadlock in such case without FK dependency or anything else.
>
> > The problem is that each worker will perform a TRUNCATE TABLE ONLY followed by
> > a copy of the original partition's data in a transaction, and that obviously
> > will lead to deadlock if the original and locked partition and the restored
> > partition are different.
>
> Oh, interesting.  I wonder if we can rearrange things to avoid that.

The BEGIN + TRUNCATE is only there to avoid generating WAL records just in case
the wal_level is minimal.  I don't remember if that optimization still exists,
but if yes we could avoid doing that if the server's wal_level is replica or
higher?  That's not perfect but it would help in many cases.

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

12 March 2023, 19:46:52

Julien Rouhaud <rjuju123@gmail.com> writes:
> The BEGIN + TRUNCATE is only there to avoid generating WAL records just in case
> the wal_level is minimal.  I don't remember if that optimization still exists,
> but if yes we could avoid doing that if the server's wal_level is replica or
> higher?  That's not perfect but it would help in many cases.

After thinking about this, it seems like a better idea is to skip the
TRUNCATE if we're doing load-via-partition-root.  In that case it's
clearly a dangerous thing to do regardless of deadlock worries, since
it risks discarding previously-loaded data that came over from another
partition.  (IOW this seems like an independent, pre-existing bug in
load-via-partition-root mode.)

The trick is to detect in pg_restore whether pg_dump chose to do
load-via-partition-root.  If we have a COPY statement we can fairly
easily examine it to see if the target table is what we expect or
something else.  However, if the table was dumped as INSERT statements
it'd be far messier; the INSERTs are not readily accessible from the
code that needs to make the decision.

What I propose we do about that is further tweak things so that
load-via-partition-root forces dumping via COPY.  AFAIK the only
compelling use-case for dump-as-INSERTs is in transferring data
to a non-Postgres database, which is a context in which dumping
partitioned tables as such is pretty hopeless anyway.  (I wonder if
we should have some way to dump all the contents of a partitioned
table as if it were unpartitioned, to support such migration.)

An alternative could be to extend the archive TOC format to record
directly whether a given TABLE DATA object loads data via partition
root or normally.  Potentially we could do that without an archive
format break by defining te->defn for TABLE DATA to be empty for
normal dumps (as it is now) or something like "-- load via partition root"
for the load-via-partition-root case.  However, depending on examination
of the COPY command would already work for the vast majority of existing
archive files, so I feel like it might be the preferable choice.

Thoughts?

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Justin Pryzby

Date:

12 March 2023, 19:54:50

On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
> What I propose we do about that is further tweak things so that
> load-via-partition-root forces dumping via COPY.  AFAIK the only
> compelling use-case for dump-as-INSERTs is in transferring data
> to a non-Postgres database, which is a context in which dumping
> partitioned tables as such is pretty hopeless anyway.  (I wonder if
> we should have some way to dump all the contents of a partitioned
> table as if it were unpartitioned, to support such migration.)

I think that what this other thread is about.

https://commitfest.postgresql.org/42/4130/
pg_dump all child tables with the root table

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

12 March 2023, 20:02:43

Justin Pryzby <pryzby@telsasoft.com> writes:
> On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
>> What I propose we do about that is further tweak things so that
>> load-via-partition-root forces dumping via COPY.  AFAIK the only
>> compelling use-case for dump-as-INSERTs is in transferring data
>> to a non-Postgres database, which is a context in which dumping
>> partitioned tables as such is pretty hopeless anyway.  (I wonder if
>> we should have some way to dump all the contents of a partitioned
>> table as if it were unpartitioned, to support such migration.)

> I think that what this other thread is about.
> https://commitfest.postgresql.org/42/4130/
> pg_dump all child tables with the root table

As far as I understood (didn't actually read the latest patch) that
one is just about easily selecting all the partitions of a partitioned
table when doing a selective dump.  It's not helping you produce a
non-Postgres-specific dump.

Although I guess by combining load-via-partition-root, data-only mode,
and dump-as-inserts you could produce a clean collection of
non-partition-dependent INSERT commands ... so maybe we'd better not
force dump-as-inserts off.  I'm starting to like the te->defn hack
more.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Julien Rouhaud

Date:

13 March 2023, 08:33:14

On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > The BEGIN + TRUNCATE is only there to avoid generating WAL records just in case
> > the wal_level is minimal.  I don't remember if that optimization still exists,
> > but if yes we could avoid doing that if the server's wal_level is replica or
> > higher?  That's not perfect but it would help in many cases.
>
> After thinking about this, it seems like a better idea is to skip the
> TRUNCATE if we're doing load-via-partition-root.  In that case it's
> clearly a dangerous thing to do regardless of deadlock worries, since
> it risks discarding previously-loaded data that came over from another
> partition.  (IOW this seems like an independent, pre-existing bug in
> load-via-partition-root mode.)

It's seems quite unlikely to be able to actually truncate already restored data
without eventually going into a deadlock, but it's still possible so agreed.

> The trick is to detect in pg_restore whether pg_dump chose to do
> load-via-partition-root.  If we have a COPY statement we can fairly
> easily examine it to see if the target table is what we expect or
> something else.  However, if the table was dumped as INSERT statements
> it'd be far messier; the INSERTs are not readily accessible from the
> code that needs to make the decision.
>
> What I propose we do about that is further tweak things so that
> load-via-partition-root forces dumping via COPY.  AFAIK the only
> compelling use-case for dump-as-INSERTs is in transferring data
> to a non-Postgres database, which is a context in which dumping
> partitioned tables as such is pretty hopeless anyway.

It seems acceptable to me.

> (I wonder if
> we should have some way to dump all the contents of a partitioned
> table as if it were unpartitioned, to support such migration.)

(this would be nice to have)

> An alternative could be to extend the archive TOC format to record
> directly whether a given TABLE DATA object loads data via partition
> root or normally.  Potentially we could do that without an archive
> format break by defining te->defn for TABLE DATA to be empty for
> normal dumps (as it is now) or something like "-- load via partition root"
> for the load-via-partition-root case.  However, depending on examination
> of the COPY command would already work for the vast majority of existing
> archive files, so I feel like it might be the preferable choice.

Given that this approach wouldn't help with existing dump files (at least if
using COPY, in any case the one using INSERT are doomed), so I'm slightly in
favor of the first approach, and later add an easy and non magic incantation
way to produce dumps that don't depend on partitioning.  It would mean that you
would only be able to produce such dumps using pg16 client binaries, but such
version would also work with older server versions so it doesn't seem like a
huge problem in the long run.

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

13 March 2023, 23:39:12

Julien Rouhaud <rjuju123@gmail.com> writes:
> On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
>> The trick is to detect in pg_restore whether pg_dump chose to do
>> load-via-partition-root.

> Given that this approach wouldn't help with existing dump files (at least if
> using COPY, in any case the one using INSERT are doomed), so I'm slightly in
> favor of the first approach, and later add an easy and non magic incantation
> way to produce dumps that don't depend on partitioning.

Yeah, we need to do both.  Attached find an updated patch series:

0001: TAP test that exhibits both this deadlock problem and the
different-hash-codes problem.  I'm not sure if we want to commit
this, or if it should be in exactly this form --- the second set
of tests with a manual --load-via-partition-root switch will be
pretty redundant after this patch series.

0002: Make pg_restore detect load-via-partition-root by examining the
COPY commands embedded in the dump, and skip the TRUNCATE if so,
thereby fixing the deadlock issue.  This is the best we can do for
legacy dump files, I think, but it should be good enough.

0003: Also detect load-via-partition-root by adding a label in the
dump.  This is a more bulletproof solution going forward.

0004-0006: same as previous patches, but rebased over these.
This gets us to a place where the new TAP test passes.

I've not done anything about modifying the documentation, but I still
think we could remove the warning label on --load-via-partition-root.

            regards, tom lane

From 33acd81f90ccd1ebdd75fbdf58024edb8f10f6a1 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Mar 2023 19:01:42 -0400
Subject: [PATCH v2 1/6] Create a test case for dump-and-restore of hashed-enum
 partitioning.

I'm not sure we want to commit this, at least not in this form,
but it's able to demonstrate both the deadlock problem (when using
--load-via-partition-root) and the change-of-hash-code problem
(when not).
---
 src/bin/pg_dump/meson.build               |  1 +
 src/bin/pg_dump/t/004_pg_dump_parallel.pl | 66 +++++++++++++++++++++++
 2 files changed, 67 insertions(+)
 create mode 100644 src/bin/pg_dump/t/004_pg_dump_parallel.pl

diff --git a/src/bin/pg_dump/meson.build b/src/bin/pg_dump/meson.build
index ab4c25c781..b2fb7ac77f 100644
--- a/src/bin/pg_dump/meson.build
+++ b/src/bin/pg_dump/meson.build
@@ -96,6 +96,7 @@ tests += {
       't/001_basic.pl',
       't/002_pg_dump.pl',
       't/003_pg_dump_with_server.pl',
+      't/004_pg_dump_parallel.pl',
       't/010_dump_connstr.pl',
     ],
   },
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
new file mode 100644
index 0000000000..c3f7d20b13
--- /dev/null
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -0,0 +1,66 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $dbname1 = 'regression_src';
+my $dbname2 = 'regression_dest1';
+my $dbname3 = 'regression_dest2';
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $backupdir = $node->backup_dir;
+
+$node->run_log([ 'createdb', $dbname1 ]);
+$node->run_log([ 'createdb', $dbname2 ]);
+$node->run_log([ 'createdb', $dbname3 ]);
+
+$node->safe_psql(
+    $dbname1,
+    qq{
+create type digit as enum ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9');
+create table t0 (en digit, data int) partition by hash(en);
+create table t0_p1 partition of t0 for values with (modulus 3, remainder 0);
+create table t0_p2 partition of t0 for values with (modulus 3, remainder 1);
+create table t0_p3 partition of t0 for values with (modulus 3, remainder 2);
+insert into t0 select (x%10)::text::digit, x from generate_series(1,1000) x;
+    });
+
+$node->command_ok(
+    [
+        'pg_dump', '-Fd', '--no-sync', '-j2', '-f', "$backupdir/dump1",
+        $node->connstr($dbname1)
+    ],
+    'parallel dump');
+
+$node->command_ok(
+    [
+        'pg_restore', '-v',
+        '-d',         $node->connstr($dbname2),
+        '-j3',        "$backupdir/dump1"
+    ],
+    'parallel restore');
+
+$node->command_ok(
+    [
+        'pg_dump', '-Fd', '--no-sync', '-j2', '-f', "$backupdir/dump2",
+        '--load-via-partition-root', $node->connstr($dbname1)
+    ],
+    'parallel dump via partition root');
+
+$node->command_ok(
+    [
+        'pg_restore', '-v',
+        '-d',         $node->connstr($dbname3),
+        '-j3',        "$backupdir/dump2"
+    ],
+    'parallel restore via partition root');
+
+done_testing();
--
2.31.1

From 685b4a1c0aaa59b2b6047051d082838b88528587 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Mar 2023 19:05:05 -0400
Subject: [PATCH v2 2/6] Avoid TRUNCATE when restoring load-via-partition-root
 data.

This fixes the deadlock problem, allowing 004_pg_dump_parallel.pl to
succeed in the test where there's a manual --load-via-partition-root
switch.  It relies on the dump having used COPY commands, though.
---
 src/bin/pg_dump/pg_backup_archiver.c | 61 ++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 61ebb8fe85..cb3a2c21f5 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -86,6 +86,7 @@ static RestorePass _tocEntryRestorePass(TocEntry *te);
 static bool _tocEntryIsACL(TocEntry *te);
 static void _disableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te);
 static void _enableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te);
+static bool is_load_via_partition_root(TocEntry *te);
 static void buildTocEntryArrays(ArchiveHandle *AH);
 static void _moveBefore(TocEntry *pos, TocEntry *te);
 static int    _discoverArchiveFormat(ArchiveHandle *AH);
@@ -887,6 +888,8 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
                 }
                 else
                 {
+                    bool        use_truncate;
+
                     _disableTriggersIfNecessary(AH, te);

                     /* Select owner and schema as necessary */
@@ -898,13 +901,23 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)

                     /*
                      * In parallel restore, if we created the table earlier in
-                     * the run then we wrap the COPY in a transaction and
-                     * precede it with a TRUNCATE.  If archiving is not on
-                     * this prevents WAL-logging the COPY.  This obtains a
-                     * speedup similar to that from using single_txn mode in
-                     * non-parallel restores.
+                     * the run and we are not restoring a
+                     * load-via-partition-root data item then we wrap the COPY
+                     * in a transaction and precede it with a TRUNCATE.  If
+                     * archiving is not on this prevents WAL-logging the COPY.
+                     * This obtains a speedup similar to that from using
+                     * single_txn mode in non-parallel restores.
+                     *
+                     * We mustn't do this for load-via-partition-root cases
+                     * because some data might get moved across partition
+                     * boundaries, risking deadlock and/or loss of previously
+                     * loaded data.  (We assume that all partitions of a
+                     * partitioned table will be treated the same way.)
                      */
-                    if (is_parallel && te->created)
+                    use_truncate = is_parallel && te->created &&
+                        !is_load_via_partition_root(te);
+
+                    if (use_truncate)
                     {
                         /*
                          * Parallel restore is always talking directly to a
@@ -942,7 +955,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
                     AH->outputKind = OUTPUT_SQLCMDS;

                     /* close out the transaction started above */
-                    if (is_parallel && te->created)
+                    if (use_truncate)
                         CommitTransaction(&AH->public);

                     _enableTriggersIfNecessary(AH, te);
@@ -1036,6 +1049,40 @@ _enableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te)
              fmtQualifiedId(te->namespace, te->tag));
 }

+/*
+ * Detect whether a TABLE DATA TOC item is performing "load via partition
+ * root", that is the target table is an ancestor partition rather than the
+ * table the TOC item is nominally for.
+ *
+ * In newer archive files this can be detected by checking for a special
+ * comment placed in te->defn.  In older files we have to fall back to seeing
+ * if the COPY statement targets the named table or some other one.  This
+ * will not work for data dumped as INSERT commands, so we could give a false
+ * negative in that case; fortunately, that's a rarely-used option.
+ */
+static bool
+is_load_via_partition_root(TocEntry *te)
+{
+    if (te->copyStmt && *te->copyStmt)
+    {
+        PQExpBuffer copyStmt = createPQExpBuffer();
+        bool        result;
+
+        /*
+         * Build the initial part of the COPY as it would appear if the
+         * nominal target table is the actual target.  If we see anything
+         * else, it must be a load-via-partition-root case.
+         */
+        appendPQExpBuffer(copyStmt, "COPY %s ",
+                          fmtQualifiedId(te->namespace, te->tag));
+        result = strncmp(te->copyStmt, copyStmt->data, copyStmt->len) != 0;
+        destroyPQExpBuffer(copyStmt);
+        return result;
+    }
+    /* Assume it's not load-via-partition-root */
+    return false;
+}
+
 /*
  * This is a routine that is part of the dumper interface, hence the 'Archive*' parameter.
  */
--
2.31.1

From 28b122cb9be438e08e3754672559a9e59ea39711 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Mar 2023 19:08:40 -0400
Subject: [PATCH v2 3/6] Add a marker to TABLE DATA items using
 load-via-partition-root.

Set the te->defn field of such an item to a comment like

-- load via partition root <table name>

This provides a mechanism for load-via-partition-root detection
that works even with --inserts mode, and as a side benefit the
dump contents are a bit less confusing.  We still need the COPY
test to work with old dump files, though.
---
 src/bin/pg_dump/pg_backup_archiver.c | 11 ++++++--
 src/bin/pg_dump/pg_dump.c            | 39 ++++++++++++++++------------
 2 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index cb3a2c21f5..1ecd719db2 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1063,6 +1063,9 @@ _enableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te)
 static bool
 is_load_via_partition_root(TocEntry *te)
 {
+    if (te->defn &&
+        strncmp(te->defn, "-- load via partition root ", 27) == 0)
+        return true;
     if (te->copyStmt && *te->copyStmt)
     {
         PQExpBuffer copyStmt = createPQExpBuffer();
@@ -3002,8 +3005,12 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
             res = res & ~REQ_DATA;
     }

-    /* If there's no definition command, there's no schema component */
-    if (!te->defn || !te->defn[0])
+    /*
+     * If there's no definition command, there's no schema component.  Treat
+     * "load via partition root" comments as not schema.
+     */
+    if (!te->defn || !te->defn[0] ||
+        strncmp(te->defn, "-- load via partition root ", 27) == 0)
         res = res & ~REQ_SCHEMA;

     /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4217908f84..b0376ee56c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2443,34 +2443,38 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
     PQExpBuffer copyBuf = createPQExpBuffer();
     PQExpBuffer clistBuf = createPQExpBuffer();
     DataDumperPtr dumpFn;
+    char       *tdDefn = NULL;
     char       *copyStmt;
     const char *copyFrom;

     /* We had better have loaded per-column details about this table */
     Assert(tbinfo->interesting);

+    /*
+     * When load-via-partition-root is set, get the root table name for the
+     * partition table, so that we can reload data through the root table.
+     * Then construct a comment to be inserted into the TOC entry's defn
+     * field, so that such cases can be identified reliably.
+     */
+    if (dopt->load_via_partition_root && tbinfo->ispartition)
+    {
+        TableInfo  *parentTbinfo;
+
+        parentTbinfo = getRootTableInfo(tbinfo);
+        copyFrom = fmtQualifiedDumpable(parentTbinfo);
+        printfPQExpBuffer(copyBuf, "-- load via partition root %s",
+                          copyFrom);
+        tdDefn = pg_strdup(copyBuf->data);
+    }
+    else
+        copyFrom = fmtQualifiedDumpable(tbinfo);
+
     if (dopt->dump_inserts == 0)
     {
         /* Dump/restore using COPY */
         dumpFn = dumpTableData_copy;
-
-        /*
-         * When load-via-partition-root is set, get the root table name for
-         * the partition table, so that we can reload data through the root
-         * table.
-         */
-        if (dopt->load_via_partition_root && tbinfo->ispartition)
-        {
-            TableInfo  *parentTbinfo;
-
-            parentTbinfo = getRootTableInfo(tbinfo);
-            copyFrom = fmtQualifiedDumpable(parentTbinfo);
-        }
-        else
-            copyFrom = fmtQualifiedDumpable(tbinfo);
-
         /* must use 2 steps here 'cause fmtId is nonreentrant */
-        appendPQExpBuffer(copyBuf, "COPY %s ",
+        printfPQExpBuffer(copyBuf, "COPY %s ",
                           copyFrom);
         appendPQExpBuffer(copyBuf, "%s FROM stdin;\n",
                           fmtCopyColumnList(tbinfo, clistBuf));
@@ -2498,6 +2502,7 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
                                        .owner = tbinfo->rolname,
                                        .description = "TABLE DATA",
                                        .section = SECTION_DATA,
+                                       .createStmt = tdDefn,
                                        .copyStmt = copyStmt,
                                        .deps = &(tbinfo->dobj.dumpId),
                                        .nDeps = 1,
--
2.31.1

From e68c58928285679663dcea32ea98409a31f15a5a Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Mar 2023 19:11:44 -0400
Subject: [PATCH v2 4/6] Force load-via-partition-root for hash partitioning on
 an enum column.

Without this, dump and restore will almost always fail because the
enum values will receive different OIDs in the destination database,
and their hash codes will therefore be different.  (Improving the
hash algorithm would not make this situation better, and would indeed
break other cases as well.)  This allows 004_pg_dump_parallel.pl to
pass in full.

Discussion: https://postgr.es/m/1376149.1675268279@sss.pgh.pa.us
---
 src/bin/pg_dump/common.c  |  18 +++---
 src/bin/pg_dump/pg_dump.c | 122 +++++++++++++++++++++++++++++++++++---
 src/bin/pg_dump/pg_dump.h |   2 +
 3 files changed, 125 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index a2b74901e4..44558b60e8 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -230,6 +230,9 @@ getSchemaData(Archive *fout, int *numTablesPtr)
     pg_log_info("flagging inherited columns in subtables");
     flagInhAttrs(fout->dopt, tblinfo, numTables);

+    pg_log_info("reading partitioning data");
+    getPartitioningInfo(fout);
+
     pg_log_info("reading indexes");
     getIndexes(fout, tblinfo, numTables);

@@ -285,7 +288,6 @@ static void
 flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
               InhInfo *inhinfo, int numInherits)
 {
-    DumpOptions *dopt = fout->dopt;
     int            i,
                 j;

@@ -301,18 +303,18 @@ flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
             continue;

         /*
-         * Normally, we don't bother computing anything for non-target tables,
-         * but if load-via-partition-root is specified, we gather information
-         * on every partition in the system so that getRootTableInfo can trace
-         * from any given to leaf partition all the way up to the root.  (We
-         * don't need to mark them as interesting for getTableAttrs, though.)
+         * Normally, we don't bother computing anything for non-target tables.
+         * However, we must find the parents of non-root partitioned tables in
+         * any case, so that we can trace from leaf partitions up to the root
+         * (in case a leaf is to be dumped but its parents are not).  We need
+         * not mark such parents interesting for getTableAttrs, though.
          */
         if (!tblinfo[i].dobj.dump)
         {
             mark_parents = false;

-            if (!dopt->load_via_partition_root ||
-                !tblinfo[i].ispartition)
+            if (!(tblinfo[i].relkind == RELKIND_PARTITIONED_TABLE &&
+                  tblinfo[i].ispartition))
                 find_parents = false;
         }

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b0376ee56c..7f40b455af 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -320,6 +320,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AH);
 static TableInfo *getRootTableInfo(const TableInfo *tbinfo);
+static bool forcePartitionRootLoad(const TableInfo *tbinfo);


 int
@@ -2227,11 +2228,13 @@ dumpTableData_insert(Archive *fout, const void *dcontext)
             insertStmt = createPQExpBuffer();

             /*
-             * When load-via-partition-root is set, get the root table name
-             * for the partition table, so that we can reload data through the
-             * root table.
+             * When load-via-partition-root is set or forced, get the root
+             * table name for the partition table, so that we can reload data
+             * through the root table.
              */
-            if (dopt->load_via_partition_root && tbinfo->ispartition)
+            if (tbinfo->ispartition &&
+                (dopt->load_via_partition_root ||
+                 forcePartitionRootLoad(tbinfo)))
                 targettab = getRootTableInfo(tbinfo);
             else
                 targettab = tbinfo;
@@ -2429,6 +2432,35 @@ getRootTableInfo(const TableInfo *tbinfo)
     return parentTbinfo;
 }

+/*
+ * forcePartitionRootLoad
+ *     Check if we must force load_via_partition_root for this partition.
+ *
+ * This is required if any level of ancestral partitioned table has an
+ * unsafe partitioning scheme.
+ */
+static bool
+forcePartitionRootLoad(const TableInfo *tbinfo)
+{
+    TableInfo  *parentTbinfo;
+
+    Assert(tbinfo->ispartition);
+    Assert(tbinfo->numParents == 1);
+
+    parentTbinfo = tbinfo->parents[0];
+    if (parentTbinfo->unsafe_partitions)
+        return true;
+    while (parentTbinfo->ispartition)
+    {
+        Assert(parentTbinfo->numParents == 1);
+        parentTbinfo = parentTbinfo->parents[0];
+        if (parentTbinfo->unsafe_partitions)
+            return true;
+    }
+
+    return false;
+}
+
 /*
  * dumpTableData -
  *      dump the contents of a single table
@@ -2451,12 +2483,14 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
     Assert(tbinfo->interesting);

     /*
-     * When load-via-partition-root is set, get the root table name for the
-     * partition table, so that we can reload data through the root table.
-     * Then construct a comment to be inserted into the TOC entry's defn
-     * field, so that such cases can be identified reliably.
+     * When load-via-partition-root is set or forced, get the root table name
+     * for the partition table, so that we can reload data through the root
+     * table.  Then construct a comment to be inserted into the TOC entry's
+     * defn field, so that such cases can be identified reliably.
      */
-    if (dopt->load_via_partition_root && tbinfo->ispartition)
+    if (tbinfo->ispartition &&
+        (dopt->load_via_partition_root ||
+         forcePartitionRootLoad(tbinfo)))
     {
         TableInfo  *parentTbinfo;

@@ -6764,6 +6798,76 @@ getInherits(Archive *fout, int *numInherits)
     return inhinfo;
 }

+/*
+ * getPartitioningInfo
+ *      get information about partitioning
+ *
+ * For the most part, we only collect partitioning info about tables we
+ * intend to dump.  However, this function has to consider all partitioned
+ * tables in the database, because we need to know about parents of partitions
+ * we are going to dump even if the parents themselves won't be dumped.
+ *
+ * Specifically, what we need to know is whether each partitioned table
+ * has an "unsafe" partitioning scheme that requires us to force
+ * load-via-partition-root mode for its children.  Currently the only case
+ * for which we force that is hash partitioning on enum columns, since the
+ * hash codes depend on enum value OIDs which won't be replicated across
+ * dump-and-reload.  There are other cases in which load-via-partition-root
+ * might be necessary, but we expect users to cope with them.
+ */
+void
+getPartitioningInfo(Archive *fout)
+{
+    PQExpBuffer query;
+    PGresult   *res;
+    int            ntups;
+
+    /* no partitions before v10 */
+    if (fout->remoteVersion < 100000)
+        return;
+    /* needn't bother if schema-only dump */
+    if (fout->dopt->schemaOnly)
+        return;
+
+    query = createPQExpBuffer();
+
+    /*
+     * Unsafe partitioning schemes are exactly those for which hash enum_ops
+     * appears among the partition opclasses.  We needn't check partstrat.
+     *
+     * Note that this query may well retrieve info about tables we aren't
+     * going to dump and hence have no lock on.  That's okay since we need not
+     * invoke any unsafe server-side functions.
+     */
+    appendPQExpBufferStr(query,
+                         "SELECT partrelid FROM pg_partitioned_table WHERE\n"
+                         "(SELECT c.oid FROM pg_opclass c JOIN pg_am a "
+                         "ON c.opcmethod = a.oid\n"
+                         "WHERE opcname = 'enum_ops' "
+                         "AND opcnamespace = 'pg_catalog'::regnamespace "
+                         "AND amname = 'hash') = ANY(partclass)");
+
+    res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);
+
+    ntups = PQntuples(res);
+
+    for (int i = 0; i < ntups; i++)
+    {
+        Oid            tabrelid = atooid(PQgetvalue(res, i, 0));
+        TableInfo  *tbinfo;
+
+        tbinfo = findTableByOid(tabrelid);
+        if (tbinfo == NULL)
+            pg_fatal("failed sanity check, table OID %u appearing in pg_partitioned_table not found",
+                     tabrelid);
+        tbinfo->unsafe_partitions = true;
+    }
+
+    PQclear(res);
+
+    destroyPQExpBuffer(query);
+}
+
 /*
  * getIndexes
  *      get information about every index on a dumpable table
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index cdca0b993d..ffa37265c8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -318,6 +318,7 @@ typedef struct _tableInfo
     bool        dummy_view;        /* view's real definition must be postponed */
     bool        postponed_def;    /* matview must be postponed into post-data */
     bool        ispartition;    /* is table a partition? */
+    bool        unsafe_partitions;    /* is it an unsafe partitioned table? */

     /*
      * These fields are computed only if we decide the table is interesting
@@ -714,6 +715,7 @@ extern ConvInfo *getConversions(Archive *fout, int *numConversions);
 extern TableInfo *getTables(Archive *fout, int *numTables);
 extern void getOwnedSeqs(Archive *fout, TableInfo tblinfo[], int numTables);
 extern InhInfo *getInherits(Archive *fout, int *numInherits);
+extern void getPartitioningInfo(Archive *fout);
 extern void getIndexes(Archive *fout, TableInfo tblinfo[], int numTables);
 extern void getExtendedStatistics(Archive *fout);
 extern void getConstraints(Archive *fout, TableInfo tblinfo[], int numTables);
--
2.31.1

From 5836a653373a382ee3d3b287ab5377230852a554 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Mar 2023 19:13:35 -0400
Subject: [PATCH v2 5/6] Don't create useless TableAttachInfo objects.

It's silly to create a TableAttachInfo object that we're not
going to dump, when we know perfectly well at creation time
that it won't be dumped.

Discussion: https://postgr.es/m/1376149.1675268279@sss.pgh.pa.us
---
 src/bin/pg_dump/common.c  | 3 ++-
 src/bin/pg_dump/pg_dump.c | 3 ---
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 44558b60e8..df48d0bb17 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -336,7 +336,8 @@ flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
         }

         /* Create TableAttachInfo object if needed */
-        if (tblinfo[i].dobj.dump && tblinfo[i].ispartition)
+        if ((tblinfo[i].dobj.dump & DUMP_COMPONENT_DEFINITION) &&
+            tblinfo[i].ispartition)
         {
             TableAttachInfo *attachinfo;

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 7f40b455af..a4a06b4a20 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16120,9 +16120,6 @@ dumpTableAttach(Archive *fout, const TableAttachInfo *attachinfo)
     if (dopt->dataOnly)
         return;

-    if (!(attachinfo->partitionTbl->dobj.dump & DUMP_COMPONENT_DEFINITION))
-        return;
-
     q = createPQExpBuffer();

     if (!fout->is_prepared[PREPQUERY_DUMPTABLEATTACH])
--
2.31.1

From dcdf1c516d06afa75759d38e0903b8585b95ef72 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Mar 2023 19:16:48 -0400
Subject: [PATCH v2 6/6] Simplify pg_dump's creation of parent-table links.

Instead of trying to optimize this by skipping creation of the
links for tables we don't plan to dump, just create them all in
bulk with a single scan over the pg_inherits data.  The previous
approach was more or less O(N^2) in the number of pg_inherits
entries, not to mention being way too complicated.

Discussion: https://postgr.es/m/1376149.1675268279@sss.pgh.pa.us
---
 src/bin/pg_dump/common.c  | 125 ++++++++++++++++----------------------
 src/bin/pg_dump/pg_dump.h |   5 +-
 2 files changed, 54 insertions(+), 76 deletions(-)

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index df48d0bb17..5d988986ed 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -83,8 +83,6 @@ static void flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
                           InhInfo *inhinfo, int numInherits);
 static void flagInhIndexes(Archive *fout, TableInfo *tblinfo, int numTables);
 static void flagInhAttrs(DumpOptions *dopt, TableInfo *tblinfo, int numTables);
-static void findParentsByOid(TableInfo *self,
-                             InhInfo *inhinfo, int numInherits);
 static int    strInArray(const char *pattern, char **arr, int arr_size);
 static IndxInfo *findIndexByOid(Oid oid);

@@ -288,45 +286,70 @@ static void
 flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
               InhInfo *inhinfo, int numInherits)
 {
+    TableInfo  *child = NULL;
+    TableInfo  *parent = NULL;
     int            i,
                 j;

-    for (i = 0; i < numTables; i++)
+    /*
+     * Set up links from child tables to their parents.
+     *
+     * We used to attempt to skip this work for tables that are not to be
+     * dumped; but the optimizable cases are rare in practice, and setting up
+     * these links in bulk is cheaper than the old way.  (Note in particular
+     * that it's very rare for a child to have more than one parent.)
+     */
+    for (i = 0; i < numInherits; i++)
     {
-        bool        find_parents = true;
-        bool        mark_parents = true;
-
-        /* Some kinds never have parents */
-        if (tblinfo[i].relkind == RELKIND_SEQUENCE ||
-            tblinfo[i].relkind == RELKIND_VIEW ||
-            tblinfo[i].relkind == RELKIND_MATVIEW)
-            continue;
-
         /*
-         * Normally, we don't bother computing anything for non-target tables.
-         * However, we must find the parents of non-root partitioned tables in
-         * any case, so that we can trace from leaf partitions up to the root
-         * (in case a leaf is to be dumped but its parents are not).  We need
-         * not mark such parents interesting for getTableAttrs, though.
+         * Skip a hashtable lookup if it's same table as last time.  This is
+         * unlikely for the child, but less so for the parent.  (Maybe we
+         * should ask the backend for a sorted array to make it more likely?
+         * Not clear the sorting effort would be repaid, though.)
          */
-        if (!tblinfo[i].dobj.dump)
+        if (child == NULL ||
+            child->dobj.catId.oid != inhinfo[i].inhrelid)
         {
-            mark_parents = false;
+            child = findTableByOid(inhinfo[i].inhrelid);

-            if (!(tblinfo[i].relkind == RELKIND_PARTITIONED_TABLE &&
-                  tblinfo[i].ispartition))
-                find_parents = false;
+            /*
+             * If we find no TableInfo, assume the pg_inherits entry is for a
+             * partitioned index, which we don't need to track.
+             */
+            if (child == NULL)
+                continue;
         }
+        if (parent == NULL ||
+            parent->dobj.catId.oid != inhinfo[i].inhparent)
+        {
+            parent = findTableByOid(inhinfo[i].inhparent);
+            if (parent == NULL)
+                pg_fatal("failed sanity check, parent OID %u of table \"%s\" (OID %u) not found",
+                         inhinfo[i].inhparent,
+                         child->dobj.name,
+                         child->dobj.catId.oid);
+        }
+        /* Add this parent to the child's list of parents. */
+        if (child->numParents > 0)
+            child->parents = pg_realloc_array(child->parents,
+                                              TableInfo *,
+                                              child->numParents + 1);
+        else
+            child->parents = pg_malloc_array(TableInfo *, 1);
+        child->parents[child->numParents++] = parent;
+    }

-        /* If needed, find all the immediate parent tables. */
-        if (find_parents)
-            findParentsByOid(&tblinfo[i], inhinfo, numInherits);
-
+    /*
+     * Now consider all child tables and mark parents interesting as needed.
+     */
+    for (i = 0; i < numTables; i++)
+    {
         /*
          * If needed, mark the parents as interesting for getTableAttrs and
-         * getIndexes.
+         * getIndexes.  We only need this for direct parents of dumpable
+         * tables.
          */
-        if (mark_parents)
+        if (tblinfo[i].dobj.dump)
         {
             int            numParents = tblinfo[i].numParents;
             TableInfo **parents = tblinfo[i].parents;
@@ -996,52 +1019,6 @@ findOwningExtension(CatalogId catalogId)
 }


-/*
- * findParentsByOid
- *      find a table's parents in tblinfo[]
- */
-static void
-findParentsByOid(TableInfo *self,
-                 InhInfo *inhinfo, int numInherits)
-{
-    Oid            oid = self->dobj.catId.oid;
-    int            i,
-                j;
-    int            numParents;
-
-    numParents = 0;
-    for (i = 0; i < numInherits; i++)
-    {
-        if (inhinfo[i].inhrelid == oid)
-            numParents++;
-    }
-
-    self->numParents = numParents;
-
-    if (numParents > 0)
-    {
-        self->parents = pg_malloc_array(TableInfo *, numParents);
-        j = 0;
-        for (i = 0; i < numInherits; i++)
-        {
-            if (inhinfo[i].inhrelid == oid)
-            {
-                TableInfo  *parent;
-
-                parent = findTableByOid(inhinfo[i].inhparent);
-                if (parent == NULL)
-                    pg_fatal("failed sanity check, parent OID %u of table \"%s\" (OID %u) not found",
-                             inhinfo[i].inhparent,
-                             self->dobj.name,
-                             oid);
-                self->parents[j++] = parent;
-            }
-        }
-    }
-    else
-        self->parents = NULL;
-}
-
 /*
  * parseOidArray
  *      parse a string of numbers delimited by spaces into a character array
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index ffa37265c8..283cd1a602 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -320,6 +320,9 @@ typedef struct _tableInfo
     bool        ispartition;    /* is table a partition? */
     bool        unsafe_partitions;    /* is it an unsafe partitioned table? */

+    int            numParents;        /* number of (immediate) parent tables */
+    struct _tableInfo **parents;    /* TableInfos of immediate parents */
+
     /*
      * These fields are computed only if we decide the table is interesting
      * (it's either a table to dump, or a direct parent of a dumpable table).
@@ -351,8 +354,6 @@ typedef struct _tableInfo
     /*
      * Stuff computed only for dumpable tables.
      */
-    int            numParents;        /* number of (immediate) parent tables */
-    struct _tableInfo **parents;    /* TableInfos of immediate parents */
     int            numIndexes;        /* number of indexes */
     struct _indxInfo *indexes;    /* indexes */
     struct _tableDataInfo *dataObj; /* TableDataInfo, if dumping its data */
--
2.31.1

Re: pg_dump versus hash partitioning

From

Julien Rouhaud

Date:

16 March 2023, 09:03:52

On Mon, Mar 13, 2023 at 07:39:12PM -0400, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
> >> The trick is to detect in pg_restore whether pg_dump chose to do
> >> load-via-partition-root.
>
> > Given that this approach wouldn't help with existing dump files (at least if
> > using COPY, in any case the one using INSERT are doomed), so I'm slightly in
> > favor of the first approach, and later add an easy and non magic incantation
> > way to produce dumps that don't depend on partitioning.
>
> Yeah, we need to do both.  Attached find an updated patch series:

I didn't find a CF entry, is it intended?

> 0001: TAP test that exhibits both this deadlock problem and the
> different-hash-codes problem.  I'm not sure if we want to commit
> this, or if it should be in exactly this form --- the second set
> of tests with a manual --load-via-partition-root switch will be
> pretty redundant after this patch series.

I think there should be at least the first set of tests committed.  I would
also be happy to see some test with the --insert case, even if those are
technically redundant too, as it's quite cheap to do once you setup the
cluster.

> 0002: Make pg_restore detect load-via-partition-root by examining the
> COPY commands embedded in the dump, and skip the TRUNCATE if so,
> thereby fixing the deadlock issue.  This is the best we can do for
> legacy dump files, I think, but it should be good enough.

is_load_via_partition_root():

+ * In newer archive files this can be detected by checking for a special
+ * comment placed in te->defn.  In older files we have to fall back to seeing
+ * if the COPY statement targets the named table or some other one.  This
+ * will not work for data dumped as INSERT commands, so we could give a false
+ * negative in that case; fortunately, that's a rarely-used option.

I'm not sure if you intend to keep the current 0002 - 0003 separation, but if
yes the part about te->defn and possible fallback should be moved to 0003.  In
0002 we're only looking at the COPY statement.

-                    * the run then we wrap the COPY in a transaction and
-                    * precede it with a TRUNCATE.  If archiving is not on
-                    * this prevents WAL-logging the COPY.  This obtains a
-                    * speedup similar to that from using single_txn mode in
-                    * non-parallel restores.
+                    * the run and we are not restoring a
+                    * load-via-partition-root data item then we wrap the COPY
+                    * in a transaction and precede it with a TRUNCATE.  If
+                    * archiving is not on this prevents WAL-logging the COPY.
+                    * This obtains a speedup similar to that from using
+                    * single_txn mode in non-parallel restores.

I know you're only inheriting this comment, but isn't it well outdated and not
accurate anymore?  I'm assuming that "archiving is not on" was an acceptable
way to mean "wal_level < archive" at some point, but it's now completely
misleading.

Minor nitpicking:
- should the function name prefixed with a "_" like almost all nearby code?
- should there be an assert that the given toc entry is indeed a TABLE DATA?

FWIW it unsurprisingly fixes the problem on my original use case.

> 0003: Also detect load-via-partition-root by adding a label in the
> dump.  This is a more bulletproof solution going forward.
>
> 0004-0006: same as previous patches, but rebased over these.
> This gets us to a place where the new TAP test passes.

+getPartitioningInfo(Archive *fout)
+{
+   PQExpBuffer query;
+   PGresult   *res;
+   int         ntups;
+
+   /* no partitions before v10 */
+   if (fout->remoteVersion < 100000)
+       return;
+ [...]
+   /*
+    * Unsafe partitioning schemes are exactly those for which hash enum_ops
+    * appears among the partition opclasses.  We needn't check partstrat.
+    *
+    * Note that this query may well retrieve info about tables we aren't
+    * going to dump and hence have no lock on.  That's okay since we need not
+    * invoke any unsafe server-side functions.
+    */
+   appendPQExpBufferStr(query,
+                        "SELECT partrelid FROM pg_partitioned_table WHERE\n"
+                        "(SELECT c.oid FROM pg_opclass c JOIN pg_am a "
+                        "ON c.opcmethod = a.oid\n"
+                        "WHERE opcname = 'enum_ops' "
+                        "AND opcnamespace = 'pg_catalog'::regnamespace "
+                        "AND amname = 'hash') = ANY(partclass)");

Hash partitioning was added with pg11, should we bypass pg10 too with a comment
saying that we only care about hash, at least for the forseeable future?

Other than that, the patchset looks quite good to me, modulo the upcoming doc
changes.

More generally, I also think that forcing --load-via-partition-root for
known unsafe partitioning seems like the best compromise.  I'm not sure if we
should have an option to turn it off though.

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

16 March 2023, 12:43:56

Julien Rouhaud <rjuju123@gmail.com> writes:
> On Mon, Mar 13, 2023 at 07:39:12PM -0400, Tom Lane wrote:
>> Yeah, we need to do both.  Attached find an updated patch series:

> I didn't find a CF entry, is it intended?

Yeah, it's there:

https://commitfest.postgresql.org/42/4226/

> I'm not sure if you intend to keep the current 0002 - 0003 separation, but if
> yes the part about te->defn and possible fallback should be moved to 0003.  In
> 0002 we're only looking at the COPY statement.

I was intending to smash it all into one commit --- the separation is
just to ease review.

> I know you're only inheriting this comment, but isn't it well outdated and not
> accurate anymore?  I'm assuming that "archiving is not on" was an acceptable
> way to mean "wal_level < archive" at some point, but it's now completely
> misleading.

Sure, want to propose wording?

> Hash partitioning was added with pg11, should we bypass pg10 too with a comment
> saying that we only care about hash, at least for the forseeable future?

Good point.  With v10 already out of support, it hardly matters in the
real world, but we might as well not expend cycles when we clearly
needn't.

> More generally, I also think that forcing --load-via-partition-root for
> known unsafe partitioning seems like the best compromise.  I'm not sure if we
> should have an option to turn it off though.

I think the odds of that yielding a usable dump are nil, so I don't
see why we should bother.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Julien Rouhaud

Date:

17 March 2023, 02:07:20

On Thu, Mar 16, 2023 at 08:43:56AM -0400, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Mon, Mar 13, 2023 at 07:39:12PM -0400, Tom Lane wrote:
> >> Yeah, we need to do both.  Attached find an updated patch series:
>
> > I didn't find a CF entry, is it intended?
>
> Yeah, it's there:
>
> https://commitfest.postgresql.org/42/4226/

Ah thanks!  I was looking for "pg_dump" in the page.

> > I'm not sure if you intend to keep the current 0002 - 0003 separation, but if
> > yes the part about te->defn and possible fallback should be moved to 0003.  In
> > 0002 we're only looking at the COPY statement.
>
> I was intending to smash it all into one commit --- the separation is
> just to ease review.

Ok, no problem then and I agree it's better to squash them.

> > I know you're only inheriting this comment, but isn't it well outdated and not
> > accurate anymore?  I'm assuming that "archiving is not on" was an acceptable
> > way to mean "wal_level < archive" at some point, but it's now completely
> > misleading.
>
> Sure, want to propose wording?

Just mentioning the exact wal_level, something like

                     * [...].  If wal_level is set to minimal this prevents
                     * WAL-logging the COPY.  This obtains a speedup similar to
                     * [...]


> > More generally, I also think that forcing --load-via-partition-root for
> > known unsafe partitioning seems like the best compromise.  I'm not sure if we
> > should have an option to turn it off though.
>
> I think the odds of that yielding a usable dump are nil, so I don't
> see why we should bother.

No objection from me.

Re: pg_dump versus hash partitioning

From

Tom Lane

Date:

17 March 2023, 17:44:12

Julien Rouhaud <rjuju123@gmail.com> writes:
> On Thu, Mar 16, 2023 at 08:43:56AM -0400, Tom Lane wrote:
>> I think the odds of that yielding a usable dump are nil, so I don't
>> see why we should bother.

> No objection from me.

OK, pushed with the discussed changes.

            regards, tom lane

Re: pg_dump versus hash partitioning

From

Julien Rouhaud

Date:

18 March 2023, 01:06:43

On Fri, Mar 17, 2023 at 01:44:12PM -0400, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Thu, Mar 16, 2023 at 08:43:56AM -0400, Tom Lane wrote:
> >> I think the odds of that yielding a usable dump are nil, so I don't
> >> see why we should bother.
>
> > No objection from me.
>
> OK, pushed with the discussed changes.

Great news, thanks a lot!