Re: pg_dump partitions can lead to inconsistent state after restore - Mailing list pgsql-hackers

From Amit Langote
Subject Re: pg_dump partitions can lead to inconsistent state after restore
Date
Msg-id CA+HiwqFxACGUPeWLfjx-ySzaiRqkMYoMHxFH9-t2MvfT1TGgsg@mail.gmail.com
Whole thread Raw
In response to Re: pg_dump partitions can lead to inconsistent state after restore  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Thu, Apr 25, 2019 at 4:01 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> So, while testing this I noticed that pg_restore fails with deadlocks if
> you do a parallel restore if the --load-via-partition-root switch was
> given to pg_dump.  Is that a known bug?

Was investigating --load-via-partition-root with a coworker and came
across the following note in the documentation:

https://www.postgresql.org/docs/11/app-pgdump.html

"It is best not to use parallelism when restoring from an archive made
with this option, because pg_restore will not know exactly which
partition(s) a given archive data item will load data into. This could
result in inefficiency due to lock conflicts between parallel jobs, or
perhaps even reload failures due to foreign key constraints being set
up before all the relevant data is loaded."

Apparently, this note was added as a result of the following discussion:

https://www.postgresql.org/message-id/flat/13624.1535486019%40sss.pgh.pa.us

So, while the documentation doesn't explicitly list deadlocks as
possible risk, Tom hinted in the first email that it's possible.

I set out to reproduce one and was able to, although I'm not sure if
it's the same deadlock as seen by Alvaro.  Steps I used to reproduce:

# in the source database

create table foo (a int primary key);
insert into foo select generate_series(1, 1000000);
create table ht (a int) partition by hash (a);
select 'create table ht' || i || ' partition of ht for values with
(modulus 100, remainder ' || i -1 || ');' from generate_series(1, 100)
i;
\gexec
insert into ht select generate_series(1, 1000000);
alter table ht add foreign key (a) references foo (a);

# in shell
pg_dump --load-via-partition-root -Fd -f /tmp/dump
createdb targetdb
pg_restore -d targetdb -j 2 /tmp/dump

The last step reports deadlocks; in the server log:

ERROR:  deadlock detected
DETAIL:  Process 14213 waits for RowExclusiveLock on relation 17447 of
database 17443; blocked by process 14212.
        Process 14212 waits for ShareRowExclusiveLock on relation
17507 of database 17443; blocked by process 14213.
        Process 14213: COPY public.ht (a) FROM stdin;

        Process 14212: ALTER TABLE public.ht
            ADD CONSTRAINT ht_a_fkey FOREIGN KEY (a) REFERENCES public.foo(a);

Here, the process adding the foreign key has got the lock on the
parent and trying to lock a partition to add the foreign key to it.
The process doing COPY (via root) has apperently locked the partition
and waiting for the lock on the parent to do actual copying.  Looking
into why the latter had got a lock on the partition at all if it
hasn't started the copying yet, I noticed that it was locked when
TRUNCATE was executed on it earlier in the same transaction as part of
some WAL-related optimization, which is something that only happens in
the parallel restore mode.  I was under the impression that the TABLE
DATA archive item (its TocEntry) would have no trace of the partition
if it was dumped with --load-via-partition-root, but that's not the
case.  --load-via-partition-root only dictates that the command that
will be dumped for the item will use the root parent as COPY target,
even though the TocEntry itself is owned by the partition.

Maybe, a way to prevent the deadlock would be for the process that
will do copy-into-given-partition-via-root-parent to do a `LOCK TABLE
root_parent` before `TRUNCATE the_partition`, but we'll need to get
hold of the root parent from the TocEntry somehow.  Turns out it's
only present in the TocEntry.copyStmt, from where it will have to
parsed out.  Maybe that's the only thing we could do without breaking
the archive format though.

Thoughts?

By the way, I couldn't think of ways to reproduce any of the hazards
mentioned in the documentations of using parallel mode to restore an
archive written with pg_dump --load-via-root-parent, but maybe I just
haven't tried hard enough.

Thanks,
Amit



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: clean up docs for v12
Next
From: Dean Rasheed
Date:
Subject: Re: Multivariate MCV list vs. statistics target