Thread: pg_dump partitions can lead to inconsistent state after restore

pg_dump partitions can lead to inconsistent state after restore

From
Alvaro Herrera
Date:
Per my comment at https://postgr.es/m/20190422225129.GA6126@alvherre.pgsql
I think that pg_dump can possibly cause bogus partition definitions,
when the users explicitly decide to join tables as partitions that have
different column ordering than the parent table.  Any COPY or INSERT
command without an explicit column list that tries to put tuples in the
table will fail after the restore.

Tom Lane said:

> I haven't looked at the partitioning code, but I am quite sure that that's
> always happened for old-style inheritance children, and I imagine pg_dump
> is just duplicating that old behavior.

Actually, the new code is unrelated to the old one; for legacy
inheritance, the children are always created exactly as they were
created at definition time.  If you use ALTER TABLE ... INHERITS
(attach a table as a children after creation) then obviously the child
table cannot be modified to match its new parent; and pg_dump reproduces
the exact column ordering that the table originally had.  If you use
"CREATE TABLE ... INHERITS (parent)" then the child columns are reordered
*at that point* (creation time); the dump will, again, reproduce the
exact same definition.


I think failing to reproduce the exact same definition is a pg_dump bug
that should be fixed and backpatched to pg10.  It's just sheer luck that
nobody has complained of being bitten by it.

-- 
Álvaro Herrera                                http://www.twitter.com/alvherre



Re: pg_dump partitions can lead to inconsistent state after restore

From
David Rowley
Date:
On Wed, 24 Apr 2019 at 06:50, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Per my comment at https://postgr.es/m/20190422225129.GA6126@alvherre.pgsql
> I think that pg_dump can possibly cause bogus partition definitions,
> when the users explicitly decide to join tables as partitions that have
> different column ordering than the parent table.  Any COPY or INSERT
> command without an explicit column list that tries to put tuples in the
> table will fail after the restore.

Yeah, pg_dump itself is broken here, never mind dreaming up some other
user command.

We do use a column list when doing COPY, but with --inserts (not
--column-inserts) we don't include a column list.

All it takes is:

postgres=# create table listp (a int, b text) partition by list(a);
CREATE TABLE
postgres=# create table listp1 (b text, a int);
CREATE TABLE
postgres=# alter table listp attach partition listp1 for values in(1);
ALTER TABLE
postgres=# insert into listp values(1,'One');
INSERT 0 1
postgres=# \q

$ createdb test1
$ pg_dump --inserts postgres | psql test1
...
ERROR:  invalid input syntax for type integer: "One"
LINE 1: INSERT INTO public.listp1 VALUES ('One', 1);

That settles the debate on the other thread...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump partitions can lead to inconsistent state after restore

From
Amit Langote
Date:
On 2019/04/24 10:19, David Rowley wrote:
> On Wed, 24 Apr 2019 at 06:50, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Per my comment at https://postgr.es/m/20190422225129.GA6126@alvherre.pgsql
>> I think that pg_dump can possibly cause bogus partition definitions,
>> when the users explicitly decide to join tables as partitions that have
>> different column ordering than the parent table.  Any COPY or INSERT
>> command without an explicit column list that tries to put tuples in the
>> table will fail after the restore.
> 
> Yeah, pg_dump itself is broken here, never mind dreaming up some other
> user command.
> 
> We do use a column list when doing COPY, but with --inserts (not
> --column-inserts) we don't include a column list.
> 
> All it takes is:
> 
> postgres=# create table listp (a int, b text) partition by list(a);
> CREATE TABLE
> postgres=# create table listp1 (b text, a int);
> CREATE TABLE
> postgres=# alter table listp attach partition listp1 for values in(1);
> ALTER TABLE
> postgres=# insert into listp values(1,'One');
> INSERT 0 1
> postgres=# \q
> 
> $ createdb test1
> $ pg_dump --inserts postgres | psql test1
> ...
> ERROR:  invalid input syntax for type integer: "One"
> LINE 1: INSERT INTO public.listp1 VALUES ('One', 1);
> 
> That settles the debate on the other thread...

+1 to fixing this, although +0.5 to back-patching.

The reason no one has complained so far of being bitten by this may be
that, as each of one us has said at least once on the other thread, users
are not very likely to create partitions with different column orders to
begin with.  Maybe, that isn't a reason to leave it as is though.

Thanks,
Amit




Re: pg_dump partitions can lead to inconsistent state after restore

From
David Rowley
Date:
On Wed, 24 Apr 2019 at 14:53, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2019/04/24 10:19, David Rowley wrote:
> > ERROR:  invalid input syntax for type integer: "One"
> > LINE 1: INSERT INTO public.listp1 VALUES ('One', 1);
> >
> > That settles the debate on the other thread...
>
> +1 to fixing this, although +0.5 to back-patching.
>
> The reason no one has complained so far of being bitten by this may be
> that, as each of one us has said at least once on the other thread, users
> are not very likely to create partitions with different column orders to
> begin with.  Maybe, that isn't a reason to leave it as is though.

Well, you could probably class most of the bugs that make their way
through feature freeze, alpha and beta as unlikely.  I don't think
that gives us an excuse to leave them as bugs.  If someone reported it
we'd most likely go and fix it then anyway, so I really don't see the
point in waiting until then.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump partitions can lead to inconsistent state after restore

From
Alvaro Herrera
Date:
So, while testing this I noticed that pg_restore fails with deadlocks if
you do a parallel restore if the --load-via-partition-root switch was
given to pg_dump.  Is that a known bug?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump partitions can lead to inconsistent state after restore

From
Amit Langote
Date:
On Thu, Apr 25, 2019 at 4:01 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> So, while testing this I noticed that pg_restore fails with deadlocks if
> you do a parallel restore if the --load-via-partition-root switch was
> given to pg_dump.  Is that a known bug?

Was investigating --load-via-partition-root with a coworker and came
across the following note in the documentation:

https://www.postgresql.org/docs/11/app-pgdump.html

"It is best not to use parallelism when restoring from an archive made
with this option, because pg_restore will not know exactly which
partition(s) a given archive data item will load data into. This could
result in inefficiency due to lock conflicts between parallel jobs, or
perhaps even reload failures due to foreign key constraints being set
up before all the relevant data is loaded."

Apparently, this note was added as a result of the following discussion:

https://www.postgresql.org/message-id/flat/13624.1535486019%40sss.pgh.pa.us

So, while the documentation doesn't explicitly list deadlocks as
possible risk, Tom hinted in the first email that it's possible.

I set out to reproduce one and was able to, although I'm not sure if
it's the same deadlock as seen by Alvaro.  Steps I used to reproduce:

# in the source database

create table foo (a int primary key);
insert into foo select generate_series(1, 1000000);
create table ht (a int) partition by hash (a);
select 'create table ht' || i || ' partition of ht for values with
(modulus 100, remainder ' || i -1 || ');' from generate_series(1, 100)
i;
\gexec
insert into ht select generate_series(1, 1000000);
alter table ht add foreign key (a) references foo (a);

# in shell
pg_dump --load-via-partition-root -Fd -f /tmp/dump
createdb targetdb
pg_restore -d targetdb -j 2 /tmp/dump

The last step reports deadlocks; in the server log:

ERROR:  deadlock detected
DETAIL:  Process 14213 waits for RowExclusiveLock on relation 17447 of
database 17443; blocked by process 14212.
        Process 14212 waits for ShareRowExclusiveLock on relation
17507 of database 17443; blocked by process 14213.
        Process 14213: COPY public.ht (a) FROM stdin;

        Process 14212: ALTER TABLE public.ht
            ADD CONSTRAINT ht_a_fkey FOREIGN KEY (a) REFERENCES public.foo(a);

Here, the process adding the foreign key has got the lock on the
parent and trying to lock a partition to add the foreign key to it.
The process doing COPY (via root) has apperently locked the partition
and waiting for the lock on the parent to do actual copying.  Looking
into why the latter had got a lock on the partition at all if it
hasn't started the copying yet, I noticed that it was locked when
TRUNCATE was executed on it earlier in the same transaction as part of
some WAL-related optimization, which is something that only happens in
the parallel restore mode.  I was under the impression that the TABLE
DATA archive item (its TocEntry) would have no trace of the partition
if it was dumped with --load-via-partition-root, but that's not the
case.  --load-via-partition-root only dictates that the command that
will be dumped for the item will use the root parent as COPY target,
even though the TocEntry itself is owned by the partition.

Maybe, a way to prevent the deadlock would be for the process that
will do copy-into-given-partition-via-root-parent to do a `LOCK TABLE
root_parent` before `TRUNCATE the_partition`, but we'll need to get
hold of the root parent from the TocEntry somehow.  Turns out it's
only present in the TocEntry.copyStmt, from where it will have to
parsed out.  Maybe that's the only thing we could do without breaking
the archive format though.

Thoughts?

By the way, I couldn't think of ways to reproduce any of the hazards
mentioned in the documentations of using parallel mode to restore an
archive written with pg_dump --load-via-root-parent, but maybe I just
haven't tried hard enough.

Thanks,
Amit