Thread: pg_dump partitions can lead to inconsistent state after restore
Per my comment at https://postgr.es/m/20190422225129.GA6126@alvherre.pgsql I think that pg_dump can possibly cause bogus partition definitions, when the users explicitly decide to join tables as partitions that have different column ordering than the parent table. Any COPY or INSERT command without an explicit column list that tries to put tuples in the table will fail after the restore. Tom Lane said: > I haven't looked at the partitioning code, but I am quite sure that that's > always happened for old-style inheritance children, and I imagine pg_dump > is just duplicating that old behavior. Actually, the new code is unrelated to the old one; for legacy inheritance, the children are always created exactly as they were created at definition time. If you use ALTER TABLE ... INHERITS (attach a table as a children after creation) then obviously the child table cannot be modified to match its new parent; and pg_dump reproduces the exact column ordering that the table originally had. If you use "CREATE TABLE ... INHERITS (parent)" then the child columns are reordered *at that point* (creation time); the dump will, again, reproduce the exact same definition. I think failing to reproduce the exact same definition is a pg_dump bug that should be fixed and backpatched to pg10. It's just sheer luck that nobody has complained of being bitten by it. -- Álvaro Herrera http://www.twitter.com/alvherre
On Wed, 24 Apr 2019 at 06:50, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Per my comment at https://postgr.es/m/20190422225129.GA6126@alvherre.pgsql > I think that pg_dump can possibly cause bogus partition definitions, > when the users explicitly decide to join tables as partitions that have > different column ordering than the parent table. Any COPY or INSERT > command without an explicit column list that tries to put tuples in the > table will fail after the restore. Yeah, pg_dump itself is broken here, never mind dreaming up some other user command. We do use a column list when doing COPY, but with --inserts (not --column-inserts) we don't include a column list. All it takes is: postgres=# create table listp (a int, b text) partition by list(a); CREATE TABLE postgres=# create table listp1 (b text, a int); CREATE TABLE postgres=# alter table listp attach partition listp1 for values in(1); ALTER TABLE postgres=# insert into listp values(1,'One'); INSERT 0 1 postgres=# \q $ createdb test1 $ pg_dump --inserts postgres | psql test1 ... ERROR: invalid input syntax for type integer: "One" LINE 1: INSERT INTO public.listp1 VALUES ('One', 1); That settles the debate on the other thread... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019/04/24 10:19, David Rowley wrote: > On Wed, 24 Apr 2019 at 06:50, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Per my comment at https://postgr.es/m/20190422225129.GA6126@alvherre.pgsql >> I think that pg_dump can possibly cause bogus partition definitions, >> when the users explicitly decide to join tables as partitions that have >> different column ordering than the parent table. Any COPY or INSERT >> command without an explicit column list that tries to put tuples in the >> table will fail after the restore. > > Yeah, pg_dump itself is broken here, never mind dreaming up some other > user command. > > We do use a column list when doing COPY, but with --inserts (not > --column-inserts) we don't include a column list. > > All it takes is: > > postgres=# create table listp (a int, b text) partition by list(a); > CREATE TABLE > postgres=# create table listp1 (b text, a int); > CREATE TABLE > postgres=# alter table listp attach partition listp1 for values in(1); > ALTER TABLE > postgres=# insert into listp values(1,'One'); > INSERT 0 1 > postgres=# \q > > $ createdb test1 > $ pg_dump --inserts postgres | psql test1 > ... > ERROR: invalid input syntax for type integer: "One" > LINE 1: INSERT INTO public.listp1 VALUES ('One', 1); > > That settles the debate on the other thread... +1 to fixing this, although +0.5 to back-patching. The reason no one has complained so far of being bitten by this may be that, as each of one us has said at least once on the other thread, users are not very likely to create partitions with different column orders to begin with. Maybe, that isn't a reason to leave it as is though. Thanks, Amit
On Wed, 24 Apr 2019 at 14:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2019/04/24 10:19, David Rowley wrote: > > ERROR: invalid input syntax for type integer: "One" > > LINE 1: INSERT INTO public.listp1 VALUES ('One', 1); > > > > That settles the debate on the other thread... > > +1 to fixing this, although +0.5 to back-patching. > > The reason no one has complained so far of being bitten by this may be > that, as each of one us has said at least once on the other thread, users > are not very likely to create partitions with different column orders to > begin with. Maybe, that isn't a reason to leave it as is though. Well, you could probably class most of the bugs that make their way through feature freeze, alpha and beta as unlikely. I don't think that gives us an excuse to leave them as bugs. If someone reported it we'd most likely go and fix it then anyway, so I really don't see the point in waiting until then. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
So, while testing this I noticed that pg_restore fails with deadlocks if you do a parallel restore if the --load-via-partition-root switch was given to pg_dump. Is that a known bug? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 25, 2019 at 4:01 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > So, while testing this I noticed that pg_restore fails with deadlocks if > you do a parallel restore if the --load-via-partition-root switch was > given to pg_dump. Is that a known bug? Was investigating --load-via-partition-root with a coworker and came across the following note in the documentation: https://www.postgresql.org/docs/11/app-pgdump.html "It is best not to use parallelism when restoring from an archive made with this option, because pg_restore will not know exactly which partition(s) a given archive data item will load data into. This could result in inefficiency due to lock conflicts between parallel jobs, or perhaps even reload failures due to foreign key constraints being set up before all the relevant data is loaded." Apparently, this note was added as a result of the following discussion: https://www.postgresql.org/message-id/flat/13624.1535486019%40sss.pgh.pa.us So, while the documentation doesn't explicitly list deadlocks as possible risk, Tom hinted in the first email that it's possible. I set out to reproduce one and was able to, although I'm not sure if it's the same deadlock as seen by Alvaro. Steps I used to reproduce: # in the source database create table foo (a int primary key); insert into foo select generate_series(1, 1000000); create table ht (a int) partition by hash (a); select 'create table ht' || i || ' partition of ht for values with (modulus 100, remainder ' || i -1 || ');' from generate_series(1, 100) i; \gexec insert into ht select generate_series(1, 1000000); alter table ht add foreign key (a) references foo (a); # in shell pg_dump --load-via-partition-root -Fd -f /tmp/dump createdb targetdb pg_restore -d targetdb -j 2 /tmp/dump The last step reports deadlocks; in the server log: ERROR: deadlock detected DETAIL: Process 14213 waits for RowExclusiveLock on relation 17447 of database 17443; blocked by process 14212. Process 14212 waits for ShareRowExclusiveLock on relation 17507 of database 17443; blocked by process 14213. Process 14213: COPY public.ht (a) FROM stdin; Process 14212: ALTER TABLE public.ht ADD CONSTRAINT ht_a_fkey FOREIGN KEY (a) REFERENCES public.foo(a); Here, the process adding the foreign key has got the lock on the parent and trying to lock a partition to add the foreign key to it. The process doing COPY (via root) has apperently locked the partition and waiting for the lock on the parent to do actual copying. Looking into why the latter had got a lock on the partition at all if it hasn't started the copying yet, I noticed that it was locked when TRUNCATE was executed on it earlier in the same transaction as part of some WAL-related optimization, which is something that only happens in the parallel restore mode. I was under the impression that the TABLE DATA archive item (its TocEntry) would have no trace of the partition if it was dumped with --load-via-partition-root, but that's not the case. --load-via-partition-root only dictates that the command that will be dumped for the item will use the root parent as COPY target, even though the TocEntry itself is owned by the partition. Maybe, a way to prevent the deadlock would be for the process that will do copy-into-given-partition-via-root-parent to do a `LOCK TABLE root_parent` before `TRUNCATE the_partition`, but we'll need to get hold of the root parent from the TocEntry somehow. Turns out it's only present in the TocEntry.copyStmt, from where it will have to parsed out. Maybe that's the only thing we could do without breaking the archive format though. Thoughts? By the way, I couldn't think of ways to reproduce any of the hazards mentioned in the documentations of using parallel mode to restore an archive written with pg_dump --load-via-root-parent, but maybe I just haven't tried hard enough. Thanks, Amit