Re: Suggestion for --truncate-tables to pg_restore - Mailing list pgsql-hackers

From Josh Kupershmidt
Subject Re: Suggestion for --truncate-tables to pg_restore
Date
Msg-id CAK3UJRHFHN_q6AOib7mtchwS3KXdDGyPg5awWAHOF3R=1DdjOw@mail.gmail.com
Whole thread Raw
In response to Re: Suggestion for --truncate-tables to pg_restore  ("Karl O. Pinc" <kop@meme.com>)
Responses Re: Suggestion for --truncate-tables to pg_restore  ("Karl O. Pinc" <kop@meme.com>)
Re: Suggestion for --truncate-tables to pg_restore  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi Karl,
I signed on to review this patch for the current CF. Most of the
background for the patch seems to be in the message below, so I'm
going to respond to this one first.

On Fri, Sep 21, 2012 at 8:54 AM, Karl O. Pinc <kop@meme.com> wrote:
> On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote:
>
>> I've had problems using pg_restore --data-only when
>> restoring individual schemas (which contain data which
>> has had bad things done to it).  --clean does not work
>> well because of dependent objects in other schemas.

OK.

> ----
>
> First, the problem:
>
> Begin with the following structure:
>
> CREATE TABLE schemaA.foo (id PRIMARY KEY, data INT);
>
> CREATE VIEW schemaB.bar AS SELECT * FROM schemaA.foo;
>
> Then, by accident, somebody does:
>
> UPDATE schemaA.foo SET data = data + (RANDOM() * 1000)::INT;
>
> So, you want to restore the data into schemaA.foo.
> But schemaA.foo has (bad) data in it that must first
> be removed. It would seem that using
>
>   pg_restore --clean -n schemaA -t foo my_pg_dump_backup
>
> would solve the problem, it would drop schemaA.foo,
> recreate it, and then restore the data.  But this does
> not work.  schemaA.foo does not drop because it's
> got a dependent database object, schemaB.bar.

Right.

> Of course there are manual work-arounds.  One of these
> is truncating schemaA.foo and then doing a pg_restore
> with --data-only.

Simply doing TRUNCATE manually was the first thought that occurred to
me when I saw your example.

> The manual work-arounds become
> increasingly burdensome as you need to restore more
> tables.  The case that motivated me was an attempt
> to restore the data in an entire schema, one which
> contained a significant number of tables.

TBH, I didn't find the example above particularly compelling for
demonstrating the need for this feature. If you've just got one table
with dependent views which needs to be restored, it's pretty easy to
manually TRUNCATE and have pg_restore --data-only reload the table.
(And easy enough to combine the truncate and restore into a single
transaction in case anything goes wrong, if need be.)

But I'm willing to grant that this proposed feature is potentially as
useful as existing restore-jiggering options like --disable-triggers.
And I guess I could see that if you're really stuck having to perform
a --data-only restore of many tables, this feature could come in
handy.

> So, the idea here is to be able to do a data-only
> restore, first truncating the data in the tables
> being restored to remove the existing corrupted data.
>
> The proposal is to add a --truncate-tables option
> to pg_restore.
>
> ----
>
> There were some comments on syntax.
>
> I proposed to use -u as a short option.  This was
> thought confusing, given it's use in other
> Unix command line programs (mysql).   Since there's
> no obvious short option, forget it.  Just have
> a long option.

Agreed.

> Another choice is to avoid introducing yet another
> option and instead overload --clean so that when
> doing a --data-only restore --clean truncates tables
> and otherwise --clean retains the existing behavior of
> dropping and re-creating the restored objects.

I like the --truncate-tables flag idea better than overloading
--clean, since it makes the purpose immediately apparent.

> (I tested pg_restore with 9.1 and when --data-only is
> used --clean is ignored, it does not even produce a warning.
> This is arguably a bug.)

+1 for having pg_restore bail out with an error if the user specifies
--data-only --clean. By the same token, --clean and --truncate-tables
together should also raise a "not allowed" error.

> ----
>
> More serious objections were raised regarding semantics.
>
> What if, instead, the initial structure looked like:
>
> CREATE TABLE schemaA.foo
>   (id PRIMARY KEY, data INT);
>
> CREATE TABLE schemaB.bar
>   (id INT CONSTRAINT "bar_on_foo" REFERENCES foo
>  , moredata INT);
>
> With a case like this, in most real-world situations, you'd
> have to use pg_restore with --disable-triggers if you wanted
> to use --data-only and --truncate-tables.  The possibility of
> foreign key referential integrity corruption is obvious.

Why would --disable-triggers be used in this example? I don't think
you could use --truncate-tables to restore only table "foo", because
you would get:
 ERROR:  cannot truncate a table referenced in a foreign key constraint

(At least, I think so, not having tried with the actual patch.) You
could use --disable-triggers when restoring "bar". Though if you're
just enabling that option for performance purposes, and are unable to
guarantee that the restore will leave the tables in a consistent
state, well then it seems like you shouldn't use the option.

> Aside:  Unless you're restoring databases in their entirety
> the pg_restore --disable-triggers option makes it easy to
> introduce foreign key referential integrity corruption.

Yup, and I think the docs could do more to warn users about
--disable-triggers in particular. And I see you've submitted a
separate patch along those lines.

> In fact, since pg_restore does not wrap it's operations
> in one big transaction, it's easy to attempt restoration
> of a portion of a database, have part of the process
> succeed and part of it fail (due to either schema
> or data dependencies), and be left off worse
> than before you started.

That's what the --single-transaction option is for.

> So, the discussion went, pg_restore is just another
> application and introducing more options
> which could lead to corruption of referential integrity is
> a bad idea.

I do agree that increasing the ways for pg_restore to be a foot-gun is
a Bad Idea.

> But pg_restore should not be thought of as just another
> front-end.  It should be thought of as a data recovery
> tool.

I don't totally agree that charter for pg_restore should be a "data
recovery tool" (i.e. general purpose), but for the sake of this patch
we can leave that aside.

>  Recovering some data and being left with referential
> integrity problems is better than having no data.

Well, this is really a judgment call, and I have a hunch that many
admins would actually choose "none of the above". And I think this
point gets to the crux of whether the --truncate-tables option will be
useful as-is.

For your first example, --truncate-tables seems to have some use, so
that the admin isn't forced to recreate various views which may be
dependent on the table. (Though it's not too difficult to work around
this case today.)

For the second example involving FKs, I'm a little bit more hesitant
about  endorsing the use of --truncate-tables combined with
--disable-triggers to potentially allow bogus FKs. I know this is
possible to some extent today using the --disable-triggers option, but
it makes me nervous to be adding a mode to pg_restore if it's
primarily designed to work together with --disable-triggers as a
larger foot-gun.

> This
> is true even if, due to different users owning different
> schemas and so forth, nobody knows exactly what
> might be broken.
>
> Yes, but we can do better.  (The unstated sub-text being that
> we don't want to introduce an inferior feature which
> will then need to be supported forever.)
>
> How could we do better:
>
> Here I will record only the ideas related to restore,
> although there was some mention of dump as well.
>
> There has apparently been some discussion of writing
> a foreign data wrapper which would operate on a database
> dump.  This might (in ways that are not immediately
> obvious to me) address this issue.

That's an interesting idea. If you could SELECT directly out of a dump
file via FDW, you could handle the restore process purely in SQL. But
not directly relevant to this patch.

> The restore process could, based on what table data needs
> restoration, look at foreign key dependencies and produce a
> list of the tables which all must be restored into order to
> ensure foreign key referential integrity.

One mode of operation of pg_restore is outputting to a file or pipe
for subsequent processing, which of course wouldn't work with this
idea of having the restore be dependent on the target database
structure.

> [snip more discussion of pg_restore possibly reordering objects and ensuring integrity]

For the purposes of actually completing a review of the patch in
question, I'm going to avoid further blue-sky speculation here.

Just a few initial comments about the doc portion of the patch:

+        This option is only relevant when performing a data-only

If we are going to restrict the --truncate-tables option to be used
with --data-only, "only allowed" may be more clear than "only
relevant".

+       <para>
+         The <option>--disable-triggers</option> will almost always
+         always need to be used in conjunction with this option to
+         disable check constraints on foreign keys.
+       </para>

For the first example you posted, of a view dependent on a table which
needed to be restored, this advice would not be accurate. IMO it's a
little dangerous advising users to "almost always" use a foot-gun like
--disable-triggers.

I'm out of time for today, and I haven't had a chance to actually try
out the patch, but I wanted to send off my thoughts so far. I should
have a chance for another look later this week.

Josh



pgsql-hackers by date:

Previous
From: Abhijit Menon-Sen
Date:
Subject: Re: [PATCH] binary heap implementation
Next
From: Michael Paquier
Date:
Subject: Re: logical changeset generation v3