Thread: Avoid duplicated rows when restoring data from pg_dumpall ??

Avoid duplicated rows when restoring data from pg_dumpall ??

From

Pablo Alonso-Villaverde Roza

Date:

24 August 2009, 10:54:20

Hi all,

I'm getting duplicated rows in some of my tables when I try to restore data from a dump file generated with 'pg_dumpall'.
I use:
pg_dumpall -c -U postgres > mybackup.sql
to dump all the data, roles, etc. into a file.

Then I try to restore the data using:
psql -U postgres -f mybackup.sql

The only "way" I have found to solve this problem is deleting my database before restoring the data, so everything is 're-created' in the restore process without generating duplicated rows.

I thought, that the "-c" flag on pg_dumpall would force a DROP of any previous data structures on the server but...it looks like it doesn't work as I expected and ...when I restore data I get duplicated rows.

How can I get a "clean" restore automatically? I mean, getting exactly the same data there was on the server at the moment I made the 'pg_dumpall'

Excuse me for my poor English, by the way.

Thanks in advance,

Re: Avoid duplicated rows when restoring data from pg_dumpall ??

From

Tom Lane

Date:

24 August 2009, 14:47:27

Pablo Alonso-Villaverde Roza <pavroza@gmail.com> writes:
> I'm getting duplicated rows in some of my tables when I try to restore data
> from a dump file generated with 'pg_dumpall'.

Probably all of them, actually ...

> The only "way" I have found to solve this problem is deleting my database
> before restoring the data, so everything is 're-created' in the restore
> process without generating duplicated rows.

A pg_dumpall script expects to be restored into an empty cluster.  This
is not a bug.

> I thought, that the "-c" flag on pg_dumpall would force a DROP of any
> previous data structures on the server but...it looks like it doesn't work
> as I expected and ...when I restore data I get duplicated rows.

The -c flag should cause the script to DROP all your databases first.
But that switch has been known to have bugs in the past, and in any case
it could fail if there are other sessions connected to those databases.
Did you pay attention to whether the restore reported any errors?
What PG version is this?

            regards, tom lane

Re: Avoid duplicated rows when restoring data from pg_dumpall ??

From

Pablo Alonso-Villaverde Roza

Date:

26 August 2009, 07:04:33

Hi Tom, thanks for your fast reply,

- As regards the duplicated rows, no, I don't get duplicated rows in all the tables stored in the database because
some tables have primary-keys (and/or UNIQUE) constraints. These constraints don't allow the restore process to
duplicate rows. In fact, it is a kind of "solution" I've tried...(add an extra column with a primary-key or unique constraint, to this tables), and it "works". The restore process doesn't generate duplicated rows, because the constraint does not allow the insertion
of new duplicated data. Anyway..it looks like a kind of 'poor solution' :-)

- Ok, thanks for the info, I thought pg_dumpall would work as I desired even on non-empty clusters.
Of course...if there is no previous data, the restore process will never create duplicated rows.

- Yes, the restore process generates errors, because it tries to re-generate data structures that exist in the database server
at that moment. Even if I delete my own databases, some errors will appear (because I cannot delete the internal stuff of
of the server -> the 'postgres' database for example) . Those errors, could be ignored in most cases I think, but perhaps create a kind of "bad feeling" about the result of the restore process, or can "hide" other more important errors when you get a huge ammount of info on the screen either.

- I forgot to mention it. I'm using PostgreSQL 8.3 on Windows xp.

Thanks a lot for your advices,

Regards, Pablo

2009/8/24 Tom Lane <tgl@sss.pgh.pa.us>

Pablo Alonso-Villaverde Roza <pavroza@gmail.com> writes:
> I'm getting duplicated rows in some of my tables when I try to restore data
> from a dump file generated with 'pg_dumpall'.

Probably all of them, actually ...

> The only "way" I have found to solve this problem is deleting my database
> before restoring the data, so everything is 're-created' in the restore
> process without generating duplicated rows.

A pg_dumpall script expects to be restored into an empty cluster. This
is not a bug.

> I thought, that the "-c" flag on pg_dumpall would force a DROP of any
> previous data structures on the server but...it looks like it doesn't work
> as I expected and ...when I restore data I get duplicated rows.

The -c flag should cause the script to DROP all your databases first.
But that switch has been known to have bugs in the past, and in any case
it could fail if there are other sessions connected to those databases.
Did you pay attention to whether the restore reported any errors?
What PG version is this?

regards, tom lane

Re: Avoid duplicated rows when restoring data from pg_dumpall ??

From

Andy Shellam

Date:

26 August 2009, 16:52:41

Hi Pablo

> - As regards the duplicated rows, no, I don't get duplicated rows in
> all the tables stored in the database because
> some tables have primary-keys (and/or UNIQUE) constraints. These
> constraints don't allow the restore process to
> duplicate rows.  In fact, it is a kind of "solution" I've tried...(add
> an extra column with a primary-key or unique constraint, to this
> tables), and it "works". The restore process doesn't generate
> duplicated rows, because the constraint does not allow the insertion
> of new duplicated data.  Anyway..it looks like a kind of 'poor
> solution'  :-)

Yeah, that's one solution, only trouble being if the data in the
existing table is different to what's in the restore script (for a
record with the same ID) it won't be updated.

e.g.: in this example your restored database will be inconsistent with
the backup.
your table: Field1 = 1 (ID), Field2 = A, Field3 = B
restore script: Field1 = 1 (ID), Field2 = B, Field3 = B

>
> - Ok, thanks for the info, I thought pg_dumpall would work as I
> desired even on non-empty clusters.
> Of course...if there is no previous data, the restore process will
> never create duplicated rows.

Exactly.  If you're looking for some form of replication (i.e.
master-to-slave) look at Slony - it fires triggers on the master that
insert data into the slave.  It has its limitations but AFAIK it's a
workable solution.
>
> - Yes, the restore process generates errors, because it tries to
> re-generate data structures that exist in the database server
> at that moment. Even if I delete my own databases, some errors will
> appear (because I cannot delete the internal stuff of
> of the server -> the 'postgres' database for example) . Those errors,
> could be ignored in most cases I think, but perhaps create a kind of
> "bad feeling" about the result of the restore process,  or can "hide"
> other more important errors when you get a huge ammount of info on the
> screen either.

You can delete the "postgres" database - it's an empty database that's
created when the server is initialised so you've got something to
connect to.  It's safe to delete, as long as you have another database
you can connect to, but there's no real reason to unless it's in your
restore script (e.g. from pg_dumpall.)

http://www.postgresql.org/docs/8.3/static/manage-ag-templatedbs.html

Regards,
Andy

Re: Avoid duplicated rows when restoring data from pg_dumpall ??

From

Pablo Alonso-Villaverde Roza

Date:

27 August 2009, 07:06:38

Hi Andy,

First of all, thank you very much for your advices,

2009/8/26 Andy Shellam <andy-lists@networkmail.eu>

Hi Pablo

- As regards the duplicated rows, no, I don't get duplicated rows in all the tables stored in the database because
some tables have primary-keys (and/or UNIQUE) constraints. These constraints don't allow the restore process to
duplicate rows. In fact, it is a kind of "solution" I've tried...(add an extra column with a primary-key or unique constraint, to this tables), and it "works". The restore process doesn't generate duplicated rows, because the constraint does not allow the insertion
of new duplicated data. Anyway..it looks like a kind of 'poor solution' :-)

Yeah, that's one solution, only trouble being if the data in the existing table is different to what's in the restore script (for a record with the same ID) it won't be updated.

e.g.: in this example your restored database will be inconsistent with the backup.
your table: Field1 = 1 (ID), Field2 = A, Field3 = B
restore script: Field1 = 1 (ID), Field2 = B, Field3 = B

Yes!, you're right, that could be a serious problem...some modified data will keep the same and not restored to the status it had
when the backup was made. Damn...

I've tought about a "heavy-method" solution.....make an "dumb-proof" utility to backup and restore the 'data' directory of the PostgresSQL server. It is not considered as good as a SQL dump, because it requires to shutdown the server, but that's not
a problem for me.

- Ok, thanks for the info, I thought pg_dumpall would work as I desired even on non-empty clusters.
Of course...if there is no previous data, the restore process will never create duplicated rows.

Exactly. If you're looking for some form of replication (i.e. master-to-slave) look at Slony - it fires triggers on the master that insert data into the slave. It has its limitations but AFAIK it's a workable solution.

No, I was not looking for a replication system. Just a "simple" method to backup and restore data for users with very very low knowledge about computers administration and zero knowledge about postgreSQL. I'll not be able to spend much time with the users, so they should be able to make administration tasks without much supervision, and I was looking for a kind of "fire-and-forget" procedure Something they could do without many details to worry about.

- Yes, the restore process generates errors, because it tries to re-generate data structures that exist in the database server
at that moment. Even if I delete my own databases, some errors will appear (because I cannot delete the internal stuff of
of the server -> the 'postgres' database for example) . Those errors, could be ignored in most cases I think, but perhaps create a kind of "bad feeling" about the result of the restore process, or can "hide" other more important errors when you get a huge ammount of info on the screen either.

You can delete the "postgres" database - it's an empty database that's created when the server is initialised so you've got something to connect to. It's safe to delete, as long as you have another database you can connect to, but there's no real reason to unless it's in your restore script (e.g. from pg_dumpall.)

http://www.postgresql.org/docs/8.3/static/manage-ag-templatedbs.html

Regards,
Andy

Ah, ok, I thought it was NOT possible to delete postgres because I was not allowed to do it from pgAdmin.

Thanks again,

Regards Pablo