Thread: How to efficiently duplicate a whole schema?

How to efficiently duplicate a whole schema?

From
Sebastien Lemieux
Date:
Hi,

  I'm running on Redhat 7.2 with postgresql 7.3.2 and I have two schema in
the same database 'db' and 'db_dev'.  Both contain a set of >20 tables for
a total of less than 50 Mb of data each (on the order of 50k rows in
total).  Once in a while (often these days!), I need to synchronize the
dev version from the production 'db'.  Currently, I do this by setting
constraints to deferred, deleting everything in db_dev, then issue a serie
of insert ... select ... to copy data from each table in db to the
equivalent table in db_dev.

  This approach used to run in less than 30 seconds in MySQL, but in
PostgreSQL it currently takes around 30 minutes.  The postmaster process
is running at 100% cpu all the time.  I enclosed all the delete statement
in one transaction and all the insert statements in a second transaction.
All the time is taken at the commit of both transaction.

  Is there a more straightforward way to synchronize a development
database to a production one?  Is there anyway to increase the performance
of this delete/insert combination?  I've got indexes and constraints on
most tables, could that be the problem?  At some point in the future, I
will also need to make a copy of a whole schema ('db' into 'db_backup'),
what would be an efficient way to do that?

  These are the parameters I've adjusted in the postgresql.conf:

max_connections = 16
shared_buffers = 3000
max_fsm_relations = 2000
max_fsm_pages = 20000
sort_mem = 20000
vacuum_mem = 20000
effective_cache_size = 15000

  And this is the memory state of the machine:

slemieux@neptune> free
             total       used       free     shared    buffers     cached
Mem:       2059472    2042224      17248      24768     115712    1286572
-/+ buffers/cache:     639940    1419532
Swap:      2096440     490968    1605472

thanks,

--
Sebastien Lemieux
Bioinformatics, post-doc
Elitra-canada


Re: How to efficiently duplicate a whole schema?

From
"Nick Fankhauser"
Date:
Sebastien-

I have a similar nightly process to keep our development system synched with
production. I just do a complete pg_dump of production, do a dropdb &
createdb to empty the database for development, and then restore the whole
db from the pg_dump file. Our database is about 12 GB currently, and it
takes less than one hour to dump & restore back into dev if I go through a
file. I can go even faster by piping the data to eliminate one set of reads
& writes to the disk:

dropdb dev
createdb dev
pg_dump prod | psql dev

This of course only works if you haven't changed your data structure in the
development area, but it is very simple and reasonably quick.

in situations where the data structure has changed, I run a more complex
system that deletes data rather than drop the whole db, but I always drop
the indexes in development before restoring data and recreate them
afterwards.

-Nick


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Sebastien
> Lemieux
> Sent: Wednesday, August 06, 2003 1:56 PM
> To: Postgresql-performance
> Subject: [PERFORM] How to efficiently duplicate a whole schema?
>
>
> Hi,
>
>   I'm running on Redhat 7.2 with postgresql 7.3.2 and I have two schema in
> the same database 'db' and 'db_dev'.  Both contain a set of >20 tables for
> a total of less than 50 Mb of data each (on the order of 50k rows in
> total).  Once in a while (often these days!), I need to synchronize the
> dev version from the production 'db'.  Currently, I do this by setting
> constraints to deferred, deleting everything in db_dev, then issue a serie
> of insert ... select ... to copy data from each table in db to the
> equivalent table in db_dev.
>
>   This approach used to run in less than 30 seconds in MySQL, but in
> PostgreSQL it currently takes around 30 minutes.  The postmaster process
> is running at 100% cpu all the time.  I enclosed all the delete statement
> in one transaction and all the insert statements in a second
> transaction.
> All the time is taken at the commit of both transaction.
>
>   Is there a more straightforward way to synchronize a development
> database to a production one?  Is there anyway to increase the
> performance
> of this delete/insert combination?  I've got indexes and constraints on
> most tables, could that be the problem?  At some point in the future, I
> will also need to make a copy of a whole schema ('db' into 'db_backup'),
> what would be an efficient way to do that?
>
>   These are the parameters I've adjusted in the postgresql.conf:
>
> max_connections = 16
> shared_buffers = 3000
> max_fsm_relations = 2000
> max_fsm_pages = 20000
> sort_mem = 20000
> vacuum_mem = 20000
> effective_cache_size = 15000
>
>   And this is the memory state of the machine:
>
> slemieux@neptune> free
>              total       used       free     shared    buffers     cached
> Mem:       2059472    2042224      17248      24768     115712    1286572
> -/+ buffers/cache:     639940    1419532
> Swap:      2096440     490968    1605472
>
> thanks,
>
> --
> Sebastien Lemieux
> Bioinformatics, post-doc
> Elitra-canada
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


Re: How to efficiently duplicate a whole schema?

From
Tom Lane
Date:
Sebastien Lemieux <slemieux@elitra.com> writes:
> All the time is taken at the commit of both transaction.

Sounds like the culprit is foreign-key checks.

One obvious question is whether you have your foreign keys set up
efficiently in the first place.  As a rule, the referenced and
referencing columns should have identical datatypes and both should
be indexed.  (PG will often let you create foreign key constraints
that don't meet these rules ... but performance will suffer.)

Also, what procedure are you using to delete all the old data?  What
I'd recommend is
    ANALYZE table;
    TRUNCATE table;
    INSERT new data;
The idea here is to make sure that the planner's statistics reflect the
"full" state of the table, not the "empty" state.  Otherwise it may pick
plans for the foreign key checks that are optimized for small tables.

            regards, tom lane

Re: How to efficiently duplicate a whole schema?

From
"scott.marlowe"
Date:
On Wed, 6 Aug 2003, Tom Lane wrote:

> Sebastien Lemieux <slemieux@elitra.com> writes:
> > All the time is taken at the commit of both transaction.
>
> Sounds like the culprit is foreign-key checks.
>
> One obvious question is whether you have your foreign keys set up
> efficiently in the first place.  As a rule, the referenced and
> referencing columns should have identical datatypes and both should
> be indexed.  (PG will often let you create foreign key constraints
> that don't meet these rules ... but performance will suffer.)

Is this one of those things that should spit out a NOTICE when it happens?
I.e. when a table is created with a references and uses a different type
than the parent, would it be a good idea to issue a "NOTICE: parent and
child fields are not of the same type"



Re: How to efficiently duplicate a whole schema?

From
Tom Lane
Date:
"scott.marlowe" <scott.marlowe@ihs.com> writes:
> On Wed, 6 Aug 2003, Tom Lane wrote:
>> One obvious question is whether you have your foreign keys set up
>> efficiently in the first place.  As a rule, the referenced and
>> referencing columns should have identical datatypes and both should
>> be indexed.  (PG will often let you create foreign key constraints
>> that don't meet these rules ... but performance will suffer.)

> Is this one of those things that should spit out a NOTICE when it happens?
> I.e. when a table is created with a references and uses a different type
> than the parent, would it be a good idea to issue a "NOTICE: parent and
> child fields are not of the same type"

I could see doing that for unequal data types, but I'm not sure if it's
reasonable to do it for lack of index.  Usually you won't have created
the referencing column's index yet when you create the FK constraint,
so any warning would just be noise.  (The referenced column's index *is*
checked for, since we require it to be unique.)

            regards, tom lane

Re: How to efficiently duplicate a whole schema?

From
"scott.marlowe"
Date:
On Wed, 6 Aug 2003, Tom Lane wrote:

> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > On Wed, 6 Aug 2003, Tom Lane wrote:
> >> One obvious question is whether you have your foreign keys set up
> >> efficiently in the first place.  As a rule, the referenced and
> >> referencing columns should have identical datatypes and both should
> >> be indexed.  (PG will often let you create foreign key constraints
> >> that don't meet these rules ... but performance will suffer.)
>
> > Is this one of those things that should spit out a NOTICE when it happens?
> > I.e. when a table is created with a references and uses a different type
> > than the parent, would it be a good idea to issue a "NOTICE: parent and
> > child fields are not of the same type"
>
> I could see doing that for unequal data types, but I'm not sure if it's
> reasonable to do it for lack of index.  Usually you won't have created
> the referencing column's index yet when you create the FK constraint,
> so any warning would just be noise.  (The referenced column's index *is*
> checked for, since we require it to be unique.)

Sure.  I wasn't thinking of the index issue anyway, just the type
mismatch.


Re: How to efficiently duplicate a whole schema?

From
Sebastien Lemieux
Date:
On Wed, 6 Aug 2003, Tom Lane wrote:

> Sebastien Lemieux <slemieux@elitra.com> writes:
> > All the time is taken at the commit of both transaction.
>
> Sounds like the culprit is foreign-key checks.
>
> One obvious question is whether you have your foreign keys set up
> efficiently in the first place.  As a rule, the referenced and
> referencing columns should have identical datatypes and both should
> be indexed.  (PG will often let you create foreign key constraints
> that don't meet these rules ... but performance will suffer.)

I've checked and all the foreign keys are setup between 'serial' (the
primary key of the referenced table) and 'integer not null' (the foreign
key field).  Would that be same type?  A couple of my foreign keys are not
indexed, I'll fix that.  The latter seems to do the job, since I can now
synchronize in about 75 seconds (compared to 30 minutes), which seems good
enough.

> Also, what procedure are you using to delete all the old data?  What
> I'd recommend is
>     ANALYZE table;
>     TRUNCATE table;
>     INSERT new data;
> The idea here is to make sure that the planner's statistics reflect the
> "full" state of the table, not the "empty" state.  Otherwise it may pick
> plans for the foreign key checks that are optimized for small tables.

I added the 'analyze' but without any noticable gain in speed.  I can't
use 'truncate' since I need to 'set constraints all deferred'.  I guess
the bottom line is that I really need to first drop all constraints and
indexes, synchronize and then rebuild indexes and check constraints.  But
for that I'll need to reorganize my code a little bit!

In the meantime, how bad a decision would it be to simply remove all
foreign key constraints?  Because, currently I think they are causing more
problems than they are avoiding...

thanks,

--
Sebastien Lemieux
Bioinformatics, post-doc
Elitra-canada






Re: How to efficiently duplicate a whole schema?

From
Tom Lane
Date:
Sebastien Lemieux <slemieux@elitra.com> writes:
> On Wed, 6 Aug 2003, Tom Lane wrote:
>> The idea here is to make sure that the planner's statistics reflect the
>> "full" state of the table, not the "empty" state.  Otherwise it may pick
>> plans for the foreign key checks that are optimized for small tables.

> I added the 'analyze' but without any noticable gain in speed.  I can't
> use 'truncate' since I need to 'set constraints all deferred'.

What are you using, exactly?

            regards, tom lane

Re: How to efficiently duplicate a whole schema?

From
Stephan Szabo
Date:
On Wed, 6 Aug 2003, Sebastien Lemieux wrote:

> On Wed, 6 Aug 2003, Tom Lane wrote:
>
> > Sebastien Lemieux <slemieux@elitra.com> writes:
> > > All the time is taken at the commit of both transaction.
> >
> > Sounds like the culprit is foreign-key checks.
> >
> > One obvious question is whether you have your foreign keys set up
> > efficiently in the first place.  As a rule, the referenced and
> > referencing columns should have identical datatypes and both should
> > be indexed.  (PG will often let you create foreign key constraints
> > that don't meet these rules ... but performance will suffer.)
>
> I've checked and all the foreign keys are setup between 'serial' (the
> primary key of the referenced table) and 'integer not null' (the foreign
> key field).  Would that be same type?  A couple of my foreign keys are not
> indexed, I'll fix that.  The latter seems to do the job, since I can now
> synchronize in about 75 seconds (compared to 30 minutes), which seems good
> enough.

Another thing might be the management of the trigger queue.  I don't think
7.3.2 had the optimization for limiting the scans of the queue when you
have lots of deferred triggers.  It looks like 7.3.4 may though.


Re: How to efficiently duplicate a whole schema?

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> Another thing might be the management of the trigger queue.  I don't think
> 7.3.2 had the optimization for limiting the scans of the queue when you
> have lots of deferred triggers.  It looks like 7.3.4 may though.

Good point.  We put that in in 7.3.3, according to the CVS logs.

            regards, tom lane

Re: How to efficiently duplicate a whole schema?

From
Sebastien Lemieux
Date:
> >> The idea here is to make sure that the planner's statistics reflect the
> >> "full" state of the table, not the "empty" state.  Otherwise it may pick
> >> plans for the foreign key checks that are optimized for small tables.
>
> > I added the 'analyze' but without any noticable gain in speed.  I can't
> > use 'truncate' since I need to 'set constraints all deferred'.
>
> What are you using, exactly?

What I want to do:

  let t be the list of tables

  for t in tables:
    delete from db_dev.t;

  for t in tables:
    insert into db_dev.t (...) select ... from db.t;

Some of my foreign keys are creating references loops in my schema, thus
there is no correct order to do the deletes and inserts so that the
constraints are satisfied at all time.  I have to enclose those two loops
in a 'set constraints all deferred' to avoid complaints from the
constraints.

I tried dropping the indexes first, doing the transfer and recreating the
indexes: no gain.  So computing the indexes doesn't take significant time.

I then tried removing all the foreign keys constraints, replacing delete
by truncate and it now runs in about 25 seconds.  Downside is that I lose
the foreign keys integrity verification, but because of this reference
loop in my schema it has caused me more problem than it has avoided until
now.  So I can live with that!

Thanks all!

--
Sebastien Lemieux
Bioinformatics, post-doc
Elitra-canada