Thread: Anyone working on pg_dump dependency ordering?

Anyone working on pg_dump dependency ordering?

From
Tom Lane
Date:
I'm thinking about attacking pg_dump's lack of knowledge about using
dependencies to determine a safe dump order.  But if there's someone
out there actively working on the problem, I don't want to tread on
your toes ... anyone?

Also, if you've got uncommitted patches for pg_dump, please let me know.
        regards, tom lane


Re: Anyone working on pg_dump dependency ordering?

From
Christopher Kings-Lynne
Date:
> I'm thinking about attacking pg_dump's lack of knowledge about using
> dependencies to determine a safe dump order.  But if there's someone
> out there actively working on the problem, I don't want to tread on
> your toes ... anyone?

I've done a whole lot of _thinking_, but basically no _doing_, so go 
right ahead :)

I may as well let you know my thoughts:

There are two levels (sort of) of dependency.  The first is that whole 
classes of objects can be dependent on whole other classes.  eg. 
databases depend on users, or ALL FK's can be dumped after ALL tables, 
etc..  It would make the dump more readable if you dumped those definite 
dependencies in that order, rather than shuffling everything up.

The second level of dependency is when a bunch of object types can 
depend on each other.  The current solution for that is to sort by OID, 
but this fails when it is possible to add a dependency to an object 
after it has been created.

eg:

- Adding a column (with a type) to a table
- All the CREATE OR REPLACE commands
- etc.

Hence, a full db wide topological sort might not be necessary.

Lastly, I presume it's possible to create a system of circular 
dependencies (eg create or replace view), which really cannot be solved 
without a system of 'shells', similar to that needed to dump types and 
their i/o functions.

Views seem to be by far the nastiest object.  They can be dependent on 
almost everything in the database.

> Also, if you've got uncommitted patches for pg_dump, please let me know.

Yes, my 'COMMENT ON' mega patch in the queue contains dumping of 
conversions and comments on a bunch of objects.

BTW, if you commit that patch - you might want to change my comment on 
type patch to put the "" around "any", and change the results file 
appropriately.  I noticed I accidentally included that in the patch, and 
was about to mention it.

CHris




Re: Anyone working on pg_dump dependency ordering?

From
Andreas Pflug
Date:
Christopher Kings-Lynne wrote:

>
>
> There are two levels (sort of) of dependency.  The first is that whole 
> classes of objects can be dependent on whole other classes.  eg. 
> databases depend on users, or ALL FK's can be dumped after ALL tables, 
> etc..  It would make the dump more readable if you dumped those 
> definite dependencies in that order, rather than shuffling everything up.
>
I agree that dumping should be done class-wise (Tables, Functions, 
Views) whenever possible, but I don't agree on FKs dumped separately 
from the table. IMHO indexes and constraints belong to the table, and 
modifying the dump will be hard if a table's code is scattered all around.


Regards,
Andreas




Re: Anyone working on pg_dump dependency ordering?

From
Stephan Szabo
Date:
On Sat, 22 Nov 2003, Andreas Pflug wrote:

> Christopher Kings-Lynne wrote:
>
> >
> >
> > There are two levels (sort of) of dependency.  The first is that whole
> > classes of objects can be dependent on whole other classes.  eg.
> > databases depend on users, or ALL FK's can be dumped after ALL tables,
> > etc..  It would make the dump more readable if you dumped those
> > definite dependencies in that order, rather than shuffling everything up.
> >
> I agree that dumping should be done class-wise (Tables, Functions,
> Views) whenever possible, but I don't agree on FKs dumped separately
> from the table. IMHO indexes and constraints belong to the table, and
> modifying the dump will be hard if a table's code is scattered all around.

You're going to potentially have the constraints scattered in any case due
to circular dependency chains. I'd think that having all the constraints
in one place would be easier than trying to go through the list of tables
that might be in a circular chain in order to find the constraints.


Re: Anyone working on pg_dump dependency ordering?

From
Andreas Pflug
Date:
Stephan Szabo wrote:

>You're going to potentially have the constraints scattered in any case due
>to circular dependency chains. I'd think that having all the constraints
>in one place would be easier than trying to go through the list of tables
>that might be in a circular chain in order to find the constraints.
>  
>
I still disagree. cyclic dependencies should be avoided anyhow. You'll 
get an awful lot of trouble loading data in such a case. Some database 
systems refuse to create such stuff right away (mssql).

Regards,
Andreas




Re: Anyone working on pg_dump dependency ordering?

From
Rod Taylor
Date:
On Sat, 2003-11-22 at 16:53, Andreas Pflug wrote:
> Stephan Szabo wrote:
> 
> >You're going to potentially have the constraints scattered in any case due
> >to circular dependency chains. I'd think that having all the constraints
> >in one place would be easier than trying to go through the list of tables
> >that might be in a circular chain in order to find the constraints.
> >  
> >
> I still disagree. cyclic dependencies should be avoided anyhow. You'll 
> get an awful lot of trouble loading data in such a case. Some database 
> systems refuse to create such stuff right away (mssql).

CREATE TABLE a (col integer primary key);
CREATE TABLE b (col integer primary key);
ALTER TABLE a ADD FOREIGN KEY (col) REFERENCES b INITIALLY DEFERRED;
ALTER TABLE b ADD FOREIGN KEY (col) REFERENCES a;

How does MSSQL deal with the above?



Re: Anyone working on pg_dump dependency ordering?

From
Andreas Pflug
Date:
Rod Taylor wrote:

>
>
>CREATE TABLE a (col integer primary key);
>CREATE TABLE b (col integer primary key);
>ALTER TABLE a ADD FOREIGN KEY (col) REFERENCES b INITIALLY DEFERRED;
>ALTER TABLE b ADD FOREIGN KEY (col) REFERENCES a;
>
>How does MSSQL deal with the above?#
>  
>
It depends. Restricting FKs are generated silently, while ON DELETE 
CASCADE will throw a message and refuse to create. MSSQL doesn't know 
about deferred FKs; no chance to enter spuriously inconsistent data.
Still, using cyclic references is IMHO bad design style. I can't accept 
an exceptional case as reason to break *all* table's definition into 
pieces. The CREATE TABLE syntax shows that I'm probably not the only one 
thinking like this: it may include all constraint definitions as well.

There might be discussions whether its better to script
CREATE TABLE xxx ..;
ALTER TABLE xxx ADD PRIMARY KEY ....;
ALTER TABLE xxx ADD FOREIGN KEY ....;
or
CREATE TABLE xxx (...., PRIMARY KEY (..), FOREIGN KEY (..));

I'd opt for the second version (a little formatted, maybe :-)

Regards,
Andreas



Re: Anyone working on pg_dump dependency ordering?

From
Rod Taylor
Date:
> >CREATE TABLE a (col integer primary key);
> >CREATE TABLE b (col integer primary key);
> >ALTER TABLE a ADD FOREIGN KEY (col) REFERENCES b INITIALLY DEFERRED;
> >ALTER TABLE b ADD FOREIGN KEY (col) REFERENCES a;

> Still, using cyclic references is IMHO bad design style. I can't accept 

They're extremely useful when you have normalized data and a very
expensive (but repeatable) process whose results you want to cache for
performance reasons. It enforces that original data and cache are both
added in a single transaction.

But you're right. Normally they're a bad idea.

> an exceptional case as reason to break *all* table's definition into 
> pieces. The CREATE TABLE syntax shows that I'm probably not the only one 
> thinking like this: it may include all constraint definitions as well.
> 
> There might be discussions whether its better to script
> CREATE TABLE xxx ..;
> ALTER TABLE xxx ADD PRIMARY KEY ....;
> ALTER TABLE xxx ADD FOREIGN KEY ....;
> or
> CREATE TABLE xxx (...., PRIMARY KEY (..), FOREIGN KEY (..));
> 
> I'd opt for the second version (a little formatted, maybe :-)

Well.. the second one will be much slower when the foreign keys verify.
Primary, unique constraints I'll buy in the create statement. Check
constraints and defaults are a little fuzzier.

Logic will be required to pull them out in the event they call functions
which depends on the table or we enable subqueries (assertion like
constraints) in them.






Re: pg_dump dependency / physical hot backup

From
Andreas Pflug
Date:
Rod Taylor wrote:

>>There might be discussions whether its better to script
>>CREATE TABLE xxx ..;
>>ALTER TABLE xxx ADD PRIMARY KEY ....;
>>ALTER TABLE xxx ADD FOREIGN KEY ....;
>>or
>>CREATE TABLE xxx (...., PRIMARY KEY (..), FOREIGN KEY (..));
>>
>>I'd opt for the second version (a little formatted, maybe :-)
>>    
>>
>
>Well.. the second one will be much slower when the foreign keys verify.
>  
>
Verifying zero rows in the freshly created table should be quite fast...

>Primary, unique constraints I'll buy in the create statement. Check
>constraints and defaults are a little fuzzier.
>
>Logic will be required to pull them out in the event they call functions
>which depends on the table or we enable subqueries (assertion like
>constraints) in them.
>
>  
>
Yes, complicated constructions might prevent creating a table's objects 
in a single step. It's probably possible to design an object that can't 
be extracted automatically and restored at all (e.g. a view using itself).

I wonder if there have been discussions about other ways to 
backup/restore. The current pg_dump/pg_restore way will rebuild all 
indexes etc, resulting in a fresh and consistent database after restore, 
enabling backend upgrades, but it's tricky to guarantee everything runs 
smoothly. And it can be quite slow. In a case of a disaster recovery, 
this is probably very unlucky.
I wonder if a mixture of pg_dump and physical cluster backup(currently 
only possible if backend is shut down) could be implemented, i.e. a 
BACKUP sql command. This command should stream out all data from the 
physical files, taking a snapshot. When restoring, the command would 
create a new database from the input data, by more or less copying the 
data to files. This way, all (hopefully non-existent) inconsistencies in 
the database would be restored as well (probably including non-vacuumed 
rows), but because no checks are applied the restore process would be as 
fast as possible. This would be possible only for the same 
backend/architecture version, but in case of disaster recovery that's 
enough.

Regards,
Andreas




Re: Anyone working on pg_dump dependency ordering?

From
Tom Lane
Date:
Rod Taylor <pg@rbt.ca> writes:
> Well.. the second one will be much slower when the foreign keys verify.
> Primary, unique constraints I'll buy in the create statement. Check
> constraints and defaults are a little fuzzier.

FK, primary, and unique constraints are already split out from the
CREATE TABLE for performance reasons.  We could think about folding them
back in in a schema-only dump, but in a full dump I don't think it's
negotiable --- you really want to load the table data before you install
these constraints.
        regards, tom lane


Re: Anyone working on pg_dump dependency ordering?

From
Andreas Pflug
Date:
Tom Lane wrote:

>Rod Taylor <pg@rbt.ca> writes:
>  
>
>>Well.. the second one will be much slower when the foreign keys verify.
>>Primary, unique constraints I'll buy in the create statement. Check
>>constraints and defaults are a little fuzzier.
>>    
>>
>
>FK, primary, and unique constraints are already split out from the
>CREATE TABLE for performance reasons.  We could think about folding them
>back in in a schema-only dump, but in a full dump I don't think it's
>negotiable --- you really want to load the table data before you install
>these constraints.
>  
>
IMHO here we have opposite requirements: The dump/restore process should 
run as fast as possible, so constraints have to be generated separately, 
but pg_dump is also used to reengineer and slightly modify the schema, 
where it's helpful if definitions are grouped. I've been asked for 
complete schema extraction features in pgAdmin3, and I replied 'use 
pg_dump'. It seems that pg_dump tries to serve both requirements, being 
a compromise where two dedicated tools could do it better.

Regards,
Andreas





Re: pg_dump dependency / physical hot backup

From
Christopher Kings-Lynne
Date:
> Verifying zero rows in the freshly created table should be quite fast...

It's hundreds of times faster to add an index to a full table than add 
rows to a table with an index.

Chris




Re: Anyone working on pg_dump dependency ordering?

From
ow
Date:
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FK, primary, and unique constraints are already split out from the
> CREATE TABLE for performance reasons.  We could think about folding them
> back in in a schema-only dump, but in a full dump I don't think it's
> negotiable --- you really want to load the table data before you install
> these constraints.

IMHO, not only data need to loaded before FK constraints are created but also
there has got to be a feature to allow creation of an FK constraint WITHOUT
doing the verification that all loaded/existing records satisfy the FK
constraint. The ability to create a FK constraint without verification of
existing records should exist by itself (maybe only to superuser) and also as
an option in pg_restore. 

More details:
http://archives.postgresql.org/pgsql-admin/2003-11/msg00308.php
http://archives.postgresql.org/pgsql-admin/2003-11/msg00323.php


Thanks







__________________________________
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/


Re: Anyone working on pg_dump dependency ordering?

From
prinsarian@zonnet.nl (Arian Prins)
Date:
chriskl@familyhealth.com.au (Christopher Kings-Lynne) wrote in message news:<3FBEC8D3.3040807@familyhealth.com.au>...
> Lastly, I presume it's possible to create a system of circular 
> dependencies (eg create or replace view), which really cannot be solved 
> without a system of 'shells', similar to that needed to dump types and 
> their i/o functions.
> 
> Views seem to be by far the nastiest object.  They can be dependent on 
> almost everything in the database.

Hello Group,

It might be an idea to keep track of all data-definition changes
during the lifetime of a database. Keep all the
SQL-definition-commands in a seperate systemtable somewhere. Then,
when the schema is dumped you'd have the choice of dumping:
- the most recent schema with database-determined ordering (as is the
case in the current situation)
- "replay" the "recorded" datadefinition of the past, so you know the
data definition is executed in a "sound" sequence.

Of course, data itself would be extracted seperate of the definitions
and there would also be the need to remove constrains while the data
is being loaded. This last thing could be done by parsing the
"recorded" data-definition-commands or by first blindly running the
"recorded" commands, then removing (or disabling) any constraints that
are there and finally replacing (or re-enabling) the constraints.

Negative side of this idea is obviously that you need to make changes
to a lot of internals of the postgresql database. AND you need to make
perfectly sure that the state of the database is always consistent
with the state of the recorded SQL.

Good Luck,
Arian.