Thread: Anyone working on pg_dump dependency ordering?
I'm thinking about attacking pg_dump's lack of knowledge about using dependencies to determine a safe dump order. But if there's someone out there actively working on the problem, I don't want to tread on your toes ... anyone? Also, if you've got uncommitted patches for pg_dump, please let me know. regards, tom lane
> I'm thinking about attacking pg_dump's lack of knowledge about using > dependencies to determine a safe dump order. But if there's someone > out there actively working on the problem, I don't want to tread on > your toes ... anyone? I've done a whole lot of _thinking_, but basically no _doing_, so go right ahead :) I may as well let you know my thoughts: There are two levels (sort of) of dependency. The first is that whole classes of objects can be dependent on whole other classes. eg. databases depend on users, or ALL FK's can be dumped after ALL tables, etc.. It would make the dump more readable if you dumped those definite dependencies in that order, rather than shuffling everything up. The second level of dependency is when a bunch of object types can depend on each other. The current solution for that is to sort by OID, but this fails when it is possible to add a dependency to an object after it has been created. eg: - Adding a column (with a type) to a table - All the CREATE OR REPLACE commands - etc. Hence, a full db wide topological sort might not be necessary. Lastly, I presume it's possible to create a system of circular dependencies (eg create or replace view), which really cannot be solved without a system of 'shells', similar to that needed to dump types and their i/o functions. Views seem to be by far the nastiest object. They can be dependent on almost everything in the database. > Also, if you've got uncommitted patches for pg_dump, please let me know. Yes, my 'COMMENT ON' mega patch in the queue contains dumping of conversions and comments on a bunch of objects. BTW, if you commit that patch - you might want to change my comment on type patch to put the "" around "any", and change the results file appropriately. I noticed I accidentally included that in the patch, and was about to mention it. CHris
Christopher Kings-Lynne wrote: > > > There are two levels (sort of) of dependency. The first is that whole > classes of objects can be dependent on whole other classes. eg. > databases depend on users, or ALL FK's can be dumped after ALL tables, > etc.. It would make the dump more readable if you dumped those > definite dependencies in that order, rather than shuffling everything up. > I agree that dumping should be done class-wise (Tables, Functions, Views) whenever possible, but I don't agree on FKs dumped separately from the table. IMHO indexes and constraints belong to the table, and modifying the dump will be hard if a table's code is scattered all around. Regards, Andreas
On Sat, 22 Nov 2003, Andreas Pflug wrote: > Christopher Kings-Lynne wrote: > > > > > > > There are two levels (sort of) of dependency. The first is that whole > > classes of objects can be dependent on whole other classes. eg. > > databases depend on users, or ALL FK's can be dumped after ALL tables, > > etc.. It would make the dump more readable if you dumped those > > definite dependencies in that order, rather than shuffling everything up. > > > I agree that dumping should be done class-wise (Tables, Functions, > Views) whenever possible, but I don't agree on FKs dumped separately > from the table. IMHO indexes and constraints belong to the table, and > modifying the dump will be hard if a table's code is scattered all around. You're going to potentially have the constraints scattered in any case due to circular dependency chains. I'd think that having all the constraints in one place would be easier than trying to go through the list of tables that might be in a circular chain in order to find the constraints.
Stephan Szabo wrote: >You're going to potentially have the constraints scattered in any case due >to circular dependency chains. I'd think that having all the constraints >in one place would be easier than trying to go through the list of tables >that might be in a circular chain in order to find the constraints. > > I still disagree. cyclic dependencies should be avoided anyhow. You'll get an awful lot of trouble loading data in such a case. Some database systems refuse to create such stuff right away (mssql). Regards, Andreas
On Sat, 2003-11-22 at 16:53, Andreas Pflug wrote: > Stephan Szabo wrote: > > >You're going to potentially have the constraints scattered in any case due > >to circular dependency chains. I'd think that having all the constraints > >in one place would be easier than trying to go through the list of tables > >that might be in a circular chain in order to find the constraints. > > > > > I still disagree. cyclic dependencies should be avoided anyhow. You'll > get an awful lot of trouble loading data in such a case. Some database > systems refuse to create such stuff right away (mssql). CREATE TABLE a (col integer primary key); CREATE TABLE b (col integer primary key); ALTER TABLE a ADD FOREIGN KEY (col) REFERENCES b INITIALLY DEFERRED; ALTER TABLE b ADD FOREIGN KEY (col) REFERENCES a; How does MSSQL deal with the above?
Rod Taylor wrote: > > >CREATE TABLE a (col integer primary key); >CREATE TABLE b (col integer primary key); >ALTER TABLE a ADD FOREIGN KEY (col) REFERENCES b INITIALLY DEFERRED; >ALTER TABLE b ADD FOREIGN KEY (col) REFERENCES a; > >How does MSSQL deal with the above?# > > It depends. Restricting FKs are generated silently, while ON DELETE CASCADE will throw a message and refuse to create. MSSQL doesn't know about deferred FKs; no chance to enter spuriously inconsistent data. Still, using cyclic references is IMHO bad design style. I can't accept an exceptional case as reason to break *all* table's definition into pieces. The CREATE TABLE syntax shows that I'm probably not the only one thinking like this: it may include all constraint definitions as well. There might be discussions whether its better to script CREATE TABLE xxx ..; ALTER TABLE xxx ADD PRIMARY KEY ....; ALTER TABLE xxx ADD FOREIGN KEY ....; or CREATE TABLE xxx (...., PRIMARY KEY (..), FOREIGN KEY (..)); I'd opt for the second version (a little formatted, maybe :-) Regards, Andreas
> >CREATE TABLE a (col integer primary key); > >CREATE TABLE b (col integer primary key); > >ALTER TABLE a ADD FOREIGN KEY (col) REFERENCES b INITIALLY DEFERRED; > >ALTER TABLE b ADD FOREIGN KEY (col) REFERENCES a; > Still, using cyclic references is IMHO bad design style. I can't accept They're extremely useful when you have normalized data and a very expensive (but repeatable) process whose results you want to cache for performance reasons. It enforces that original data and cache are both added in a single transaction. But you're right. Normally they're a bad idea. > an exceptional case as reason to break *all* table's definition into > pieces. The CREATE TABLE syntax shows that I'm probably not the only one > thinking like this: it may include all constraint definitions as well. > > There might be discussions whether its better to script > CREATE TABLE xxx ..; > ALTER TABLE xxx ADD PRIMARY KEY ....; > ALTER TABLE xxx ADD FOREIGN KEY ....; > or > CREATE TABLE xxx (...., PRIMARY KEY (..), FOREIGN KEY (..)); > > I'd opt for the second version (a little formatted, maybe :-) Well.. the second one will be much slower when the foreign keys verify. Primary, unique constraints I'll buy in the create statement. Check constraints and defaults are a little fuzzier. Logic will be required to pull them out in the event they call functions which depends on the table or we enable subqueries (assertion like constraints) in them.
Rod Taylor wrote: >>There might be discussions whether its better to script >>CREATE TABLE xxx ..; >>ALTER TABLE xxx ADD PRIMARY KEY ....; >>ALTER TABLE xxx ADD FOREIGN KEY ....; >>or >>CREATE TABLE xxx (...., PRIMARY KEY (..), FOREIGN KEY (..)); >> >>I'd opt for the second version (a little formatted, maybe :-) >> >> > >Well.. the second one will be much slower when the foreign keys verify. > > Verifying zero rows in the freshly created table should be quite fast... >Primary, unique constraints I'll buy in the create statement. Check >constraints and defaults are a little fuzzier. > >Logic will be required to pull them out in the event they call functions >which depends on the table or we enable subqueries (assertion like >constraints) in them. > > > Yes, complicated constructions might prevent creating a table's objects in a single step. It's probably possible to design an object that can't be extracted automatically and restored at all (e.g. a view using itself). I wonder if there have been discussions about other ways to backup/restore. The current pg_dump/pg_restore way will rebuild all indexes etc, resulting in a fresh and consistent database after restore, enabling backend upgrades, but it's tricky to guarantee everything runs smoothly. And it can be quite slow. In a case of a disaster recovery, this is probably very unlucky. I wonder if a mixture of pg_dump and physical cluster backup(currently only possible if backend is shut down) could be implemented, i.e. a BACKUP sql command. This command should stream out all data from the physical files, taking a snapshot. When restoring, the command would create a new database from the input data, by more or less copying the data to files. This way, all (hopefully non-existent) inconsistencies in the database would be restored as well (probably including non-vacuumed rows), but because no checks are applied the restore process would be as fast as possible. This would be possible only for the same backend/architecture version, but in case of disaster recovery that's enough. Regards, Andreas
Rod Taylor <pg@rbt.ca> writes: > Well.. the second one will be much slower when the foreign keys verify. > Primary, unique constraints I'll buy in the create statement. Check > constraints and defaults are a little fuzzier. FK, primary, and unique constraints are already split out from the CREATE TABLE for performance reasons. We could think about folding them back in in a schema-only dump, but in a full dump I don't think it's negotiable --- you really want to load the table data before you install these constraints. regards, tom lane
Tom Lane wrote: >Rod Taylor <pg@rbt.ca> writes: > > >>Well.. the second one will be much slower when the foreign keys verify. >>Primary, unique constraints I'll buy in the create statement. Check >>constraints and defaults are a little fuzzier. >> >> > >FK, primary, and unique constraints are already split out from the >CREATE TABLE for performance reasons. We could think about folding them >back in in a schema-only dump, but in a full dump I don't think it's >negotiable --- you really want to load the table data before you install >these constraints. > > IMHO here we have opposite requirements: The dump/restore process should run as fast as possible, so constraints have to be generated separately, but pg_dump is also used to reengineer and slightly modify the schema, where it's helpful if definitions are grouped. I've been asked for complete schema extraction features in pgAdmin3, and I replied 'use pg_dump'. It seems that pg_dump tries to serve both requirements, being a compromise where two dedicated tools could do it better. Regards, Andreas
> Verifying zero rows in the freshly created table should be quite fast... It's hundreds of times faster to add an index to a full table than add rows to a table with an index. Chris
--- Tom Lane <tgl@sss.pgh.pa.us> wrote: > FK, primary, and unique constraints are already split out from the > CREATE TABLE for performance reasons. We could think about folding them > back in in a schema-only dump, but in a full dump I don't think it's > negotiable --- you really want to load the table data before you install > these constraints. IMHO, not only data need to loaded before FK constraints are created but also there has got to be a feature to allow creation of an FK constraint WITHOUT doing the verification that all loaded/existing records satisfy the FK constraint. The ability to create a FK constraint without verification of existing records should exist by itself (maybe only to superuser) and also as an option in pg_restore. More details: http://archives.postgresql.org/pgsql-admin/2003-11/msg00308.php http://archives.postgresql.org/pgsql-admin/2003-11/msg00323.php Thanks __________________________________ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/
chriskl@familyhealth.com.au (Christopher Kings-Lynne) wrote in message news:<3FBEC8D3.3040807@familyhealth.com.au>... > Lastly, I presume it's possible to create a system of circular > dependencies (eg create or replace view), which really cannot be solved > without a system of 'shells', similar to that needed to dump types and > their i/o functions. > > Views seem to be by far the nastiest object. They can be dependent on > almost everything in the database. Hello Group, It might be an idea to keep track of all data-definition changes during the lifetime of a database. Keep all the SQL-definition-commands in a seperate systemtable somewhere. Then, when the schema is dumped you'd have the choice of dumping: - the most recent schema with database-determined ordering (as is the case in the current situation) - "replay" the "recorded" datadefinition of the past, so you know the data definition is executed in a "sound" sequence. Of course, data itself would be extracted seperate of the definitions and there would also be the need to remove constrains while the data is being loaded. This last thing could be done by parsing the "recorded" data-definition-commands or by first blindly running the "recorded" commands, then removing (or disabling) any constraints that are there and finally replacing (or re-enabling) the constraints. Negative side of this idea is obviously that you need to make changes to a lot of internals of the postgresql database. AND you need to make perfectly sure that the state of the database is always consistent with the state of the recorded SQL. Good Luck, Arian.