Thread: pg_dump in 7.4
Hi, Has anyone given much thought to improving pg_dump's object order algorithm for 7.4? It seems that now we have dependencies, it should just be a matter of doing a breadth-first or depth-first search over the pg_depend table to generate a valid order of oids. To allow for mess-ups in that table, the next step would be to add to the end of the list of oids any objects that for whatever reason aren't in the dependency system. (Is this possible? Manual hacking can do it methinks...) Does this sound like an idea? I've just become rather frustrated trying to do a test reload of our 7.2.3 dump into 7.3b5. The problem is all the tsearch types are declared after the tables that actually use them! Chris
At 01:33 PM 13/11/2002 +0800, Christopher Kings-Lynne wrote: >Does this sound like an idea? It does, but in keeping with allowing pg_restore to be quite flexible, I'd like to see the dependency data stored in the dump file, then processed at restore-time. >I've just become rather frustrated trying to do a test reload of our 7.2.3 >dump into 7.3b5. The problem is all the tsearch types are declared after >the tables that actually use them! pg_dump already has rudimentary dependency tracking (one level deep); each item can have a list of oid's it depends on. You *could* patch it to add the types to the table dependencies. In the future I'd imagine we'll just dump the OIDs of all first level dependencies for each object, then at restore-time, process them in whatever order the user requests (defaulting to dependency-order). ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> pg_dump already has rudimentary dependency tracking (one level > deep); each > item can have a list of oid's it depends on. You *could* patch it to add > the types to the table dependencies. > > In the future I'd imagine we'll just dump the OIDs of all first level > dependencies for each object, then at restore-time, process them in > whatever order the user requests (defaulting to dependency-order). Well, the problem is that you can add a new type and then add a column to a really old table that uses that type - that causes pain. Lots of other people have also reported the "view dumped before table it is based on" problem. Chris
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > Has anyone given much thought to improving pg_dump's object order algorithm > for 7.4? It seems that now we have dependencies, it should just be a matter > of doing a breadth-first or depth-first search over the pg_depend table to > generate a valid order of oids. I've thought about this a little. Using the pg_depend data would obviously give pg_dump a big leg up on the problem, but it's by no means a trivial task even so. Some things to think about: * We don't store dependencies for SQL functions to things mentioned in the SQL function body. (Maybe we should, but we don't.) So there's data missing in that case, and possibly other cases. * pg_dump really ought to still work for pre-7.3 databases. What will its fallback strategy be like when it hasn't got pg_depend? * With ALTER TABLE, CREATE OR REPLACE FUNCTION, etc, it's not hard at all to create situations with circular dependencies. Right now pg_dump is guaranteed to produce an unusable dump in such cases, but our ambition should be to make it work. That means pg_dump needs a strategy for breaking circular dependency paths, by itself using ALTER when needed to postpone a reference. The thought that I'd been toying with is to build a list of inter-object dependencies (using pg_depend if available, else fall back on pg_dump's native wit, ie, the rather limited set of dependencies it already understands). Then do a topological sort, preferring to maintain OID order in cases where the object ordering is underspecified. When the sort fails (ie, there's a circular dependency) then modify the set of dumpable objects by breaking some object into two parts (a base declaration and an ALTER command); this changes the dependencies too. Repeat the sort and adjustment steps until the sort succeeds. If you're not familiar with topological sorts, look at Knuth or your favorite algorithms text. There are one or two instances of the method in PG already (deadlock detection, for example). regards, tom lane
Philip Warner <pjw@rhyme.com.au> writes: > At 01:33 PM 13/11/2002 +0800, Christopher Kings-Lynne wrote: >> Does this sound like an idea? > It does, but in keeping with allowing pg_restore to be quite flexible, I'd > like to see the dependency data stored in the dump file, then processed at > restore-time. No, because that doesn't help for the plain-text-dump case. I think we should solve the dependencies during pg_dump and output the objects in a safe order. pg_restore can keep its options for rearranging the order, but those should become vestigial, or at least only needed in bizarre cases. regards, tom lane
On Wed, 2002-11-13 at 00:33, Christopher Kings-Lynne wrote: > for 7.4? It seems that now we have dependencies, it should just be a matter > of doing a breadth-first or depth-first search over the pg_depend table to > generate a valid order of oids. The biggest trick will be trying to re-combine the ALTER ... ADD CONSTRAINT and ALTER ... SET DEFAULT statements back into CREATE TABLE, but that seems to partially solve itself simply by using an ALAP algorithm (as late as possible) and being picky about the paths you try first, as they'll get pushed together. > To allow for mess-ups in that table, the next step would be to add to the > end of the list of oids any objects that for whatever reason aren't in the > dependency system. (Is this possible? Manual hacking can do it > methinks...) Are there any objects which are not in the dependency system, other than comments? Comments can be done at the time the object is created. -- Rod Taylor
At 08:36 AM 13/11/2002 -0500, Tom Lane wrote: >No, because that doesn't help for the plain-text-dump case. Wrong. The way pg_dump does a plain-text dump is to do a fake restore. Same code. Which meand if make the restore work correctly and everything works. > I think we >should solve the dependencies during pg_dump and output the objects in a >safe order. We should do this anyway, just to make the restorations quicker (as we do already). > pg_restore can keep its options for rearranging the order, >but those should become vestigial, or at least only needed in bizarre >cases. Ordering is not the problem; it's allowing the user to dump a single table and associated type definitions that requires the pg_dump really *must* dump dependency information. Then the dump/restore code can sort it appropriately. The way the code works at the moment is: - Dump definitions in any convenient order, creating an in-memory TOC. - Sort the TOC entries appropriately (using code in pg_backup_archiver). - Dump the definitions and data to file/stdout (using code in pg_backup_archiver). We need is to replace the naieve quicksort with a more complex sorting system. The suggestion of breaking items into create/alter etc is interesting - I assume you are thinking of function bodies? Or is there something else? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 01:50 PM 13/11/2002 +0800, Christopher Kings-Lynne wrote: >Well, the problem is that you can add a new type and then add a column to a >really old table that uses that type - that causes pain\ You may have misunderstood; I meant to add each type used by the table to the deps list for a table (and each function used by a view etc etc). Current implementation leaves the deps list blank for tables, and only lists the tables for a view (I think). The problem with the current deps list is that (a) it assumes that OID order is important and (b) it does not do any analysis of the topology of the dependencies. The latter will be substantially improved if we can get pg_depend deps into the dump file, and if we can do a useful analysis of the dependencies. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 08:52 AM 13/11/2002 -0500, Rod Taylor wrote: >The biggest trick will be trying to re-combine the ALTER ... ADD >CONSTRAINT and ALTER ... SET DEFAULT statements back into CREATE TABLE I'm not sure this would be worth the effort - I'll grant it would be cute, but getting pg_dump to understand SQL seems a little ambitious. We'd probably end up defining a portable schema definition language just for dump files. To achieve Tom's suggestion it might be simpler to store two versions - the 'full' version, and the 'fully deconstructed' version. If our analysis of the dependencies meant we needed to break up an object, then we use the latter. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
On Wed, 2002-11-13 at 09:08, Philip Warner wrote: > At 08:52 AM 13/11/2002 -0500, Rod Taylor wrote: > >The biggest trick will be trying to re-combine the ALTER ... ADD > >CONSTRAINT and ALTER ... SET DEFAULT statements back into CREATE TABLE > > I'm not sure this would be worth the effort - I'll grant it would be cute, > but getting pg_dump to understand SQL seems a little ambitious. We'd > probably end up defining a portable schema definition language just for > dump files. > To achieve Tom's suggestion it might be simpler to store two versions - the > 'full' version, and the 'fully deconstructed' version. If our analysis of > the dependencies meant we needed to break up an object, then we use the > latter. Different approaches to the same result. To me, the dependency tree is already (mostly) broken up to start with. So at some point you need to teach something to re-combine in the pg_attrdef -> pg_class dependencies among others. But the opposite approach of starting with the large objects and breaking up where required would be just as good, especially if you only breakup the little bits that are required. Starting with all functions broken up into their two parts (definition and body) with the body dependent on the definition and recombining later if they sort side by side seems much easier than trying to resolve cycles and break it up at a later time. An ALAP scheduling algorithm will almost always sort these things to be side by side to allow combining on a second pass by something with the intelligence. -- Rod Taylor
Philip Warner <pjw@rhyme.com.au> writes: > The suggestion of breaking items into create/alter etc is interesting - I > assume you are thinking of function bodies? Or is there something else? Let's see --- foreign-key constraints are an obvious source of possible circularities, but I see pg_dump already dumps those as separate objects. I recall thinking that column default and constraint clauses might need to be broken out too, but I'm not sure why I thought that (maybe because they can call SQL functions?). Anything else? A simple-minded approach would be to *always* add these things via ALTER commands at the end, but to keep dumps legible it would be nicer to keep them in the original table definition whenever possible. regards, tom lane
At 09:29 AM 13/11/2002 -0500, Rod Taylor wrote: >An ALAP scheduling algorithm will almost always sort these things to be >side by side to allow combining on a second pass by something with the >intelligence. Do we have a list of dependency data that we collect? eg. do we know about functions used in views and indexes? At this stage it's probably worth making a list of what we think is achievable in 7.4, and what we want to achieve ultimately. I certainly agree about recombining function headers and bodies, but there are a bunch of things that I think we can leave deconstructed and always move to the end of the restore, eg: - constraints - sequences set (not really a dependency problem) - indexes - comments AFAIR, we can only split function bodies for non-SQL functions. For views we need to know the types of casts within the view, the types returned by the view, the functions used by the view as well as the tables referenced. AFAIR, there is no way to break up a view, so it has to go after each ancestor. For a table, it should be sufficient to know the constraints & types; we can add constraints later, but I'd be reluctant to get into doing 'ALTER TABLE ADD COLUMN...'. Indexes may have a function and/or a cast? Create Index I on T( cast(id as my_type) )? I'd guess constraints can depend on multiple tables, views(?), types, & functions. Not sure what else. We can't really break these down. So it looks like the only contentious item might be table attrs? is that right? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> Do we have a list of dependency data that we collect? eg. do we know about > functions used in views and indexes? At this stage it's probably worth > - constraints > - sequences set (not really a dependency problem) > - indexes > - comments I can make a complete list tonight of whats captured. Shall we tack the list onto the end of section 3.13 (pg_depend): http://developer.postgresql.org/docs/postgres/catalog-pg-depend.html > For a table, it should be sufficient to know the constraints & types; we > can add constraints later, but I'd be reluctant to get into doing 'ALTER > TABLE ADD COLUMN...'. Shouldn't ever need to do an ALTER TABLE ADD COLUMN. But I can certainly come up with a case for ALTER TABLE SET DEFAULT. > Indexes may have a function and/or a cast? Create Index I on T( cast(id as > my_type) )? > > I'd guess constraints can depend on multiple tables, views(?), types, & > functions. Not sure what else. We can't really break these down. They can via functions. And you can break down a function and table, but not really types or views. CREATE FUNCTION func .... 'SELECT TRUE;' LANGUAGE 'sql'; CREATE <items requiring function>; -- Fill in function body. CREATE OR REPLACE FUNCTION func ... '<real query>' LANGUAGE 'sql'; > So it looks like the only contentious item might be table attrs? is that right? More likely to be functions. As everything else (I can think of) is easily altered into place. -- Rod Taylor
> The thought that I'd been toying with is to build a list of inter-object > dependencies (using pg_depend if available, else fall back on pg_dump's > native wit, ie, the rather limited set of dependencies it already > understands). Then do a topological sort, preferring to maintain OID > order in cases where the object ordering is underspecified. When the > sort fails (ie, there's a circular dependency) then modify the set of > dumpable objects by breaking some object into two parts (a base > declaration and an ALTER command); this changes the dependencies too. > Repeat the sort and adjustment steps until the sort succeeds. > > If you're not familiar with topological sorts, look at Knuth or your > favorite algorithms text. There are one or two instances of the method > in PG already (deadlock detection, for example). C'mon - I _do_ have an honours degree in computer science ;) I was actually trying to think of the correct term when i wrote my original email. I had 'PERT charts' in my head, but I couldn't remember the term for the required order of activities - thanks for reminding me :) Chris
At 02:53 PM 13/11/2002 -0500, Rod Taylor wrote: >I can make a complete list tonight of whats captured. Sounds good. If you can also indicate which parts of functions are captured - arguments, return type and body? IIRC, only SQL functions are compiled at define-time, so other functions *shouldn't* be a major problem if they are out of order in *most* cases. Since pg_dump already has some intrinsic knowledge of dependencies, if there is much missing from pg_depend, we could probably add it to pg_dump (it many cases it will just be a matter of getting oids as well as names in SQL statements). In terms of supporting older dump files, I think it should all work: they already have space for deps, mostly empty, and as Tom suggested, we should be able to dump leaf nodes in OID order. From the PoV of pg_dump, the algorthim becomes: 1. generate in-memory TOC in a convenient format. 2. pick the lowest OID *leaf* node. If none, goto 5. 3. remove it from deps of other TOC entries 4. goto 2. [cyclic] 5. pick the lowest OID node. See if we can break it. If not repeat with more nodes until we can break one. If none, then goto step 8. 6. Break up the node. 7. Goto 2. [cyclic, no resolution] 8. Pick the lowest OID node. 9. Goto 3. There are a few issues here: (a) we need to know dependencies of *parts* of objects in order to do the breakup. To me this implies we should break them up at dump time (ie. have FUNCTION_DEFINITION and FUNCTION_BODY TOC entries; it gets nastier with tables, but TABLE_DEFINITION, TABLE_CONSTRAINT, ATTRIBUTE_CONSTRAINT, and ATTRIBUTE_DEFAULT come to mind. So time later we can write cunning code to recombine them (volunteers?). (b) Step 6 may be pointless. We probably need to verify that we will end up with a leaf node as a result of the breakup: I presume someone has a good algorithm. But if we do have a cycle, I'd guess we should just revert to an OID ordering, since while we may pick the topologically optimal node, it may well not be optimal from PostgreSQL point of view: if the node fails to be defined, then everything else that depends on it will fail. This has the advantage of being simple. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Tom Lane <tgl@sss.pgh.pa.us> writes: > * We don't store dependencies for SQL functions to things mentioned in > the SQL function body. (Maybe we should, but we don't.) So there's > data missing in that case, and possibly other cases. This might be interesting to do, and we could tie it into the need to invalidate PL/PgSQL functions that depend on a database object when the object is changed. Perhaps when the function is defined, we run all the SQL queries in the function body through the parser/analyzer/rewriter, and then generate dependencies on the Query trees we get back? In any case, there would be a limit to what we could divine from the function definition (e.g. we'd get practically no info about a function defined in C) -- but this might make things a little nicer, anyway. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
At 12:39 AM 14/11/2002 -0500, Neil Conway wrote: >Perhaps when the function is defined, we run all the SQL queries in >the function body through the parser/analyzer/rewriter, and then >generate dependencies on the Query trees we get back? Won't work for functions that build dynamic queries. But it might be interesting/worthwhile to allow user-specified dependencies; that way if a user has problems with database dumps etc, they could manually add dependencies for C functions, dynamic query functions (where possible) etc. It's probably more trouble than it's worth, but worth considering. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > Won't work for functions that build dynamic queries. Granted, but the the intent is to (a) solve some, but not necessarily all, of the dump-order problems (b) drop functions that depend on a database object when the database object itself is dropped -- i.e. youdon't *want* to worry about dynamically built queries, as they are fine already Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Philip Warner <pjw@rhyme.com.au> writes: > At 12:39 AM 14/11/2002 -0500, Neil Conway wrote: >> Perhaps when the function is defined, we run all the SQL queries in >> the function body through the parser/analyzer/rewriter, and then >> generate dependencies on the Query trees we get back? > Won't work for functions that build dynamic queries. That's irrelevant for SQL functions, though, and pg_dump does not need to worry about dependencies inside other types of functions (since only SQL functions' bodies are examined during CREATE FUNCTION). Neil's solution is exactly what I was thinking we should do. Or almost exactly, anyway: we don't want to run the rewriter before extracting dependencies. regards, tom lane
On Wed, 2002-11-13 at 23:37, Philip Warner wrote: > At 02:53 PM 13/11/2002 -0500, Rod Taylor wrote: > >I can make a complete list tonight of whats captured. > > Sounds good Below is a summary of what pg_depend tracks that might be useful. Skipped a number of dependencies that are internal only (ie. toast table dependencies) as they will be regenerated correctly if their 'owners' are generated correctly. <Expression Dependencies> include:- Operators- Functions- Relations (and columns)- Aggregates Attributes (Columns) depend on:- Type of attribute Tables depend on:- Namespace- Parent tables (if inheritance) Default expressions depend on:- Table- <Expression Dependencies> Indexes depend on:- Constraint (where unique / primary key constraint)- Index procedure- Index operator- Attributes of indexedrelation Aggregates depend on:- Transformation function- Final function (if required) Foreign Keys depend on:- Foreign key'd relation and its attributes- Constrained relation and its attributes- Unique Indexon the foreign key'd relation Check Constraints depend on:- <Expression Dependencies> <- includes parent relation- Domain type (if check constraint ondomain -- v7.4) Operators depend on:- Namespace- Left operator type- Right operator type- Result operator type- Code function- Rest function-Join function Functions depend on:- Namespace- Language- Return type- Argument types (all) Types (domains included) depend on:- Namespace- Input type- Output type- Element type (if array)- Base type (if domain)-Default value -> <Expression Dependencies> Casts depend on:- Source type- Target type- Cast function Operator Classes depend on:- Namespaces- Input type- Key data type (if different than input type)- Dependencies on operatorsin the class- Dependencies on procedures in the class Languages depend on:- Call function- Validation function Triggers depend on:- Trigger function- Relation trigger is on- Constrained relation (if constraint trigger) Rules depend on:- Relation rule is on- Qualifying condition -> <Expression Dependencies>- resulting Query Tree -> <ExpressionDependencies> Missing:- Body of all functions -- Rod Taylor <rbt@rbt.ca>
At 11:43 AM 15/11/2002 -0500, Rod Taylor wrote: >Below is a summary of what pg_depend tracks that might be useful. This looks excellent. If people are happy with my earlier outline, it should be reasonably easy to proceed... ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
On Sat, 2002-11-16 at 15:49, Alvaro Herrera wrote: > On Fri, Nov 15, 2002 at 11:43:47AM -0500, Rod Taylor wrote: > > > Below is a summary of what pg_depend tracks that might be useful. > > Skipped a number of dependencies that are internal only (ie. toast table > > dependencies) as they will be regenerated correctly if their 'owners' > > are generated correctly. > > > > > > Tables depend on: > > - Namespace > > - Parent tables (if inheritance) > > And columns? I only see table dependencies. tablecmds.c line 943 > > Indexes depend on: > > - Constraint (where unique / primary key constraint) > > - Index procedure > > - Index operator > > - Attributes of indexed relation > > On function if functional maybe? (Is this "procedure"?) Yes, forgot that marker beside 'Index Procedure'. -- Rod Taylor <rbt@rbt.ca>