Thread: pg_dump: Sorted output, referential integrity statements
Hi, While daily using pg_dump for a while the need for the following features grew significantly. I finally came to the point of implementing them myself ;-) - pg_dump outputs the data unsorted but to manage the data in a version control system you need it consistently sorted. So a flag to sort by either primary key or left to right would be of great value. (--sorted ?) - pg_dump outputs referential constraints as 3 triggers (near to two different tables) per constraint. A mode which outputs the original statement (alter table ... add constraint) would be more sql standard conformant, portable and readable. But ... you might get into trouble if the referenced table creation command is output later. If we call this switch --sql-standard it might also prefer the short (standard compliant) form for index creation [create index X on Y(Z,W)] and some other things. So, I'm kindly asking for your opinion regarding this two features. Does anybody plan to implement them? Do you have reasons against? Christof
On Thu, 6 Dec 2001, Christof Petig wrote: > - pg_dump outputs referential constraints as 3 triggers (near to two > different tables) per constraint. A mode which outputs the original > statement (alter table ... add constraint) would be more sql standard > conformant, portable and readable. But ... you might get into trouble if > the referenced table creation command is output later. There's some interesting timing things with this. Pretty much the alter statements have to be after the creates for all the tables at least due to recursive constraints. When you're using insert statements (-d) since the restore doesn't appear to be in a transaction, all the data needs to have been loaded as well (again due to recursive constraints). In fact, there's *no* guarantee that even with a transaction that a restore of the current database state statement by statement will succeed since the user may have done odd things to insert the data. If the data's already there, the alter table is going to check each row for validity which can be kinda slow right now on big restores, we'd probably need to make a better check.
At 12:05 6/12/01 +0100, Christof Petig wrote: > >- pg_dump outputs the data unsorted Not quite correct; it outputs them in an order that is designed to improve the chances of dependencies being satisfied, and improve the performance of a full restore (a modified OID order). > but to manage the data in a version >control system you need it consistently sorted. So a flag to sort by >either primary key or left to right would be of great value. (--sorted >?) Not really very generalizable when you consider user defined types, triggers etc. >- pg_dump outputs referential constraints as 3 triggers (near to two >different tables) per constraint. A mode which outputs the original >statement (alter table ... add constraint) would be more sql standard Abosolutely; with time we are moving pg_dump to use standard SQL. >So, I'm kindly asking for your opinion regarding this two features. >Does anybody plan to implement them? No plans for the first one, but sorting by ('object-type', 'object-name') would be close to trivial, if there is much interest/support for it. The second (SQL conformance) is high on my list; a few people (Chris & Stephen?) have been working hard to implement 'alter table add/etc constraint'. When this is stable, we will move pg_dump in that direction. But as of 7.1, there were still wrinkles in the the implementation that meant it was unsuitable for pg_dump. Not sure about the status in 7.2. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Stephan Szabo wrote: Since nobody answered concerning the sort issue, I guess - nobody is planning or implementing this - nobody disagrees this might be handy to have > On Thu, 6 Dec 2001, Christof Petig wrote: > > > - pg_dump outputs referential constraints as 3 triggers (near to two > > different tables) per constraint. A mode which outputs the original > > statement (alter table ... add constraint) would be more sql standard > > conformant, portable and readable. But ... you might get into trouble if > > the referenced table creation command is output later. > > There's some interesting timing things with this. Pretty much the > alter statements have to be after the creates for all the tables at least > due to recursive constraints. When you're using insert statements (-d) > since the restore doesn't appear to be in a transaction, all the data > needs to have been loaded as well (again due to recursive constraints). > In fact, there's *no* guarantee that even with a transaction that a > restore of the current database state statement by statement will succeed > since the user may have done odd things to insert the data. > If the data's already there, the alter table is going to check each row > for validity which can be kinda slow right now on big restores, we'd > probably need to make a better check. The propose was mainly made to make the output more readable if you dump a single table (per pg_dump call). This would also use portable sql commands so it's easier to migrate data (given that you also specify -D). Yours Christof
Philip Warner wrote: Ah, yes. Now I remember it was you improving pg_dump. > At 12:05 6/12/01 +0100, Christof Petig wrote: > > > >- pg_dump outputs the data unsorted > > Not quite correct; it outputs them in an order that is designed to improve > the chances of dependencies being satisfied, and improve the performance of > a full restore (a modified OID order). That's perfect - unless you want to diff two pg_dumps > > but to manage the data in a version > >control system you need it consistently sorted. So a flag to sort by > >either primary key or left to right would be of great value. (--sorted > >?) > > Not really very generalizable when you consider user defined types, > triggers etc. Hmmm. But if we have a primary key on columns (A,B,C) and request the data 'order by A,B,C' this should be portable, shouldn't it? If we don't have a primary key simply ordering by 1,2,3,...n should also work. Or am I missing something? > >- pg_dump outputs referential constraints as 3 triggers (near to two > >different tables) per constraint. A mode which outputs the original > >statement (alter table ... add constraint) would be more sql standard > > Abosolutely; with time we are moving pg_dump to use standard SQL. Great news. > >So, I'm kindly asking for your opinion regarding this two features. > >Does anybody plan to implement them? > > No plans for the first one, but sorting by ('object-type', 'object-name') > would be close to trivial, if there is much interest/support for it. I don't understand what you mean by 'sorting by object-type/name', can you give me an example. Simply adding an (optional) order by clause was the one I intended. > The second (SQL conformance) is high on my list; a few people (Chris & > Stephen?) have been working hard to implement 'alter table add/etc > constraint'. When this is stable, we will move pg_dump in that direction. > But as of 7.1, there were still wrinkles in the the implementation that > meant it was unsuitable for pg_dump. Not sure about the status in 7.2. Oh, I was targeting 7.2. I can not surely tell about 7.2, but have seen cvs logs implementing similar things. Christof
On Fri, Dec 07, 2001 at 03:16:26PM +0100, Christof Petig wrote: > Philip Warner wrote: > > Ah, yes. Now I remember it was you improving pg_dump. > > > At 12:05 6/12/01 +0100, Christof Petig wrote: > > > > > >- pg_dump outputs the data unsorted > > > > Not quite correct; it outputs them in an order that is designed to improve > > the chances of dependencies being satisfied, and improve the performance of > > a full restore (a modified OID order). > > That's perfect - unless you want to diff two pg_dumps I've ran into this myself. However, I've never wanted to diff a full dump, usually just schema comparisions - I usually _know_ which database has the current data, I just want to be sure I can move it over. For schema comparisions, it's easy enough to generate a 'diffable' file that reflects the schema, something like: select relname||'.'||attname from pg_class c, pg_attribute a where attrelid=c.oid and attnum >0 and relname !~ '^pg' order by relname,attname; Hmm, I do see that sometimes it'd be nice to do a full diff, really. The 'oid order' was a nice hack to avoid having to do a full dependency analysis on db objects, but they're not stable. I think with oids going away as much as possible, anyway, we're probably going to have to bite the bullet and do dependencies, one way or another. There are a number of features that are often requested that all boil down to knowing dependencies: dropping the auto-generated sequence for a serial, along with the table - reparsing various functions/views/etc. when the underlying tables are modified, etc. Ross
At 15:16 7/12/01 +0100, Christof Petig wrote: >> >> Not really very generalizable when you consider user defined types, >> triggers etc. > >Hmmm. But if we have a primary key on columns (A,B,C) and request the data >'order by A,B,C' this should be portable, shouldn't it? >If we don't have a primary key simply ordering by 1,2,3,...n should also work. >Or am I missing something? My mistake; I thought you wanted to compare metadata. Sorting data by PK seems like a reasonable thing to do. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 10:02 8/12/01 +1100, Philip Warner wrote: >At 15:16 7/12/01 +0100, Christof Petig wrote: >>> >>> Not really very generalizable when you consider user defined types, >>> triggers etc. >> >>Hmmm. But if we have a primary key on columns (A,B,C) and request the data >>'order by A,B,C' this should be portable, shouldn't it? >>If we don't have a primary key simply ordering by 1,2,3,...n should also >work. >>Or am I missing something? > >My mistake; I thought you wanted to compare metadata. Sorting data by PK >seems like a reasonable thing to do. > To make the dump diff-able, we probably need to sort the metadata by type & name as well. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> > > but to manage the data in a version > > >control system you need it consistently sorted. So a flag to sort by > > >either primary key or left to right would be of great value. (--sorted > > >?) > > > > Not really very generalizable when you consider user defined types, > > triggers etc. > > Hmmm. But if we have a primary key on columns (A,B,C) and request the data > 'order by A,B,C' this should be portable, shouldn't it? > If we don't have a primary key simply ordering by 1,2,3,...n > should also work. > Or am I missing something? I can see how ordering a dump by the primary key would be a neat way of 'clustering' your data after a restore, however I have qualms about the scalability of such a scheme. What if someone has a 100GB table? They may have arranged things so that they never get a sort from it or something, or it might take ages. However I guess if it's an optional parameter it might be neat. My feeling is that it won't happen unless you actually code it into a patch that makes it a parameter to pg_dump. Having an actual patch is a great way of getting something you want done ;) Alternatively, have you tried just writing a PERL script (or some clever sed script) that will just sort the COPY FROM sections...? Chris
Christopher Kings-Lynne wrote: > > > > but to manage the data in a version > > > >control system you need it consistently sorted. So a flag to sort by > > > >either primary key or left to right would be of great value. (--sorted > > > >?) > > > > > > Not really very generalizable when you consider user defined types, > > > triggers etc. > > > > Hmmm. But if we have a primary key on columns (A,B,C) and request the data > > 'order by A,B,C' this should be portable, shouldn't it? > > If we don't have a primary key simply ordering by 1,2,3,...n > > should also work. > > Or am I missing something? > > I can see how ordering a dump by the primary key would be a neat way of > 'clustering' your data after a restore, however I have qualms about the > scalability of such a scheme. What if someone has a 100GB table? They may > have arranged things so that they never get a sort from it or something, or > it might take ages. However I guess if it's an optional parameter it might > be neat. > > My feeling is that it won't happen unless you actually code it into a patch > that makes it a parameter to pg_dump. Having an actual patch is a great way > of getting something you want done ;) > > Alternatively, have you tried just writing a PERL script (or some clever sed > script) that will just sort the COPY FROM sections...? That's beyond my perl skills. And I believe sed to be not the right tool. (hmm, perhaps split (at 'COPY FROM' and at '\.'), then sort, then cat ... many (perhaps big) temporary files, let the db do the hard work) But making a patch to pg_dump is a matter of (say) up to 4 hours. I'll do it since you seem to like it and nobody started doing it so far. Christof
> But making a patch to pg_dump is a matter of (say) up to 4 hours. > I'll do it since you seem to like it and nobody started doing it so far. Well, I'm in no way a major developer, so even if I do like it, I don't know what the chances are of it making its way into the tree. Chris
Christopher Kings-Lynne wrote: > > But making a patch to pg_dump is a matter of (say) up to 4 hours. > > I'll do it since you seem to like it and nobody started doing it so far. > > Well, I'm in no way a major developer, so even if I do like it, I don't know > what the chances are of it making its way into the tree. If I stop using C++ comments '//', the chance might grew better ;-) [I apologize again] Since Philip also likes it ... I would say it's a good feature to have and up to now most of my patches went into the tree. So the chances are not that bad (though definitely not for 7.2). Christof
> The second (SQL conformance) is high on my list; a few people (Chris & > Stephen?) have been working hard to implement 'alter table add/etc > constraint'. When this is stable, we will move pg_dump in that direction. > But as of 7.1, there were still wrinkles in the the implementation that > meant it was unsuitable for pg_dump. Not sure about the status in 7.2. Well, the biggest thing I see on using alter table add constraint for foreign keys is the expense involved if you do it after the tables are populated. I chose the theoretical cleanliness of checking each row using the code we had over the speed of doing a special check for the alter table case, although I'm considering reversing that for 7.3 to make the alter table more reasonable and make it possible for you to consider doing it.
At 13:34 7/12/01 -0800, Stephan Szabo wrote: > >Well, the biggest thing I see on using alter table add constraint for >foreign keys is the expense involved if you do it after the tables are >populated. Is it really worse than loading the tables with the constraint in place? >I chose the theoretical cleanliness of checking each row >using the code we had over the speed of doing a special check for the >alter table case, Out of curiosity - what was the difference? By the sounds of it, we may get 'alter table' in pg_dump by 7.3 or 7.4. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
On Tue, 11 Dec 2001, Philip Warner wrote: > At 13:34 7/12/01 -0800, Stephan Szabo wrote: > > > >Well, the biggest thing I see on using alter table add constraint for > >foreign keys is the expense involved if you do it after the tables are > >populated. > > Is it really worse than loading the tables with the constraint in place? I'd say its better than while loading, but currently the check isn't performed at all I think, because the create constraint trigger statements are after data load and they don't check the data at all. At least that's how I remember it, I could be wrong. > >I chose the theoretical cleanliness of checking each row > >using the code we had over the speed of doing a special check for the > >alter table case, > > Out of curiosity - what was the difference? The check could be performed in a single statment on the fktable with a not exists (limit 1). I've sort of hoped that the optimizer would be able to potentially pick a better plan than run the subselect once for every row in the fktable. :) But at the time, I wasn't comfortable with mucking with the triggers themselves, and that would have meant having two things that each had a copy of the fk check logic. > By the sounds of it, we may get 'alter table' in pg_dump by 7.3 or 7.4. That'd be cool. :)
Christof Petig wrote: > Christopher Kings-Lynne wrote: > > > > But making a patch to pg_dump is a matter of (say) up to 4 hours. > > > I'll do it since you seem to like it and nobody started doing it so far. > > > > Well, I'm in no way a major developer, so even if I do like it, I don't know > > what the chances are of it making its way into the tree. > > If I stop using C++ comments '//', the chance might grew better ;-) [I > apologize again] > > Since Philip also likes it ... > I would say it's a good feature to have. Here's the patch. It's not as efficient as it might be (if dumpTable_order_by had indinfo around) but it works. I'm not clear about quoting when using sorted output in 'COPY' style. So if anybody has good test cases around (tables with strange characters), please check it. Also I don't know whether the sorting behaviour is sensible when it comes to inheritance. Can someone using inheritance please check it. If you like the patch I'll provide documentation patches. ----- This patch implements: -T alias '--sort' which sorts by primary key / the columns in output order Yours Christof
Attachment
Stephan Szabo wrote: > > On Tue, 11 Dec 2001, Philip Warner wrote: > > > At 13:34 7/12/01 -0800, Stephan Szabo wrote: > > > > > >Well, the biggest thing I see on using alter table add constraint for > > >foreign keys is the expense involved if you do it after the tables are > > >populated. > > > > Is it really worse than loading the tables with the constraint in place? > > I'd say its better than while loading, but currently the check isn't > performed at all I think, because the create constraint trigger > statements are after data load and they don't check the data at all. > At least that's how I remember it, I could be wrong. You're not. This discussion came up a couple of times, and the answer is allways the same. We don't want to define the constraints with ALTER TABLE because this means checking data on restore that doesn'tneed to be checked at all (in theory). If he has a crash of a critical system and restores from a dump, Ibet the farm that he wants it FAST. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
On Tue, 11 Dec 2001, Jan Wieck wrote: > Stephan Szabo wrote: > > > > On Tue, 11 Dec 2001, Philip Warner wrote: > > > > > At 13:34 7/12/01 -0800, Stephan Szabo wrote: > > > > > > > >Well, the biggest thing I see on using alter table add constraint for > > > >foreign keys is the expense involved if you do it after the tables are > > > >populated. > > > > > > Is it really worse than loading the tables with the constraint in place? > > > > I'd say its better than while loading, but currently the check isn't > > performed at all I think, because the create constraint trigger > > statements are after data load and they don't check the data at all. > > At least that's how I remember it, I could be wrong. > > You're not. This discussion came up a couple of times, and > the answer is allways the same. > > We don't want to define the constraints with ALTER TABLE > because this means checking data on restore that doesn't need > to be checked at all (in theory). If he has a crash of a > critical system and restores from a dump, I bet the farm that > he wants it FAST. I'd say as an optional parameter to dump, it's definately not a bad idea (like the idea of a --sql or whatever) since the user has to explicitly ask for it. I think for the rest of the cases it comes down to what people want it to do.
At 10:34 11/12/01 -0500, Jan Wieck wrote: > > We don't want to define the constraints with ALTER TABLE > because this means checking data on restore that doesn't need > to be checked at all (in theory). If he has a crash of a > critical system and restores from a dump, I bet the farm that > he wants it FAST. This is just an argument for (a) using ALTER TABLE (since it will also prevent PK indexes being created, and make it FASTer), and (b) the ability to 'SET ALL CONSTRAINTS OFF' (or similar) to prevent the ALTER TABLE from forcing validation of the constraint. The current situation of creating constraint triggers is IMO not acceptable in the long term. There are also enough people who just restore one table to warrant the ability for pg_dump to optionally run with constraints ON. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
If you're going to allow bypassing data integrity checks (great for speed!) perhaps one should be introduced to quickly confirm the integrity of the file itself? A checksum on the first line will validate the contents through the rest of the file. It'll take a few minutes to confirm a multi-GB sized file but in comparison to load time it may be worthwhile to look into. That way you can ensure it's the same as it was when it was dumped and fsck or other accidental editing didn't remove the middle of it. Intentional modifications won't be stopped but backups should be treated the same as the database is security wise. -- Rod Taylor This message represents the official view of the voices in my head ----- Original Message ----- From: "Philip Warner" <pjw@rhyme.com.au> To: "Jan Wieck" <janwieck@yahoo.com>; "Stephan Szabo" <sszabo@megazone23.bigpanda.com> Cc: "Christof Petig" <christof@petig-baender.de>; "PostgreSQL Hackers" <pgsql-hackers@postgresql.org> Sent: Tuesday, December 11, 2001 10:03 PM Subject: Re: [HACKERS] pg_dump: Sorted output, referential integrity > At 10:34 11/12/01 -0500, Jan Wieck wrote: > > > > We don't want to define the constraints with ALTER TABLE > > because this means checking data on restore that doesn't need > > to be checked at all (in theory). If he has a crash of a > > critical system and restores from a dump, I bet the farm that > > he wants it FAST. > > This is just an argument for (a) using ALTER TABLE (since it will > also prevent PK indexes being created, and make it FASTer), and > (b) the ability to 'SET ALL CONSTRAINTS OFF' (or similar) to > prevent the ALTER TABLE from forcing validation of the constraint. > > The current situation of creating constraint triggers is IMO not > acceptable in the long term. > > There are also enough people who just restore one table to warrant > the ability for pg_dump to optionally run with constraints ON. > > > ---------------------------------------------------------------- > Philip Warner | __---_____ > Albatross Consulting Pty. Ltd. |----/ - \ > (A.B.N. 75 008 659 498) | /(@) ______---_ > Tel: (+61) 0500 83 82 81 | _________ \ > Fax: (+61) 0500 83 82 82 | ___________ | > Http://www.rhyme.com.au | / \| > | --________-- > PGP key available upon request, | / > and from pgp5.ai.mit.edu:11371 |/ > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Jan Wieck writes: > We don't want to define the constraints with ALTER TABLE > because this means checking data on restore that doesn't need > to be checked at all (in theory). If he has a crash of a > critical system and restores from a dump, I bet the farm that > he wants it FAST. Um, if he has a *crash* of a *critical* system, doesn't he want his data checked before he puts it back online? -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > Jan Wieck writes: > > > We don't want to define the constraints with ALTER TABLE > > because this means checking data on restore that doesn't need > > to be checked at all (in theory). If he has a crash of a > > critical system and restores from a dump, I bet the farm that > > he wants it FAST. > > Um, if he has a *crash* of a *critical* system, doesn't he want his data > checked before he puts it back online? The data came (in theory!!!) from an intact, consistent database. So the dump content is (theoretically) knownto be consistent, thus no check required. The difference between theory and practice? There is none, theoretically :-) Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com