Re: pg_dump --split patch - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: pg_dump --split patch
Date
Msg-id AANLkTim+sFO7N539V5C+yZFx7_fTFQxdHxtCUyhPD-3V@mail.gmail.com
Whole thread Raw
In response to Re: pg_dump --split patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pg_dump --split patch  (Andrew Dunstan <andrew@dunslane.net>)
Re: pg_dump --split patch  (Greg Smith <greg@2ndquadrant.com>)
List pgsql-hackers
2010/12/29 Tom Lane <tgl@sss.pgh.pa.us>

If you've solved the deterministic-ordering problem, then this entire
patch is quite useless.  You can just run a normal dump and diff it.


No, that's only half true.

Diff will do a good job minimizing the "size" of the diff output, yes, but such a diff is still quite useless if you want to quickly grasp the context of the change.

If you have a hundreds of functions, just looking at the changed source code is not enough to figure out which functions were modified, unless you have the brain power to memorize every single line of code and are able to figure out the function name just by looking at the old and new line of codes.

To understand a change to my database functions, I would start by looking at the top-level, only focusing on the names of the functions modified/added/removed.
At this stage, you want as little information as possible about each change, such as only the names of the functions.
To do this, get a list of changes functions, you cannot compare two full schema plain text dumps using diff, as it would only reveal the lines changed, not the name of the functions, unless you are lucky to get the name of the function within the (by default) 3 lines of copied context.

While you could increase the number of copied lines of context to a value which would ensure you would see the name of the function in the diff, that is not feasible if you want to quickly "get a picture" of the code areas modified, since you would then need to read through even more lines of diff output.

For a less database-centric system where you don't have hundreds of stored procedures, I would agree it's not an issue to keep track of changes by diffing entire schema files, but for extremely database-centric systems, such as the one we have developed at my company, it's not possible to "get the whole picture" of a change by analyzing diffs of entire schema dumps.

The patch has been updated:

*) Only spit objects with a namespace (schema) not being null
*) Append all objects of same tag (name) of same type (desc) of same namespace (schema) to the same file (i.e., do not append -2, -3, like before) (Suggested by David Wilson, thanks.)

I also tested to play around with "ORDER BY pronargs" and "ORDER BY pronargs DESC" to the queries in getFuncs() in pg_dump.c, but it had no effect to the order the functions of same name but different number of arguments were dumped.
Perhaps functions are already sorted?
Anyway, it doesn't matter that much, keeping all functions of the same name in the same file is a fair trade-off I think. The main advantage is the ability to quickly get a picture of the names of all changed functions, secondly to optimize the actual diff output.


--
Best regards,

Joel Jacobson
Glue Finance

E: jj@gluefinance.com
T: +46 70 360 38 01

Postal address:
Glue Finance AB
Box  549
114 11  Stockholm
Sweden

Visiting address:
Glue Finance AB
Birger Jarlsgatan 14
114 34 Stockholm
Sweden
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Revised patches to add table function support to PL/Tcl (TODO item)
Next
From: Andrew Dunstan
Date:
Subject: Re: pg_dump --split patch