Thread: Backup using GiT?
I have recently had to teach myself how to use git and the thought came to me that this tool might provide a fairly low setup cost way of passing pg_dumps over the network to our off site data store. Think Rsync, but on a file content basis; just the content diff gets transmitted. GiT works by compressing deltas of the contents of successive versions of file systems under repository control. It treats binary objects as just another object under control. The question is, are successive (compressed) dumps of an altered database sufficiently similar to make the deltas small enough to warrant this approach? Comments? (not my my sanity, please) -- *** E-Mail is NOT a SECURE channel *** James B. Byrne mailto:ByrneJB@Harte-Lyne.ca Harte & Lyne Limited http://www.harte-lyne.ca 9 Brockley Drive vox: +1 905 561 1241 Hamilton, Ontario fax: +1 905 561 0757 Canada L8E 3C3
James B. Byrne wrote: > GiT works by compressing deltas of the contents of successive versions of file > systems under repository control. It treats binary objects as just another > object under control. The question is, are successive (compressed) dumps of > an altered database sufficiently similar to make the deltas small enough to > warrant this approach? Hmm, perhaps a not completely insane thing to do would be to use another pg_dump "format" (i.e. a new -F option) which installs stuff on a GIT repo. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
James B. Byrne wrote: > I have recently had to teach myself how to use git and the thought came to me > that this tool might provide a fairly low setup cost way of passing pg_dumps > over the network to our off site data store. Think Rsync, but on a file > content basis; just the content diff gets transmitted. > > GiT works by compressing deltas of the contents of successive versions of file > systems under repository control. It treats binary objects as just another > object under control. The question is, are successive (compressed) dumps of > an altered database sufficiently similar to make the deltas small enough to > warrant this approach? > > Comments? (not my my sanity, please) > It probably depends on the number of changes in the database. For example, a vacuum followed by an insert could result in records that were previously at the start of the dump being somewhere else -like the middle of the dump (i.e., a dead tuple is marked as available, then the space is "used" for an insert). In such a case, you would end up with a row that was unchanged, but in a different location in the file. Would GIT then back that up? I would think so. So in essence you'd be getting "at least a diff, but likely more" . Of course, I'm assuming you are just dumping the data in a table using pg_dump....once you start talking about a dumpall, you might find that smaller changes (i.e., give a user a new privilege) causes stuff to be offset more.... Add compression into the mix and I think you could find that there are little/no similarities.. On the other hand, if you were only doing inserts into an optimized (no dead tuples) table, I would think that you'd get a much better result. Perhaps you would be better off using PITR in such cases? -- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 919-463-0999/877-258-8987 http://www.otg-nc.com
"James B. Byrne" <byrnejb@harte-lyne.ca> writes: > I have recently had to teach myself how to use git and the thought came to me > that this tool might provide a fairly low setup cost way of passing pg_dumps > over the network to our off site data store. Think Rsync, but on a file > content basis; just the content diff gets transmitted. Huh? rsync does within-file diff transmission already. > GiT works by compressing deltas of the contents of successive versions of file > systems under repository control. It treats binary objects as just another > object under control. The question is, are successive (compressed) dumps of > an altered database sufficiently similar to make the deltas small enough to > warrant this approach? No. If you compress it, you can be pretty certain that the output will be different from the first point of difference to the end of the file. You'd have to work on uncompressed output, which might cost more than you'd end up saving ... regards, tom lane
Tom Lane wrote: > "James B. Byrne" <byrnejb@harte-lyne.ca> writes: > > GiT works by compressing deltas of the contents of successive versions of file > > systems under repository control. It treats binary objects as just another > > object under control. The question is, are successive (compressed) dumps of > > an altered database sufficiently similar to make the deltas small enough to > > warrant this approach? > > No. If you compress it, you can be pretty certain that the output will > be different from the first point of difference to the end of the file. > You'd have to work on uncompressed output, which might cost more than > you'd end up saving ... The other problem is that since the tables are not dumped in any consistent order, it's pretty unlikely that you'd get any similarity between two dumps of the same table. To get any benefit, you'd need to get pg_dump to dump sorted tuples. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, Jun 13, 2008 at 11:11 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Tom Lane wrote: >> "James B. Byrne" <byrnejb@harte-lyne.ca> writes: > >> > GiT works by compressing deltas of the contents of successive versions of file >> > systems under repository control. It treats binary objects as just another >> > object under control. The question is, are successive (compressed) dumps of >> > an altered database sufficiently similar to make the deltas small enough to >> > warrant this approach? >> >> No. If you compress it, you can be pretty certain that the output will >> be different from the first point of difference to the end of the file. >> You'd have to work on uncompressed output, which might cost more than >> you'd end up saving ... > > The other problem is that since the tables are not dumped in any > consistent order, it's pretty unlikely that you'd get any similarity > between two dumps of the same table. To get any benefit, you'd need to > get pg_dump to dump sorted tuples. > > -- > Alvaro Herrera http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general The idea of using GIT for backing-up databases is not that bad. I would propose the following: -- dump the creation script in a separate file; (or maybe one file per object (table, view, function) etc.;) -- dump the content of each table in it's own file; -- dump the tuples sorted but in plain text (as COPY data or INSERTS maybe); (as Alvaro suggested); -- don't use compression (as Tom and Chander suggested) because GIT already uses compression for the packed files; One advantage of using GIT in the manner described previously will be change tracking by doing just a simple git diff you could see the modifications (inserts, updates, deletes, etc., schema alteration). Going a step further you could also do merges between multiple databases with the same structure (each database would have it's own branch). Just imagine how simple a database schema upgrade will be in most situations, when both the development and the deployed schema have been modified and we want to put them into sync. As a conclusion I would subscribe to such an idea. Ciprian Craciun.