Thread: Backup using GiT?

Backup using GiT?

From

"James B. Byrne"

Date:

13 June 2008, 17:33:36

I have recently had to teach myself how to use git and the thought came to me
that this tool might provide a fairly low setup cost way of passing pg_dumps
over the network to our off site data store.  Think Rsync, but on a file
content basis; just the content diff gets transmitted.

GiT works by compressing deltas of the contents of successive versions of file
systems under repository control.  It treats binary objects as just another
object under control.  The question is, are successive (compressed) dumps of
an altered database sufficiently similar to make the deltas small enough to
warrant this approach?

Comments? (not my my sanity, please)


--
***          E-Mail is NOT a SECURE channel          ***
James B. Byrne                mailto:ByrneJB@Harte-Lyne.ca
Harte & Lyne Limited          http://www.harte-lyne.ca
9 Brockley Drive              vox: +1 905 561 1241
Hamilton, Ontario             fax: +1 905 561 0757
Canada  L8E 3C3

Re: Backup using GiT?

From

Alvaro Herrera

Date:

13 June 2008, 17:36:37

James B. Byrne wrote:

> GiT works by compressing deltas of the contents of successive versions of file
> systems under repository control.  It treats binary objects as just another
> object under control.  The question is, are successive (compressed) dumps of
> an altered database sufficiently similar to make the deltas small enough to
> warrant this approach?

Hmm, perhaps a not completely insane thing to do would be to use another
pg_dump "format" (i.e. a new -F option) which installs stuff on a GIT
repo.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Backup using GiT?

From

Chander Ganesan

Date:

13 June 2008, 18:46:19

James B. Byrne wrote:
> I have recently had to teach myself how to use git and the thought came to me
> that this tool might provide a fairly low setup cost way of passing pg_dumps
> over the network to our off site data store.  Think Rsync, but on a file
> content basis; just the content diff gets transmitted.
>
> GiT works by compressing deltas of the contents of successive versions of file
> systems under repository control.  It treats binary objects as just another
> object under control.  The question is, are successive (compressed) dumps of
> an altered database sufficiently similar to make the deltas small enough to
> warrant this approach?
>
> Comments? (not my my sanity, please)
>
It probably depends on the number of changes in the database.  For
example, a vacuum followed by an insert could result in records that
were previously at the start of the dump being somewhere else -like the
middle of the dump (i.e., a dead tuple is marked as available, then the
space is "used" for an insert).  In such a case, you would end up with a
row that was unchanged, but in a different location in the file.  Would
GIT then back that up?  I would think so.  So in essence you'd be
getting "at least a diff, but likely more" .  Of course, I'm assuming
you are just dumping the data in a table using pg_dump....once you start
talking about a dumpall, you might find that smaller changes (i.e., give
a user a new privilege) causes stuff to be offset more....  Add
compression into the mix and I think you could find that there are
little/no similarities..

On the other hand, if you were only doing inserts into an optimized (no
dead tuples) table, I would think that you'd get a much better result.

Perhaps you would be better off using PITR in such cases?

--
Chander Ganesan
Open Technology Group, Inc.
One Copley Parkway, Suite 210
Morrisville, NC  27560
919-463-0999/877-258-8987
http://www.otg-nc.com

Re: Backup using GiT?

From

Tom Lane

Date:

13 June 2008, 19:51:29

"James B. Byrne" <byrnejb@harte-lyne.ca> writes:
> I have recently had to teach myself how to use git and the thought came to me
> that this tool might provide a fairly low setup cost way of passing pg_dumps
> over the network to our off site data store.  Think Rsync, but on a file
> content basis; just the content diff gets transmitted.

Huh?  rsync does within-file diff transmission already.

> GiT works by compressing deltas of the contents of successive versions of file
> systems under repository control.  It treats binary objects as just another
> object under control.  The question is, are successive (compressed) dumps of
> an altered database sufficiently similar to make the deltas small enough to
> warrant this approach?

No.  If you compress it, you can be pretty certain that the output will
be different from the first point of difference to the end of the file.
You'd have to work on uncompressed output, which might cost more than
you'd end up saving ...

            regards, tom lane

Re: Backup using GiT?

From

Alvaro Herrera

Date:

13 June 2008, 20:11:44

Tom Lane wrote:
> "James B. Byrne" <byrnejb@harte-lyne.ca> writes:

> > GiT works by compressing deltas of the contents of successive versions of file
> > systems under repository control.  It treats binary objects as just another
> > object under control.  The question is, are successive (compressed) dumps of
> > an altered database sufficiently similar to make the deltas small enough to
> > warrant this approach?
>
> No.  If you compress it, you can be pretty certain that the output will
> be different from the first point of difference to the end of the file.
> You'd have to work on uncompressed output, which might cost more than
> you'd end up saving ...

The other problem is that since the tables are not dumped in any
consistent order, it's pretty unlikely that you'd get any similarity
between two dumps of the same table.  To get any benefit, you'd need to
get pg_dump to dump sorted tuples.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Backup using GiT?

From

"Ciprian Dorin Craciun"

Date:

14 June 2008, 10:51:01

On Fri, Jun 13, 2008 at 11:11 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Tom Lane wrote:
>> "James B. Byrne" <byrnejb@harte-lyne.ca> writes:
>
>> > GiT works by compressing deltas of the contents of successive versions of file
>> > systems under repository control.  It treats binary objects as just another
>> > object under control.  The question is, are successive (compressed) dumps of
>> > an altered database sufficiently similar to make the deltas small enough to
>> > warrant this approach?
>>
>> No.  If you compress it, you can be pretty certain that the output will
>> be different from the first point of difference to the end of the file.
>> You'd have to work on uncompressed output, which might cost more than
>> you'd end up saving ...
>
> The other problem is that since the tables are not dumped in any
> consistent order, it's pretty unlikely that you'd get any similarity
> between two dumps of the same table.  To get any benefit, you'd need to
> get pg_dump to dump sorted tuples.
>
> --
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

    The idea of using GIT for backing-up databases is not that bad.

    I would propose the following:
    -- dump the creation script in a separate file; (or maybe one file
per object (table, view, function) etc.;)
    -- dump the content of each table in it's own file;
    -- dump the tuples sorted but in plain text (as COPY data or
INSERTS maybe); (as Alvaro suggested);
    -- don't use compression (as Tom and Chander suggested) because
GIT already uses compression for the packed files;

    One advantage of using GIT in the manner described previously will
be change tracking by doing just a simple git diff you could see the
modifications (inserts, updates, deletes, etc., schema alteration).
Going a step further you could also do merges between multiple
databases with the same structure (each database would have it's own
branch).

    Just imagine how simple a database schema upgrade will be in most
situations, when both the development and the deployed schema have
been modified and we want to put them into sync.

    As a conclusion I would subscribe to such an idea.

    Ciprian Craciun.