Thread: managing git disk space usage

managing git disk space usage

From
Robert Haas
Date:
Tom and, I believe, also Andrew have expressed some concerns about the
space that will be taken up by having multiple copies of the git
repository on their systems.  While most users can probably get by
with a single repository, committers will likely need one for each
back-branch that they work with, and we have quite a few of those.

After playing around with this a bit, I've come to the conclusion that
there are a couple of possible options but they've all got some
drawbacks.

1. Clone the origin.  Then, clone the clone n times locally.  This
uses hard links, so it saves disk space.  But, every time you want to
pull, you first have to pull to the "main" clone, and then to each of
the "slave" clones.  And same thing when you want to push.

2. Clone the origin n times.  Use more disk space.  Live with it.  :-)

3. Clone the origin once.  Apply patches to multiple branches by
switching branches.  Playing around with it, this is probably a
tolerable way to work when you're only going back one or two branches
but it's certainly a big nuisance when you're going back 5-7 branches.

4. Clone the origin.  Use that to get at the master branch.  Then
clone that clone n-1 times, one for each back-branch.  This makes it a
bit easier to push and pull when you're only dealing with the master
branch, but you still have the double push/double pull problem for all
the other branches.

5. Use git clone --shared or git clone --references or
git-new-workdir.  While I once thought this was the solution, I can't
take very seriously any solution that has a warning in the manual that
says, essentially, git gc may corrupt your repository if you do this.

I'm not really sure which of these I'm going to do yet, and I'm not
sure what to recommend to others, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: managing git disk space usage

From
Aidan Van Dyk
Date:
* Robert Haas <robertmhaas@gmail.com> [100720 13:04]:
> 3. Clone the origin once.  Apply patches to multiple branches by
> switching branches.  Playing around with it, this is probably a
> tolerable way to work when you're only going back one or two branches
> but it's certainly a big nuisance when you're going back 5-7 branches.

This is what I do when I'm working on a project that has completely
proper dependancies, and you don't need to always re-run configure
between different branches.  I use ccache heavily, so configure takes
longer than a complete build with a couple-dozen
actually-not-previously-seen changes...

But *all* dependancies need to be proper in the build system, or you end
up needing a git-clean-type-cleanup between branch switches, forcing a
new configure run too, which takes too much time...

Maybe this will cause make dependancies to be refined in PG ;-)

It has the advantage, that if "back patching" (or in reality, forward
patching) all happens in 1 repository, the git conflict machinery is all
using the same cache of resolutions, meaning that if you apply the same
patch to 2 different branches, with identical code/conflict, you don't
need to do the whole conflict resolution by hand from scratch in the 2nd
branch.

> 5. Use git clone --shared or git clone --references or
> git-new-workdir.  While I once thought this was the solution, I can't
> take very seriously any solution that has a warning in the manual that
> says, essentially, git gc may corrupt your repository if you do this.

This is the type of setup I often use.  I have a "central" set of git
repos that I have automatically straight mirror-clones of project
repositories.   And they are kept up-to-date via cron.  And any time I
clone a work repo, I use --reference.

Since I make sure I don't "remove" anything from the reference repo, I
don't have to worry about loosing objects other repositories might be
using from the "cache" repo.  In case anyone is wondering, that's:   git clone --mirror $REPO
/data/src/cache/$project.git  git --git-dir=/data/src/cache/$project.git config gc.auto 0
 

And then in crontab:   git --git-dir=/data/src/cache/$project.git fetch --quiet --all

With gc.auto disabled, and the only commands ever run being "git fetch",
no objects are removed, even if a remote rewinds and throws away
commits.

But this way means that the seperate repos only share the "past, from
central repository" history, which means that you have to jump through
hoops if you want to be able to use git's handyj
merging/cherry-picking/conflict tools when trying to rebase/port
patches between branches.  You're pretty much limited to exporting a
patch, changing to a the new branch-repository, and applying the patch.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: managing git disk space usage

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> 2. Clone the origin n times.  Use more disk space.  Live with it. 
:-)
But each copy uses almost 0.36% of the formatted space on my 150GB
drive!
-Kevin


Re: managing git disk space usage

From
Peter Eisentraut
Date:
On tis, 2010-07-20 at 13:28 -0400, Aidan Van Dyk wrote:
> But *all* dependancies need to be proper in the build system, or you
> end
> up needing a git-clean-type-cleanup between branch switches, forcing a
> new configure run too, which takes too much time...

This realization, while true, doesn't really help, because we are
talking about maintaining 5+ year old back branches, where we are not
going to fiddle with the build system at this time.  Also, the switch
from 9.0 to 9.1 the other day showed everyone who cared to watch that
the dependencies are currently not correct for major version switches,
so this method will definitely not work at the moment.



Re: managing git disk space usage

From
Peter Eisentraut
Date:
On tis, 2010-07-20 at 13:04 -0400, Robert Haas wrote:
> 2. Clone the origin n times.  Use more disk space.  Live with it.  :-)

Well, I plan to use cp -a to avoid cloning over the network n times, but
other than that that was my plan.  My .git directory currently takes 283
MB, so I think I can just about live with that.



Re: managing git disk space usage

From
Andrew Dunstan
Date:

Robert Haas wrote:
> Tom and, I believe, also Andrew have expressed some concerns about the
> space that will be taken up by having multiple copies of the git
> repository on their systems.  While most users can probably get by
> with a single repository, committers will likely need one for each
> back-branch that they work with, and we have quite a few of those.
>
> After playing around with this a bit, I've come to the conclusion that
> there are a couple of possible options but they've all got some
> drawbacks.
>
> 1. Clone the origin.  Then, clone the clone n times locally.  This
> uses hard links, so it saves disk space.  But, every time you want to
> pull, you first have to pull to the "main" clone, and then to each of
> the "slave" clones.  And same thing when you want to push.
>
>
>   

You can have a cron job that does the first pull fairly frequently. It 
should be a fairly cheap operation unless the git protocol is dumber 
than I think.

The second pull is the equivalent of what we do now with "cvs update".

Given that, you could push commits direct to the authoritative repo and 
wait for the cron job to catch up your local base clone.

I think that's the pattern I will probably try to follow.

cheers

andrew


Re: managing git disk space usage

From
Robert Haas
Date:
On Wed, Jul 21, 2010 at 6:17 AM, Abhijit Menon-Sen <ams@toroid.org> wrote:
> At 2010-07-20 13:04:12 -0400, robertmhaas@gmail.com wrote:
>>
>> 1. Clone the origin.  Then, clone the clone n times locally.  This
>> uses hard links, so it saves disk space.  But, every time you want to
>> pull, you first have to pull to the "main" clone, and then to each of
>> the "slave" clones.  And same thing when you want to push.
>
> If your extra clones are for occasionally-touched back branches, then:
>
> (a) In my experience, it is almost always much easier to work with many
> branches and move patches between them rather than use multiple clones;
> but
>
> (b) You don't need to do the double-pull and push. Clone your local
> repository as many times as needed, but create new git-remote(1)s in
> each extra clone and pull/push only the branch you care about directly
> from or to the remote. That way, you'll start off with the bulk of the
> storage shared with your main local repository, and "waste" a few KB
> when you make (presumably infrequent) new changes.

Ah, that is clever.  Perhaps we need to write up directions on how to do that.

> But that brings me to another point:
>
> In my experience (doing exactly this kind of old-branch-maintenance with
> Archiveopteryx), git doesn't help you much if you want to backport (i.e.
> cherry-pick) changes from a development branch to old release branches.
> It is much more helpful when you make changes to the *oldest* applicable
> branch and bring it *forward* to your development branch (by merging the
> old branch into your master). Cherry-picking can be done, but it becomes
> painful after a while.

Well, per previous discussion, we're not going to change that at this
point, or maybe ever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: managing git disk space usage

From
Magnus Hagander
Date:
On Wed, Jul 21, 2010 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 21, 2010 at 6:17 AM, Abhijit Menon-Sen <ams@toroid.org> wrote:
>> At 2010-07-20 13:04:12 -0400, robertmhaas@gmail.com wrote:
>>>
>>> 1. Clone the origin.  Then, clone the clone n times locally.  This
>>> uses hard links, so it saves disk space.  But, every time you want to
>>> pull, you first have to pull to the "main" clone, and then to each of
>>> the "slave" clones.  And same thing when you want to push.
>>
>> If your extra clones are for occasionally-touched back branches, then:
>>
>> (a) In my experience, it is almost always much easier to work with many
>> branches and move patches between them rather than use multiple clones;
>> but
>>
>> (b) You don't need to do the double-pull and push. Clone your local
>> repository as many times as needed, but create new git-remote(1)s in
>> each extra clone and pull/push only the branch you care about directly
>> from or to the remote. That way, you'll start off with the bulk of the
>> storage shared with your main local repository, and "waste" a few KB
>> when you make (presumably infrequent) new changes.
>
> Ah, that is clever.  Perhaps we need to write up directions on how to do that.

Yeah, that's the way I work with some projects at least.


>> But that brings me to another point:
>>
>> In my experience (doing exactly this kind of old-branch-maintenance with
>> Archiveopteryx), git doesn't help you much if you want to backport (i.e.
>> cherry-pick) changes from a development branch to old release branches.
>> It is much more helpful when you make changes to the *oldest* applicable
>> branch and bring it *forward* to your development branch (by merging the
>> old branch into your master). Cherry-picking can be done, but it becomes
>> painful after a while.
>
> Well, per previous discussion, we're not going to change that at this
> point, or maybe ever.

Nope, the deal was definitely that we stick to the current workflow.

Yes, this means we can't use git cherry-pick or similar git-specific
tools to make life easier. But it shouldn't make life harder than it
is *now*, with cvs.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: managing git disk space usage

From
Abhijit Menon-Sen
Date:
At 2010-07-20 13:04:12 -0400, robertmhaas@gmail.com wrote:
>
> 1. Clone the origin.  Then, clone the clone n times locally.  This
> uses hard links, so it saves disk space.  But, every time you want to
> pull, you first have to pull to the "main" clone, and then to each of
> the "slave" clones.  And same thing when you want to push.

If your extra clones are for occasionally-touched back branches, then:

(a) In my experience, it is almost always much easier to work with many
branches and move patches between them rather than use multiple clones;
but

(b) You don't need to do the double-pull and push. Clone your local
repository as many times as needed, but create new git-remote(1)s in
each extra clone and pull/push only the branch you care about directly
from or to the remote. That way, you'll start off with the bulk of the
storage shared with your main local repository, and "waste" a few KB
when you make (presumably infrequent) new changes.

But that brings me to another point:

In my experience (doing exactly this kind of old-branch-maintenance with
Archiveopteryx), git doesn't help you much if you want to backport (i.e.
cherry-pick) changes from a development branch to old release branches.
It is much more helpful when you make changes to the *oldest* applicable
branch and bring it *forward* to your development branch (by merging the
old branch into your master). Cherry-picking can be done, but it becomes
painful after a while.

See http://toroid.org/ams/etc/git-merge-vs-p4-integrate for more.

-- ams


Re: managing git disk space usage

From
Abhijit Menon-Sen
Date:
At 2010-07-21 06:39:28 -0400, robertmhaas@gmail.com wrote:
>
> Perhaps we need to write up directions on how to do that.

I'll write them if you tell me where to put them. It's trivial.

> Well, per previous discussion, we're not going to change that at this
> point, or maybe ever.

Sure. I just wanted to mention it, because it's something I learned the
hard way. It's also true that back-porting changes is a bigger deal for
Postgres than it was for me (in the sense that it's an exception rather
than a routine activity), and individual changes are usually backported
as soon as, or very soon after, they are committed; so it should be less
painful on the whole.

Another point, in response to Magnus's followup:

At 2010-07-21 12:42:03 +0200, magnus@hagander.net wrote:
>
> Yes, this means we can't use git cherry-pick or similar git-specific
> tools to make life easier.

No, that's not right. You *can* use cherry-pick; in fact, it's the sane
way to backport the occasional change. What you can't do is efficiently
manage a queue of changes to be backported to multiple branches. But as
I said above, that's not exactly what we want to do for Postgres, so it
should not matter too much.

-- ams


Re: managing git disk space usage

From
Robert Haas
Date:
On Wed, Jul 21, 2010 at 6:56 AM, Abhijit Menon-Sen <ams@toroid.org> wrote:
> At 2010-07-21 06:39:28 -0400, robertmhaas@gmail.com wrote:
>>
>> Perhaps we need to write up directions on how to do that.
>
> I'll write them if you tell me where to put them. It's trivial.

Post 'em here or drop them on the wiki and post a link.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: managing git disk space usage

From
Abhijit Menon-Sen
Date:
At 2010-07-21 06:57:53 -0400, robertmhaas@gmail.com wrote:
>
> Post 'em here or drop them on the wiki and post a link.

1. Clone the remote repository as usual:
   git clone git://git.postgresql.org/git/postgresql.git

2. Create as many local clones as you want:
   git clone postgresql foobar

3. In each clone (supposing you care about branch xyzzy):
   3.1. git remote origin set-url ssh://whatever/postgresql.git
   3.2. git remote update && git remote prune
   3.2. git checkout -t origin/xyzzy
   3.4. git branch -d master
   3.5. Edit .git/config and set origin.fetch thus:
        [remote "origin"]            fetch = +refs/heads/xyzzy:refs/remotes/origin/xyzzy
        (You can git config remote.origin.fetch '+refs/...' if you're        squeamish about editing the config file.)
   3.6. That's it. git pull and git push will work correctly.

(This will replace the "origin" remote that pointed at your local
postgresql.git clone with one that points to the real remote; but you
could also add a remote definition named something other than "origin",
in which case you'd need to "git push thatname" etc.)

-- ams


Re: managing git disk space usage

From
Dimitri Fontaine
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:
> * Robert Haas <robertmhaas@gmail.com> [100720 13:04]:
>  
>> 3. Clone the origin once.  Apply patches to multiple branches by
>> switching branches.  Playing around with it, this is probably a
>> tolerable way to work when you're only going back one or two branches
>> but it's certainly a big nuisance when you're going back 5-7 branches.
>
> This is what I do when I'm working on a project that has completely
> proper dependancies, and you don't need to always re-run configure
> between different branches.  I use ccache heavily, so configure takes
> longer than a complete build with a couple-dozen
> actually-not-previously-seen changes...
>
> But *all* dependancies need to be proper in the build system, or you end
> up needing a git-clean-type-cleanup between branch switches, forcing a
> new configure run too, which takes too much time...
>
> Maybe this will cause make dependancies to be refined in PG ;-)

Well, there's also the VPATH possibility, where all your build objects
are stored out of the way of the repo. So you could checkout the branch
you're interrested in, change to the associated build directory and
build there. And automate that of course.

Regards,
-- 
dim


Re: managing git disk space usage

From
Alvaro Herrera
Date:
Excerpts from Dimitri Fontaine's message of mié jul 21 15:00:48 -0400 2010:

> Well, there's also the VPATH possibility, where all your build objects
> are stored out of the way of the repo. So you could checkout the branch
> you're interrested in, change to the associated build directory and
> build there. And automate that of course.

This does not work as cleanly as you suppose, because some "build
objects" are stored in the source tree.  configure being one of them.
So if you switch branches, configure is rerun even in a VPATH build,
which is undesirable.


Re: managing git disk space usage

From
Dimitri Fontaine
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> This does not work as cleanly as you suppose, because some "build
> objects" are stored in the source tree.  configure being one of them.
> So if you switch branches, configure is rerun even in a VPATH build,
> which is undesirable.

Ouch. Reading -hackers led me to thinking this had received a cleaning
effort in the Makefiles, so that any generated file appears in the build
directory. Sorry to learn that's not (yet?) the case.

Regards,
-- 
dim


Re: managing git disk space usage

From
Peter Eisentraut
Date:
On ons, 2010-07-21 at 23:06 +0200, Dimitri Fontaine wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > This does not work as cleanly as you suppose, because some "build
> > objects" are stored in the source tree.  configure being one of them.
> > So if you switch branches, configure is rerun even in a VPATH build,
> > which is undesirable.
> 
> Ouch. Reading -hackers led me to thinking this had received a cleaning
> effort in the Makefiles, so that any generated file appears in the build
> directory. Sorry to learn that's not (yet?) the case.

It is, but not in the back branches.