Thread: managing git disk space usage
Tom and, I believe, also Andrew have expressed some concerns about the space that will be taken up by having multiple copies of the git repository on their systems. While most users can probably get by with a single repository, committers will likely need one for each back-branch that they work with, and we have quite a few of those. After playing around with this a bit, I've come to the conclusion that there are a couple of possible options but they've all got some drawbacks. 1. Clone the origin. Then, clone the clone n times locally. This uses hard links, so it saves disk space. But, every time you want to pull, you first have to pull to the "main" clone, and then to each of the "slave" clones. And same thing when you want to push. 2. Clone the origin n times. Use more disk space. Live with it. :-) 3. Clone the origin once. Apply patches to multiple branches by switching branches. Playing around with it, this is probably a tolerable way to work when you're only going back one or two branches but it's certainly a big nuisance when you're going back 5-7 branches. 4. Clone the origin. Use that to get at the master branch. Then clone that clone n-1 times, one for each back-branch. This makes it a bit easier to push and pull when you're only dealing with the master branch, but you still have the double push/double pull problem for all the other branches. 5. Use git clone --shared or git clone --references or git-new-workdir. While I once thought this was the solution, I can't take very seriously any solution that has a warning in the manual that says, essentially, git gc may corrupt your repository if you do this. I'm not really sure which of these I'm going to do yet, and I'm not sure what to recommend to others, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
* Robert Haas <robertmhaas@gmail.com> [100720 13:04]: > 3. Clone the origin once. Apply patches to multiple branches by > switching branches. Playing around with it, this is probably a > tolerable way to work when you're only going back one or two branches > but it's certainly a big nuisance when you're going back 5-7 branches. This is what I do when I'm working on a project that has completely proper dependancies, and you don't need to always re-run configure between different branches. I use ccache heavily, so configure takes longer than a complete build with a couple-dozen actually-not-previously-seen changes... But *all* dependancies need to be proper in the build system, or you end up needing a git-clean-type-cleanup between branch switches, forcing a new configure run too, which takes too much time... Maybe this will cause make dependancies to be refined in PG ;-) It has the advantage, that if "back patching" (or in reality, forward patching) all happens in 1 repository, the git conflict machinery is all using the same cache of resolutions, meaning that if you apply the same patch to 2 different branches, with identical code/conflict, you don't need to do the whole conflict resolution by hand from scratch in the 2nd branch. > 5. Use git clone --shared or git clone --references or > git-new-workdir. While I once thought this was the solution, I can't > take very seriously any solution that has a warning in the manual that > says, essentially, git gc may corrupt your repository if you do this. This is the type of setup I often use. I have a "central" set of git repos that I have automatically straight mirror-clones of project repositories. And they are kept up-to-date via cron. And any time I clone a work repo, I use --reference. Since I make sure I don't "remove" anything from the reference repo, I don't have to worry about loosing objects other repositories might be using from the "cache" repo. In case anyone is wondering, that's: git clone --mirror $REPO /data/src/cache/$project.git git --git-dir=/data/src/cache/$project.git config gc.auto 0 And then in crontab: git --git-dir=/data/src/cache/$project.git fetch --quiet --all With gc.auto disabled, and the only commands ever run being "git fetch", no objects are removed, even if a remote rewinds and throws away commits. But this way means that the seperate repos only share the "past, from central repository" history, which means that you have to jump through hoops if you want to be able to use git's handyj merging/cherry-picking/conflict tools when trying to rebase/port patches between branches. You're pretty much limited to exporting a patch, changing to a the new branch-repository, and applying the patch. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Robert Haas <robertmhaas@gmail.com> wrote: > 2. Clone the origin n times. Use more disk space. Live with it. :-) But each copy uses almost 0.36% of the formatted space on my 150GB drive! -Kevin
On tis, 2010-07-20 at 13:28 -0400, Aidan Van Dyk wrote: > But *all* dependancies need to be proper in the build system, or you > end > up needing a git-clean-type-cleanup between branch switches, forcing a > new configure run too, which takes too much time... This realization, while true, doesn't really help, because we are talking about maintaining 5+ year old back branches, where we are not going to fiddle with the build system at this time. Also, the switch from 9.0 to 9.1 the other day showed everyone who cared to watch that the dependencies are currently not correct for major version switches, so this method will definitely not work at the moment.
On tis, 2010-07-20 at 13:04 -0400, Robert Haas wrote: > 2. Clone the origin n times. Use more disk space. Live with it. :-) Well, I plan to use cp -a to avoid cloning over the network n times, but other than that that was my plan. My .git directory currently takes 283 MB, so I think I can just about live with that.
Robert Haas wrote: > Tom and, I believe, also Andrew have expressed some concerns about the > space that will be taken up by having multiple copies of the git > repository on their systems. While most users can probably get by > with a single repository, committers will likely need one for each > back-branch that they work with, and we have quite a few of those. > > After playing around with this a bit, I've come to the conclusion that > there are a couple of possible options but they've all got some > drawbacks. > > 1. Clone the origin. Then, clone the clone n times locally. This > uses hard links, so it saves disk space. But, every time you want to > pull, you first have to pull to the "main" clone, and then to each of > the "slave" clones. And same thing when you want to push. > > > You can have a cron job that does the first pull fairly frequently. It should be a fairly cheap operation unless the git protocol is dumber than I think. The second pull is the equivalent of what we do now with "cvs update". Given that, you could push commits direct to the authoritative repo and wait for the cron job to catch up your local base clone. I think that's the pattern I will probably try to follow. cheers andrew
On Wed, Jul 21, 2010 at 6:17 AM, Abhijit Menon-Sen <ams@toroid.org> wrote: > At 2010-07-20 13:04:12 -0400, robertmhaas@gmail.com wrote: >> >> 1. Clone the origin. Then, clone the clone n times locally. This >> uses hard links, so it saves disk space. But, every time you want to >> pull, you first have to pull to the "main" clone, and then to each of >> the "slave" clones. And same thing when you want to push. > > If your extra clones are for occasionally-touched back branches, then: > > (a) In my experience, it is almost always much easier to work with many > branches and move patches between them rather than use multiple clones; > but > > (b) You don't need to do the double-pull and push. Clone your local > repository as many times as needed, but create new git-remote(1)s in > each extra clone and pull/push only the branch you care about directly > from or to the remote. That way, you'll start off with the bulk of the > storage shared with your main local repository, and "waste" a few KB > when you make (presumably infrequent) new changes. Ah, that is clever. Perhaps we need to write up directions on how to do that. > But that brings me to another point: > > In my experience (doing exactly this kind of old-branch-maintenance with > Archiveopteryx), git doesn't help you much if you want to backport (i.e. > cherry-pick) changes from a development branch to old release branches. > It is much more helpful when you make changes to the *oldest* applicable > branch and bring it *forward* to your development branch (by merging the > old branch into your master). Cherry-picking can be done, but it becomes > painful after a while. Well, per previous discussion, we're not going to change that at this point, or maybe ever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Jul 21, 2010 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 21, 2010 at 6:17 AM, Abhijit Menon-Sen <ams@toroid.org> wrote: >> At 2010-07-20 13:04:12 -0400, robertmhaas@gmail.com wrote: >>> >>> 1. Clone the origin. Then, clone the clone n times locally. This >>> uses hard links, so it saves disk space. But, every time you want to >>> pull, you first have to pull to the "main" clone, and then to each of >>> the "slave" clones. And same thing when you want to push. >> >> If your extra clones are for occasionally-touched back branches, then: >> >> (a) In my experience, it is almost always much easier to work with many >> branches and move patches between them rather than use multiple clones; >> but >> >> (b) You don't need to do the double-pull and push. Clone your local >> repository as many times as needed, but create new git-remote(1)s in >> each extra clone and pull/push only the branch you care about directly >> from or to the remote. That way, you'll start off with the bulk of the >> storage shared with your main local repository, and "waste" a few KB >> when you make (presumably infrequent) new changes. > > Ah, that is clever. Perhaps we need to write up directions on how to do that. Yeah, that's the way I work with some projects at least. >> But that brings me to another point: >> >> In my experience (doing exactly this kind of old-branch-maintenance with >> Archiveopteryx), git doesn't help you much if you want to backport (i.e. >> cherry-pick) changes from a development branch to old release branches. >> It is much more helpful when you make changes to the *oldest* applicable >> branch and bring it *forward* to your development branch (by merging the >> old branch into your master). Cherry-picking can be done, but it becomes >> painful after a while. > > Well, per previous discussion, we're not going to change that at this > point, or maybe ever. Nope, the deal was definitely that we stick to the current workflow. Yes, this means we can't use git cherry-pick or similar git-specific tools to make life easier. But it shouldn't make life harder than it is *now*, with cvs. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
At 2010-07-20 13:04:12 -0400, robertmhaas@gmail.com wrote: > > 1. Clone the origin. Then, clone the clone n times locally. This > uses hard links, so it saves disk space. But, every time you want to > pull, you first have to pull to the "main" clone, and then to each of > the "slave" clones. And same thing when you want to push. If your extra clones are for occasionally-touched back branches, then: (a) In my experience, it is almost always much easier to work with many branches and move patches between them rather than use multiple clones; but (b) You don't need to do the double-pull and push. Clone your local repository as many times as needed, but create new git-remote(1)s in each extra clone and pull/push only the branch you care about directly from or to the remote. That way, you'll start off with the bulk of the storage shared with your main local repository, and "waste" a few KB when you make (presumably infrequent) new changes. But that brings me to another point: In my experience (doing exactly this kind of old-branch-maintenance with Archiveopteryx), git doesn't help you much if you want to backport (i.e. cherry-pick) changes from a development branch to old release branches. It is much more helpful when you make changes to the *oldest* applicable branch and bring it *forward* to your development branch (by merging the old branch into your master). Cherry-picking can be done, but it becomes painful after a while. See http://toroid.org/ams/etc/git-merge-vs-p4-integrate for more. -- ams
At 2010-07-21 06:39:28 -0400, robertmhaas@gmail.com wrote: > > Perhaps we need to write up directions on how to do that. I'll write them if you tell me where to put them. It's trivial. > Well, per previous discussion, we're not going to change that at this > point, or maybe ever. Sure. I just wanted to mention it, because it's something I learned the hard way. It's also true that back-porting changes is a bigger deal for Postgres than it was for me (in the sense that it's an exception rather than a routine activity), and individual changes are usually backported as soon as, or very soon after, they are committed; so it should be less painful on the whole. Another point, in response to Magnus's followup: At 2010-07-21 12:42:03 +0200, magnus@hagander.net wrote: > > Yes, this means we can't use git cherry-pick or similar git-specific > tools to make life easier. No, that's not right. You *can* use cherry-pick; in fact, it's the sane way to backport the occasional change. What you can't do is efficiently manage a queue of changes to be backported to multiple branches. But as I said above, that's not exactly what we want to do for Postgres, so it should not matter too much. -- ams
On Wed, Jul 21, 2010 at 6:56 AM, Abhijit Menon-Sen <ams@toroid.org> wrote: > At 2010-07-21 06:39:28 -0400, robertmhaas@gmail.com wrote: >> >> Perhaps we need to write up directions on how to do that. > > I'll write them if you tell me where to put them. It's trivial. Post 'em here or drop them on the wiki and post a link. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
At 2010-07-21 06:57:53 -0400, robertmhaas@gmail.com wrote: > > Post 'em here or drop them on the wiki and post a link. 1. Clone the remote repository as usual: git clone git://git.postgresql.org/git/postgresql.git 2. Create as many local clones as you want: git clone postgresql foobar 3. In each clone (supposing you care about branch xyzzy): 3.1. git remote origin set-url ssh://whatever/postgresql.git 3.2. git remote update && git remote prune 3.2. git checkout -t origin/xyzzy 3.4. git branch -d master 3.5. Edit .git/config and set origin.fetch thus: [remote "origin"] fetch = +refs/heads/xyzzy:refs/remotes/origin/xyzzy (You can git config remote.origin.fetch '+refs/...' if you're squeamish about editing the config file.) 3.6. That's it. git pull and git push will work correctly. (This will replace the "origin" remote that pointed at your local postgresql.git clone with one that points to the real remote; but you could also add a remote definition named something other than "origin", in which case you'd need to "git push thatname" etc.) -- ams
Aidan Van Dyk <aidan@highrise.ca> writes: > * Robert Haas <robertmhaas@gmail.com> [100720 13:04]: > >> 3. Clone the origin once. Apply patches to multiple branches by >> switching branches. Playing around with it, this is probably a >> tolerable way to work when you're only going back one or two branches >> but it's certainly a big nuisance when you're going back 5-7 branches. > > This is what I do when I'm working on a project that has completely > proper dependancies, and you don't need to always re-run configure > between different branches. I use ccache heavily, so configure takes > longer than a complete build with a couple-dozen > actually-not-previously-seen changes... > > But *all* dependancies need to be proper in the build system, or you end > up needing a git-clean-type-cleanup between branch switches, forcing a > new configure run too, which takes too much time... > > Maybe this will cause make dependancies to be refined in PG ;-) Well, there's also the VPATH possibility, where all your build objects are stored out of the way of the repo. So you could checkout the branch you're interrested in, change to the associated build directory and build there. And automate that of course. Regards, -- dim
Excerpts from Dimitri Fontaine's message of mié jul 21 15:00:48 -0400 2010: > Well, there's also the VPATH possibility, where all your build objects > are stored out of the way of the repo. So you could checkout the branch > you're interrested in, change to the associated build directory and > build there. And automate that of course. This does not work as cleanly as you suppose, because some "build objects" are stored in the source tree. configure being one of them. So if you switch branches, configure is rerun even in a VPATH build, which is undesirable.
Alvaro Herrera <alvherre@commandprompt.com> writes: > This does not work as cleanly as you suppose, because some "build > objects" are stored in the source tree. configure being one of them. > So if you switch branches, configure is rerun even in a VPATH build, > which is undesirable. Ouch. Reading -hackers led me to thinking this had received a cleaning effort in the Makefiles, so that any generated file appears in the build directory. Sorry to learn that's not (yet?) the case. Regards, -- dim
On ons, 2010-07-21 at 23:06 +0200, Dimitri Fontaine wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > This does not work as cleanly as you suppose, because some "build > > objects" are stored in the source tree. configure being one of them. > > So if you switch branches, configure is rerun even in a VPATH build, > > which is undesirable. > > Ouch. Reading -hackers led me to thinking this had received a cleaning > effort in the Makefiles, so that any generated file appears in the build > directory. Sorry to learn that's not (yet?) the case. It is, but not in the back branches.