Thread: PostgreSQL GIT mirror status
The PostgreSQL GIT mirror at git.postgresql.org/git/postgresql.git was screwed up on Dec 13th. All the history up to that date was duplicated four times, and strange "fixup" commits appeared in back-branches. In addition, there was the old issue that back-branches were not being updated. Both issues have now been fixed. The repository was "reset" to the situation before the screwup on Dec 13th, and all the patches after that were reapplied. This means that if you have a clone that has been updated (pulled) since that date, the next time you issue fetch or pull, it will fail, complaining about "non fast-forward" updates. You will need to use the --force option to force it. If you have any local branches in your repository, you will need to rebase them over the new head. With something like: git-rebase origin/master Let's hope that the script can now keep the mirror up-to-date without manual intervention. Let me know if there's problems. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-30 at 22:18 +0200, Heikki Linnakangas wrote: > Let's hope that the script can now keep the mirror up-to-date without > manual intervention. Let me know if there's problems. > Thanks for putting in the time. I find the git repo very helpful, especially during patch review. One question though: does "git repack" ever get run? Sometimes the repository seems a little slow, but maybe that's just because it's big. Regards,Jeff Davis
On Wednesday 31 December 2008 00:00:39 Jeff Davis wrote: > One question though: does "git repack" ever get run? Yes, after every update.
On Tue, Dec 30, 2008 at 2:58 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Wednesday 31 December 2008 00:00:39 Jeff Davis wrote: >> One question though: does "git repack" ever get run? > > Yes, after every update. > Follow-up question: does "git repack -a -d" ever get run? I have also noticed slow fetching and have seen that (via the HTTP url at http://git.postgresql.org/git/postgresql.git/objects/pack/) that there seems to be a relatively large number of packs that could be to blame. Important side note: I don't think "git repack -a -d" is 'safe' with dumber git protocols like HTTP, so any in-progress HTTP-based pulls may encounter 'interesting' effects at the moment the repack finishes and prunes away old packs. If you want to be really thorough, consider heeding the mail archived at http://gcc.gnu.org/ml/gcc/2007-12/msg00165.html and running a extensive repack overnight. It *might* be worth it if it has not been done at least once already. fdr
Daniel Farina wrote: > Follow-up question: does "git repack -a -d" ever get run? No. > If you want to be really thorough, consider heeding the mail archived > at http://gcc.gnu.org/ml/gcc/2007-12/msg00165.html and running a > extensive repack overnight. It *might* be worth it if it has not been > done at least once already. Well, if you want to give it a try and then report back about whether there were any noticeable effects ...
On Wed, Jan 7, 2009 at 1:58 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > > Well, if you want to give it a try and then report back about whether there > were any noticeable effects ... > I ran a regular git repack -a -d. This took about 3.5 cpu-intensive hours, but made object counting *much* (I cannot stress that enough) faster and made the repository shrink dramatically: 361M to 246M. I also won't have any more open-file-limit problems (things like git fsck --full would fail because of too many open files until I raised ulimit -n). I should also mention that cloning from http seems completely broken because of the huge number of packs...potentially also an open file limit issue. You may want to run 'git repack -a -d' also, but I'd advise waiting until tomorrow when I write up my full report and compare that with the much more aggressive packing options. My estimation is that using the already-repacked repository that finding new deltas will take about nine hours with extremely aggressive settings. It has a higher likelihood of being worthwhile on projects as large as Postgres, so we'll see. After this I can either solidify the recipe I used and you can burn another fifteen or so hours of compute time to re-derive this result or I can simply give you the pack generated. You can use 'git fsck --full' to ensure the pack's fidelity. I suggest running 'git repack -a -d' to consolidate packs every once in a while, maybe monthly or semi-monthly. It's quite cheap if there aren't so many packs and/or loose objects. Aggressive repacking such as what I'm doing may only be useful on a yearly basis or even longer...unless git learns some better ways to build packs. I also hope you (and everyone else) has git version >= 1.5.3, when the pack format changed. fdr
Okay, final report: I suggest running 'git gc' from time to time instead of repack directly. It seems smart enough on modern git versions to have some sensible limits and generally do the right thing to keep a repository in shape, in spite of its name suggesting it's really 'just' for garbage collection. It'll also detect an excessive number of packs and consolidate them. Tweaking the gc options may be preferable to messing around with repack options directly, but I found there was no need to tweak to see large improvement. Secondly, 'git gc' has the '--aggressive' option. This used to do something really misleading, but I'm pretty sure it's fixed 'now', although I couldn't point you to an exact version. This makes life easy: just run 'git gc --aggressive' once in a long while. Given the current data it seems that the pack should be about 100M afterwards. Thirdly, I found a lot of garbage. There was no garbage when I used wget to fetch a copy of repo (and over 600000 objects) but then when I pushed to a git clone git chose only to send something in the 300000 object range. I suspect the difference is in the reflog or something, but I still can't explain why there was so much garbage that's not connected to branches or tags. Regardless, all the branches seem present and 'git fsck' says everything is okay. I'm trying to figure out where those extra objects are reachable from, but that's mostly for completeness -- everything seems to be working convincingly. I only have access to a machine where I've set up a 'dumb' git repo that only serves via http. It's at http://fdr.lolrus.org/postgresql.git If you are interested in grabbing a verbatim copy of my objects and repo, you can run the following to get an exact, untouched mirror: $ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git You will probably have to delete any spurious 'index.html' files that wget grabs before the repository will work as-is. Conclusion: 361M (plus pathological performance issues) to 246M (just repacking) to 110M (aggressive repacking). fdr Addendum: I tried repack with much deeper delta chains (that's what too so long to compute as alluded to in my previous email) and it did cut down size by another 20 megs or so, but many operations are much more costly because of the long chains. The 20 meg increase in size buys a lot of performance, so I think default 'git gc --aggressive' uses a more reasonable trade-off.
On Fri, Jan 9, 2009 at 2:53 AM, Daniel Farina <drfarina@acm.org> wrote: > $ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git > Important correction: a trailing slash is needed on that, in my recollection. Fixed: $ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git/ fdr
Daniel Farina wrote: > Secondly, 'git gc' has the '--aggressive' option. This used to do > something really misleading, but I'm pretty sure it's fixed 'now', > although I couldn't point you to an exact version. This makes life > easy: just run 'git gc --aggressive' once in a long while. Given the > current data it seems that the pack should be about 100M > afterwards. Wow, that's impressive! How long does a "git gc --agressive" run take? > Thirdly, I found a lot of garbage. There was no garbage when I used > wget to fetch a copy of repo (and over 600000 objects) but then when I > pushed to a git clone git chose only to send something in the 300000 > object range. I suspect the difference is in the reflog or something, > but I still can't explain why there was so much garbage that's not > connected to branches or tags. Regardless, all the branches seem > present and 'git fsck' says everything is okay. I'm trying to figure > out where those extra objects are reachable from, but that's mostly > for completeness -- everything seems to be working convincingly. That could be because of the duplicated history we had there in December, that I then fixed. I reset the branches to just before the screwup, and then ran fromcvs to catch up with CVS HEAD again. That duplicated history is probably still there, but nor reachable from any branches or tags. Should we run "git prune" to get rid of the garbage? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 9, 2009 at 3:06 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Wow, that's impressive! How long does a "git gc --agressive" run take? Actually, not that long. The main step that takes forever at this point (starting from scratch) is counting all those objects. The actual gc --aggressive time could probably be measured in minutes and < 1hr on a reasonably fast machine. > That could be because of the duplicated history we had there in December, > that I then fixed. I reset the branches to just before the screwup, and then > ran fromcvs to catch up with CVS HEAD again. That duplicated history is > probably still there, but nor reachable from any branches or tags. > > Should we run "git prune" to get rid of the garbage? > Sounds like a good candidate, but I don't think that alone will do it. I've had to do something like this before when I temporarily added some large blobs to my git repository to move them between home and work. I have isolated the problem to the being the reflog, which sounds about right. The "git reflog" man page says it has ways to delete and/or expire these to be pruned, so try that first (and then tell me if it worked as you expected, and what you did). If it doesn't (i.e. for some reason is not pruning properly) and if you are sure you won't need the reflog it seems that you can just delete the 'logs' directory under the git repository (you may notice that it seems that the repository at lolrus.org works fine, but has no 'logs' directory). That seems to be the same state as having no reflog at all, after which a regular 'git gc' will collect most of those objects. "But wait, there's more!" You'll then want to run a 'git prune', as it seems that gc will still keep some objects around because they're inside the gc grace period, which I believe to be distinct from the reflog. In this case it seems that we really want them gone. Given this information it seems like the right steps are something like this: 1. Somehow expire and/or delete the reflogs so they register as garbage. * By making use of the 'git reflog' expiration/deletion commands (preferred, if one can figure out their behaviorexactly) * Or just deleting $GITREPO/logs. (works for me at the moment) 2. Run 'git gc --aggressive' 3. Run 'git prune' Alternatively, just steal the pack from fdr.lolrus.org, as mentioned above. fdr
Daniel Farina wrote: > Secondly, 'git gc' has the '--aggressive' option. This used to do > something really misleading, but I'm pretty sure it's fixed 'now', > although I couldn't point you to an exact version. This makes life > easy: just run 'git gc --aggressive' once in a long while. Given the > current data it seems that the pack should be about 100M > afterwards. git gc --aggressive has now been run, and the repository has shrunk significantly. Thanks for the investigation.
On Thu, Jan 15, 2009 at 2:26 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > git gc --aggressive has now been run... I did a little bit more investigation. Actually, it seems that --aggressive settings aren't 'fixed', but seems okay for Postgres... To avoid spreading misconception, I figured I should post this for completeness: I think --aggressive it runs something akin to "git repack -a -f -d --window=100 --depth=100" by default, (tweakable using configuration options, I think). In fact, repacking the emacs git repository with --aggressive causes the pack to explode in size. My guess is that some projects would benefit from larger window sizes (although repacking then takes longer and is more computationally expensive). Too-long delta chains can degrade performance, so some care must be exhibited with --depth. I have tried with the settings "git repack -a -f -d --window=250 --depth=250" as suggested in the mail by Linus Torvalds posted previously. For the Postgres git it probably shaves off another 10MB, so it seems the difference is somewhat negligible, as I have tried much more aggressive settings and have not seen appreciable gain beyond that (perhaps creeping towards another 10MB savings). It may be worth doing if you have extra time on your hands. Also, unless git.postgresql.org is using object alternates/shared repos, you may want to consider deleting the ./logs/ directory or expiring them with 'git reflog': there's a lot of garbage objects that will remain reachable otherwise and will not be collected. fdr