Re: PostgreSQL GIT mirror status - Mailing list pgsql-www
From | Daniel Farina |
---|---|
Subject | Re: PostgreSQL GIT mirror status |
Date | |
Msg-id | 7b97c5a40901090253w25ddd4e5q2e104e58a998610f@mail.gmail.com Whole thread Raw |
In response to | Re: PostgreSQL GIT mirror status ("Daniel Farina" <drfarina@acm.org>) |
Responses |
Re: PostgreSQL GIT mirror status
Re: PostgreSQL GIT mirror status Re: PostgreSQL GIT mirror status |
List | pgsql-www |
Okay, final report: I suggest running 'git gc' from time to time instead of repack directly. It seems smart enough on modern git versions to have some sensible limits and generally do the right thing to keep a repository in shape, in spite of its name suggesting it's really 'just' for garbage collection. It'll also detect an excessive number of packs and consolidate them. Tweaking the gc options may be preferable to messing around with repack options directly, but I found there was no need to tweak to see large improvement. Secondly, 'git gc' has the '--aggressive' option. This used to do something really misleading, but I'm pretty sure it's fixed 'now', although I couldn't point you to an exact version. This makes life easy: just run 'git gc --aggressive' once in a long while. Given the current data it seems that the pack should be about 100M afterwards. Thirdly, I found a lot of garbage. There was no garbage when I used wget to fetch a copy of repo (and over 600000 objects) but then when I pushed to a git clone git chose only to send something in the 300000 object range. I suspect the difference is in the reflog or something, but I still can't explain why there was so much garbage that's not connected to branches or tags. Regardless, all the branches seem present and 'git fsck' says everything is okay. I'm trying to figure out where those extra objects are reachable from, but that's mostly for completeness -- everything seems to be working convincingly. I only have access to a machine where I've set up a 'dumb' git repo that only serves via http. It's at http://fdr.lolrus.org/postgresql.git If you are interested in grabbing a verbatim copy of my objects and repo, you can run the following to get an exact, untouched mirror: $ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git You will probably have to delete any spurious 'index.html' files that wget grabs before the repository will work as-is. Conclusion: 361M (plus pathological performance issues) to 246M (just repacking) to 110M (aggressive repacking). fdr Addendum: I tried repack with much deeper delta chains (that's what too so long to compute as alluded to in my previous email) and it did cut down size by another 20 megs or so, but many operations are much more costly because of the long chains. The 20 meg increase in size buys a lot of performance, so I think default 'git gc --aggressive' uses a more reasonable trade-off.