Re: PostgreSQL GIT mirror status - Mailing list pgsql-www

From Daniel Farina
Subject Re: PostgreSQL GIT mirror status
Date
Msg-id 7b97c5a40901090253w25ddd4e5q2e104e58a998610f@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL GIT mirror status  ("Daniel Farina" <drfarina@acm.org>)
Responses Re: PostgreSQL GIT mirror status  ("Daniel Farina" <drfarina@acm.org>)
Re: PostgreSQL GIT mirror status  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Re: PostgreSQL GIT mirror status  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-www
Okay, final report:

I suggest running 'git gc' from time to time instead of repack
directly. It seems smart enough on modern git versions to have some
sensible limits and generally do the right thing to keep a repository
in shape, in spite of its name suggesting it's really 'just' for
garbage collection. It'll also detect an excessive number of packs and
consolidate them. Tweaking the gc options may be preferable to messing
around with repack options directly, but I found there was no need to
tweak to see large improvement.

Secondly, 'git gc' has the '--aggressive' option. This used to do
something really misleading, but I'm pretty sure it's fixed 'now',
although I couldn't point you to an exact version. This makes life
easy: just run 'git gc --aggressive' once in a long while. Given the
current data it seems that the pack should be about 100M
afterwards.

Thirdly, I found a lot of garbage. There was no garbage when I used
wget to fetch a copy of repo (and over 600000 objects) but then when I
pushed to a git clone git chose only to send something in the 300000
object range. I suspect the difference is in the reflog or something,
but I still can't explain why there was so much garbage that's not
connected to branches or tags. Regardless, all the branches seem
present and 'git fsck' says everything is okay. I'm trying to figure
out where those extra objects are reachable from, but that's mostly
for completeness -- everything seems to be working convincingly.

I only have access to a machine where I've set up a 'dumb' git repo
that only serves via http. It's at
http://fdr.lolrus.org/postgresql.git

If you are interested in grabbing a verbatim copy of my objects and
repo, you can run the following to get an exact, untouched mirror:

$ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git

You will probably have to delete any spurious 'index.html' files that
wget grabs before the repository will work as-is.

Conclusion: 361M (plus pathological performance issues) to 246M (just
repacking) to 110M (aggressive repacking).

fdr


Addendum:

I tried repack with much deeper delta chains (that's what too so long
to compute as alluded to in my previous email) and it did cut down
size by another 20 megs or so, but many operations are much more
costly because of the long chains. The 20 meg increase in size buys a
lot of performance, so I think default 'git gc --aggressive' uses a
more reasonable trade-off.


pgsql-www by date:

Previous
From: "Brendan Jurd"
Date:
Subject: Re: Wiki wizard help?
Next
From: "Daniel Farina"
Date:
Subject: Re: PostgreSQL GIT mirror status