Thread: PostgreSQL GIT mirror status

PostgreSQL GIT mirror status

From

Heikki Linnakangas

Date:

30 December 2008, 16:19:02

The PostgreSQL GIT mirror at git.postgresql.org/git/postgresql.git was 
screwed up on Dec 13th. All the history up to that date was duplicated 
four times, and strange "fixup" commits appeared in back-branches. In 
addition, there was the old issue that back-branches were not being updated.

Both issues have now been fixed. The repository was "reset" to the 
situation before the screwup on Dec 13th, and all the patches after that 
were reapplied.

This means that if you have a clone that has been updated (pulled) since 
that date, the next time you issue fetch or pull, it will fail, 
complaining about "non fast-forward" updates. You will need to use the 
--force option to force it.

If you have any local branches in your repository, you will need to 
rebase them over the new head. With something like:
  git-rebase origin/master


Let's hope that the script can now keep the mirror up-to-date without 
manual intervention. Let me know if there's problems.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: PostgreSQL GIT mirror status

From

Jeff Davis

Date:

30 December 2008, 18:00:46

On Tue, 2008-12-30 at 22:18 +0200, Heikki Linnakangas wrote:
> Let's hope that the script can now keep the mirror up-to-date without 
> manual intervention. Let me know if there's problems.
> 

Thanks for putting in the time. I find the git repo very helpful,
especially during patch review.

One question though: does "git repack" ever get run? Sometimes the
repository seems a little slow, but maybe that's just because it's big.

Regards,Jeff Davis

Re: PostgreSQL GIT mirror status

From

Peter Eisentraut

Date:

30 December 2008, 18:58:16

On Wednesday 31 December 2008 00:00:39 Jeff Davis wrote:
> One question though: does "git repack" ever get run?

Yes, after every update.

Re: PostgreSQL GIT mirror status

From

"Daniel Farina"

Date:

05 January 2009, 19:15:06

On Tue, Dec 30, 2008 at 2:58 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Wednesday 31 December 2008 00:00:39 Jeff Davis wrote:
>> One question though: does "git repack" ever get run?
>
> Yes, after every update.
>

Follow-up question: does "git repack -a -d" ever get run?

I have also noticed slow fetching and have seen that (via the HTTP url
at http://git.postgresql.org/git/postgresql.git/objects/pack/) that
there seems to be a relatively large number of packs that could be to
blame.

Important side note: I don't think "git repack -a -d" is 'safe' with
dumber git protocols like HTTP, so any in-progress HTTP-based pulls
may encounter 'interesting' effects at the moment the repack finishes
and prunes away old packs.

If you want to be really thorough, consider heeding the mail archived
at http://gcc.gnu.org/ml/gcc/2007-12/msg00165.html and running a
extensive repack overnight. It *might* be worth it if it has not been
done at least once already.

fdr

Re: PostgreSQL GIT mirror status

From

Peter Eisentraut

Date:

07 January 2009, 05:58:36

Daniel Farina wrote:
> Follow-up question: does "git repack -a -d" ever get run?

No.

> If you want to be really thorough, consider heeding the mail archived
> at http://gcc.gnu.org/ml/gcc/2007-12/msg00165.html and running a
> extensive repack overnight. It *might* be worth it if it has not been
> done at least once already.

Well, if you want to give it a try and then report back about whether 
there were any noticeable effects ...

Re: PostgreSQL GIT mirror status

From

"Daniel Farina"

Date:

08 January 2009, 05:57:05

On Wed, Jan 7, 2009 at 1:58 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>
> Well, if you want to give it a try and then report back about whether there
> were any noticeable effects ...
>

I ran a regular git repack -a -d. This took about 3.5 cpu-intensive
hours, but made object counting *much* (I cannot stress that enough)
faster and made the repository shrink dramatically: 361M to 246M. I
also won't have any more open-file-limit problems (things like git
fsck --full would fail because of too many open files until I raised
ulimit -n). I should also mention that cloning from http seems
completely broken because of the huge number of packs...potentially
also an open file limit issue.

You may want to run 'git repack -a -d' also, but I'd advise waiting
until tomorrow when I write up my full report and compare that with
the much more aggressive packing options. My estimation is that using
the already-repacked repository that finding new deltas will take
about nine hours with extremely aggressive settings. It has a higher
likelihood of being worthwhile on projects as large as Postgres, so
we'll see.

After this I can either solidify the recipe I used and you can burn
another fifteen or so hours of compute time to re-derive this result
or I can simply give you the pack generated. You can use 'git fsck
--full' to ensure the pack's fidelity.

I suggest running 'git repack -a -d' to consolidate packs every once
in a while, maybe monthly or semi-monthly. It's quite cheap if there
aren't so many packs and/or loose objects. Aggressive repacking such
as what I'm doing may only be useful on a yearly basis or even
longer...unless git learns some better ways to build packs. I also
hope you (and everyone else) has git version >= 1.5.3, when the pack
format changed.

fdr

Re: PostgreSQL GIT mirror status

From

"Daniel Farina"

Date:

09 January 2009, 06:53:29

Okay, final report:

I suggest running 'git gc' from time to time instead of repack
directly. It seems smart enough on modern git versions to have some
sensible limits and generally do the right thing to keep a repository
in shape, in spite of its name suggesting it's really 'just' for
garbage collection. It'll also detect an excessive number of packs and
consolidate them. Tweaking the gc options may be preferable to messing
around with repack options directly, but I found there was no need to
tweak to see large improvement.

Secondly, 'git gc' has the '--aggressive' option. This used to do
something really misleading, but I'm pretty sure it's fixed 'now',
although I couldn't point you to an exact version. This makes life
easy: just run 'git gc --aggressive' once in a long while. Given the
current data it seems that the pack should be about 100M
afterwards.

Thirdly, I found a lot of garbage. There was no garbage when I used
wget to fetch a copy of repo (and over 600000 objects) but then when I
pushed to a git clone git chose only to send something in the 300000
object range. I suspect the difference is in the reflog or something,
but I still can't explain why there was so much garbage that's not
connected to branches or tags. Regardless, all the branches seem
present and 'git fsck' says everything is okay. I'm trying to figure
out where those extra objects are reachable from, but that's mostly
for completeness -- everything seems to be working convincingly.

I only have access to a machine where I've set up a 'dumb' git repo
that only serves via http. It's at
http://fdr.lolrus.org/postgresql.git

If you are interested in grabbing a verbatim copy of my objects and
repo, you can run the following to get an exact, untouched mirror:

$ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git

You will probably have to delete any spurious 'index.html' files that
wget grabs before the repository will work as-is.

Conclusion: 361M (plus pathological performance issues) to 246M (just
repacking) to 110M (aggressive repacking).

fdr


Addendum:

I tried repack with much deeper delta chains (that's what too so long
to compute as alluded to in my previous email) and it did cut down
size by another 20 megs or so, but many operations are much more
costly because of the long chains. The 20 meg increase in size buys a
lot of performance, so I think default 'git gc --aggressive' uses a
more reasonable trade-off.

Re: PostgreSQL GIT mirror status

From

"Daniel Farina"

Date:

09 January 2009, 06:55:26

On Fri, Jan 9, 2009 at 2:53 AM, Daniel Farina <drfarina@acm.org> wrote:
> $ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git
>

Important correction: a trailing slash is needed on that, in my recollection.

Fixed:

$ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git/

fdr

Re: PostgreSQL GIT mirror status

From

Heikki Linnakangas

Date:

09 January 2009, 07:06:29

Daniel Farina wrote:
> Secondly, 'git gc' has the '--aggressive' option. This used to do
> something really misleading, but I'm pretty sure it's fixed 'now',
> although I couldn't point you to an exact version. This makes life
> easy: just run 'git gc --aggressive' once in a long while. Given the
> current data it seems that the pack should be about 100M
> afterwards.

Wow, that's impressive! How long does a "git gc --agressive" run take?

> Thirdly, I found a lot of garbage. There was no garbage when I used
> wget to fetch a copy of repo (and over 600000 objects) but then when I
> pushed to a git clone git chose only to send something in the 300000
> object range. I suspect the difference is in the reflog or something,
> but I still can't explain why there was so much garbage that's not
> connected to branches or tags. Regardless, all the branches seem
> present and 'git fsck' says everything is okay. I'm trying to figure
> out where those extra objects are reachable from, but that's mostly
> for completeness -- everything seems to be working convincingly.

That could be because of the duplicated history we had there in 
December, that I then fixed. I reset the branches to just before the 
screwup, and then ran fromcvs to catch up with CVS HEAD again. That 
duplicated history is probably still there, but nor reachable from any 
branches or tags.

Should we run "git prune" to get rid of the garbage?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: PostgreSQL GIT mirror status

From

"Daniel Farina"

Date:

09 January 2009, 13:56:16

On Fri, Jan 9, 2009 at 3:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Wow, that's impressive! How long does a "git gc --agressive" run take?

Actually, not that long. The main step that takes forever at this
point (starting from scratch) is counting all those objects. The
actual gc --aggressive time could probably be measured in minutes and
< 1hr on a reasonably fast machine.

> That could be because of the duplicated history we had there in December,
> that I then fixed. I reset the branches to just before the screwup, and then
> ran fromcvs to catch up with CVS HEAD again. That duplicated history is
> probably still there, but nor reachable from any branches or tags.
>
> Should we run "git prune" to get rid of the garbage?
>

Sounds like a good candidate, but I don't think that alone will do
it. I've had to do something like this before when I temporarily added
some large blobs to my git repository to move them between home and
work.

I have isolated the problem to the being the reflog, which sounds
about right. The "git reflog" man page says it has ways to delete
and/or expire these to be pruned, so try that first (and then tell me
if it worked as you expected, and what you did).

If it doesn't (i.e. for some reason is not pruning properly) and if
you are sure you won't need the reflog it seems that you can just
delete the 'logs' directory under the git repository (you may notice
that it seems that the repository at lolrus.org works fine, but has no
'logs' directory). That seems to be the same state as having no reflog
at all, after which a regular 'git gc' will collect most of those
objects.

"But wait, there's more!"

You'll then want to run a 'git prune', as it seems that gc will still
keep some objects around because they're inside the gc grace period,
which I believe to be distinct from the reflog. In this case it seems
that we really want them gone.

Given this information it seems like the right steps are something
like this:
1. Somehow expire and/or delete the reflogs so they register as   garbage.
    * By making use of the 'git reflog' expiration/deletion commands      (preferred, if one can figure out their
behaviorexactly)

    * Or just deleting $GITREPO/logs. (works for me at the moment)
2. Run 'git gc --aggressive'
3. Run 'git prune'

Alternatively, just steal the pack from fdr.lolrus.org, as mentioned
above.

fdr

Re: PostgreSQL GIT mirror status

From

Peter Eisentraut

Date:

15 January 2009, 06:27:18

Daniel Farina wrote:
> Secondly, 'git gc' has the '--aggressive' option. This used to do
> something really misleading, but I'm pretty sure it's fixed 'now',
> although I couldn't point you to an exact version. This makes life
> easy: just run 'git gc --aggressive' once in a long while. Given the
> current data it seems that the pack should be about 100M
> afterwards.

git gc --aggressive has now been run, and the repository has shrunk 
significantly.

Thanks for the investigation.

Re: PostgreSQL GIT mirror status

From

"Daniel Farina"

Date:

15 January 2009, 13:22:31

On Thu, Jan 15, 2009 at 2:26 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> git gc --aggressive has now been run...

I did a little bit more investigation. Actually, it seems that
--aggressive settings aren't 'fixed', but seems okay for Postgres...

To avoid spreading misconception, I figured I should post this for
completeness:

I think --aggressive it runs something akin to "git repack -a -f -d
--window=100 --depth=100" by default, (tweakable using configuration
options, I think). In fact, repacking the emacs git repository with
--aggressive causes the pack to explode in size. My guess is that some
projects would benefit from larger window sizes (although repacking
then takes longer and is more computationally expensive). Too-long
delta chains can degrade performance, so some care must be exhibited
with --depth.

I have tried with the settings "git repack -a -f -d --window=250
--depth=250" as suggested in the mail by Linus Torvalds posted
previously. For the Postgres git it probably shaves off another 10MB,
so it seems the difference is somewhat negligible, as I have tried
much more aggressive settings and have not seen appreciable gain
beyond that (perhaps creeping towards another 10MB savings). It may be
worth doing if you have extra time on your hands.

Also, unless git.postgresql.org is using object alternates/shared
repos, you may want to consider deleting the ./logs/ directory or
expiring them with 'git reflog': there's a lot of garbage objects that
will remain reachable otherwise and will not be collected.

fdr