Thread: robots.txt on git.postgresql.org
I note that git.postgresql.org's robot.txt refuses permission to crawl the git repository: http://git.postgresql.org/robots.txt User-agent: * Disallow: / I'm curious what motivates this. It's certainly useful to be able to search for commits. I frequently type git commit hashes into Google to find the commit in other projects. I think I've even done it in Postgres before and not had a problem. Maybe Google brought up github or something else. Fwiw the reason I noticed this is because I searched for "postgresql git log" and the first hit was for "see the commit that fixed the issue, with all the gory details" which linked to http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a6e0cd7b76c04acc8c8f868a3bcd0f9ff13e16c8 This was indexed despite the robot.txt because it was linked to from elsewhere (Hence the interesting link title). There are ways to ask Google not to index pages if that's really what we're after but I don't see why we would be. -- greg
On 2013-07-09 16:24:42 +0100, Greg Stark wrote: > I note that git.postgresql.org's robot.txt refuses permission to crawl > the git repository: > > http://git.postgresql.org/robots.txt > > User-agent: * > Disallow: / > > > I'm curious what motivates this. It's certainly useful to be able to > search for commits. Gitweb is horribly slow. I don't think anybody with a bigger git repo using gitweb can afford to let all the crawlers go through it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 07/09/2013 11:24 AM, Greg Stark wrote: > I note that git.postgresql.org's robot.txt refuses permission to crawl > the git repository: > > http://git.postgresql.org/robots.txt > > User-agent: * > Disallow: / > > > I'm curious what motivates this. It's certainly useful to be able to > search for commits. I frequently type git commit hashes into Google to > find the commit in other projects. I think I've even done it in > Postgres before and not had a problem. Maybe Google brought up github > or something else. > > Fwiw the reason I noticed this is because I searched for "postgresql > git log" and the first hit was for "see the commit that fixed the > issue, with all the gory details" which linked to > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a6e0cd7b76c04acc8c8f868a3bcd0f9ff13e16c8 > > This was indexed despite the robot.txt because it was linked to from > elsewhere (Hence the interesting link title). There are ways to ask > Google not to index pages if that's really what we're after but I > don't see why we would be. It's certainly not universal. For example, the only reason I found buildfarm client commit d533edea5441115d40ffcd02bd97e64c4d5814d9, for which the repo is housed at GitHub, is that Google has indexed the buildfarm commits mailing list on pgfoundry. Do we have a robots.txt on the postgres mailing list archives site? cheers andrew
Andres Freund <andres@2ndquadrant.com> writes: > Gitweb is horribly slow. I don't think anybody with a bigger git repo > using gitweb can afford to let all the crawlers go through it. What's blocking alternatives to be considered? I already did mention cgit, which has the advantage to clearly show the latest patch on all the active branches in its default view, which would match our branch usage pretty well I think. http://git.zx2c4.com/cgit/ http://git.gnus.org/cgit/gnus.git/ Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Jul 9, 2013 at 5:30 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-07-09 16:24:42 +0100, Greg Stark wrote: >> I note that git.postgresql.org's robot.txt refuses permission to crawl >> the git repository: >> >> http://git.postgresql.org/robots.txt >> >> User-agent: * >> Disallow: / >> >> >> I'm curious what motivates this. It's certainly useful to be able to >> search for commits. > > Gitweb is horribly slow. I don't think anybody with a bigger git repo > using gitweb can afford to let all the crawlers go through it. Yes, this is the reason it's been blocked. That machine basically died every time google or bing or baidu or those hit it. Giving horrible response times and timeouts for actual users. We might be able to do something better aobut that now taht we can do better rate limiting, but it's like playing whack-a-mole. The basic software is just fantastically slow. --Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Tue, Jul 9, 2013 at 5:56 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> Gitweb is horribly slow. I don't think anybody with a bigger git repo >> using gitweb can afford to let all the crawlers go through it. > > What's blocking alternatives to be considered? I already did mention > cgit, which has the advantage to clearly show the latest patch on all > the active branches in its default view, which would match our branch > usage pretty well I think. Time and testing. For one thing, we need something that works with the fact that we have multiple repositories on that same box. It may well be that these do, but it needs to be verified. And t be able to give an overview. And to be able to selectively hide some repositories. Etc. Oh, and we need stable wheezy packages for them, or we'll be paying even more in maintenance. AFAICT, there aren't any for cgit, but maybe I'm searching for the wrong thing.. If they do all those things, and people do like those interfaces, then sure, we can do that. --Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Tue, Jul 9, 2013 at 5:56 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: >> What's blocking alternatives to be considered? I already did mention >> cgit, which has the advantage to clearly show the latest patch on all >> the active branches in its default view, which would match our branch >> usage pretty well I think. > ... > If they do all those things, and people do like those interfaces, then > sure, we can do that. cgit is what Red Hat is using, and I have to say I don't like it much. I find gitweb much more pleasant overall. There are a few nice things in cgit but lots of things that are worse. regards, tom lane
Magnus Hagander <magnus@hagander.net> writes: > Oh, and we need stable wheezy packages for them, or we'll be paying > even more in maintenance. AFAICT, there aren't any for cgit, but maybe > I'm searching for the wrong thing.. Seems to be a loser on that front too. -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 07/09/2013 11:30 PM, Andres Freund wrote: > On 2013-07-09 16:24:42 +0100, Greg Stark wrote: >> I note that git.postgresql.org's robot.txt refuses permission to crawl >> the git repository: >> >> http://git.postgresql.org/robots.txt >> >> User-agent: * >> Disallow: / >> >> >> I'm curious what motivates this. It's certainly useful to be able to >> search for commits. > > Gitweb is horribly slow. I don't think anybody with a bigger git repo > using gitweb can afford to let all the crawlers go through it. Wouldn't whacking a reverse proxy in front be a pretty reasonable option? There's a disk space cost, but using Apache's mod_proxy or similar would do quite nicely. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 10, 2013 at 9:25 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 07/09/2013 11:30 PM, Andres Freund wrote: >> On 2013-07-09 16:24:42 +0100, Greg Stark wrote: >>> I note that git.postgresql.org's robot.txt refuses permission to crawl >>> the git repository: >>> >>> http://git.postgresql.org/robots.txt >>> >>> User-agent: * >>> Disallow: / >>> >>> >>> I'm curious what motivates this. It's certainly useful to be able to >>> search for commits. >> >> Gitweb is horribly slow. I don't think anybody with a bigger git repo >> using gitweb can afford to let all the crawlers go through it. > > Wouldn't whacking a reverse proxy in front be a pretty reasonable > option? There's a disk space cost, but using Apache's mod_proxy or > similar would do quite nicely. It's already sitting behind Varnish, but the vast majority of pages on that site would only ever be hit by crawlers anyway, so I doubt that'd help a great deal as those pages would likely expire from the cache before it really saved us anything. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 10, 2013 at 10:25 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 07/09/2013 11:30 PM, Andres Freund wrote: >> On 2013-07-09 16:24:42 +0100, Greg Stark wrote: >>> I note that git.postgresql.org's robot.txt refuses permission to crawl >>> the git repository: >>> >>> http://git.postgresql.org/robots.txt >>> >>> User-agent: * >>> Disallow: / >>> >>> >>> I'm curious what motivates this. It's certainly useful to be able to >>> search for commits. >> >> Gitweb is horribly slow. I don't think anybody with a bigger git repo >> using gitweb can afford to let all the crawlers go through it. > > Wouldn't whacking a reverse proxy in front be a pretty reasonable > option? There's a disk space cost, but using Apache's mod_proxy or > similar would do quite nicely. We already run this, that's what we did to make it survive at all. The problem is there are so many thousands of different URLs you can get to on that site, and google indexes them all by default. It's before we had this that the side regularly died. --Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander <magnus@hagander.net> wrote: > We already run this, that's what we did to make it survive at all. The > problem is there are so many thousands of different URLs you can get > to on that site, and google indexes them all by default. There's also https://support.google.com/webmasters/answer/48620?hl=en which lets us control how fast the Google crawler crawls. I think it's adaptive though so if the pages are slow it should be crawling slowly -- greg
On 2013-07-11 14:43:21 +0100, Greg Stark wrote: > On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander <magnus@hagander.net> wrote: > > We already run this, that's what we did to make it survive at all. The > > problem is there are so many thousands of different URLs you can get > > to on that site, and google indexes them all by default. > > There's also https://support.google.com/webmasters/answer/48620?hl=en > which lets us control how fast the Google crawler crawls. I think it's > adaptive though so if the pages are slow it should be crawling slowly The problem is that gitweb gives you access to more than a million pages... Revisions: git rev-list --all origin/master|wc -l => 77123 Branches: git branch --all|grep origin|wc - Views per commit: commit, commitdiff, tree So, slow crawling isn't going to help very much. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 11, 2013 at 3:43 PM, Greg Stark <stark@mit.edu> wrote: > On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander <magnus@hagander.net> wrote: >> We already run this, that's what we did to make it survive at all. The >> problem is there are so many thousands of different URLs you can get >> to on that site, and google indexes them all by default. > > There's also https://support.google.com/webmasters/answer/48620?hl=en > which lets us control how fast the Google crawler crawls. I think it's > adaptive though so if the pages are slow it should be crawling slowly Sure, but there are plenty of other search engines as well, not just google... Google is actually "reasonably good" at scaling back it's own speed, in my experience. Which is not true of all the others. Of course, it's also got the problem of it then taking a long time to actually crawl the site, since there are so many different URLs... --Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/