Thread: robots.txt on git.postgresql.org

robots.txt on git.postgresql.org

From
Greg Stark
Date:
I note that git.postgresql.org's robot.txt refuses permission to crawl
the git repository:

http://git.postgresql.org/robots.txt

User-agent: *
Disallow: /


I'm curious what motivates this. It's certainly useful to be able to
search for commits. I frequently type git commit hashes into Google to
find the commit in other projects. I think I've even done it in
Postgres before and not had a problem. Maybe Google brought up github
or something else.

Fwiw the reason I noticed this is because I searched for "postgresql
git log" and the first hit was for "see the commit that fixed the
issue, with all the gory details" which linked to
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a6e0cd7b76c04acc8c8f868a3bcd0f9ff13e16c8

This was indexed despite the robot.txt because it was linked to from
elsewhere (Hence the interesting link title). There are ways to ask
Google not to index pages if that's really what we're after but I
don't see why we would be.

-- 
greg



Re: robots.txt on git.postgresql.org

From
Andres Freund
Date:
On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
> I note that git.postgresql.org's robot.txt refuses permission to crawl
> the git repository:
> 
> http://git.postgresql.org/robots.txt
> 
> User-agent: *
> Disallow: /
> 
> 
> I'm curious what motivates this. It's certainly useful to be able to
> search for commits.

Gitweb is horribly slow. I don't think anybody with a bigger git repo
using gitweb can afford to let all the crawlers go through it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: robots.txt on git.postgresql.org

From
Andrew Dunstan
Date:
On 07/09/2013 11:24 AM, Greg Stark wrote:
> I note that git.postgresql.org's robot.txt refuses permission to crawl
> the git repository:
>
> http://git.postgresql.org/robots.txt
>
> User-agent: *
> Disallow: /
>
>
> I'm curious what motivates this. It's certainly useful to be able to
> search for commits. I frequently type git commit hashes into Google to
> find the commit in other projects. I think I've even done it in
> Postgres before and not had a problem. Maybe Google brought up github
> or something else.
>
> Fwiw the reason I noticed this is because I searched for "postgresql
> git log" and the first hit was for "see the commit that fixed the
> issue, with all the gory details" which linked to
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a6e0cd7b76c04acc8c8f868a3bcd0f9ff13e16c8
>
> This was indexed despite the robot.txt because it was linked to from
> elsewhere (Hence the interesting link title). There are ways to ask
> Google not to index pages if that's really what we're after but I
> don't see why we would be.



It's certainly not universal. For example, the only reason I found 
buildfarm client commit d533edea5441115d40ffcd02bd97e64c4d5814d9, for 
which the repo is housed at GitHub, is that Google has indexed the 
buildfarm commits mailing list on pgfoundry. Do we have a robots.txt on 
the postgres mailing list archives site?

cheers

andrew



Re: robots.txt on git.postgresql.org

From
Dimitri Fontaine
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> Gitweb is horribly slow. I don't think anybody with a bigger git repo
> using gitweb can afford to let all the crawlers go through it.

What's blocking alternatives to be considered? I already did mention
cgit, which has the advantage to clearly show the latest patch on all
the active branches in its default view, which would match our branch
usage pretty well I think.
 http://git.zx2c4.com/cgit/ http://git.gnus.org/cgit/gnus.git/

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



Re: robots.txt on git.postgresql.org

From
Magnus Hagander
Date:
On Tue, Jul 9, 2013 at 5:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>> the git repository:
>>
>> http://git.postgresql.org/robots.txt
>>
>> User-agent: *
>> Disallow: /
>>
>>
>> I'm curious what motivates this. It's certainly useful to be able to
>> search for commits.
>
> Gitweb is horribly slow. I don't think anybody with a bigger git repo
> using gitweb can afford to let all the crawlers go through it.

Yes, this is the reason it's been blocked. That machine basically died
every time google or bing or baidu or those hit it. Giving horrible
response times and timeouts for actual users.

We might be able to do something better aobut that now taht we can do
better rate limiting, but it's like playing whack-a-mole. The basic
software is just fantastically slow.


--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



Re: robots.txt on git.postgresql.org

From
Magnus Hagander
Date:
On Tue, Jul 9, 2013 at 5:56 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> Gitweb is horribly slow. I don't think anybody with a bigger git repo
>> using gitweb can afford to let all the crawlers go through it.
>
> What's blocking alternatives to be considered? I already did mention
> cgit, which has the advantage to clearly show the latest patch on all
> the active branches in its default view, which would match our branch
> usage pretty well I think.

Time and testing.

For one thing, we need something that works with the fact that we have
multiple repositories on that same box. It may well be that these do,
but it needs to be verified. And t be able to give an overview. And to
be able to selectively hide some repositories. Etc.

Oh, and we need stable wheezy packages for them, or we'll be paying
even more in maintenance. AFAICT, there aren't any for cgit, but maybe
I'm searching for the wrong thing..

If they do all those things, and people do like those interfaces, then
sure, we can do that.


--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



Re: robots.txt on git.postgresql.org

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Jul 9, 2013 at 5:56 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>> What's blocking alternatives to be considered? I already did mention
>> cgit, which has the advantage to clearly show the latest patch on all
>> the active branches in its default view, which would match our branch
>> usage pretty well I think.

> ...
> If they do all those things, and people do like those interfaces, then
> sure, we can do that.

cgit is what Red Hat is using, and I have to say I don't like it much.
I find gitweb much more pleasant overall.  There are a few nice things
in cgit but lots of things that are worse.
        regards, tom lane



Re: robots.txt on git.postgresql.org

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Oh, and we need stable wheezy packages for them, or we'll be paying
> even more in maintenance. AFAICT, there aren't any for cgit, but maybe
> I'm searching for the wrong thing..

Seems to be a loser on that front too.
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



Re: robots.txt on git.postgresql.org

From
Craig Ringer
Date:
On 07/09/2013 11:30 PM, Andres Freund wrote:
> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>> the git repository:
>>
>> http://git.postgresql.org/robots.txt
>>
>> User-agent: *
>> Disallow: /
>>
>>
>> I'm curious what motivates this. It's certainly useful to be able to
>> search for commits.
> 
> Gitweb is horribly slow. I don't think anybody with a bigger git repo
> using gitweb can afford to let all the crawlers go through it.

Wouldn't whacking a reverse proxy in front be a pretty reasonable
option? There's a disk space cost, but using Apache's mod_proxy or
similar would do quite nicely.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: robots.txt on git.postgresql.org

From
Dave Page
Date:
On Wed, Jul 10, 2013 at 9:25 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 07/09/2013 11:30 PM, Andres Freund wrote:
>> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>>> the git repository:
>>>
>>> http://git.postgresql.org/robots.txt
>>>
>>> User-agent: *
>>> Disallow: /
>>>
>>>
>>> I'm curious what motivates this. It's certainly useful to be able to
>>> search for commits.
>>
>> Gitweb is horribly slow. I don't think anybody with a bigger git repo
>> using gitweb can afford to let all the crawlers go through it.
>
> Wouldn't whacking a reverse proxy in front be a pretty reasonable
> option? There's a disk space cost, but using Apache's mod_proxy or
> similar would do quite nicely.

It's already sitting behind Varnish, but the vast majority of pages on
that site would only ever be hit by crawlers anyway, so I doubt that'd
help a great deal as those pages would likely expire from the cache
before it really saved us anything.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: robots.txt on git.postgresql.org

From
Magnus Hagander
Date:
On Wed, Jul 10, 2013 at 10:25 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 07/09/2013 11:30 PM, Andres Freund wrote:
>> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>>> the git repository:
>>>
>>> http://git.postgresql.org/robots.txt
>>>
>>> User-agent: *
>>> Disallow: /
>>>
>>>
>>> I'm curious what motivates this. It's certainly useful to be able to
>>> search for commits.
>>
>> Gitweb is horribly slow. I don't think anybody with a bigger git repo
>> using gitweb can afford to let all the crawlers go through it.
>
> Wouldn't whacking a reverse proxy in front be a pretty reasonable
> option? There's a disk space cost, but using Apache's mod_proxy or
> similar would do quite nicely.

We already run this, that's what we did to make it survive at all. The
problem is there are so many thousands of different URLs you can get
to on that site, and google indexes them all by default.

It's before we had this that the side regularly died.


--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



Re: robots.txt on git.postgresql.org

From
Greg Stark
Date:
On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander <magnus@hagander.net> wrote:
> We already run this, that's what we did to make it survive at all. The
> problem is there are so many thousands of different URLs you can get
> to on that site, and google indexes them all by default.

There's also https://support.google.com/webmasters/answer/48620?hl=en
which lets us control how fast the Google crawler crawls. I think it's
adaptive though so if the pages are slow it should be crawling slowly


-- 
greg



Re: robots.txt on git.postgresql.org

From
Andres Freund
Date:
On 2013-07-11 14:43:21 +0100, Greg Stark wrote:
> On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander <magnus@hagander.net> wrote:
> > We already run this, that's what we did to make it survive at all. The
> > problem is there are so many thousands of different URLs you can get
> > to on that site, and google indexes them all by default.
> 
> There's also https://support.google.com/webmasters/answer/48620?hl=en
> which lets us control how fast the Google crawler crawls. I think it's
> adaptive though so if the pages are slow it should be crawling slowly

The problem is that gitweb gives you access to more than a million
pages...
Revisions: git rev-list --all origin/master|wc -l => 77123
Branches: git branch --all|grep origin|wc -
Views per commit: commit, commitdiff, tree

So, slow crawling isn't going to help very much.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: robots.txt on git.postgresql.org

From
Magnus Hagander
Date:
On Thu, Jul 11, 2013 at 3:43 PM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> We already run this, that's what we did to make it survive at all. The
>> problem is there are so many thousands of different URLs you can get
>> to on that site, and google indexes them all by default.
>
> There's also https://support.google.com/webmasters/answer/48620?hl=en
> which lets us control how fast the Google crawler crawls. I think it's
> adaptive though so if the pages are slow it should be crawling slowly

Sure, but there are plenty of other search engines as well, not just
google... Google is actually "reasonably good" at scaling back it's
own speed, in my experience. Which is not true of all the others. Of
course, it's also got the problem of it then taking a long time to
actually crawl the site, since there are so many different URLs...

--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/