Re: robots.txt on git.postgresql.org - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: robots.txt on git.postgresql.org
Date
Msg-id CABUevEyUM-CEmmBcHmX6VrnkHj8O7xYk6ZvfdSfk-T8O4jd-Vw@mail.gmail.com
Whole thread Raw
In response to Re: robots.txt on git.postgresql.org  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: robots.txt on git.postgresql.org
List pgsql-hackers
On Wed, Jul 10, 2013 at 10:25 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 07/09/2013 11:30 PM, Andres Freund wrote:
>> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>>> the git repository:
>>>
>>> http://git.postgresql.org/robots.txt
>>>
>>> User-agent: *
>>> Disallow: /
>>>
>>>
>>> I'm curious what motivates this. It's certainly useful to be able to
>>> search for commits.
>>
>> Gitweb is horribly slow. I don't think anybody with a bigger git repo
>> using gitweb can afford to let all the crawlers go through it.
>
> Wouldn't whacking a reverse proxy in front be a pretty reasonable
> option? There's a disk space cost, but using Apache's mod_proxy or
> similar would do quite nicely.

We already run this, that's what we did to make it survive at all. The
problem is there are so many thousands of different URLs you can get
to on that site, and google indexes them all by default.

It's before we had this that the side regularly died.


--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: robots.txt on git.postgresql.org
Next
From: Jeevan Chalke
Date:
Subject: Regex pattern with shorter back reference does NOT work as expected