Re: robots.txt on git.postgresql.org - Mailing list pgsql-hackers

From Dave Page
Subject Re: robots.txt on git.postgresql.org
Date
Msg-id CA+OCxoyOiOLbk8PM_HJCfnNj=uxgmOYz+cA4s40CUm9vWYSOeA@mail.gmail.com
Whole thread Raw
In response to Re: robots.txt on git.postgresql.org  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
On Wed, Jul 10, 2013 at 9:25 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 07/09/2013 11:30 PM, Andres Freund wrote:
>> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>>> the git repository:
>>>
>>> http://git.postgresql.org/robots.txt
>>>
>>> User-agent: *
>>> Disallow: /
>>>
>>>
>>> I'm curious what motivates this. It's certainly useful to be able to
>>> search for commits.
>>
>> Gitweb is horribly slow. I don't think anybody with a bigger git repo
>> using gitweb can afford to let all the crawlers go through it.
>
> Wouldn't whacking a reverse proxy in front be a pretty reasonable
> option? There's a disk space cost, but using Apache's mod_proxy or
> similar would do quite nicely.

It's already sitting behind Varnish, but the vast majority of pages on
that site would only ever be hit by crawlers anyway, so I doubt that'd
help a great deal as those pages would likely expire from the cache
before it really saved us anything.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Markus Wanner
Date:
Subject: Re: Review: extension template
Next
From: Magnus Hagander
Date:
Subject: Re: robots.txt on git.postgresql.org