Thread: Static mirror generation

Static mirror generation

From
Alexey Borzov
Date:
Greetings.

I've commited a script for static mirror generation. Unlike previous
such scripts, it is just a generic spider that follows links and does
not have any knowledge about the site structure. This immediately
helped to fix several problems.

The mirror of website not including docs is generated in ~5 minutes. I
don't yet know how long it will take with all the docs, got tired after
the first 1,5 hours.

I've set up a proof-of-concept static mirror @ http://oc.cs.msu.su/pgorg/

The mirror uses Apache's content negotiation, so that if you have
Russian set up as the preferred language in your browser, it'll come up
in Russian, else in English.

Re: Static mirror generation

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-www-owner@postgresql.org
> [mailto:pgsql-www-owner@postgresql.org] On Behalf Of Alexey Borzov
> Sent: 17 June 2004 15:19
> To: pgsql-www@postgresql.org
> Subject: [pgsql-www] Static mirror generation
>
> Greetings.
>
> I've commited a script for static mirror generation. Unlike
> previous such scripts, it is just a generic spider that
> follows links and does not have any knowledge about the site
> structure. This immediately helped to fix several problems.
>
> The mirror of website not including docs is generated in ~5
> minutes. I don't yet know how long it will take with all the
> docs, got tired after the first 1,5 hours.

Meep, that's slow. The current build takes just a few minutes when the
server is behaving. Still, the idea of using a crawler is a good one -
at least that way nothing will get forgotten, and presumably it will
create a report of any broken links?

> I've set up a proof-of-concept static mirror @
> http://oc.cs.msu.su/pgorg/
>
> The mirror uses Apache's content negotiation, so that if you
> have Russian set up as the preferred language in your
> browser, it'll come up in Russian, else in English.

OK, sounds good. Nice work :-)

Regards, Dave.

Re: Static mirror generation

From
Alexey Borzov
Date:
Hi,

Dave Page wrote:
>>The mirror of website not including docs is generated in ~5
>>minutes. I don't yet know how long it will take with all the
>>docs, got tired after the first 1,5 hours.
>
> Meep, that's slow. The current build takes just a few minutes when the
> server is behaving.

I suspect this is because of the limited resources allocated to the dev
server. Marc may know better.

> Still, the idea of using a crawler is a good one -
> at least that way nothing will get forgotten, and presumably it will
> create a report of any broken links?

Yes, of course:
Jun 17 10:15:52 mirror [error] HTTP error 404 at page
http://www.alexey.beta.postgresql.org/images/editorschoice2003.jpg
Jun 17 10:16:56 mirror [error] HTTP error 404 at page
http://www.alexey.beta.postgresql.org/presskit/en/presskit74.html
Jun 17 10:17:31 mirror [error] HTTP error 404 at page
http://www.alexey.beta.postgresql.org/pgsql-bugs@postgresql.org

These are in news/events texts, I suppose. Couldn't find them in files.

Re: Static mirror generation

From
"Marc G. Fournier"
Date:
On Fri, 18 Jun 2004, Alexey Borzov wrote:

> Hi,
>
> Dave Page wrote:
>>> The mirror of website not including docs is generated in ~5 minutes. I
>>> don't yet know how long it will take with all the docs, got tired after
>>> the first 1,5 hours.
>>
>> Meep, that's slow. The current build takes just a few minutes when the
>> server is behaving.
>
> I suspect this is because of the limited resources allocated to the dev
> server. Marc may know better.

If its the same server, and teh current build takes minutes ... how could
limited resources make the difference?  its the same resources whether
using teh current build, or the spider ... :)

What I'm suspecting is that part of it is 'local machine' vs 'network lag'
though ...


----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Static mirror generation

From
Alexey Borzov
Date:
Hi!

Marc G. Fournier wrote:
>> I suspect this is because of the limited resources allocated to the
>> dev server. Marc may know better.
>
> If its the same server, and teh current build takes minutes ... how
> could limited resources make the difference?  its the same resources
> whether using teh current build, or the spider ... :)

I am confused here. You mean that www.postgresql.org and
alexey.beta.postgresql.org are in the same machine?

> What I'm suspecting is that part of it is 'local machine' vs 'network
> lag' though ...

I suspect that has to do with the server load... I tried doing the same mirror
feat now and it completed in less than a minute.

ab -n 1000 -c 10 http://alexey.beta.postgresql.org
gave me 10 requests per second, which is small for my tastes, but reasonable.
While yesterday / earlier today I saw ridiculous ~1 second page generation times.

Re: Static mirror generation

From
"Marc G. Fournier"
Date:
On Fri, 18 Jun 2004, Alexey Borzov wrote:

> Hi!
>
> Marc G. Fournier wrote:
>>> I suspect this is because of the limited resources allocated to the dev
>>> server. Marc may know better.
>>
>> If its the same server, and teh current build takes minutes ... how could
>> limited resources make the difference?  its the same resources whether
>> using teh current build, or the spider ... :)
>
> I am confused here. You mean that www.postgresql.org and
> alexey.beta.postgresql.org are in the same machine?

of course ...

> I suspect that has to do with the server load... I tried doing the same
> mirror feat now and it completed in less than a minute.

that could be ... its why I'm ordering a Dual-Athlon ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Static mirror generation

From
David Fetter
Date:
On Fri, Jun 18, 2004 at 04:12:18PM -0300, Marc G. Fournier wrote:
> On Fri, 18 Jun 2004, Alexey Borzov wrote:
>
> >Hi!
> >
> >Marc G. Fournier wrote:
> >>>I suspect this is because of the limited resources allocated to
> >>>the dev server. Marc may know better.
> >>
> >>If its the same server, and teh current build takes minutes ...
> >>how could limited resources make the difference?  its the same
> >>resources whether using teh current build, or the spider ... :)
> >
> >I am confused here. You mean that www.postgresql.org and
> >alexey.beta.postgresql.org are in the same machine?
>
> of course ...
>
> >I suspect that has to do with the server load... I tried doing the
> >same mirror feat now and it completed in less than a minute.
>
> that could be ... its why I'm ordering a Dual-Athlon ...

BTW, I've got more interest from people to chip in personally.  If
this can wait 'til August, the PG Foundation should be able to cut you
a check.  If it can't, I can coordinate some donations.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!

Re: Static mirror generation

From
"Marc G. Fournier"
Date:
On Fri, 18 Jun 2004, David Fetter wrote:

> On Fri, Jun 18, 2004 at 04:12:18PM -0300, Marc G. Fournier wrote:
>> On Fri, 18 Jun 2004, Alexey Borzov wrote:
>>
>>> Hi!
>>>
>>> Marc G. Fournier wrote:
>>>>> I suspect this is because of the limited resources allocated to
>>>>> the dev server. Marc may know better.
>>>>
>>>> If its the same server, and teh current build takes minutes ...
>>>> how could limited resources make the difference?  its the same
>>>> resources whether using teh current build, or the spider ... :)
>>>
>>> I am confused here. You mean that www.postgresql.org and
>>> alexey.beta.postgresql.org are in the same machine?
>>
>> of course ...
>>
>>> I suspect that has to do with the server load... I tried doing the
>>> same mirror feat now and it completed in less than a minute.
>>
>> that could be ... its why I'm ordering a Dual-Athlon ...
>
> BTW, I've got more interest from people to chip in personally.  If
> this can wait 'til August, the PG Foundation should be able to cut you
> a check.  If it can't, I can coordinate some donations.

Once the money gets in that I'm expecting (its "in transit"), I'm going to
be ordering both the new server, and the new switch ... if the one quote I
got today from a supplier in the US is any indication of what I can get,
as long as I can co-ordinate shipping from there to the co-lo facility,
the savings from what it would cost me either here, or in Panama, would
allow me to pick up servers a bit more often ...

The one quote I got was ~3/5ths the cost I was quoted in Panama, and about
the same savings over what it would cost me here in Canada ... I always
knew I lived in the wrong country for some things *sigh*

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Static mirror generation

From
"Dave Page"
Date:


> -----Original Message-----
> From: Alexey Borzov [mailto:borz_off@cs.msu.su]
> Sent: Fri 6/18/2004 7:56 PM
> To: Marc G. Fournier
> Cc: Dave Page; pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Static mirror generation
>
> ab -n 1000 -c 10 http://alexey.beta.postgresql.org
> gave me 10 requests per second, which is small for my tastes, but reasonable.
> While yesterday / earlier today I saw ridiculous ~1 second page generation times.

My guess is that you tested whilst db backup was running. I've been caught that way before - serves us right for
workingin the middle of the night Canadian time :-) 

Regards, Dave