Thread: 404s

404s

From
Simon Riggs
Date:
Do we keep track of 404 errors on the .org website?

If its not possible, do we use a link checker?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 404s

From
"Dave Page"
Date:
On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> Do we keep track of 404 errors on the .org website?

The spider logs internal errors (or used to, I haven't looked at
recent versions). Why, did you find one?

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: 404s

From
Simon Riggs
Date:
On Wed, 2008-05-28 at 10:09 +0100, Dave Page wrote:
> On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > Do we keep track of 404 errors on the .org website?
> 
> The spider logs internal errors (or used to, I haven't looked at
> recent versions). Why, did you find one?

Yes. I'm trying to understand why we didn't spot the 404s, nor perform a
link check that would do that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 404s

From
"Dave Page"
Date:
On Wed, May 28, 2008 at 10:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Wed, 2008-05-28 at 10:09 +0100, Dave Page wrote:
>> On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> > Do we keep track of 404 errors on the .org website?
>>
>> The spider logs internal errors (or used to, I haven't looked at
>> recent versions). Why, did you find one?
>
> Yes. I'm trying to understand why we didn't spot the 404s, nor perform a
> link check that would do that.

Probably because noone checked the log recently (we know if errors
occur through other channels, but not 404 warnings).

Care to share what you found?

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: 404s

From
Stefan Kaltenbrunner
Date:
Dave Page wrote:
> On Wed, May 28, 2008 at 10:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Wed, 2008-05-28 at 10:09 +0100, Dave Page wrote:
>>> On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> Do we keep track of 404 errors on the .org website?
>>> The spider logs internal errors (or used to, I haven't looked at
>>> recent versions). Why, did you find one?
>> Yes. I'm trying to understand why we didn't spot the 404s, nor perform a
>> link check that would do that.
> 
> Probably because noone checked the log recently (we know if errors
> occur through other channels, but not 404 warnings).

well the logs still have a fair number of false positives.
This partly due to  the mirror script being a bit careless at times in 
what it should consider a valid url and the other part is url's that we 
once had and referenced in say a press release that are no longer valid 
(be it website reorg or a decision to rename directories on the ftp site).
otoh it seems that we have at least one really broken URL in the press 
FAQ page - will see if we can fix that ...


Stefan


Re: 404s

From
"Joshua D. Drake"
Date:

On Wed, 2008-05-28 at 18:23 +0200, Stefan Kaltenbrunner wrote:
> Dave Page wrote:

> > Probably because noone checked the log recently (we know if errors
> > occur through other channels, but not 404 warnings).
> 
> well the logs still have a fair number of false positives.
> This partly due to  the mirror script being a bit careless at times in 
> what it should consider a valid url and the other part is url's that we 
> once had and referenced in say a press release that are no longer valid 
> (be it website reorg or a decision to rename directories on the ftp site).
> otoh it seems that we have at least one really broken URL in the press 
> FAQ page - will see if we can fix that ...
> 

What if we used a rewrite rule on 404 to actually bring up a single
entry form that said, "Report broken page: <email> <submit>". 


Sincerely,

Joshua D. Drake



> 
> Stefan
> 



Re: 404s

From
Stefan Kaltenbrunner
Date:
Joshua D. Drake wrote:
> 
> On Wed, 2008-05-28 at 18:23 +0200, Stefan Kaltenbrunner wrote:
>> Dave Page wrote:
> 
>>> Probably because noone checked the log recently (we know if errors
>>> occur through other channels, but not 404 warnings).
>> well the logs still have a fair number of false positives.
>> This partly due to  the mirror script being a bit careless at times in 
>> what it should consider a valid url and the other part is url's that we 
>> once had and referenced in say a press release that are no longer valid 
>> (be it website reorg or a decision to rename directories on the ftp site).
>> otoh it seems that we have at least one really broken URL in the press 
>> FAQ page - will see if we can fix that ...
>>
> 
> What if we used a rewrite rule on 404 to actually bring up a single
> entry form that said, "Report broken page: <email> <submit>". 

I think it would be more reasonable to look into what it would take to 
remove the (obvious) false positives and have the mirror script report 
new ones automatically during site build.
Though I think what simon was actually refering to are urls pointing to 
external sites which we could maybe check on events/training whatever 
submission and refuse to accept them.
The mirroring does not really care for external sites so we would only 
be able to spot mistakes that lead to urls that end up on wwwmaster 
(like it being interpreted as a relative link or such) not ones that are 
broken otherwise (domain misspelled, simply wrong,...).


Stefan


Re: 404s

From
Simon Riggs
Date:
On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote:

> I think it would be more reasonable to look into what it would take to 
> remove the (obvious) false positives and have the mirror script report 
> new ones automatically during site build.
> Though I think what simon was actually refering to are urls pointing to 
> external sites which we could maybe check on events/training whatever 
> submission and refuse to accept them.
> The mirroring does not really care for external sites so we would only 
> be able to spot mistakes that lead to urls that end up on wwwmaster 
> (like it being interpreted as a relative link or such) not ones that are 
> broken otherwise (domain misspelled, simply wrong,...).

Specifically, yes. But I am worried that we aren't monitoring such a
basic quality issue. There might be lots of URLs in the Wiki that go bad
over time and we want to check on this, don't we?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 404s

From
Stefan Kaltenbrunner
Date:
Simon Riggs wrote:
> On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote:
> 
>> I think it would be more reasonable to look into what it would take to 
>> remove the (obvious) false positives and have the mirror script report 
>> new ones automatically during site build.
>> Though I think what simon was actually refering to are urls pointing to 
>> external sites which we could maybe check on events/training whatever 
>> submission and refuse to accept them.
>> The mirroring does not really care for external sites so we would only 
>> be able to spot mistakes that lead to urls that end up on wwwmaster 
>> (like it being interpreted as a relative link or such) not ones that are 
>> broken otherwise (domain misspelled, simply wrong,...).
> 
> Specifically, yes. But I am worried that we aren't monitoring such a
> basic quality issue. There might be lots of URLs in the Wiki that go bad
> over time and we want to check on this, don't we?

well - on www.postgresql.org itself it is a rare(though not impossible) 
issue because most of the urls there are internal links and not that 
often changed.
In the wiki case - everybody who spots an error there can fix it and I 
guess that there are already addons for mediawiki available that can 
help with that.


Stefan


Re: 404s

From
"Guido Barosio"
Date:
+1 but without the form and directly triggering an alert to slaves.

404 ? trigger_alert.php?missingurl=param

Do not rely on users if you want to improve the experience, though.

Regards,
gb.-

On Wed, May 28, 2008 at 9:35 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>
>
> On Wed, 2008-05-28 at 18:23 +0200, Stefan Kaltenbrunner wrote:
>> Dave Page wrote:
>
>> > Probably because noone checked the log recently (we know if errors
>> > occur through other channels, but not 404 warnings).
>>
>> well the logs still have a fair number of false positives.
>> This partly due to  the mirror script being a bit careless at times in
>> what it should consider a valid url and the other part is url's that we
>> once had and referenced in say a press release that are no longer valid
>> (be it website reorg or a decision to rename directories on the ftp site).
>> otoh it seems that we have at least one really broken URL in the press
>> FAQ page - will see if we can fix that ...
>>
>
> What if we used a rewrite rule on 404 to actually bring up a single
> entry form that said, "Report broken page: <email> <submit>".
>
>
> Sincerely,
>
> Joshua D. Drake
>
>
>
>>
>> Stefan
>>
>
>
> --
> Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-www
>



-- 
Guido Barosio
-----------------------
http://www.globant.com
guido.barosio@globant.com


Re: 404s

From
"Joshua D. Drake"
Date:

On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote:
> Joshua D. Drake wrote:

> > What if we used a rewrite rule on 404 to actually bring up a single
> > entry form that said, "Report broken page: <email> <submit>". 
> 
> I think it would be more reasonable to look into what it would take to 
> remove the (obvious) false positives and have the mirror script report 
> new ones automatically during site build.

If you know a way to get rid of people hunting for virus and path
execution I want to know :)

Joshua D. Drake






Re: 404s

From
Stefan Kaltenbrunner
Date:
Joshua D. Drake wrote:
> 
> On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote:
>> Joshua D. Drake wrote:
> 
>>> What if we used a rewrite rule on 404 to actually bring up a single
>>> entry form that said, "Report broken page: <email> <submit>". 
>> I think it would be more reasonable to look into what it would take to 
>> remove the (obvious) false positives and have the mirror script report 
>> new ones automatically during site build.
> 
> If you know a way to get rid of people hunting for virus and path
> execution I want to know :)

well I'm talking about the mirror script here - that one is spidering 
our own site(and only that - no external urls (obviously) and generating 
the static html files for the mirrors.
It already logs 404's though most of them are false positives because 
the script missparses some (old) pages - this could be fixed but I'm not 
sure we can (or should) do much more because we would have to 
periodically spider all external(!) urls.


Stefan


Re: 404s

From
Stefan Kaltenbrunner
Date:
Guido Barosio wrote:
> +1 but without the form and directly triggering an alert to slaves.
> 
> 404 ? trigger_alert.php?missingurl=param

so anybody with wget and a simply shellscript could can (email) DoS 
-slaves and wwwmaster in seconds ?

> 
> Do not rely on users if you want to improve the experience, though.

keep in mind that we can only detect relative urls on our OWN 
infrastructure and also that 99% of the website traffic is on 
www.postgresql.org with ourely static (mirrored) content, no PHP (or 
whatever) support and are only partly under our control.
only wwwmaster is dynamic but only a fraction of traffic ends up there.


Stefan


Re: 404s

From
"Guido Barosio"
Date:
mod_friends = true; /* commitment @ postgresql.org makes my life easy */

> so anybody with wget and a simply shellscript could can (email) DoS -slaves
> and wwwmaster in seconds ?

(so curl+post wouldn't be a kiddie workarround or are you planning to
implement CAPTCHA? [ BTW, I've heard about CAPTCHA bypassing, as easy
as dating my syster] )

Hmmmm, what about http hooks? (rock *'s)

http://httpd.apache.org/docs/2.0/developer/hooks.html ---> 100%
transparent though.

2 cents.

>> Do not rely on users if you want to improve the experience, though.
>
> keep in mind that we can only detect relative urls on our OWN infrastructure
> and also that 99% of the website traffic is on www.postgresql.org with
> ourely static (mirrored) content, no PHP (or whatever) support and are only
> partly under our control.
> only wwwmaster is dynamic but only a fraction of traffic ends up there.

Ta, txs!

-- 
Guido Barosio
-----------------------
http://www.globant.com
guido.barosio@globant.com


Re: 404s

From
"Guido Barosio"
Date:
GET http://www.postgresql.org/blah

Not Found

The requested URL /blah was not found on this server.
Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4 PHP/4.4.4-8+etch4
Server at www.postgresql.org Port 80

Though, we should at least hide (my sister's phone) some details.
(even when  Facebook shows all her pictures and makes my life
impossible!). Furthermore, we could take that "blah" string and search
the site in order to ease the experience, presenting a result letting
the user decide what to do.

gb.-

On Wed, May 28, 2008 at 11:54 AM, Guido Barosio <gbarosio@gmail.com> wrote:
> mod_friends = true; /* commitment @ postgresql.org makes my life easy */
>
>> so anybody with wget and a simply shellscript could can (email) DoS -slaves
>> and wwwmaster in seconds ?
>
> (so curl+post wouldn't be a kiddie workarround or are you planning to
> implement CAPTCHA? [ BTW, I've heard about CAPTCHA bypassing, as easy
> as dating my syster] )
>
> Hmmmm, what about http hooks? (rock *'s)
>
> http://httpd.apache.org/docs/2.0/developer/hooks.html ---> 100%
> transparent though.
>
> 2 cents.
>
>>> Do not rely on users if you want to improve the experience, though.
>>
>> keep in mind that we can only detect relative urls on our OWN infrastructure
>> and also that 99% of the website traffic is on www.postgresql.org with
>> ourely static (mirrored) content, no PHP (or whatever) support and are only
>> partly under our control.
>> only wwwmaster is dynamic but only a fraction of traffic ends up there.
>
> Ta, txs!
>
> --
> Guido Barosio
> -----------------------
> http://www.globant.com
> guido.barosio@globant.com
>



-- 
Guido Barosio
-----------------------
http://www.globant.com
guido.barosio@globant.com


Re: 404s

From
Tino Wildenhain
Date:
Stefan Kaltenbrunner wrote:
...
> well the logs still have a fair number of false positives.
> This partly due to  the mirror script being a bit careless at times in 
> what it should consider a valid url and the other part is url's that we 
> once had and referenced in say a press release that are no longer valid 
> (be it website reorg or a decision to rename directories on the ftp site).
> otoh it seems that we have at least one really broken URL in the press 
> FAQ page - will see if we can fix that ...

I also notized the docs URL changed unexpectedly, this broke my 
Bookmarks. Generally its not a good idea to change such links.

T.

Re: 404s

From
Magnus Hagander
Date:
Tino Wildenhain wrote:
> Stefan Kaltenbrunner wrote:
> ...
> > well the logs still have a fair number of false positives.
> > This partly due to  the mirror script being a bit careless at times
> > in what it should consider a valid url and the other part is url's
> > that we once had and referenced in say a press release that are no
> > longer valid (be it website reorg or a decision to rename
> > directories on the ftp site). otoh it seems that we have at least
> > one really broken URL in the press FAQ page - will see if we can
> > fix that ...
> 
> I also notized the docs URL changed unexpectedly, this broke my 
> Bookmarks. Generally its not a good idea to change such links.

Example, please? From what, to what?

//Magnus