Thread: 404s
Do we keep track of 404 errors on the .org website? If its not possible, do we use a link checker? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Do we keep track of 404 errors on the .org website? The spider logs internal errors (or used to, I haven't looked at recent versions). Why, did you find one? -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
On Wed, 2008-05-28 at 10:09 +0100, Dave Page wrote: > On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > Do we keep track of 404 errors on the .org website? > > The spider logs internal errors (or used to, I haven't looked at > recent versions). Why, did you find one? Yes. I'm trying to understand why we didn't spot the 404s, nor perform a link check that would do that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, May 28, 2008 at 10:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2008-05-28 at 10:09 +0100, Dave Page wrote: >> On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > >> > Do we keep track of 404 errors on the .org website? >> >> The spider logs internal errors (or used to, I haven't looked at >> recent versions). Why, did you find one? > > Yes. I'm trying to understand why we didn't spot the 404s, nor perform a > link check that would do that. Probably because noone checked the log recently (we know if errors occur through other channels, but not 404 warnings). Care to share what you found? -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
Dave Page wrote: > On Wed, May 28, 2008 at 10:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Wed, 2008-05-28 at 10:09 +0100, Dave Page wrote: >>> On Wed, May 28, 2008 at 9:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> Do we keep track of 404 errors on the .org website? >>> The spider logs internal errors (or used to, I haven't looked at >>> recent versions). Why, did you find one? >> Yes. I'm trying to understand why we didn't spot the 404s, nor perform a >> link check that would do that. > > Probably because noone checked the log recently (we know if errors > occur through other channels, but not 404 warnings). well the logs still have a fair number of false positives. This partly due to the mirror script being a bit careless at times in what it should consider a valid url and the other part is url's that we once had and referenced in say a press release that are no longer valid (be it website reorg or a decision to rename directories on the ftp site). otoh it seems that we have at least one really broken URL in the press FAQ page - will see if we can fix that ... Stefan
On Wed, 2008-05-28 at 18:23 +0200, Stefan Kaltenbrunner wrote: > Dave Page wrote: > > Probably because noone checked the log recently (we know if errors > > occur through other channels, but not 404 warnings). > > well the logs still have a fair number of false positives. > This partly due to the mirror script being a bit careless at times in > what it should consider a valid url and the other part is url's that we > once had and referenced in say a press release that are no longer valid > (be it website reorg or a decision to rename directories on the ftp site). > otoh it seems that we have at least one really broken URL in the press > FAQ page - will see if we can fix that ... > What if we used a rewrite rule on 404 to actually bring up a single entry form that said, "Report broken page: <email> <submit>". Sincerely, Joshua D. Drake > > Stefan >
Joshua D. Drake wrote: > > On Wed, 2008-05-28 at 18:23 +0200, Stefan Kaltenbrunner wrote: >> Dave Page wrote: > >>> Probably because noone checked the log recently (we know if errors >>> occur through other channels, but not 404 warnings). >> well the logs still have a fair number of false positives. >> This partly due to the mirror script being a bit careless at times in >> what it should consider a valid url and the other part is url's that we >> once had and referenced in say a press release that are no longer valid >> (be it website reorg or a decision to rename directories on the ftp site). >> otoh it seems that we have at least one really broken URL in the press >> FAQ page - will see if we can fix that ... >> > > What if we used a rewrite rule on 404 to actually bring up a single > entry form that said, "Report broken page: <email> <submit>". I think it would be more reasonable to look into what it would take to remove the (obvious) false positives and have the mirror script report new ones automatically during site build. Though I think what simon was actually refering to are urls pointing to external sites which we could maybe check on events/training whatever submission and refuse to accept them. The mirroring does not really care for external sites so we would only be able to spot mistakes that lead to urls that end up on wwwmaster (like it being interpreted as a relative link or such) not ones that are broken otherwise (domain misspelled, simply wrong,...). Stefan
On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote: > I think it would be more reasonable to look into what it would take to > remove the (obvious) false positives and have the mirror script report > new ones automatically during site build. > Though I think what simon was actually refering to are urls pointing to > external sites which we could maybe check on events/training whatever > submission and refuse to accept them. > The mirroring does not really care for external sites so we would only > be able to spot mistakes that lead to urls that end up on wwwmaster > (like it being interpreted as a relative link or such) not ones that are > broken otherwise (domain misspelled, simply wrong,...). Specifically, yes. But I am worried that we aren't monitoring such a basic quality issue. There might be lots of URLs in the Wiki that go bad over time and we want to check on this, don't we? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote: > >> I think it would be more reasonable to look into what it would take to >> remove the (obvious) false positives and have the mirror script report >> new ones automatically during site build. >> Though I think what simon was actually refering to are urls pointing to >> external sites which we could maybe check on events/training whatever >> submission and refuse to accept them. >> The mirroring does not really care for external sites so we would only >> be able to spot mistakes that lead to urls that end up on wwwmaster >> (like it being interpreted as a relative link or such) not ones that are >> broken otherwise (domain misspelled, simply wrong,...). > > Specifically, yes. But I am worried that we aren't monitoring such a > basic quality issue. There might be lots of URLs in the Wiki that go bad > over time and we want to check on this, don't we? well - on www.postgresql.org itself it is a rare(though not impossible) issue because most of the urls there are internal links and not that often changed. In the wiki case - everybody who spots an error there can fix it and I guess that there are already addons for mediawiki available that can help with that. Stefan
+1 but without the form and directly triggering an alert to slaves. 404 ? trigger_alert.php?missingurl=param Do not rely on users if you want to improve the experience, though. Regards, gb.- On Wed, May 28, 2008 at 9:35 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > > > On Wed, 2008-05-28 at 18:23 +0200, Stefan Kaltenbrunner wrote: >> Dave Page wrote: > >> > Probably because noone checked the log recently (we know if errors >> > occur through other channels, but not 404 warnings). >> >> well the logs still have a fair number of false positives. >> This partly due to the mirror script being a bit careless at times in >> what it should consider a valid url and the other part is url's that we >> once had and referenced in say a press release that are no longer valid >> (be it website reorg or a decision to rename directories on the ftp site). >> otoh it seems that we have at least one really broken URL in the press >> FAQ page - will see if we can fix that ... >> > > What if we used a rewrite rule on 404 to actually bring up a single > entry form that said, "Report broken page: <email> <submit>". > > > Sincerely, > > Joshua D. Drake > > > >> >> Stefan >> > > > -- > Sent via pgsql-www mailing list (pgsql-www@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-www > -- Guido Barosio ----------------------- http://www.globant.com guido.barosio@globant.com
On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote: > Joshua D. Drake wrote: > > What if we used a rewrite rule on 404 to actually bring up a single > > entry form that said, "Report broken page: <email> <submit>". > > I think it would be more reasonable to look into what it would take to > remove the (obvious) false positives and have the mirror script report > new ones automatically during site build. If you know a way to get rid of people hunting for virus and path execution I want to know :) Joshua D. Drake
Joshua D. Drake wrote: > > On Wed, 2008-05-28 at 18:42 +0200, Stefan Kaltenbrunner wrote: >> Joshua D. Drake wrote: > >>> What if we used a rewrite rule on 404 to actually bring up a single >>> entry form that said, "Report broken page: <email> <submit>". >> I think it would be more reasonable to look into what it would take to >> remove the (obvious) false positives and have the mirror script report >> new ones automatically during site build. > > If you know a way to get rid of people hunting for virus and path > execution I want to know :) well I'm talking about the mirror script here - that one is spidering our own site(and only that - no external urls (obviously) and generating the static html files for the mirrors. It already logs 404's though most of them are false positives because the script missparses some (old) pages - this could be fixed but I'm not sure we can (or should) do much more because we would have to periodically spider all external(!) urls. Stefan
Guido Barosio wrote: > +1 but without the form and directly triggering an alert to slaves. > > 404 ? trigger_alert.php?missingurl=param so anybody with wget and a simply shellscript could can (email) DoS -slaves and wwwmaster in seconds ? > > Do not rely on users if you want to improve the experience, though. keep in mind that we can only detect relative urls on our OWN infrastructure and also that 99% of the website traffic is on www.postgresql.org with ourely static (mirrored) content, no PHP (or whatever) support and are only partly under our control. only wwwmaster is dynamic but only a fraction of traffic ends up there. Stefan
mod_friends = true; /* commitment @ postgresql.org makes my life easy */ > so anybody with wget and a simply shellscript could can (email) DoS -slaves > and wwwmaster in seconds ? (so curl+post wouldn't be a kiddie workarround or are you planning to implement CAPTCHA? [ BTW, I've heard about CAPTCHA bypassing, as easy as dating my syster] ) Hmmmm, what about http hooks? (rock *'s) http://httpd.apache.org/docs/2.0/developer/hooks.html ---> 100% transparent though. 2 cents. >> Do not rely on users if you want to improve the experience, though. > > keep in mind that we can only detect relative urls on our OWN infrastructure > and also that 99% of the website traffic is on www.postgresql.org with > ourely static (mirrored) content, no PHP (or whatever) support and are only > partly under our control. > only wwwmaster is dynamic but only a fraction of traffic ends up there. Ta, txs! -- Guido Barosio ----------------------- http://www.globant.com guido.barosio@globant.com
GET http://www.postgresql.org/blah Not Found The requested URL /blah was not found on this server. Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4 PHP/4.4.4-8+etch4 Server at www.postgresql.org Port 80 Though, we should at least hide (my sister's phone) some details. (even when Facebook shows all her pictures and makes my life impossible!). Furthermore, we could take that "blah" string and search the site in order to ease the experience, presenting a result letting the user decide what to do. gb.- On Wed, May 28, 2008 at 11:54 AM, Guido Barosio <gbarosio@gmail.com> wrote: > mod_friends = true; /* commitment @ postgresql.org makes my life easy */ > >> so anybody with wget and a simply shellscript could can (email) DoS -slaves >> and wwwmaster in seconds ? > > (so curl+post wouldn't be a kiddie workarround or are you planning to > implement CAPTCHA? [ BTW, I've heard about CAPTCHA bypassing, as easy > as dating my syster] ) > > Hmmmm, what about http hooks? (rock *'s) > > http://httpd.apache.org/docs/2.0/developer/hooks.html ---> 100% > transparent though. > > 2 cents. > >>> Do not rely on users if you want to improve the experience, though. >> >> keep in mind that we can only detect relative urls on our OWN infrastructure >> and also that 99% of the website traffic is on www.postgresql.org with >> ourely static (mirrored) content, no PHP (or whatever) support and are only >> partly under our control. >> only wwwmaster is dynamic but only a fraction of traffic ends up there. > > Ta, txs! > > -- > Guido Barosio > ----------------------- > http://www.globant.com > guido.barosio@globant.com > -- Guido Barosio ----------------------- http://www.globant.com guido.barosio@globant.com
Stefan Kaltenbrunner wrote: ... > well the logs still have a fair number of false positives. > This partly due to the mirror script being a bit careless at times in > what it should consider a valid url and the other part is url's that we > once had and referenced in say a press release that are no longer valid > (be it website reorg or a decision to rename directories on the ftp site). > otoh it seems that we have at least one really broken URL in the press > FAQ page - will see if we can fix that ... I also notized the docs URL changed unexpectedly, this broke my Bookmarks. Generally its not a good idea to change such links. T.
Tino Wildenhain wrote: > Stefan Kaltenbrunner wrote: > ... > > well the logs still have a fair number of false positives. > > This partly due to the mirror script being a bit careless at times > > in what it should consider a valid url and the other part is url's > > that we once had and referenced in say a press release that are no > > longer valid (be it website reorg or a decision to rename > > directories on the ftp site). otoh it seems that we have at least > > one really broken URL in the press FAQ page - will see if we can > > fix that ... > > I also notized the docs URL changed unexpectedly, this broke my > Bookmarks. Generally its not a good idea to change such links. Example, please? From what, to what? //Magnus