Thread: Whack of changes to archives ...
Based on comments from Oleg, I did a bunch of cleaning up of the archives (which is still in the process of being "re-mhonarced") ... First and foremost, a 'Last-Modified' date is now set, based on the timestamp of the file ... the script that generates the files is designed so that it doesn't go through and re-create old messages, so as long as I don't to rebuild the whole things 'yet again', the time stamps should stay pretty fixed ... Second, I added a bunch of code that is aimed at reducing the amount of information that gets indexed by search engines. For instance, the banner and search code at the top of *every* page isn't displayed, nor are the followup/references stuff at the bottom ... and only the date index is searched, not the thread ones ... basically, content is presented, not the miriad of links to that same content ... basically, for the msg* files, I've strip out *everything* except the message itself ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 20 January 2004 18:17 > To: pgsql-www@postgresql.org > Subject: [pgsql-www] Whack of changes to archives ... > > > Based on comments from Oleg, I did a bunch of cleaning up of > the archives (which is still in the process of being > "re-mhonarced") ... > > First and foremost, a 'Last-Modified' date is now set, based > on the timestamp of the file ... the script that generates > the files is designed so that it doesn't go through and > re-create old messages, so as long as I don't to rebuild the > whole things 'yet again', the time stamps should stay pretty fixed ... > > Second, I added a bunch of code that is aimed at reducing the > amount of information that gets indexed by search engines. > For instance, the banner and search code at the top of > *every* page isn't displayed, nor are the followup/references > stuff at the bottom ... and only the date index is searched, > not the thread ones ... basically, content is presented, not > the miriad of links to that same content ... basically, for > the msg* files, I've strip out *everything* except the > message itself ... Arrggh! You know I'm in the middle of indexing that lot! The banner code etc. can be removed from indexes by wrapping it in <!--noindex--> <!--/noindex--> or similar tags as appropriate to the search engine being used (currently I'm building an ASPSeek index which is proving to work very nicely). Regards Dave.
On Tue, 2004-01-20 at 13:17, Marc G. Fournier wrote: > > Based on comments from Oleg, I did a bunch of cleaning up of the archives > (which is still in the process of being "re-mhonarced") ... > > First and foremost, a 'Last-Modified' date is now set, based on the > timestamp of the file ... the script that generates the files is designed > so that it doesn't go through and re-create old messages, so as long as I > don't to rebuild the whole things 'yet again', the time stamps should stay > pretty fixed ... > > Second, I added a bunch of code that is aimed at reducing the amount of > information that gets indexed by search engines. For instance, the banner > and search code at the top of *every* page isn't displayed, nor are the > followup/references stuff at the bottom ... and only the date index is > searched, not the thread ones ... basically, content is presented, not the > miriad of links to that same content ... basically, for the msg* files, > I've strip out *everything* except the message itself ... > Woohoo! Been wishing for a couple of these. Thanks Marc! :-) Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Marc, have you checked what exactly Last-Modified is ? megera@mira:~/app/php$ curl -I http://archives.postgresql.org/pgsql-hackers/2004-01/msg00419.php HTTP/1.1 200 OK Date: Tue, 20 Jan 2004 20:32:15 GMT Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1 X-Powered-By: PHP/4.3.3RC1 Last-Modified: 01/20/04 17:03:50 Content-Type: text/html But message itself is posted * From: Lamar Owen <lowen ( at ) pari ( dot ) edu> * To: pgsql-hackers ( at ) postgresql ( dot ) org * Subject: Old binary packages. * Date: Mon, 19 Jan 2004 14:35:57 -0500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Oleg On Tue, 20 Jan 2004, Marc G. Fournier wrote: > > Based on comments from Oleg, I did a bunch of cleaning up of the archives > (which is still in the process of being "re-mhonarced") ... > > First and foremost, a 'Last-Modified' date is now set, based on the > timestamp of the file ... the script that generates the files is designed > so that it doesn't go through and re-create old messages, so as long as I > don't to rebuild the whole things 'yet again', the time stamps should stay > pretty fixed ... > > Second, I added a bunch of code that is aimed at reducing the amount of > information that gets indexed by search engines. For instance, the banner > and search code at the top of *every* page isn't displayed, nor are the > followup/references stuff at the bottom ... and only the date index is > searched, not the thread ones ... basically, content is presented, not the > miriad of links to that same content ... basically, for the msg* files, > I've strip out *everything* except the message itself ... > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Marc, You have to specify Last-Modified header in strict format !!! Check please, cacheability script: http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers%2F2004-01%2Fmsg00282.php&descend=on Expires - Cache-Control - Last-Modified invalid (01/20/04 17:03:51) ETag - Content-Length - (actual size: 13593) Server Apache/1.3.28 (Unix) PHP/4.3.3RC1 Good tutorial on cacheability is: http://www.mnot.net/cache_docs/ time in a HTTP date is Greenwich Mean Time (GMT), not local time. For example: Expires: Fri, 30 Oct 1998 14:19:41 GMT Oleg On Tue, 20 Jan 2004, Marc G. Fournier wrote: > > Based on comments from Oleg, I did a bunch of cleaning up of the archives > (which is still in the process of being "re-mhonarced") ... > > First and foremost, a 'Last-Modified' date is now set, based on the > timestamp of the file ... the script that generates the files is designed > so that it doesn't go through and re-create old messages, so as long as I > don't to rebuild the whole things 'yet again', the time stamps should stay > pretty fixed ... > > Second, I added a bunch of code that is aimed at reducing the amount of > information that gets indexed by search engines. For instance, the banner > and search code at the top of *every* page isn't displayed, nor are the > followup/references stuff at the bottom ... and only the date index is > searched, not the thread ones ... basically, content is presented, not the > miriad of links to that same content ... basically, for the msg* files, > I've strip out *everything* except the message itself ... > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Tue, 20 Jan 2004, Dave Page wrote: > Arrggh! You know I'm in the middle of indexing that lot! Sorry ... just fixed a 'bug' in the code also, but that is in the 'detect the search engine' code, and doesn't affect the physical files ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 20 January 2004 20:42 > To: Dave Page > Cc: Marc G. Fournier; pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Whack of changes to archives ... > > On Tue, 20 Jan 2004, Dave Page wrote: > > > Arrggh! You know I'm in the middle of indexing that lot! > > Sorry ... just fixed a 'bug' in the code also, but that is in > the 'detect the search engine' code, and doesn't affect the > physical files ... So what is 'detect the search engine code'? /D
oh, nthing special, I jus have PHP code in that checks for the HTTP_USER_AGENT being set ... On Tue, 20 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 20 January 2004 20:42 > > To: Dave Page > > Cc: Marc G. Fournier; pgsql-www@postgresql.org > > Subject: RE: [pgsql-www] Whack of changes to archives ... > > > > On Tue, 20 Jan 2004, Dave Page wrote: > > > > > Arrggh! You know I'm in the middle of indexing that lot! > > > > Sorry ... just fixed a 'bug' in the code also, but that is in > > the 'detect the search engine' code, and doesn't affect the > > physical files ... > > So what is 'detect the search engine code'? > > /D > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Tue, 20 Jan 2004, Oleg Bartunov wrote: > Marc, > > You have to specify Last-Modified header in strict format !!! > Check please, cacheability script: > > http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers%2F2004-01%2Fmsg00282.php&descend=on > > Expires - > Cache-Control - > Last-Modified invalid (01/20/04 17:03:51) > ETag - > Content-Length - (actual size: 13593) > Server Apache/1.3.28 (Unix) PHP/4.3.3RC1 > > Good tutorial on cacheability is: > > http://www.mnot.net/cache_docs/ > > time in a HTTP date is Greenwich Mean Time (GMT), not local time. > > For example: > > Expires: Fri, 30 Oct 1998 14:19:41 GMT 'k, I set a Content-Cache header with a max-age (since from reading the above URL, it sounds more useful then the Expires), and fixed the date format of the Last-Modified ... thanks ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Thu, 22 Jan 2004, Marc G. Fournier wrote: > On Tue, 20 Jan 2004, Oleg Bartunov wrote: > > > Marc, > > > > You have to specify Last-Modified header in strict format !!! > > Check please, cacheability script: > > > > http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers%2F2004-01%2Fmsg00282.php&descend=on > > > > Expires - > > Cache-Control - > > Last-Modified invalid (01/20/04 17:03:51) > > ETag - > > Content-Length - (actual size: 13593) > > Server Apache/1.3.28 (Unix) PHP/4.3.3RC1 > > > > Good tutorial on cacheability is: > > > > http://www.mnot.net/cache_docs/ > > > > time in a HTTP date is Greenwich Mean Time (GMT), not local time. > > > > For example: > > > > Expires: Fri, 30 Oct 1998 14:19:41 GMT > > 'k, I set a Content-Cache header with a max-age (since from reading the > above URL, it sounds more useful then the Expires), and fixed the date > format of the Last-Modified ... But date in Last-Modified is still wrong: megera@zeon:~/preview/www.astronet.ru/db/astrosearch$ curl -I http://archives.p ostgresql.org/pgsql-hackers/2004-01/msg00301.php HTTP/1.1 200 OK Date: Thu, 22 Jan 2004 05:14:27 GMT Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1 X-Powered-By: PHP/4.3.3RC1 Last-Modified: Thu, 22 Jan 2004 04:00:44 +0000 Cache-control: max-age=2592000 Content-Type: text/html The message was actually posted 13 Jan 2004, but have Last-Modified date 22 Jan 2004. > > thanks ... > > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Thu, 22 Jan 2004, Oleg Bartunov wrote: > But date in Last-Modified is still wrong: > > megera@zeon:~/preview/www.astronet.ru/db/astrosearch$ curl -I http://archives.p > ostgresql.org/pgsql-hackers/2004-01/msg00301.php > HTTP/1.1 200 OK > Date: Thu, 22 Jan 2004 05:14:27 GMT > Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1 > X-Powered-By: PHP/4.3.3RC1 > Last-Modified: Thu, 22 Jan 2004 04:00:44 +0000 > Cache-control: max-age=2592000 > Content-Type: text/html > > The message was actually posted 13 Jan 2004, but have Last-Modified date > 22 Jan 2004. I tried to pull in posting date through the mhonarc .resource file, and it didn't work ... if you have any ideas on that, please feel free to let me know, but baring that, I'm just using the file modification date right now, so that unless I have to regenerate at some point (whcih, unless I change the .resource file, there is no reason to), the date stays fixed ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Thu, 22 Jan 2004, Marc G. Fournier wrote: > On Thu, 22 Jan 2004, Oleg Bartunov wrote: > > > But date in Last-Modified is still wrong: > > > > megera@zeon:~/preview/www.astronet.ru/db/astrosearch$ curl -I http://archives.p > > ostgresql.org/pgsql-hackers/2004-01/msg00301.php > > HTTP/1.1 200 OK > > Date: Thu, 22 Jan 2004 05:14:27 GMT > > Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1 > > X-Powered-By: PHP/4.3.3RC1 > > Last-Modified: Thu, 22 Jan 2004 04:00:44 +0000 > > Cache-control: max-age=2592000 > > Content-Type: text/html > > > > The message was actually posted 13 Jan 2004, but have Last-Modified date > > 22 Jan 2004. > > I tried to pull in posting date through the mhonarc .resource file, and it > didn't work ... if you have any ideas on that, please feel free to let me > know, but baring that, I'm just using the file modification date right > now, so that unless I have to regenerate at some point (whcih, unless I > change the .resource file, there is no reason to), the date stays fixed > ... I don't have experience with mhonarc, does it produces *static* files ? If so, why messages have .php extension ? What's the reason for doing that ? For static files you always could use 'touch' command to change mtime :) If pages are served through php, you should have no problem to setup proper header. > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83