Thread: Whack of changes to archives ...

Whack of changes to archives ...

From
"Marc G. Fournier"
Date:
Based on comments from Oleg, I did a bunch of cleaning up of the archives
(which is still in the process of being "re-mhonarced") ...

First and foremost, a 'Last-Modified' date is now set, based on the
timestamp of the file ... the script that generates the files is designed
so that it doesn't go through and re-create old messages, so as long as I
don't to rebuild the whole things 'yet again', the time stamps should stay
pretty fixed ...

Second, I added a bunch of code that is aimed at reducing the amount of
information that gets indexed by search engines.  For instance, the banner
and search code at the top of *every* page isn't displayed, nor are the
followup/references stuff at the bottom ... and only the date index is
searched, not the thread ones ... basically, content is presented, not the
miriad of links to that same content ... basically, for the msg* files,
I've strip out *everything* except the message itself ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Whack of changes to archives ...

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 20 January 2004 18:17
> To: pgsql-www@postgresql.org
> Subject: [pgsql-www] Whack of changes to archives ...
>
>
> Based on comments from Oleg, I did a bunch of cleaning up of
> the archives (which is still in the process of being
> "re-mhonarced") ...
>
> First and foremost, a 'Last-Modified' date is now set, based
> on the timestamp of the file ... the script that generates
> the files is designed so that it doesn't go through and
> re-create old messages, so as long as I don't to rebuild the
> whole things 'yet again', the time stamps should stay pretty fixed ...
>
> Second, I added a bunch of code that is aimed at reducing the
> amount of information that gets indexed by search engines.
> For instance, the banner and search code at the top of
> *every* page isn't displayed, nor are the followup/references
> stuff at the bottom ... and only the date index is searched,
> not the thread ones ... basically, content is presented, not
> the miriad of links to that same content ... basically, for
> the msg* files, I've strip out *everything* except the
> message itself ...

Arrggh! You know I'm in the middle of indexing that lot!

The banner code etc. can be removed from indexes by wrapping it in
<!--noindex--> <!--/noindex--> or similar tags as appropriate to the
search engine being used (currently I'm building an ASPSeek index which
is proving to work very nicely).

Regards Dave.

Re: Whack of changes to archives ...

From
Robert Treat
Date:
On Tue, 2004-01-20 at 13:17, Marc G. Fournier wrote:
>
> Based on comments from Oleg, I did a bunch of cleaning up of the archives
> (which is still in the process of being "re-mhonarced") ...
>
> First and foremost, a 'Last-Modified' date is now set, based on the
> timestamp of the file ... the script that generates the files is designed
> so that it doesn't go through and re-create old messages, so as long as I
> don't to rebuild the whole things 'yet again', the time stamps should stay
> pretty fixed ...
>
> Second, I added a bunch of code that is aimed at reducing the amount of
> information that gets indexed by search engines.  For instance, the banner
> and search code at the top of *every* page isn't displayed, nor are the
> followup/references stuff at the bottom ... and only the date index is
> searched, not the thread ones ... basically, content is presented, not the
> miriad of links to that same content ... basically, for the msg* files,
> I've strip out *everything* except the message itself ...
>

Woohoo! Been wishing for a couple of these. Thanks Marc! :-)

Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


Re: Whack of changes to archives ...

From
Oleg Bartunov
Date:
Marc,

have you checked what exactly Last-Modified is ?

megera@mira:~/app/php$ curl -I http://archives.postgresql.org/pgsql-hackers/2004-01/msg00419.php
HTTP/1.1 200 OK
Date: Tue, 20 Jan 2004 20:32:15 GMT
Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1
X-Powered-By: PHP/4.3.3RC1
Last-Modified: 01/20/04 17:03:50
Content-Type: text/html

But message itself is posted
    * From: Lamar Owen <lowen ( at ) pari ( dot ) edu>
    * To: pgsql-hackers ( at ) postgresql ( dot ) org
    * Subject: Old binary packages.
    * Date: Mon, 19 Jan 2004 14:35:57 -0500
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Oleg

On Tue, 20 Jan 2004, Marc G. Fournier wrote:

>
> Based on comments from Oleg, I did a bunch of cleaning up of the archives
> (which is still in the process of being "re-mhonarced") ...
>
> First and foremost, a 'Last-Modified' date is now set, based on the
> timestamp of the file ... the script that generates the files is designed
> so that it doesn't go through and re-create old messages, so as long as I
> don't to rebuild the whole things 'yet again', the time stamps should stay
> pretty fixed ...
>
> Second, I added a bunch of code that is aimed at reducing the amount of
> information that gets indexed by search engines.  For instance, the banner
> and search code at the top of *every* page isn't displayed, nor are the
> followup/references stuff at the bottom ... and only the date index is
> searched, not the thread ones ... basically, content is presented, not the
> miriad of links to that same content ... basically, for the msg* files,
> I've strip out *everything* except the message itself ...
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Whack of changes to archives ...

From
Oleg Bartunov
Date:
Marc,

You have to specify Last-Modified header in strict format !!!
Check please, cacheability script:


http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers%2F2004-01%2Fmsg00282.php&descend=on

Expires         -
Cache-Control         -
Last-Modified       invalid  (01/20/04 17:03:51)
ETag         -
Content-Length        - (actual size: 13593)
Server      Apache/1.3.28 (Unix) PHP/4.3.3RC1

Good tutorial on cacheability is:

http://www.mnot.net/cache_docs/

 time in a HTTP date is Greenwich Mean Time (GMT), not local time.

For example:

Expires: Fri, 30 Oct 1998 14:19:41 GMT



    Oleg
On Tue, 20 Jan 2004, Marc G. Fournier wrote:

>
> Based on comments from Oleg, I did a bunch of cleaning up of the archives
> (which is still in the process of being "re-mhonarced") ...
>
> First and foremost, a 'Last-Modified' date is now set, based on the
> timestamp of the file ... the script that generates the files is designed
> so that it doesn't go through and re-create old messages, so as long as I
> don't to rebuild the whole things 'yet again', the time stamps should stay
> pretty fixed ...
>
> Second, I added a bunch of code that is aimed at reducing the amount of
> information that gets indexed by search engines.  For instance, the banner
> and search code at the top of *every* page isn't displayed, nor are the
> followup/references stuff at the bottom ... and only the date index is
> searched, not the thread ones ... basically, content is presented, not the
> miriad of links to that same content ... basically, for the msg* files,
> I've strip out *everything* except the message itself ...
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Whack of changes to archives ...

From
"Marc G. Fournier"
Date:
On Tue, 20 Jan 2004, Dave Page wrote:

> Arrggh! You know I'm in the middle of indexing that lot!

Sorry ... just fixed a 'bug' in the code also, but that is in the 'detect
the search engine' code, and doesn't affect the physical files ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Whack of changes to archives ...

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 20 January 2004 20:42
> To: Dave Page
> Cc: Marc G. Fournier; pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Whack of changes to archives ...
>
> On Tue, 20 Jan 2004, Dave Page wrote:
>
> > Arrggh! You know I'm in the middle of indexing that lot!
>
> Sorry ... just fixed a 'bug' in the code also, but that is in
> the 'detect the search engine' code, and doesn't affect the
> physical files ...

So what is 'detect the search engine code'?

/D

Re: Whack of changes to archives ...

From
"Marc G. Fournier"
Date:
oh, nthing special, I jus have PHP code in that checks for the
HTTP_USER_AGENT being set ...

On Tue, 20 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 20 January 2004 20:42
> > To: Dave Page
> > Cc: Marc G. Fournier; pgsql-www@postgresql.org
> > Subject: RE: [pgsql-www] Whack of changes to archives ...
> >
> > On Tue, 20 Jan 2004, Dave Page wrote:
> >
> > > Arrggh! You know I'm in the middle of indexing that lot!
> >
> > Sorry ... just fixed a 'bug' in the code also, but that is in
> > the 'detect the search engine' code, and doesn't affect the
> > physical files ...
>
> So what is 'detect the search engine code'?
>
> /D
>

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Whack of changes to archives ...

From
"Marc G. Fournier"
Date:
On Tue, 20 Jan 2004, Oleg Bartunov wrote:

> Marc,
>
> You have to specify Last-Modified header in strict format !!!
> Check please, cacheability script:
>
>
http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers%2F2004-01%2Fmsg00282.php&descend=on
>
> Expires         -
> Cache-Control         -
> Last-Modified       invalid  (01/20/04 17:03:51)
> ETag         -
> Content-Length        - (actual size: 13593)
> Server      Apache/1.3.28 (Unix) PHP/4.3.3RC1
>
> Good tutorial on cacheability is:
>
> http://www.mnot.net/cache_docs/
>
>  time in a HTTP date is Greenwich Mean Time (GMT), not local time.
>
> For example:
>
> Expires: Fri, 30 Oct 1998 14:19:41 GMT

'k, I set a Content-Cache header with a max-age (since from reading the
above URL, it sounds more useful then the Expires), and fixed the date
format of the Last-Modified ...

thanks ...


----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Whack of changes to archives ...

From
Oleg Bartunov
Date:
On Thu, 22 Jan 2004, Marc G. Fournier wrote:

> On Tue, 20 Jan 2004, Oleg Bartunov wrote:
>
> > Marc,
> >
> > You have to specify Last-Modified header in strict format !!!
> > Check please, cacheability script:
> >
> >
http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers%2F2004-01%2Fmsg00282.php&descend=on
> >
> > Expires         -
> > Cache-Control         -
> > Last-Modified       invalid  (01/20/04 17:03:51)
> > ETag         -
> > Content-Length        - (actual size: 13593)
> > Server      Apache/1.3.28 (Unix) PHP/4.3.3RC1
> >
> > Good tutorial on cacheability is:
> >
> > http://www.mnot.net/cache_docs/
> >
> >  time in a HTTP date is Greenwich Mean Time (GMT), not local time.
> >
> > For example:
> >
> > Expires: Fri, 30 Oct 1998 14:19:41 GMT
>
> 'k, I set a Content-Cache header with a max-age (since from reading the
> above URL, it sounds more useful then the Expires), and fixed the date
> format of the Last-Modified ...

But date in Last-Modified is still wrong:

megera@zeon:~/preview/www.astronet.ru/db/astrosearch$ curl -I http://archives.p
ostgresql.org/pgsql-hackers/2004-01/msg00301.php
HTTP/1.1 200 OK
Date: Thu, 22 Jan 2004 05:14:27 GMT
Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1
X-Powered-By: PHP/4.3.3RC1
Last-Modified: Thu, 22 Jan 2004 04:00:44 +0000
Cache-control: max-age=2592000
Content-Type: text/html

The message was actually posted 13 Jan 2004, but have Last-Modified date
22 Jan 2004.


>
> thanks ...
>
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Whack of changes to archives ...

From
"Marc G. Fournier"
Date:
On Thu, 22 Jan 2004, Oleg Bartunov wrote:

> But date in Last-Modified is still wrong:
>
> megera@zeon:~/preview/www.astronet.ru/db/astrosearch$ curl -I http://archives.p
> ostgresql.org/pgsql-hackers/2004-01/msg00301.php
> HTTP/1.1 200 OK
> Date: Thu, 22 Jan 2004 05:14:27 GMT
> Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1
> X-Powered-By: PHP/4.3.3RC1
> Last-Modified: Thu, 22 Jan 2004 04:00:44 +0000
> Cache-control: max-age=2592000
> Content-Type: text/html
>
> The message was actually posted 13 Jan 2004, but have Last-Modified date
> 22 Jan 2004.

I tried to pull in posting date through the mhonarc .resource file, and it
didn't work ... if you have any ideas on that, please feel free to let me
know, but baring that, I'm just using the file modification date right
now, so that unless I have to regenerate at some point (whcih, unless I
change the .resource file, there is no reason to), the date stays fixed
...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Whack of changes to archives ...

From
Oleg Bartunov
Date:
On Thu, 22 Jan 2004, Marc G. Fournier wrote:

> On Thu, 22 Jan 2004, Oleg Bartunov wrote:
>
> > But date in Last-Modified is still wrong:
> >
> > megera@zeon:~/preview/www.astronet.ru/db/astrosearch$ curl -I http://archives.p
> > ostgresql.org/pgsql-hackers/2004-01/msg00301.php
> > HTTP/1.1 200 OK
> > Date: Thu, 22 Jan 2004 05:14:27 GMT
> > Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1
> > X-Powered-By: PHP/4.3.3RC1
> > Last-Modified: Thu, 22 Jan 2004 04:00:44 +0000
> > Cache-control: max-age=2592000
> > Content-Type: text/html
> >
> > The message was actually posted 13 Jan 2004, but have Last-Modified date
> > 22 Jan 2004.
>
> I tried to pull in posting date through the mhonarc .resource file, and it
> didn't work ... if you have any ideas on that, please feel free to let me
> know, but baring that, I'm just using the file modification date right
> now, so that unless I have to regenerate at some point (whcih, unless I
> change the .resource file, there is no reason to), the date stays fixed
> ...

I don't have experience with mhonarc, does it produces *static* files ?
If so, why messages have .php extension ? What's the reason for doing that ?
For static files you always could use 'touch' command to change mtime :)
If pages are served through php, you should have no problem to setup
proper header.

>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83