Thread: Postgresql.org search engine.

Postgresql.org search engine.

From
"Dave Page"
Date:
Hi guys,

As some of you may have noticed, there is now a new search engine on the
main, and archives websites. This one is based on an unreleased
(currently) port of ASPSeek which runs on PostgreSQL.

Comments etc. welcome - you should find this one *much* faster.

Marc: I believe the mnogo stuff can all be ditched now.

Regards, Dave.

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> Hi guys,
>
> As some of you may have noticed, there is now a new search engine on the
> main, and archives websites. This one is based on an unreleased
> (currently) port of ASPSeek which runs on PostgreSQL.
>
> Comments etc. welcome - you should find this one *much* faster.
>

I'd recommend to use ispell dictionaries, so 'databases' and 'database'
will produce the same results.


> Marc: I believe the mnogo stuff can all be ditched now.

agreed !

>
> Regards, Dave.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> Hi guys,
>
> As some of you may have noticed, there is now a new search engine on the
> main, and archives websites. This one is based on an unreleased
> (currently) port of ASPSeek which runs on PostgreSQL.

Just checked archives, and its still using MnogoSearch?  Or is there
something that I'm supposed to be changing over there?


----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:
Hi Oleg,

> -----Original Message-----
> From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Sent: 30 January 2004 16:03
> To: Dave Page
> Cc: pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
>
> I'd recommend to use ispell dictionaries, so 'databases' and
> 'database'
> will produce the same results.

Thanks, installed.

BTW, searching for 'database' really makes it think! Other queries that
generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker.

I have also added some weighting to the indexed sites to try to give
preference to those that are more 'authoritative' and of global interest
than others. Any comments or suggestions for changes welcome as always!

# Primary sites
SiteWeight http://www.postgresql.org/ 100
SiteWeight http://advocacy.postgresql.org/ 100
SiteWeight http://jdbc.postgresql.org/ 100
SiteWeight http://developer.postgresql.org/ 100

# Authoritiative project sites
SiteWeight http://gborg.postgresql.org/ 75
SiteWeight http://pgadmin.postgresql.org/ 75
SiteWeight http://phppgadmin.sourceforge.net/ 75

# User contributed stuff
SiteWeight http://techdocs.postgresql.org/ 50
SiteWeight http://archives.postgresql.org/ 50

# Outside but reliable
SiteWeight http://www.varlena.com/ 25

# And the rest...
SiteWeight http://www.postgresql.cl/ 0
SiteWeight http://postgresql.ok.cz/ 0
SiteWeight http://www.postgresql.jp/ 0
SiteWeight http://pgsql-fr.tuxfamily.org/ 0
SiteWeight http://www.linuxshare.ru/ 0
SiteWeight http://www.postgres.de/ 0
SiteWeight http://www.pgsqldb.org/ 0
SiteWeight http://www.postgresql.org.br/ 0

Regards, Dave.

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 30 January 2004 16:48
> To: Dave Page
> Cc: pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
> On Fri, 30 Jan 2004, Dave Page wrote:
>
> > Hi guys,
> >
> > As some of you may have noticed, there is now a new search
> engine on
> > the main, and archives websites. This one is based on an unreleased
> > (currently) port of ASPSeek which runs on PostgreSQL.
>
> Just checked archives, and its still using MnogoSearch?  Or
> is there something that I'm supposed to be changing over there?

It's using aspseek from here. A search for 'stuff' just gave:

Documents 1-20 of total 11042 found.    Searching in 276628 documents
took 2.736 seconds.

Followed by the ASPSeeeeeek graphical page selector.

Oh, and I changed the beige boxes to light blue.

Try a ctrl-refresh perhaps?

Regard,s Dave.

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
that did it, great ... mnogosearch database being zap'd! *dances a jig*

On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 30 January 2004 16:48
> > To: Dave Page
> > Cc: pgsql-www@postgresql.org
> > Subject: Re: [pgsql-www] Postgresql.org search engine.
> >
> > On Fri, 30 Jan 2004, Dave Page wrote:
> >
> > > Hi guys,
> > >
> > > As some of you may have noticed, there is now a new search
> > engine on
> > > the main, and archives websites. This one is based on an unreleased
> > > (currently) port of ASPSeek which runs on PostgreSQL.
> >
> > Just checked archives, and its still using MnogoSearch?  Or
> > is there something that I'm supposed to be changing over there?
>
> It's using aspseek from here. A search for 'stuff' just gave:
>
> Documents 1-20 of total 11042 found.    Searching in 276628 documents
> took 2.736 seconds.
>
> Followed by the ASPSeeeeeek graphical page selector.
>
> Oh, and I changed the beige boxes to light blue.
>
> Try a ctrl-refresh perhaps?
>
> Regard,s Dave.
>

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> Hi Oleg,
>
> > -----Original Message-----
> > From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > Sent: 30 January 2004 16:03
> > To: Dave Page
> > Cc: pgsql-www@postgresql.org
> > Subject: Re: [pgsql-www] Postgresql.org search engine.
> >
> >
> > I'd recommend to use ispell dictionaries, so 'databases' and
> > 'database'
> > will produce the same results.
>
> Thanks, installed.
>
> BTW, searching for 'database' really makes it think! Other queries that
> generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker.

It would think much longer if you search 'pgsql database' :(
Just tried and got ~100 sec.

This is feature of search engines based on inverted indices.
tsearch2 does just the other way - the more words in query the faster
searching.

I suggest to include 'postgresql', 'pgsql', 'postgres' into stop words
list :(  btw, you may look at word statistics and let top N words
as stop words.

>
> I have also added some weighting to the indexed sites to try to give
> preference to those that are more 'authoritative' and of global interest
> than others. Any comments or suggestions for changes welcome as always!

Hmm, I thought aspseek has sort of page rank, so let him works.


>
> # Primary sites
> SiteWeight http://www.postgresql.org/ 100
> SiteWeight http://advocacy.postgresql.org/ 100
> SiteWeight http://jdbc.postgresql.org/ 100
> SiteWeight http://developer.postgresql.org/ 100
>
> # Authoritiative project sites
> SiteWeight http://gborg.postgresql.org/ 75
> SiteWeight http://pgadmin.postgresql.org/ 75
> SiteWeight http://phppgadmin.sourceforge.net/ 75
>
> # User contributed stuff
> SiteWeight http://techdocs.postgresql.org/ 50
> SiteWeight http://archives.postgresql.org/ 50
>
> # Outside but reliable
> SiteWeight http://www.varlena.com/ 25
>
> # And the rest...
> SiteWeight http://www.postgresql.cl/ 0
> SiteWeight http://postgresql.ok.cz/ 0
> SiteWeight http://www.postgresql.jp/ 0
> SiteWeight http://pgsql-fr.tuxfamily.org/ 0
> SiteWeight http://www.linuxshare.ru/ 0
> SiteWeight http://www.postgres.de/ 0
> SiteWeight http://www.pgsqldb.org/ 0
> SiteWeight http://www.postgresql.org.br/ 0
>
> Regards, Dave.
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Dave Page"
Date:
It's rumoured that Oleg Bartunov once said:
> On Fri, 30 Jan 2004, Dave Page wrote:
>
>> BTW, searching for 'database' really makes it think! Other queries
>> that generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker.
>
> It would think much longer if you search 'pgsql database' :(
> Just tried and got ~100 sec.
>
Meep!

>
> I suggest to include 'postgresql', 'pgsql', 'postgres' into stop words
> list :(  btw, you may look at word statistics and let top N words
> as stop words.

OK, I'll look at that after dinner - thanks.

>> I have also added some weighting to the indexed sites to try to give
>> preference to those that are more 'authoritative' and of global
>> interest than others. Any comments or suggestions for changes welcome
>> as always!
>
> Hmm, I thought aspseek has sort of page rank, so let him works.

It does, but I'm trying to give a little preference to results on sites
with maximum appeal (ie. those in English), and the most authoritative
(ie. those that are published docs rather than list archives or user
docs).
Also, bear in mind that by default results are grouped by site on the main
search page, so generally you will see results from *all* sites indexed on
a single page (sorted with the site weighting factored in), but then drill
down into a specific site which is unaffected by the site weighting.
Regards, Dave.



Re: Postgresql.org search engine.

From
Josh Berkus
Date:
Guys,

Out of curiosity, why are we not using OpenFTS for this?

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Postgresql.org search engine.

From
"Dave Page"
Date:
It's rumoured that Josh Berkus once said:
> Guys,
>
> Out of curiosity, why are we not using OpenFTS for this?

Mainly because Oleg's site uses OpenFTS and it seemed kinda pointless
duplicating that, but also because the PostgreSQL port of ASPSeek is
proving to be very good (see http://search.oztralis.com.au/ for an example
of it searching 3.2 million pages).
Regards, Dave.



Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> It's rumoured that Oleg Bartunov once said:
> > On Fri, 30 Jan 2004, Dave Page wrote:
> >
> >> BTW, searching for 'database' really makes it think! Other queries
> >> that generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker.
> >
> > It would think much longer if you search 'pgsql database' :(
> > Just tried and got ~100 sec.
> >
> Meep!
>
> >
> > I suggest to include 'postgresql', 'pgsql', 'postgres' into stop words
> > list :(  btw, you may look at word statistics and let top N words
> > as stop words.
>
> OK, I'll look at that after dinner - thanks.

bon appetit !

>
> >> I have also added some weighting to the indexed sites to try to give
> >> preference to those that are more 'authoritative' and of global
> >> interest than others. Any comments or suggestions for changes welcome
> >> as always!
> >
> > Hmm, I thought aspseek has sort of page rank, so let him works.
>
> It does, but I'm trying to give a little preference to results on sites
> with maximum appeal (ie. those in English), and the most authoritative

sounds reasonable.

> (ie. those that are published docs rather than list archives or user
> docs).
> Also, bear in mind that by default results are grouped by site on the main
> search page, so generally you will see results from *all* sites indexed on
> a single page (sorted with the site weighting factored in), but then drill
> down into a specific site which is unaffected by the site weighting.
> Regards, Dave.
>

I don't have an experience with aspseek, but one desirable feature -
spelling support for user's query. Does aspseek has support for this ?

Design suggestion: I'd like to see most important parts of form at left side,
for example, site selector, imho, better to have left most, then grouping,
format, number results per page.

Also, I don't like fixed width, in my browser I have to scroll left-right
to see a whole form :)

>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Josh Berkus wrote:

> Guys,
>
> Out of curiosity, why are we not using OpenFTS for this?
>

Because OpenFTS isn't an end user application, it's search engine and
someone should write wrappers. We already done mailing list archive search
based on OpenFTS/tsearch2, but didn't have time to release it to
production server :( Expect it on www.pgsql.ru.

>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> It's rumoured that Josh Berkus once said:
> > Guys,
> >
> > Out of curiosity, why are we not using OpenFTS for this?
>
> Mainly because Oleg's site uses OpenFTS and it seemed kinda pointless
> duplicating that, but also because the PostgreSQL port of ASPSeek is
> proving to be very good (see http://search.oztralis.com.au/ for an example
> of it searching 3.2 million pages).

Guys, there is a big difference between semi-static index (aspseek) and
incremental indexing of incoming documents (tsearch2). Our approach is
to develop fully automatical searchable mailing list archive with
instant indexing. So, for example, I see my postings about subj.
already in database and *searchable* ! I don't expect aspseek's search
engine at postgresql.org has my recent postings in its index.
OpenFTS has full access to metadata of documents, so we could limit search '
range by date, by list, by authors, so smart user could get reasonable
search performance (relevance is very good, because it based on
proximity). So, different searches for different purposes !



> Regards, Dave.
>
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Sent: 30 January 2004 19:06
> To: Dave Page
> Cc: josh@agliodbs.com; pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
>
> Guys, there is a big difference between semi-static index
> (aspseek) and incremental indexing of incoming documents
> (tsearch2). Our approach is to develop fully automatical
> searchable mailing list archive with instant indexing. So,
> for example, I see my postings about subj.
> already in database and *searchable* ! I don't expect
> aspseek's search engine at postgresql.org has my recent
> postings in its index.

No it doesn't, but it probably could do with a little clever scripting
to expire the right index pages before each run.

In addition, one of the mods made in the version we are using is the
addition of an XML feed to the indexer - John (the guy responsible for
the port) is keen for me to use this for far more efficient indexing of
the archives, however I have yet to do this mainly because it requires
hacking mhonarc about to output the XML data.

> OpenFTS has full access to metadata of documents, so we could
> limit search '
> range by date, by list, by authors, so smart user could get
> reasonable search performance (relevance is very good,
> because it based on proximity). So, different searches for
> different purposes !

We don't have those fields, but the XML feed was originally written for
indexing data from online catalogues and has added fields like price.
I'd be surprised if others couldn't be added as well.

Regards, Dave.

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Sent: 30 January 2004 18:50
> To: Dave Page
> Cc: pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>
>
> I don't have an experience with aspseek, but one desirable
> feature - spelling support for user's query. Does aspseek has
> support for this ?

You mean like Googles speeling corektor?
(http://labs.google.com/britney.html). No, ASPSeek doesn't have this,
however I wonder how hard it might be to knock something up based on
ispell and soundex... Hmmmmm....

> Design suggestion: I'd like to see most important parts of
> form at left side, for example, site selector, imho, better
> to have left most, then grouping, format, number results per page.

Yup, agreed.

> Also, I don't like fixed width, in my browser I have to
> scroll left-right to see a whole form :)

The whole site is designed that way at the moment, but the new
multilanguage version that's in development will be sizable. I hope to
update the archives to a similar design at some point as well, just as
soon as I've persuaded Marc that it's worth regenerating the messages
again!

Regards, Dave.

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > Sent: 30 January 2004 19:06
> > To: Dave Page
> > Cc: josh@agliodbs.com; pgsql-www@postgresql.org
> > Subject: Re: [pgsql-www] Postgresql.org search engine.
> >
> >
> > Guys, there is a big difference between semi-static index
> > (aspseek) and incremental indexing of incoming documents
> > (tsearch2). Our approach is to develop fully automatical
> > searchable mailing list archive with instant indexing. So,
> > for example, I see my postings about subj.
> > already in database and *searchable* ! I don't expect
> > aspseek's search engine at postgresql.org has my recent
> > postings in its index.
>
> No it doesn't, but it probably could do with a little clever scripting
> to expire the right index pages before each run.
>
> In addition, one of the mods made in the version we are using is the
> addition of an XML feed to the indexer - John (the guy responsible for
> the port) is keen for me to use this for far more efficient indexing of
> the archives, however I have yet to do this mainly because it requires
> hacking mhonarc about to output the XML data.
>
> > OpenFTS has full access to metadata of documents, so we could
> > limit search '
> > range by date, by list, by authors, so smart user could get
> > reasonable search performance (relevance is very good,
> > because it based on proximity). So, different searches for
> > different purposes !
>
> We don't have those fields, but the XML feed was originally written for
> indexing data from online catalogues and has added fields like price.
> I'd be surprised if others couldn't be added as well.

This is what you need to look for to optimize search (limit search region
by date period). Default search should use something like
search last year documents.

>
> Regards, Dave.
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Sent: 30 January 2004 19:52
> To: Dave Page
> Cc: josh@agliodbs.com; pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>
>
> This is what you need to look for to optimize search (limit
> search region by date period). Default search should use
> something like search last year documents.

Oh, date is not a problem. I just haven't put it on the form yet. It's
the metadata like author, subject, listname etc. that will take more
work (though the latter is handled quite well using a subset
restriction).

Regards, Dave.

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > Sent: 30 January 2004 18:50
> > To: Dave Page
> > Cc: pgsql-www@postgresql.org
> > Subject: RE: [pgsql-www] Postgresql.org search engine.
> >
> >
> > I don't have an experience with aspseek, but one desirable
> > feature - spelling support for user's query. Does aspseek has
> > support for this ?
>
> You mean like Googles speeling corektor?
> (http://labs.google.com/britney.html). No, ASPSeek doesn't have this,
> however I wonder how hard it might be to knock something up based on
> ispell and soundex... Hmmmmm....

In principle, simple corrector could be implemented independent from
aspseek. If you have some dictionary of words, create trigrams of these
words, if query returns too many results create trigram of words in the
query and check which words from dictionary are close, i.e. compute
similarity weights (jaccard coefficents would be ok). More complex
algorithm we use in www.pgsql.ru and in our contrib/trgm

soundex, metaphone are good for english, while trigrams method is
universal.

>
> > Design suggestion: I'd like to see most important parts of
> > form at left side, for example, site selector, imho, better
> > to have left most, then grouping, format, number results per page.
>
> Yup, agreed.
>
> > Also, I don't like fixed width, in my browser I have to
> > scroll left-right to see a whole form :)
>
> The whole site is designed that way at the moment, but the new
> multilanguage version that's in development will be sizable. I hope to
> update the archives to a similar design at some point as well, just as
> soon as I've persuaded Marc that it's worth regenerating the messages
> again!

 Why not have fully dynamic pages for mailing lists ?  Proper configured
server with cacheing could be very fast.

>
> Regards, Dave.
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Sent: 30 January 2004 20:01
> To: Dave Page
> Cc: pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>
>
>  Why not have fully dynamic pages for mailing lists ?  Proper
> configured server with cacheing could be very fast.

Dunno, the lists and archives are traditionally Marc's domain :-) The
changes I'd like to see do involve stripping out the individual messages
to the absolute bare minimum of content.

Regards, Dave.

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> the archives, however I have yet to do this mainly because it requires
> hacking mhonarc about to output the XML data.

I just did a search to see if someone else hadn't done this, and couldn't
find anything ... have you checked with the list archives to see if anyone
is working on this as part of the main stream?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> The whole site is designed that way at the moment, but the new
> multilanguage version that's in development will be sizable. I hope to
> update the archives to a similar design at some point as well, just as
> soon as I've persuaded Marc that it's worth regenerating the messages
> again!

This virus has taken a toll on time this week (7k virus' scanned in 48hrs)
... have regenerating plan'd for this weekend ... will fire you off a note
once I've started it :)

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Oleg Bartunov wrote:

>  Why not have fully dynamic pages for mailing lists ?  Proper configured
> server with cacheing could be very fast.

How do you mean?  Right now, it is dynamic to an extend, but Dave pointed
me to the <!--noindex--> stuff vs doing the PHP as I'm doing it now ...
but to implement it, I have to update the mhonarc resource file and
regenerate all the messages to match ...



----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > Sent: 30 January 2004 19:52
> > To: Dave Page
> > Cc: josh@agliodbs.com; pgsql-www@postgresql.org
> > Subject: RE: [pgsql-www] Postgresql.org search engine.
> >
> >
> > This is what you need to look for to optimize search (limit
> > search region by date period). Default search should use
> > something like search last year documents.
>
> Oh, date is not a problem. I just haven't put it on the form yet. It's
> the metadata like author, subject, listname etc. that will take more
> work (though the latter is handled quite well using a subset
> restriction).

k, before I regenerate the lists, is this stuff you want me to add to the
META DATA part?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 30 January 2004 20:37
> To: Dave Page
> Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
> On Fri, 30 Jan 2004, Dave Page wrote:
>
> > the archives, however I have yet to do this mainly because
> it requires
> > hacking mhonarc about to output the XML data.
>
> I just did a search to see if someone else hadn't done this,
> and couldn't find anything ... have you checked with the list
> archives to see if anyone is working on this as part of the
> main stream?

I haven't looked at it at all yet. John wrote the XML feed code for
other purposes but suggested it for the archives as well - it did look
intriquing...

Regards, Dave

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > Sent: 30 January 2004 20:01
> > To: Dave Page
> > Cc: pgsql-www@postgresql.org
> > Subject: RE: [pgsql-www] Postgresql.org search engine.
> >
> >
> >  Why not have fully dynamic pages for mailing lists ?  Proper
> > configured server with cacheing could be very fast.
>
> Dunno, the lists and archives are traditionally Marc's domain :-) The
> changes I'd like to see do involve stripping out the individual messages
> to the absolute bare minimum of content.

'k, this is how the search engines should see each message when they
index:

http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php

anything else you'd like me to strip out of there? :(

Note that this is not with the <!--noindex--> stuff you were talking
about, which, from what you've said, I'm not sure is useful, since not all
search engines will recognize it ... with the way its coded now, as long
as I have the search engine listed, the output will look the same ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 30 January 2004 20:43
> To: Dave Page
> Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
>
> k, before I regenerate the lists, is this stuff you want me
> to add to the META DATA part?

There's not much point I don't think. It's the XML feed that might make
use of it, not the standard indexer.

What I really want to see is the absolute bare minimum in the msg files
(not even the titles that are there at the moment - speacking of which,
might be worth including them as a php var we can pickup from the
top_config.php)  - as per the example I emailed you. Then, we should be
able to do anything by editting the header and footer php include files.

Regards, Dave.

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 30 January 2004 20:37
> > To: Dave Page
> > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org
> > Subject: Re: [pgsql-www] Postgresql.org search engine.
> >
> > On Fri, 30 Jan 2004, Dave Page wrote:
> >
> > > the archives, however I have yet to do this mainly because
> > it requires
> > > hacking mhonarc about to output the XML data.
> >
> > I just did a search to see if someone else hadn't done this,
> > and couldn't find anything ... have you checked with the list
> > archives to see if anyone is working on this as part of the
> > main stream?
>
> I haven't looked at it at all yet. John wrote the XML feed code for
> other purposes but suggested it for the archives as well - it did look
> intriquing...

Just a stupid question, before we go through and 'yet again regen' the
archives ... is there something different then mhonarc that would be
better?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 30 January 2004 20:49
> To: Dave Page
> Cc: Oleg Bartunov; pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
> 'k, this is how the search engines should see each message when they
> index:
>
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php
>
> anything else you'd like me to strip out of there? :(
>
> Note that this is not with the <!--noindex--> stuff you were
> talking about, which, from what you've said, I'm not sure is
> useful, since not all search engines will recognize it ...
> with the way its coded now, as long as I have the search
> engine listed, the output will look the same ...

As I pointed out before, the problem with that is that search engines
like Google or search.postgresql.org that cache the pages won't get any
of the thread navigation and other elements of the page.

I'd rather see the <!--noindex--> bits (or at very least, include them
as well and don't look for aspseek or googlebot in showit()).

Regards, Dave.

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 30 January 2004 20:52
> To: Dave Page
> Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com;
> pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>
>
> Just a stupid question, before we go through and 'yet again
> regen' the archives ... is there something different then
> mhonarc that would be better?

Dunno. Shall we leave it a couple of days or so and I'll take a look and
produce some test versions of what it might be nice to see? Tonight's a
bit awkward as Jo is feeling a bit odd and in her condition...

Regards, Dave.

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 30 January 2004 20:43
> > To: Dave Page
> > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org
> > Subject: Re: [pgsql-www] Postgresql.org search engine.
> >
> >
> > k, before I regenerate the lists, is this stuff you want me
> > to add to the META DATA part?
>
> There's not much point I don't think. It's the XML feed that might make
> use of it, not the standard indexer.
>
> What I really want to see is the absolute bare minimum in the msg files
> (not even the titles that are there at the moment - speacking of which,
> might be worth including them as a php var we can pickup from the
> top_config.php)  - as per the example I emailed you. Then, we should be
> able to do anything by editting the header and footer php include files.

D'oh ... I was going to say that I didn't think taht was possible, but, it
just might be ... seems I have a section declared twice (note that someone
else wrote this originally, I've only just begun to understand it to
modify it), so the second section is overriding the first, but I was only
ever seeing the first ...

Let me play with this over the weekend, I'll do a 'small sample set' that
you can look at the messages in, and we can go from there ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

> As I pointed out before, the problem with that is that search engines
> like Google or search.postgresql.org that cache the pages won't get any
> of the thread navigation and other elements of the page.
>
> I'd rather see the <!--noindex--> bits (or at very least, include them
> as well and don't look for aspseek or googlebot in showit()).

'k, I have some ideas on how to do this so that we can change later if we
need ... will play with it this weekend ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 30 January 2004 21:02
> To: Dave Page
> Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com;
> pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>
>
> D'oh ... I was going to say that I didn't think taht was
> possible, but, it just might be ... seems I have a section
> declared twice (note that someone else wrote this originally,
> I've only just begun to understand it to modify it), so the
> second section is overriding the first, but I was only ever
> seeing the first ...

Huh? You've lost me there...

> Let me play with this over the weekend, I'll do a 'small
> sample set' that you can look at the messages in, and we can
> go from there ...

Ok. If you can do it in a directory away from the archives themselves
then I can play if need be without breaking anything by accident...

/D

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
Hmm,
what's about

<meta name="robots" content="noindex,follow">

    Oleg
On Fri, 30 Jan 2004, Marc G. Fournier wrote:

> On Fri, 30 Jan 2004, Dave Page wrote:
>
> >
> >
> > > -----Original Message-----
> > > From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > > Sent: 30 January 2004 20:01
> > > To: Dave Page
> > > Cc: pgsql-www@postgresql.org
> > > Subject: RE: [pgsql-www] Postgresql.org search engine.
> > >
> > >
> > >  Why not have fully dynamic pages for mailing lists ?  Proper
> > > configured server with cacheing could be very fast.
> >
> > Dunno, the lists and archives are traditionally Marc's domain :-) The
> > changes I'd like to see do involve stripping out the individual messages
> > to the absolute bare minimum of content.
>
> 'k, this is how the search engines should see each message when they
> index:
>
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php
>
> anything else you'd like me to strip out of there? :(
>
> Note that this is not with the <!--noindex--> stuff you were talking
> about, which, from what you've said, I'm not sure is useful, since not all
> search engines will recognize it ... with the way its coded now, as long
> as I have the search engine listed, the output will look the same ...
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 30 January 2004 20:43
> > To: Dave Page
> > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org
> > Subject: Re: [pgsql-www] Postgresql.org search engine.
> >
> >
> > k, before I regenerate the lists, is this stuff you want me
> > to add to the META DATA part?
>
> There's not much point I don't think. It's the XML feed that might make
> use of it, not the standard indexer.
>
> What I really want to see is the absolute bare minimum in the msg files
> (not even the titles that are there at the moment - speacking of which,
> might be worth including them as a php var we can pickup from the
> top_config.php)  - as per the example I emailed you. Then, we should be
> able to do anything by editting the header and footer php include files.


I don't understand waht's the problem having postings in raw format stored
in filesystem, metadatt - in postgres and show component which combines
both sources to nice html page. Dave could get raw postings from filesystem
using metadata and index them without any problem. Marc could change
html wrapping everyday and everybody are happy :)



>
> Regards, Dave.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Sat, 31 Jan 2004, Oleg Bartunov wrote:

> Hmm,
> what's about
>
> <meta name="robots" content="noindex,follow">

k, that would be for the index/thread pages, of course ... right?

here's a question, since you have more experience with this then I ... teh
current meta tags set for the message pages themselves are:

<META NAME="robots" CONTENT="all">
<META NAME="MSSmartTagsPreventParsing" content="TRUE">
<META HTTP-EQUIV="Content-Type" content="text/html; charset=iso-8859-1">
<META NAME="keywords" content="postgresql, hackers, general, sql, admin, novice, interfaces, odbc, jdbc">
<META NAME="rating" Content="General" >
<META NAME="distribution" Content="Global" >
<META NAME="revisit-after"  Content="7 days" >
<META NAME="robots" CONTENT="follow, index, noarchive">

anything wrong with the above?  seems okay to me, just making sure that
maybe there isn't something else that I should add?

>
>     Oleg
> On Fri, 30 Jan 2004, Marc G. Fournier wrote:
>
> > On Fri, 30 Jan 2004, Dave Page wrote:
> >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> > > > Sent: 30 January 2004 20:01
> > > > To: Dave Page
> > > > Cc: pgsql-www@postgresql.org
> > > > Subject: RE: [pgsql-www] Postgresql.org search engine.
> > > >
> > > >
> > > >  Why not have fully dynamic pages for mailing lists ?  Proper
> > > > configured server with cacheing could be very fast.
> > >
> > > Dunno, the lists and archives are traditionally Marc's domain :-) The
> > > changes I'd like to see do involve stripping out the individual messages
> > > to the absolute bare minimum of content.
> >
> > 'k, this is how the search engines should see each message when they
> > index:
> >
> > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php
> >
> > anything else you'd like me to strip out of there? :(
> >
> > Note that this is not with the <!--noindex--> stuff you were talking
> > about, which, from what you've said, I'm not sure is useful, since not all
> > search engines will recognize it ... with the way its coded now, as long
> > as I have the search engine listed, the output will look the same ...
> >
> > ----
> > Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> > Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: Have you checked our extensive FAQ?
> >
> >                http://www.postgresql.org/docs/faqs/FAQ.html
> >
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Sat, 31 Jan 2004, Oleg Bartunov wrote:

> I don't understand waht's the problem having postings in raw format
> stored in filesystem, metadatt - in postgres and show component which
> combines both sources to nice html page. Dave could get raw postings
> from filesystem using metadata and index them without any problem. Marc
> could change html wrapping everyday and everybody are happy :)

Do you have software to do this, including all the inter-posting
references and followups?  Or do you propose we write this all from
scratch?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
Marc and Dave,

at the same time, could you see how to generating right http headers
(LAST-MODIFIED), so search engines could cache documents and don't waste
server resources . What I still don't understand is if
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php
is static page or dynamic :-? If dynamic I don't see any problem generating
headers, if static - you could always use 'touch' hack to set correct
last modification date to file.

    Oleg

On Fri, 30 Jan 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 30 January 2004 21:02
> > To: Dave Page
> > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com;
> > pgsql-www@postgresql.org
> > Subject: RE: [pgsql-www] Postgresql.org search engine.
> >
> >
> > D'oh ... I was going to say that I didn't think taht was
> > possible, but, it just might be ... seems I have a section
> > declared twice (note that someone else wrote this originally,
> > I've only just begun to understand it to modify it), so the
> > second section is overriding the first, but I was only ever
> > seeing the first ...
>
> Huh? You've lost me there...
>
> > Let me play with this over the weekend, I'll do a 'small
> > sample set' that you can look at the messages in, and we can
> > go from there ...
>
> Ok. If you can do it in a directory away from the archives themselves
> then I can play if need be without breaking anything by accident...
>
> /D
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
Josh Berkus
Date:
Guys,

> Do you have software to do this, including all the inter-posting
> references and followups?  Or do you propose we write this all from
> scratch?

Robert Bernier apparently wrote something to break up mail for inclusion in a
database, and should be able to help in a couple months.  Josh Drake is also
willing to help, and has already done a prototype wiithout header searching.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Sat, 31 Jan 2004, Oleg Bartunov wrote:

> Marc and Dave,
>
> at the same time, could you see how to generating right http headers
> (LAST-MODIFIED), so search engines could cache documents and don't waste
> server resources . What I still don't understand is if
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php
> is static page or dynamic :-? If dynamic I don't see any problem generating
> headers, if static - you could always use 'touch' hack to set correct
> last modification date to file.

Huh?  The t.php one above was just to show Dave what the search engines
are seeing (ie. minus the search/banner/links, just the message) ... its
not part of the system, just a copy of an existing message ...

re last-modified time ... what is wrong with it?  According to my browser,
it is being displayed correctly, or are you still hung up on the fact that
it doesn't equal the posting date of the message itself?  If that is all
it is, I'm planning on trying something this weekend to get that in place,
but the last time I tried it didn't work ... again, if you have better
software you can recommend then what we are  using now to generate the
archives (mhonarc), please speak up before I go through the trouble of
regenerating everything all over again ...


 >
>     Oleg
>
> On Fri, 30 Jan 2004, Dave Page wrote:
>
> >
> >
> > > -----Original Message-----
> > > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > > Sent: 30 January 2004 21:02
> > > To: Dave Page
> > > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com;
> > > pgsql-www@postgresql.org
> > > Subject: RE: [pgsql-www] Postgresql.org search engine.
> > >
> > >
> > > D'oh ... I was going to say that I didn't think taht was
> > > possible, but, it just might be ... seems I have a section
> > > declared twice (note that someone else wrote this originally,
> > > I've only just begun to understand it to modify it), so the
> > > second section is overriding the first, but I was only ever
> > > seeing the first ...
> >
> > Huh? You've lost me there...
> >
> > > Let me play with this over the weekend, I'll do a 'small
> > > sample set' that you can look at the messages in, and we can
> > > go from there ...
> >
> > Ok. If you can do it in a directory away from the archives themselves
> > then I can play if need be without breaking anything by accident...
> >
> > /D
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: the planner will ignore your desire to choose an index scan if your
> >       joining column's datatypes do not match
> >
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Fri, 30 Jan 2004, Josh Berkus wrote:

> Guys,
>
> > Do you have software to do this, including all the inter-posting
> > references and followups?  Or do you propose we write this all from
> > scratch?
>
> Robert Bernier apparently wrote something to break up mail for inclusion in a
> database, and should be able to help in a couple months.  Josh Drake is also
> willing to help, and has already done a prototype wiithout header searching.

Dumping mail into a database isn't that hard to do ... there are several
projects on the 'Net right now doing that, including one that connects a
POP3 daemon into the database to download the mail ... in fact, from what
I recall of fts.postgresql.org, isn't that what Oleg/Teodor's stuff does?

I'm kinda curious here ... exactly what problem are we trying to solve
here?

Me, I'm just trying to clean up the archives so that when someone gets
their search results, they don't all show the same 'text', which I've
already accomplished ... Dave is working on improving the speed of the
searches, which he has accomplished with ASPseek ...

If I can figure out how to get the Date: of the posting into the
Last-Modified field (I know *how* it should work, but last time I tried it
ended up generating a whack of errors), then that should satisfy Oleg's
beef ...

Oleg, one question ... what do you recommend setting max-age to for
Cache-control?  Right now, I have it set to 30 days ... too long?  not
long enough?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 31 January 2004 06:15
> To: Josh Berkus
> Cc: Oleg Bartunov; Dave Page; pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
>
>
> I'm kinda curious here ... exactly what problem are we trying
> to solve here?
>

My thoughts as well as we are starting to see suggestions for solving
non-existant problems :-)

1) We need each message file minimised - e.g. some php variables
defined, a php include for the header, the message and thread links and
a php include for the footer.

2) The php header file should use the variables defined in the messages
to generate the <TITLE></TITLE> tag and last modified dates, as well as
any other useful meta data.

3) The php header and footer files should include
<!--noindex--><!--/noindex--> tags to allow aspseek to cache the entire
page but only index the relevant content (see anywhere on the main
portal site to see how this is done - specifically, menus are not
indexed though they are still followed).

4) (Optionally, I don't think it's necessary - certainly not for
ASPSeek) The php header/footer may only display if certain user agent
strings are not detected.

Regards, Dave.

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Sat, 31 Jan 2004, Marc G. Fournier wrote:

> On Sat, 31 Jan 2004, Oleg Bartunov wrote:
>
> > Marc and Dave,
> >
> > at the same time, could you see how to generating right http headers
> > (LAST-MODIFIED), so search engines could cache documents and don't waste
> > server resources . What I still don't understand is if
> > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php
> > is static page or dynamic :-? If dynamic I don't see any problem generating
> > headers, if static - you could always use 'touch' hack to set correct
> > last modification date to file.
>
> Huh?  The t.php one above was just to show Dave what the search engines
> are seeing (ie. minus the search/banner/links, just the message) ... its
> not part of the system, just a copy of an existing message ...
>
> re last-modified time ... what is wrong with it?  According to my browser,
> it is being displayed correctly, or are you still hung up on the fact that
> it doesn't equal the posting date of the message itself?  If that is all
> it is, I'm planning on trying something this weekend to get that in place,
> but the last time I tried it didn't work ... again, if you have better

yes, correct http headers are what I'd like to see, many crawlers/spiders
take them into account. Saves bandwidth and server didn't overloaded
I count two attempts, one failed because of incorrect format of date,
and second - because all pages have the same time modification date -
moment of page creating, not the date of posting. Apache's mod_headers
could generate last modified header for static pages using information
about file modification time, so it's possible to use command 'touch'
to get file modification date equal to date of posting. Dynamic pages is
another story and http header should be generated by software responsible
for displaying page.

> software you can recommend then what we are  using now to generate the
> archives (mhonarc), please speak up before I go through the trouble of
> regenerating everything all over again ...

Dont know :( We have our mailware, which you've seen on fts.postgresql.org
and soon will appear on www.pgsql.ru, but it's not end user application.

I've seen mailman (http://www.list.org/) which connects somehow with
mhonarc. A wide list of MLM's is available from
http://www.sympa.org/robots.html

Also , Sympa has support for Mhonarc archives - http://www.sympa.org/


>
>
>  >
> >     Oleg
> >
> > On Fri, 30 Jan 2004, Dave Page wrote:
> >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > > > Sent: 30 January 2004 21:02
> > > > To: Dave Page
> > > > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com;
> > > > pgsql-www@postgresql.org
> > > > Subject: RE: [pgsql-www] Postgresql.org search engine.
> > > >
> > > >
> > > > D'oh ... I was going to say that I didn't think taht was
> > > > possible, but, it just might be ... seems I have a section
> > > > declared twice (note that someone else wrote this originally,
> > > > I've only just begun to understand it to modify it), so the
> > > > second section is overriding the first, but I was only ever
> > > > seeing the first ...
> > >
> > > Huh? You've lost me there...
> > >
> > > > Let me play with this over the weekend, I'll do a 'small
> > > > sample set' that you can look at the messages in, and we can
> > > > go from there ...
> > >
> > > Ok. If you can do it in a directory away from the archives themselves
> > > then I can play if need be without breaking anything by accident...
> > >
> > > /D
> > >
> > > ---------------------------(end of broadcast)---------------------------
> > > TIP 9: the planner will ignore your desire to choose an index scan if your
> > >       joining column's datatypes do not match
> > >
> >
> >     Regards,
> >         Oleg
> > _____________________________________________________________
> > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> > Sternberg Astronomical Institute, Moscow University (Russia)
> > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> > phone: +007(095)939-16-83, +007(095)939-23-83
> >
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Sat, 31 Jan 2004, Marc G. Fournier wrote:

> On Fri, 30 Jan 2004, Josh Berkus wrote:
>
> > Guys,
> >
> > > Do you have software to do this, including all the inter-posting
> > > references and followups?  Or do you propose we write this all from
> > > scratch?
> >
> > Robert Bernier apparently wrote something to break up mail for inclusion in a
> > database, and should be able to help in a couple months.  Josh Drake is also
> > willing to help, and has already done a prototype wiithout header searching.
>
> Dumping mail into a database isn't that hard to do ... there are several
> projects on the 'Net right now doing that, including one that connects a
> POP3 daemon into the database to download the mail ... in fact, from what
> I recall of fts.postgresql.org, isn't that what Oleg/Teodor's stuff does?
>
> I'm kinda curious here ... exactly what problem are we trying to solve
> here?
>
> Me, I'm just trying to clean up the archives so that when someone gets
> their search results, they don't all show the same 'text', which I've
> already accomplished ... Dave is working on improving the speed of the
> searches, which he has accomplished with ASPseek ...
>
> If I can figure out how to get the Date: of the posting into the
> Last-Modified field (I know *how* it should work, but last time I tried it
> ended up generating a whack of errors), then that should satisfy Oleg's
> beef ...
>
> Oleg, one question ... what do you recommend setting max-age to for
> Cache-control?  Right now, I have it set to 30 days ... too long?  not
> long enough?

in my experience Cache-control is not effective, because it's
HTTP/1.1 feature and a lot of users come through proxy which still
doesn't support HTTP/1.1
Last-Modified header is the most universal way.
Check http://www.mnot.net/cache_docs/#CACHE-CONTROL

>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Sat, 31 Jan 2004, Oleg Bartunov wrote:

> > If I can figure out how to get the Date: of the posting into the
> > Last-Modified field (I know *how* it should work, but last time I tried it
> > ended up generating a whack of errors), then that should satisfy Oleg's
> > beef ...

'k, figured out my error with the mhonarc resource file, and now have
posting date in as last-modified ... I'm doing this off to the side right
now, while I work out the noindex stuff for Dave, but check out:

http://archives.postgresql.org/dev

And let me know if the headers look right to you ... I took out the
Cache-control stuff ...

Let me know if there is anything else you'd like to see in there ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

META Tags on Archives

From
"Marc G. Fournier"
Date:
Oleg ... as the "resident pro" here ... does this make sense:

Messages have:

     <META NAME="robots" CONTENT="nofollow, index, archive">

And indexes have:

     <META NAME="robots" CONTENT="follow, noindex, noarchive">



----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
Oleg Bartunov
Date:
On Sun, 1 Feb 2004, Marc G. Fournier wrote:

> On Sat, 31 Jan 2004, Oleg Bartunov wrote:
>
> > > If I can figure out how to get the Date: of the posting into the
> > > Last-Modified field (I know *how* it should work, but last time I tried it
> > > ended up generating a whack of errors), then that should satisfy Oleg's
> > > beef ...
>
> 'k, figured out my error with the mhonarc resource file, and now have
> posting date in as last-modified ... I'm doing this off to the side right
> now, while I work out the noindex stuff for Dave, but check out:
>
> http://archives.postgresql.org/dev
>
> And let me know if the headers look right to you ... I took out the
> Cache-control stuff ...
>
> Let me know if there is anything else you'd like to see in there ...
>

http headers looks fine !

> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: META Tags on Archives

From
Oleg Bartunov
Date:
On Sun, 1 Feb 2004, Marc G. Fournier wrote:

>
> Oleg ... as the "resident pro" here ... does this make sense:
>
> Messages have:
>
>      <META NAME="robots" CONTENT="nofollow, index, archive">
>
> And indexes have:
>
>      <META NAME="robots" CONTENT="follow, noindex, noarchive">
>

I don't know 'archive, noarchive', but others looks ok. I'm rather
sceptical about this tag, because I dont know robots which recognize it :)

>
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 01 February 2004 22:12
> To: Oleg Bartunov
> Cc: Marc G. Fournier; Josh Berkus; Dave Page; pgsql-www@postgresql.org
> Subject: Re: [pgsql-www] Postgresql.org search engine.
>
> And let me know if the headers look right to you ... I took
> out the Cache-control stuff ...
>
> Let me know if there is anything else you'd like to see in there ...

It looks even more complex to me now - there are what, 6 include files?

How about something more simple:

========================================================================
======
<?
  $last_modified = "Fri,  9 Jan 2004 19:00:28 +0000 (GMT)";
  $subject = " Re: IMPORTANT: A temporary list for Strategic Marketing";
  require("$DOCUMENT_ROOT/includes/header.php");
?>

<pre>Joshua D. Drake wrote:
>
> >There shouldn't be any tangents or general discussion on
-advocacy
> >either -- that's what -general is for.  A one-time incident
should not
> >lead to such drastic measures.  If the marketing plan is no
longer
> >discussed on -advocacy, what is?
> >
> >
> I disagree whole heartedly. If you look at general, it is basically
> PostgreSQL-Support.

<!--noindex-->
<HR>
<UL>
<li>Prev by Date:
<strong><a href="msg00116.php">Re: IMPORTANT: A temporary list for
Strategic Marketing</a></strong>
</li>
<li>Next by Date:
<strong><a href="msg00118.php">Re: IMPORTANT: A temporary list for
Strategic Marketing</a></strong>
</li>

<li>Previous by thread:
<strong><a href="msg00126.php">Re: IMPORTANT: A temporary list for
Strategic</a></strong>
</li>
<li>Next by thread:
<strong><a href="msg00125.php">Re: IMPORTANT: A temporary list for
Strategic</a></strong>
</li>

  <LI>Index(es):
      <UL>
        <LI><A HREF="mail2.php#00117"><STRONG>Main</STRONG></A></LI>
        <LI><A HREF="thrd2.php#00117"><STRONG>Thread</STRONG></A></LI>
      </UL>
  </LI>
</UL>
<!--/noindex-->

<?
  require("$DOCUMENT_ROOT/includes/footer.php");
?>
========================================================================
======

Header.php then may look something like:

========================================================================
======
<?
  if(isset($last_modified)) {
    header("Last-Modified: $last_modified");
  } else {
    header("Last-Modified: " .date("r", filemtime($SCRIPT_FILENAME)));
  }

  // Other stuff here
?>

<HTML>
<HEADER>
  <TITLE><? php echo $subject ?></TITLE>
  <META NAME="robots" CONTENT="nofollow, index, archive">
</HEADER>

<BODY>

<!--noindex-->
  <!-- HTML code for search form etc. -->
<!--/noindex-->
========================================================================
======


And footer.php (minus and footers we might add).

========================================================================
======
</BODY>
</HTML>
========================================================================
======

In addition, there is an awful lot of HTML comments that mhonarc has
added:

<!--X-Head-Body-Sep-End-->
<!--X-Body-of-Message-->

As examples. These seem somewhat extranous and could be removed for ease
of reading and disk space/bandwidth usage reduction.

Oh, and on the current version the noindex tags seem to be in the wrong
places. On the index/thread pages for example, they should enclose all
the hyperlinks. The noindex tags do not stop links being followed, just
the text within them from being included in the index.

Regards, Dave.

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Mon, 2 Feb 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 01 February 2004 22:12
> > To: Oleg Bartunov
> > Cc: Marc G. Fournier; Josh Berkus; Dave Page; pgsql-www@postgresql.org
> > Subject: Re: [pgsql-www] Postgresql.org search engine.
> >
> > And let me know if the headers look right to you ... I took
> > out the Cache-control stuff ...
> >
> > Let me know if there is anything else you'd like to see in there ...
>
> It looks even more complex to me now - there are what, 6 include files?
>
> How about something more simple:

if you can figure out how to do it in the .resource file, please let me
know ... I've strip'd out everything that I believe can be done without
making the .resource file itself majorly confusing ...

> In addition, there is an awful lot of HTML comments that mhonarc has
> added:
>
> <!--X-Head-Body-Sep-End-->
> <!--X-Body-of-Message-->

Nothing I can do about these, there are no configuration directives that
I've found to strip those ...

> Oh, and on the current version the noindex tags seem to be in the wrong
> places. On the index/thread pages for example, they should enclose all
> the hyperlinks. The noindex tags do not stop links being followed, just
> the text within them from being included in the index.

The index/thread pages all have noindex,follow set in the META TAG ...
isn't that what that META TAG is supposed to be for?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:

I just cleaned up the <HEAD></HEAD> section of the message layout, so that
shrinks the msg*.php files by a few more lines ...

On Mon, 2 Feb 2004, Marc G. Fournier wrote:

> On Mon, 2 Feb 2004, Dave Page wrote:
>
> >
> >
> > > -----Original Message-----
> > > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > > Sent: 01 February 2004 22:12
> > > To: Oleg Bartunov
> > > Cc: Marc G. Fournier; Josh Berkus; Dave Page; pgsql-www@postgresql.org
> > > Subject: Re: [pgsql-www] Postgresql.org search engine.
> > >
> > > And let me know if the headers look right to you ... I took
> > > out the Cache-control stuff ...
> > >
> > > Let me know if there is anything else you'd like to see in there ...
> >
> > It looks even more complex to me now - there are what, 6 include files?
> >
> > How about something more simple:
>
> if you can figure out how to do it in the .resource file, please let me
> know ... I've strip'd out everything that I believe can be done without
> making the .resource file itself majorly confusing ...
>
> > In addition, there is an awful lot of HTML comments that mhonarc has
> > added:
> >
> > <!--X-Head-Body-Sep-End-->
> > <!--X-Body-of-Message-->
>
> Nothing I can do about these, there are no configuration directives that
> I've found to strip those ...
>
> > Oh, and on the current version the noindex tags seem to be in the wrong
> > places. On the index/thread pages for example, they should enclose all
> > the hyperlinks. The noindex tags do not stop links being followed, just
> > the text within them from being included in the index.
>
> The index/thread pages all have noindex,follow set in the META TAG ...
> isn't that what that META TAG is supposed to be for?
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: META Tags on Archives

From
Jeroen Ruigrok/asmodai
Date:
-On [20040201 23:43], Marc G. Fournier (scrappy@postgresql.org) wrote:
>     <META NAME="robots" CONTENT="nofollow, index, archive">

According to http://www.robotstxt.org/wc/meta-user.html
archive|noarchive does not exist.

Where'd you find it?

--
Jeroen Ruigrok van der Werven <asmodai(at)wxs.nl> / asmodai / kita no mono
PGP fingerprint: 2D92 980E 45FE 2C28 9DB7  9D88 97E6 839B 2EAC 625B
http://www.tendra.org/   | http://diary.in-nomine.org/
The human race is challenged more than ever before to demonstrate our
mastery -- not over nature but of ourselves...

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 02 February 2004 14:11
> To: Dave Page
> Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus;
> pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>>
> if you can figure out how to do it in the .resource file,
> please let me know ... I've strip'd out everything that I
> believe can be done without making the .resource file itself
> majorly confusing ...

If I make a copy of the directory to play with, how do I re-run mhonarc?
(probably won't be today though, I have screaming headache and a broken
pbx).

>
> The index/thread pages all have noindex,follow set in the META TAG ...
> isn't that what that META TAG is supposed to be for?

Yes, but then why include the <!--noindex--> tags as well if they are in
the wrong place?

Regards, Dave.

Re: META Tags on Archives

From
"Marc G. Fournier"
Date:
On Mon, 2 Feb 2004, Jeroen Ruigrok/asmodai wrote:

> -On [20040201 23:43], Marc G. Fournier (scrappy@postgresql.org) wrote:
> >     <META NAME="robots" CONTENT="nofollow, index, archive">
>
> According to http://www.robotstxt.org/wc/meta-user.html
> archive|noarchive does not exist.
>
> Where'd you find it?

Actually, that one was in the original .resource file, but a quick search
on google shows:

http://www.bauser.com/websnob/meta/robots.html

and

http://www.google.com/webmasters/faq.html#cached

the funny thing is that this one:

http://www.katpatuka.org/pub/doc/robotexclusion.html

refers to the NOARCHIVE, but puts to:

http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt

which doesn't include it ...

I love standards that everyone follows *roll eyes*

I'm gathering its somethign that some use (Google does, apparently), and
some don't ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Mon, 2 Feb 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 02 February 2004 14:11
> > To: Dave Page
> > Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus;
> > pgsql-www@postgresql.org
> > Subject: RE: [pgsql-www] Postgresql.org search engine.
> >>
> > if you can figure out how to do it in the .resource file,
> > please let me know ... I've strip'd out everything that I
> > believe can be done without making the .resource file itself
> > majorly confusing ...
>
> If I make a copy of the directory to play with, how do I re-run mhonarc?
> (probably won't be today though, I have screaming headache and a broken
> pbx).

there is a mk-mhonarc script in the directory that you can run ...

> > The index/thread pages all have noindex,follow set in the META TAG ...
> > isn't that what that META TAG is supposed to be for?
>
> Yes, but then why include the <!--noindex--> tags as well if they are in
> the wrong place?

removed from the index page(s) ... in fact, I hadn't even put them into
the thread index pages ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Mon, 2 Feb 2004, Dave Page wrote:

> In addition, there is an awful lot of HTML comments that mhonarc has
> added:
>
> <!--X-Head-Body-Sep-End-->
> <!--X-Body-of-Message-->
>
> As examples. These seem somewhat extranous and could be removed for ease
> of reading and disk space/bandwidth usage reduction.

I've put a note out to the mhonarc list to see if there is somethign I'm
missing in the docs that allows one to turn those off ... it seems to add
about a 1k worth of data to each file, which, when dealing with 100's of
thousands of messages, is a fair amount of disk space ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 02 February 2004 14:11
> To: Dave Page
> Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus;
> pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>
> > How about something more simple:
>
> if you can figure out how to do it in the .resource file,
> please let me know ... I've strip'd out everything that I
> believe can be done without making the .resource file itself
> majorly confusing ...

OK, well frankly mhonarc looks like a nightmare to setup. I've had a
play with hypermail instead. I realise that I've yet to drop in your
search engine detection code and there is still work to be done, but how
does this look:

http://archives.postgresql.org/dave/pgsql-advocacy/

It's all pretty self contained at the moment - feel free to have a play
with it.

(mk-hypermail to rebuild, you may need to clear old files first if you
make drastic changes).

Regards, Dave

Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Wed, 4 Feb 2004, Dave Page wrote:

>
>
> > -----Original Message-----
> > From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> > Sent: 02 February 2004 14:11
> > To: Dave Page
> > Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus;
> > pgsql-www@postgresql.org
> > Subject: RE: [pgsql-www] Postgresql.org search engine.
> >
> > > How about something more simple:
> >
> > if you can figure out how to do it in the .resource file,
> > please let me know ... I've strip'd out everything that I
> > believe can be done without making the .resource file itself
> > majorly confusing ...
>
> OK, well frankly mhonarc looks like a nightmare to setup.

Actually, its quite easy one you read through the docs ... there are
formats in the .resource file for the Date Index, Thread Index and Message
Page ... and each of those is broken down into sub-sections ... best place
to start is:

http://www.mhonarc.org/MHonArc/doc/layout.html

and then look at each subsection as you need to modify it ...

> I've had a
> play with hypermail instead. I realise that I've yet to drop in your
> search engine detection code and there is still work to be done, but how
> does this look:
>
> http://archives.postgresql.org/dave/pgsql-advocacy/
>
> It's all pretty self contained at the moment - feel free to have a play
> with it.

k, first thing that is missing is the last-modified date isn't set right,
which makes it a no go option ... looking at the hypermail.conf file, it
more reminds me of setting up a web stats program then a list archiver ...
options are either ... you can add footers and headers, but scanning
through the docs that it points to, there doesn't seem to be any way of
adding a Last-Modified header, since they don't even seem to define
VARIABLES (ie. $DATE$ for date of posting) that you can use when
generating the archives ...

let me know if I've missed something .. ?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Postgresql.org search engine.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Marc G. Fournier [mailto:scrappy@postgresql.org]
> Sent: 04 February 2004 14:12
> To: Dave Page
> Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus;
> pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Postgresql.org search engine.
>
>
> k, first thing that is missing is the last-modified date
> isn't set right, which makes it a no go option ... looking at
> the hypermail.conf file, it more reminds me of setting up a
> web stats program then a list archiver ...

I thought the only archiver you knew was mhonarc? Not the biggest frame
of reference :-)

> options are either ... you can add footers and headers, but
> scanning through the docs that it points to, there doesn't
> seem to be any way of adding a Last-Modified header, since
> they don't even seem to define VARIABLES (ie. $DATE$ for date
> of posting) that you can use when generating the archives ...
>
> let me know if I've missed something .. ?

I was looking at this after I posted my last message. Hypermail supports
HTML templates which may be used instead of headers and footers. These
do have variables, however I couldn't see one for posting date :-(
Probably not the hardest mod in the world to add it to the program, but
unfortunately I just started a new module at Uni so am somewhat short of
spare time again...

Regards, Dave.

Re: Postgresql.org search engine.

From
John Hansen
Date:
Well, subject is easy enough to implement, as the subject is also the
page title, and we can as it is, limit searches to the title meta tag.


>> -----Original Message-----
>> From: Oleg Bartunov [mailto:oleg ( at ) sai ( dot ) msu ( dot ) su]
>> Sent: 30 January 2004 19:52
>> To: Dave Page
>> Cc: josh ( at ) agliodbs ( dot ) com; pgsql-www ( at ) postgresql ( dot ) org
>> Subject: RE: [pgsql-www] Postgresql.org search engine.
>>
>>
>> This is what you need to look for to optimize search (limit
>> search region by date period). Default search should use
>> something like search last year documents.
>
>Oh, date is not a problem. I just haven't put it on the form yet. It's
>the metadata like author, subject, listname etc. that will take more
>work (though the latter is handled quite well using a subset
>restriction).
>
>Regards, Dave.


And in reply to the following:


>> -----Original Message-----
>> From: Marc G. Fournier [mailto:scrappy ( at ) postgresql ( dot ) org]
>> Sent: 30 January 2004 20:43
>> To: Dave Page
>> Cc: Oleg Bartunov; josh ( at ) agliodbs ( dot ) com; pgsql-www ( at ) postgresql ( dot ) org
>> Subject: Re: [pgsql-www] Postgresql.org search engine.
>>
>>
>> k, before I regenerate the lists, is this stuff you want me
>> to add to the META DATA part?
>
>There's not much point I don't think. It's the XML feed that might make
>use of it, not the standard indexer.
>
>What I really want to see is the absolute bare minimum in the msg files
>(not even the titles that are there at the moment - speacking of which,
>might be worth including them as a php var we can pickup from the
>top_config.php)  - as per the example I emailed you. Then, we should be
>able to do anything by editting the header and footer php include files.
>
>Regards, Dave.

The standard indexer can too be reconfigured to take advantage of other (non-standard) meta tags, just as easyly as the
xmlfeed code can. 

Regards,

John


Re: Postgresql.org search engine.

From
"Marc G. Fournier"
Date:
On Mon, 21 Jun 2004, John Hansen wrote:

> And in reply to the following:
>
>
>>> -----Original Message-----
>>> From: Marc G. Fournier [mailto:scrappy ( at ) postgresql ( dot ) org]
>>> Sent: 30 January 2004 20:43
>>> To: Dave Page
>>> Cc: Oleg Bartunov; josh ( at ) agliodbs ( dot ) com; pgsql-www ( at ) postgresql ( dot ) org
>>> Subject: Re: [pgsql-www] Postgresql.org search engine.
>>>
>>>
>>> k, before I regenerate the lists, is this stuff you want me
>>> to add to the META DATA part?
>>
>> There's not much point I don't think. It's the XML feed that might make
>> use of it, not the standard indexer.
>>
>> What I really want to see is the absolute bare minimum in the msg files
>> (not even the titles that are there at the moment - speacking of which,
>> might be worth including them as a php var we can pickup from the
>> top_config.php)  - as per the example I emailed you. Then, we should be
>> able to do anything by editting the header and footer php include files.
>>
>> Regards, Dave.
>
> The standard indexer can too be reconfigured to take advantage of other (non-standard) meta tags, just as easyly as
thexml feed code can. 

I can re-generate the archives if there is something  you think should be
added to the META tags to improve the searching, just let me know ...


----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664