Thread: Postgresql.org search engine.
Hi guys, As some of you may have noticed, there is now a new search engine on the main, and archives websites. This one is based on an unreleased (currently) port of ASPSeek which runs on PostgreSQL. Comments etc. welcome - you should find this one *much* faster. Marc: I believe the mnogo stuff can all be ditched now. Regards, Dave.
On Fri, 30 Jan 2004, Dave Page wrote: > Hi guys, > > As some of you may have noticed, there is now a new search engine on the > main, and archives websites. This one is based on an unreleased > (currently) port of ASPSeek which runs on PostgreSQL. > > Comments etc. welcome - you should find this one *much* faster. > I'd recommend to use ispell dictionaries, so 'databases' and 'database' will produce the same results. > Marc: I believe the mnogo stuff can all be ditched now. agreed ! > > Regards, Dave. > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 30 Jan 2004, Dave Page wrote: > Hi guys, > > As some of you may have noticed, there is now a new search engine on the > main, and archives websites. This one is based on an unreleased > (currently) port of ASPSeek which runs on PostgreSQL. Just checked archives, and its still using MnogoSearch? Or is there something that I'm supposed to be changing over there? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
Hi Oleg, > -----Original Message----- > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > Sent: 30 January 2004 16:03 > To: Dave Page > Cc: pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > I'd recommend to use ispell dictionaries, so 'databases' and > 'database' > will produce the same results. Thanks, installed. BTW, searching for 'database' really makes it think! Other queries that generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker. I have also added some weighting to the indexed sites to try to give preference to those that are more 'authoritative' and of global interest than others. Any comments or suggestions for changes welcome as always! # Primary sites SiteWeight http://www.postgresql.org/ 100 SiteWeight http://advocacy.postgresql.org/ 100 SiteWeight http://jdbc.postgresql.org/ 100 SiteWeight http://developer.postgresql.org/ 100 # Authoritiative project sites SiteWeight http://gborg.postgresql.org/ 75 SiteWeight http://pgadmin.postgresql.org/ 75 SiteWeight http://phppgadmin.sourceforge.net/ 75 # User contributed stuff SiteWeight http://techdocs.postgresql.org/ 50 SiteWeight http://archives.postgresql.org/ 50 # Outside but reliable SiteWeight http://www.varlena.com/ 25 # And the rest... SiteWeight http://www.postgresql.cl/ 0 SiteWeight http://postgresql.ok.cz/ 0 SiteWeight http://www.postgresql.jp/ 0 SiteWeight http://pgsql-fr.tuxfamily.org/ 0 SiteWeight http://www.linuxshare.ru/ 0 SiteWeight http://www.postgres.de/ 0 SiteWeight http://www.pgsqldb.org/ 0 SiteWeight http://www.postgresql.org.br/ 0 Regards, Dave.
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 30 January 2004 16:48 > To: Dave Page > Cc: pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > On Fri, 30 Jan 2004, Dave Page wrote: > > > Hi guys, > > > > As some of you may have noticed, there is now a new search > engine on > > the main, and archives websites. This one is based on an unreleased > > (currently) port of ASPSeek which runs on PostgreSQL. > > Just checked archives, and its still using MnogoSearch? Or > is there something that I'm supposed to be changing over there? It's using aspseek from here. A search for 'stuff' just gave: Documents 1-20 of total 11042 found. Searching in 276628 documents took 2.736 seconds. Followed by the ASPSeeeeeek graphical page selector. Oh, and I changed the beige boxes to light blue. Try a ctrl-refresh perhaps? Regard,s Dave.
that did it, great ... mnogosearch database being zap'd! *dances a jig* On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 30 January 2004 16:48 > > To: Dave Page > > Cc: pgsql-www@postgresql.org > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > On Fri, 30 Jan 2004, Dave Page wrote: > > > > > Hi guys, > > > > > > As some of you may have noticed, there is now a new search > > engine on > > > the main, and archives websites. This one is based on an unreleased > > > (currently) port of ASPSeek which runs on PostgreSQL. > > > > Just checked archives, and its still using MnogoSearch? Or > > is there something that I'm supposed to be changing over there? > > It's using aspseek from here. A search for 'stuff' just gave: > > Documents 1-20 of total 11042 found. Searching in 276628 documents > took 2.736 seconds. > > Followed by the ASPSeeeeeek graphical page selector. > > Oh, and I changed the beige boxes to light blue. > > Try a ctrl-refresh perhaps? > > Regard,s Dave. > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Fri, 30 Jan 2004, Dave Page wrote: > Hi Oleg, > > > -----Original Message----- > > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > > Sent: 30 January 2004 16:03 > > To: Dave Page > > Cc: pgsql-www@postgresql.org > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > > > I'd recommend to use ispell dictionaries, so 'databases' and > > 'database' > > will produce the same results. > > Thanks, installed. > > BTW, searching for 'database' really makes it think! Other queries that > generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker. It would think much longer if you search 'pgsql database' :( Just tried and got ~100 sec. This is feature of search engines based on inverted indices. tsearch2 does just the other way - the more words in query the faster searching. I suggest to include 'postgresql', 'pgsql', 'postgres' into stop words list :( btw, you may look at word statistics and let top N words as stop words. > > I have also added some weighting to the indexed sites to try to give > preference to those that are more 'authoritative' and of global interest > than others. Any comments or suggestions for changes welcome as always! Hmm, I thought aspseek has sort of page rank, so let him works. > > # Primary sites > SiteWeight http://www.postgresql.org/ 100 > SiteWeight http://advocacy.postgresql.org/ 100 > SiteWeight http://jdbc.postgresql.org/ 100 > SiteWeight http://developer.postgresql.org/ 100 > > # Authoritiative project sites > SiteWeight http://gborg.postgresql.org/ 75 > SiteWeight http://pgadmin.postgresql.org/ 75 > SiteWeight http://phppgadmin.sourceforge.net/ 75 > > # User contributed stuff > SiteWeight http://techdocs.postgresql.org/ 50 > SiteWeight http://archives.postgresql.org/ 50 > > # Outside but reliable > SiteWeight http://www.varlena.com/ 25 > > # And the rest... > SiteWeight http://www.postgresql.cl/ 0 > SiteWeight http://postgresql.ok.cz/ 0 > SiteWeight http://www.postgresql.jp/ 0 > SiteWeight http://pgsql-fr.tuxfamily.org/ 0 > SiteWeight http://www.linuxshare.ru/ 0 > SiteWeight http://www.postgres.de/ 0 > SiteWeight http://www.pgsqldb.org/ 0 > SiteWeight http://www.postgresql.org.br/ 0 > > Regards, Dave. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
It's rumoured that Oleg Bartunov once said: > On Fri, 30 Jan 2004, Dave Page wrote: > >> BTW, searching for 'database' really makes it think! Other queries >> that generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker. > > It would think much longer if you search 'pgsql database' :( > Just tried and got ~100 sec. > Meep! > > I suggest to include 'postgresql', 'pgsql', 'postgres' into stop words > list :( btw, you may look at word statistics and let top N words > as stop words. OK, I'll look at that after dinner - thanks. >> I have also added some weighting to the indexed sites to try to give >> preference to those that are more 'authoritative' and of global >> interest than others. Any comments or suggestions for changes welcome >> as always! > > Hmm, I thought aspseek has sort of page rank, so let him works. It does, but I'm trying to give a little preference to results on sites with maximum appeal (ie. those in English), and the most authoritative (ie. those that are published docs rather than list archives or user docs). Also, bear in mind that by default results are grouped by site on the main search page, so generally you will see results from *all* sites indexed on a single page (sorted with the site weighting factored in), but then drill down into a specific site which is unaffected by the site weighting. Regards, Dave.
Guys, Out of curiosity, why are we not using OpenFTS for this? -- -Josh Berkus Aglio Database Solutions San Francisco
It's rumoured that Josh Berkus once said: > Guys, > > Out of curiosity, why are we not using OpenFTS for this? Mainly because Oleg's site uses OpenFTS and it seemed kinda pointless duplicating that, but also because the PostgreSQL port of ASPSeek is proving to be very good (see http://search.oztralis.com.au/ for an example of it searching 3.2 million pages). Regards, Dave.
On Fri, 30 Jan 2004, Dave Page wrote: > It's rumoured that Oleg Bartunov once said: > > On Fri, 30 Jan 2004, Dave Page wrote: > > > >> BTW, searching for 'database' really makes it think! Other queries > >> that generate less hits (eg. Mvcc or psqlodbc) seem to be far quicker. > > > > It would think much longer if you search 'pgsql database' :( > > Just tried and got ~100 sec. > > > Meep! > > > > > I suggest to include 'postgresql', 'pgsql', 'postgres' into stop words > > list :( btw, you may look at word statistics and let top N words > > as stop words. > > OK, I'll look at that after dinner - thanks. bon appetit ! > > >> I have also added some weighting to the indexed sites to try to give > >> preference to those that are more 'authoritative' and of global > >> interest than others. Any comments or suggestions for changes welcome > >> as always! > > > > Hmm, I thought aspseek has sort of page rank, so let him works. > > It does, but I'm trying to give a little preference to results on sites > with maximum appeal (ie. those in English), and the most authoritative sounds reasonable. > (ie. those that are published docs rather than list archives or user > docs). > Also, bear in mind that by default results are grouped by site on the main > search page, so generally you will see results from *all* sites indexed on > a single page (sorted with the site weighting factored in), but then drill > down into a specific site which is unaffected by the site weighting. > Regards, Dave. > I don't have an experience with aspseek, but one desirable feature - spelling support for user's query. Does aspseek has support for this ? Design suggestion: I'd like to see most important parts of form at left side, for example, site selector, imho, better to have left most, then grouping, format, number results per page. Also, I don't like fixed width, in my browser I have to scroll left-right to see a whole form :) > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 30 Jan 2004, Josh Berkus wrote: > Guys, > > Out of curiosity, why are we not using OpenFTS for this? > Because OpenFTS isn't an end user application, it's search engine and someone should write wrappers. We already done mailing list archive search based on OpenFTS/tsearch2, but didn't have time to release it to production server :( Expect it on www.pgsql.ru. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 30 Jan 2004, Dave Page wrote: > It's rumoured that Josh Berkus once said: > > Guys, > > > > Out of curiosity, why are we not using OpenFTS for this? > > Mainly because Oleg's site uses OpenFTS and it seemed kinda pointless > duplicating that, but also because the PostgreSQL port of ASPSeek is > proving to be very good (see http://search.oztralis.com.au/ for an example > of it searching 3.2 million pages). Guys, there is a big difference between semi-static index (aspseek) and incremental indexing of incoming documents (tsearch2). Our approach is to develop fully automatical searchable mailing list archive with instant indexing. So, for example, I see my postings about subj. already in database and *searchable* ! I don't expect aspseek's search engine at postgresql.org has my recent postings in its index. OpenFTS has full access to metadata of documents, so we could limit search ' range by date, by list, by authors, so smart user could get reasonable search performance (relevance is very good, because it based on proximity). So, different searches for different purposes ! > Regards, Dave. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> -----Original Message----- > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > Sent: 30 January 2004 19:06 > To: Dave Page > Cc: josh@agliodbs.com; pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > Guys, there is a big difference between semi-static index > (aspseek) and incremental indexing of incoming documents > (tsearch2). Our approach is to develop fully automatical > searchable mailing list archive with instant indexing. So, > for example, I see my postings about subj. > already in database and *searchable* ! I don't expect > aspseek's search engine at postgresql.org has my recent > postings in its index. No it doesn't, but it probably could do with a little clever scripting to expire the right index pages before each run. In addition, one of the mods made in the version we are using is the addition of an XML feed to the indexer - John (the guy responsible for the port) is keen for me to use this for far more efficient indexing of the archives, however I have yet to do this mainly because it requires hacking mhonarc about to output the XML data. > OpenFTS has full access to metadata of documents, so we could > limit search ' > range by date, by list, by authors, so smart user could get > reasonable search performance (relevance is very good, > because it based on proximity). So, different searches for > different purposes ! We don't have those fields, but the XML feed was originally written for indexing data from online catalogues and has added fields like price. I'd be surprised if others couldn't be added as well. Regards, Dave.
> -----Original Message----- > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > Sent: 30 January 2004 18:50 > To: Dave Page > Cc: pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > I don't have an experience with aspseek, but one desirable > feature - spelling support for user's query. Does aspseek has > support for this ? You mean like Googles speeling corektor? (http://labs.google.com/britney.html). No, ASPSeek doesn't have this, however I wonder how hard it might be to knock something up based on ispell and soundex... Hmmmmm.... > Design suggestion: I'd like to see most important parts of > form at left side, for example, site selector, imho, better > to have left most, then grouping, format, number results per page. Yup, agreed. > Also, I don't like fixed width, in my browser I have to > scroll left-right to see a whole form :) The whole site is designed that way at the moment, but the new multilanguage version that's in development will be sizable. I hope to update the archives to a similar design at some point as well, just as soon as I've persuaded Marc that it's worth regenerating the messages again! Regards, Dave.
On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > > Sent: 30 January 2004 19:06 > > To: Dave Page > > Cc: josh@agliodbs.com; pgsql-www@postgresql.org > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > > > Guys, there is a big difference between semi-static index > > (aspseek) and incremental indexing of incoming documents > > (tsearch2). Our approach is to develop fully automatical > > searchable mailing list archive with instant indexing. So, > > for example, I see my postings about subj. > > already in database and *searchable* ! I don't expect > > aspseek's search engine at postgresql.org has my recent > > postings in its index. > > No it doesn't, but it probably could do with a little clever scripting > to expire the right index pages before each run. > > In addition, one of the mods made in the version we are using is the > addition of an XML feed to the indexer - John (the guy responsible for > the port) is keen for me to use this for far more efficient indexing of > the archives, however I have yet to do this mainly because it requires > hacking mhonarc about to output the XML data. > > > OpenFTS has full access to metadata of documents, so we could > > limit search ' > > range by date, by list, by authors, so smart user could get > > reasonable search performance (relevance is very good, > > because it based on proximity). So, different searches for > > different purposes ! > > We don't have those fields, but the XML feed was originally written for > indexing data from online catalogues and has added fields like price. > I'd be surprised if others couldn't be added as well. This is what you need to look for to optimize search (limit search region by date period). Default search should use something like search last year documents. > > Regards, Dave. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> -----Original Message----- > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > Sent: 30 January 2004 19:52 > To: Dave Page > Cc: josh@agliodbs.com; pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > This is what you need to look for to optimize search (limit > search region by date period). Default search should use > something like search last year documents. Oh, date is not a problem. I just haven't put it on the form yet. It's the metadata like author, subject, listname etc. that will take more work (though the latter is handled quite well using a subset restriction). Regards, Dave.
On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > > Sent: 30 January 2004 18:50 > > To: Dave Page > > Cc: pgsql-www@postgresql.org > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > I don't have an experience with aspseek, but one desirable > > feature - spelling support for user's query. Does aspseek has > > support for this ? > > You mean like Googles speeling corektor? > (http://labs.google.com/britney.html). No, ASPSeek doesn't have this, > however I wonder how hard it might be to knock something up based on > ispell and soundex... Hmmmmm.... In principle, simple corrector could be implemented independent from aspseek. If you have some dictionary of words, create trigrams of these words, if query returns too many results create trigram of words in the query and check which words from dictionary are close, i.e. compute similarity weights (jaccard coefficents would be ok). More complex algorithm we use in www.pgsql.ru and in our contrib/trgm soundex, metaphone are good for english, while trigrams method is universal. > > > Design suggestion: I'd like to see most important parts of > > form at left side, for example, site selector, imho, better > > to have left most, then grouping, format, number results per page. > > Yup, agreed. > > > Also, I don't like fixed width, in my browser I have to > > scroll left-right to see a whole form :) > > The whole site is designed that way at the moment, but the new > multilanguage version that's in development will be sizable. I hope to > update the archives to a similar design at some point as well, just as > soon as I've persuaded Marc that it's worth regenerating the messages > again! Why not have fully dynamic pages for mailing lists ? Proper configured server with cacheing could be very fast. > > Regards, Dave. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> -----Original Message----- > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > Sent: 30 January 2004 20:01 > To: Dave Page > Cc: pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > Why not have fully dynamic pages for mailing lists ? Proper > configured server with cacheing could be very fast. Dunno, the lists and archives are traditionally Marc's domain :-) The changes I'd like to see do involve stripping out the individual messages to the absolute bare minimum of content. Regards, Dave.
On Fri, 30 Jan 2004, Dave Page wrote: > the archives, however I have yet to do this mainly because it requires > hacking mhonarc about to output the XML data. I just did a search to see if someone else hadn't done this, and couldn't find anything ... have you checked with the list archives to see if anyone is working on this as part of the main stream? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Fri, 30 Jan 2004, Dave Page wrote: > The whole site is designed that way at the moment, but the new > multilanguage version that's in development will be sizable. I hope to > update the archives to a similar design at some point as well, just as > soon as I've persuaded Marc that it's worth regenerating the messages > again! This virus has taken a toll on time this week (7k virus' scanned in 48hrs) ... have regenerating plan'd for this weekend ... will fire you off a note once I've started it :) ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Fri, 30 Jan 2004, Oleg Bartunov wrote: > Why not have fully dynamic pages for mailing lists ? Proper configured > server with cacheing could be very fast. How do you mean? Right now, it is dynamic to an extend, but Dave pointed me to the <!--noindex--> stuff vs doing the PHP as I'm doing it now ... but to implement it, I have to update the mhonarc resource file and regenerate all the messages to match ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > > Sent: 30 January 2004 19:52 > > To: Dave Page > > Cc: josh@agliodbs.com; pgsql-www@postgresql.org > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > This is what you need to look for to optimize search (limit > > search region by date period). Default search should use > > something like search last year documents. > > Oh, date is not a problem. I just haven't put it on the form yet. It's > the metadata like author, subject, listname etc. that will take more > work (though the latter is handled quite well using a subset > restriction). k, before I regenerate the lists, is this stuff you want me to add to the META DATA part? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 30 January 2004 20:37 > To: Dave Page > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > On Fri, 30 Jan 2004, Dave Page wrote: > > > the archives, however I have yet to do this mainly because > it requires > > hacking mhonarc about to output the XML data. > > I just did a search to see if someone else hadn't done this, > and couldn't find anything ... have you checked with the list > archives to see if anyone is working on this as part of the > main stream? I haven't looked at it at all yet. John wrote the XML feed code for other purposes but suggested it for the archives as well - it did look intriquing... Regards, Dave
On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > > Sent: 30 January 2004 20:01 > > To: Dave Page > > Cc: pgsql-www@postgresql.org > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > Why not have fully dynamic pages for mailing lists ? Proper > > configured server with cacheing could be very fast. > > Dunno, the lists and archives are traditionally Marc's domain :-) The > changes I'd like to see do involve stripping out the individual messages > to the absolute bare minimum of content. 'k, this is how the search engines should see each message when they index: http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php anything else you'd like me to strip out of there? :( Note that this is not with the <!--noindex--> stuff you were talking about, which, from what you've said, I'm not sure is useful, since not all search engines will recognize it ... with the way its coded now, as long as I have the search engine listed, the output will look the same ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 30 January 2004 20:43 > To: Dave Page > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > k, before I regenerate the lists, is this stuff you want me > to add to the META DATA part? There's not much point I don't think. It's the XML feed that might make use of it, not the standard indexer. What I really want to see is the absolute bare minimum in the msg files (not even the titles that are there at the moment - speacking of which, might be worth including them as a php var we can pickup from the top_config.php) - as per the example I emailed you. Then, we should be able to do anything by editting the header and footer php include files. Regards, Dave.
On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 30 January 2004 20:37 > > To: Dave Page > > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > On Fri, 30 Jan 2004, Dave Page wrote: > > > > > the archives, however I have yet to do this mainly because > > it requires > > > hacking mhonarc about to output the XML data. > > > > I just did a search to see if someone else hadn't done this, > > and couldn't find anything ... have you checked with the list > > archives to see if anyone is working on this as part of the > > main stream? > > I haven't looked at it at all yet. John wrote the XML feed code for > other purposes but suggested it for the archives as well - it did look > intriquing... Just a stupid question, before we go through and 'yet again regen' the archives ... is there something different then mhonarc that would be better? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 30 January 2004 20:49 > To: Dave Page > Cc: Oleg Bartunov; pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > 'k, this is how the search engines should see each message when they > index: > > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php > > anything else you'd like me to strip out of there? :( > > Note that this is not with the <!--noindex--> stuff you were > talking about, which, from what you've said, I'm not sure is > useful, since not all search engines will recognize it ... > with the way its coded now, as long as I have the search > engine listed, the output will look the same ... As I pointed out before, the problem with that is that search engines like Google or search.postgresql.org that cache the pages won't get any of the thread navigation and other elements of the page. I'd rather see the <!--noindex--> bits (or at very least, include them as well and don't look for aspseek or googlebot in showit()). Regards, Dave.
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 30 January 2004 20:52 > To: Dave Page > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com; > pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > Just a stupid question, before we go through and 'yet again > regen' the archives ... is there something different then > mhonarc that would be better? Dunno. Shall we leave it a couple of days or so and I'll take a look and produce some test versions of what it might be nice to see? Tonight's a bit awkward as Jo is feeling a bit odd and in her condition... Regards, Dave.
On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 30 January 2004 20:43 > > To: Dave Page > > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > > > k, before I regenerate the lists, is this stuff you want me > > to add to the META DATA part? > > There's not much point I don't think. It's the XML feed that might make > use of it, not the standard indexer. > > What I really want to see is the absolute bare minimum in the msg files > (not even the titles that are there at the moment - speacking of which, > might be worth including them as a php var we can pickup from the > top_config.php) - as per the example I emailed you. Then, we should be > able to do anything by editting the header and footer php include files. D'oh ... I was going to say that I didn't think taht was possible, but, it just might be ... seems I have a section declared twice (note that someone else wrote this originally, I've only just begun to understand it to modify it), so the second section is overriding the first, but I was only ever seeing the first ... Let me play with this over the weekend, I'll do a 'small sample set' that you can look at the messages in, and we can go from there ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Fri, 30 Jan 2004, Dave Page wrote: > As I pointed out before, the problem with that is that search engines > like Google or search.postgresql.org that cache the pages won't get any > of the thread navigation and other elements of the page. > > I'd rather see the <!--noindex--> bits (or at very least, include them > as well and don't look for aspseek or googlebot in showit()). 'k, I have some ideas on how to do this so that we can change later if we need ... will play with it this weekend ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 30 January 2004 21:02 > To: Dave Page > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com; > pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > D'oh ... I was going to say that I didn't think taht was > possible, but, it just might be ... seems I have a section > declared twice (note that someone else wrote this originally, > I've only just begun to understand it to modify it), so the > second section is overriding the first, but I was only ever > seeing the first ... Huh? You've lost me there... > Let me play with this over the weekend, I'll do a 'small > sample set' that you can look at the messages in, and we can > go from there ... Ok. If you can do it in a directory away from the archives themselves then I can play if need be without breaking anything by accident... /D
Hmm, what's about <meta name="robots" content="noindex,follow"> Oleg On Fri, 30 Jan 2004, Marc G. Fournier wrote: > On Fri, 30 Jan 2004, Dave Page wrote: > > > > > > > > -----Original Message----- > > > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > > > Sent: 30 January 2004 20:01 > > > To: Dave Page > > > Cc: pgsql-www@postgresql.org > > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > > > > Why not have fully dynamic pages for mailing lists ? Proper > > > configured server with cacheing could be very fast. > > > > Dunno, the lists and archives are traditionally Marc's domain :-) The > > changes I'd like to see do involve stripping out the individual messages > > to the absolute bare minimum of content. > > 'k, this is how the search engines should see each message when they > index: > > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php > > anything else you'd like me to strip out of there? :( > > Note that this is not with the <!--noindex--> stuff you were talking > about, which, from what you've said, I'm not sure is useful, since not all > search engines will recognize it ... with the way its coded now, as long > as I have the search engine listed, the output will look the same ... > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 30 January 2004 20:43 > > To: Dave Page > > Cc: Oleg Bartunov; josh@agliodbs.com; pgsql-www@postgresql.org > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > > > k, before I regenerate the lists, is this stuff you want me > > to add to the META DATA part? > > There's not much point I don't think. It's the XML feed that might make > use of it, not the standard indexer. > > What I really want to see is the absolute bare minimum in the msg files > (not even the titles that are there at the moment - speacking of which, > might be worth including them as a php var we can pickup from the > top_config.php) - as per the example I emailed you. Then, we should be > able to do anything by editting the header and footer php include files. I don't understand waht's the problem having postings in raw format stored in filesystem, metadatt - in postgres and show component which combines both sources to nice html page. Dave could get raw postings from filesystem using metadata and index them without any problem. Marc could change html wrapping everyday and everybody are happy :) > > Regards, Dave. > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Sat, 31 Jan 2004, Oleg Bartunov wrote: > Hmm, > what's about > > <meta name="robots" content="noindex,follow"> k, that would be for the index/thread pages, of course ... right? here's a question, since you have more experience with this then I ... teh current meta tags set for the message pages themselves are: <META NAME="robots" CONTENT="all"> <META NAME="MSSmartTagsPreventParsing" content="TRUE"> <META HTTP-EQUIV="Content-Type" content="text/html; charset=iso-8859-1"> <META NAME="keywords" content="postgresql, hackers, general, sql, admin, novice, interfaces, odbc, jdbc"> <META NAME="rating" Content="General" > <META NAME="distribution" Content="Global" > <META NAME="revisit-after" Content="7 days" > <META NAME="robots" CONTENT="follow, index, noarchive"> anything wrong with the above? seems okay to me, just making sure that maybe there isn't something else that I should add? > > Oleg > On Fri, 30 Jan 2004, Marc G. Fournier wrote: > > > On Fri, 30 Jan 2004, Dave Page wrote: > > > > > > > > > > > > -----Original Message----- > > > > From: Oleg Bartunov [mailto:oleg@sai.msu.su] > > > > Sent: 30 January 2004 20:01 > > > > To: Dave Page > > > > Cc: pgsql-www@postgresql.org > > > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > > > > > > > Why not have fully dynamic pages for mailing lists ? Proper > > > > configured server with cacheing could be very fast. > > > > > > Dunno, the lists and archives are traditionally Marc's domain :-) The > > > changes I'd like to see do involve stripping out the individual messages > > > to the absolute bare minimum of content. > > > > 'k, this is how the search engines should see each message when they > > index: > > > > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php > > > > anything else you'd like me to strip out of there? :( > > > > Note that this is not with the <!--noindex--> stuff you were talking > > about, which, from what you've said, I'm not sure is useful, since not all > > search engines will recognize it ... with the way its coded now, as long > > as I have the search engine listed, the output will look the same ... > > > > ---- > > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 5: Have you checked our extensive FAQ? > > > > http://www.postgresql.org/docs/faqs/FAQ.html > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Sat, 31 Jan 2004, Oleg Bartunov wrote: > I don't understand waht's the problem having postings in raw format > stored in filesystem, metadatt - in postgres and show component which > combines both sources to nice html page. Dave could get raw postings > from filesystem using metadata and index them without any problem. Marc > could change html wrapping everyday and everybody are happy :) Do you have software to do this, including all the inter-posting references and followups? Or do you propose we write this all from scratch? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
Marc and Dave, at the same time, could you see how to generating right http headers (LAST-MODIFIED), so search engines could cache documents and don't waste server resources . What I still don't understand is if http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php is static page or dynamic :-? If dynamic I don't see any problem generating headers, if static - you could always use 'touch' hack to set correct last modification date to file. Oleg On Fri, 30 Jan 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 30 January 2004 21:02 > > To: Dave Page > > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com; > > pgsql-www@postgresql.org > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > D'oh ... I was going to say that I didn't think taht was > > possible, but, it just might be ... seems I have a section > > declared twice (note that someone else wrote this originally, > > I've only just begun to understand it to modify it), so the > > second section is overriding the first, but I was only ever > > seeing the first ... > > Huh? You've lost me there... > > > Let me play with this over the weekend, I'll do a 'small > > sample set' that you can look at the messages in, and we can > > go from there ... > > Ok. If you can do it in a directory away from the archives themselves > then I can play if need be without breaking anything by accident... > > /D > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Guys, > Do you have software to do this, including all the inter-posting > references and followups? Or do you propose we write this all from > scratch? Robert Bernier apparently wrote something to break up mail for inclusion in a database, and should be able to help in a couple months. Josh Drake is also willing to help, and has already done a prototype wiithout header searching. -- -Josh Berkus Aglio Database Solutions San Francisco
On Sat, 31 Jan 2004, Oleg Bartunov wrote: > Marc and Dave, > > at the same time, could you see how to generating right http headers > (LAST-MODIFIED), so search engines could cache documents and don't waste > server resources . What I still don't understand is if > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php > is static page or dynamic :-? If dynamic I don't see any problem generating > headers, if static - you could always use 'touch' hack to set correct > last modification date to file. Huh? The t.php one above was just to show Dave what the search engines are seeing (ie. minus the search/banner/links, just the message) ... its not part of the system, just a copy of an existing message ... re last-modified time ... what is wrong with it? According to my browser, it is being displayed correctly, or are you still hung up on the fact that it doesn't equal the posting date of the message itself? If that is all it is, I'm planning on trying something this weekend to get that in place, but the last time I tried it didn't work ... again, if you have better software you can recommend then what we are using now to generate the archives (mhonarc), please speak up before I go through the trouble of regenerating everything all over again ... > > Oleg > > On Fri, 30 Jan 2004, Dave Page wrote: > > > > > > > > -----Original Message----- > > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > > Sent: 30 January 2004 21:02 > > > To: Dave Page > > > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com; > > > pgsql-www@postgresql.org > > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > > > > D'oh ... I was going to say that I didn't think taht was > > > possible, but, it just might be ... seems I have a section > > > declared twice (note that someone else wrote this originally, > > > I've only just begun to understand it to modify it), so the > > > second section is overriding the first, but I was only ever > > > seeing the first ... > > > > Huh? You've lost me there... > > > > > Let me play with this over the weekend, I'll do a 'small > > > sample set' that you can look at the messages in, and we can > > > go from there ... > > > > Ok. If you can do it in a directory away from the archives themselves > > then I can play if need be without breaking anything by accident... > > > > /D > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: the planner will ignore your desire to choose an index scan if your > > joining column's datatypes do not match > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Fri, 30 Jan 2004, Josh Berkus wrote: > Guys, > > > Do you have software to do this, including all the inter-posting > > references and followups? Or do you propose we write this all from > > scratch? > > Robert Bernier apparently wrote something to break up mail for inclusion in a > database, and should be able to help in a couple months. Josh Drake is also > willing to help, and has already done a prototype wiithout header searching. Dumping mail into a database isn't that hard to do ... there are several projects on the 'Net right now doing that, including one that connects a POP3 daemon into the database to download the mail ... in fact, from what I recall of fts.postgresql.org, isn't that what Oleg/Teodor's stuff does? I'm kinda curious here ... exactly what problem are we trying to solve here? Me, I'm just trying to clean up the archives so that when someone gets their search results, they don't all show the same 'text', which I've already accomplished ... Dave is working on improving the speed of the searches, which he has accomplished with ASPseek ... If I can figure out how to get the Date: of the posting into the Last-Modified field (I know *how* it should work, but last time I tried it ended up generating a whack of errors), then that should satisfy Oleg's beef ... Oleg, one question ... what do you recommend setting max-age to for Cache-control? Right now, I have it set to 30 days ... too long? not long enough? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 31 January 2004 06:15 > To: Josh Berkus > Cc: Oleg Bartunov; Dave Page; pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > I'm kinda curious here ... exactly what problem are we trying > to solve here? > My thoughts as well as we are starting to see suggestions for solving non-existant problems :-) 1) We need each message file minimised - e.g. some php variables defined, a php include for the header, the message and thread links and a php include for the footer. 2) The php header file should use the variables defined in the messages to generate the <TITLE></TITLE> tag and last modified dates, as well as any other useful meta data. 3) The php header and footer files should include <!--noindex--><!--/noindex--> tags to allow aspseek to cache the entire page but only index the relevant content (see anywhere on the main portal site to see how this is done - specifically, menus are not indexed though they are still followed). 4) (Optionally, I don't think it's necessary - certainly not for ASPSeek) The php header/footer may only display if certain user agent strings are not detected. Regards, Dave.
On Sat, 31 Jan 2004, Marc G. Fournier wrote: > On Sat, 31 Jan 2004, Oleg Bartunov wrote: > > > Marc and Dave, > > > > at the same time, could you see how to generating right http headers > > (LAST-MODIFIED), so search engines could cache documents and don't waste > > server resources . What I still don't understand is if > > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00745t.php > > is static page or dynamic :-? If dynamic I don't see any problem generating > > headers, if static - you could always use 'touch' hack to set correct > > last modification date to file. > > Huh? The t.php one above was just to show Dave what the search engines > are seeing (ie. minus the search/banner/links, just the message) ... its > not part of the system, just a copy of an existing message ... > > re last-modified time ... what is wrong with it? According to my browser, > it is being displayed correctly, or are you still hung up on the fact that > it doesn't equal the posting date of the message itself? If that is all > it is, I'm planning on trying something this weekend to get that in place, > but the last time I tried it didn't work ... again, if you have better yes, correct http headers are what I'd like to see, many crawlers/spiders take them into account. Saves bandwidth and server didn't overloaded I count two attempts, one failed because of incorrect format of date, and second - because all pages have the same time modification date - moment of page creating, not the date of posting. Apache's mod_headers could generate last modified header for static pages using information about file modification time, so it's possible to use command 'touch' to get file modification date equal to date of posting. Dynamic pages is another story and http header should be generated by software responsible for displaying page. > software you can recommend then what we are using now to generate the > archives (mhonarc), please speak up before I go through the trouble of > regenerating everything all over again ... Dont know :( We have our mailware, which you've seen on fts.postgresql.org and soon will appear on www.pgsql.ru, but it's not end user application. I've seen mailman (http://www.list.org/) which connects somehow with mhonarc. A wide list of MLM's is available from http://www.sympa.org/robots.html Also , Sympa has support for Mhonarc archives - http://www.sympa.org/ > > > > > > Oleg > > > > On Fri, 30 Jan 2004, Dave Page wrote: > > > > > > > > > > > > -----Original Message----- > > > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > > > Sent: 30 January 2004 21:02 > > > > To: Dave Page > > > > Cc: Marc G. Fournier; Oleg Bartunov; josh@agliodbs.com; > > > > pgsql-www@postgresql.org > > > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > > > > > > > > D'oh ... I was going to say that I didn't think taht was > > > > possible, but, it just might be ... seems I have a section > > > > declared twice (note that someone else wrote this originally, > > > > I've only just begun to understand it to modify it), so the > > > > second section is overriding the first, but I was only ever > > > > seeing the first ... > > > > > > Huh? You've lost me there... > > > > > > > Let me play with this over the weekend, I'll do a 'small > > > > sample set' that you can look at the messages in, and we can > > > > go from there ... > > > > > > Ok. If you can do it in a directory away from the archives themselves > > > then I can play if need be without breaking anything by accident... > > > > > > /D > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 9: the planner will ignore your desire to choose an index scan if your > > > joining column's datatypes do not match > > > > > > > Regards, > > Oleg > > _____________________________________________________________ > > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > > Sternberg Astronomical Institute, Moscow University (Russia) > > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > > phone: +007(095)939-16-83, +007(095)939-23-83 > > > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Sat, 31 Jan 2004, Marc G. Fournier wrote: > On Fri, 30 Jan 2004, Josh Berkus wrote: > > > Guys, > > > > > Do you have software to do this, including all the inter-posting > > > references and followups? Or do you propose we write this all from > > > scratch? > > > > Robert Bernier apparently wrote something to break up mail for inclusion in a > > database, and should be able to help in a couple months. Josh Drake is also > > willing to help, and has already done a prototype wiithout header searching. > > Dumping mail into a database isn't that hard to do ... there are several > projects on the 'Net right now doing that, including one that connects a > POP3 daemon into the database to download the mail ... in fact, from what > I recall of fts.postgresql.org, isn't that what Oleg/Teodor's stuff does? > > I'm kinda curious here ... exactly what problem are we trying to solve > here? > > Me, I'm just trying to clean up the archives so that when someone gets > their search results, they don't all show the same 'text', which I've > already accomplished ... Dave is working on improving the speed of the > searches, which he has accomplished with ASPseek ... > > If I can figure out how to get the Date: of the posting into the > Last-Modified field (I know *how* it should work, but last time I tried it > ended up generating a whack of errors), then that should satisfy Oleg's > beef ... > > Oleg, one question ... what do you recommend setting max-age to for > Cache-control? Right now, I have it set to 30 days ... too long? not > long enough? in my experience Cache-control is not effective, because it's HTTP/1.1 feature and a lot of users come through proxy which still doesn't support HTTP/1.1 Last-Modified header is the most universal way. Check http://www.mnot.net/cache_docs/#CACHE-CONTROL > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Sat, 31 Jan 2004, Oleg Bartunov wrote: > > If I can figure out how to get the Date: of the posting into the > > Last-Modified field (I know *how* it should work, but last time I tried it > > ended up generating a whack of errors), then that should satisfy Oleg's > > beef ... 'k, figured out my error with the mhonarc resource file, and now have posting date in as last-modified ... I'm doing this off to the side right now, while I work out the noindex stuff for Dave, but check out: http://archives.postgresql.org/dev And let me know if the headers look right to you ... I took out the Cache-control stuff ... Let me know if there is anything else you'd like to see in there ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
Oleg ... as the "resident pro" here ... does this make sense: Messages have: <META NAME="robots" CONTENT="nofollow, index, archive"> And indexes have: <META NAME="robots" CONTENT="follow, noindex, noarchive"> ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Sun, 1 Feb 2004, Marc G. Fournier wrote: > On Sat, 31 Jan 2004, Oleg Bartunov wrote: > > > > If I can figure out how to get the Date: of the posting into the > > > Last-Modified field (I know *how* it should work, but last time I tried it > > > ended up generating a whack of errors), then that should satisfy Oleg's > > > beef ... > > 'k, figured out my error with the mhonarc resource file, and now have > posting date in as last-modified ... I'm doing this off to the side right > now, while I work out the noindex stuff for Dave, but check out: > > http://archives.postgresql.org/dev > > And let me know if the headers look right to you ... I took out the > Cache-control stuff ... > > Let me know if there is anything else you'd like to see in there ... > http headers looks fine ! > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Sun, 1 Feb 2004, Marc G. Fournier wrote: > > Oleg ... as the "resident pro" here ... does this make sense: > > Messages have: > > <META NAME="robots" CONTENT="nofollow, index, archive"> > > And indexes have: > > <META NAME="robots" CONTENT="follow, noindex, noarchive"> > I don't know 'archive, noarchive', but others looks ok. I'm rather sceptical about this tag, because I dont know robots which recognize it :) > > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 01 February 2004 22:12 > To: Oleg Bartunov > Cc: Marc G. Fournier; Josh Berkus; Dave Page; pgsql-www@postgresql.org > Subject: Re: [pgsql-www] Postgresql.org search engine. > > And let me know if the headers look right to you ... I took > out the Cache-control stuff ... > > Let me know if there is anything else you'd like to see in there ... It looks even more complex to me now - there are what, 6 include files? How about something more simple: ======================================================================== ====== <? $last_modified = "Fri, 9 Jan 2004 19:00:28 +0000 (GMT)"; $subject = " Re: IMPORTANT: A temporary list for Strategic Marketing"; require("$DOCUMENT_ROOT/includes/header.php"); ?> <pre>Joshua D. Drake wrote: > > >There shouldn't be any tangents or general discussion on -advocacy > >either -- that's what -general is for. A one-time incident should not > >lead to such drastic measures. If the marketing plan is no longer > >discussed on -advocacy, what is? > > > > > I disagree whole heartedly. If you look at general, it is basically > PostgreSQL-Support. <!--noindex--> <HR> <UL> <li>Prev by Date: <strong><a href="msg00116.php">Re: IMPORTANT: A temporary list for Strategic Marketing</a></strong> </li> <li>Next by Date: <strong><a href="msg00118.php">Re: IMPORTANT: A temporary list for Strategic Marketing</a></strong> </li> <li>Previous by thread: <strong><a href="msg00126.php">Re: IMPORTANT: A temporary list for Strategic</a></strong> </li> <li>Next by thread: <strong><a href="msg00125.php">Re: IMPORTANT: A temporary list for Strategic</a></strong> </li> <LI>Index(es): <UL> <LI><A HREF="mail2.php#00117"><STRONG>Main</STRONG></A></LI> <LI><A HREF="thrd2.php#00117"><STRONG>Thread</STRONG></A></LI> </UL> </LI> </UL> <!--/noindex--> <? require("$DOCUMENT_ROOT/includes/footer.php"); ?> ======================================================================== ====== Header.php then may look something like: ======================================================================== ====== <? if(isset($last_modified)) { header("Last-Modified: $last_modified"); } else { header("Last-Modified: " .date("r", filemtime($SCRIPT_FILENAME))); } // Other stuff here ?> <HTML> <HEADER> <TITLE><? php echo $subject ?></TITLE> <META NAME="robots" CONTENT="nofollow, index, archive"> </HEADER> <BODY> <!--noindex--> <!-- HTML code for search form etc. --> <!--/noindex--> ======================================================================== ====== And footer.php (minus and footers we might add). ======================================================================== ====== </BODY> </HTML> ======================================================================== ====== In addition, there is an awful lot of HTML comments that mhonarc has added: <!--X-Head-Body-Sep-End--> <!--X-Body-of-Message--> As examples. These seem somewhat extranous and could be removed for ease of reading and disk space/bandwidth usage reduction. Oh, and on the current version the noindex tags seem to be in the wrong places. On the index/thread pages for example, they should enclose all the hyperlinks. The noindex tags do not stop links being followed, just the text within them from being included in the index. Regards, Dave.
On Mon, 2 Feb 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 01 February 2004 22:12 > > To: Oleg Bartunov > > Cc: Marc G. Fournier; Josh Berkus; Dave Page; pgsql-www@postgresql.org > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > And let me know if the headers look right to you ... I took > > out the Cache-control stuff ... > > > > Let me know if there is anything else you'd like to see in there ... > > It looks even more complex to me now - there are what, 6 include files? > > How about something more simple: if you can figure out how to do it in the .resource file, please let me know ... I've strip'd out everything that I believe can be done without making the .resource file itself majorly confusing ... > In addition, there is an awful lot of HTML comments that mhonarc has > added: > > <!--X-Head-Body-Sep-End--> > <!--X-Body-of-Message--> Nothing I can do about these, there are no configuration directives that I've found to strip those ... > Oh, and on the current version the noindex tags seem to be in the wrong > places. On the index/thread pages for example, they should enclose all > the hyperlinks. The noindex tags do not stop links being followed, just > the text within them from being included in the index. The index/thread pages all have noindex,follow set in the META TAG ... isn't that what that META TAG is supposed to be for? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
I just cleaned up the <HEAD></HEAD> section of the message layout, so that shrinks the msg*.php files by a few more lines ... On Mon, 2 Feb 2004, Marc G. Fournier wrote: > On Mon, 2 Feb 2004, Dave Page wrote: > > > > > > > > -----Original Message----- > > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > > Sent: 01 February 2004 22:12 > > > To: Oleg Bartunov > > > Cc: Marc G. Fournier; Josh Berkus; Dave Page; pgsql-www@postgresql.org > > > Subject: Re: [pgsql-www] Postgresql.org search engine. > > > > > > And let me know if the headers look right to you ... I took > > > out the Cache-control stuff ... > > > > > > Let me know if there is anything else you'd like to see in there ... > > > > It looks even more complex to me now - there are what, 6 include files? > > > > How about something more simple: > > if you can figure out how to do it in the .resource file, please let me > know ... I've strip'd out everything that I believe can be done without > making the .resource file itself majorly confusing ... > > > In addition, there is an awful lot of HTML comments that mhonarc has > > added: > > > > <!--X-Head-Body-Sep-End--> > > <!--X-Body-of-Message--> > > Nothing I can do about these, there are no configuration directives that > I've found to strip those ... > > > Oh, and on the current version the noindex tags seem to be in the wrong > > places. On the index/thread pages for example, they should enclose all > > the hyperlinks. The noindex tags do not stop links being followed, just > > the text within them from being included in the index. > > The index/thread pages all have noindex,follow set in the META TAG ... > isn't that what that META TAG is supposed to be for? > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
-On [20040201 23:43], Marc G. Fournier (scrappy@postgresql.org) wrote: > <META NAME="robots" CONTENT="nofollow, index, archive"> According to http://www.robotstxt.org/wc/meta-user.html archive|noarchive does not exist. Where'd you find it? -- Jeroen Ruigrok van der Werven <asmodai(at)wxs.nl> / asmodai / kita no mono PGP fingerprint: 2D92 980E 45FE 2C28 9DB7 9D88 97E6 839B 2EAC 625B http://www.tendra.org/ | http://diary.in-nomine.org/ The human race is challenged more than ever before to demonstrate our mastery -- not over nature but of ourselves...
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 02 February 2004 14:11 > To: Dave Page > Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus; > pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. >> > if you can figure out how to do it in the .resource file, > please let me know ... I've strip'd out everything that I > believe can be done without making the .resource file itself > majorly confusing ... If I make a copy of the directory to play with, how do I re-run mhonarc? (probably won't be today though, I have screaming headache and a broken pbx). > > The index/thread pages all have noindex,follow set in the META TAG ... > isn't that what that META TAG is supposed to be for? Yes, but then why include the <!--noindex--> tags as well if they are in the wrong place? Regards, Dave.
On Mon, 2 Feb 2004, Jeroen Ruigrok/asmodai wrote: > -On [20040201 23:43], Marc G. Fournier (scrappy@postgresql.org) wrote: > > <META NAME="robots" CONTENT="nofollow, index, archive"> > > According to http://www.robotstxt.org/wc/meta-user.html > archive|noarchive does not exist. > > Where'd you find it? Actually, that one was in the original .resource file, but a quick search on google shows: http://www.bauser.com/websnob/meta/robots.html and http://www.google.com/webmasters/faq.html#cached the funny thing is that this one: http://www.katpatuka.org/pub/doc/robotexclusion.html refers to the NOARCHIVE, but puts to: http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt which doesn't include it ... I love standards that everyone follows *roll eyes* I'm gathering its somethign that some use (Google does, apparently), and some don't ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Mon, 2 Feb 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 02 February 2004 14:11 > > To: Dave Page > > Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus; > > pgsql-www@postgresql.org > > Subject: RE: [pgsql-www] Postgresql.org search engine. > >> > > if you can figure out how to do it in the .resource file, > > please let me know ... I've strip'd out everything that I > > believe can be done without making the .resource file itself > > majorly confusing ... > > If I make a copy of the directory to play with, how do I re-run mhonarc? > (probably won't be today though, I have screaming headache and a broken > pbx). there is a mk-mhonarc script in the directory that you can run ... > > The index/thread pages all have noindex,follow set in the META TAG ... > > isn't that what that META TAG is supposed to be for? > > Yes, but then why include the <!--noindex--> tags as well if they are in > the wrong place? removed from the index page(s) ... in fact, I hadn't even put them into the thread index pages ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Mon, 2 Feb 2004, Dave Page wrote: > In addition, there is an awful lot of HTML comments that mhonarc has > added: > > <!--X-Head-Body-Sep-End--> > <!--X-Body-of-Message--> > > As examples. These seem somewhat extranous and could be removed for ease > of reading and disk space/bandwidth usage reduction. I've put a note out to the mhonarc list to see if there is somethign I'm missing in the docs that allows one to turn those off ... it seems to add about a 1k worth of data to each file, which, when dealing with 100's of thousands of messages, is a fair amount of disk space ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 02 February 2004 14:11 > To: Dave Page > Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus; > pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > How about something more simple: > > if you can figure out how to do it in the .resource file, > please let me know ... I've strip'd out everything that I > believe can be done without making the .resource file itself > majorly confusing ... OK, well frankly mhonarc looks like a nightmare to setup. I've had a play with hypermail instead. I realise that I've yet to drop in your search engine detection code and there is still work to be done, but how does this look: http://archives.postgresql.org/dave/pgsql-advocacy/ It's all pretty self contained at the moment - feel free to have a play with it. (mk-hypermail to rebuild, you may need to clear old files first if you make drastic changes). Regards, Dave
On Wed, 4 Feb 2004, Dave Page wrote: > > > > -----Original Message----- > > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > > Sent: 02 February 2004 14:11 > > To: Dave Page > > Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus; > > pgsql-www@postgresql.org > > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > > > How about something more simple: > > > > if you can figure out how to do it in the .resource file, > > please let me know ... I've strip'd out everything that I > > believe can be done without making the .resource file itself > > majorly confusing ... > > OK, well frankly mhonarc looks like a nightmare to setup. Actually, its quite easy one you read through the docs ... there are formats in the .resource file for the Date Index, Thread Index and Message Page ... and each of those is broken down into sub-sections ... best place to start is: http://www.mhonarc.org/MHonArc/doc/layout.html and then look at each subsection as you need to modify it ... > I've had a > play with hypermail instead. I realise that I've yet to drop in your > search engine detection code and there is still work to be done, but how > does this look: > > http://archives.postgresql.org/dave/pgsql-advocacy/ > > It's all pretty self contained at the moment - feel free to have a play > with it. k, first thing that is missing is the last-modified date isn't set right, which makes it a no go option ... looking at the hypermail.conf file, it more reminds me of setting up a web stats program then a list archiver ... options are either ... you can add footers and headers, but scanning through the docs that it points to, there doesn't seem to be any way of adding a Last-Modified header, since they don't even seem to define VARIABLES (ie. $DATE$ for date of posting) that you can use when generating the archives ... let me know if I've missed something .. ? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> -----Original Message----- > From: Marc G. Fournier [mailto:scrappy@postgresql.org] > Sent: 04 February 2004 14:12 > To: Dave Page > Cc: Marc G. Fournier; Oleg Bartunov; Josh Berkus; > pgsql-www@postgresql.org > Subject: RE: [pgsql-www] Postgresql.org search engine. > > > k, first thing that is missing is the last-modified date > isn't set right, which makes it a no go option ... looking at > the hypermail.conf file, it more reminds me of setting up a > web stats program then a list archiver ... I thought the only archiver you knew was mhonarc? Not the biggest frame of reference :-) > options are either ... you can add footers and headers, but > scanning through the docs that it points to, there doesn't > seem to be any way of adding a Last-Modified header, since > they don't even seem to define VARIABLES (ie. $DATE$ for date > of posting) that you can use when generating the archives ... > > let me know if I've missed something .. ? I was looking at this after I posted my last message. Hypermail supports HTML templates which may be used instead of headers and footers. These do have variables, however I couldn't see one for posting date :-( Probably not the hardest mod in the world to add it to the program, but unfortunately I just started a new module at Uni so am somewhat short of spare time again... Regards, Dave.
Well, subject is easy enough to implement, as the subject is also the page title, and we can as it is, limit searches to the title meta tag. >> -----Original Message----- >> From: Oleg Bartunov [mailto:oleg ( at ) sai ( dot ) msu ( dot ) su] >> Sent: 30 January 2004 19:52 >> To: Dave Page >> Cc: josh ( at ) agliodbs ( dot ) com; pgsql-www ( at ) postgresql ( dot ) org >> Subject: RE: [pgsql-www] Postgresql.org search engine. >> >> >> This is what you need to look for to optimize search (limit >> search region by date period). Default search should use >> something like search last year documents. > >Oh, date is not a problem. I just haven't put it on the form yet. It's >the metadata like author, subject, listname etc. that will take more >work (though the latter is handled quite well using a subset >restriction). > >Regards, Dave. And in reply to the following: >> -----Original Message----- >> From: Marc G. Fournier [mailto:scrappy ( at ) postgresql ( dot ) org] >> Sent: 30 January 2004 20:43 >> To: Dave Page >> Cc: Oleg Bartunov; josh ( at ) agliodbs ( dot ) com; pgsql-www ( at ) postgresql ( dot ) org >> Subject: Re: [pgsql-www] Postgresql.org search engine. >> >> >> k, before I regenerate the lists, is this stuff you want me >> to add to the META DATA part? > >There's not much point I don't think. It's the XML feed that might make >use of it, not the standard indexer. > >What I really want to see is the absolute bare minimum in the msg files >(not even the titles that are there at the moment - speacking of which, >might be worth including them as a php var we can pickup from the >top_config.php) - as per the example I emailed you. Then, we should be >able to do anything by editting the header and footer php include files. > >Regards, Dave. The standard indexer can too be reconfigured to take advantage of other (non-standard) meta tags, just as easyly as the xmlfeed code can. Regards, John
On Mon, 21 Jun 2004, John Hansen wrote: > And in reply to the following: > > >>> -----Original Message----- >>> From: Marc G. Fournier [mailto:scrappy ( at ) postgresql ( dot ) org] >>> Sent: 30 January 2004 20:43 >>> To: Dave Page >>> Cc: Oleg Bartunov; josh ( at ) agliodbs ( dot ) com; pgsql-www ( at ) postgresql ( dot ) org >>> Subject: Re: [pgsql-www] Postgresql.org search engine. >>> >>> >>> k, before I regenerate the lists, is this stuff you want me >>> to add to the META DATA part? >> >> There's not much point I don't think. It's the XML feed that might make >> use of it, not the standard indexer. >> >> What I really want to see is the absolute bare minimum in the msg files >> (not even the titles that are there at the moment - speacking of which, >> might be worth including them as a php var we can pickup from the >> top_config.php) - as per the example I emailed you. Then, we should be >> able to do anything by editting the header and footer php include files. >> >> Regards, Dave. > > The standard indexer can too be reconfigured to take advantage of other (non-standard) meta tags, just as easyly as thexml feed code can. I can re-generate the archives if there is something you think should be added to the META tags to improve the searching, just let me know ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664