Thread: A counter productive conversation about search.
Hello, Now that I have effectively slapped myself silly by being rude to Tom about search. Let me bring up some points about search and see if there is a way to resolve them. The problem: Search really isn't that good. Tom has good results with it, but I am guessing that because he is looking for specific things, likely just in archives as I doubt he often searches the documentation ;). A quick search on google: site:archives.postgresql.org index bloat archives.postgresql.org/pgsql-performance/2005-04/msg00617.php archives.postgresql.org/pgsql-performance/2005-04/msg00594.php archives.postgresql.org/pgsql-performance/2005-04/msg00608.php archives.postgresql.org: http://archives.postgresql.org/pgsql-performance/2005-04/msg00575.php http://archives.postgresql.org/pgsql-general/2004-12/msg00288.php http://archives.postgresql.org/pgsql-general/2005-07/msg00186.php site:www.postgresql.org create index www.postgresql.org/docs/7.4/static/sql-createindex.html www.postgresql.org/docs/8.1/static/sql-createindex.html www.postgresql.org/files/documentation/books/aw_pgsql/node216.html search.postgresql.org: http://www.postgresql.org/files/documentation/books/aw_pgsql/node216.html http://www.postgresql.org/files/documentation/books/pghandbuch/html/sql-createindex.html http://developer.postgresql.org/~petere/past-events/lsm2003-slides/foil20.html The first search is "reasonable" between the two, although it does not appear to correctly follow the thread path. The second search to me is completely wrong. CREATE INDEX should always return the current documentation first. I can forgive google for showing 7.4 first because it has been around longer and yet is still widely in use. I have on multiple occasions brought up the idea of another search engine. I wrote the pgsql.ru guys and asked if they would share their code. To their benefit they said they would be willing but didn't have the time to install it for us. I told them I would be happy to muscle through it if they would just answer some emails. I never heard back. Other options include lucene, and rolling our own. Rolling our own really wouldn't be that hard "if" we can create a reasonably smart web page grabber. We have all the tools (tsearch2 and pg_pgtrm) to easily do the searches. So is anyone up for helping develop a page grabber? Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
Joshua D. Drake wrote: ... > Rolling our own really wouldn't be that hard "if" we can create a > reasonably smart web page grabber. We have all the tools (tsearch2 and > pg_pgtrm) to easily do the searches. > > So is anyone up for helping develop a page grabber? Thats not the hardest part but why do we need to grab if the contents of the pages could be in the database? But admittedly, I don't know any good CMS w/ postgresql backend. But anyway, grabbing the sources of the pages while they are published (like the docbook stuff for the documentation) makes a lot more sense imho. Ditto for the archives. Its much easier to get an idea of the structure and nature of the data when you dont have to deal with the final result (e.g. HTML) So a couple of scripts that fire when mail comes in, documentation is compiled and when some other publishing takes place could really help to keep the index in sync w/o having to crawl all sites over and over again. Regards Tino Wildenhain
On Tue, 29 Aug 2006, Tino Wildenhain wrote: > Joshua D. Drake wrote: > ... >> Rolling our own really wouldn't be that hard "if" we can create a >> reasonably smart web page grabber. We have all the tools (tsearch2 and >> pg_pgtrm) to easily do the searches. >> >> So is anyone up for helping develop a page grabber? > > Thats not the hardest part but why do we need to grab if the contents > of the pages could be in the database? But admittedly, I don't know > any good CMS w/ postgresql backend. But anyway, grabbing the sources > of the pages while they are published (like the docbook stuff > for the documentation) makes a lot more sense imho. Ditto for the > archives. Its much easier to get an idea of the structure and nature > of the data when you dont have to deal with the final result (e.g. HTML) > > So a couple of scripts that fire when mail comes in, documentation > is compiled and when some other publishing takes place could > really help to keep the index in sync w/o having to crawl all sites > over and over again. This is exactly what we have on pgsql.ru/db/mw. We use procmail to fire our backend to process incoming message. This is not a problem, the most complex thing is a backend. > > Regards > Tino Wildenhain > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> -----Original Message----- > From: pgsql-www-owner@postgresql.org > [mailto:pgsql-www-owner@postgresql.org] On Behalf Of Joshua D. Drake > Sent: 29 August 2006 04:12 > To: PostgreSQL WWW > Subject: [pgsql-www] A counter productive conversation about search. > > Hello, > > Now that I have effectively slapped myself silly by being rude to Tom > about search. Let me bring up some points about search and > see if there > is a way to resolve them. > > The problem: > > Search really isn't that good. Tom has good results with it, but I am > guessing that because he is looking for specific things, > likely just in > archives as I doubt he often searches the documentation ;). > > A quick search on google: > > site:archives.postgresql.org index bloat > > archives.postgresql.org/pgsql-performance/2005-04/msg00617.php > archives.postgresql.org/pgsql-performance/2005-04/msg00594.php > archives.postgresql.org/pgsql-performance/2005-04/msg00608.php > > archives.postgresql.org: > > http://archives.postgresql.org/pgsql-performance/2005-04/msg00575.php > http://archives.postgresql.org/pgsql-general/2004-12/msg00288.php > http://archives.postgresql.org/pgsql-general/2005-07/msg00186.php > > site:www.postgresql.org create index > www.postgresql.org/docs/7.4/static/sql-createindex.html > www.postgresql.org/docs/8.1/static/sql-createindex.html > www.postgresql.org/files/documentation/books/aw_pgsql/node216.html > > search.postgresql.org: > http://www.postgresql.org/files/documentation/books/aw_pgsql/n > ode216.html > http://www.postgresql.org/files/documentation/books/pghandbuch > /html/sql-createindex.html > http://developer.postgresql.org/~petere/past-events/lsm2003-sl > ides/foil20.html > > The first search is "reasonable" between the two, although it > does not > appear to correctly follow the thread path. The search engine has no site specific knowledge - it (like any other generic search engine) simply doesn't know about threading. > The second search to me is completely wrong. CREATE INDEX > should always > return the current documentation first. I can forgive google > for showing > 7.4 first because it has been around longer and yet is still > widely in use. That should be fixable by tweaking weighting values, however last time I suggested that I got shot down. > I have on multiple occasions brought up the idea of another search > engine. I wrote the pgsql.ru guys and asked if they would share their > code. To their benefit they said they would be willing but > didn't have > the time to install it for us. I told them I would be happy to muscle > through it if they would just answer some emails. I never heard back. > > Other options include lucene, and rolling our own. Is Lucene capable of handling the size of our index? This has always been the problem we've had with other projects like MnogoSearch. They work well until you load them up with the archives after which they simply can't cope without ridiculous amounts of hardware. > Rolling our own really wouldn't be that hard "if" we can create a > reasonably smart web page grabber. We have all the tools > (tsearch2 and > pg_pgtrm) to easily do the searches. > > So is anyone up for helping develop a page grabber? We have one - it builds the static version of the main site by spidering it hourly. Regards, Dave.
Dave Page Wrote: > That should be fixable by tweaking weighting values, however > last time I suggested that I got shot down. Not so much shot down, but that it isn't possible. (at least not without rewriting code) Siteweights are for sites, not for parts of a site. However, if you browse to the documentation you want to search first, then only that part of the website will be searched. Example: Go to http://www.postgresql.org/docs/8.1/static/index.html and search for 'create index' http://search.postgresql.org/www.search?ul=http%3A%2F%2Fwww.postgresql.org%2Fdocs%2F8.1%2Fstatic%2F%25&fm=on&cs=utf-8&q=create+index
> -----Original Message----- > From: John Hansen [mailto:john@geeknet.com.au] > Sent: 29 August 2006 08:52 > To: Dave Page; Joshua D. Drake; PostgreSQL WWW > Subject: RE: [pgsql-www] A counter productive conversation > about search. > > Dave Page Wrote: > > > That should be fixable by tweaking weighting values, however > > last time I suggested that I got shot down. > > Not so much shot down, but that it isn't possible. (at least > not without rewriting code) > > Siteweights are for sites, not for parts of a site. Yeah, I was thinking of Mnogosearch where a server can include subsections so you can do things like: ServerWeight 100 Server http://www.postgresql.org/docs/8.1/ ServerWeight 50 Server http://www.postgresql.org/docs/7.4/ I did get shot down as well though - iirc it was Oleg who was essentially saying that the search engine should not have any knowledge of the site beyond what it crawled. > However, if you browse to the documentation you want to > search first, then only that part of the website will be searched. > > Example: > > Go to http://www.postgresql.org/docs/8.1/static/index.html > and search for 'create index' > > http://search.postgresql.org/www.search?ul=http%3A%2F%2Fwww.po > stgresql.org%2Fdocs%2F8.1%2Fstatic%2F%25&fm=on&cs=utf-8&q=create+index Yeah. /D
>> Other options include lucene, and rolling our own. > > Is Lucene capable of handling the size of our index? This has always I am going to say, "yes" without any actual knowledge because of Lucene but that is because I am putting more trust in the fact that it is an Apache project then anything. I will check. > been the problem we've had with other projects like MnogoSearch. They > work well until you load them up with the archives after which they > simply can't cope without ridiculous amounts of hardware. > >> Rolling our own really wouldn't be that hard "if" we can create a >> reasonably smart web page grabber. We have all the tools >> (tsearch2 and >> pg_pgtrm) to easily do the searches. >> >> So is anyone up for helping develop a page grabber? > > We have one - it builds the static version of the main site by spidering > it hourly. Should we look at that then? > > Regards, Dave. > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
> ServerWeight 100 > Server http://www.postgresql.org/docs/8.1/ > ServerWeight 50 > Server http://www.postgresql.org/docs/7.4/ > > I did get shot down as well though - iirc it was Oleg who was > essentially saying that the search engine should not have any knowledge > of the site beyond what it crawled. That won't work in aspseek. I looked into that too. It applies the weight to the whole site. Joshua D. Drake > >> However, if you browse to the documentation you want to >> search first, then only that part of the website will be searched. >> >> Example: >> >> Go to http://www.postgresql.org/docs/8.1/static/index.html >> and search for 'create index' >> >> http://search.postgresql.org/www.search?ul=http%3A%2F%2Fwww.po >> stgresql.org%2Fdocs%2F8.1%2Fstatic%2F%25&fm=on&cs=utf-8&q=create+index > > Yeah. > > /D > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
> -----Original Message----- > From: Joshua D. Drake [mailto:jd@commandprompt.com] > Sent: 29 August 2006 15:28 > To: Dave Page > Cc: PostgreSQL WWW > Subject: Re: [pgsql-www] A counter productive conversation > about search. > > > >> So is anyone up for helping develop a page grabber? > > > > We have one - it builds the static version of the main site > by spidering > > it hourly. > > Should we look at that then? Is that the "Royal" we? I'm currently in 'no more project atm' mode, but as far as I'm concerned you're welcome to work on it yourself on the understanding that whatever you come up with will only be accepted to replace the current solution if the community (-www) agrees that it is a) better from a user perspective, b) more maintainable and c) matches the main site look'n'feel. FWIW, one of the problems with building a true online index of the archives is that we don't know what URL a message might have until it has been indexed by the archives site. Currently that is then indexed by ASPSeek sometime later. It might be worth considering rewriting both the archives and the search to make it all truly realtime. That shouldn't be terribly difficult apart from generating thread indexes/forward/back links, and dealing with the historic URL problem. A SMOP you might say... :-p Regards, Dave.
> -----Original Message----- > From: Joshua D. Drake [mailto:jd@commandprompt.com] > Sent: 29 August 2006 15:32 > To: Dave Page > Cc: John Hansen; PostgreSQL WWW > Subject: Re: [pgsql-www] A counter productive conversation > about search. > > > > ServerWeight 100 > > Server http://www.postgresql.org/docs/8.1/ > > ServerWeight 50 > > Server http://www.postgresql.org/docs/7.4/ > > > > I did get shot down as well though - iirc it was Oleg who was > > essentially saying that the search engine should not have > any knowledge > > of the site beyond what it crawled. > > That won't work in aspseek. I looked into that too. It applies the > weight to the whole site. Yeah, I was confusing my self with options in Mnogosearch (which is very similar in many ways). Regards, Dave.
Hi there, On Mon, 28 Aug 2006, Joshua D. Drake wrote: > > I have on multiple occasions brought up the idea of another search engine. I > wrote the pgsql.ru guys and asked if they would share their code. To their > benefit they said they would be willing but didn't have the time to install > it for us. I told them I would be happy to muscle through it if they would > just answer some emails. I never heard back. Joshua, we'd be happy to help PostgreSQL community and actually we tried in past developing pgsql.ru, but we have families and we're in situation we need money to live. We don't want to promise something we could break. On pgsql.ru we have 2 search engines, one is a commercial version which crawl pages, index them and provide search. I and Teodor are not the only owners, so there is a problem with it. Also, I don't like the idea to use it, since it's not fully online indexing. The second SE, based on tsearch2, is what we actually needed. Several years ago (fts.postgresql.org) tsearch2 was slow, but now, when we have GiN support I see no real problem to have fully online indexing. We plan to renew pgsql.ru after releasing 8.2 and then we'll see how it's working. Another problem, is how documents are getting indexed. We have special user, called robot, which subscribed to almost all mailing list, and procmail entry instructed to process incoming message using our CMS. This worked nice and allows to be fully in sync. Of course, we depend on what messages come to the robot. This is not a problem on arhives.postgresql.org, which has full control on the mailing lists. To index wwww.postgresql.org I see two alternatives: 1. periodically run script, which crawl the site 2. Have a real CMS with hook to indexer. I suspect, that second way is a complex thing for the current state of art, so I'd stay with the first one. Giving, that documentation changed slow and only news pages require indexing, it's not a bad approximation. Hmm, looks like a mess :( The entire system needs to be rewritten ! It's my opinion, that without understanding what to index/search and financial support current thread is useless. Do we have any financing for that ? Regards, Oleg btw, we have simple crawler for OpenFTS, available from http://openfts.sourceforge.net/contributions.shtml Using it, it's possible to write simple script to index collections of documents, like documentation. See examples on http://mira.sai.msu.su/~megera/pgsql > > Other options include lucene, and rolling our own. > > Rolling our own really wouldn't be that hard "if" we can create a reasonably > smart web page grabber. We have all the tools (tsearch2 and pg_pgtrm) to > easily do the searches. > > So is anyone up for helping develop a page grabber? > > Sincerely, > > Joshua D. Drake > > > > > > > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > It's my opinion, that without understanding what to index/search and > financial support current thread is useless. Do we have any financing > for that ? EnterpriseDB created a PostgreSQL fund at the anniversary. Perhaps we can use some of that. I am one of the people who control the fund. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Oleg Bartunov wrote: >> It's my opinion, that without understanding what to index/search and >> financial support current thread is useless. Do we have any financing >> for that ? > > EnterpriseDB created a PostgreSQL fund at the anniversary. Perhaps we > can use some of that. I am one of the people who control the fund. > Well I already offered them money yesterday... :) However this really should go through the PostgreSQL Fundraising Group don't you think? Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
Oleg Bartunov wrote: > Hi there, > > On Mon, 28 Aug 2006, Joshua D. Drake wrote: > >> >> I have on multiple occasions brought up the idea of another search >> engine. I wrote the pgsql.ru guys and asked if they would share their >> code. To their benefit they said they would be willing but didn't have >> the time to install it for us. I told them I would be happy to muscle >> through it if they would just answer some emails. I never heard back. > > Joshua, we'd be happy to help PostgreSQL community and actually we tried > in past developing pgsql.ru, but we have families and we're in situation we > need money to live. Of course :) and I understand that. Which is why I sent you a sponsorship suggestion yesterday. Let's take this off list and talk about some of the financial requirements. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
Joshua D. Drake wrote: > Bruce Momjian wrote: > > Oleg Bartunov wrote: > >> It's my opinion, that without understanding what to index/search and > >> financial support current thread is useless. Do we have any financing > >> for that ? > > > > EnterpriseDB created a PostgreSQL fund at the anniversary. Perhaps we > > can use some of that. I am one of the people who control the fund. > > > > Well I already offered them money yesterday... :) However this really > should go through the PostgreSQL Fundraising Group don't you think? Yes, it should go through the Fundraising Group, if that is possible. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +