Thread: A counter productive conversation about search.

A counter productive conversation about search.

From
"Joshua D. Drake"
Date:
Hello,

Now that I have effectively slapped myself silly by being rude to Tom
about search. Let me bring up some points about search and see if there
is a way to resolve them.

The problem:

Search really isn't that good. Tom has good results with it, but I am
guessing that because he is looking for specific things, likely just in
archives as I doubt he often searches the documentation ;).

A quick search on google:

site:archives.postgresql.org index bloat

archives.postgresql.org/pgsql-performance/2005-04/msg00617.php
archives.postgresql.org/pgsql-performance/2005-04/msg00594.php
archives.postgresql.org/pgsql-performance/2005-04/msg00608.php

archives.postgresql.org:

http://archives.postgresql.org/pgsql-performance/2005-04/msg00575.php
http://archives.postgresql.org/pgsql-general/2004-12/msg00288.php
http://archives.postgresql.org/pgsql-general/2005-07/msg00186.php

site:www.postgresql.org create index
www.postgresql.org/docs/7.4/static/sql-createindex.html
www.postgresql.org/docs/8.1/static/sql-createindex.html
www.postgresql.org/files/documentation/books/aw_pgsql/node216.html

search.postgresql.org:
http://www.postgresql.org/files/documentation/books/aw_pgsql/node216.html
http://www.postgresql.org/files/documentation/books/pghandbuch/html/sql-createindex.html
http://developer.postgresql.org/~petere/past-events/lsm2003-slides/foil20.html

The first search is "reasonable" between the two, although it does not
appear to correctly follow the thread path.

The second search to me is completely wrong. CREATE INDEX should always
return the current documentation first. I can forgive google for showing
7.4 first because it has been around longer and yet is still widely in use.

I have on multiple occasions brought up the idea of another search
engine. I wrote the pgsql.ru guys and asked if they would share their
code. To their benefit they said they would be willing but didn't have
the time to install it for us. I told them I would be happy to muscle
through it if they would just answer some emails. I never heard back.

Other options include lucene, and rolling our own.

Rolling our own really wouldn't be that hard "if" we can create a
reasonably smart web page grabber. We have all the tools (tsearch2 and
pg_pgtrm) to easily do the searches.

So is anyone up for helping develop a page grabber?

Sincerely,

Joshua D. Drake









--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/



Re: A counter productive conversation about search.

From
Tino Wildenhain
Date:
Joshua D. Drake wrote:
...
> Rolling our own really wouldn't be that hard "if" we can create a
> reasonably smart web page grabber. We have all the tools (tsearch2 and
> pg_pgtrm) to easily do the searches.
>
> So is anyone up for helping develop a page grabber?

Thats not the hardest part but why do we need to grab if the contents
of the pages could be in the database? But admittedly, I don't know
any good CMS w/ postgresql backend. But anyway, grabbing the sources
of the pages while they are published (like the docbook stuff
for the documentation) makes a lot more sense imho. Ditto for the
archives. Its much easier to get an idea of the structure and nature
of the data when you dont have to deal with the final result (e.g. HTML)

So a couple of scripts that fire when mail comes in, documentation
is compiled and when some other publishing takes place could
really help to keep the index in sync w/o having to crawl all sites
over and over again.

Regards
Tino Wildenhain


Re: A counter productive conversation about search.

From
Oleg Bartunov
Date:
On Tue, 29 Aug 2006, Tino Wildenhain wrote:

> Joshua D. Drake wrote:
> ...
>> Rolling our own really wouldn't be that hard "if" we can create a
>> reasonably smart web page grabber. We have all the tools (tsearch2 and
>> pg_pgtrm) to easily do the searches.
>>
>> So is anyone up for helping develop a page grabber?
>
> Thats not the hardest part but why do we need to grab if the contents
> of the pages could be in the database? But admittedly, I don't know
> any good CMS w/ postgresql backend. But anyway, grabbing the sources
> of the pages while they are published (like the docbook stuff
> for the documentation) makes a lot more sense imho. Ditto for the
> archives. Its much easier to get an idea of the structure and nature
> of the data when you dont have to deal with the final result (e.g. HTML)
>
> So a couple of scripts that fire when mail comes in, documentation
> is compiled and when some other publishing takes place could
> really help to keep the index in sync w/o having to crawl all sites
> over and over again.

This is exactly what we have on pgsql.ru/db/mw. We use procmail to fire
our backend to process incoming message. This is not a problem, the
most complex thing is a backend.

>
> Regards
> Tino Wildenhain
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: A counter productive conversation about search.

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-www-owner@postgresql.org
> [mailto:pgsql-www-owner@postgresql.org] On Behalf Of Joshua D. Drake
> Sent: 29 August 2006 04:12
> To: PostgreSQL WWW
> Subject: [pgsql-www] A counter productive conversation about search.
>
> Hello,
>
> Now that I have effectively slapped myself silly by being rude to Tom
> about search. Let me bring up some points about search and
> see if there
> is a way to resolve them.
>
> The problem:
>
> Search really isn't that good. Tom has good results with it, but I am
> guessing that because he is looking for specific things,
> likely just in
> archives as I doubt he often searches the documentation ;).
>
> A quick search on google:
>
> site:archives.postgresql.org index bloat
>
> archives.postgresql.org/pgsql-performance/2005-04/msg00617.php
> archives.postgresql.org/pgsql-performance/2005-04/msg00594.php
> archives.postgresql.org/pgsql-performance/2005-04/msg00608.php
>
> archives.postgresql.org:
>
> http://archives.postgresql.org/pgsql-performance/2005-04/msg00575.php
> http://archives.postgresql.org/pgsql-general/2004-12/msg00288.php
> http://archives.postgresql.org/pgsql-general/2005-07/msg00186.php
>
> site:www.postgresql.org create index
> www.postgresql.org/docs/7.4/static/sql-createindex.html
> www.postgresql.org/docs/8.1/static/sql-createindex.html
> www.postgresql.org/files/documentation/books/aw_pgsql/node216.html
>
> search.postgresql.org:
> http://www.postgresql.org/files/documentation/books/aw_pgsql/n
> ode216.html
> http://www.postgresql.org/files/documentation/books/pghandbuch
> /html/sql-createindex.html
> http://developer.postgresql.org/~petere/past-events/lsm2003-sl
> ides/foil20.html
>
> The first search is "reasonable" between the two, although it
> does not
> appear to correctly follow the thread path.

The search engine has no site specific knowledge - it (like any other
generic search engine) simply doesn't know about threading.

> The second search to me is completely wrong. CREATE INDEX
> should always
> return the current documentation first. I can forgive google
> for showing
> 7.4 first because it has been around longer and yet is still
> widely in use.

That should be fixable by tweaking weighting values, however last time I
suggested that I got shot down.

> I have on multiple occasions brought up the idea of another search
> engine. I wrote the pgsql.ru guys and asked if they would share their
> code. To their benefit they said they would be willing but
> didn't have
> the time to install it for us. I told them I would be happy to muscle
> through it if they would just answer some emails. I never heard back.
>
> Other options include lucene, and rolling our own.

Is Lucene capable of handling the size of our index? This has always
been the problem we've had with other projects like MnogoSearch. They
work well until you load them up with the archives after which they
simply can't cope without ridiculous amounts of hardware.

> Rolling our own really wouldn't be that hard "if" we can create a
> reasonably smart web page grabber. We have all the tools
> (tsearch2 and
> pg_pgtrm) to easily do the searches.
>
> So is anyone up for helping develop a page grabber?

We have one - it builds the static version of the main site by spidering
it hourly.

Regards, Dave.

Re: A counter productive conversation about search.

From
"John Hansen"
Date:
Dave Page Wrote:

> That should be fixable by tweaking weighting values, however
> last time I suggested that I got shot down.

Not so much shot down, but that it isn't possible. (at least not without rewriting code)

Siteweights are for sites, not for parts of a site.

However, if you browse to the documentation you want to search first, then only that part of the website will be
searched.

Example:

Go to http://www.postgresql.org/docs/8.1/static/index.html and search for 'create index'


http://search.postgresql.org/www.search?ul=http%3A%2F%2Fwww.postgresql.org%2Fdocs%2F8.1%2Fstatic%2F%25&fm=on&cs=utf-8&q=create+index


Re: A counter productive conversation about search.

From
"Dave Page"
Date:

> -----Original Message-----
> From: John Hansen [mailto:john@geeknet.com.au]
> Sent: 29 August 2006 08:52
> To: Dave Page; Joshua D. Drake; PostgreSQL WWW
> Subject: RE: [pgsql-www] A counter productive conversation
> about search.
>
> Dave Page Wrote:
>
> > That should be fixable by tweaking weighting values, however
> > last time I suggested that I got shot down.
>
> Not so much shot down, but that it isn't possible. (at least
> not without rewriting code)
>
> Siteweights are for sites, not for parts of a site.

Yeah, I was thinking of Mnogosearch where a server can include
subsections so you can do things like:

ServerWeight 100
Server http://www.postgresql.org/docs/8.1/
ServerWeight 50
Server http://www.postgresql.org/docs/7.4/

I did get shot down as well though - iirc it was Oleg who was
essentially saying that the search engine should not have any knowledge
of the site beyond what it crawled.

> However, if you browse to the documentation you want to
> search first, then only that part of the website will be searched.
>
> Example:
>
> Go to http://www.postgresql.org/docs/8.1/static/index.html
> and search for 'create index'
>
> http://search.postgresql.org/www.search?ul=http%3A%2F%2Fwww.po
> stgresql.org%2Fdocs%2F8.1%2Fstatic%2F%25&fm=on&cs=utf-8&q=create+index

Yeah.

/D

Re: A counter productive conversation about search.

From
"Joshua D. Drake"
Date:
>> Other options include lucene, and rolling our own.
>
> Is Lucene capable of handling the size of our index? This has always

I am going to say, "yes" without any actual knowledge because of Lucene
but that is because I am putting more trust in the fact that it is an
Apache project then anything. I will check.

> been the problem we've had with other projects like MnogoSearch. They
> work well until you load them up with the archives after which they
> simply can't cope without ridiculous amounts of hardware.
>
>> Rolling our own really wouldn't be that hard "if" we can create a
>> reasonably smart web page grabber. We have all the tools
>> (tsearch2 and
>> pg_pgtrm) to easily do the searches.
>>
>> So is anyone up for helping develop a page grabber?
>
> We have one - it builds the static version of the main site by spidering
> it hourly.

Should we look at that then?

>
> Regards, Dave.
>


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/



Re: A counter productive conversation about search.

From
"Joshua D. Drake"
Date:
> ServerWeight 100
> Server http://www.postgresql.org/docs/8.1/
> ServerWeight 50
> Server http://www.postgresql.org/docs/7.4/
>
> I did get shot down as well though - iirc it was Oleg who was
> essentially saying that the search engine should not have any knowledge
> of the site beyond what it crawled.

That won't work in aspseek. I looked into that too. It applies the
weight to the whole site.

Joshua D. Drake

>
>> However, if you browse to the documentation you want to
>> search first, then only that part of the website will be searched.
>>
>> Example:
>>
>> Go to http://www.postgresql.org/docs/8.1/static/index.html
>> and search for 'create index'
>>
>> http://search.postgresql.org/www.search?ul=http%3A%2F%2Fwww.po
>> stgresql.org%2Fdocs%2F8.1%2Fstatic%2F%25&fm=on&cs=utf-8&q=create+index
>
> Yeah.
>
> /D
>


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/



Re: A counter productive conversation about search.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Joshua D. Drake [mailto:jd@commandprompt.com]
> Sent: 29 August 2006 15:28
> To: Dave Page
> Cc: PostgreSQL WWW
> Subject: Re: [pgsql-www] A counter productive conversation
> about search.
>
>
> >> So is anyone up for helping develop a page grabber?
> >
> > We have one - it builds the static version of the main site
> by spidering
> > it hourly.
>
> Should we look at that then?

Is that the "Royal" we? I'm currently in 'no more project atm' mode, but
as far as I'm concerned you're welcome to work on it yourself on the
understanding that whatever you come up with will only be accepted to
replace the current solution if the community (-www) agrees that it is
a) better from a user perspective, b) more maintainable and c) matches
the main site look'n'feel.

FWIW, one of the problems with building a true online index of the
archives is that we don't know what URL a message might have until it
has been indexed by the archives site. Currently that is then indexed by
ASPSeek sometime later. It might be worth considering rewriting both the
archives and the search to make it all truly realtime. That shouldn't be
terribly difficult apart from generating thread indexes/forward/back
links, and dealing with the historic URL problem. A SMOP you might
say...

:-p

Regards, Dave.

Re: A counter productive conversation about search.

From
"Dave Page"
Date:

> -----Original Message-----
> From: Joshua D. Drake [mailto:jd@commandprompt.com]
> Sent: 29 August 2006 15:32
> To: Dave Page
> Cc: John Hansen; PostgreSQL WWW
> Subject: Re: [pgsql-www] A counter productive conversation
> about search.
>
>
> > ServerWeight 100
> > Server http://www.postgresql.org/docs/8.1/
> > ServerWeight 50
> > Server http://www.postgresql.org/docs/7.4/
> >
> > I did get shot down as well though - iirc it was Oleg who was
> > essentially saying that the search engine should not have
> any knowledge
> > of the site beyond what it crawled.
>
> That won't work in aspseek. I looked into that too. It applies the
> weight to the whole site.

Yeah, I was confusing my self with options in Mnogosearch (which is very
similar in many ways).

Regards, Dave.

Re: A counter productive conversation about search.

From
Oleg Bartunov
Date:
Hi there,

On Mon, 28 Aug 2006, Joshua D. Drake wrote:

>
> I have on multiple occasions brought up the idea of another search engine. I
> wrote the pgsql.ru guys and asked if they would share their code. To their
> benefit they said they would be willing but didn't have the time to install
> it for us. I told them I would be happy to muscle through it if they would
> just answer some emails. I never heard back.

Joshua, we'd be happy to help PostgreSQL community and actually we tried
in past developing pgsql.ru, but we have families and we're in situation we
need money to live.  We don't want to promise something we could break.
On pgsql.ru we have 2 search engines, one is a commercial version which
crawl pages, index them and provide search. I and Teodor are not the only
owners, so there is a problem with it. Also, I don't like the idea to
use it, since it's not fully online indexing. The second SE, based on
tsearch2, is what we actually needed. Several years ago (fts.postgresql.org)
tsearch2 was slow, but now, when we have GiN support I see no
real problem to have fully online indexing. We plan to renew pgsql.ru after
releasing 8.2 and then we'll see how it's working.

Another problem, is how documents are getting indexed. We have special user,
called robot, which subscribed to almost all mailing list, and procmail
entry instructed to process incoming message using our CMS. This worked
nice and allows to be fully in sync. Of course, we depend on what messages
come to the robot. This is not a problem on arhives.postgresql.org, which
has full control on the mailing lists.

To index wwww.postgresql.org I see two alternatives:
1. periodically run script, which crawl the site
2. Have a real CMS with hook to indexer.

I suspect, that second way is a complex thing for the current state of art,
so I'd stay with the first one. Giving, that documentation changed slow and
only news pages require indexing, it's not a bad approximation.

Hmm,  looks like a mess :( The entire system needs to be rewritten !

It's my opinion, that without understanding what to index/search and
financial support current thread is useless. Do we have any financing
for that ?


Regards,
     Oleg

btw, we have simple crawler for OpenFTS, available from
http://openfts.sourceforge.net/contributions.shtml
Using it, it's possible to write simple script to index collections of
documents, like documentation. See examples on
http://mira.sai.msu.su/~megera/pgsql
















>
> Other options include lucene, and rolling our own.
>
> Rolling our own really wouldn't be that hard "if" we can create a reasonably
> smart web page grabber. We have all the tools (tsearch2 and pg_pgtrm) to
> easily do the searches.
>
> So is anyone up for helping develop a page grabber?
>
> Sincerely,
>
> Joshua D. Drake
>
>
>
>
>
>
>
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: A counter productive conversation about

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> It's my opinion, that without understanding what to index/search and
> financial support current thread is useless. Do we have any financing
> for that ?

EnterpriseDB created a PostgreSQL fund at the anniversary.  Perhaps we
can use some of that.  I am one of the people who control the fund.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: A counter productive conversation about search.

From
"Joshua D. Drake"
Date:
Bruce Momjian wrote:
> Oleg Bartunov wrote:
>> It's my opinion, that without understanding what to index/search and
>> financial support current thread is useless. Do we have any financing
>> for that ?
>
> EnterpriseDB created a PostgreSQL fund at the anniversary.  Perhaps we
> can use some of that.  I am one of the people who control the fund.
>

Well I already offered them money yesterday... :) However this really
should go through the PostgreSQL Fundraising Group don't you think?

Sincerely,

Joshua D. Drake


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/



Re: A counter productive conversation about search.

From
"Joshua D. Drake"
Date:
Oleg Bartunov wrote:
> Hi there,
>
> On Mon, 28 Aug 2006, Joshua D. Drake wrote:
>
>>
>> I have on multiple occasions brought up the idea of another search
>> engine. I wrote the pgsql.ru guys and asked if they would share their
>> code. To their benefit they said they would be willing but didn't have
>> the time to install it for us. I told them I would be happy to muscle
>> through it if they would just answer some emails. I never heard back.
>
> Joshua, we'd be happy to help PostgreSQL community and actually we tried
> in past developing pgsql.ru, but we have families and we're in situation we
> need money to live.

Of course :) and I understand that. Which is why I sent you a
sponsorship suggestion yesterday.

Let's take this off list and talk about some of the financial requirements.

Sincerely,

Joshua D. Drake

--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/



Re: A counter productive conversation about

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> Bruce Momjian wrote:
> > Oleg Bartunov wrote:
> >> It's my opinion, that without understanding what to index/search and
> >> financial support current thread is useless. Do we have any financing
> >> for that ?
> >
> > EnterpriseDB created a PostgreSQL fund at the anniversary.  Perhaps we
> > can use some of that.  I am one of the people who control the fund.
> >
>
> Well I already offered them money yesterday... :) However this really
> should go through the PostgreSQL Fundraising Group don't you think?

Yes, it should go through the Fundraising Group, if that is possible.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +