Re: website doc search is extremely SLOW - Mailing list pgsql-general

From Dave Cramer
Subject Re: website doc search is extremely SLOW
Date
Msg-id 1072879214.2293.3.camel@localhost.localdomain
Whole thread Raw
In response to Re: website doc search is extremely SLOW  ("John Sidney-Woollett" <johnsw@wardbrook.com>)
List pgsql-general
The search engine I am using is lucene
http://jakarta.apache.org/lucene/docs/index.html

it too uses it's own internal database format, optimized for searching,
it is quite flexible, and allow searching on arbitrary fields as well.


The section on querying explains more

http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

It is even possible to index text data inside a database.

Dave
On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:
> Wow, you're right - I could have probably saved myself a load of time! :)
>
> Although you do learn a lot reinventing the wheel... ...or at least you
> hit the same issues and insights others did before...
>
> John
>
> Ericson Smith said:
> > You should probably take a look at the Swish project. For a certain
> > project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
> > but with over 600,000 documents to index, both took too long to conduct
> > searches, especially as the database was swapped in and out of memory
> > based on search segment. MySQL full text was the most unusable.
> >
> > Swish uses its own internal DB format, and comes with a simple spider as
> > well. You can make it search by category, date and other nifty criteria
> > also.
> > http://swish-e.org
> >
> > You can take a look over at the project and do some searches to see what
> > I mean:
> > http://cbd-net.com
> >
> > Warmest regards,
> > Ericson Smith
> > Tracking Specialist/DBA
> > +-----------------------+----------------------------+
> > | http://www.did-it.com | "When I'm paid, I always   |
> > | eric@did-it.com       | follow the job through.    |
> > | 516-255-0500          | You know that." -Angel Eyes|
> > +-----------------------+----------------------------+
> >
> >
> >
> > John Sidney-Woollett wrote:
> >
> >>I think that Oleg's new search offering looks really good and fast. (I
> >>can't wait till I have some task that needs tsearch!).
> >>
> >>I agree with Dave that searching the docs is more important for me than
> >>the sites - but it would be really nice to have both, in one tool.
> >>
> >>I built something similar for the Tate Gallery in the UK - here you can
> >>select the type of content that you want returned, either static pages or
> >>dynamic. You can see the idea at
> >>http://www.tate.org.uk/search/default.jsp?terms=sunset%20oil&action=new
> >>
> >>This is custom built (using java/Oracle), supports stemming, boolean
> >>operators, exact phrase matching, relevancy and matched term
> >> highlighting.
> >>
> >>You can switch on/off the types of documents that you are not interested
> >>in. Using this analogy, a search facility that could offer you results
> >>from i) the docs and/or ii) the postgres sites static pages would be very
> >>useful.
> >>
> >>John Sidney-Woollett
> >>
> >>Dave Cramer said:
> >>
> >>
> >>>Marc,
> >>>
> >>>No it doesn't spider, it is a specialized tool for searching documents.
> >>>
> >>>I'm curious, what value is there to being able to count the number of
> >>>url's ?
> >>>
> >>>It does do things like query all documents where CREATE AND TABLE are n
> >>>words apart, just as fast, I would think these are more valuable to
> >>>document searching?
> >>>
> >>>I think the challenge here is what do we want to search. I am betting
> >>>that folks use this page as they would man? ie. what is the command for
> >>>create trigger?
> >>>
> >>>As I said my offer stands to help out, but I think if the goal is to
> >>>search the entire website, then this particular tool is not useful.
> >>>
> >>>At this point I am working on indexing the sgml directly as it has less
> >>>cruft in it. For instance all the links that appear in every summary are
> >>>just noise.
> >>>
> >>>
> >>>Dave
> >>>
> >>>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
> >>>
> >>>
> >>>>On Wed, 31 Dec 2003, Dave Cramer wrote:
> >>>>
> >>>>
> >>>>
> >>>>>I can modify mine to be client server if you want?
> >>>>>
> >>>>>It is a java app, so we need to be able to run jdk1.3 at least?
> >>>>>
> >>>>>
> >>>>jdk1.4 is available on the VMs ... does your spider?  for instance, you
> >>>>mention that you have the docs indexed right now, but we are currently
> >>>>indexing:
> >>>>
> >>>>Server http://archives.postgresql.org/
> >>>>Server http://advocacy.postgresql.org/
> >>>>Server http://developer.postgresql.org/
> >>>>Server http://gborg.postgresql.org/
> >>>>Server http://pgadmin.postgresql.org/
> >>>>Server http://techdocs.postgresql.org/
> >>>>Server http://www.postgresql.org/
> >>>>
> >>>>will it be able to handle:
> >>>>
> >>>>186_archives=# select count(*) from url;
> >>>> count
> >>>>--------
> >>>> 393551
> >>>>(1 row)
> >>>>
> >>>>as fast as you are finding with just the docs?
> >>>>
> >>>>----
> >>>>Marc G. Fournier           Hub.Org Networking Services
> >>>>(http://www.hub.org)
> >>>>Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ:
> >>>>7615664
> >>>>
> >>>>
> >>>>
> >>>--
> >>>Dave Cramer
> >>>519 939 0336
> >>>ICQ # 1467551
> >>>
> >>>
> >>>---------------------------(end of broadcast)---------------------------
> >>>TIP 9: the planner will ignore your desire to choose an index scan if
> >>> your
> >>>      joining column's datatypes do not match
> >>>
> >>>
> >>>
> >>
> >>
> >>---------------------------(end of broadcast)---------------------------
> >>TIP 2: you can get off all lists at once with the unregister command
> >>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> >>
> >>
> >>
> >
>
--
Dave Cramer
519 939 0336
ICQ # 1467551


pgsql-general by date:

Previous
From: "John Sidney-Woollett"
Date:
Subject: Re: website doc search is extremely SLOW
Next
From: Dave Cramer
Date:
Subject: Re: website doc search is extremely SLOW