Re: [GENERAL] Re: full text searching - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: [GENERAL] Re: full text searching
Date
Msg-id Pine.GSO.4.33.0102082306320.22966-100000@ra.sai.msu.su
Whole thread Raw
In response to Re: [GENERAL] Re: full text searching  (Ned Lilly <ned@greatbridge.com>)
Responses Re: [GENERAL] Re: full text searching  (The Hermit Hacker <scrappy@hub.org>)
List pgsql-hackers
On Thu, 8 Feb 2001, Ned Lilly wrote:

> (bcc'ed to -hackers)
>
> Gunnar R|nning wrote:
>
> > Does anybody know how Oracle has implemented their "context" search or
> > whatever it is called nowadays ?
>
> They're calling it Intermedia now ... http://www.oracle.com/intermedia/
>
> I have yet to meet an Oracle customer who likes it.
>
> I think there's a lot of agreement that this is an area where Postgres
> could use some work.  I know Oleg Bartunov has done some interesting
> work with Postgres and the search engine at the Russian portal site
> "Rambler" ... http://www.rambler.ru/ .  Oleg, could you talk a bit about
> what you guys did?

Well, we have FTS engine fully based on postgresql. It was developed
specifically for indexing dynamic text collections like online
news. It has support of morphology, uses coordinate information and
sophisticated ranking of search results. Search and ranking are built
in postgres. Currently the biggest collection we have is about 300,000
messages. We're not very happy with performance on such size collection
and specifically to improve it we did researching in GiST area.
Using GiST we did index support for integer arrays which greatly
improves search performance ! Right now we are trying to understand
how to improve sort performance, which is a final (we hope) stopper
for our FTS. Let me explain a bit:
Search performance is great, but in real life application we have to
display result of search on Web page, page by page. Results could be sorted
by relevancy or another parameter. In case of online news or mailing
list archive results are sorted by publication date. We found that most
time is spent to sort full set of results while we need just
10-15 rows to display on Web page (using ORDER BY .. LIMIT,OFFSET)
Some queries in our case produce
about 50,000 rows (search "Putin" for example) ! Sort time is enormous and
eats all the performance gain we did for search. One solution we currently
investigating is implementation of partial sort into postgres.
We don't need to sort full set. Currently LIMIT provides rather simple
optimization - only part of results are transferred from backend to client.
We propose stop sorting after getting those part of results already
sorted. From our experience and literature we know that 95% of all
hits gets  2 first pages of search results. In our worst case with
50,000 rows we could get first  page  to display about 5-6 times faster
if we do partial sorting. I understand it looks rather limited area
for optimization but many people would appreciate  such optimization.
I remember when I asked Jan to implement LIMIT feature many friends
momentally moved from mysql to postgres. This feature isn't standard
but it's Web friendly and most web applications utilize it.
We have a patch for 7.1, well, just a sketch we did for benchmarking
purposes. Tom isn't happy and we still need some help from core developers.
But time is for 7.1 release and we dont' want to bother developers
right now. Anyway, for medium size collection our FTS is good enough
even using plain 7.0.3. We was planning to release FTS as open source
before new year but were messed with organizational problem (still have :-(

>
> If there's interest in spinning up a separate project to sit outside the
> database, a la Intermedia or Verity, we'd be happy to sponsor such a
> thing on our GreatBridge.org project hosting site (CVS, bug tracking,
> mail lists, etc.)

We plan to develope sample application - searching postgres mail archives
( I have collection from 1995) and present it for testing. If people will
happy with performance and quality of results we could install it
on www.postgresql.org.

>
> Regards,
> Ned
>
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83




pgsql-hackers by date:

Previous
From: Lamar Owen
Date:
Subject: Re: Syslog and pg_options (for RPMs)
Next
From: Oleg Bartunov
Date:
Subject: Re: Syslog and pg_options (for RPMs)