Thread: Please comment on the following OpenFTS/tsearch2 issues!

Please comment on the following OpenFTS/tsearch2 issues!

From
"Don Walker"
Date:
This topic was originally posted to the OpenFTS-general list on April 24,
2006. There were no replies in about 22 hours so I'm reposting to this more
active list.

I'm investigating OpenFTS and tsearch2 to see if they provide enough
full-text searching features to be used in a new application. I've run into
a number of issues that I would appreciate feedback/comments/workarounds on.

1. While tsearch2 provides fairly complete boolean search expression support
with AND - &, OR - |, NOT - !, and grouping - (), OpenFTS appears to only
have support for ANDing search terms. Is there some reason it hasn't been
extended to support full tsearch2 search expressions? Has anyone modified
OpenFTS to do this?

2. Neither OpenFTS or tsearch2 support exact phrase matching. I've seen the
workaround to support matching a single exact phrase by modifying the WHERE
clause with textcolumn ~* "exact phrase". Does this give reasonable
performance? Has anyone implemented exact phrase matching in complex search
expressions like ("exact phrase1" AND term1) OR (NOT "exact phrase2" AND
"exact phrase3") ?

3. The following summarizes what I've read about performance and scalability
of OpenFTS and/or tsearch2:

a) don't expect OpenFTS/tsearch2 to perform/scale as well as dedicated
search engines like Lucene, http://lucene.apache.org/,
http://archives.postgresql.org/pgsql-general/2002-05/msg01156.php.

b) OR queries are slower than AND queries,
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/o
ptimization.html.

c) the design trade-offs favor online indexing instead of search
performance/scalability - see Full text search engine section in
http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo

d) there are a number of things you can do to improve performance - see the
thread starting at
http://sourceforge.net/mailarchive/message.php?msg_id=11444008.

Do you agree with this summary? If you are using either OpenFTS or tsearch2
in production, has the performance been acceptable? For my application I
could be looking at several million documents averaging about 3 pages each
(I only have ballpark figures at present).

4. If you are using either OpenFTS or tsearch2 in production why did you
choose OpenFTS over tsearch2 or vice versa? One of the advantages of
tsearch2 that I can see is that, once you have setup your database and
indexed your documents, you can talk to the database directly from your
application using SQL without needing to go through Perl first. This assumes
that you're ok with tsearch2 search expression syntax so you can use
functions like to_tsquery. It also assumes that you don't need sophisticated
exact phrase matching.

5. Are there any scripts, tools, add-ons, etc. that you can recommend?


Re: Please comment on the following OpenFTS/tsearch2 issues!

From
John DeSoi
Date:
On Apr 25, 2006, at 3:45 PM, Don Walker wrote:

> 2. Neither OpenFTS or tsearch2 support exact phrase matching. I've
> seen the
> workaround to support matching a single exact phrase by modifying
> the WHERE
> clause with textcolumn ~* "exact phrase". Does this give reasonable
> performance?

It seems to work well for me, but I'm sure the results are highly
data dependent. Performance will directly depend on the size and
number of documents you must sequentially search for your phrase
after making the initial cut on the indexed words.


John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


Re: Please comment on the following OpenFTS/tsearch2 issues!

From
Teodor Sigaev
Date:
> 1. While tsearch2 provides fairly complete boolean search expression support
> with AND - &, OR - |, NOT - !, and grouping - (), OpenFTS appears to only
> have support for ANDing search terms. Is there some reason it hasn't been
> extended to support full tsearch2 search expressions? Has anyone modified
> OpenFTS to do this?

Historical and simplification. No more.
We didn't modify OpenFTS... People often asks us about conversation text ->
tsquery, so, in 8.2 will be plainto_tsquery() returning the same result as
OpenFTS query parser.



> 2. Neither OpenFTS or tsearch2 support exact phrase matching. I've seen the
> workaround to support matching a single exact phrase by modifying the WHERE
> clause with textcolumn ~* "exact phrase". Does this give reasonable
> performance? Has anyone implemented exact phrase matching in complex search
> expressions like ("exact phrase1" AND term1) OR (NOT "exact phrase2" AND
> "exact phrase3") ?

We didn't plan to develop phrase search unless we have clean idea to support
complex query and compound words, look discussion at
http://www.pgsql.ru/db/mw/msg.html?mid=2111601

>
> 3. The following summarizes what I've read about performance and scalability
> of OpenFTS and/or tsearch2:
>
> a) don't expect OpenFTS/tsearch2 to perform/scale as well as dedicated
> search engines like Lucene, http://lucene.apache.org/,
> http://archives.postgresql.org/pgsql-general/2002-05/msg01156.php.
Yes, GiST index is good for online update, but has problem with big sets.
We plan to add to 8.2 inverted index with which tsearch2 will work with
comparable speed with Lucene...

First version was already published, look for announce :)


> b) OR queries are slower than AND queries,
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/o
> ptimization.html.

Yes

> Do you agree with this summary? If you are using either OpenFTS or tsearch2
> in production, has the performance been acceptable? For my application I
> could be looking at several million documents averaging about 3 pages each
> (I only have ballpark figures at present).

We knows installation of tsearch2 working with 4 millions docs.

>
> 4. If you are using either OpenFTS or tsearch2 in production why did you
> choose OpenFTS over tsearch2 or vice versa? One of the advantages of
> tsearch2 that I can see is that, once you have setup your database and
> indexed your documents, you can talk to the database directly from your
> application using SQL without needing to go through Perl first. This assumes
> that you're ok with tsearch2 search expression syntax so you can use
> functions like to_tsquery. It also assumes that you don't need sophisticated
> exact phrase matching.

OpenFTS may work on another box than pgsql, OpenFTS may index file directly from
file system.

>
> 5. Are there any scripts, tools, add-ons, etc. that you can recommend?

We can tweak OpenFTS/tsearch2 for you.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Please comment on the following OpenFTS/tsearch2 issues!

From
Vivek Khera
Date:
On Apr 26, 2006, at 3:17 AM, Teodor Sigaev wrote:

> We knows installation of tsearch2 working with 4 millions docs.
>

What are the design goals for the size of the source tables?  My
engineers are telling me of things their friends have tried and have
hit limits of tsearch2.  One was importing a large message board
(millions of rows, a few sentences of text per row) and ran into
problems (which were not detailed).

Our interest is in using it for indexing mailing lists we host.
We're looking at about 100 or so messages per day right now, with
potential growth.  Short of actually implementing it and loading up
sample data,  what guidelines can you provide as to the limits of
tsearch2 source data size?

I can imagine having 10+ million rows of 4k-byte to 10k-byte long
messages within a couple of years.

Re: Please comment on the following OpenFTS/tsearch2

From
Oleg Bartunov
Date:
On Thu, 27 Apr 2006, Vivek Khera wrote:

>
> On Apr 26, 2006, at 3:17 AM, Teodor Sigaev wrote:
>
>> We knows installation of tsearch2 working with 4 millions docs.
>>
>
> What are the design goals for the size of the source tables?  My engineers
> are telling me of things their friends have tried and have hit limits of
> tsearch2.  One was importing a large message board (millions of rows, a few
> sentences of text per row) and ran into problems (which were not detailed).
>
> Our interest is in using it for indexing mailing lists we host.  We're
> looking at about 100 or so messages per day right now, with potential growth.
> Short of actually implementing it and loading up sample data,  what
> guidelines can you provide as to the limits of tsearch2 source data size?
>
> I can imagine having 10+ million rows of 4k-byte to 10k-byte long messages
> within a couple of years.

It should be no problem with inverted index we just posted. Search itself
is very fast ! The problem is intrinsic for relational database - read
data from disk. If you find 100,000 results and you want to rank them,
you have to read them from hd, which is slow. That's why we use cacheing
search daemon and on 5 mln blog and we could get 1mln search/day on
8Gb RAM server.



>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83