Thread: Please comment on the following OpenFTS/tsearch2 issues!
This topic was originally posted to the OpenFTS-general list on April 24, 2006. There were no replies in about 22 hours so I'm reposting to this more active list. I'm investigating OpenFTS and tsearch2 to see if they provide enough full-text searching features to be used in a new application. I've run into a number of issues that I would appreciate feedback/comments/workarounds on. 1. While tsearch2 provides fairly complete boolean search expression support with AND - &, OR - |, NOT - !, and grouping - (), OpenFTS appears to only have support for ANDing search terms. Is there some reason it hasn't been extended to support full tsearch2 search expressions? Has anyone modified OpenFTS to do this? 2. Neither OpenFTS or tsearch2 support exact phrase matching. I've seen the workaround to support matching a single exact phrase by modifying the WHERE clause with textcolumn ~* "exact phrase". Does this give reasonable performance? Has anyone implemented exact phrase matching in complex search expressions like ("exact phrase1" AND term1) OR (NOT "exact phrase2" AND "exact phrase3") ? 3. The following summarizes what I've read about performance and scalability of OpenFTS and/or tsearch2: a) don't expect OpenFTS/tsearch2 to perform/scale as well as dedicated search engines like Lucene, http://lucene.apache.org/, http://archives.postgresql.org/pgsql-general/2002-05/msg01156.php. b) OR queries are slower than AND queries, http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/o ptimization.html. c) the design trade-offs favor online indexing instead of search performance/scalability - see Full text search engine section in http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo d) there are a number of things you can do to improve performance - see the thread starting at http://sourceforge.net/mailarchive/message.php?msg_id=11444008. Do you agree with this summary? If you are using either OpenFTS or tsearch2 in production, has the performance been acceptable? For my application I could be looking at several million documents averaging about 3 pages each (I only have ballpark figures at present). 4. If you are using either OpenFTS or tsearch2 in production why did you choose OpenFTS over tsearch2 or vice versa? One of the advantages of tsearch2 that I can see is that, once you have setup your database and indexed your documents, you can talk to the database directly from your application using SQL without needing to go through Perl first. This assumes that you're ok with tsearch2 search expression syntax so you can use functions like to_tsquery. It also assumes that you don't need sophisticated exact phrase matching. 5. Are there any scripts, tools, add-ons, etc. that you can recommend?
On Apr 25, 2006, at 3:45 PM, Don Walker wrote: > 2. Neither OpenFTS or tsearch2 support exact phrase matching. I've > seen the > workaround to support matching a single exact phrase by modifying > the WHERE > clause with textcolumn ~* "exact phrase". Does this give reasonable > performance? It seems to work well for me, but I'm sure the results are highly data dependent. Performance will directly depend on the size and number of documents you must sequentially search for your phrase after making the initial cut on the indexed words. John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
> 1. While tsearch2 provides fairly complete boolean search expression support > with AND - &, OR - |, NOT - !, and grouping - (), OpenFTS appears to only > have support for ANDing search terms. Is there some reason it hasn't been > extended to support full tsearch2 search expressions? Has anyone modified > OpenFTS to do this? Historical and simplification. No more. We didn't modify OpenFTS... People often asks us about conversation text -> tsquery, so, in 8.2 will be plainto_tsquery() returning the same result as OpenFTS query parser. > 2. Neither OpenFTS or tsearch2 support exact phrase matching. I've seen the > workaround to support matching a single exact phrase by modifying the WHERE > clause with textcolumn ~* "exact phrase". Does this give reasonable > performance? Has anyone implemented exact phrase matching in complex search > expressions like ("exact phrase1" AND term1) OR (NOT "exact phrase2" AND > "exact phrase3") ? We didn't plan to develop phrase search unless we have clean idea to support complex query and compound words, look discussion at http://www.pgsql.ru/db/mw/msg.html?mid=2111601 > > 3. The following summarizes what I've read about performance and scalability > of OpenFTS and/or tsearch2: > > a) don't expect OpenFTS/tsearch2 to perform/scale as well as dedicated > search engines like Lucene, http://lucene.apache.org/, > http://archives.postgresql.org/pgsql-general/2002-05/msg01156.php. Yes, GiST index is good for online update, but has problem with big sets. We plan to add to 8.2 inverted index with which tsearch2 will work with comparable speed with Lucene... First version was already published, look for announce :) > b) OR queries are slower than AND queries, > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/o > ptimization.html. Yes > Do you agree with this summary? If you are using either OpenFTS or tsearch2 > in production, has the performance been acceptable? For my application I > could be looking at several million documents averaging about 3 pages each > (I only have ballpark figures at present). We knows installation of tsearch2 working with 4 millions docs. > > 4. If you are using either OpenFTS or tsearch2 in production why did you > choose OpenFTS over tsearch2 or vice versa? One of the advantages of > tsearch2 that I can see is that, once you have setup your database and > indexed your documents, you can talk to the database directly from your > application using SQL without needing to go through Perl first. This assumes > that you're ok with tsearch2 search expression syntax so you can use > functions like to_tsquery. It also assumes that you don't need sophisticated > exact phrase matching. OpenFTS may work on another box than pgsql, OpenFTS may index file directly from file system. > > 5. Are there any scripts, tools, add-ons, etc. that you can recommend? We can tweak OpenFTS/tsearch2 for you. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Apr 26, 2006, at 3:17 AM, Teodor Sigaev wrote: > We knows installation of tsearch2 working with 4 millions docs. > What are the design goals for the size of the source tables? My engineers are telling me of things their friends have tried and have hit limits of tsearch2. One was importing a large message board (millions of rows, a few sentences of text per row) and ran into problems (which were not detailed). Our interest is in using it for indexing mailing lists we host. We're looking at about 100 or so messages per day right now, with potential growth. Short of actually implementing it and loading up sample data, what guidelines can you provide as to the limits of tsearch2 source data size? I can imagine having 10+ million rows of 4k-byte to 10k-byte long messages within a couple of years.
On Thu, 27 Apr 2006, Vivek Khera wrote: > > On Apr 26, 2006, at 3:17 AM, Teodor Sigaev wrote: > >> We knows installation of tsearch2 working with 4 millions docs. >> > > What are the design goals for the size of the source tables? My engineers > are telling me of things their friends have tried and have hit limits of > tsearch2. One was importing a large message board (millions of rows, a few > sentences of text per row) and ran into problems (which were not detailed). > > Our interest is in using it for indexing mailing lists we host. We're > looking at about 100 or so messages per day right now, with potential growth. > Short of actually implementing it and loading up sample data, what > guidelines can you provide as to the limits of tsearch2 source data size? > > I can imagine having 10+ million rows of 4k-byte to 10k-byte long messages > within a couple of years. It should be no problem with inverted index we just posted. Search itself is very fast ! The problem is intrinsic for relational database - read data from disk. If you find 100,000 results and you want to rank them, you have to read them from hd, which is slow. That's why we use cacheing search daemon and on 5 mln blog and we could get 1mln search/day on 8Gb RAM server. > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83