Thread: TSearch2 vs. Apache Lucene

TSearch2 vs. Apache Lucene

From
Joshua Kramer
Date:
Greetings all,

I'm going to do a performance comparison with DocMgr and PG81/TSearch2 on
one end, and Apache Lucene on the other end.

In order to do this, I'm going to create a derivative of the
docmgr-autoimport script so that I can specify one file to import at a
time.  I'll then create a Perl script which logs all details (such as
timing, etc.) as the test progresses.

As test data, I have approximately 9,000 text files from Project Gutenberg
ranging in size from a few hundred bytes to 4.5M.

I plan to test the speed of import of each file.  Then, I plan to write a
web-robot in Perl that will test the speed and number of results returned.

Can anyone think of a validation of this test, or how I should configure
PG to maximise import and search speed?  Can I maximise search speed and
import speed, or are those things mutually exclusive?  (Note that this
will be run on limited hardware - 900MHz Athlon with 512M of ram)

Has anyone ever compared TSearch2 to Lucene, as far as performance is
concerned?

Thanks,
-Josh

Re: TSearch2 vs. Apache Lucene

From
Michael Riess
Date:
> Has anyone ever compared TSearch2 to Lucene, as far as performance is
> concerned?

I'll stay away from TSearch2 until it is fully integrated in the
postgres core (like "create index foo_text on foo (texta, textb) USING
TSearch2"). Because a full integration is unlikely to happen in the near
future (as far as I know), I'll stick to Lucene.

Mike

Re: TSearch2 vs. Apache Lucene

From
Oleg Bartunov
Date:
Folks,

tsearch2 and Lucene are very different search engines, so it'd be unfair
comparison. If you need full access to metadata and instant indexing
you, probably, find tsearch2 is more suitable then Lucene. But, if
you could live without that features and need to search read only
archives you need Lucene.

Tsearch2 integration into pgsql would be cool, but, I see no problem to
use tsearch2 as an official extension module. After completing our
todo, which we hope will likely  happens for 8.2 release, you could
forget about Lucene and other engines :) We'll be available for developing
in spring and we estimate about three months for our todo, so, it's
really doable.

     Oleg

On Tue, 6 Dec 2005, Michael Riess wrote:

>
>> Has anyone ever compared TSearch2 to Lucene, as far as performance is
>> concerned?
>
> I'll stay away from TSearch2 until it is fully integrated in the postgres
> core (like "create index foo_text on foo (texta, textb) USING TSearch2").
> Because a full integration is unlikely to happen in the near future (as far
> as I know), I'll stick to Lucene.
>
> Mike
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>              http://www.postgresql.org/docs/faq
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: TSearch2 vs. Apache Lucene

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> Folks,
>
> tsearch2 and Lucene are very different search engines, so it'd be unfair
> comparison. If you need full access to metadata and instant indexing
> you, probably, find tsearch2 is more suitable then Lucene. But, if
> you could live without that features and need to search read only
> archives you need Lucene.
>
> Tsearch2 integration into pgsql would be cool, but, I see no problem to
> use tsearch2 as an official extension module. After completing our
> todo, which we hope will likely  happens for 8.2 release, you could
> forget about Lucene and other engines :) We'll be available for developing
> in spring and we estimate about three months for our todo, so, it's
> really doable.

Agreed.  There isn't anything magical about a plug-in vs something
integrated, as least in PostgreSQL.  In other database, plug-ins can't
fully function as integrated, but in PostgreSQL, everything is really a
plug-in because it is all abstracted.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: TSearch2 vs. Apache Lucene

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Oleg Bartunov wrote:
>> Tsearch2 integration into pgsql would be cool, but, I see no problem to
>> use tsearch2 as an official extension module.

> Agreed.  There isn't anything magical about a plug-in vs something
> integrated, as least in PostgreSQL.

The quality gap between contrib and the main system is a lot smaller
than it used to be, at least for those contrib modules that have
regression tests.  Main and contrib get equal levels of testing from
the buildfarm, so they're about on par as far as portability goes.
We could never say that before 8.1 ...

(Having said that, I think that tsearch2 will eventually become part
of core, but probably not for awhile yet.)

            regards, tom lane

Re: TSearch2 vs. Apache Lucene

From
Michael Riess
Date:
Bruce Momjian schrieb:
> Oleg Bartunov wrote:
>> Folks,
>>
>> tsearch2 and Lucene are very different search engines, so it'd be unfair
>> comparison. If you need full access to metadata and instant indexing
>> you, probably, find tsearch2 is more suitable then Lucene. But, if
>> you could live without that features and need to search read only
>> archives you need Lucene.
>>
>> Tsearch2 integration into pgsql would be cool, but, I see no problem to
>> use tsearch2 as an official extension module. After completing our
>> todo, which we hope will likely  happens for 8.2 release, you could
>> forget about Lucene and other engines :) We'll be available for developing
>> in spring and we estimate about three months for our todo, so, it's
>> really doable.
>
> Agreed.  There isn't anything magical about a plug-in vs something
> integrated, as least in PostgreSQL.  In other database, plug-ins can't
> fully function as integrated, but in PostgreSQL, everything is really a
> plug-in because it is all abstracted.


I only remember evaluating TSearch2 about a year ago, and when I read
statements like "Vacuum and/or database dump/restore work differently
when using TSearch2, sql scripts need to be executed etc." I knew that I
would not want to go there.

But I don't doubt that it works, and that it is a sane concept.

Re: TSearch2 vs. Apache Lucene

From
Bruce Momjian
Date:
Michael Riess wrote:
> Bruce Momjian schrieb:
> > Oleg Bartunov wrote:
> >> Folks,
> >>
> >> tsearch2 and Lucene are very different search engines, so it'd be unfair
> >> comparison. If you need full access to metadata and instant indexing
> >> you, probably, find tsearch2 is more suitable then Lucene. But, if
> >> you could live without that features and need to search read only
> >> archives you need Lucene.
> >>
> >> Tsearch2 integration into pgsql would be cool, but, I see no problem to
> >> use tsearch2 as an official extension module. After completing our
> >> todo, which we hope will likely  happens for 8.2 release, you could
> >> forget about Lucene and other engines :) We'll be available for developing
> >> in spring and we estimate about three months for our todo, so, it's
> >> really doable.
> >
> > Agreed.  There isn't anything magical about a plug-in vs something
> > integrated, as least in PostgreSQL.  In other database, plug-ins can't
> > fully function as integrated, but in PostgreSQL, everything is really a
> > plug-in because it is all abstracted.
>
>
> I only remember evaluating TSearch2 about a year ago, and when I read
> statements like "Vacuum and/or database dump/restore work differently
> when using TSearch2, sql scripts need to be executed etc." I knew that I
> would not want to go there.
>
> But I don't doubt that it works, and that it is a sane concept.

Good point.  I think we had some problems at that point because the API
was improved between versions.  Even if it had been integrated, we might
have had the same problem.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: TSearch2 vs. Apache Lucene

From
Russell Garrett
Date:
On 6 Dec 2005, at 16:47, Joshua Kramer wrote:
> Has anyone ever compared TSearch2 to Lucene, as far as performance
> is concerned?

In our experience (small often-updated documents) Lucene leaves
tsearch2 in the dust. This probably has a lot to do with our usage
pattern though. For our usage it's very beneficial to have the index
on a separate machine to the data, however in many cases this won't
make sense. Lucene is also a lot easier to "cluster" than Postgres
(it's simply a matter of NFS-mounting the index).

Russ Garrett
russ@last.fm

Re: TSearch2 vs. Apache Lucene

From
Christopher Kings-Lynne
Date:
...

So you'll avoid a non-core product and instead only use another non-core
product...?

Chris

Michael Riess wrote:
>
>> Has anyone ever compared TSearch2 to Lucene, as far as performance is
>> concerned?
>
>
> I'll stay away from TSearch2 until it is fully integrated in the
> postgres core (like "create index foo_text on foo (texta, textb) USING
> TSearch2"). Because a full integration is unlikely to happen in the near
> future (as far as I know), I'll stick to Lucene.
>
> Mike
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq


Re: TSearch2 vs. Apache Lucene

From
Michael Riess
Date:
No, my problem is that using TSearch2 interferes with other core
components of postgres like (auto)vacuum or dump/restore.


> ...
>
> So you'll avoid a non-core product and instead only use another non-core
> product...?
>
> Chris
>
> Michael Riess wrote:
>>
>>> Has anyone ever compared TSearch2 to Lucene, as far as performance is
>>> concerned?
>>
>>
>> I'll stay away from TSearch2 until it is fully integrated in the
>> postgres core (like "create index foo_text on foo (texta, textb) USING
>> TSearch2"). Because a full integration is unlikely to happen in the
>> near future (as far as I know), I'll stick to Lucene.
>>
>> Mike
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: Have you checked our extensive FAQ?
>>
>>               http://www.postgresql.org/docs/faq
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

Re: TSearch2 vs. Apache Lucene

From
Christopher Kings-Lynne
Date:
> No, my problem is that using TSearch2 interferes with other core
> components of postgres like (auto)vacuum or dump/restore.

That's nonsense...seriously.

The only trick with dump/restore is that you have to install the
tsearch2 shared library before restoring.  That's the same as all
contribs though.

Chris


Re: TSearch2 vs. Apache Lucene

From
Michael Riess
Date:
Christopher Kings-Lynne schrieb:
>> No, my problem is that using TSearch2 interferes with other core
>> components of postgres like (auto)vacuum or dump/restore.
>
> That's nonsense...seriously.
>
> The only trick with dump/restore is that you have to install the
> tsearch2 shared library before restoring.  That's the same as all
> contribs though.

Well, then it changed since I last read the documentation. That was
about a year ago, and since then we are using Lucene ... and as it works
quite nicely, I see no reason to switch to TSearch2. Including it with
the pgsql core would make it much more attractive to me, as it seems to
me that once included into the core, features seem to be more stable.
Call me paranoid, if you must ... ;-)


>
> Chris
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>