Thread: For full text indexing, which is better, tsearch2 or fulltextindex

For full text indexing, which is better, tsearch2 or fulltextindex

From
LIANHE SHAO
Date:
Hi all,
Which one is better (performance/easier to use),
tsearch2 or fulltextindex?
there is an example how to use fulltextindex in the
techdocs, but I checked the contrib/fulltextindex
package, there is a WARNING that fulltextindex is
much slower than tsearch2. but tsearch2 seems
complex to use, and I can not find a good example.
Which one I should use? Any suggestions?

thanks and Regards,
William

----- Original Message -----
From: Hannu Krosing <hannu@tm.ee>
Date: Wednesday, November 26, 2003 5:33 pm
Subject: Re: [PERFORM] why index scan not working
when using 'like'?

> Tom Lane kirjutas T, 25.11.2003 kell 23:29:
> > Josh Berkus <josh@agliodbs.com> writes:
> > > In regular text fields containing words, your
problem is
> solvable with full
> > > text indexing (FTI).   Unfortunately, FTI is
not designed for
> arbitrary
> > > non-language strings.  It could be adapted,
but would require a
> lot of
> > > hacking.
> >
> > I'm not sure why you say that FTI isn't a usable
solution.  As
> long as
> > the gene symbols are separated by whitespace or
some other non-
> letters> (eg, "foo mif bar" not "foomifbar"), I'd
think FTI would
> work.
> If he wants to search on arbitrary substring, he
could change
> tokeniserin FTI to produce trigrams, so that
"foomifbar" would be
> indexed as if
> it were text "foo oom omi mif ifb fba bar" and
search for things like
> %mifb% should first do a FTI search for "mif" AND
"ifb" and then
> simpleLIKE %mifb% to weed out something like "mififb".
>
> There are ways to use trigrams for 1 and 2 letter
matches as well.
>
> -------------
> Hannu
>
>
> ---------------------------(end of
broadcast)-----------------------
> ----
> TIP 3: if posting/reading through Usenet, please
send an appropriate
>      subscribe-nomail command to
majordomo@postgresql.org so that
> your      message can get through to the mailing
list cleanly
>


Re: For full text indexing, which is better, tsearch2 or

From
Christopher Kings-Lynne
Date:
> Which one is better (performance/easier to use),
> tsearch2 or fulltextindex?
> there is an example how to use fulltextindex in the
> techdocs, but I checked the contrib/fulltextindex
> package, there is a WARNING that fulltextindex is
> much slower than tsearch2. but tsearch2 seems
> complex to use, and I can not find a good example.
> Which one I should use? Any suggestions?

I believe I wrote that warning :)

Tsearch2 is what you should use.  Yes, it's more complicated but it's
HEAPS faster and seriously powerful.

Just read the README file.

You could also try out the original tsearch (V1), but that will probably
be superceded soon, now that tsearch2 is around.

Chris



Re: For full text indexing, which is better, tsearch2 or

From
Dror Matalon
Date:
On Thu, Nov 27, 2003 at 08:51:14AM +0800, Christopher Kings-Lynne wrote:
> >Which one is better (performance/easier to use),
> >tsearch2 or fulltextindex?
> >there is an example how to use fulltextindex in the
> >techdocs, but I checked the contrib/fulltextindex
> >package, there is a WARNING that fulltextindex is
> >much slower than tsearch2. but tsearch2 seems
> >complex to use, and I can not find a good example.
> >Which one I should use? Any suggestions?
>
> I believe I wrote that warning :)
>
> Tsearch2 is what you should use.  Yes, it's more complicated but it's
> HEAPS faster and seriously powerful.
>

Can you provide some numbers please, both for creating full text indexes
as well as for searching them? I tried to use tsearch and it seemed like
just creating a full text index on million+ records took forever.

> Just read the README file.
>
> You could also try out the original tsearch (V1), but that will probably
> be superceded soon, now that tsearch2 is around.
>
> Chris
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html

--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

Re: For full text indexing, which is better, tsearch2 or

From
Steve Atkins
Date:
On Thu, Nov 27, 2003 at 08:51:14AM +0800, Christopher Kings-Lynne wrote:
> >Which one is better (performance/easier to use),
> >tsearch2 or fulltextindex?
> >there is an example how to use fulltextindex in the
> >techdocs, but I checked the contrib/fulltextindex
> >package, there is a WARNING that fulltextindex is
> >much slower than tsearch2. but tsearch2 seems
> >complex to use, and I can not find a good example.
> >Which one I should use? Any suggestions?
>
> I believe I wrote that warning :)
>
> Tsearch2 is what you should use.  Yes, it's more complicated but it's
> HEAPS faster and seriously powerful.

Does anyone have any metrics on how fast tsearch2 actually is?

I tried it on a synthetic dataset of a million documents of a hundred
words each and while insertions were impressively fast I gave up on
the search after 10 minutes.

Broken? Unusable slow? This was on the last 7.4 release candidate.

Cheers,
  Steve

Re: For full text indexing, which is better, tsearch2 or

From
Christopher Kings-Lynne
Date:
> Does anyone have any metrics on how fast tsearch2 actually is?
>
> I tried it on a synthetic dataset of a million documents of a hundred
> words each and while insertions were impressively fast I gave up on
> the search after 10 minutes.
>
> Broken? Unusable slow? This was on the last 7.4 release candidate.

I just created a 1.1million row dataset by copying one of our 30000 row
production tables and just taking out the txtidx column.  Then I
inserted it into itself until it had 1.1 million rows.

Then I created the GiST index - THAT took forever - seriously like 20
mins or half an hour or something.

Now, to find a word:

select * from tsearchtest where ftiidx ## 'curry';
Time: 9760.75 ms

The AND of two words:
Time: 103.61 ms

The AND of three words:
select * from tsearchtest where ftiidx ## 'curry&green&thai';
Time: 61.86 ms

And now a one word query now that buffers are cached:
select * from tsearchtest where ftiidx ## 'curry';
Time: 444.89 ms

So, I have no idea why you think it's slow?  Perhaps you forgot the
'create index using gist' step?

Also, if you use the NOT (!) operand, you can get yourself into a really
slow situation.

Chris




Re: For full text indexing, which is better, tsearch2 or

From
Steve Atkins
Date:
On Thu, Nov 27, 2003 at 12:41:59PM +0800, Christopher Kings-Lynne wrote:
> >Does anyone have any metrics on how fast tsearch2 actually is?
> >
> >I tried it on a synthetic dataset of a million documents of a hundred
> >words each and while insertions were impressively fast I gave up on
> >the search after 10 minutes.
> >
> >Broken? Unusable slow? This was on the last 7.4 release candidate.
>
> I just created a 1.1million row dataset by copying one of our 30000 row
> production tables and just taking out the txtidx column.  Then I
> inserted it into itself until it had 1.1 million rows.
>
> Then I created the GiST index - THAT took forever - seriously like 20
> mins or half an hour or something.
>
> Now, to find a word:
>
> select * from tsearchtest where ftiidx ## 'curry';
> Time: 9760.75 ms

> So, I have no idea why you think it's slow?  Perhaps you forgot the
> 'create index using gist' step?

No, it was indexed.

Thanks, that was the datapoint I was looking for. It _can_ run fast, so
I just need to work out what's going on. (It's hard to diagnose a slow
query when you've no idea whether it's really 'slow').

Cheers,
  Steve

Re: For full text indexing, which is better, tsearch2 or

From
Steve Atkins
Date:
On Wed, Nov 26, 2003 at 09:12:30PM -0800, Steve Atkins wrote:
> On Thu, Nov 27, 2003 at 12:41:59PM +0800, Christopher Kings-Lynne wrote:
> > >Does anyone have any metrics on how fast tsearch2 actually is?
> > >
> > >I tried it on a synthetic dataset of a million documents of a hundred
> > >words each and while insertions were impressively fast I gave up on
> > >the search after 10 minutes.
> > >
> > >Broken? Unusable slow? This was on the last 7.4 release candidate.
> >
> > I just created a 1.1million row dataset by copying one of our 30000 row
> > production tables and just taking out the txtidx column.  Then I
> > inserted it into itself until it had 1.1 million rows.
> >
> > Then I created the GiST index - THAT took forever - seriously like 20
> > mins or half an hour or something.
> >
> > Now, to find a word:
> >
> > select * from tsearchtest where ftiidx ## 'curry';
> > Time: 9760.75 ms
>
> > So, I have no idea why you think it's slow?  Perhaps you forgot the
> > 'create index using gist' step?
>
> No, it was indexed.
>
> Thanks, that was the datapoint I was looking for. It _can_ run fast, so
> I just need to work out what's going on. (It's hard to diagnose a slow
> query when you've no idea whether it's really 'slow').

Looking at it further, something is very broken, possibly with GIST
indices, possibly with tsearch2s use of 'em.

This is on a newly built 7.4 installation, built with 64 bit
datetimes, but completely stock other than that. Stock gcc 3.3.2,
Linux, somewhat elderly 2.4.18 kernel. Running on a 1.5GHz single
processor Athlon with a half gig of RAM. Configuration set to use 20%
of RAM as shared buffers (amongst other settings, this was the last of
a range I tried looking for variation).

Software RAID0 across two 7200RPM SCSI drives, reiserfs (it's a
development box, not a production system). System completely idle
apart from postgresql.

269000 rows, each row having 400 words. Analyzed.

Running the select query given below appears to pause a process trying
to insert into the table completely (locking issue? I/O bandwidth?).

top shows the select below consuming <2% of CPU and iostat shows it reading
~2800 blocks/second from each of the two RAID drives.

Physical size of the database is under 3 gigs, including toast and index
tables.

The select query takes around 6 minutes (consistently, even if the same
identical query is repeated).

For entertainment, I turned off indexscan and the query takes 1
minute with a simple seqscan.

Any thoughts?

Cheers,
  Steve

=> select count(*) from ftstest;
 count
--------
 269000
(1 row)

=> \d ftstest
                            Table "public.ftstest"
 Column |   Type   |                        Modifiers
--------+----------+----------------------------------------------------------
 idx    | integer  | not null default nextval('public.ftstest_idx_seq'::text)
 words  | text     | not null
 idxfti | tsvector | not null
Indexes:
    "ftstest_idx" gist (idxfti)

=> explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery;
                                                            QUERY PLAN
           

-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ftstest_idx on ftstest  (cost=0.00..515.90 rows=271 width=4) (actual time=219.694..376042.428
rows=4796loops=1) 
   Index Cond: (idxfti @@ '\'dominican\''::tsquery)
   Filter: (idxfti @@ '\'dominican\''::tsquery)
 Total runtime: 376061.541 ms
(4 rows)


((Set enable_indexscan=false))


=> explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on ftstest  (cost=0.00..5765.88 rows=271 width=4) (actual time=42.589..62158.285 rows=4796 loops=1)
   Filter: (idxfti @@ '\'dominican\''::tsquery)
 Total runtime: 62182.277 ms
(3 rows)

Re: For full text indexing, which is better, tsearch2 or

From
Christopher Kings-Lynne
Date:
> Any thoughts?

Actually, I ran my tests using tsearch V1.  I wonder if there has been
some weird regression between tsearch 1 and 2?

hris



Re: For full text indexing, which is better, tsearch2 or

From
Christopher Kings-Lynne
Date:
>> Any thoughts?
>
>
> Actually, I ran my tests using tsearch V1.  I wonder if there has been
> some weird regression between tsearch 1 and 2?

I also ran my tests on 7.3.4 :(

Chris



Re: For full text indexing, which is better, tsearch2 or

From
Steve Atkins
Date:
On Fri, Nov 28, 2003 at 01:18:48PM +0800, Christopher Kings-Lynne wrote:
>
> >Any thoughts?
>
> Actually, I ran my tests using tsearch V1.  I wonder if there has been
> some weird regression between tsearch 1 and 2?

Maybe. tsearch2 doesn't seem production ready in other respects
(untsearch2.sql barfs with 'aggregate stat(tsvector) does not exist'
and the openfts mailing list, where this would be more appropriate,
doesn't appear to exist according to sourceforge).

So, using the same data, modulo a few alter tables, I try tsearch, V1.
It's a little slower than V2, and again runs far faster without an
index than with it. Broken in the same way.

I have 7.2.4 running on a Sun box, so I tried that too, with similar
results. tsearch just doesn't seem to work very well on this dataset
(or any other large dataset I've tried).

Cheers,
  Steve

Re: For full text indexing, which is better, tsearch2 or

From
Christopher Kings-Lynne
Date:
> I have 7.2.4 running on a Sun box, so I tried that too, with similar
> results. tsearch just doesn't seem to work very well on this dataset
> (or any other large dataset I've tried).

Well, as I've shown - works fine for me...

I strongly suggest you repost your problem report to -hackers, since the
fact that the tsearch developers haven't chimed in implies to me that
they don't watch the performance list.

BTW, read this about Gist indexes:

http://www.postgresql.org/docs/current/static/limitations.html

(Note lack of concurrency)

Chris


Dump restoration via archive files

From
"Kamalraj Singh Madhan"
Date:
Hi,
    I'am taking dump of a huge database and do not want the restoration of
that dump to take a lot of time as is the case when you take the dump in
text files. I want to take the dump as an archive file and get it restored
in very less time. I'am not able to figure out what is the command for
taking dump of a database in a archive file. Kindly help it's urgent.

thanks and regards
Kamalraj Singh


Re: Dump restoration via archive files

From
Jeff
Date:
On Mon, 1 Dec 2003 15:47:47 +0530
"Kamalraj Singh Madhan" <kamalr@networkprograms.com> wrote:

> Hi,
>     I'am taking dump of a huge database and do not want the
>     restoration of
> that dump to take a lot of time as is the case when you take the dump
> in text files. I want to take the dump as an archive file and get it
> restored in very less time. I'am not able to figure out what is the
> command for taking dump of a database in a archive file. Kindly help
> it's urgent.
>

Fast backups are an area PG needs work in. Currently, PG has no 'archive
file backup'.  You do have the following options to get around this:

1. Take big db offline, copy $PGDATA.  Has a restore time of how long it
takes to copy $PGDATA (And optionally untar/gzip), bring db back online

2. If you are using an LVM, take a snapshot and copy the data. Like #1,
it also has a "0" restore time.

3. If you are using a pg_dump generated dump, be sure to really jack up
your sort_mem - this will be a HUGE benefit when creating indexes & if
you are using 7.4, adding the foriegn keys.  Also turning off fsync
(Don't forget to turn it back on after your restore!) cna give you some
nice speed increases.

4. If you are not using 7.4 and using pg_dump, there isn't much you can
do about adding foreign keys going stupidly slow :(


--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/

Re: Dump restoration via archive files

From
Stephan Szabo
Date:
On Mon, 1 Dec 2003, Jeff wrote:

> On Mon, 1 Dec 2003 15:47:47 +0530
> "Kamalraj Singh Madhan" <kamalr@networkprograms.com> wrote:
>
> 4. If you are not using 7.4 and using pg_dump, there isn't much you can
> do about adding foreign keys going stupidly slow :(

You can take a schema dump and a separate data only dump where the latter
specifies --disable-triggers which should disable the checks when the data
is being added.