Re: For full text indexing, which is better, tsearch2 or - Mailing list pgsql-performance

From Steve Atkins
Subject Re: For full text indexing, which is better, tsearch2 or
Date
Msg-id 20031128050417.GA14227@gp.word-to-the-wise.com
Whole thread Raw
In response to Re: For full text indexing, which is better, tsearch2 or  (Steve Atkins <steve@blighty.com>)
Responses Re: For full text indexing, which is better, tsearch2 or
List pgsql-performance
On Wed, Nov 26, 2003 at 09:12:30PM -0800, Steve Atkins wrote:
> On Thu, Nov 27, 2003 at 12:41:59PM +0800, Christopher Kings-Lynne wrote:
> > >Does anyone have any metrics on how fast tsearch2 actually is?
> > >
> > >I tried it on a synthetic dataset of a million documents of a hundred
> > >words each and while insertions were impressively fast I gave up on
> > >the search after 10 minutes.
> > >
> > >Broken? Unusable slow? This was on the last 7.4 release candidate.
> >
> > I just created a 1.1million row dataset by copying one of our 30000 row
> > production tables and just taking out the txtidx column.  Then I
> > inserted it into itself until it had 1.1 million rows.
> >
> > Then I created the GiST index - THAT took forever - seriously like 20
> > mins or half an hour or something.
> >
> > Now, to find a word:
> >
> > select * from tsearchtest where ftiidx ## 'curry';
> > Time: 9760.75 ms
>
> > So, I have no idea why you think it's slow?  Perhaps you forgot the
> > 'create index using gist' step?
>
> No, it was indexed.
>
> Thanks, that was the datapoint I was looking for. It _can_ run fast, so
> I just need to work out what's going on. (It's hard to diagnose a slow
> query when you've no idea whether it's really 'slow').

Looking at it further, something is very broken, possibly with GIST
indices, possibly with tsearch2s use of 'em.

This is on a newly built 7.4 installation, built with 64 bit
datetimes, but completely stock other than that. Stock gcc 3.3.2,
Linux, somewhat elderly 2.4.18 kernel. Running on a 1.5GHz single
processor Athlon with a half gig of RAM. Configuration set to use 20%
of RAM as shared buffers (amongst other settings, this was the last of
a range I tried looking for variation).

Software RAID0 across two 7200RPM SCSI drives, reiserfs (it's a
development box, not a production system). System completely idle
apart from postgresql.

269000 rows, each row having 400 words. Analyzed.

Running the select query given below appears to pause a process trying
to insert into the table completely (locking issue? I/O bandwidth?).

top shows the select below consuming <2% of CPU and iostat shows it reading
~2800 blocks/second from each of the two RAID drives.

Physical size of the database is under 3 gigs, including toast and index
tables.

The select query takes around 6 minutes (consistently, even if the same
identical query is repeated).

For entertainment, I turned off indexscan and the query takes 1
minute with a simple seqscan.

Any thoughts?

Cheers,
  Steve

=> select count(*) from ftstest;
 count
--------
 269000
(1 row)

=> \d ftstest
                            Table "public.ftstest"
 Column |   Type   |                        Modifiers
--------+----------+----------------------------------------------------------
 idx    | integer  | not null default nextval('public.ftstest_idx_seq'::text)
 words  | text     | not null
 idxfti | tsvector | not null
Indexes:
    "ftstest_idx" gist (idxfti)

=> explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery;
                                                            QUERY PLAN
           

-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ftstest_idx on ftstest  (cost=0.00..515.90 rows=271 width=4) (actual time=219.694..376042.428
rows=4796loops=1) 
   Index Cond: (idxfti @@ '\'dominican\''::tsquery)
   Filter: (idxfti @@ '\'dominican\''::tsquery)
 Total runtime: 376061.541 ms
(4 rows)


((Set enable_indexscan=false))


=> explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on ftstest  (cost=0.00..5765.88 rows=271 width=4) (actual time=42.589..62158.285 rows=4796 loops=1)
   Filter: (idxfti @@ '\'dominican\''::tsquery)
 Total runtime: 62182.277 ms
(3 rows)

pgsql-performance by date:

Previous
From: Stefan Champailler
Date:
Subject: Re: Impossibly slow DELETEs
Next
From: Christopher Kings-Lynne
Date:
Subject: Re: For full text indexing, which is better, tsearch2 or