Re: For full text indexing, which is better, tsearch2 or - Mailing list pgsql-performance
From | Steve Atkins |
---|---|
Subject | Re: For full text indexing, which is better, tsearch2 or |
Date | |
Msg-id | 20031128050417.GA14227@gp.word-to-the-wise.com Whole thread Raw |
In response to | Re: For full text indexing, which is better, tsearch2 or (Steve Atkins <steve@blighty.com>) |
Responses |
Re: For full text indexing, which is better, tsearch2 or
|
List | pgsql-performance |
On Wed, Nov 26, 2003 at 09:12:30PM -0800, Steve Atkins wrote: > On Thu, Nov 27, 2003 at 12:41:59PM +0800, Christopher Kings-Lynne wrote: > > >Does anyone have any metrics on how fast tsearch2 actually is? > > > > > >I tried it on a synthetic dataset of a million documents of a hundred > > >words each and while insertions were impressively fast I gave up on > > >the search after 10 minutes. > > > > > >Broken? Unusable slow? This was on the last 7.4 release candidate. > > > > I just created a 1.1million row dataset by copying one of our 30000 row > > production tables and just taking out the txtidx column. Then I > > inserted it into itself until it had 1.1 million rows. > > > > Then I created the GiST index - THAT took forever - seriously like 20 > > mins or half an hour or something. > > > > Now, to find a word: > > > > select * from tsearchtest where ftiidx ## 'curry'; > > Time: 9760.75 ms > > > So, I have no idea why you think it's slow? Perhaps you forgot the > > 'create index using gist' step? > > No, it was indexed. > > Thanks, that was the datapoint I was looking for. It _can_ run fast, so > I just need to work out what's going on. (It's hard to diagnose a slow > query when you've no idea whether it's really 'slow'). Looking at it further, something is very broken, possibly with GIST indices, possibly with tsearch2s use of 'em. This is on a newly built 7.4 installation, built with 64 bit datetimes, but completely stock other than that. Stock gcc 3.3.2, Linux, somewhat elderly 2.4.18 kernel. Running on a 1.5GHz single processor Athlon with a half gig of RAM. Configuration set to use 20% of RAM as shared buffers (amongst other settings, this was the last of a range I tried looking for variation). Software RAID0 across two 7200RPM SCSI drives, reiserfs (it's a development box, not a production system). System completely idle apart from postgresql. 269000 rows, each row having 400 words. Analyzed. Running the select query given below appears to pause a process trying to insert into the table completely (locking issue? I/O bandwidth?). top shows the select below consuming <2% of CPU and iostat shows it reading ~2800 blocks/second from each of the two RAID drives. Physical size of the database is under 3 gigs, including toast and index tables. The select query takes around 6 minutes (consistently, even if the same identical query is repeated). For entertainment, I turned off indexscan and the query takes 1 minute with a simple seqscan. Any thoughts? Cheers, Steve => select count(*) from ftstest; count -------- 269000 (1 row) => \d ftstest Table "public.ftstest" Column | Type | Modifiers --------+----------+---------------------------------------------------------- idx | integer | not null default nextval('public.ftstest_idx_seq'::text) words | text | not null idxfti | tsvector | not null Indexes: "ftstest_idx" gist (idxfti) => explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Index Scan using ftstest_idx on ftstest (cost=0.00..515.90 rows=271 width=4) (actual time=219.694..376042.428 rows=4796loops=1) Index Cond: (idxfti @@ '\'dominican\''::tsquery) Filter: (idxfti @@ '\'dominican\''::tsquery) Total runtime: 376061.541 ms (4 rows) ((Set enable_indexscan=false)) => explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Seq Scan on ftstest (cost=0.00..5765.88 rows=271 width=4) (actual time=42.589..62158.285 rows=4796 loops=1) Filter: (idxfti @@ '\'dominican\''::tsquery) Total runtime: 62182.277 ms (3 rows)
pgsql-performance by date: