Re: tsearch2 poor performance - Mailing list pgsql-admin
From | Oleg Bartunov |
---|---|
Subject | Re: tsearch2 poor performance |
Date | |
Msg-id | Pine.GSO.4.58.0409291600370.14980@ra.sai.msu.su Whole thread Raw |
In response to | Re: tsearch2 poor performance (Kris Kiger <kris@musicrebellion.com>) |
List | pgsql-admin |
Kris, we're working on prototype of tsearchd - full text search daemon, which maintain static inverted index outside of postgresql using the same parser, dictionary tsearch2 does. This approach could scale up fts capability preserving access to metadata, so yo may have "archive" part of your collection (tsearchd) and "online", which could be searchable with tsearch2. Here is what we have right now: pages ( tid integer, fts_index tsvector) 1. Create index select count(tdindex(tid,fts_index)) from pages; 2. Flush index select tdflush(); 3. Search select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where tid=idx order by rank desc; If it's possible, you could share your data, so we could test our prototype on real data. Oleg On Mon, 27 Sep 2004, Kris Kiger wrote: > Yes, it is much better than no index of sequential scan. We may just be > looking at the best performance tsearch2 can offer on my machine. > > search_test=# explain analyze SELECT count(q) FROM product, > to_tsquery('oil') AS q WHERE vector @@ q; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual > time=83311.552..83311.555 rows=1 loops=1) > -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32) > (actual time=0.204..81960.198 rows=226357 loops=1) > Join Filter: ("outer".vector @@ "inner".q) > -> Seq Scan on product (cost=0.00..339752.00 rows=3000000 > width=32) (actual time=0.100..27415.795 rows=3000000 loops=1) > -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual > time=0.003..0.006 rows=1 loops=3000000) > -> Function Scan on q (cost=0.00..12.50 rows=1000 > width=32) (actual time=0.020..0.024 rows=1 loops=1) > Total runtime: 83311.735 ms > (7 rows) > > search_test=# explain analyze select count(*) from product where > description like '% oil %'; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------ > Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual > time=39858.350..39858.353 rows=1 loops=1) > -> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0) > (actual time=0.100..38320.293 rows=226357 loops=1) > Filter: (description ~~ '% oil %'::text) > Total runtime: 39858.491 ms > > > >>Oleg, > >> > >> Thanks for the help on this. > >> > >> The query I used to return the 508 number is: > >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc > >>desc, word ; > >> > >> Testing says, the more words I use, the faster the query is. My > >>original search word, 'oil', appears in 226,357 documents 233,266 times. > >> As far as distinct words go, 'oil' is middle of the road for > >>occurences. As it is set up now, the best search time I am getting on > >>this single word is roughly 22 seconds. > >> > >> > > > >Does this time (22 seconds) is still better than seq. scan (no index) > >or standard 'LIKE' ? > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
pgsql-admin by date: