Thread: For full text indexing, which is better, tsearch2 or fulltextindex
Hi all, Which one is better (performance/easier to use), tsearch2 or fulltextindex? there is an example how to use fulltextindex in the techdocs, but I checked the contrib/fulltextindex package, there is a WARNING that fulltextindex is much slower than tsearch2. but tsearch2 seems complex to use, and I can not find a good example. Which one I should use? Any suggestions? thanks and Regards, William ----- Original Message ----- From: Hannu Krosing <hannu@tm.ee> Date: Wednesday, November 26, 2003 5:33 pm Subject: Re: [PERFORM] why index scan not working when using 'like'? > Tom Lane kirjutas T, 25.11.2003 kell 23:29: > > Josh Berkus <josh@agliodbs.com> writes: > > > In regular text fields containing words, your problem is > solvable with full > > > text indexing (FTI). Unfortunately, FTI is not designed for > arbitrary > > > non-language strings. It could be adapted, but would require a > lot of > > > hacking. > > > > I'm not sure why you say that FTI isn't a usable solution. As > long as > > the gene symbols are separated by whitespace or some other non- > letters> (eg, "foo mif bar" not "foomifbar"), I'd think FTI would > work. > If he wants to search on arbitrary substring, he could change > tokeniserin FTI to produce trigrams, so that "foomifbar" would be > indexed as if > it were text "foo oom omi mif ifb fba bar" and search for things like > %mifb% should first do a FTI search for "mif" AND "ifb" and then > simpleLIKE %mifb% to weed out something like "mififb". > > There are ways to use trigrams for 1 and 2 letter matches as well. > > ------------- > Hannu > > > ---------------------------(end of broadcast)----------------------- > ---- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your message can get through to the mailing list cleanly >
> Which one is better (performance/easier to use), > tsearch2 or fulltextindex? > there is an example how to use fulltextindex in the > techdocs, but I checked the contrib/fulltextindex > package, there is a WARNING that fulltextindex is > much slower than tsearch2. but tsearch2 seems > complex to use, and I can not find a good example. > Which one I should use? Any suggestions? I believe I wrote that warning :) Tsearch2 is what you should use. Yes, it's more complicated but it's HEAPS faster and seriously powerful. Just read the README file. You could also try out the original tsearch (V1), but that will probably be superceded soon, now that tsearch2 is around. Chris
On Thu, Nov 27, 2003 at 08:51:14AM +0800, Christopher Kings-Lynne wrote: > >Which one is better (performance/easier to use), > >tsearch2 or fulltextindex? > >there is an example how to use fulltextindex in the > >techdocs, but I checked the contrib/fulltextindex > >package, there is a WARNING that fulltextindex is > >much slower than tsearch2. but tsearch2 seems > >complex to use, and I can not find a good example. > >Which one I should use? Any suggestions? > > I believe I wrote that warning :) > > Tsearch2 is what you should use. Yes, it's more complicated but it's > HEAPS faster and seriously powerful. > Can you provide some numbers please, both for creating full text indexes as well as for searching them? I tried to use tsearch and it seemed like just creating a full text index on million+ records took forever. > Just read the README file. > > You could also try out the original tsearch (V1), but that will probably > be superceded soon, now that tsearch2 is around. > > Chris > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com
On Thu, Nov 27, 2003 at 08:51:14AM +0800, Christopher Kings-Lynne wrote: > >Which one is better (performance/easier to use), > >tsearch2 or fulltextindex? > >there is an example how to use fulltextindex in the > >techdocs, but I checked the contrib/fulltextindex > >package, there is a WARNING that fulltextindex is > >much slower than tsearch2. but tsearch2 seems > >complex to use, and I can not find a good example. > >Which one I should use? Any suggestions? > > I believe I wrote that warning :) > > Tsearch2 is what you should use. Yes, it's more complicated but it's > HEAPS faster and seriously powerful. Does anyone have any metrics on how fast tsearch2 actually is? I tried it on a synthetic dataset of a million documents of a hundred words each and while insertions were impressively fast I gave up on the search after 10 minutes. Broken? Unusable slow? This was on the last 7.4 release candidate. Cheers, Steve
> Does anyone have any metrics on how fast tsearch2 actually is? > > I tried it on a synthetic dataset of a million documents of a hundred > words each and while insertions were impressively fast I gave up on > the search after 10 minutes. > > Broken? Unusable slow? This was on the last 7.4 release candidate. I just created a 1.1million row dataset by copying one of our 30000 row production tables and just taking out the txtidx column. Then I inserted it into itself until it had 1.1 million rows. Then I created the GiST index - THAT took forever - seriously like 20 mins or half an hour or something. Now, to find a word: select * from tsearchtest where ftiidx ## 'curry'; Time: 9760.75 ms The AND of two words: Time: 103.61 ms The AND of three words: select * from tsearchtest where ftiidx ## 'curry&green&thai'; Time: 61.86 ms And now a one word query now that buffers are cached: select * from tsearchtest where ftiidx ## 'curry'; Time: 444.89 ms So, I have no idea why you think it's slow? Perhaps you forgot the 'create index using gist' step? Also, if you use the NOT (!) operand, you can get yourself into a really slow situation. Chris
On Thu, Nov 27, 2003 at 12:41:59PM +0800, Christopher Kings-Lynne wrote: > >Does anyone have any metrics on how fast tsearch2 actually is? > > > >I tried it on a synthetic dataset of a million documents of a hundred > >words each and while insertions were impressively fast I gave up on > >the search after 10 minutes. > > > >Broken? Unusable slow? This was on the last 7.4 release candidate. > > I just created a 1.1million row dataset by copying one of our 30000 row > production tables and just taking out the txtidx column. Then I > inserted it into itself until it had 1.1 million rows. > > Then I created the GiST index - THAT took forever - seriously like 20 > mins or half an hour or something. > > Now, to find a word: > > select * from tsearchtest where ftiidx ## 'curry'; > Time: 9760.75 ms > So, I have no idea why you think it's slow? Perhaps you forgot the > 'create index using gist' step? No, it was indexed. Thanks, that was the datapoint I was looking for. It _can_ run fast, so I just need to work out what's going on. (It's hard to diagnose a slow query when you've no idea whether it's really 'slow'). Cheers, Steve
On Wed, Nov 26, 2003 at 09:12:30PM -0800, Steve Atkins wrote: > On Thu, Nov 27, 2003 at 12:41:59PM +0800, Christopher Kings-Lynne wrote: > > >Does anyone have any metrics on how fast tsearch2 actually is? > > > > > >I tried it on a synthetic dataset of a million documents of a hundred > > >words each and while insertions were impressively fast I gave up on > > >the search after 10 minutes. > > > > > >Broken? Unusable slow? This was on the last 7.4 release candidate. > > > > I just created a 1.1million row dataset by copying one of our 30000 row > > production tables and just taking out the txtidx column. Then I > > inserted it into itself until it had 1.1 million rows. > > > > Then I created the GiST index - THAT took forever - seriously like 20 > > mins or half an hour or something. > > > > Now, to find a word: > > > > select * from tsearchtest where ftiidx ## 'curry'; > > Time: 9760.75 ms > > > So, I have no idea why you think it's slow? Perhaps you forgot the > > 'create index using gist' step? > > No, it was indexed. > > Thanks, that was the datapoint I was looking for. It _can_ run fast, so > I just need to work out what's going on. (It's hard to diagnose a slow > query when you've no idea whether it's really 'slow'). Looking at it further, something is very broken, possibly with GIST indices, possibly with tsearch2s use of 'em. This is on a newly built 7.4 installation, built with 64 bit datetimes, but completely stock other than that. Stock gcc 3.3.2, Linux, somewhat elderly 2.4.18 kernel. Running on a 1.5GHz single processor Athlon with a half gig of RAM. Configuration set to use 20% of RAM as shared buffers (amongst other settings, this was the last of a range I tried looking for variation). Software RAID0 across two 7200RPM SCSI drives, reiserfs (it's a development box, not a production system). System completely idle apart from postgresql. 269000 rows, each row having 400 words. Analyzed. Running the select query given below appears to pause a process trying to insert into the table completely (locking issue? I/O bandwidth?). top shows the select below consuming <2% of CPU and iostat shows it reading ~2800 blocks/second from each of the two RAID drives. Physical size of the database is under 3 gigs, including toast and index tables. The select query takes around 6 minutes (consistently, even if the same identical query is repeated). For entertainment, I turned off indexscan and the query takes 1 minute with a simple seqscan. Any thoughts? Cheers, Steve => select count(*) from ftstest; count -------- 269000 (1 row) => \d ftstest Table "public.ftstest" Column | Type | Modifiers --------+----------+---------------------------------------------------------- idx | integer | not null default nextval('public.ftstest_idx_seq'::text) words | text | not null idxfti | tsvector | not null Indexes: "ftstest_idx" gist (idxfti) => explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Index Scan using ftstest_idx on ftstest (cost=0.00..515.90 rows=271 width=4) (actual time=219.694..376042.428 rows=4796loops=1) Index Cond: (idxfti @@ '\'dominican\''::tsquery) Filter: (idxfti @@ '\'dominican\''::tsquery) Total runtime: 376061.541 ms (4 rows) ((Set enable_indexscan=false)) => explain analyze select idx from ftstest where idxfti @@ 'dominican'::tsquery; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Seq Scan on ftstest (cost=0.00..5765.88 rows=271 width=4) (actual time=42.589..62158.285 rows=4796 loops=1) Filter: (idxfti @@ '\'dominican\''::tsquery) Total runtime: 62182.277 ms (3 rows)
> Any thoughts? Actually, I ran my tests using tsearch V1. I wonder if there has been some weird regression between tsearch 1 and 2? hris
>> Any thoughts? > > > Actually, I ran my tests using tsearch V1. I wonder if there has been > some weird regression between tsearch 1 and 2? I also ran my tests on 7.3.4 :( Chris
On Fri, Nov 28, 2003 at 01:18:48PM +0800, Christopher Kings-Lynne wrote: > > >Any thoughts? > > Actually, I ran my tests using tsearch V1. I wonder if there has been > some weird regression between tsearch 1 and 2? Maybe. tsearch2 doesn't seem production ready in other respects (untsearch2.sql barfs with 'aggregate stat(tsvector) does not exist' and the openfts mailing list, where this would be more appropriate, doesn't appear to exist according to sourceforge). So, using the same data, modulo a few alter tables, I try tsearch, V1. It's a little slower than V2, and again runs far faster without an index than with it. Broken in the same way. I have 7.2.4 running on a Sun box, so I tried that too, with similar results. tsearch just doesn't seem to work very well on this dataset (or any other large dataset I've tried). Cheers, Steve
> I have 7.2.4 running on a Sun box, so I tried that too, with similar > results. tsearch just doesn't seem to work very well on this dataset > (or any other large dataset I've tried). Well, as I've shown - works fine for me... I strongly suggest you repost your problem report to -hackers, since the fact that the tsearch developers haven't chimed in implies to me that they don't watch the performance list. BTW, read this about Gist indexes: http://www.postgresql.org/docs/current/static/limitations.html (Note lack of concurrency) Chris
Hi, I'am taking dump of a huge database and do not want the restoration of that dump to take a lot of time as is the case when you take the dump in text files. I want to take the dump as an archive file and get it restored in very less time. I'am not able to figure out what is the command for taking dump of a database in a archive file. Kindly help it's urgent. thanks and regards Kamalraj Singh
On Mon, 1 Dec 2003 15:47:47 +0530 "Kamalraj Singh Madhan" <kamalr@networkprograms.com> wrote: > Hi, > I'am taking dump of a huge database and do not want the > restoration of > that dump to take a lot of time as is the case when you take the dump > in text files. I want to take the dump as an archive file and get it > restored in very less time. I'am not able to figure out what is the > command for taking dump of a database in a archive file. Kindly help > it's urgent. > Fast backups are an area PG needs work in. Currently, PG has no 'archive file backup'. You do have the following options to get around this: 1. Take big db offline, copy $PGDATA. Has a restore time of how long it takes to copy $PGDATA (And optionally untar/gzip), bring db back online 2. If you are using an LVM, take a snapshot and copy the data. Like #1, it also has a "0" restore time. 3. If you are using a pg_dump generated dump, be sure to really jack up your sort_mem - this will be a HUGE benefit when creating indexes & if you are using 7.4, adding the foriegn keys. Also turning off fsync (Don't forget to turn it back on after your restore!) cna give you some nice speed increases. 4. If you are not using 7.4 and using pg_dump, there isn't much you can do about adding foreign keys going stupidly slow :( -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
On Mon, 1 Dec 2003, Jeff wrote: > On Mon, 1 Dec 2003 15:47:47 +0530 > "Kamalraj Singh Madhan" <kamalr@networkprograms.com> wrote: > > 4. If you are not using 7.4 and using pg_dump, there isn't much you can > do about adding foreign keys going stupidly slow :( You can take a schema dump and a separate data only dump where the latter specifies --disable-triggers which should disable the checks when the data is being added.