Thread: tsearch2 poor performance
Hi all. I am doing some work with tsearch2 and am not sure what to expect out of it, performance wise. Here is my setup: Table "public.product" Column | Type | Modifiers -------------+----------+------------------------------------------------- description | text | product_id | integer | default nextval('product_product_id_seq'::text) vector | tsvector | Indexes: "vector_idx" gist (vector) Triggers: tsvectorupdate BEFORE INSERT OR UPDATE ON product FOR EACH ROW EXECUTE PROCEDURE tsearch2('vector', 'description') This table has 3,000,000 rows in it. Each description field has roughly 50 characters. There are fewer than ten thousanddistinct words in my 3,000,000 rows. The vector was filled using the description fields values. I ran a vacuumfull analyze before executing any of my queries. Here are a couple of tests I performed using the tsearch index and like; search_test=# select count(*) from product where vector @@ to_tsquery('oil'); count -------- 226357 (1 row) Time: 191056.230 ms search_test=# select count(*) from product where vector @@ to_tsquery('hydrogen'); count -------- 226868 (1 row) Time: 306411.957 ms search_test=# select count(*) from product where description like '% oil %'; count -------- 226357 (1 row) Time: 38426.851 ms search_test=# select count(*) from product where description like '% hydrogen %'; count -------- 226868 (1 row) Time: 38265.421 ms Both of the likes are using a sequential scan and both of the tsearch queries use the gist index. Did I miss a configurationparameter, are these queries incorrectly using tsearch2,or is this tsearch2's average performance? Thanks inadvance for the input! Kris
Kris, could you post 'explain analyze' output ? Also, could you disable index usage (set enable_indexscan=off) and rerun search using tsearch2 ? also, could you run 'stat' function to see frequency distribution of words. See http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes for details. Oleg On Fri, 24 Sep 2004, Kris Kiger wrote: > Hi all. I am doing some work with tsearch2 and am not sure what to > expect out of it, performance wise. Here is my setup: > > Table "public.product" > Column | Type | Modifiers > -------------+----------+------------------------------------------------- > description | text | > product_id | integer | default nextval('product_product_id_seq'::text) > vector | tsvector | > Indexes: > "vector_idx" gist (vector) > Triggers: > tsvectorupdate BEFORE INSERT OR UPDATE ON product FOR EACH ROW EXECUTE PROCEDURE tsearch2('vector', 'description') > > This table has 3,000,000 rows in it. Each description field has roughly 50 characters. There are fewer than ten thousanddistinct words in my 3,000,000 rows. The vector was filled using the description fields values. I ran a vacuumfull analyze before executing any of my queries. > > Here are a couple of tests I performed using the tsearch index and like; > > search_test=# select count(*) from product where vector @@ to_tsquery('oil'); > count > -------- > 226357 > (1 row) > > Time: 191056.230 ms > > search_test=# select count(*) from product where vector @@ to_tsquery('hydrogen'); > count > -------- > 226868 > (1 row) > > Time: 306411.957 ms > > search_test=# select count(*) from product where description like '% oil %'; > count > -------- > 226357 > (1 row) > > Time: 38426.851 ms > > search_test=# select count(*) from product where description like '% hydrogen %'; > count > -------- > 226868 > (1 row) > > Time: 38265.421 ms > > > Both of the likes are using a sequential scan and both of the tsearch queries use the gist index. Did I miss a configurationparameter, are these queries incorrectly using tsearch2,or is this tsearch2's average performance? Thanks inadvance for the input! > > Kris > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Here is the explain analyze output, funny thing, after I ran josh's query, mine ran a lot faster....maybe it forced a caching?; search_test=# explain analyze select count(*) from product where vector @@ to_tsquery('oil'); Aggregate (cost=6113.09..6113.09 rows=1 width=0) (actual time=19643.372..19643.376 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6105.58 rows=3001 width=0) (actual time=0.381..18145.917 rows=226357 loops=1) Index Cond: (vector @@ '\'oil\''::tsquery) Filter: (vector @@ '\'oil\''::tsquery) Total runtime: 19643.597 ms search_test=# explain analyze select count(*) from product where vector @@ to_tsquery('hydrogen'); Aggregate (cost=6113.09..6113.09 rows=1 width=0) (actual time=19629.766..19629.769 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6105.58 rows=3001 width=0) (actual time=0.378..18127.573 rows=226868 loops=1) Index Cond: (vector @@ '\'hydrogen\''::tsquery) Filter: (vector @@ '\'hydrogen\''::tsquery) Total runtime: 19629.992 ms Here is Josh's; search_test=# explain analyze SELECT count(q) FROM product, to_tsquery('oil') AS q WHERE vector @@ q; Aggregate (cost=6150597.03..6150597.03 rows=1 width=32) (actual time=21769.526..21769.530 rows=1 loops=1) -> Nested Loop (cost=0.00..6143097.02 rows=3000001 width=32) (actual time=0.424..20450.208 rows=226357 loops=1) -> Function Scan on q (cost=0.00..12.50 rows=1000 width=32) (actual time=0.023..0.031 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6105.58 rows=3000 width=32) (actual time=0.376..18165.415 rows=226357 loops=1) Index Cond: (product.vector @@ "outer".q) Filter: (product.vector @@ "outer".q) Total runtime: 21769.786 ms Disabling Index usage slowed it down: search_test=# explain analyze select count(*) from product where vector @@ to_tsquery('hydrogen'); Aggregate (cost=347259.51..347259.51 rows=1 width=0) (actual time=24675.933..24675.936 rows=1 loops=1) -> Seq Scan on product (cost=0.00..347252.00 rows=3001 width=0) (actual time=0.320..23164.492 rows=226868 loops=1) Filter: (vector @@ '\'hydrogen\''::tsquery) Total runtime: 24676.091 ms Time: 24678.842 ms search_test=# explain analyze SELECT count(q) FROM product, to_tsquery('oil') AS q WHERE vector @@ q; Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual time=83631.201..83631.204 rows=1 loops=1) -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32) (actual time=0.214..82294.710 rows=226357 loops=1) Join Filter: ("outer".vector @@ "inner".q) -> Seq Scan on product (cost=0.00..339752.00 rows=3000000 width=32) (actual time=0.107..27563.952 rows=3000000 loops=1) -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual time=0.003..0.006 rows=1 loops=3000000) -> Function Scan on q (cost=0.00..12.50 rows=1000 width=32) (actual time=0.019..0.023 rows=1 loops=1) Total runtime: 83631.385 ms Here are the results of stat: search_test=# select * from stat('select vector from product') order by ndoc desc, nentry; word | ndoc | nentry --------------+---------+--------- anoth | 1187386 | 1477442 bear | 696668 | 780963 take | 675319 | 736410 relat | 491469 | 528259 toward | 490653 | 528369 defin | 490572 | 527099 live | 490538 | 527401 beyond | 490124 | 527957 behind | 490087 | 527735 insid | 489530 | 527074 near | 489504 | 527721 around | 489244 | 526870 mean | 478201 | 512699 complex | 440339 | 468669 light | 438685 | 468140 ball | 438567 | 468168 pit | 438293 | 467807 dress | 438128 | 467260 player | 437633 | 466753 secret | 433279 | 457246 love | 423777 | 442694 give | 423691 | 441305 need | 423336 | 434409 peopl | 423336 | 434409 believ | 423336 | 434409 rememb | 423336 | 434409 howev | 421762 | 434194 real | 419906 | 435074 furthermor | 416672 | 434413 indic | 416508 | 434919 exampl | 416508 | 434919 alway | 415543 | 432861 sometim | 415543 | 432861 see | 410706 | 434586 inde | 408379 | 434283 fruit | 363203 | 381862 cook | 362674 | 381112 graduat | 362444 | 381284 chees | 362358 | 381040 hesit | 307431 | 317550 self | 301001 | 312312 hard | 300138 | 310167 spirit | 299310 | 312092 know | 298246 | 309010 laugh | 294136 | 302392 make | 287633 | 295003 find | 287550 | 294770 goe | 279336 | 287025 team | 228000 | 234703 footbal | 228000 | 234703 void | 227914 | 234681 formless | 227914 | 234681 board | 227907 | 234797 chess | 227907 | 234797 submarin | 227869 | 234727 inferior | 227858 | 234357 viper | 227855 | 234865 cylind | 227847 | 234505 suit | 227822 | 234376 class | 227822 | 234376 action | 227822 | 234376 diskett | 227802 | 234786 roller | 227792 | 234524 coaster | 227792 | 234524 mate | 227785 | 234431 ritual | 227785 | 234431 engin | 227784 | 234575 steam | 227784 | 234575 industri | 227780 | 234312 fire | 227775 | 234532 hydrant | 227775 | 234532 briar | 227769 | 234524 patch | 227769 | 234524 mastadon | 227677 | 234665 defend | 227617 | 234410 blade | 227603 | 234356 razor | 227603 | 234356 cab | 227578 | 234554 driver | 227578 | 234554 cough | 227570 | 234324 syrup | 227570 | 234324 cowboy | 227566 | 234663 chop | 227564 | 234437 pork | 227564 | 234437 ceo | 227557 | 234760 rattlesnak | 227554 | 234323 hell | 227540 | 234313 flavor | 227540 | 234313 maelstrom | 227537 | 234404 mulch | 227531 | 234311 cyprus | 227531 | 234311 tack | 227525 | 234462 carpet | 227525 | 234462 movi | 227505 | 234207 theater | 227505 | 234207 spider | 227466 | 234524 cone | 227463 | 234198 pine | 227463 | 234198 garbag | 227459 | 234207 beer | 227443 | 234077 bottl | 227443 | 234077 polygon | 227438 | 234267 judg | 227425 | 234565 blith | 227409 | 233979 traffic | 227403 | 234051 paper | 227397 | 234028 napkin | 227397 | 234028 apart | 227393 | 233911 build | 227393 | 233911 cocker | 227368 | 233926 spaniel | 227368 | 233926 bay | 227358 | 234261 cargo | 227358 | 234261 order | 227357 | 233885 short | 227357 | 233885 polar | 227326 | 234118 demon | 227324 | 234442 minivan | 227317 | 234292 bulb | 227314 | 234089 fundrais | 227308 | 234235 eggplant | 227306 | 234202 cake | 227299 | 234075 bowl | 227299 | 234110 paycheck | 227295 | 234224 sheriff | 227292 | 234313 turkey | 227271 | 234267 turn | 227265 | 234210 signal | 227265 | 234210 chestnut | 227250 | 234104 hole | 227239 | 233975 puncher | 227239 | 233975 tabloid | 227238 | 234341 microscop | 227236 | 234067 reclin | 227234 | 233946 dolphin | 227231 | 234080 pen | 227222 | 234269 pig | 227222 | 234269 wed | 227221 | 233860 bullfrog | 227211 | 234144 truck | 227208 | 233980 pickup | 227208 | 233980 agent | 227201 | 233840 insur | 227201 | 233840 girl | 227201 | 233934 scout | 227201 | 233934 drill | 227200 | 233986 power | 227200 | 233986 ocean | 227187 | 234211 case | 227173 | 233983 crank | 227173 | 233983 squid | 227169 | 234056 senat | 227167 | 234147 fraction | 227161 | 234065 custom | 227152 | 234128 burglar | 227148 | 234014 grizzli | 227133 | 233955 wheel | 227122 | 233813 asteroid | 227108 | 233928 anomali | 227106 | 234156 acceler | 227103 | 233428 particl | 227103 | 233428 saw | 227082 | 233934 chain | 227082 | 233934 reactor | 227035 | 234061 wedg | 227033 | 234143 photon | 227029 | 234017 deficit | 227029 | 234102 vacuum | 227021 | 233760 cleaner | 227021 | 233760 cashier | 227010 | 233858 scyth | 227001 | 233928 cloud | 226981 | 233569 format | 226981 | 233569 tornado | 226968 | 234058 grand | 226936 | 233730 piano | 226936 | 233730 tripod | 226930 | 233755 tomato | 226928 | 233915 sandwich | 226923 | 233786 earring | 226912 | 233665 train | 226912 | 233712 freight | 226912 | 233712 skyscrap | 226901 | 233755 abstract | 226890 | 233658 mortician | 226883 | 233781 warranti | 226876 | 233935 atom | 226868 | 233467 hydrogen | 226868 | 233467 satellit | 226866 | 233680 corpor | 226858 | 233818 globul | 226853 | 233980 cow | 226832 | 233808 jersey | 226832 | 233808 salad | 226830 | 233400 buzzard | 226804 | 233825 lot | 226794 | 233643 park | 226794 | 233643 prime | 226793 | 233325 minist | 226793 | 233325 clot | 226780 | 233380 blood | 226780 | 233380 tuba | 226765 | 233575 tape | 226749 | 233388 record | 226749 | 233388 line | 226747 | 233574 dancer | 226747 | 233574 nation | 226736 | 233796 bartend | 226653 | 233422 hockey | 226645 | 233178 canyon | 226617 | 233699 ski | 226610 | 233451 lodg | 226610 | 233451 stovepip | 226608 | 233489 crane | 226590 | 233578 sand | 226572 | 233270 grain | 226572 | 233270 dust | 226570 | 233391 bunni | 226570 | 233391 lover | 226564 | 233628 fairi | 226554 | 233743 plaintiff | 226537 | 233563 wheelbarrow | 226520 | 233206 food | 226445 | 233228 stamp | 226445 | 233228 umbrella | 226380 | 233273 avocado | 226375 | 232942 oil | 226357 | 233266 filter | 226357 | 233266 financi | 220105 | 225116 complet | 162829 | 164065 ridicul | 162346 | 163592 handl | 162200 | 163390 singl | 162200 | 163390 single-handl | 162200 | 163390 greedili | 162123 | 163379 careless | 162009 | 163193 somewhat | 161979 | 163205 accur | 161975 | 163228 overwhelm | 161946 | 163107 usual | 161930 | 163158 ostens | 161826 | 163020 lazili | 161809 | 163133 slyli | 161803 | 163149 underhand | 161751 | 162955 non | 161585 | 162823 chalant | 161585 | 162823 non-chal | 161585 | 162823 seldom | 161525 | 162739 accident | 161511 | 162676 almost | 161508 | 162782 often | 161488 | 162733 bare | 161401 | 162659 eager | 161278 | 162513 wise | 161073 | 162341 inexor | 161042 | 162265 feverish | 160805 | 162020 thorough | 160611 | 161823 home | 154672 | 155766 return | 154672 | 155766 lost | 154655 | 155567 glori | 154655 | 155567 start | 154655 | 155567 reminisc | 154655 | 155567 rumin | 154577 | 155776 read | 154529 | 155642 magazin | 154529 | 155642 pray | 154478 | 155748 floor | 154396 | 155477 sweep | 154396 | 155477 nag | 154271 | 155259 feel | 154271 | 155259 remors | 154271 | 155259 procrastin | 154256 | 155371 wake | 154220 | 155397 sleep | 154217 | 155353 panic | 154189 | 155346 get | 154168 | 155253 drunk | 154168 | 155253 stink | 154168 | 155253 hibern | 154158 | 155358 die | 153973 | 155223 fli | 153943 | 155056 rage | 153943 | 155056 flagel | 153916 | 155067 self-flagel | 153916 | 155067 daydream | 153864 | 155043 medit | 153816 | 154935 ceas | 153735 | 154815 exist | 153735 | 154815 joy | 153672 | 154754 beam | 153672 | 154754 trembl | 153656 | 154799 loud | 153635 | 154665 hide | 153592 | 154797 break | 153559 | 154599 coffe | 153559 | 154599 earn | 153538 | 154540 mile | 153538 | 154540 flier | 153538 | 154540 frequent | 153538 | 154540 leav | 153535 | 154730 rejoic | 153226 | 154412 sell | 147231 | 148103 plan | 147046 | 147809 escap | 147046 | 147809 throw | 146973 | 147764 negoti | 146905 | 147704 prenupti | 146905 | 147704 agreement | 146905 | 147704 card | 146892 | 147731 trade | 146892 | 147731 basebal | 146892 | 147731 oper | 146888 | 147787 small | 146888 | 147787 stand | 146888 | 147787 drink | 146881 | 147727 night | 146881 | 147727 steal | 146835 | 147847 pencil | 146835 | 147847 seek | 146816 | 148029 figur | 146801 | 147908 write | 146736 | 147720 letter | 146736 | 147720 recogn | 146723 | 147823 truce | 146684 | 147630 eat | 146670 | 147874 compet | 146647 | 147760 buy | 146642 | 147522 gift | 146642 | 147522 expens | 146642 | 147522 big | 146626 | 147717 fan | 146626 | 147717 fall | 146597 | 147601 assist | 146587 | 147589 requir | 146587 | 147589 chang | 146542 | 147479 heart | 146542 | 147479 conquer | 146542 | 147695 money | 146481 | 147450 borrow | 146481 | 147450 ignor | 146475 | 147643 share | 146415 | 147404 shower | 146415 | 147404 fault | 146413 | 147361 subtl | 146413 | 147361 kind | 146402 | 147492 great | 146397 | 147367 upon | 146396 | 147366 honor | 146396 | 147366 bestow | 146396 | 147366 pee | 146394 | 147477 avoid | 146392 | 147388 contact | 146392 | 147388 pink | 146372 | 147347 slip | 146372 | 147347 aid | 146367 | 147225 teach | 146366 | 147516 sanit | 146361 | 147477 lice | 146360 | 147409 buri | 146360 | 147483 cold | 146357 | 147220 pour | 146357 | 147220 freez | 146357 | 147220 water | 146357 | 147220 sea | 146347 | 147217 deep | 146347 | 147217 fish | 146347 | 147217 organ | 146321 | 147476 grit | 146289 | 147227 satiat | 146251 | 147349 assimil | 146251 | 147377 tri | 146188 | 147200 seduc | 146188 | 147200 reach | 146132 | 147008 understand | 146132 | 147008 brainwash | 146068 | 147158 admir | 146050 | 147021 caricatur | 145989 | 147107 deriv | 145941 | 146790 pervers | 145941 | 146790 satisfact | 145941 | 146790 moral | 145854 | 146733 lectur | 145854 | 146733 befriend | 145799 | 146963 learn | 145758 | 146666 lesson | 145758 | 146666 play | 145738 | 146706 pinochl | 145738 | 146706 peek | 145698 | 146737 danc | 145555 | 146637 fashion | 78762 | 79203 muddi | 78750 | 79236 hypnot | 78747 | 79204 childlik | 78579 | 79002 loyal | 78575 | 79056 mysteri | 78554 | 79047 annoy | 78532 | 79032 slow | 78517 | 78996 twist | 78515 | 79016 unstabl | 78510 | 78945 feder | 78501 | 78967 rever | 78501 | 79008 wrinkl | 78495 | 78965 rude | 78495 | 78975 boil | 78493 | 78972 high | 78481 | 78940 paid | 78481 | 78940 geosynchron | 78478 | 78931 greasi | 78476 | 78961 cosmopolitan | 78459 | 78903 fat | 78438 | 78935 inciner | 78429 | 78896 dot | 78426 | 78864 polka | 78426 | 78864 polka-dot | 78426 | 78864 outer | 78415 | 78910 phoni | 78411 | 78895 pathet | 78405 | 78869 purpl | 78405 | 78895 frozen | 78403 | 78886 nearest | 78396 | 78879 statesmanlik | 78386 | 78830 dirt | 78376 | 78828 encrust | 78376 | 78828 dirt-encrust | 78376 | 78828 sur | 78371 | 78895 obsequi | 78369 | 78805 salti | 78360 | 78834 imagin | 78356 | 78808 south | 78325 | 78787 american | 78325 | 78787 load | 78318 | 78832 righteous | 78282 | 78760 fractur | 78281 | 78737 educ | 78278 | 78682 colleg | 78278 | 78682 college-educ | 78278 | 78682 mitochondri | 78269 | 78745 treacher | 78265 | 78697 spartan | 78252 | 78707 felin | 78244 | 78713 ravish | 78242 | 78765 patern | 78241 | 78701 psychot | 78238 | 78693 shabbi | 78228 | 78685 dreamlik | 78224 | 78642 loath | 78221 | 78653 self-loath | 78221 | 78653 world | 78203 | 78658 call | 78183 | 78610 so-cal | 78183 | 78610 radioact | 78182 | 78623 alleg | 78178 | 78664 cantanker | 78159 | 78620 makeshift | 78159 | 78648 gentl | 78156 | 78609 fri | 78143 | 78648 linguist | 78141 | 78586 overrip | 78134 | 78572 varig | 78132 | 78609 vapor | 78105 | 78548 impromptu | 78104 | 78569 actual | 78104 | 78592 self-actu | 78104 | 78592 frighten | 78100 | 78544 molten | 78100 | 78567 gratifi | 78098 | 78528 bur | 78094 | 78563 hairi | 78092 | 78563 foreign | 78083 | 78569 tatter | 78050 | 78518 frustrat | 78044 | 78474 stoic | 78036 | 78503 eurasian | 78033 | 78513 proverbi | 78031 | 78519 green | 78024 | 78450 skinni | 78023 | 78524 familiar | 78016 | 78477 optim | 78006 | 78483 bohemian | 78002 | 78500 overpr | 77983 | 78411 pompous | 77955 | 78460 difficult | 77938 | 78375 raspi | 77924 | 78461 soggi | 77912 | 78381 resplend | 77910 | 78351 blotch | 77910 | 78380 fals | 77908 | 78409 infect | 77907 | 78399 magnific | 77898 | 78350 snooti | 77897 | 78422 moron | 77886 | 78362 moldi | 77865 | 78370 precis | 77860 | 78331 crispi | 77856 | 78324 smelli | 77813 | 78279 tempor | 77810 | 78244 alaskan | 77808 | 78258 elus | 77775 | 78245 miser | 77772 | 78232 flatul | 77761 | 78201 orbit | 77723 | 78157 mean-spirit | 77660 | 78113 flabbi | 77649 | 78110 nuclear | 77609 | 78069 go | 15532 | 15545 made | 1 | 1 america | 1 | 1 If you need anything else, let me know! Kris Oleg Bartunov wrote: >Kris, > >could you post 'explain analyze' output ? >Also, could you disable index usage (set enable_indexscan=off) >and rerun search using tsearch2 ? > >also, could you run 'stat' function to see frequency distribution >of words. See http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes >for details. > >Oleg > > >>Hi all. I am doing some work with tsearch2 and am not sure what to >>expect out of it, performance wise. Here is my setup: >> >> Table "public.product" >> Column | Type | Modifiers >>-------------+----------+------------------------------------------------- >> description | text | >> product_id | integer | default nextval('product_product_id_seq'::text) >> vector | tsvector | >>Indexes: >> "vector_idx" gist (vector) >>Triggers: >> tsvectorupdate BEFORE INSERT OR UPDATE ON product FOR EACH ROW EXECUTE PROCEDURE tsearch2('vector', 'description') >> >>This table has 3,000,000 rows in it. Each description field has roughly 50 characters. There are fewer than ten thousanddistinct words in my 3,000,000 rows. The vector was filled using the description fields values. I ran a vacuumfull analyze before executing any of my queries. >> >>Here are a couple of tests I performed using the tsearch index and like; >> >>search_test=# select count(*) from product where vector @@ to_tsquery('oil'); >> count >>-------- >> 226357 >>(1 row) >> >>Time: 191056.230 ms >> >>search_test=# select count(*) from product where vector @@ to_tsquery('hydrogen'); >> count >>-------- >> 226868 >>(1 row) >> >>Time: 306411.957 ms >> >>search_test=# select count(*) from product where description like '% oil %'; >> count >>-------- >> 226357 >>(1 row) >> >>Time: 38426.851 ms >> >>search_test=# select count(*) from product where description like '% hydrogen %'; >> count >>-------- >> 226868 >>(1 row) >> >>Time: 38265.421 ms >> >> >>Both of the likes are using a sequential scan and both of the tsearch queries use the gist index. Did I miss a configurationparameter, are these queries incorrectly using tsearch2,or is this tsearch2's average performance? Thanks inadvance for the input! >> >>Kris >> >>
Can't speak to tsearch2 in specific but I have learned to be very cautious -- caching does indeed make a noticible differenceon this sort of thing, especially if you have enough RAM to hold a significant amount of the data. Either keepchanging the query target or do something violent to wipe the cache(s). Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: Kris Kiger [mailto:kris@musicrebellion.com] Sent: Friday, September 24, 2004 2:59 PM To: pgsql-admin@postgresql.org Subject: Re: [ADMIN] tsearch2 poor performance Here is the explain analyze output, funny thing, after I ran josh's query, mine ran a lot faster....maybe it forced a caching?; search_test=# explain analyze select count(*) from product where vector @@ to_tsquery('oil'); Aggregate (cost=6113.09..6113.09 rows=1 width=0) (actual time=19643.372..19643.376 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6105.58 rows=3001 width=0) (actual time=0.381..18145.917 rows=226357 loops=1) Index Cond: (vector @@ '\'oil\''::tsquery) Filter: (vector @@ '\'oil\''::tsquery) Total runtime: 19643.597 ms search_test=# explain analyze select count(*) from product where vector @@ to_tsquery('hydrogen'); Aggregate (cost=6113.09..6113.09 rows=1 width=0) (actual time=19629.766..19629.769 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6105.58 rows=3001 width=0) (actual time=0.378..18127.573 rows=226868 loops=1) Index Cond: (vector @@ '\'hydrogen\''::tsquery) Filter: (vector @@ '\'hydrogen\''::tsquery) Total runtime: 19629.992 ms Here is Josh's; search_test=# explain analyze SELECT count(q) FROM product, to_tsquery('oil') AS q WHERE vector @@ q; Aggregate (cost=6150597.03..6150597.03 rows=1 width=32) (actual time=21769.526..21769.530 rows=1 loops=1) -> Nested Loop (cost=0.00..6143097.02 rows=3000001 width=32) (actual time=0.424..20450.208 rows=226357 loops=1) -> Function Scan on q (cost=0.00..12.50 rows=1000 width=32) (actual time=0.023..0.031 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6105.58 rows=3000 width=32) (actual time=0.376..18165.415 rows=226357 loops=1) Index Cond: (product.vector @@ "outer".q) Filter: (product.vector @@ "outer".q) Total runtime: 21769.786 ms Disabling Index usage slowed it down: search_test=# explain analyze select count(*) from product where vector @@ to_tsquery('hydrogen'); Aggregate (cost=347259.51..347259.51 rows=1 width=0) (actual time=24675.933..24675.936 rows=1 loops=1) -> Seq Scan on product (cost=0.00..347252.00 rows=3001 width=0) (actual time=0.320..23164.492 rows=226868 loops=1) Filter: (vector @@ '\'hydrogen\''::tsquery) Total runtime: 24676.091 ms Time: 24678.842 ms search_test=# explain analyze SELECT count(q) FROM product, to_tsquery('oil') AS q WHERE vector @@ q; Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual time=83631.201..83631.204 rows=1 loops=1) -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32) (actual time=0.214..82294.710 rows=226357 loops=1) Join Filter: ("outer".vector @@ "inner".q) -> Seq Scan on product (cost=0.00..339752.00 rows=3000000 width=32) (actual time=0.107..27563.952 rows=3000000 loops=1) -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual time=0.003..0.006 rows=1 loops=3000000) -> Function Scan on q (cost=0.00..12.50 rows=1000 width=32) (actual time=0.019..0.023 rows=1 loops=1) Total runtime: 83631.385 ms Here are the results of stat: search_test=# select * from stat('select vector from product') order by ndoc desc, nentry; word | ndoc | nentry --------------+---------+--------- anoth | 1187386 | 1477442 bear | 696668 | 780963 take | 675319 | 736410 relat | 491469 | 528259 toward | 490653 | 528369 defin | 490572 | 527099 live | 490538 | 527401 beyond | 490124 | 527957 behind | 490087 | 527735 insid | 489530 | 527074 near | 489504 | 527721 around | 489244 | 526870 mean | 478201 | 512699 complex | 440339 | 468669 light | 438685 | 468140 ball | 438567 | 468168 pit | 438293 | 467807 dress | 438128 | 467260 player | 437633 | 466753 secret | 433279 | 457246 love | 423777 | 442694 give | 423691 | 441305 need | 423336 | 434409 peopl | 423336 | 434409 believ | 423336 | 434409 rememb | 423336 | 434409 howev | 421762 | 434194 real | 419906 | 435074 furthermor | 416672 | 434413 indic | 416508 | 434919 exampl | 416508 | 434919 alway | 415543 | 432861 sometim | 415543 | 432861 see | 410706 | 434586 inde | 408379 | 434283 fruit | 363203 | 381862 cook | 362674 | 381112 graduat | 362444 | 381284 chees | 362358 | 381040 hesit | 307431 | 317550 self | 301001 | 312312 hard | 300138 | 310167 spirit | 299310 | 312092 know | 298246 | 309010 laugh | 294136 | 302392 make | 287633 | 295003 find | 287550 | 294770 goe | 279336 | 287025 team | 228000 | 234703 footbal | 228000 | 234703 void | 227914 | 234681 formless | 227914 | 234681 board | 227907 | 234797 chess | 227907 | 234797 submarin | 227869 | 234727 inferior | 227858 | 234357 viper | 227855 | 234865 cylind | 227847 | 234505 suit | 227822 | 234376 class | 227822 | 234376 action | 227822 | 234376 diskett | 227802 | 234786 roller | 227792 | 234524 coaster | 227792 | 234524 mate | 227785 | 234431 ritual | 227785 | 234431 engin | 227784 | 234575 steam | 227784 | 234575 industri | 227780 | 234312 fire | 227775 | 234532 hydrant | 227775 | 234532 briar | 227769 | 234524 patch | 227769 | 234524 mastadon | 227677 | 234665 defend | 227617 | 234410 blade | 227603 | 234356 razor | 227603 | 234356 cab | 227578 | 234554 driver | 227578 | 234554 cough | 227570 | 234324 syrup | 227570 | 234324 cowboy | 227566 | 234663 chop | 227564 | 234437 pork | 227564 | 234437 ceo | 227557 | 234760 rattlesnak | 227554 | 234323 hell | 227540 | 234313 flavor | 227540 | 234313 maelstrom | 227537 | 234404 mulch | 227531 | 234311 cyprus | 227531 | 234311 tack | 227525 | 234462 carpet | 227525 | 234462 movi | 227505 | 234207 theater | 227505 | 234207 spider | 227466 | 234524 cone | 227463 | 234198 pine | 227463 | 234198 garbag | 227459 | 234207 beer | 227443 | 234077 bottl | 227443 | 234077 polygon | 227438 | 234267 judg | 227425 | 234565 blith | 227409 | 233979 traffic | 227403 | 234051 paper | 227397 | 234028 napkin | 227397 | 234028 apart | 227393 | 233911 build | 227393 | 233911 cocker | 227368 | 233926 spaniel | 227368 | 233926 bay | 227358 | 234261 cargo | 227358 | 234261 order | 227357 | 233885 short | 227357 | 233885 polar | 227326 | 234118 demon | 227324 | 234442 minivan | 227317 | 234292 bulb | 227314 | 234089 fundrais | 227308 | 234235 eggplant | 227306 | 234202 cake | 227299 | 234075 bowl | 227299 | 234110 paycheck | 227295 | 234224 sheriff | 227292 | 234313 turkey | 227271 | 234267 turn | 227265 | 234210 signal | 227265 | 234210 chestnut | 227250 | 234104 hole | 227239 | 233975 puncher | 227239 | 233975 tabloid | 227238 | 234341 microscop | 227236 | 234067 reclin | 227234 | 233946 dolphin | 227231 | 234080 pen | 227222 | 234269 pig | 227222 | 234269 wed | 227221 | 233860 bullfrog | 227211 | 234144 truck | 227208 | 233980 pickup | 227208 | 233980 agent | 227201 | 233840 insur | 227201 | 233840 girl | 227201 | 233934 scout | 227201 | 233934 drill | 227200 | 233986 power | 227200 | 233986 ocean | 227187 | 234211 case | 227173 | 233983 crank | 227173 | 233983 squid | 227169 | 234056 senat | 227167 | 234147 fraction | 227161 | 234065 custom | 227152 | 234128 burglar | 227148 | 234014 grizzli | 227133 | 233955 wheel | 227122 | 233813 asteroid | 227108 | 233928 anomali | 227106 | 234156 acceler | 227103 | 233428 particl | 227103 | 233428 saw | 227082 | 233934 chain | 227082 | 233934 reactor | 227035 | 234061 wedg | 227033 | 234143 photon | 227029 | 234017 deficit | 227029 | 234102 vacuum | 227021 | 233760 cleaner | 227021 | 233760 cashier | 227010 | 233858 scyth | 227001 | 233928 cloud | 226981 | 233569 format | 226981 | 233569 tornado | 226968 | 234058 grand | 226936 | 233730 piano | 226936 | 233730 tripod | 226930 | 233755 tomato | 226928 | 233915 sandwich | 226923 | 233786 earring | 226912 | 233665 train | 226912 | 233712 freight | 226912 | 233712 skyscrap | 226901 | 233755 abstract | 226890 | 233658 mortician | 226883 | 233781 warranti | 226876 | 233935 atom | 226868 | 233467 hydrogen | 226868 | 233467 satellit | 226866 | 233680 corpor | 226858 | 233818 globul | 226853 | 233980 cow | 226832 | 233808 jersey | 226832 | 233808 salad | 226830 | 233400 buzzard | 226804 | 233825 lot | 226794 | 233643 park | 226794 | 233643 prime | 226793 | 233325 minist | 226793 | 233325 clot | 226780 | 233380 blood | 226780 | 233380 tuba | 226765 | 233575 tape | 226749 | 233388 record | 226749 | 233388 line | 226747 | 233574 dancer | 226747 | 233574 nation | 226736 | 233796 bartend | 226653 | 233422 hockey | 226645 | 233178 canyon | 226617 | 233699 ski | 226610 | 233451 lodg | 226610 | 233451 stovepip | 226608 | 233489 crane | 226590 | 233578 sand | 226572 | 233270 grain | 226572 | 233270 dust | 226570 | 233391 bunni | 226570 | 233391 lover | 226564 | 233628 fairi | 226554 | 233743 plaintiff | 226537 | 233563 wheelbarrow | 226520 | 233206 food | 226445 | 233228 stamp | 226445 | 233228 umbrella | 226380 | 233273 avocado | 226375 | 232942 oil | 226357 | 233266 filter | 226357 | 233266 financi | 220105 | 225116 complet | 162829 | 164065 ridicul | 162346 | 163592 handl | 162200 | 163390 singl | 162200 | 163390 single-handl | 162200 | 163390 greedili | 162123 | 163379 careless | 162009 | 163193 somewhat | 161979 | 163205 accur | 161975 | 163228 overwhelm | 161946 | 163107 usual | 161930 | 163158 ostens | 161826 | 163020 lazili | 161809 | 163133 slyli | 161803 | 163149 underhand | 161751 | 162955 non | 161585 | 162823 chalant | 161585 | 162823 non-chal | 161585 | 162823 seldom | 161525 | 162739 accident | 161511 | 162676 almost | 161508 | 162782 often | 161488 | 162733 bare | 161401 | 162659 eager | 161278 | 162513 wise | 161073 | 162341 inexor | 161042 | 162265 feverish | 160805 | 162020 thorough | 160611 | 161823 home | 154672 | 155766 return | 154672 | 155766 lost | 154655 | 155567 glori | 154655 | 155567 start | 154655 | 155567 reminisc | 154655 | 155567 rumin | 154577 | 155776 read | 154529 | 155642 magazin | 154529 | 155642 pray | 154478 | 155748 floor | 154396 | 155477 sweep | 154396 | 155477 nag | 154271 | 155259 feel | 154271 | 155259 remors | 154271 | 155259 procrastin | 154256 | 155371 wake | 154220 | 155397 sleep | 154217 | 155353 panic | 154189 | 155346 get | 154168 | 155253 drunk | 154168 | 155253 stink | 154168 | 155253 hibern | 154158 | 155358 die | 153973 | 155223 fli | 153943 | 155056 rage | 153943 | 155056 flagel | 153916 | 155067 self-flagel | 153916 | 155067 daydream | 153864 | 155043 medit | 153816 | 154935 ceas | 153735 | 154815 exist | 153735 | 154815 joy | 153672 | 154754 beam | 153672 | 154754 trembl | 153656 | 154799 loud | 153635 | 154665 hide | 153592 | 154797 break | 153559 | 154599 coffe | 153559 | 154599 earn | 153538 | 154540 mile | 153538 | 154540 flier | 153538 | 154540 frequent | 153538 | 154540 leav | 153535 | 154730 rejoic | 153226 | 154412 sell | 147231 | 148103 plan | 147046 | 147809 escap | 147046 | 147809 throw | 146973 | 147764 negoti | 146905 | 147704 prenupti | 146905 | 147704 agreement | 146905 | 147704 card | 146892 | 147731 trade | 146892 | 147731 basebal | 146892 | 147731 oper | 146888 | 147787 small | 146888 | 147787 stand | 146888 | 147787 drink | 146881 | 147727 night | 146881 | 147727 steal | 146835 | 147847 pencil | 146835 | 147847 seek | 146816 | 148029 figur | 146801 | 147908 write | 146736 | 147720 letter | 146736 | 147720 recogn | 146723 | 147823 truce | 146684 | 147630 eat | 146670 | 147874 compet | 146647 | 147760 buy | 146642 | 147522 gift | 146642 | 147522 expens | 146642 | 147522 big | 146626 | 147717 fan | 146626 | 147717 fall | 146597 | 147601 assist | 146587 | 147589 requir | 146587 | 147589 chang | 146542 | 147479 heart | 146542 | 147479 conquer | 146542 | 147695 money | 146481 | 147450 borrow | 146481 | 147450 ignor | 146475 | 147643 share | 146415 | 147404 shower | 146415 | 147404 fault | 146413 | 147361 subtl | 146413 | 147361 kind | 146402 | 147492 great | 146397 | 147367 upon | 146396 | 147366 honor | 146396 | 147366 bestow | 146396 | 147366 pee | 146394 | 147477 avoid | 146392 | 147388 contact | 146392 | 147388 pink | 146372 | 147347 slip | 146372 | 147347 aid | 146367 | 147225 teach | 146366 | 147516 sanit | 146361 | 147477 lice | 146360 | 147409 buri | 146360 | 147483 cold | 146357 | 147220 pour | 146357 | 147220 freez | 146357 | 147220 water | 146357 | 147220 sea | 146347 | 147217 deep | 146347 | 147217 fish | 146347 | 147217 organ | 146321 | 147476 grit | 146289 | 147227 satiat | 146251 | 147349 assimil | 146251 | 147377 tri | 146188 | 147200 seduc | 146188 | 147200 reach | 146132 | 147008 understand | 146132 | 147008 brainwash | 146068 | 147158 admir | 146050 | 147021 caricatur | 145989 | 147107 deriv | 145941 | 146790 pervers | 145941 | 146790 satisfact | 145941 | 146790 moral | 145854 | 146733 lectur | 145854 | 146733 befriend | 145799 | 146963 learn | 145758 | 146666 lesson | 145758 | 146666 play | 145738 | 146706 pinochl | 145738 | 146706 peek | 145698 | 146737 danc | 145555 | 146637 fashion | 78762 | 79203 muddi | 78750 | 79236 hypnot | 78747 | 79204 childlik | 78579 | 79002 loyal | 78575 | 79056 mysteri | 78554 | 79047 annoy | 78532 | 79032 slow | 78517 | 78996 twist | 78515 | 79016 unstabl | 78510 | 78945 feder | 78501 | 78967 rever | 78501 | 79008 wrinkl | 78495 | 78965 rude | 78495 | 78975 boil | 78493 | 78972 high | 78481 | 78940 paid | 78481 | 78940 geosynchron | 78478 | 78931 greasi | 78476 | 78961 cosmopolitan | 78459 | 78903 fat | 78438 | 78935 inciner | 78429 | 78896 dot | 78426 | 78864 polka | 78426 | 78864 polka-dot | 78426 | 78864 outer | 78415 | 78910 phoni | 78411 | 78895 pathet | 78405 | 78869 purpl | 78405 | 78895 frozen | 78403 | 78886 nearest | 78396 | 78879 statesmanlik | 78386 | 78830 dirt | 78376 | 78828 encrust | 78376 | 78828 dirt-encrust | 78376 | 78828 sur | 78371 | 78895 obsequi | 78369 | 78805 salti | 78360 | 78834 imagin | 78356 | 78808 south | 78325 | 78787 american | 78325 | 78787 load | 78318 | 78832 righteous | 78282 | 78760 fractur | 78281 | 78737 educ | 78278 | 78682 colleg | 78278 | 78682 college-educ | 78278 | 78682 mitochondri | 78269 | 78745 treacher | 78265 | 78697 spartan | 78252 | 78707 felin | 78244 | 78713 ravish | 78242 | 78765 patern | 78241 | 78701 psychot | 78238 | 78693 shabbi | 78228 | 78685 dreamlik | 78224 | 78642 loath | 78221 | 78653 self-loath | 78221 | 78653 world | 78203 | 78658 call | 78183 | 78610 so-cal | 78183 | 78610 radioact | 78182 | 78623 alleg | 78178 | 78664 cantanker | 78159 | 78620 makeshift | 78159 | 78648 gentl | 78156 | 78609 fri | 78143 | 78648 linguist | 78141 | 78586 overrip | 78134 | 78572 varig | 78132 | 78609 vapor | 78105 | 78548 impromptu | 78104 | 78569 actual | 78104 | 78592 self-actu | 78104 | 78592 frighten | 78100 | 78544 molten | 78100 | 78567 gratifi | 78098 | 78528 bur | 78094 | 78563 hairi | 78092 | 78563 foreign | 78083 | 78569 tatter | 78050 | 78518 frustrat | 78044 | 78474 stoic | 78036 | 78503 eurasian | 78033 | 78513 proverbi | 78031 | 78519 green | 78024 | 78450 skinni | 78023 | 78524 familiar | 78016 | 78477 optim | 78006 | 78483 bohemian | 78002 | 78500 overpr | 77983 | 78411 pompous | 77955 | 78460 difficult | 77938 | 78375 raspi | 77924 | 78461 soggi | 77912 | 78381 resplend | 77910 | 78351 blotch | 77910 | 78380 fals | 77908 | 78409 infect | 77907 | 78399 magnific | 77898 | 78350 snooti | 77897 | 78422 moron | 77886 | 78362 moldi | 77865 | 78370 precis | 77860 | 78331 crispi | 77856 | 78324 smelli | 77813 | 78279 tempor | 77810 | 78244 alaskan | 77808 | 78258 elus | 77775 | 78245 miser | 77772 | 78232 flatul | 77761 | 78201 orbit | 77723 | 78157 mean-spirit | 77660 | 78113 flabbi | 77649 | 78110 nuclear | 77609 | 78069 go | 15532 | 15545 made | 1 | 1 america | 1 | 1 If you need anything else, let me know! Kris Oleg Bartunov wrote: >Kris, > >could you post 'explain analyze' output ? >Also, could you disable index usage (set enable_indexscan=off) >and rerun search using tsearch2 ? > >also, could you run 'stat' function to see frequency distribution >of words. See http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes >for details. > >Oleg > > >>Hi all. I am doing some work with tsearch2 and am not sure what to >>expect out of it, performance wise. Here is my setup: >> >> Table "public.product" >> Column | Type | Modifiers >>-------------+----------+------------------------------------------------- >> description | text | >> product_id | integer | default nextval('product_product_id_seq'::text) >> vector | tsvector | >>Indexes: >> "vector_idx" gist (vector) >>Triggers: >> tsvectorupdate BEFORE INSERT OR UPDATE ON product FOR EACH ROW EXECUTE PROCEDURE tsearch2('vector', 'description') >> >>This table has 3,000,000 rows in it. Each description field has roughly 50 characters. There are fewer than ten thousanddistinct words in my 3,000,000 rows. The vector was filled using the description fields values. I ran a vacuumfull analyze before executing any of my queries. >> >>Here are a couple of tests I performed using the tsearch index and like; >> >>search_test=# select count(*) from product where vector @@ to_tsquery('oil'); >> count >>-------- >> 226357 >>(1 row) >> >>Time: 191056.230 ms >> >>search_test=# select count(*) from product where vector @@ to_tsquery('hydrogen'); >> count >>-------- >> 226868 >>(1 row) >> >>Time: 306411.957 ms >> >>search_test=# select count(*) from product where description like '% oil %'; >> count >>-------- >> 226357 >>(1 row) >> >>Time: 38426.851 ms >> >>search_test=# select count(*) from product where description like '% hydrogen %'; >> count >>-------- >> 226868 >>(1 row) >> >>Time: 38265.421 ms >> >> >>Both of the likes are using a sequential scan and both of the tsearch queries use the gist index. Did I miss a configurationparameter, are these queries incorrectly using tsearch2,or is this tsearch2's average performance? Thanks inadvance for the input! >> >>Kris >> >> ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Regardless of caching, the queries are still taking 19~20 seconds to run on the 3,000,000 rows. I've played with performance tuning and nothing seems to make much of a difference. If I am reading that list from stat correctly, then I am operating on 508 distinct words. Is this the performance I should expect from tsearch2? Or is something still awry? I'm inclined to think something else is wrong, after reading some other people's tsearch performance stats. Thanks! Kris
Kris, do you actually have only 508 disctinct words ? Could you try more complex queries, say 2-3 words. Does these queries run faster ? Oleg On Mon, 27 Sep 2004, Kris Kiger wrote: > Regardless of caching, the queries are still taking 19~20 seconds to run > on the 3,000,000 rows. I've played with performance tuning and nothing > seems to make much of a difference. If I am reading that list from stat > correctly, then I am operating on 508 distinct words. Is this the > performance I should expect from tsearch2? Or is something still awry? > I'm inclined to think something else is wrong, after reading some > other people's tsearch performance stats. Thanks! > > Kris > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg, Thanks for the help on this. The query I used to return the 508 number is: SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc desc, word ; Testing says, the more words I use, the faster the query is. My original search word, 'oil', appears in 226,357 documents 233,266 times. As far as distinct words go, 'oil' is middle of the road for occurences. As it is set up now, the best search time I am getting on this single word is roughly 22 seconds. Kris Oleg Bartunov wrote: >Kris, > >do you actually have only 508 disctinct words ? Could you try >more complex queries, say 2-3 words. Does these queries run faster ? > > > Oleg >On Mon, 27 Sep 2004, Kris Kiger wrote: > > > >>Regardless of caching, the queries are still taking 19~20 seconds to run >>on the 3,000,000 rows. I've played with performance tuning and nothing >>seems to make much of a difference. If I am reading that list from stat >>correctly, then I am operating on 508 distinct words. Is this the >>performance I should expect from tsearch2? Or is something still awry? >> I'm inclined to think something else is wrong, after reading some >>other people's tsearch performance stats. Thanks! >> >>Kris >> >> >> > Regards, > Oleg > >
Hello, I might of missed this on a previous message, BUT what type of hardware are we dealing with here? Is it possible that we just don't have enough IO/RAM to push this? J Kris Kiger wrote: > Oleg, > > Thanks for the help on this. > > The query I used to return the 508 number is: > SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc > desc, word ; > Testing says, the more words I use, the faster the query is. My > original search word, 'oil', appears in 226,357 documents 233,266 times. > As far as distinct words go, 'oil' is middle of the road for > occurences. As it is set up now, the best search time I am getting on > this single word is roughly 22 seconds. > Kris > > Oleg Bartunov wrote: > >> Kris, >> >> do you actually have only 508 disctinct words ? Could you try >> more complex queries, say 2-3 words. Does these queries run faster ? >> >> >> Oleg >> On Mon, 27 Sep 2004, Kris Kiger wrote: >> >> >> >>> Regardless of caching, the queries are still taking 19~20 seconds to run >>> on the 3,000,000 rows. I've played with performance tuning and nothing >>> seems to make much of a difference. If I am reading that list from stat >>> correctly, then I am operating on 508 distinct words. Is this the >>> performance I should expect from tsearch2? Or is something still awry? >>> I'm inclined to think something else is wrong, after reading some >>> other people's tsearch performance stats. Thanks! >>> >>> Kris >>> >>> >> >> Regards, >> Oleg >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
Attachment
Josh, I am running my tests on a dual processor PIII 1 GHz machine with 2Gb of RAM. There are four hard drives set up. One for OS/logging and three in a raid-5 configuration for the postgres data directory. Kris Joshua D. Drake wrote: > Hello, > I might of missed this on a previous message, BUT what type of > hardware are we dealing with here? Is it possible that we just don't > have enough IO/RAM to push this? > J > Kris Kiger wrote: > >> Oleg, >> Thanks for the help on this. >> The query I used to return the 508 number is: >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY >> ndoc desc, word ; >> Testing says, the more words I use, the faster the query is. My >> original search word, 'oil', appears in 226,357 documents 233,266 >> times. As far as distinct words go, 'oil' is middle of the road for >> occurences. As it is set up now, the best search time I am getting >> on this single word is roughly 22 seconds. Kris >> >> Oleg Bartunov wrote: >> >>> Kris, >>> >>> do you actually have only 508 disctinct words ? Could you try >>> more complex queries, say 2-3 words. Does these queries run faster ? >>> Oleg >>> On Mon, 27 Sep 2004, Kris Kiger wrote: >>> >>>> Regardless of caching, the queries are still taking 19~20 seconds >>>> to run >>>> on the 3,000,000 rows. I've played with performance tuning and >>>> nothing >>>> seems to make much of a difference. If I am reading that list from >>>> stat >>>> correctly, then I am operating on 508 distinct words. Is this the >>>> performance I should expect from tsearch2? Or is something still >>>> awry? >>>> I'm inclined to think something else is wrong, after reading some >>>> other people's tsearch performance stats. Thanks! >>>> >>>> Kris >>> >>> Regards, >>> Oleg >> -- ______________________________ Kris Kiger Software Developer Digonex Technologies, Inc. 317.638.4174 Fax CONFIDENTIALITY NOTICE: The information in this transmission is private, confidential, may be legally privileged, is propertyof the sender and is intended solely for the use of the addressee. If you are not the addressee, you should notread, disclose, distribute, copy, use or rely upon the information contained in this transmission. If you have receivedthis transmission in error please delete or destroy it and notify DIGONEX TECHNOLOGIES, INC. immediately at (317)638-4154.
On Mon, 27 Sep 2004, Kris Kiger wrote: > Oleg, > > Thanks for the help on this. > > The query I used to return the 508 number is: > SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc > desc, word ; > > Testing says, the more words I use, the faster the query is. My > original search word, 'oil', appears in 226,357 documents 233,266 times. > As far as distinct words go, 'oil' is middle of the road for > occurences. As it is set up now, the best search time I am getting on > this single word is roughly 22 seconds. Does this time (22 seconds) is still better than seq. scan (no index) or standard 'LIKE' ? > > Kris > > Oleg Bartunov wrote: > > >Kris, > > > >do you actually have only 508 disctinct words ? Could you try > >more complex queries, say 2-3 words. Does these queries run faster ? > > > > > > Oleg > >On Mon, 27 Sep 2004, Kris Kiger wrote: > > > > > > > >>Regardless of caching, the queries are still taking 19~20 seconds to run > >>on the 3,000,000 rows. I've played with performance tuning and nothing > >>seems to make much of a difference. If I am reading that list from stat > >>correctly, then I am operating on 508 distinct words. Is this the > >>performance I should expect from tsearch2? Or is something still awry? > >> I'm inclined to think something else is wrong, after reading some > >>other people's tsearch performance stats. Thanks! > >> > >>Kris > >> > >> > >> > > Regards, > > Oleg > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Kris Kiger wrote: > Josh, > > I am running my tests on a dual processor PIII 1 GHz machine with 2Gb > of RAM. There are four hard drives set up. One for OS/logging and > three in a raid-5 configuration for the postgres data directory. > What about other factors? effective_cache_size random_page_cost sort_mem FYI that example I gave was on a Dual P4-Xeon with 4 Gigs of ram and a RAID5 over 6 drives. Just curious. J > Kris > > Joshua D. Drake wrote: > >> Hello, >> I might of missed this on a previous message, BUT what type of >> hardware are we dealing with here? Is it possible that we just don't >> have enough IO/RAM to push this? >> J >> Kris Kiger wrote: >> >>> Oleg, >>> Thanks for the help on this. >>> The query I used to return the 508 number is: >>> SELECT * FROM stat('SELECT vector FROM product') ORDER BY >>> ndoc desc, word ; >>> Testing says, the more words I use, the faster the query is. My >>> original search word, 'oil', appears in 226,357 documents 233,266 >>> times. As far as distinct words go, 'oil' is middle of the road for >>> occurences. As it is set up now, the best search time I am getting >>> on this single word is roughly 22 seconds. Kris >>> >>> Oleg Bartunov wrote: >>> >>>> Kris, >>>> >>>> do you actually have only 508 disctinct words ? Could you try >>>> more complex queries, say 2-3 words. Does these queries run faster ? >>>> Oleg >>>> On Mon, 27 Sep 2004, Kris Kiger wrote: >>>> >>>>> Regardless of caching, the queries are still taking 19~20 seconds >>>>> to run >>>>> on the 3,000,000 rows. I've played with performance tuning and >>>>> nothing >>>>> seems to make much of a difference. If I am reading that list from >>>>> stat >>>>> correctly, then I am operating on 508 distinct words. Is this the >>>>> performance I should expect from tsearch2? Or is something still >>>>> awry? >>>>> I'm inclined to think something else is wrong, after reading some >>>>> other people's tsearch performance stats. Thanks! >>>>> >>>>> Kris >>>> >>>> >>>> Regards, >>>> Oleg >>> >>> > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
Attachment
Yes, it is much better than no index of sequential scan. We may just be looking at the best performance tsearch2 can offer on my machine. search_test=# explain analyze SELECT count(q) FROM product, to_tsquery('oil') AS q WHERE vector @@ q; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual time=83311.552..83311.555 rows=1 loops=1) -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32) (actual time=0.204..81960.198 rows=226357 loops=1) Join Filter: ("outer".vector @@ "inner".q) -> Seq Scan on product (cost=0.00..339752.00 rows=3000000 width=32) (actual time=0.100..27415.795 rows=3000000 loops=1) -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual time=0.003..0.006 rows=1 loops=3000000) -> Function Scan on q (cost=0.00..12.50 rows=1000 width=32) (actual time=0.020..0.024 rows=1 loops=1) Total runtime: 83311.735 ms (7 rows) search_test=# explain analyze select count(*) from product where description like '% oil %'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual time=39858.350..39858.353 rows=1 loops=1) -> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0) (actual time=0.100..38320.293 rows=226357 loops=1) Filter: (description ~~ '% oil %'::text) Total runtime: 39858.491 ms >>Oleg, >> >> Thanks for the help on this. >> >> The query I used to return the 508 number is: >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc >>desc, word ; >> >> Testing says, the more words I use, the faster the query is. My >>original search word, 'oil', appears in 226,357 documents 233,266 times. >> As far as distinct words go, 'oil' is middle of the road for >>occurences. As it is set up now, the best search time I am getting on >>this single word is roughly 22 seconds. >> >> > >Does this time (22 seconds) is still better than seq. scan (no index) >or standard 'LIKE' ? > > >
On Mon, 27 Sep 2004, Kris Kiger wrote: > Yes, it is much better than no index of sequential scan. We may just be > looking at the best performance tsearch2 can offer on my machine. Hmm, tsearch2 with no index should be faster than LIKE, because tsearch2 already has *parsed* and *sorted* list. It's interesting to fetch just 226,357 documents from disk and see processing time. select count(*) from product limit 226357 offset 1; > > search_test=# explain analyze SELECT count(q) FROM product, > to_tsquery('oil') AS q WHERE vector @@ q; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual > time=83311.552..83311.555 rows=1 loops=1) > -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32) > (actual time=0.204..81960.198 rows=226357 loops=1) > Join Filter: ("outer".vector @@ "inner".q) > -> Seq Scan on product (cost=0.00..339752.00 rows=3000000 > width=32) (actual time=0.100..27415.795 rows=3000000 loops=1) > -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual > time=0.003..0.006 rows=1 loops=3000000) > -> Function Scan on q (cost=0.00..12.50 rows=1000 > width=32) (actual time=0.020..0.024 rows=1 loops=1) > Total runtime: 83311.735 ms > (7 rows) > > search_test=# explain analyze select count(*) from product where > description like '% oil %'; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------ > Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual > time=39858.350..39858.353 rows=1 loops=1) > -> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0) > (actual time=0.100..38320.293 rows=226357 loops=1) > Filter: (description ~~ '% oil %'::text) > Total runtime: 39858.491 ms > > > >>Oleg, > >> > >> Thanks for the help on this. > >> > >> The query I used to return the 508 number is: > >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc > >>desc, word ; > >> > >> Testing says, the more words I use, the faster the query is. My > >>original search word, 'oil', appears in 226,357 documents 233,266 times. > >> As far as distinct words go, 'oil' is middle of the road for > >>occurences. As it is set up now, the best search time I am getting on > >>this single word is roughly 22 seconds. > >> > >> > > > >Does this time (22 seconds) is still better than seq. scan (no index) > >or standard 'LIKE' ? > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Kris Kiger wrote: > search_test=# explain analyze SELECT count(q) FROM product, > to_tsquery('oil') AS q WHERE vector @@ q; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual > time=83311.552..83311.555 rows=1 loops=1) > -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32) > (actual time=0.204..81960.198 rows=226357 loops=1) > Join Filter: ("outer".vector @@ "inner".q) > -> Seq Scan on product (cost=0.00..339752.00 rows=3000000 > width=32) (actual time=0.100..27415.795 rows=3000000 loops=1) > -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual > time=0.003..0.006 rows=1 loops=3000000) > -> Function Scan on q (cost=0.00..12.50 rows=1000 > width=32) (actual time=0.020..0.024 rows=1 loops=1) > Total runtime: 83311.735 ms > (7 rows) The explain analyze output doesn't show that a gist index on the vector column is being used. This is because either you don't have an index defined and\or the query is causing a poor plan to be chosen. I've found that putting to_tsquery in the FROM clause does not execute fast. Try rewriting the query as: explain analyze SELECT count(to_tsquery('oil')) FROM product WHERE vector @@ to_tsquery('oil'); or explain analyze SELECT count(*) FROM product WHERE vector @@ to_tsquery('oil'); George Essig
Kris, we're working on prototype of tsearchd - full text search daemon, which maintain static inverted index outside of postgresql using the same parser, dictionary tsearch2 does. This approach could scale up fts capability preserving access to metadata, so yo may have "archive" part of your collection (tsearchd) and "online", which could be searchable with tsearch2. Here is what we have right now: pages ( tid integer, fts_index tsvector) 1. Create index select count(tdindex(tid,fts_index)) from pages; 2. Flush index select tdflush(); 3. Search select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where tid=idx order by rank desc; If it's possible, you could share your data, so we could test our prototype on real data. Oleg On Mon, 27 Sep 2004, Kris Kiger wrote: > Yes, it is much better than no index of sequential scan. We may just be > looking at the best performance tsearch2 can offer on my machine. > > search_test=# explain analyze SELECT count(q) FROM product, > to_tsquery('oil') AS q WHERE vector @@ q; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual > time=83311.552..83311.555 rows=1 loops=1) > -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32) > (actual time=0.204..81960.198 rows=226357 loops=1) > Join Filter: ("outer".vector @@ "inner".q) > -> Seq Scan on product (cost=0.00..339752.00 rows=3000000 > width=32) (actual time=0.100..27415.795 rows=3000000 loops=1) > -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual > time=0.003..0.006 rows=1 loops=3000000) > -> Function Scan on q (cost=0.00..12.50 rows=1000 > width=32) (actual time=0.020..0.024 rows=1 loops=1) > Total runtime: 83311.735 ms > (7 rows) > > search_test=# explain analyze select count(*) from product where > description like '% oil %'; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------ > Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual > time=39858.350..39858.353 rows=1 loops=1) > -> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0) > (actual time=0.100..38320.293 rows=226357 loops=1) > Filter: (description ~~ '% oil %'::text) > Total runtime: 39858.491 ms > > > >>Oleg, > >> > >> Thanks for the help on this. > >> > >> The query I used to return the 508 number is: > >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc > >>desc, word ; > >> > >> Testing says, the more words I use, the faster the query is. My > >>original search word, 'oil', appears in 226,357 documents 233,266 times. > >> As far as distinct words go, 'oil' is middle of the road for > >>occurences. As it is set up now, the best search time I am getting on > >>this single word is roughly 22 seconds. > >> > >> > > > >Does this time (22 seconds) is still better than seq. scan (no index) > >or standard 'LIKE' ? > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg, the data I have right now was generated using a random paragraph generator. The words are real words, but there are only 508 distinct keywords in the 3,000,000 records that tsearch2 will pick up, using default settings. I was using this data set for the purpose of testing tsearch2's capabilities, so it's not real world data. If you still want it, let me know where to send it and I will send you a dump of the DB. Kris Oleg Bartunov wrote: >Kris, > >we're working on prototype of tsearchd - full text search daemon, which >maintain static inverted index outside of postgresql using the same >parser, dictionary tsearch2 does. This approach could scale up >fts capability preserving access to metadata, so yo may have >"archive" part of your collection (tsearchd) and "online", which could be >searchable with tsearch2. > >Here is what we have right now: > >pages ( tid integer, fts_index tsvector) > >1. Create index >select count(tdindex(tid,fts_index)) from pages; >2. Flush index >select tdflush(); >3. Search >select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank >from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where >tid=idx order by rank desc; > >If it's possible, you could share your data, so we could test our >prototype on real data. > > > Oleg > >
Kris, is't possible to share your "paragraph generator" ? It'd be better than downloading the whole dump :) On Wed, 29 Sep 2004, Kris Kiger wrote: > Oleg, the data I have right now was generated using a random paragraph > generator. The words are real words, but there are only 508 distinct > keywords in the 3,000,000 records that tsearch2 will pick up, using > default settings. I was using this data set for the purpose of testing > tsearch2's capabilities, so it's not real world data. If you still want > it, let me know where to send it and I will send you a dump of the DB. > > Kris > > > > Oleg Bartunov wrote: > > >Kris, > > > >we're working on prototype of tsearchd - full text search daemon, which > >maintain static inverted index outside of postgresql using the same > >parser, dictionary tsearch2 does. This approach could scale up > >fts capability preserving access to metadata, so yo may have > >"archive" part of your collection (tsearchd) and "online", which could be > >searchable with tsearch2. > > > >Here is what we have right now: > > > >pages ( tid integer, fts_index tsvector) > > > >1. Create index > >select count(tdindex(tid,fts_index)) from pages; > >2. Flush index > >select tdflush(); > >3. Search > >select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank > >from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where > >tid=idx order by rank desc; > > > >If it's possible, you could share your data, so we could test our > >prototype on real data. > > > > > > Oleg > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Hey all, its me again. If I do not do a count(product_id) on my tsearch2 queries, its actually really fast, for example; explain analyze SELECT product_id FROM product, to_tsquery('bear') AS q WHERE vector @@ q LIMIT 1000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..2081.60 rows=1000 width=4) (actual time=2.308..51.522 rows=1000 loops=1) -> Nested Loop (cost=0.00..6244798.55 rows=3000001 width=4) (actual time=2.299..45.637 rows=1000 loops=1) -> Function Scan on q (cost=0.00..12.50 rows=1000 width=32) (actual time=0.023..0.023 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6207.29 rows=3000 width=36) (actual time=2.253..37.946 rows=1000 loops=1) Index Cond: (product.vector @@ "outer".q) Filter: (product.vector @@ "outer".q) Total runtime: 122.487 ms explain analyze SELECT product_id FROM product, to_tsquery('complex') AS q WHERE vector @@ q LIMIT 1000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..2081.60 rows=1000 width=4) (actual time=4.943..2325.949 rows=1000 loops=1) -> Nested Loop (cost=0.00..6244798.55 rows=3000001 width=4) (actual time=4.933..2319.885 rows=1000 loops=1) -> Function Scan on q (cost=0.00..12.50 rows=1000 width=32) (actual time=0.040..0.040 rows=1 loops=1) -> Index Scan using vector_idx on product (cost=0.00..6207.29 rows=3000 width=36) (actual time=4.868..2311.087 rows=1000 loops=1) Index Cond: (product.vector @@ "outer".q) Filter: (product.vector @@ "outer".q) Total runtime: 2329.389 ms From what I know, there is only one reason I can offer why a count takes approximately 30~40 seconds longer on these same queries... that is that count has to evaluate whether a value is null or not. There probably is a better reason, if anyone has any ideas, I would much appreciate you sharing! Also, why the big difference in query times in the above? bear appears 780963 times in 696668 documents complex appears 468669 times in 440339 documents. Again, thanks all! Kris
On Fri, 1 Oct 2004, Kris Kiger wrote: > Hey all, its me again. If I do not do a count(product_id) on my > tsearch2 queries, its actually really fast, for example; > Hmm, I also really want to know what's the difference ? Postgresql 8.0beta3 on Linux 2.4.25 tsearchd=# explain analyze select body from txt where fts_index @@ to_tsquery('oil') limit 1000; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..4027.67 rows=1000 width=315) (actual time=0.053..14.662 rows=1000 loops=1) -> Index Scan using fts_idx on txt (cost=0.00..12083.02 rows=3000 width=315) (actual time=0.049..12.552 rows=1000 loops=1) Index Cond: (fts_index @@ '\'oil\''::tsquery) Total runtime: 15.848 ms (4 rows) tsearchd=# explain analyze select count(body) from txt where fts_index @@ to_tsquery('oil') limit 1000; Didn't get result after 10 minutes :( Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83