Thread: Nested Loop trouble : Execution time increases more 1000 time (long)
Hello, We are using postgresql in a search engine on an intranet handling throusand of documents. But we ave a big problem when users use more than two search key. There are more tables around, but the heart of the search engine is made of three tables : fiches (f_id int4, f_title varchar) 52445 rows engine (f_id int4, k_id int4, weight ) 11761700 rows keywords(k_id, keyword) 1072600 rows A "fiche" is linked to any kind of document. The engine table counts how many times a keyword appears in a document. A query to search on one or two keywords is quick to execute (the front-end creates thoses queries): --------------------------------------------------------------------- select count (distinct f.f_id) as results FROM fiches f INNER JOIN engine e1 INNER JOIN keywords k1 USING (k_id) USING (f_id) INNER JOIN engine e2 INNER JOIN keywords k2 USING (k_id) USING (f_id) WHERE TRUE AND k1.keyword like 'maintenance%' AND k2.keyword like 'exploitation%' ; QUERY PLAN Aggregate (cost=3953.00..3953.00 rows=1 width=4) (actual time=525.243..525.243 rows=1 loops=1) -> Nested Loop (cost=1974.79..3952.99 rows=1 width=4) (actual time=211.570..513.758 rows=6879 loops=1) -> Hash Join (cost=1974.79..3949.62 rows=1 width=8) (actual time=211.483..389.340 rows=6879 loops=1) Hash Cond: ("outer".f_id = "inner".f_id) -> Nested Loop (cost=0.00..1974.76 rows=11 width=4) (actual time=0.132..155.499 rows=9520 loops=1) -> Index Scan using keyword_pattern_key on keywords k2 (cost=0.00..3.51 rows=1 width=4) (actual time=0.078..1.887 rows=75 loops=1) Index Cond: (((keyword)::text ~>=~ 'exploitation'::character varying) AND ((keyword)::text ~<~ 'exploitatioo'::character varying)) Filter: ((keyword)::text ~~ 'exploitation%'::text) -> Index Scan using k_id_key on engine e2 (cost=0.00..1954.93 rows=1306 width=8) (actual time=0.049..1.842 rows=127 loops=75) Index Cond: (e2.k_id = "outer".k_id) -> Hash (cost=1974.76..1974.76 rows=11 width=4) (actual time=211.203..211.203 rows=0 loops=1) -> Nested Loop (cost=0.00..1974.76 rows=11 width=4) (actual time=0.296..197.590 rows=11183 loops=1) -> Index Scan using keyword_pattern_key on keywords k1 (cost=0.00..3.51 rows=1 width=4) (actual time=0.189..1.351 rows=73 loops=1) Index Cond: (((keyword)::text ~>=~ 'maintenance'::character varying) AND ((keyword)::text ~<~ 'maintenancf'::character varying)) Filter: ((keyword)::text ~~ 'maintenance%'::text) -> Index Scan using k_id_key on engine e1 (cost=0.00..1954.93 rows=1306 width=8) (actual time=0.029..2.406 rows=153 loops=73) Index Cond: (e1.k_id = "outer".k_id) -> Index Scan using fiches_pkey on fiches f (cost=0.00..3.36 rows=1 width=4) (actual time=0.013..0.014 rows=1 loops=6879) Index Cond: (f.f_id = "outer".f_id) Total runtime: 525.511 ms -------------------------------------------------------------------------- But when there are three keywords or more, the planner chooses to perform a very costly nested loop : -------------------------------------------------------------------------- select count (distinct f.f_id) as results FROM fiches f INNER JOIN engine e1 INNER JOIN keywords k1 USING (k_id) USING (f_id) INNER JOIN engine e2 INNER JOIN keywords k2 USING (k_id) USING (f_id) INNER JOIN engine e3 INNER JOIN keywords k3 USING (k_id) USING (f_id) WHERE TRUE AND k1.keyword like 'maintenance%' AND k2.keyword like 'exploitation%' AND k3.keyword like 'numerique%' ; QUERY PLAN Aggregate (cost=5927.90..5927.90 rows=1 width=4) (actual time=673048.168..673048.169 rows=1 loops=1) -> Nested Loop (cost=1974.79..5927.90 rows=1 width=4) (actual time=1853.789..673038.065 rows=2929 loops=1) -> Nested Loop (cost=1974.79..5924.52 rows=1 width=12) (actual time=1853.719..672881.725 rows=2929 loops=1) Join Filter: ("inner".f_id = "outer".f_id) -> Hash Join (cost=1974.79..3949.62 rows=1 width=8) (actual time=198.845..441.947 rows=6879 loops=1) Hash Cond: ("outer".f_id = "inner".f_id) -> Nested Loop (cost=0.00..1974.76 rows=11 width=4) (actual time=0.129..199.895 rows=9520 loops=1) -> Index Scan using keyword_pattern_key on keywords k2 (cost=0.00..3.51 rows=1 width=4) (actual time=0.077..1.918 rows=75 loops=1) Index Cond: (((keyword)::text ~>=~ 'exploitation'::character varying) AND ((keyword)::text ~<~ 'exploitatioo'::character varying)) Filter: ((keyword)::text ~~ 'exploitation%'::text) -> Index Scan using k_id_key on engine e2 (cost=0.00..1954.93 rows=1306 width=8) (actual time=0.035..2.342 rows=127 loops=75) Index Cond: (e2.k_id = "outer".k_id) -> Hash (cost=1974.76..1974.76 rows=11 width=4) (actual time=198.650..198.650 rows=0 loops=1) -> Nested Loop (cost=0.00..1974.76 rows=11 width=4) (actual time=0.174..187.216 rows=11183 loops=1) -> Index Scan using keyword_pattern_key on keywords k1 (cost=0.00..3.51 rows=1 width=4) (actual time=0.113..1.222 rows=73 loops=1) Index Cond: (((keyword)::text ~>=~ 'maintenance'::character varying) AND ((keyword)::text ~<~ 'maintenancf'::character varying)) Filter: ((keyword)::text ~~ 'maintenance%'::text) -> Index Scan using k_id_key on engine e1 (cost=0.00..1954.93 rows=1306 width=8) (actual time=0.029..2.311 rows=153 loops=73) Index Cond: (e1.k_id = "outer".k_id) -> Nested Loop (cost=0.00..1974.76 rows=11 width=4) (actual time=0.087..90.165 rows=9553 loops=6879) -> Index Scan using keyword_pattern_key on keywords k3 (cost=0.00..3.51 rows=1 width=4) (actual time=0.049..0.628 rows=49 loops=6879) Index Cond: (((keyword)::text ~>=~ 'numerique'::character varying) AND ((keyword)::text ~<~ 'numeriquf'::character varying)) Filter: ((keyword)::text ~~ 'numerique%'::text) -> Index Scan using k_id_key on engine e3 (cost=0.00..1954.93 rows=1306 width=8) (actual time=0.023..1.544 rows=195 loops=337071) Index Cond: (e3.k_id = "outer".k_id) -> Index Scan using fiches_pkey on fiches f (cost=0.00..3.36 rows=1 width=4) (actual time=0.041..0.043 rows=1 loops=2929) Index Cond: (f.f_id = "outer".f_id) Total runtime: 673048.405 ms ---------------------------------------------------------------------- More than 10 minutes ! Is there a specific reason the planner chooses this way ? Can whe do something on the postgresql configuration to avoid this ? Can whe force the planner to use a hash join as it does for the first joins ? Regards, Antoine Bajolet
On Sat, 2005-09-17 at 17:47 +0200, Antoine Bajolet wrote: > There are more tables around, but the heart of the search engine is > made of three tables : > > fiches (f_id int4, f_title varchar) 52445 rows > engine (f_id int4, k_id int4, weight ) 11761700 rows > keywords(k_id, keyword) 1072600 rows > > A "fiche" is linked to any kind of document. > The engine table counts how many times a keyword appears in a document. > > A query to search on one or two keywords is quick to execute (the > front-end creates thoses queries): > > Is there a specific reason the planner chooses this way ? Yes, you have an additional join for each new keyword, so there is more work to do. Recode your SQL with an IN subselect that retrieves all possible keywords before it accesses the larger table. That way you should have only one join for each new keyword. > Can whe do something on the postgresql configuration to avoid this ? > Can whe force the planner to use a hash join as it does for the first > joins ? Not required, IMHO. Best Regards, Simon Riggs
Antoine Bajolet <antoine.bajolet@free.fr> writes: > We are using postgresql in a search engine on an intranet handling > throusand of documents. > But we ave a big problem when users use more than two search key. I think you need to increase the statistics targets for your keywords table --- the estimates of numbers of matching rows are much too small: > -> Index Scan using keyword_pattern_key on keywords > k2 (cost=0.00..3.51 rows=1 width=4) (actual time=0.078..1.887 rows=75 > loops=1) > Index Cond: (((keyword)::text ~>=~ > 'exploitation'::character varying) AND ((keyword)::text ~<~ > 'exploitatioo'::character varying)) > Filter: ((keyword)::text ~~ 'exploitation%'::text) A factor-of-75 error is quite likely to mislead the planner into choosing a bad join plan. BTW, have you looked into using a real full-text-search engine (eg, tsearch2) instead of rolling your own like this? regards, tom lane
Hello, Tom Lane a écrit : >Antoine Bajolet <antoine.bajolet@free.fr> writes: > > >>We are using postgresql in a search engine on an intranet handling >>throusand of documents. >>But we ave a big problem when users use more than two search key. >> >> > >I think you need to increase the statistics targets for your keywords >table --- the estimates of numbers of matching rows are much too small: > > What value you think i could put into a ALTER TABLE SET STATISTICS statment ? Also, the solution given by Simon Riggs works well. <quote> Recode your SQL with an IN subselect that retrieves all possible keywords before it accesses the larger table. </quote> But i will try the old ones increasing the statistics parameter and compare performance. > > >> -> Index Scan using keyword_pattern_key on keywords >>k2 (cost=0.00..3.51 rows=1 width=4) (actual time=0.078..1.887 rows=75 >>loops=1) >> Index Cond: (((keyword)::text ~>=~ >>'exploitation'::character varying) AND ((keyword)::text ~<~ >>'exploitatioo'::character varying)) >> Filter: ((keyword)::text ~~ 'exploitation%'::text) >> >> > >A factor-of-75 error is quite likely to mislead the planner into >choosing a bad join plan. > >BTW, have you looked into using a real full-text-search engine (eg, >tsearch2) instead of rolling your own like this? > > It seems a quite good contrib, but... The first version of this search engine was developped in 2000... tsearch2 nor tsearch existed at this time. Also, there are some developpement works around this search engine (pertinence algorithm, filtering with users rights, ponderating keywords with specific rules to each type of document, etc.) and adapting all to work in the similar way with tsearch2 seems to be a bit heavy. At the end, each document indexed are quite big and the choosen method reduces disk storage : 1 Go of text content traduces to ~100 Mo of table space. Best Regards, Antoine Bajolet
Re, With modifing parameters like this : ALTER TABLE keywords ALTER keyword SET STATISTICS 100; ALTER TABLE keywords ALTER k_id SET STATISTICS 100; ALTER TABLE engine ALTER k_id SET STATISTICS 100; ALTER TABLE engine ALTER f_id SET STATISTICS 100; vacuuming both tables and rewriting the queries using sub-selects : select count (distinct f.f_id) as results FROM fiches f INNER JOIN (SELECT distinct f_id FROM keywords,engine WHERE engine.k_id = keywords.k_id AND keyword like 'exploitation%') as e1 USING(f_id) INNER JOIN (SELECT distinct f_id FROM keywords,engine WHERE engine.k_id = keywords.k_id AND keyword like 'maintenance%') as e2 USING(f_id) INNER JOIN (SELECT distinct f_id FROM keywords,engine WHERE engine.k_id = keywords.k_id AND keyword like 'numerique%') as e3 USING(f_id) The query time is less than 600 ms, and increases only a little adding more keywords. Thanks to Tom Lane and Simon Riggs. Best regards, Antoine Bajolet Antoine Bajolet a écrit : > Hello, > > Tom Lane a écrit : > >> Antoine Bajolet <antoine.bajolet@free.fr> writes: >> >> >>> We are using postgresql in a search engine on an intranet handling >>> throusand of documents. >>> But we ave a big problem when users use more than two search key. >>> >> >> >> I think you need to increase the statistics targets for your keywords >> table --- the estimates of numbers of matching rows are much too small: >> >> > What value you think i could put into a ALTER TABLE SET STATISTICS > statment ? > > Also, the solution given by Simon Riggs works well. > <quote> > > Recode your SQL with an IN subselect that retrieves all possible > keywords before it accesses the larger table. > </quote> > > But i will try the old ones increasing the statistics parameter and > compare performance. >