Thread: Nested Loop trouble : Execution time increases more 1000 time (long)

Nested Loop trouble : Execution time increases more 1000 time (long)

From
Antoine Bajolet
Date:
Hello,

We are using postgresql in a search engine on an intranet handling
throusand of documents.
But we ave a big problem when users use more than two search key.

There are more tables around, but the heart of  the search engine is
made of three tables :

fiches (f_id int4, f_title varchar)          52445 rows
engine (f_id int4, k_id int4, weight )    11761700 rows
keywords(k_id, keyword)                    1072600 rows

A "fiche" is linked to any kind of document.
The engine table counts how many times a keyword appears in a document.

A query to search on one or two keywords is quick to execute (the
front-end creates thoses queries):

---------------------------------------------------------------------
select count (distinct f.f_id) as results
FROM
fiches f
 INNER JOIN engine e1 INNER JOIN keywords k1 USING (k_id) USING (f_id)
 INNER JOIN engine e2 INNER JOIN keywords k2 USING (k_id) USING (f_id)

WHERE TRUE
AND k1.keyword like 'maintenance%'
AND k2.keyword like 'exploitation%'
;

QUERY PLAN
Aggregate  (cost=3953.00..3953.00 rows=1 width=4) (actual
time=525.243..525.243 rows=1 loops=1)
  ->  Nested Loop  (cost=1974.79..3952.99 rows=1 width=4) (actual
time=211.570..513.758 rows=6879 loops=1)
        ->  Hash Join  (cost=1974.79..3949.62 rows=1 width=8) (actual
time=211.483..389.340 rows=6879 loops=1)
              Hash Cond: ("outer".f_id = "inner".f_id)
              ->  Nested Loop  (cost=0.00..1974.76 rows=11 width=4)
(actual time=0.132..155.499 rows=9520 loops=1)
                    ->  Index Scan using keyword_pattern_key on keywords
k2  (cost=0.00..3.51 rows=1 width=4) (actual time=0.078..1.887 rows=75
loops=1)
                          Index Cond: (((keyword)::text ~>=~
'exploitation'::character varying) AND ((keyword)::text ~<~
'exploitatioo'::character varying))
                          Filter: ((keyword)::text ~~ 'exploitation%'::text)
                    ->  Index Scan using k_id_key on engine e2
(cost=0.00..1954.93 rows=1306 width=8) (actual time=0.049..1.842
rows=127 loops=75)
                          Index Cond: (e2.k_id = "outer".k_id)
              ->  Hash  (cost=1974.76..1974.76 rows=11 width=4) (actual
time=211.203..211.203 rows=0 loops=1)
                    ->  Nested Loop  (cost=0.00..1974.76 rows=11
width=4) (actual time=0.296..197.590 rows=11183 loops=1)
                          ->  Index Scan using keyword_pattern_key on
keywords k1  (cost=0.00..3.51 rows=1 width=4) (actual time=0.189..1.351
rows=73 loops=1)
                                Index Cond: (((keyword)::text ~>=~
'maintenance'::character varying) AND ((keyword)::text ~<~
'maintenancf'::character varying))
                                Filter: ((keyword)::text ~~
'maintenance%'::text)
                          ->  Index Scan using k_id_key on engine e1
(cost=0.00..1954.93 rows=1306 width=8) (actual time=0.029..2.406
rows=153 loops=73)
                                Index Cond: (e1.k_id = "outer".k_id)
        ->  Index Scan using fiches_pkey on fiches f  (cost=0.00..3.36
rows=1 width=4) (actual time=0.013..0.014 rows=1 loops=6879)
              Index Cond: (f.f_id = "outer".f_id)
Total runtime: 525.511 ms
--------------------------------------------------------------------------


But when there are three keywords or more, the planner chooses to
perform a very costly nested loop :

--------------------------------------------------------------------------
select count (distinct f.f_id) as results
FROM
fiches f
 INNER JOIN engine e1 INNER JOIN keywords k1 USING (k_id) USING (f_id)
 INNER JOIN engine e2 INNER JOIN keywords k2 USING (k_id) USING (f_id)
 INNER JOIN engine e3 INNER JOIN keywords k3 USING (k_id) USING (f_id)

WHERE TRUE
AND k1.keyword like 'maintenance%'
AND k2.keyword like 'exploitation%'
AND k3.keyword like 'numerique%'
;

QUERY PLAN
Aggregate  (cost=5927.90..5927.90 rows=1 width=4) (actual
time=673048.168..673048.169 rows=1 loops=1)
  ->  Nested Loop  (cost=1974.79..5927.90 rows=1 width=4) (actual
time=1853.789..673038.065 rows=2929 loops=1)
        ->  Nested Loop  (cost=1974.79..5924.52 rows=1 width=12) (actual
time=1853.719..672881.725 rows=2929 loops=1)
              Join Filter: ("inner".f_id = "outer".f_id)
              ->  Hash Join  (cost=1974.79..3949.62 rows=1 width=8)
(actual time=198.845..441.947 rows=6879 loops=1)
                    Hash Cond: ("outer".f_id = "inner".f_id)
                    ->  Nested Loop  (cost=0.00..1974.76 rows=11
width=4) (actual time=0.129..199.895 rows=9520 loops=1)
                          ->  Index Scan using keyword_pattern_key on
keywords k2  (cost=0.00..3.51 rows=1 width=4) (actual time=0.077..1.918
rows=75 loops=1)
                                Index Cond: (((keyword)::text ~>=~
'exploitation'::character varying) AND ((keyword)::text ~<~
'exploitatioo'::character varying))
                                Filter: ((keyword)::text ~~
'exploitation%'::text)
                          ->  Index Scan using k_id_key on engine e2
(cost=0.00..1954.93 rows=1306 width=8) (actual time=0.035..2.342
rows=127 loops=75)
                                Index Cond: (e2.k_id = "outer".k_id)
                    ->  Hash  (cost=1974.76..1974.76 rows=11 width=4)
(actual time=198.650..198.650 rows=0 loops=1)
                          ->  Nested Loop  (cost=0.00..1974.76 rows=11
width=4) (actual time=0.174..187.216 rows=11183 loops=1)
                                ->  Index Scan using keyword_pattern_key
on keywords k1  (cost=0.00..3.51 rows=1 width=4) (actual
time=0.113..1.222 rows=73 loops=1)
                                      Index Cond: (((keyword)::text ~>=~
'maintenance'::character varying) AND ((keyword)::text ~<~
'maintenancf'::character varying))
                                      Filter: ((keyword)::text ~~
'maintenance%'::text)
                                ->  Index Scan using k_id_key on engine
e1  (cost=0.00..1954.93 rows=1306 width=8) (actual time=0.029..2.311
rows=153 loops=73)
                                      Index Cond: (e1.k_id = "outer".k_id)
              ->  Nested Loop  (cost=0.00..1974.76 rows=11 width=4)
(actual time=0.087..90.165 rows=9553 loops=6879)
                    ->  Index Scan using keyword_pattern_key on keywords
k3  (cost=0.00..3.51 rows=1 width=4) (actual time=0.049..0.628 rows=49
loops=6879)
                          Index Cond: (((keyword)::text ~>=~
'numerique'::character varying) AND ((keyword)::text ~<~
'numeriquf'::character varying))
                          Filter: ((keyword)::text ~~ 'numerique%'::text)
                    ->  Index Scan using k_id_key on engine e3
(cost=0.00..1954.93 rows=1306 width=8) (actual time=0.023..1.544
rows=195 loops=337071)
                          Index Cond: (e3.k_id = "outer".k_id)
        ->  Index Scan using fiches_pkey on fiches f  (cost=0.00..3.36
rows=1 width=4) (actual time=0.041..0.043 rows=1 loops=2929)
              Index Cond: (f.f_id = "outer".f_id)
Total runtime: 673048.405 ms
----------------------------------------------------------------------
More than 10 minutes !

Is there a specific reason the planner chooses this way ?
Can whe do something on the postgresql configuration to avoid this ?
Can whe force the planner to use a hash join as it does for the first
joins ?

Regards,
Antoine Bajolet



Re: Nested Loop trouble : Execution time increases more

From
Simon Riggs
Date:
On Sat, 2005-09-17 at 17:47 +0200, Antoine Bajolet wrote:

> There are more tables around, but the heart of  the search engine is
> made of three tables :
>
> fiches (f_id int4, f_title varchar)          52445 rows
> engine (f_id int4, k_id int4, weight )    11761700 rows
> keywords(k_id, keyword)                    1072600 rows
>
> A "fiche" is linked to any kind of document.
> The engine table counts how many times a keyword appears in a document.
>
> A query to search on one or two keywords is quick to execute (the
> front-end creates thoses queries):
>

> Is there a specific reason the planner chooses this way ?

Yes, you have an additional join for each new keyword, so there is more
work to do.

Recode your SQL with an IN subselect that retrieves all possible
keywords before it accesses the larger table.

That way you should have only one join for each new keyword.

> Can whe do something on the postgresql configuration to avoid this ?
> Can whe force the planner to use a hash join as it does for the first
> joins ?

Not required, IMHO.

Best Regards, Simon Riggs



Re: Nested Loop trouble : Execution time increases more 1000 time (long)

From
Tom Lane
Date:
Antoine Bajolet <antoine.bajolet@free.fr> writes:
> We are using postgresql in a search engine on an intranet handling
> throusand of documents.
> But we ave a big problem when users use more than two search key.

I think you need to increase the statistics targets for your keywords
table --- the estimates of numbers of matching rows are much too small:

>                     ->  Index Scan using keyword_pattern_key on keywords
> k2  (cost=0.00..3.51 rows=1 width=4) (actual time=0.078..1.887 rows=75
> loops=1)
>                           Index Cond: (((keyword)::text ~>=~
> 'exploitation'::character varying) AND ((keyword)::text ~<~
> 'exploitatioo'::character varying))
>                           Filter: ((keyword)::text ~~ 'exploitation%'::text)

A factor-of-75 error is quite likely to mislead the planner into
choosing a bad join plan.

BTW, have you looked into using a real full-text-search engine (eg,
tsearch2) instead of rolling your own like this?

            regards, tom lane

Re: Nested Loop trouble : Execution time increases more

From
Antoine Bajolet
Date:
Hello,

Tom Lane a écrit :

>Antoine Bajolet <antoine.bajolet@free.fr> writes:
>
>
>>We are using postgresql in a search engine on an intranet handling
>>throusand of documents.
>>But we ave a big problem when users use more than two search key.
>>
>>
>
>I think you need to increase the statistics targets for your keywords
>table --- the estimates of numbers of matching rows are much too small:
>
>
What value you think i could put into a ALTER TABLE SET STATISTICS
statment ?

Also, the solution given by Simon Riggs works well.
<quote>

Recode your SQL with an IN subselect that retrieves all possible
keywords before it accesses the larger table.
</quote>

But i will try the old ones increasing the statistics parameter and compare performance.


>
>
>>                    ->  Index Scan using keyword_pattern_key on keywords
>>k2  (cost=0.00..3.51 rows=1 width=4) (actual time=0.078..1.887 rows=75
>>loops=1)
>>                          Index Cond: (((keyword)::text ~>=~
>>'exploitation'::character varying) AND ((keyword)::text ~<~
>>'exploitatioo'::character varying))
>>                          Filter: ((keyword)::text ~~ 'exploitation%'::text)
>>
>>
>
>A factor-of-75 error is quite likely to mislead the planner into
>choosing a bad join plan.
>
>BTW, have you looked into using a real full-text-search engine (eg,
>tsearch2) instead of rolling your own like this?
>
>
It seems a quite good contrib, but...
The first version of this search engine was developped in 2000...
tsearch2 nor tsearch existed at this time.
Also, there are some developpement works around this search engine
(pertinence algorithm, filtering with users rights, ponderating keywords
with specific rules to each type of document, etc.) and adapting all to
work in the similar way with tsearch2 seems to be a bit heavy.
At the end, each document indexed are quite big and the choosen method
reduces disk storage : 1 Go of text content traduces to ~100 Mo of table
space.

Best Regards,
Antoine Bajolet



Re: Nested Loop trouble : Execution time increases more

From
Antoine Bajolet
Date:
Re,

With modifing parameters like this :

ALTER TABLE keywords ALTER keyword SET STATISTICS 100;
ALTER TABLE keywords ALTER k_id SET STATISTICS 100;
ALTER TABLE engine ALTER k_id SET STATISTICS 100;
ALTER TABLE engine ALTER f_id SET STATISTICS 100;

vacuuming both tables
and rewriting the queries using sub-selects :

select count (distinct f.f_id) as results
FROM
fiches f
INNER JOIN (SELECT distinct f_id FROM keywords,engine WHERE engine.k_id
= keywords.k_id AND keyword like 'exploitation%') as e1 USING(f_id)
INNER JOIN (SELECT distinct f_id FROM keywords,engine WHERE engine.k_id
= keywords.k_id AND keyword like 'maintenance%') as e2 USING(f_id)
INNER JOIN (SELECT distinct f_id FROM keywords,engine WHERE engine.k_id
= keywords.k_id AND keyword like 'numerique%') as e3 USING(f_id)

The query time is less than 600 ms, and increases only a little adding
more keywords.

Thanks to Tom Lane and Simon Riggs.

Best regards,
Antoine Bajolet

Antoine Bajolet a écrit :

> Hello,
>
> Tom Lane a écrit :
>
>> Antoine Bajolet <antoine.bajolet@free.fr> writes:
>>
>>
>>> We are using postgresql in a search engine on an intranet handling
>>> throusand of documents.
>>> But we ave a big problem when users use more than two search key.
>>>
>>
>>
>> I think you need to increase the statistics targets for your keywords
>> table --- the estimates of numbers of matching rows are much too small:
>>
>>
> What value you think i could put into a ALTER TABLE SET STATISTICS
> statment ?
>
> Also, the solution given by Simon Riggs works well.
> <quote>
>
> Recode your SQL with an IN subselect that retrieves all possible
> keywords before it accesses the larger table.
> </quote>
>
> But i will try the old ones increasing the statistics parameter and
> compare performance.
>