Thread: When you really want to force a certain join type?

When you really want to force a certain join type?

From

Gunther Schadow

Date:

28 December 2022, 18:39:14

I have a complex query which essentially runs a finite state automaton through a with recursive union, adding the next state based on the previous. This is run at 100,000 or a million start states at the same time, picking a new record (token), matching it to the FSA (a three-way join:

token inner join next token * state-transition-table -> next state

I know this doesn't really tell you much. The following might give you a glimpse:

with recursive Token as (  select * from steps left outer join token using(event)  limit 100000
), StartStates as (
select pathId, start, end, m.new_state as state, m.goalId  from Token w inner join FSA m    on(m.token = w.token and m.old_state = w.state)
), Phrase as (
select pathId, start, end, state, goalId  from StartStates
union all
select p.pathId, p.start, n.end, n.new_state as state, n.goalId  from Phrase p  inner join (    select pathId, start, end, old_state as state, new_state, f.goalId      from Token inner join FSA f using(token)  ) n using(pathId, end, state)

There are 100s of thousands of states. This join has a HUGE fan out if it is not immediately limited by the chaining criterion on the old_state. So any attempt to use merge join or hash join which will compute the whole big thing and only then apply the chaining criterion, will just create massive amounts of sort load and/or humongous hash tables only to throw the vast majority away every time. But when it runs through nested loops, the indexes help to make it really quick.

I cannot show you the exact data, but I can show you the plan that works amazingly fast:

 Insert on good_paths  (cost=912224.51..912228.71 rows=240 width=302)   CTE token     ->  Limit  (cost=46728.24..81127.75 rows=100000 width=519)           ->  Hash Left Join  (cost=46728.24..115752.23 rows=200654 width=519)               ... this is creating the start states   CTE path     ->  Recursive Union  (cost=23293.75..831082.45 rows=241 width=278)           ->  Merge Join  (cost=23293.75..289809.83 rows=171 width=278)                 Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token))                 ->  Index Scan using fsa_old_state_token_idx on fsa m  (cost=0.43..245365.63 rows=4029834 width=28)                 ->  Materialize  (cost=23293.32..23793.32 rows=100000 width=278)                       ->  Sort  (cost=23293.32..23543.32 rows=100000 width=278)                             Sort Key: w_1.state, w_1.token                             ->  CTE Scan on token w_1  (cost=0.00..2000.00 rows=100000 width=278)           ->  Nested Loop  (cost=18295.78..54126.78 rows=7 width=278)                 ->  Merge Join  (cost=18295.35..19120.16 rows=4275 width=340)                       Merge Cond: ((token.pathid = p.pathid) AND (token.start = p.end))                       ->  Sort  (cost=18169.32..18419.32 rows=100000 width=160)                             Sort Key: token.pathid, token.start                             ->  CTE Scan on token (cost=0.00..2000.00 rows=100000 width=160)                       ->  Sort  (cost=126.03..130.30 rows=1710 width=212)                             Sort Key: p.pathid, p.end                             ->  WorkTable Scan on path p  (cost=0.00..34.20 rows=1710 width=212)                 ->  Index Scan using fsa_old_state_token_idx on fsa f  (cost=0.43..8.18 rows=1 width=28)

Now, when that initial token list (of start states) grows beyond this limit of 100,000, the execution plan flips:

 Insert on good_paths  (cost=2041595.63..2041606.66 rows=630 width=302)   CTE token     ->  Limit  (cost=46728.24..115752.23 rows=200654 width=519)           ->  Hash Left Join  (cost=46728.24..115752.23 rows=200654 width=519)               ... this is creating the start states   CTE path     ->  Recursive Union  (cost=47749.96..1925801.45 rows=633 width=278)           ->  Merge Join  (cost=47749.96..315274.30 rows=343 width=278)                 Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token))                 ->  Index Scan using fsa_old_state_token_idx on fsa m  (cost=0.43..245365.63 rows=4029834 width=28)                 ->  Materialize  (cost=47749.53..48752.80 rows=200654 width=278)                       ->  Sort  (cost=47749.53..48251.16 rows=200654 width=278)                             Sort Key: w_1.state, w_1.token                             ->  CTE Scan on token w_1  (cost=0.00..4013.08 rows=200654 width=278)           ->  Merge Join  (cost=158013.87..161051.45 rows=29 width=278)                 Merge Cond: ((token.token = f.token) AND (token.pathid = p.pathid) AND (token.start = p.end))                 ->  Sort  (cost=37459.53..37961.16 rows=200654 width=160)                       Sort Key: token.token, token.pathid, token.start                       ->  CTE Scan on token (cost=0.00..4013.08 rows=200654 width=160)                 ->  Materialize  (cost=120554.35..120966.44 rows=82419 width=228)                       ->  Sort  (cost=120554.35..120760.39 rows=82419 width=228)                             Sort Key: f.token, p.pathid, p.end                             ->  Nested Loop  (cost=0.43..104808.55 rows=82419 width=228)                                   ->  WorkTable Scan on path p  (cost=0.00..68.60 rows=3430 width=212)                                   ->  Index Scan using fsa_old_state_token_idx on fsa f  (cost=0.43..30.30 rows=24 width=28)

Once this merge join kicks in, the query essentially stalls (I mean, each of the limited components runs in seconds, and I can iteratively run them so that my initial set of tokens never grows past 100,000, and then I can complete everything in about linear time, each iteration takes about linear time proportional with the amount of tokens. But with the merge join it doesn't complete before several times that amount of time.

I doubt that I can find any trick to give to the planner better data which it can then use to figure out that the merge join is a bad proposition.

I wish I could just force it. I probably had this discussion here some years ago. I think that while the PostgreSQL optimizer is pretty good, there are situations such as this where its predictions do not work.

Note, for my immediate relief I have forced it by simply set enable_mergejoin=off. This works fine, except, it converts both into a nested loop, but the upper merge join was not a problem, and sometimes (most often) nested loop is a bad choice for bulk data. It's only for this recursive query it sometimes makes sense.

regards,
-Gunther

Re: When you really want to force a certain join type?

From

Justin Pryzby

Date:

28 December 2022, 18:48:37

On Wed, Dec 28, 2022 at 10:39:14AM -0500, Gunther Schadow wrote:
> I have a complex query which essentially runs a finite state automaton
> through a with recursive union, adding the next state based on the
> previous.  This is run at 100,000 or a million start states at the same
> time, picking a new record (token), matching it to the FSA (a three-way
> join:

> There are 100s of thousands of states. This join has a HUGE fan out if it is

> I doubt that I can find any trick to give to the planner better data which
> it can then use to figure out that the merge join is a bad proposition.

> Note, for my immediate relief I have forced it by simply set
> enable_mergejoin=off. This works fine, except, it converts both into a
> nested loop, but the upper merge join was not a problem, and sometimes (most
> often) nested loop is a bad choice for bulk data. It's only for this
> recursive query it sometimes makes sense.

Maybe the new parameter in v15 would help.

https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-RECURSIVE-WORKTABLE-FACTOR
recursive_worktable_factor (floating point)

    Sets the planner's estimate of the average size of the working table
    of a recursive query, as a multiple of the estimated size of the
    initial non-recursive term of the query. This helps the planner
    choose the most appropriate method for joining the working table to
    the query's other tables. The default value is 10.0. A smaller value
    such as 1.0 can be helpful when the recursion has low “fan-out” from
    one step to the next, as for example in shortest-path queries. Graph
    analytics queries may benefit from larger-than-default values.

-- 
Justin

Re: When you really want to force a certain join type?

From

Gunther Schadow

Date:

29 December 2022, 10:31:59

On 12/28/2022 10:48 AM, Justin Pryzby wrote:

Maybe the new parameter in v15 would help.

https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-RECURSIVE-WORKTABLE-FACTOR
recursive_worktable_factor (floating point)
    Sets the planner's estimate of the average size of the working table    of a recursive query, as a multiple of the estimated size of the    initial non-recursive term of the query. This helps the planner    choose the most appropriate method for joining the working table to    the query's other tables. The default value is 10.0. A smaller value    such as 1.0 can be helpful when the recursion has low “fan-out” from    one step to the next, as for example in shortest-path queries. Graph    analytics queries may benefit from larger-than-default values.

Thanks that's something I will try after I upgraded.

But speaking of such other uses for recursive queries, I can say I have quite a bit of experience of turning graph related "traversal" and search and optimization and classification queries into SQL, in short, computing the transitive closure. And usually I have stayed away from the recursive WITH query and instead set up a start table and then perform the iterative step. And there are two ways to go about it. Say you have a graph, simple nodes and arcs. You want to find all paths through the graph.

Now you can set up start nodes and then extend them at the end by joining the recursive table to the simple arc table and extend your path every time. This is what the WITH RECURSIVE supports. These queries linearly iterate as many times as the length of the longest path.

with recursive arcs as (
  select source, target, 1 as distance, ...
    from ...
), paths as (
  select * from arcs
  union all
  select a.source, b.target, a.distance + 1 as distance, ...
    from paths a inner join arcs b      on(b.source = a.target)
)
select * from paths

But another way is to join paths with paths. It would be this, which I think I have seen postgresql unable to deal with:

with recursive arcs as (
  select source, target, 1 as distance, ...
    from ...
), paths as (
  select * from arcs
  union all
  select a.source, b.target, a.distance + 1 as distance, ...
    from paths a inner join paths b      on(b.source = a.target)
)
select * from paths

So, instead of the recursive union to join back to the fixed table, it joins the recursive table to the recursive table, and the benefit of that is that these queries converge much quicker. Instead of going 10 iterations to find a path of length 10, you go 1 iteration to find all paths of 2 (if distance 1 is the base table of all arcs), then next you find paths of up to 4 then you find paths of up to 8, then 16, 32, ... This converges much faster. I usually do that as follows

create table paths as
select source, target, 1 as distance, ...  from arcs;

prepare rep as
insert into paths(source, target, distance, ...)
select a.source, b.target, a.distance + b.distance as distance, ... 
  from paths a inner join paths b on(b.source = a.target) 
except
select * from paths;

execute rep;
execute rep;
...

or instead of the except, in order to minimize distances:

where not exists (select 1 from paths x 
                   where x.source = a.source                      and x.target = a.target                     and x.distance < a.distance)

I have even done a group by in the recursive step which replaces the paths relation at every iteration (e.g. with only minimal distance paths).

Since this converges so rapidly I often prefer that approach over a recursive union query.

I think in IBM DB2 allowed to join the recursive table with itself. Is this something you want to support at some time?

Also, why even use the RECURSIVE keyword, DB2 didn't need it, and the query analyzer should immediately see the recursion, so no need to have that keyword.

regards,
-Gunther

Re: When you really want to force a certain join type?

From

Tom Lane

Date:

29 December 2022, 10:52:10

Gunther Schadow <raj@gusw.net> writes:
> Also, why even use the RECURSIVE keyword, DB2 didn't need it, and the 
> query analyzer should immediately see the recursion, so no need to have 
> that keyword.

Our reading of the SQL spec is that it's required.  The scope of
visibility of CTE names is different depending on whether you
write RECURSIVE or not, so it's not a question of "the query analyzer
should see it": the analyzer is required NOT to see it.

DB2 generally has a reputation for agreeing with the spec,
so I'm surprised to hear that they're not doing this per spec.

            regards, tom lane