Re: a path towards replacing GEQO with something better - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: a path towards replacing GEQO with something better |
Date | |
Msg-id | CAFBsxsFYUiRGy5wVfhieKXPKW=ooFpwkpa9rS6qB=Mr4bZZ-8A@mail.gmail.com Whole thread Raw |
In response to | Re: a path towards replacing GEQO with something better (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: a path towards replacing GEQO with something better
|
List | pgsql-hackers |
On Tue, Jun 15, 2021 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 9, 2021 at 9:24 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > 3) It actually improves the existing exhaustive search, because the complexity of the join order problem depends on the query shape: a "chain" shape (linear) vs. a "star" shape (as in star schema), for the most common examples. The size of the DP table grows like this (for n >= 4):
...
> I don't quite understand the difference between the "chain" case and
> the "star" case. Can you show sample queries for each one? e.g. SELECT
> ... FROM a_1, a_2, ..., a_n WHERE <something>?
There's a very simple example in the optimizer README:
--
SELECT *
FROM tab1, tab2, tab3, tab4
WHERE tab1.col = tab2.col AND
tab2.col = tab3.col AND
tab3.col = tab4.col
Tables 1, 2, 3, and 4 are joined as:
{1 2},{2 3},{3 4}
{1 2 3},{2 3 4}
{1 2 3 4}
(other possibilities will be excluded for lack of join clauses)
SELECT *
FROM tab1, tab2, tab3, tab4
WHERE tab1.col = tab2.col AND
tab1.col = tab3.col AND
tab1.col = tab4.col
Tables 1, 2, 3, and 4 are joined as:
{1 2},{1 3},{1 4}
{1 2 3},{1 3 4},{1 2 4}
{1 2 3 4}
--
The first one is chain, and the second is star. Four is the smallest set where we have a difference. I should now point out an imprecision in my language: By "size of the DP table", the numbers in my first email refer to the number of joinrels times the number of possible joins (not paths, and ignoring commutativity). Here are the steps laid out, with cumulative counts:
join_level, # joins, cumulative # joins:
linear, n=4
2 3 3
3 4 7
4 3 10
star, n=4
2 3 3
3 6 9
4 3 12
And of course, the chain query also has three at the last level, because it tries two left- (or right-) deep joins and one bushy join.
> One idea I just ran across in
> https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
> is to try to economize by skipping consideration of bushy plans.
That's a good paper, and it did influence my thinking.
You likely already know this, but for the archives: If only chain queries could have bushy plans, it wouldn't matter because they are so cheap to enumerate. But, since star queries can introduce a large number of extra joins via equivalence (same join column or FK), making them resemble "clique" queries, bushy joins get excessively numerous.
> We
> could start doing that when some budget is exceeded, similar to what
> you are proposing here, but probably the budget for skipping
> consideration of bushy plans would be smaller than the budget for
> switching to IDP. The idea of skipping bushy plan generation in some
> cases makes sense to me intuitively because most of the plans
> PostgreSQL generates are mostly left-deep, and even when we do
> generate bushy plans, they're not always a whole lot better than the
> nearest equivalent left-deep plan. The paper argues that considering
> bushy plans makes measurable gains, but also that failure to consider
> such plans isn't catastrophically bad.
I think that makes sense. There are a few things we could do within the "grey zone" -- too many rels to finish exhaustive search, but not enough to justify starting directly with the greedy step -- to increase our chances of completing, and that's a very simple one.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Wed, Jun 9, 2021 at 9:24 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > 3) It actually improves the existing exhaustive search, because the complexity of the join order problem depends on the query shape: a "chain" shape (linear) vs. a "star" shape (as in star schema), for the most common examples. The size of the DP table grows like this (for n >= 4):
...
> I don't quite understand the difference between the "chain" case and
> the "star" case. Can you show sample queries for each one? e.g. SELECT
> ... FROM a_1, a_2, ..., a_n WHERE <something>?
There's a very simple example in the optimizer README:
--
SELECT *
FROM tab1, tab2, tab3, tab4
WHERE tab1.col = tab2.col AND
tab2.col = tab3.col AND
tab3.col = tab4.col
Tables 1, 2, 3, and 4 are joined as:
{1 2},{2 3},{3 4}
{1 2 3},{2 3 4}
{1 2 3 4}
(other possibilities will be excluded for lack of join clauses)
SELECT *
FROM tab1, tab2, tab3, tab4
WHERE tab1.col = tab2.col AND
tab1.col = tab3.col AND
tab1.col = tab4.col
Tables 1, 2, 3, and 4 are joined as:
{1 2},{1 3},{1 4}
{1 2 3},{1 3 4},{1 2 4}
{1 2 3 4}
--
The first one is chain, and the second is star. Four is the smallest set where we have a difference. I should now point out an imprecision in my language: By "size of the DP table", the numbers in my first email refer to the number of joinrels times the number of possible joins (not paths, and ignoring commutativity). Here are the steps laid out, with cumulative counts:
join_level, # joins, cumulative # joins:
linear, n=4
2 3 3
3 4 7
4 3 10
star, n=4
2 3 3
3 6 9
4 3 12
And of course, the chain query also has three at the last level, because it tries two left- (or right-) deep joins and one bushy join.
> One idea I just ran across in
> https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
> is to try to economize by skipping consideration of bushy plans.
That's a good paper, and it did influence my thinking.
You likely already know this, but for the archives: If only chain queries could have bushy plans, it wouldn't matter because they are so cheap to enumerate. But, since star queries can introduce a large number of extra joins via equivalence (same join column or FK), making them resemble "clique" queries, bushy joins get excessively numerous.
> We
> could start doing that when some budget is exceeded, similar to what
> you are proposing here, but probably the budget for skipping
> consideration of bushy plans would be smaller than the budget for
> switching to IDP. The idea of skipping bushy plan generation in some
> cases makes sense to me intuitively because most of the plans
> PostgreSQL generates are mostly left-deep, and even when we do
> generate bushy plans, they're not always a whole lot better than the
> nearest equivalent left-deep plan. The paper argues that considering
> bushy plans makes measurable gains, but also that failure to consider
> such plans isn't catastrophically bad.
I think that makes sense. There are a few things we could do within the "grey zone" -- too many rels to finish exhaustive search, but not enough to justify starting directly with the greedy step -- to increase our chances of completing, and that's a very simple one.
--
John Naylor
EDB: http://www.enterprisedb.com
pgsql-hackers by date: