Thread: Partitionwise join fails under GEQO

Partitionwise join fails under GEQO

From
Tom Lane
Date:
If you run the core regression tests with geqo_threshold = 2
(injected, say, via ALTER SYSTEM SET), you will notice the
earlier tests showing some join ordering differences, which
is unsurprising ... but when it gets to partition_join, that
test just dumps core.

Investigation shows that the crash is due to a corrupt EquivalenceClass
member.  It gets that way because add_child_join_rel_equivalences
is unaware of the planner's memory allocation policies, and feels
free to mess with the EC data structures during join planning.
When GEQO's transient memory context is thrown away after one round
of GEQO planning, some still-referenced EC data goes with it,
resulting in a crash during the next round.

I band-aided around that with the attached patch, which is enough
to prevent the crash, but it's entirely unsatisfactory as a production
solution.  The problem is that each GEQO round will re-add the same
child EC members, since add_child_join_rel_equivalences has absolutely
no concept that the members it wants might be there already.  Since
GEQO tends to use a lot of rounds, this can run to significant memory
bloat, not to mention a pretty serious speed hit while EC operations
thumb through all of those duplicate expressions.

Now that I've seen this, I wonder whether add_child_join_rel_equivalences
might not be making duplicate EC entries even without GEQO.  Is there any
guarantee that it's not called repeatedly on related join-rel sets?

So I think that in addition to this, we need some check to see if the
derived EC child member is already there.  It's likely that checking
for an em_relids match would be sufficient to eliminate irrelevant
members quickly.

            regards, tom lane

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index b68a5a0ec7..8507cc8ef4 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -138,6 +138,8 @@ process_equivalence(PlannerInfo *root,
                *em2;
     ListCell   *lc1;

+    Assert(CurrentMemoryContext == root->planner_cxt);
+
     /* Should not already be marked as having generated an eclass */
     Assert(restrictinfo->left_ec == NULL);
     Assert(restrictinfo->right_ec == NULL);
@@ -2253,6 +2255,8 @@ add_child_rel_equivalences(PlannerInfo *root,
     Relids        child_relids = child_rel->relids;
     int            i;

+    Assert(CurrentMemoryContext == root->planner_cxt);
+
     /*
      * EC merging should be complete already, so we can use the parent rel's
      * eclass_indexes to avoid searching all of root->eq_classes.
@@ -2380,6 +2384,7 @@ add_child_join_rel_equivalences(PlannerInfo *root,
     Relids        top_parent_relids = child_joinrel->top_parent_relids;
     Relids        child_relids = child_joinrel->relids;
     Bitmapset  *matching_ecs;
+    MemoryContext oldcontext;
     int            i;

     Assert(IS_JOIN_REL(child_joinrel) && IS_JOIN_REL(parent_joinrel));
@@ -2387,6 +2392,8 @@ add_child_join_rel_equivalences(PlannerInfo *root,
     /* We need consider only ECs that mention the parent joinrel */
     matching_ecs = get_eclass_indexes_for_relids(root, top_parent_relids);

+    oldcontext = MemoryContextSwitchTo(root->planner_cxt);
+
     i = -1;
     while ((i = bms_next_member(matching_ecs, i)) >= 0)
     {
@@ -2486,6 +2493,8 @@ add_child_join_rel_equivalences(PlannerInfo *root,
             }
         }
     }
+
+    MemoryContextSwitchTo(oldcontext);
 }



Re: Partitionwise join fails under GEQO

From
Ashutosh Bapat
Date:
On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> If you run the core regression tests with geqo_threshold = 2
> (injected, say, via ALTER SYSTEM SET), you will notice the
> earlier tests showing some join ordering differences, which
> is unsurprising ... but when it gets to partition_join, that
> test just dumps core.
>
> Investigation shows that the crash is due to a corrupt EquivalenceClass
> member.  It gets that way because add_child_join_rel_equivalences
> is unaware of the planner's memory allocation policies, and feels
> free to mess with the EC data structures during join planning.
> When GEQO's transient memory context is thrown away after one round
> of GEQO planning, some still-referenced EC data goes with it,
> resulting in a crash during the next round.
>
> I band-aided around that with the attached patch, which is enough
> to prevent the crash, but it's entirely unsatisfactory as a production
> solution.  The problem is that each GEQO round will re-add the same
> child EC members, since add_child_join_rel_equivalences has absolutely
> no concept that the members it wants might be there already.  Since
> GEQO tends to use a lot of rounds, this can run to significant memory
> bloat, not to mention a pretty serious speed hit while EC operations
> thumb through all of those duplicate expressions.
>
> Now that I've seen this, I wonder whether add_child_join_rel_equivalences
> might not be making duplicate EC entries even without GEQO.  Is there any
> guarantee that it's not called repeatedly on related join-rel sets?

build_child_join_rel() is the only caller of
add_child_join_rel_equivalences(). Before the first calls the later,
the first has an Assert
 899     Assert(!find_join_rel(root, joinrel->relids));
This means that for a given child join rel this function is called
only once implying that add_child_join_rel_equivalences() is called
only once for a given child join rel. Initially I thought that this is
enough to not add duplicate child members. But to me use of
bms_overlap() in that function looks doubtful. Thinking allowed, let's
say there's an EC member with em_relids = {1, 2} 1, 2 being relids of
two partitioned tables. Let's assume that there's a third partitioned
table with relid = 3. When a child join rel between children, say 6
and 7, of 1 and 2 resp was produces the EC memeber with em_relids =
{1,2} was translated to produce and EC memeber with em_relids = {6, 7}
and em_is_child = true. But when we will add a child join rel between
(6, 7, 8) of a join between (1, 2, 3), bms_overlap({1, 2}, {1, 2, 3})
will return true and we will add one more member with em_relids = {6,
7}. So your suspicion is true. I think, using
bms_is_subset(top_parent_relids, em->em_relids) should fix that
problem. In addition, adding an Assert to make sure that we are not
adding multiple instances of same child EC member should suffice.
Thumbing through all the child EC members to eliminate duplicates
would be much costlier when a huge number of partitions are involved.

On a related but different subject, I have been thinking on-off about
the need for expression translation while planning partitions.The
translated expressions differ only in the leaf-vars and even for most
of the partitioned tables really only the varno (assuming that
partitioned table and partitions will have same layouts when a large
number of partitions are created automatically.) If we can introduce a
new type of (polymorphic) Var node which represents not just the
parent Var but also child Vars as well. The whole need for expression
translation vanishes, reducing memory requirements by many folds. Such
a Var node will indicate which varnos it covers and for a group of
varnos which varattno it represents. In most of the cases where parent
and the child share the same varattno, all the varnos form a single
set. Whole Var references pose a problem since parent's whole var
reference translates to a RowExpr containing child Var references, but
probably using varattno = 0 for children will suffice. As far as my
thoughts go, this will work in planner but when translating plan tree
into execution tree we will have to produce different Execution time
Var nodes. That should be fine since we will be dealing with only a
single plan. If we could fix that somehow, well and good.

-- 
Best Wishes,
Ashutosh Bapat



Re: Partitionwise join fails under GEQO

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now that I've seen this, I wonder whether add_child_join_rel_equivalences
>> might not be making duplicate EC entries even without GEQO.  Is there any
>> guarantee that it's not called repeatedly on related join-rel sets?

> build_child_join_rel() is the only caller of
> add_child_join_rel_equivalences(). Before the first calls the later,
> the first has an Assert
>  899     Assert(!find_join_rel(root, joinrel->relids));
> This means that for a given child join rel this function is called
> only once implying that add_child_join_rel_equivalences() is called
> only once for a given child join rel. Initially I thought that this is
> enough to not add duplicate child members. But to me use of
> bms_overlap() in that function looks doubtful.

Yeah.  As soon as you get into three-or-more-way joins, this code will
be called again, mentioning some of the same relids as at the lower
join level.

> On a related but different subject, I have been thinking on-off about
> the need for expression translation while planning partitions.The
> translated expressions differ only in the leaf-vars and even for most
> of the partitioned tables really only the varno (assuming that
> partitioned table and partitions will have same layouts when a large
> number of partitions are created automatically.) If we can introduce a
> new type of (polymorphic) Var node which represents not just the
> parent Var but also child Vars as well. The whole need for expression
> translation vanishes, reducing memory requirements by many folds.

Actually, the reason I have been looking at equivclass.c is that I've
been attacking the problem from a different direction.  I'm fairly
unexcited about making the definition of Var looser as you suggest
here --- I actually think that it's dangerously imprecise already,
due to not accounting for the effects of outer joins.  However,
we could avoid the growth in eclass members for large partition sets
if we simply didn't store child eclass members, instead translating
on-the-fly when we need to deal with child rels.  I have a patch
about half done, but it won't be possible to determine the true
performance implications of that idea until it's all done.  More
later.

Either approach would mean that add_child_join_rel_equivalences
goes away entirely, or at least doesn't need to store new em_is_child
entries anymore, causing the memory bloat issue to become moot.
So maybe we should just fix the wrong-context issue for now, and
live with the GEQO bloat in the back branches.

            regards, tom lane



Re: Partitionwise join fails under GEQO

From
Ashutosh Bapat
Date:
On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> > On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Now that I've seen this, I wonder whether add_child_join_rel_equivalences
> >> might not be making duplicate EC entries even without GEQO.  Is there any
> >> guarantee that it's not called repeatedly on related join-rel sets?
>
> > build_child_join_rel() is the only caller of
> > add_child_join_rel_equivalences(). Before the first calls the later,
> > the first has an Assert
> >  899     Assert(!find_join_rel(root, joinrel->relids));
> > This means that for a given child join rel this function is called
> > only once implying that add_child_join_rel_equivalences() is called
> > only once for a given child join rel. Initially I thought that this is
> > enough to not add duplicate child members. But to me use of
> > bms_overlap() in that function looks doubtful.
>
> Yeah.  As soon as you get into three-or-more-way joins, this code will
> be called again, mentioning some of the same relids as at the lower
> join level.
>
> > On a related but different subject, I have been thinking on-off about
> > the need for expression translation while planning partitions.The
> > translated expressions differ only in the leaf-vars and even for most
> > of the partitioned tables really only the varno (assuming that
> > partitioned table and partitions will have same layouts when a large
> > number of partitions are created automatically.) If we can introduce a
> > new type of (polymorphic) Var node which represents not just the
> > parent Var but also child Vars as well. The whole need for expression
> > translation vanishes, reducing memory requirements by many folds.
>
> Actually, the reason I have been looking at equivclass.c is that I've
> been attacking the problem from a different direction.  I'm fairly
> unexcited about making the definition of Var looser as you suggest
> here --- I actually think that it's dangerously imprecise already,
> due to not accounting for the effects of outer joins.  However,
> we could avoid the growth in eclass members for large partition sets
> if we simply didn't store child eclass members, instead translating
> on-the-fly when we need to deal with child rels.  I have a patch
> about half done, but it won't be possible to determine the true
> performance implications of that idea until it's all done.  More
> later.

If we translate more than ones, every time someone comes searching for
an EC member, we will leak memory in the planner memory context, which
anyway gets bloated because of the huge number of translations even
when done only once. So that needs to be done rather carefully. Of
course we could save child EC members separately in the same EC and
search that using hash or something to avoid multiple instances of the
same child EC member.

But this memory bloat isn't unique to ECs, though it shows up mostly
in ECs. We translate all the expressions, TLEs, quals and so on. That
consumes a lot of memory. The number of joins grows exponentially with
the number of relations being joined and so does this child joins. So
we need some way to avoid expression translations, where the
translated expressions just differ in leaf var nodes, in order to
support a large number of partitions.

>
> Either approach would mean that add_child_join_rel_equivalences
> goes away entirely, or at least doesn't need to store new em_is_child
> entries anymore, causing the memory bloat issue to become moot.
> So maybe we should just fix the wrong-context issue for now, and
> live with the GEQO bloat in the back branches.

Yes, I agree with that. For now your patch fixing the wrong context
issue is good enough.

-- 
Best Wishes,
Ashutosh Bapat



Re: Partitionwise join fails under GEQO

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... we could avoid the growth in eclass members for large partition sets
>> if we simply didn't store child eclass members, instead translating
>> on-the-fly when we need to deal with child rels.  I have a patch
>> about half done, but it won't be possible to determine the true
>> performance implications of that idea until it's all done.  More
>> later.

> If we translate more than ones, every time someone comes searching for
> an EC member, we will leak memory in the planner memory context, which
> anyway gets bloated because of the huge number of translations even
> when done only once. So that needs to be done rather carefully.

I'm not terribly concerned about that.  There's not a "huge number"
of such translations to be done --- it's more like one per index.
(Also, we could very possibly build the translations in a temp
context that gets reset every so often, if it becomes an issue.)

I am a bit worried about whether we'll be spending a lot more cycles
to do the added translations; though some of that should be bought
back by having fewer EC members to compare to.  In any event, testing
a working patch will be a lot more illuminating than speculation.

>> Either approach would mean that add_child_join_rel_equivalences
>> goes away entirely, or at least doesn't need to store new em_is_child
>> entries anymore, causing the memory bloat issue to become moot.
>> So maybe we should just fix the wrong-context issue for now, and
>> live with the GEQO bloat in the back branches.

> Yes, I agree with that. For now your patch fixing the wrong context
> issue is good enough.

Done for now.

            regards, tom lane



Re: Partitionwise join fails under GEQO

From
Robert Haas
Date:
On Mon, Oct 5, 2020 at 2:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Actually, the reason I have been looking at equivclass.c is that I've
> been attacking the problem from a different direction.  I'm fairly
> unexcited about making the definition of Var looser as you suggest
> here --- I actually think that it's dangerously imprecise already,
> due to not accounting for the effects of outer joins.  However,
> we could avoid the growth in eclass members for large partition sets
> if we simply didn't store child eclass members, instead translating
> on-the-fly when we need to deal with child rels.  I have a patch
> about half done, but it won't be possible to determine the true
> performance implications of that idea until it's all done.  More
> later.

The thing that seems a bit dumb about the current system is that we
got to a lot of trouble to generate separte tlists and quals for every
inheritance child, but then in the final plan, what difference does it
make? I see that scan nodes, unlike joins etc., don't replace Var
nodes with INNER_VAR/OUTER_VAR, but it seems like a qual attached to a
Seq Scan has only one possible relation to which it can refer, so
what's the point of having separate lists with different varnos in
each? There is an issue if column numbering is different, because then
varattnos need to differ even at execution time, but it seems like the
only possible use of the varnos in these places is during planning. By
execution time, we can't really care any more, at least not as far as
I understand it.

Just to be clear, this is not a vote against whatever you want to do
to improve things in this area. It's just an observation that we seem
to create an awful lot of bulky data structures when planning a query
like SELECT * FROM many_children, and then don't do much with them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partitionwise join fails under GEQO

From
Amit Langote
Date:
On Wed, Oct 7, 2020 at 12:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> > On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> ... we could avoid the growth in eclass members for large partition sets
> >> if we simply didn't store child eclass members, instead translating
> >> on-the-fly when we need to deal with child rels.  I have a patch
> >> about half done, but it won't be possible to determine the true
> >> performance implications of that idea until it's all done.  More
> >> later.

+1 to this idea.  We've seen mainly get_eclass_for_sort_expr() become
a bottleneck with large partition sets and getting rid of it would be
really great.

> > If we translate more than ones, every time someone comes searching for
> > an EC member, we will leak memory in the planner memory context, which
> > anyway gets bloated because of the huge number of translations even
> > when done only once. So that needs to be done rather carefully.
>
> I'm not terribly concerned about that.  There's not a "huge number"
> of such translations to be done --- it's more like one per index.
> (Also, we could very possibly build the translations in a temp
> context that gets reset every so often, if it becomes an issue.)

Yeah, the memory bloat from this should be minimal.  AFAICS, the only
place that demands a child expression from a given matched EC thus
needing a translation is createplan.c, which I would expect to be a
place that doesn't spend a lot of memory because there's no
exploratory processing.

> I am a bit worried about whether we'll be spending a lot more cycles
> to do the added translations; though some of that should be bought
> back by having fewer EC members to compare to.  In any event, testing
> a working patch will be a lot more illuminating than speculation.

Agreed.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com



Re: Partitionwise join fails under GEQO

From
Amit Langote
Date:
On Thu, Oct 8, 2020 at 2:58 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Oct 7, 2020 at 12:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> > > On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> ... we could avoid the growth in eclass members for large partition sets
> > >> if we simply didn't store child eclass members, instead translating
> > >> on-the-fly when we need to deal with child rels.  I have a patch
> > >> about half done, but it won't be possible to determine the true
> > >> performance implications of that idea until it's all done.  More
> > >> later.
>
> +1 to this idea.  We've seen mainly get_eclass_for_sort_expr() become
> a bottleneck with large partition sets and getting rid of it would be
> really great.
>
> [1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com

Oops, I linked this thread but forgot to write why.  Well, I had meant
to say that I had unsuccessfully tried to implement this idea as a PoC
back when David had started the linked discussion to address the same
problem.

-- 
Amit Langote
EDB: http://www.enterprisedb.com