Re: Partitionwise join fails under GEQO - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Partitionwise join fails under GEQO |
Date | |
Msg-id | CAExHW5tgiLsYC_CLcaKHFFc8H56C0s9mCu_0OpahGxn=hUi_Pg@mail.gmail.com Whole thread Raw |
In response to | Partitionwise join fails under GEQO (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Partitionwise join fails under GEQO
|
List | pgsql-hackers |
On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > If you run the core regression tests with geqo_threshold = 2 > (injected, say, via ALTER SYSTEM SET), you will notice the > earlier tests showing some join ordering differences, which > is unsurprising ... but when it gets to partition_join, that > test just dumps core. > > Investigation shows that the crash is due to a corrupt EquivalenceClass > member. It gets that way because add_child_join_rel_equivalences > is unaware of the planner's memory allocation policies, and feels > free to mess with the EC data structures during join planning. > When GEQO's transient memory context is thrown away after one round > of GEQO planning, some still-referenced EC data goes with it, > resulting in a crash during the next round. > > I band-aided around that with the attached patch, which is enough > to prevent the crash, but it's entirely unsatisfactory as a production > solution. The problem is that each GEQO round will re-add the same > child EC members, since add_child_join_rel_equivalences has absolutely > no concept that the members it wants might be there already. Since > GEQO tends to use a lot of rounds, this can run to significant memory > bloat, not to mention a pretty serious speed hit while EC operations > thumb through all of those duplicate expressions. > > Now that I've seen this, I wonder whether add_child_join_rel_equivalences > might not be making duplicate EC entries even without GEQO. Is there any > guarantee that it's not called repeatedly on related join-rel sets? build_child_join_rel() is the only caller of add_child_join_rel_equivalences(). Before the first calls the later, the first has an Assert 899 Assert(!find_join_rel(root, joinrel->relids)); This means that for a given child join rel this function is called only once implying that add_child_join_rel_equivalences() is called only once for a given child join rel. Initially I thought that this is enough to not add duplicate child members. But to me use of bms_overlap() in that function looks doubtful. Thinking allowed, let's say there's an EC member with em_relids = {1, 2} 1, 2 being relids of two partitioned tables. Let's assume that there's a third partitioned table with relid = 3. When a child join rel between children, say 6 and 7, of 1 and 2 resp was produces the EC memeber with em_relids = {1,2} was translated to produce and EC memeber with em_relids = {6, 7} and em_is_child = true. But when we will add a child join rel between (6, 7, 8) of a join between (1, 2, 3), bms_overlap({1, 2}, {1, 2, 3}) will return true and we will add one more member with em_relids = {6, 7}. So your suspicion is true. I think, using bms_is_subset(top_parent_relids, em->em_relids) should fix that problem. In addition, adding an Assert to make sure that we are not adding multiple instances of same child EC member should suffice. Thumbing through all the child EC members to eliminate duplicates would be much costlier when a huge number of partitions are involved. On a related but different subject, I have been thinking on-off about the need for expression translation while planning partitions.The translated expressions differ only in the leaf-vars and even for most of the partitioned tables really only the varno (assuming that partitioned table and partitions will have same layouts when a large number of partitions are created automatically.) If we can introduce a new type of (polymorphic) Var node which represents not just the parent Var but also child Vars as well. The whole need for expression translation vanishes, reducing memory requirements by many folds. Such a Var node will indicate which varnos it covers and for a group of varnos which varattno it represents. In most of the cases where parent and the child share the same varattno, all the varnos form a single set. Whole Var references pose a problem since parent's whole var reference translates to a RowExpr containing child Var references, but probably using varattno = 0 for children will suffice. As far as my thoughts go, this will work in planner but when translating plan tree into execution tree we will have to produce different Execution time Var nodes. That should be fine since we will be dealing with only a single plan. If we could fix that somehow, well and good. -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: