Re: [HACKERS] Runtime Partition Pruning - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Runtime Partition Pruning |
Date | |
Msg-id | CA+TgmoYBPkkWVq+sd=uJgXu2BgaKNsaL85UCuJBuz2Nxaa1FKw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Runtime Partition Pruning (Beena Emerson <memissemerson@gmail.com>) |
Responses |
Re: [HACKERS] Runtime Partition Pruning
Re: [HACKERS] Runtime Partition Pruning |
List | pgsql-hackers |
On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote: > I have added the partition quals that are used for pruning. > > PFA the updated patch. I have changed the names of variables to make > it more appropriate, along with adding more code comments and doing > some refactoring and other code cleanups. - initClauses() seems to be a duplicate of the existing function ExecInitExprList(), except for the extra NULL test, which isn't necessary. - The executor already has a system for making sure that relations get opened and locked, and it's different from the new one scheme which set_append_subplan_indexes() implements. Relations should be locked during the executor initialization phase (i.e. ExecInitAppend) and not when the first tuple is requested (i.e. ExecAppend). Also, there's already code to lock both child relations (i.e. the scans of those relations, see InitScanRelation, ExecInitIndexScan) and non-leaf partitions (ExecLockNonLeafAppendTables). The call to find_all_inheritors() will lock all of that same stuff again *plus* the leaf partitions that were pruned during planning - moreover, if the Append is rescanned, we'll walk the partitioning structure again for every rescan. I think RelationGetPartitionDispatchInfo should be called directly from ExecInitAppend after the existing code to take locks has been called, and store a pointer to the PartitionDispatch object in the AppendState for future use. - I am surprised that set_append_subplan_indexes() needs to worry about multi-level partitioning directly. I would have thought that Amit's patch would take care of that, just returning a Bitmapset of indexes which this function could use directly. It also doesn't seem like a very good idea to convert the Bitmapset (subplans) into a list of integers (node->subplan_indexes), as set_append_subplan_indexes() does at the bottom. The Bitmapset will be a lot more efficient; we should be able to just iterate over that directly rather than converting it into a List. Note that a Bitmapset can be created with a single palloc, but an integer list needs one per list element plus one for the list itself. - I don't think it's a good idea for ExecInitAppend to copy so much information into the appendstate. It copies append_paths_size, append_paths_array, parentoid, base_params, es_param_list_info, join_clauses, but it could just as well access them directly via planstate->plan and planstate->state when they are needed. Maybe you had some thought that this would be more efficient, but it probably doesn't save much and it's unlike what we do elsewhere in the executor. - A good chunk of the rest of this logic in nodeAppend.c looks like it's going to conflict heavily with the Parallel Append patch that just went in. That's going to require some thought. There's no reason why a parallel-aware Append can't do runtime partition pruning, but we want to avoid as much overhead as possible when runtime pruning isn't chosen. In the parallel-aware case, I think we should just try to jigger things so that the plans we don't need to scan get marked pa_finished. We don't want to hold pa_lock while doing the pruning, so probably what should happen is add a new ParallelAppendState member indicating wither pruning has been done; any process which needs to choose a subplan and sees that pruning isn't done yet releases the lock, performs pruning, then reacquires the lock, marks pa_finished on all plans that we don't need to scan, marks pruning down, picks a plan, and releases the lock. There is a race condition where pruning gets conducted by multiple workers around the same time, but it often won't happen and isn't a big deal if it does; they should all get the same answer. In the non-parallel-aware case, I think we should probably replace choose_next_subplan_locally with choose_next_subplan_simply (for the non-pruning case) and choose_next_subplan_with_pruning (for the other case). - On the optimizer side of things, the new calls to find_all_inheritors() and get_leaf_part_recurse() in set_base_rel_sizes() don't look good. As in the executor stuff, that's work that the optimizer is already doing elsewhere, and we don't want to redo it here. In the case of the optimizer, the most relevant place is probably expand_partitioned_rtentry(). Another place where we already have the relation open is get_relation_info(). Any information you want to get from the relation descriptor needs to be saved in one of those functions; don't re-open the relation elsewhere. - I think you are perhaps doing the work a little too early here, especially with regards to the join clauses. In set_base_rel_sizes() we don't have a clear idea what the join order will be, or what type of join will be used, so we don't know what join clauses are relevant for run-time pruning. I don't think that trying to identify join clauses at that stage makes sense. I think that the time when we know (or can figure out) that stuff might be when we go to build a parameterized append path - see the bottom of add_paths_to_append_rel(). It's really the parameterization -- the relids listed in required_outer -- that tell us which join clauses we could potentially use for runtime pruning. I'm not sure in detail how this should work, but it seems to me that generally the right formula is probably: useful clauses from the appendrel's baserestrictinfo PLUS joinclauses that mention only vars from the relation itself and whatever's in required_outer. - You can't use random fields in PlannerInfo *root as scratch space for functions in different files to communicate with each other. I mean, there's obviously some stuff that's valid to stick into the PlannerInfo, but these are basically per-relation details which you're counting on clearing before anybody gets confused. That's not a good design, and it's probably a sign that not all of the code is in the right place yet. For example, a lot of the logic you've added to create_append_path() probably belongs in the caller -- look at how simple create_.*_path() functions generally are. Similarly, think about whether the chunk of logic added to set_rel_size() just after the call to set_append_rel_size() doesn't really belong inside that function, or conversely whether the chunk of code in set_base_rel_sizes() just before the call to set_rel_size() shuldn't be moved down. I'm hopeful that with some rejiggering of this sort you can get the code that needs to communicate closer to being all in one place and pass around whatever is needed via the parameter lists of the functions involved, or even local variables, rather than via PlannerInfo. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: