Re: [HACKERS] Runtime Partition Pruning - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Runtime Partition Pruning
Date
Msg-id CA+TgmoYBPkkWVq+sd=uJgXu2BgaKNsaL85UCuJBuz2Nxaa1FKw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Runtime Partition Pruning  (Beena Emerson <memissemerson@gmail.com>)
Responses Re: [HACKERS] Runtime Partition Pruning
Re: [HACKERS] Runtime Partition Pruning
List pgsql-hackers
On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> I have added the partition quals that are used for pruning.
>
> PFA the updated patch. I have changed the names of variables to make
> it more appropriate, along with adding more code comments and doing
> some refactoring and other code cleanups.

- initClauses() seems to be a duplicate of the existing function
ExecInitExprList(), except for the extra NULL test, which isn't
necessary.

- The executor already has a system for making sure that relations get
opened and locked, and it's different from the new one scheme which
set_append_subplan_indexes() implements. Relations should be locked
during the executor initialization phase (i.e. ExecInitAppend) and not
when the first tuple is requested (i.e. ExecAppend).  Also, there's
already code to lock both child relations (i.e. the scans of those
relations, see InitScanRelation, ExecInitIndexScan) and non-leaf
partitions (ExecLockNonLeafAppendTables).  The call to
find_all_inheritors() will lock all of that same stuff again *plus*
the leaf partitions that were pruned during planning - moreover, if
the Append is rescanned, we'll walk the partitioning structure again
for every rescan.  I think RelationGetPartitionDispatchInfo should be
called directly from ExecInitAppend after the existing code to take
locks has been called, and store a pointer to the PartitionDispatch
object in the AppendState for future use.

- I am surprised that set_append_subplan_indexes() needs to worry
about multi-level partitioning directly.  I would have thought that
Amit's patch would take care of that, just returning a Bitmapset of
indexes which this function could use directly.  It also doesn't seem
like a very good idea to convert the Bitmapset (subplans) into a list
of integers (node->subplan_indexes), as set_append_subplan_indexes()
does at the bottom.  The Bitmapset will be a lot more efficient; we
should be able to just iterate over that directly rather than
converting it into a List.  Note that a Bitmapset can be created with
a single palloc, but an integer list needs one per list element plus
one for the list itself.

- I don't think it's a good idea for ExecInitAppend to copy so much
information into the appendstate.  It copies append_paths_size,
append_paths_array, parentoid, base_params, es_param_list_info,
join_clauses, but it could just as well access them directly via
planstate->plan and planstate->state when they are needed.  Maybe you
had some thought that this would be more efficient, but it probably
doesn't save much and it's unlike what we do elsewhere in the
executor.

- A good chunk of the rest of this logic in nodeAppend.c looks like
it's going to conflict heavily with the Parallel Append patch that
just went in.  That's going to require some thought.  There's no
reason why a parallel-aware Append can't do runtime partition pruning,
but we want to avoid as much overhead as possible when runtime pruning
isn't chosen.  In the parallel-aware case, I think we should just try
to jigger things so that the plans we don't need to scan get marked
pa_finished.  We don't want to hold pa_lock while doing the pruning,
so probably what should happen is add a new ParallelAppendState member
indicating wither pruning has been done; any process which needs to
choose a subplan and sees that pruning isn't done yet releases the
lock, performs pruning, then reacquires the lock, marks pa_finished on
all plans that we don't need to scan, marks pruning down, picks a
plan, and releases the lock.  There is a race condition where pruning
gets conducted by multiple workers around the same time, but it often
won't happen and isn't a big deal if it does; they should all get the
same answer.  In the non-parallel-aware case, I think we should
probably replace choose_next_subplan_locally with
choose_next_subplan_simply (for the non-pruning case) and
choose_next_subplan_with_pruning (for the other case).

- On the optimizer side of things, the new calls to
find_all_inheritors() and get_leaf_part_recurse() in
set_base_rel_sizes() don't look good.  As in the executor stuff,
that's work that the optimizer is already doing elsewhere, and we
don't want to redo it here.  In the case of the optimizer, the most
relevant place is probably expand_partitioned_rtentry().  Another
place where we already have the relation open is get_relation_info().
Any information you want to get from the relation descriptor needs to
be saved in one of those functions; don't re-open the relation
elsewhere.

- I think you are perhaps doing the work a little too early here,
especially with regards to the join clauses.  In set_base_rel_sizes()
we don't have a clear idea what the join order will be, or what type
of join will be used, so we don't know what join clauses are relevant
for run-time pruning.  I don't think that trying to identify join
clauses at that stage makes sense.  I think that the time when we know
(or can figure out) that stuff might be when we go to build a
parameterized append path - see the bottom of
add_paths_to_append_rel().  It's really the parameterization -- the
relids listed in required_outer -- that tell us which join clauses we
could potentially use for runtime pruning.  I'm not sure in detail how
this should work, but it seems to me that generally the right formula
is probably: useful clauses from the appendrel's baserestrictinfo PLUS
joinclauses that mention only vars from the relation itself and
whatever's in required_outer.

- You can't use random fields in PlannerInfo *root as scratch space
for functions in different files to communicate with each other.  I
mean, there's obviously some stuff that's valid to stick into the
PlannerInfo, but these are basically per-relation details which you're
counting on clearing before anybody gets confused.  That's not a good
design, and it's probably a sign that not all of the code is in the
right place yet.  For example, a lot of the logic you've added to
create_append_path() probably belongs in the caller -- look at how
simple create_.*_path() functions generally are.  Similarly, think
about whether the chunk of logic added to set_rel_size() just after
the call to set_append_rel_size() doesn't really belong inside that
function, or conversely whether the chunk of code in
set_base_rel_sizes() just before the call to set_rel_size() shuldn't
be moved down.  I'm hopeful that with some rejiggering of this sort
you can get the code that needs to communicate closer to being all in
one place and pass around whatever is needed via the parameter lists
of the functions involved, or even local variables, rather than via
PlannerInfo.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: proposal: alternative psql commands quit and exit
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures