Thread: numbering plan nodes

numbering plan nodes

From
Robert Haas
Date:
Hi,

One of the problems which the remaining parallel query patches need to
solve is to correlate Plan or PlanState nodes across multiple
backends.  This need arises in a couple of cases.  First, if a group
of workers are all executing the same plan nodes, and es_instrument !=
0, then we'll accumulate instrumentation data in each worker, but we
need to collect all of the instrumentation data from each individual
worker and add the totals from each counter in each worker to the
master's totals, so that what finally gets reported includes the
effort spent in all workers.  And of course the instrumentation data
for any given node in the worker needs to be added to the
*corresponding node* in the master, not any random node.

Second, for nodes that are actually parallel aware, like the proposed
Partial Seq Scan node, there's bound to be some shared state.  In the
case of a Partial Seq Scan, for example, we need to keep track of
which blocks are yet to be scanned.  Amit's patch is currently silly
about this: it assumes there will be only one Partial Seq Scan in a
plan, which makes it easy to find the relevant state.  But that's not
an assumption we want to be stuck with.  Here again, if there are two
or more Partial Seq Scans in the plan passed to every worker, perhaps
with an Append node on top of them, then we need a way to match them
up across all the backends, so that each group of corresponding nodes
is looking at the same bit of shared state.

To meet these needs, what I propose to do is have
set_plan_references() assign an integer ID to every plan node that is
unique across the entire plan tree.  Then, when we ship part of the
plan tree off to a worker to be executed, the integer IDs will also
get copied (by copyfuncs.c support yet to be added, but it should be
pretty boilerplate ... I think), so that whatever plan node #42 is in
the parallel group leader will also be plan node #42 in the worker.
When I started out, I had the idea of trying to number the PlanState
nodes rather than the Plan nodes, and count on the fact the master and
the worker would traverse the query tree in the same order; since the
worker had only a subtree of the whole plan, it's numbering would be
offset from the master's, but the idea was that you could fix this by
adding an offset.  This turns out to not to work, at least not the way
I tried to do it, because ExecInitNode() visits all initPlans
everywhere in the tree before recursing over the main tree, so you
only get identical numbering everywhere if you start with the same
exact tree.  In any case, relying on two traversals in separate
processes to visit everything in the same order seems a bit fragile;
putting the ID into the plan node - which is what will actually be
passed down to the worker - seems much more robust.

It would be nice if we could assign the integer IDs with no gaps, and
with the IDs of a node's children immediately following that of their
parent.  The advantage of this is that if you want to have a data
structure for every node in the tree passed to some worker - like a
struct Instrumentation in dynamic shared memory - you can just create
an array and index it by (node ID) - (node ID of uppermost child
passed to worker), and every slot will be in use, so no memory is
wasted and no lookup table is needed.  Assigning the IDs in
set_plan_references() ... almost succeeds at this.  The removal of
SubqueryScan nodes as part of that process could create gaps in the
numbering sequence, but probably few enough that we could just ignore
the problem and waste an entry in whatever arrays we create whenever
that optimization applies.

My main concern with this design is how future-proof it is.  I know
Tom is working on ripping the planner apart and putting it back
together again to allow the upper levels of the planner to use paths,
and the discussion of the SS_finalize_plan() changes made some
reference to possibly wanting to mess with set_plan_references().  I'm
not sure if any of those changes would disturb what I'm proposing
here.

Thoughts?  Better ideas?  Concept patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: numbering plan nodes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> To meet these needs, what I propose to do is have
> set_plan_references() assign an integer ID to every plan node that is
> unique across the entire plan tree.

OK.

> It would be nice if we could assign the integer IDs with no gaps, and
> with the IDs of a node's children immediately following that of their
> parent.

I do not think this should be a goal, either explicitly or implicitly.
For one thing, it will be very much in the eye of individual beholders
which nodes are children-that-should-be-consecutive.  What about
initplans, subplans, grandchild nodes?  Different use cases could easily
have different opinions about what to do with those.

> The advantage of this is that if you want to have a data
> structure for every node in the tree passed to some worker - like a
> struct Instrumentation in dynamic shared memory - you can just create
> an array and index it by (node ID) - (node ID of uppermost child
> passed to worker), and every slot will be in use, so no memory is
> wasted and no lookup table is needed.

I think you could just waste the memory, or more likely you could cope
with one level of indirection so that you only waste a pointer per
node (ie, array of pointers indexed by node ID, only fill the ones that
are relevant).

> My main concern with this design is how future-proof it is.  I know
> Tom is working on ripping the planner apart and putting it back
> together again to allow the upper levels of the planner to use paths,
> and the discussion of the SS_finalize_plan() changes made some
> reference to possibly wanting to mess with set_plan_references().

I can't imagine I'd be doing anything that would break the simple case
of "give every node a distinct ID".  If you are building in weird
assumptions about traversal order, that might indeed be a problem.
        regards, tom lane



Re: numbering plan nodes

From
Robert Haas
Date:
Thanks much for the quick reply.

On Thu, Sep 17, 2015 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It would be nice if we could assign the integer IDs with no gaps, and
>> with the IDs of a node's children immediately following that of their
>> parent.
>
> I do not think this should be a goal, either explicitly or implicitly.
> For one thing, it will be very much in the eye of individual beholders
> which nodes are children-that-should-be-consecutive.  What about
> initplans, subplans, grandchild nodes?  Different use cases could easily
> have different opinions about what to do with those.

I was aiming for: the IDs of the descendents of any given node are
consecutive and have IDs which follow that's node ID, in any old order
you like.

>> The advantage of this is that if you want to have a data
>> structure for every node in the tree passed to some worker - like a
>> struct Instrumentation in dynamic shared memory - you can just create
>> an array and index it by (node ID) - (node ID of uppermost child
>> passed to worker), and every slot will be in use, so no memory is
>> wasted and no lookup table is needed.
>
> I think you could just waste the memory, or more likely you could cope
> with one level of indirection so that you only waste a pointer per
> node (ie, array of pointers indexed by node ID, only fill the ones that
> are relevant).

Hmm, I guess that wouldn't be too bad.  A straight join between N
relations will have N base relations and N - 1 joins and maybe a few
sorts or similar.  You could get a plan tree with a lot more nodes if
your query involves inheritance hierarchies - e.g. join 4 tables each
of which has 1000 children.  But even in that case you're only talking
about 32kB of memory for the indirect table, and that's not going to
be very high on your list of problems in that case.

>> My main concern with this design is how future-proof it is.  I know
>> Tom is working on ripping the planner apart and putting it back
>> together again to allow the upper levels of the planner to use paths,
>> and the discussion of the SS_finalize_plan() changes made some
>> reference to possibly wanting to mess with set_plan_references().
>
> I can't imagine I'd be doing anything that would break the simple case
> of "give every node a distinct ID".  If you are building in weird
> assumptions about traversal order, that might indeed be a problem.

Good to know, thanks.  With the design you proposal above, we can make
this insensitive to traversal order.  The only thing that would cause
trouble is if you somehow ended up with massive gaps in the numbering
sequence - e.g. if the final plan had 6 nodes, but the IDs were all 5
digit numbers, you'd waste a silly amount of memory relative to the
size of the plan.  But a few gaps (e.g. for removed SubqueryScan
nodes) won't be a problem.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: numbering plan nodes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 17, 2015 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I can't imagine I'd be doing anything that would break the simple case
>> of "give every node a distinct ID".  If you are building in weird
>> assumptions about traversal order, that might indeed be a problem.

> Good to know, thanks.  With the design you proposal above, we can make
> this insensitive to traversal order.  The only thing that would cause
> trouble is if you somehow ended up with massive gaps in the numbering
> sequence - e.g. if the final plan had 6 nodes, but the IDs were all 5
> digit numbers, you'd waste a silly amount of memory relative to the
> size of the plan.  But a few gaps (e.g. for removed SubqueryScan
> nodes) won't be a problem.

Hm ... if you quit worrying about the order-of-assignment, maybe you
could prevent gaps from removed SubqueryScan nodes by not giving them
IDs until after that decision is made?  Although it may not be worth
any extra trouble.
        regards, tom lane



Re: numbering plan nodes

From
Simon Riggs
Date:
On 17 September 2015 at 13:14, Robert Haas <robertmhaas@gmail.com> wrote:
 
My main concern with this design is how future-proof it is.

Passing array offsets sounds brittle to me.

It would screw up any attempts to post-process the plans. Later enhancements seem certain to break that scheme.

It also assumes that all actors have access to a single memory structure that describes everything.

Hopefully we are working on a parallel query system that will work intranode as well as across nodes, so access to memory should not be assumed.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: numbering plan nodes

From
Robert Haas
Date:
On Thu, Sep 17, 2015 at 4:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Sep 17, 2015 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I can't imagine I'd be doing anything that would break the simple case
>>> of "give every node a distinct ID".  If you are building in weird
>>> assumptions about traversal order, that might indeed be a problem.
>
>> Good to know, thanks.  With the design you proposal above, we can make
>> this insensitive to traversal order.  The only thing that would cause
>> trouble is if you somehow ended up with massive gaps in the numbering
>> sequence - e.g. if the final plan had 6 nodes, but the IDs were all 5
>> digit numbers, you'd waste a silly amount of memory relative to the
>> size of the plan.  But a few gaps (e.g. for removed SubqueryScan
>> nodes) won't be a problem.
>
> Hm ... if you quit worrying about the order-of-assignment, maybe you
> could prevent gaps from removed SubqueryScan nodes by not giving them
> IDs until after that decision is made?

Hmm, that might work.  I'll try it.

> Although it may not be worth
> any extra trouble.

Right.  If it doesn't turn out to be easy, I won't worry about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: numbering plan nodes

From
Robert Haas
Date:
On Thu, Sep 17, 2015 at 5:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Passing array offsets sounds brittle to me.

How would this case be any different from other places in the planner
where we already do exactly that?  For example, RTIs.

> It would screw up any attempts to post-process the plans. Later enhancements
> seem certain to break that scheme.

I'm proposing to this very late in plan generation precisely for this
reason.  I think it would be quite difficult to post-process the plans
after the planner is completely through with them.  I've never heard
of anyone wanting to do that, but even if someone did, they'd still be
fine as long as they only deleted nodes or moved them around in the
tree.  If someone wanted to do surgery on a plan to create new nodes,
they would need to set the ID field properly, but that's no harder
(and probably for the most part quite a bit easier) than setting all
the other fields in the Plan correctly.  So I'm not seeing the
problem.

> It also assumes that all actors have access to a single memory structure
> that describes everything.

Our planner assumes that in general.  How is this case different?

> Hopefully we are working on a parallel query system that will work intranode
> as well as across nodes, so access to memory should not be assumed.

Currently, the only method we have for accessing data on another node
is FDWs.  When you access data via an FDW, there is both a local plan
(constructed by the local PostgreSQL) and perhaps a remote plan
(depending on what the FDW is accessing).  The local plan is
accessible via local memory, and the remote plan doesn't need to be.
What I'm talking about here doesn't change any of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: numbering plan nodes

From
Kouhei Kaigai
Date:
> One of the problems which the remaining parallel query patches need to
> solve is to correlate Plan or PlanState nodes across multiple
> backends.  This need arises in a couple of cases.  First, if a group
> of workers are all executing the same plan nodes, and es_instrument !=
> 0, then we'll accumulate instrumentation data in each worker, but we
> need to collect all of the instrumentation data from each individual
> worker and add the totals from each counter in each worker to the
> master's totals, so that what finally gets reported includes the
> effort spent in all workers.  And of course the instrumentation data
> for any given node in the worker needs to be added to the
> *corresponding node* in the master, not any random node.
> 
> Second, for nodes that are actually parallel aware, like the proposed
> Partial Seq Scan node, there's bound to be some shared state.  In the
> case of a Partial Seq Scan, for example, we need to keep track of
> which blocks are yet to be scanned.  Amit's patch is currently silly
> about this: it assumes there will be only one Partial Seq Scan in a
> plan, which makes it easy to find the relevant state.  But that's not
> an assumption we want to be stuck with.  Here again, if there are two
> or more Partial Seq Scans in the plan passed to every worker, perhaps
> with an Append node on top of them, then we need a way to match them
> up across all the backends, so that each group of corresponding nodes
> is looking at the same bit of shared state.
>
> To meet these needs, what I propose to do is have
> set_plan_references() assign an integer ID to every plan node that is
> unique across the entire plan tree.  Then, when we ship part of the
> plan tree off to a worker to be executed, the integer IDs will also
> get copied (by copyfuncs.c support yet to be added, but it should be
> pretty boilerplate ... I think), so that whatever plan node #42 is in
> the parallel group leader will also be plan node #42 in the worker.
> When I started out, I had the idea of trying to number the PlanState
> nodes rather than the Plan nodes, and count on the fact the master and
> the worker would traverse the query tree in the same order; since the
> worker had only a subtree of the whole plan, it's numbering would be
> offset from the master's, but the idea was that you could fix this by
> adding an offset.  This turns out to not to work, at least not the way
> I tried to do it, because ExecInitNode() visits all initPlans
> everywhere in the tree before recursing over the main tree, so you
> only get identical numbering everywhere if you start with the same
> exact tree.  In any case, relying on two traversals in separate
> processes to visit everything in the same order seems a bit fragile;
> putting the ID into the plan node - which is what will actually be
> passed down to the worker - seems much more robust.
> 
> It would be nice if we could assign the integer IDs with no gaps, and
> with the IDs of a node's children immediately following that of their
> parent.  The advantage of this is that if you want to have a data
> structure for every node in the tree passed to some worker - like a
> struct Instrumentation in dynamic shared memory - you can just create
> an array and index it by (node ID) - (node ID of uppermost child
> passed to worker), and every slot will be in use, so no memory is
> wasted and no lookup table is needed.
>
I entirely agree with the idea of plan-node identifier, however,
uncertain whether the node-id shall represent physical location on
the dynamic shared memory segment, because
(1) Relatively smaller number of node type needs shared state,   thus most of array items are empty.
(2) Extension that tries to modify plan-tree using planner_hook   may need to adjust node-id also.

Even though shm_toc_lookup() has to walk on the toc entries to find
out the node-id, it happens at once on beginning of the executor at
background worker side. I don't think it makes a significant problem.

Thanks,

> Assigning the IDs in
> set_plan_references() ... almost succeeds at this.  The removal of
> SubqueryScan nodes as part of that process could create gaps in the
> numbering sequence, but probably few enough that we could just ignore
> the problem and waste an entry in whatever arrays we create whenever
> that optimization applies.
> 
> My main concern with this design is how future-proof it is.  I know
> Tom is working on ripping the planner apart and putting it back
> together again to allow the upper levels of the planner to use paths,
> and the discussion of the SS_finalize_plan() changes made some
> reference to possibly wanting to mess with set_plan_references().  I'm
> not sure if any of those changes would disturb what I'm proposing
> here.
> 
> Thoughts?  Better ideas?  Concept patch attached.
>

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: numbering plan nodes

From
Robert Haas
Date:
On Thu, Sep 17, 2015 at 9:01 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> I entirely agree with the idea of plan-node identifier, however,
> uncertain whether the node-id shall represent physical location on
> the dynamic shared memory segment, because
> (1) Relatively smaller number of node type needs shared state,
>     thus most of array items are empty.
> (2) Extension that tries to modify plan-tree using planner_hook
>     may need to adjust node-id also.
>
> Even though shm_toc_lookup() has to walk on the toc entries to find
> out the node-id, it happens at once on beginning of the executor at
> background worker side. I don't think it makes a significant problem.

Yes, I was thinking that what would make sense is to have each
parallel-aware node call shm_toc_insert() using its ID as the key.
Then, we also need Instrumentation nodes.  For those, I thought we
could use some fixed, high-numbered key, and Tom's idea.

Are there extensions that use planner_hook to do surgery on the plan
tree?  What do they do, exactly?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: numbering plan nodes

From
Kouhei Kaigai
Date:
> On Thu, Sep 17, 2015 at 9:01 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > I entirely agree with the idea of plan-node identifier, however,
> > uncertain whether the node-id shall represent physical location on
> > the dynamic shared memory segment, because
> > (1) Relatively smaller number of node type needs shared state,
> >     thus most of array items are empty.
> > (2) Extension that tries to modify plan-tree using planner_hook
> >     may need to adjust node-id also.
> >
> > Even though shm_toc_lookup() has to walk on the toc entries to find
> > out the node-id, it happens at once on beginning of the executor at
> > background worker side. I don't think it makes a significant problem.
> 
> Yes, I was thinking that what would make sense is to have each
> parallel-aware node call shm_toc_insert() using its ID as the key.
> Then, we also need Instrumentation nodes.  For those, I thought we
> could use some fixed, high-numbered key, and Tom's idea.
>
Hmm, indeed, run-time statistics are needed for every node.
If an array indexed by node-id would be a hash slot, we can treat
non-contiguous node-id with no troubles.

> Are there extensions that use planner_hook to do surgery on the plan
> tree?  What do they do, exactly?
>
(Even though it will not work under Funnel,) PG-Strom often inject
a preprocessor node under Agg-node to produce partial aggregation
to reduce number of rows to be processed by CPU.

Also, I have seen a paper published by Fujitsu folks. Their module
modifies plan-tree to replace built-in scan node with their own
columnar storage scan node. http://db-event.jpn.org/deim2015/paper/195.pdf
This paper is written in Japanese, however, figure-3 in page.4 shows
what I explain above.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: numbering plan nodes

From
Robert Haas
Date:
On Fri, Sep 18, 2015 at 5:24 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> On Thu, Sep 17, 2015 at 9:01 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> > I entirely agree with the idea of plan-node identifier, however,
>> > uncertain whether the node-id shall represent physical location on
>> > the dynamic shared memory segment, because
>> > (1) Relatively smaller number of node type needs shared state,
>> >     thus most of array items are empty.
>> > (2) Extension that tries to modify plan-tree using planner_hook
>> >     may need to adjust node-id also.
>> >
>> > Even though shm_toc_lookup() has to walk on the toc entries to find
>> > out the node-id, it happens at once on beginning of the executor at
>> > background worker side. I don't think it makes a significant problem.
>>
>> Yes, I was thinking that what would make sense is to have each
>> parallel-aware node call shm_toc_insert() using its ID as the key.
>> Then, we also need Instrumentation nodes.  For those, I thought we
>> could use some fixed, high-numbered key, and Tom's idea.
>>
> Hmm, indeed, run-time statistics are needed for every node.
> If an array indexed by node-id would be a hash slot, we can treat
> non-contiguous node-id with no troubles.

Yeah, we could do that, but it seems more complex than we really need.

>> Are there extensions that use planner_hook to do surgery on the plan
>> tree?  What do they do, exactly?
>>
> (Even though it will not work under Funnel,) PG-Strom often inject
> a preprocessor node under Agg-node to produce partial aggregation
> to reduce number of rows to be processed by CPU.

OK.  So, if you wanted to make it work under a Funnel, you'd need to
stick a node ID on it that was higher than any assigned to any plan
node thus far.  Doesn't seem like a big deal.

> Also, I have seen a paper published by Fujitsu folks. Their module
> modifies plan-tree to replace built-in scan node with their own
> columnar storage scan node.
>   http://db-event.jpn.org/deim2015/paper/195.pdf
> This paper is written in Japanese, however, figure-3 in page.4 shows
> what I explain above.

If you replaced a node, you'd just copy the ID from the existing node.
That seems simple enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company