I don't know the parallel code so I'm not going to comment on the overall patch, but a thought on ExecSetTupleBound().
That function is getting a number of cases where we're doing recursion for a single child (result, gather, gather-merge). Recursion for a single child isn't particularly efficient. I know that if there's a single case of recursion like this, compilers will frequently turn it into a loop, but I don't know if compilers can optimize a branched case like this.
Would we be better off moving those cases into the while loop I added to avoid the recursion? So we'd end up with something like:
while ()
{
if subquery
else if result
else if gather
else if gather merge
}
if sort
else if merge append
And a nit from my original fix now that I've looked at the pg10 code more. The two casts I added (to SubqueryScanState and Node) should probably be changed to castNode() calls.