Re: [HACKERS] [BUGS] [postgresql 10 beta3] unrecognized node type: 90 - Mailing list pgsql-hackers

I wrote:
> Ah-hah, I see my dromedary box is one of the ones failing, so I'll
> have a look there.  I can't reproduce it on my other machines.

OK, so this is a whole lot more broken than I thought :-(.

Bear in mind that the plan for this (omitting uninteresting detail) is
Nested Loop Left Join  ->  Values Scan on "*VALUES*"  ->  Finalize GroupAggregate        ->  Gather Merge
-> Partial GroupAggregate                    ->  Sort                          ->  Parallel Seq Scan on tenk1
 

What seems to be happening is that:

1. On the first pass, the parallel seqscan work gets doled out to several
workers, plus the leader, as expected.

2. When the nestloop rescans its right input, ExecReScanGatherMerge
supposes that this is good enough to handle rescanning its subnodes:
ExecReScan(node->ps.lefttree);

Leaving aside the question of why that doesn't look like nearly every
other child rescan call, what happens is that that invokes ExecReScanAgg,
which does the more usual thing:
if (outerPlan->chgParam == NULL)    ExecReScan(outerPlan);

and so that invokes ExecReScanSort, and then behold what ExecReScanSort
thinks it can optimize away:
    * If subnode is to be rescanned then we forget previous sort results; we    * have to re-read the subplan and
re-sort. Also must re-sort if the    * bounded-sort parameters changed or we didn't select randomAccess.    *    *
Otherwisewe can just rewind and rescan the sorted output.
 

So we never get to ExecReScanSeqScan, and thus not to heap_rescan,
with the effect that parallel_scan->phs_nallocated never gets reset.

3. On the next pass, we fire up all the workers as expected, but they all
perceive phs_nallocated >= rs_nblocks and conclude they have nothing to
do.  Meanwhile, in the leader, nodeSort just re-emits the sorted data it
had last time.  Net effect is that the GatherMerge node returns only the
fraction of the data that was scanned by the leader in the first pass.

4. The fact that the test succeeds on many machines implies that the
leader process is usually doing *all* of the work.  This is in itself not
very good.  Even on the machines where it fails, the fact that the tuple
counts are usually a pretty large fraction of the expected values
indicates that the leader usually did most of the work.  We need to take
a closer look at why that is.

However, the bottom line here is that parallel scan is completely broken
for rescans, and it's not (solely) the fault of nodeGatherMerge; rather,
the issue is that nobody bothered to wire up parallelism to the rescan
parameterization mechanism.  I imagine that related bugs can be
demonstrated in 9.6 with little effort.

I think that the correct fix probably involves marking each parallel scan
plan node as dependent on a pseudo executor parameter, which the parent
Gather or GatherMerge node would flag as being changed on each rescan.
This would cue the plan layers in between that they cannot optimize on the
assumption that the leader's instance of the parallel scan will produce
exactly the same rows as it did last time, even when "nothing else
changed".  The "wtParam" pseudo parameter that's used for communication
between RecursiveUnion and its descendant WorkTableScan node is a good
model for what needs to happen.

A possibly-simpler fix would be to abandon the idea that the leader
should do any of the work, but I imagine that will be unpopular.

As I mentioned, I'm outta here for the next week.  I'd be willing to
work on, or review, a patch for this when I get back.
        regards, tom lane

PS: while I was trying to decipher this, I found three or four other
minor bugs or near-bugs in nodeGatherMerge.c.  But none of them seem to be
triggering in this example.  I plan to do a round of code-review-ish fixes
there when I get back.



pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: [HACKERS] SCRAM salt length
Next
From: Thomas Munro
Date:
Subject: Re: [HACKERS] Inadequate infrastructure for NextValueExpr