Bogus rescan logic in ExecReScanCteScan - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Bogus rescan logic in ExecReScanCteScan |
Date | |
Msg-id | 13458.1345060924@sss.pgh.pa.us Whole thread Raw |
List | pgsql-hackers |
I looked into the misbehavior reported by Adam Mackler in http://archives.postgresql.org/pgsql-bugs/2012-08/msg00073.php What it seems to boil down to is a design error in ExecReScanCteScan. Given the query WITH RECURSIVE tab(id_key,link) AS ( VALUES (1,17), (2,17), (3,17), (4,17), (6,17), (5,17) ), iter (id_key, row_type, link) AS ( SELECT 0, 'base', 17 UNION( WITH remaining(id_key, row_type, link, min) AS ( SELECT tab.id_key, 'true'::text, iter.link, MIN(tab.id_key) OVER () FROM tab INNER JOIN iter USING (link) WHERE tab.id_key > iter.id_key ), first_remaining AS ( SELECT id_key, row_type, link FROM remaining WHERE id_key=min ), effect AS ( SELECT tab.id_key, 'new'::text, tab.link FROM first_remaining e INNER JOIN tab ON e.id_key=tab.id_key /* Try changing this WHERE clause to other false expressions */ WHERE e.row_type='false' ) SELECT * FROM first_remaining /* Try uncommenting the next line */ UNION SELECT * FROM effect ) ) SELECT * FROM iter; we get a plan like this: CTE Scan on iter (cost=9.06..9.48 rows=21 width=40) CTE tab -> Values Scan on "*VALUES*" (cost=0.00..0.08 rows=6 width=8) CTE iter -> Recursive Union (cost=0.00..8.98 rows=21 width=40) -> Result (cost=0.00..0.01 rows=1 width=0) -> Unique (cost=0.84..0.86 rows=2 width=40) CTE remaining -> WindowAgg (cost=0.20..0.53 rows=2 width=8) -> Hash Join (cost=0.20..0.51 rows=2 width=8) Hash Cond: (iter.link = tab.link) Join Filter: (tab.id_key > iter.id_key) -> WorkTable Scan on iter (cost=0.00..0.20 rows=10 width=8) -> Hash (cost=0.12..0.12 rows=6 width=8) -> CTE Scan on tab (cost=0.00..0.12 rows=6 width=8) CTE first_remaining -> CTE Scan on remaining (cost=0.00..0.04 rows=1 width=44) Filter: (id_key = min) CTE effect -> Hash Join (cost=0.04..0.19 rows=1 width=8) Hash Cond: (tab.id_key = e.id_key) -> CTE Scan on tab (cost=0.00..0.12 rows=6 width=8) -> Hash (cost=0.02..0.02 rows=1 width=4) -----> -> CTE Scan on first_remaining e (cost=0.00..0.02 rows=1 width=4) Filter: (row_type = 'false'::text) -> Sort (cost=0.07..0.08 rows=2 width=40) Sort Key: first_remaining.id_key, first_remaining.row_type, first_remaining.link -> Append (cost=0.00..0.06 rows=2 width=40) -----> -> CTE Scan on first_remaining (cost=0.00..0.02 rows=1 width=40) -> CTE Scan on effect (cost=0.00..0.02 rows=1 width=40) where there are two CTEScan nodes (marked for effect) on the output of the "first_remaining" CTE. Now the first of these (the one inside the "effect" CTE) is initialized first and thus becomes the "leader" of the group of CTEScans reading this CTE. However, because node ReScans happen basically in execution order, the one under the Append is the one that actually gets a rescan call first; in fact, that one will get run to completion before "effect" ever gets rescanned. So what happens while re-executing the right-hand side of the RecursiveUnion is that the second scan of first_remaining just regurgitates what first_remaining already output on the previous cycle, because its rescan only rewound its own tuple pointer and didn't change the tuplestore contents. That leads to totally bogus results of course. Another way to say this is that ExecReScanCteScan only works right if the CTE nodes of a group are ReScanned in the same order they were initialized in, which might be true some of the time but is not to be relied on. It's a bit surprising that we've not had any reports about this in the nearly 4 years that code has been in there. After reflecting on this for awhile I think that ExecReScanCteScan is just overly complicated, and there's no reason not to let the first arrival reset the tuplestore. The attached patch seems to fix Adam's problem (at least, variants of his query that seem like they should give the same answer do), and it passes our regression tests. Comments? regards, tom lane diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c index 7ab15ac63e1020dd00b3ef27950615dc108b795b..d45352896b09b7580f1518450366d53cbcfdbc1e 100644 *** a/src/backend/executor/nodeCtescan.c --- b/src/backend/executor/nodeCtescan.c *************** ExecReScanCteScan(CteScanState *node) *** 312,337 **** ExecScanReScan(&node->ss); ! if (node->leader == node) { ! /* ! * The leader is responsible for clearing the tuplestore if a new scan ! * of the underlying CTE is required. ! */ ! if (node->cteplanstate->chgParam != NULL) ! { ! tuplestore_clear(tuplestorestate); ! node->eof_cte = false; ! } ! else ! { ! tuplestore_select_read_pointer(tuplestorestate, node->readptr); ! tuplestore_rescan(tuplestorestate); ! } } else { ! /* Not leader, so just rewind my own pointer */ tuplestore_select_read_pointer(tuplestorestate, node->readptr); tuplestore_rescan(tuplestorestate); } --- 312,333 ---- ExecScanReScan(&node->ss); ! /* ! * Clear the tuplestore if a new scan of the underlying CTE is required. ! * This implicitly resets all the tuplestore's read pointers. Note that ! * multiple CTE nodes might redundantly clear the tuplestore; that's OK, ! * and not unduly expensive. We'll stop taking this path as soon as ! * somebody has attempted to read something from the underlying CTE ! * (thereby causing its chgParam to be cleared). ! */ ! if (node->leader->cteplanstate->chgParam != NULL) { ! tuplestore_clear(tuplestorestate); ! node->leader->eof_cte = false; } else { ! /* Else, just rewind my own pointer */ tuplestore_select_read_pointer(tuplestorestate, node->readptr); tuplestore_rescan(tuplestorestate); }
pgsql-hackers by date: