Bogus rescan logic in ExecReScanCteScan - Mailing list pgsql-hackers

From Tom Lane
Subject Bogus rescan logic in ExecReScanCteScan
Date
Msg-id 13458.1345060924@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
I looked into the misbehavior reported by Adam Mackler in
http://archives.postgresql.org/pgsql-bugs/2012-08/msg00073.php
What it seems to boil down to is a design error in ExecReScanCteScan.
Given the query

WITH RECURSIVE
tab(id_key,link) AS ( VALUES (1,17), (2,17), (3,17), (4,17), (6,17), (5,17) ),
iter (id_key, row_type, link) AS (
    SELECT 0, 'base', 17
  UNION(
    WITH remaining(id_key, row_type, link, min) AS (
      SELECT tab.id_key, 'true'::text, iter.link, MIN(tab.id_key) OVER ()
      FROM tab INNER JOIN iter USING (link)
      WHERE tab.id_key > iter.id_key
    ),
    first_remaining AS (
      SELECT id_key, row_type, link
      FROM remaining
      WHERE id_key=min
    ),
    effect AS (
      SELECT tab.id_key, 'new'::text, tab.link
      FROM first_remaining e INNER JOIN tab ON e.id_key=tab.id_key
      /* Try changing this WHERE clause to other false expressions */
      WHERE e.row_type='false'
    )
    SELECT * FROM first_remaining
    /* Try uncommenting the next line */
    UNION SELECT * FROM effect
  )
)
SELECT * FROM iter;

we get a plan like this:

 CTE Scan on iter  (cost=9.06..9.48 rows=21 width=40)
   CTE tab
     ->  Values Scan on "*VALUES*"  (cost=0.00..0.08 rows=6 width=8)
   CTE iter
     ->  Recursive Union  (cost=0.00..8.98 rows=21 width=40)
           ->  Result  (cost=0.00..0.01 rows=1 width=0)
           ->  Unique  (cost=0.84..0.86 rows=2 width=40)
                 CTE remaining
                   ->  WindowAgg  (cost=0.20..0.53 rows=2 width=8)
                         ->  Hash Join  (cost=0.20..0.51 rows=2 width=8)
                               Hash Cond: (iter.link = tab.link)
                               Join Filter: (tab.id_key > iter.id_key)
                               ->  WorkTable Scan on iter  (cost=0.00..0.20 rows=10 width=8)
                               ->  Hash  (cost=0.12..0.12 rows=6 width=8)
                                     ->  CTE Scan on tab  (cost=0.00..0.12 rows=6 width=8)
                 CTE first_remaining
                   ->  CTE Scan on remaining  (cost=0.00..0.04 rows=1 width=44)
                         Filter: (id_key = min)
                 CTE effect
                   ->  Hash Join  (cost=0.04..0.19 rows=1 width=8)
                         Hash Cond: (tab.id_key = e.id_key)
                         ->  CTE Scan on tab  (cost=0.00..0.12 rows=6 width=8)
                         ->  Hash  (cost=0.02..0.02 rows=1 width=4)
----->                         ->  CTE Scan on first_remaining e  (cost=0.00..0.02 rows=1 width=4)
                                     Filter: (row_type = 'false'::text)
                 ->  Sort  (cost=0.07..0.08 rows=2 width=40)
                       Sort Key: first_remaining.id_key, first_remaining.row_type, first_remaining.link
                       ->  Append  (cost=0.00..0.06 rows=2 width=40)
----->                       ->  CTE Scan on first_remaining  (cost=0.00..0.02 rows=1 width=40)
                             ->  CTE Scan on effect  (cost=0.00..0.02 rows=1 width=40)

where there are two CTEScan nodes (marked for effect) on the output of
the "first_remaining" CTE.   Now the first of these (the one inside the
"effect" CTE) is initialized first and thus becomes the "leader" of the
group of CTEScans reading this CTE.  However, because node ReScans
happen basically in execution order, the one under the Append is the one
that actually gets a rescan call first; in fact, that one will get run
to completion before "effect" ever gets rescanned.  So what happens
while re-executing the right-hand side of the RecursiveUnion is that
the second scan of first_remaining just regurgitates what
first_remaining already output on the previous cycle, because its
rescan only rewound its own tuple pointer and didn't change the
tuplestore contents.  That leads to totally bogus results of course.

Another way to say this is that ExecReScanCteScan only works right if
the CTE nodes of a group are ReScanned in the same order they were
initialized in, which might be true some of the time but is not to be
relied on.  It's a bit surprising that we've not had any reports about
this in the nearly 4 years that code has been in there.

After reflecting on this for awhile I think that ExecReScanCteScan
is just overly complicated, and there's no reason not to let the first
arrival reset the tuplestore.  The attached patch seems to fix Adam's
problem (at least, variants of his query that seem like they should give
the same answer do), and it passes our regression tests.

Comments?

            regards, tom lane


diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 7ab15ac63e1020dd00b3ef27950615dc108b795b..d45352896b09b7580f1518450366d53cbcfdbc1e 100644
*** a/src/backend/executor/nodeCtescan.c
--- b/src/backend/executor/nodeCtescan.c
*************** ExecReScanCteScan(CteScanState *node)
*** 312,337 ****

      ExecScanReScan(&node->ss);

!     if (node->leader == node)
      {
!         /*
!          * The leader is responsible for clearing the tuplestore if a new scan
!          * of the underlying CTE is required.
!          */
!         if (node->cteplanstate->chgParam != NULL)
!         {
!             tuplestore_clear(tuplestorestate);
!             node->eof_cte = false;
!         }
!         else
!         {
!             tuplestore_select_read_pointer(tuplestorestate, node->readptr);
!             tuplestore_rescan(tuplestorestate);
!         }
      }
      else
      {
!         /* Not leader, so just rewind my own pointer */
          tuplestore_select_read_pointer(tuplestorestate, node->readptr);
          tuplestore_rescan(tuplestorestate);
      }
--- 312,333 ----

      ExecScanReScan(&node->ss);

!     /*
!      * Clear the tuplestore if a new scan of the underlying CTE is required.
!      * This implicitly resets all the tuplestore's read pointers.  Note that
!      * multiple CTE nodes might redundantly clear the tuplestore; that's OK,
!      * and not unduly expensive.  We'll stop taking this path as soon as
!      * somebody has attempted to read something from the underlying CTE
!      * (thereby causing its chgParam to be cleared).
!      */
!     if (node->leader->cteplanstate->chgParam != NULL)
      {
!         tuplestore_clear(tuplestorestate);
!         node->leader->eof_cte = false;
      }
      else
      {
!         /* Else, just rewind my own pointer */
          tuplestore_select_read_pointer(tuplestorestate, node->readptr);
          tuplestore_rescan(tuplestorestate);
      }

pgsql-hackers by date:

Previous
From: Fabrízio de Royes Mello
Date:
Subject: Re: CREATE SCHEMA IF NOT EXISTS
Next
From: "David E. Wheeler"
Date:
Subject: Re: CREATE SCHEMA IF NOT EXISTS