Thread: The documentation for READ COMMITTED may be incomplete or wrong

The documentation for READ COMMITTED may be incomplete or wrong

From
Aleksander Alekseev
Date:
Hi hackers,

A colleague of mine, Ante Krešić, got puzzled by the following behavior:

Setup:

postgres=# create table inh_test (id serial, value float);
CREATE TABLE
postgres=# create table inh_child_1 () INHERITS ( inh_test);
CREATE TABLE
postgres=# create table inh_child_2 () INHERITS ( inh_test);
CREATE TABLE
postgres=# insert into inh_child_1 values (1,1);
INSERT 0 1
postgres=# insert into inh_child_2 values (1,1);
INSERT 0 1

Update tuples in first transaction:

postgres=# begin;
BEGIN
postgres=*# update inh_test set value = 2 where value = 1;
UPDATE 2

Delete in second transaction while the first is still active:

postgres=# delete from inh_test where value = 1;

Commit in the first transaction and we get a delete in the second one
even though committed values do not qualify after update.

postgres=# COMMIT;

postgres=# delete from inh_test where value = 1;
DELETE 1

The same happens for declarative partitioned tables as well. When
working on a table without inheritance / partitioning the result is
different, DELETE 0.

So what's the problem?

According to the documentation [1]:

"""
UPDATE, DELETE [..] commands behave the same as SELECT in terms of
searching for target rows: they will only find target rows that were
committed as of the command start time. However, such a target row
might have already been updated (or deleted or locked) by another
concurrent transaction by the time it is found. In this case, the
would-be updater will wait for the first updating transaction to
commit or roll back (if it is still in progress). If the first updater
rolls back, then its effects are negated and the second updater can
proceed with updating the originally found row. If the first updater
commits, the second updater will ignore the row if the first updater
deleted it, otherwise it will attempt to apply its operation to the
updated version of the row. The search condition of the command (the
WHERE clause) is re-evaluated to see if the updated version of the row
still matches the search condition. If so, the second updater proceeds
with its operation using the updated version of the row.
"""

It looks like the observed behaviour contradicts the documentation. If
we read it literally the second transaction should delete 0 rows, as
it does for non-partitioned and non-inherited tables. From what I can
tell the observed behavior doesn't contradict the general guarantees
promised by READ COMMITTED.

Perhaps we should update the documentation for this case, or maybe
remove the quoted part of it.

Thoughts?

[1]: https://www.postgresql.org/docs/current/transaction-iso.html

--
Best regards,
Aleksander Alekseev



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Tom Lane
Date:
Aleksander Alekseev <aleksander@timescale.com> writes:
> A colleague of mine, Ante Krešić, got puzzled by the following behavior:

That's not a documentation problem.  That's a bug, and an extremely
nasty one.  A quick check shows that it works as expected up through
v13, but fails as described in v14 and later.  Needs bisecting ...

            regards, tom lane



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Nathan Bossart
Date:
On Thu, May 18, 2023 at 10:53:35AM -0400, Tom Lane wrote:
> That's not a documentation problem.  That's a bug, and an extremely
> nasty one.  A quick check shows that it works as expected up through
> v13, but fails as described in v14 and later.  Needs bisecting ...

git-bisect points me to 86dc900.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Tom Lane
Date:
I wrote:
> Aleksander Alekseev <aleksander@timescale.com> writes:
>> A colleague of mine, Ante Krešić, got puzzled by the following behavior:

> That's not a documentation problem.  That's a bug, and an extremely
> nasty one.  A quick check shows that it works as expected up through
> v13, but fails as described in v14 and later.  Needs bisecting ...

Ugh.  Bisecting says it broke at

86dc90056dfdbd9d1b891718d2e5614e3e432f35 is the first bad commit
commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Mar 31 11:52:34 2021 -0400

    Rework planning and execution of UPDATE and DELETE.

which was absolutely not supposed to be breaking any concurrent-execution
guarantees.  I wonder what we got wrong.

            regards, tom lane



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Aleksander Alekseev
Date:
Hi,

> I wonder what we got wrong.

One thing we noticed is that the description for EvalPlanQual may be wrong [1]:

"""
In UPDATE/DELETE, only the target relation needs to be handled this way.
In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
so we obtain lock on the current tuple version in each such relation before
executing the recheck.
"""

[1]: https://github.com/postgres/postgres/blob/master/src/backend/executor/README#L381

-- 
Best regards,
Aleksander Alekseev



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Nathan Bossart
Date:
On Thu, May 18, 2023 at 11:22:54AM -0400, Tom Lane wrote:
> Ugh.  Bisecting says it broke at
> 
> 86dc90056dfdbd9d1b891718d2e5614e3e432f35 is the first bad commit
> commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Wed Mar 31 11:52:34 2021 -0400
> 
>     Rework planning and execution of UPDATE and DELETE.
> 
> which was absolutely not supposed to be breaking any concurrent-execution
> guarantees.  I wonder what we got wrong.

With the reproduction steps listed upthread, I see that XMAX for both
tuples is set to the deleting transaction, but the one in inh_child_2 has
two additional infomask flags: HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_LOCK_ONLY.
If I add a third table (i.e., inh_child_3), XMAX for all three tuples is
set to the deleting transaction, and only the one in inh_child_3 has the
lock bits set.  Also, in the three-table case, the DELETE statement reports
"DELETE 2".

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, May 18, 2023 at 11:22:54AM -0400, Tom Lane wrote:
>> Ugh.  Bisecting says it broke at
>> commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35
>> which was absolutely not supposed to be breaking any concurrent-execution
>> guarantees.  I wonder what we got wrong.

> With the reproduction steps listed upthread, I see that XMAX for both
> tuples is set to the deleting transaction, but the one in inh_child_2 has
> two additional infomask flags: HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_LOCK_ONLY.
> If I add a third table (i.e., inh_child_3), XMAX for all three tuples is
> set to the deleting transaction, and only the one in inh_child_3 has the
> lock bits set.  Also, in the three-table case, the DELETE statement reports
> "DELETE 2".

Yeah.  I see the problem: when starting up an EPQ recheck, we stuff
the tuple-to-test into the epqstate->relsubs_slot[] entry for the
relation it came from, but we do nothing to the EPQ state for the
other target relations, which allows the EPQ plan to fetch rows
from those relations as usual.  If it finds a (non-updated) row
passing the qual, kaboom!  We decide the EPQ check passed.

What we need to do, I think, is set epqstate->relsubs_done[] for
all target relations except the one we are stuffing a tuple into.

While nodeModifyTable can certainly be made to do that, things are
complicated by the fact that currently ExecScanReScan thinks it ought
to clear all the relsubs_done flags, which would break things again.
I wonder if we can simply delete that code.  Dropping the
FDW/Custom-specific code there is a bit scary, but on the whole that
looks like code that got cargo-culted in rather than anything we
actually need.

The reason this wasn't a bug before 86dc90056 is that any given
plan tree could have only one target relation, so there was not
anything else to suppress.

            regards, tom lane



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Nathan Bossart
Date:
On Thu, May 18, 2023 at 04:03:36PM -0400, Tom Lane wrote:
> Yeah.  I see the problem: when starting up an EPQ recheck, we stuff
> the tuple-to-test into the epqstate->relsubs_slot[] entry for the
> relation it came from, but we do nothing to the EPQ state for the
> other target relations, which allows the EPQ plan to fetch rows
> from those relations as usual.  If it finds a (non-updated) row
> passing the qual, kaboom!  We decide the EPQ check passed.

Ah, so the EPQ check only fails for the last tuple because we won't fetch
rows from the other relations.  I think that explains the behavior I'm
seeing.

> What we need to do, I think, is set epqstate->relsubs_done[] for
> all target relations except the one we are stuffing a tuple into.

This seems generally reasonable to me.

> While nodeModifyTable can certainly be made to do that, things are
> complicated by the fact that currently ExecScanReScan thinks it ought
> to clear all the relsubs_done flags, which would break things again.
> I wonder if we can simply delete that code.  Dropping the
> FDW/Custom-specific code there is a bit scary, but on the whole that
> looks like code that got cargo-culted in rather than anything we
> actually need.

I see that part was added in 385f337 [0].  I haven't had a chance to
evaluate whether it seems necessary.

[0] https://postgr.es/m/9A28C8860F777E439AA12E8AEA7694F80117370C%40BPXM15GP.gisp.nec.co.jp

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, May 18, 2023 at 04:03:36PM -0400, Tom Lane wrote:
>> What we need to do, I think, is set epqstate->relsubs_done[] for
>> all target relations except the one we are stuffing a tuple into.

> This seems generally reasonable to me.

Here's a draft patch for this.  I think it's OK for HEAD, but we are
going to need some different details for the back branches, because
adding a field to struct EPQState would create an ABI break.  Our
standard trick of shoving the field to the end in the back branches
won't help, because struct EPQState is actually embedded in
struct ModifyTableState, meaning that all the subsequent fields
therein will move.  Maybe we can get away with that, but I bet not.

I think what the back branches will have to do is reinitialize the
relsubs_done array every time through EvalPlanQual(), which is a bit sad
but probably doesn't amount to anything compared to the startup overhead
of the sub-executor.

Debian Code Search doesn't know of any outside code touching
relsubs_done, so I think we are safe in dropping that code in
ExecScanReScan.  It seems quite pointless anyway considering
that up to now, EvalPlanQualBegin has always zeroed the whole
array.

            regards, tom lane

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0186be452c..c518daa7d0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2469,7 +2469,7 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
  *    relation - table containing tuple
  *    rti - rangetable index of table containing tuple
  *    inputslot - tuple for processing - this can be the slot from
- *        EvalPlanQualSlot(), for the increased efficiency.
+ *        EvalPlanQualSlot() for this rel, for increased efficiency.
  *
  * This tests whether the tuple in inputslot still matches the relevant
  * quals. For that result to be useful, typically the input tuple has to be
@@ -2503,6 +2503,13 @@ EvalPlanQual(EPQState *epqstate, Relation relation,
     if (testslot != inputslot)
         ExecCopySlot(testslot, inputslot);

+    /*
+     * Mark that an EPQ tuple is available for this relation.  (If there is
+     * more than one result relation, the others remain marked as having no
+     * tuple available.)
+     */
+    epqstate->relsubs_done[rti - 1] = false;
+
     /*
      * Run the EPQ query.  We assume it will return at most one tuple.
      */
@@ -2519,11 +2526,12 @@ EvalPlanQual(EPQState *epqstate, Relation relation,
         ExecMaterializeSlot(slot);

     /*
-     * Clear out the test tuple.  This is needed in case the EPQ query is
-     * re-used to test a tuple for a different relation.  (Not clear that can
-     * really happen, but let's be safe.)
+     * Clear out the test tuple, and mark that no tuple is available here.
+     * This is needed in case the EPQ state is re-used to test a tuple for a
+     * different target relation.
      */
     ExecClearTuple(testslot);
+    epqstate->relsubs_done[rti - 1] = true;

     return slot;
 }
@@ -2532,18 +2540,26 @@ EvalPlanQual(EPQState *epqstate, Relation relation,
  * EvalPlanQualInit -- initialize during creation of a plan state node
  * that might need to invoke EPQ processing.
  *
+ * If the caller intends to use EvalPlanQual(), resultRelations should be
+ * a list of RT indexes of potential target relations for EvalPlanQual(),
+ * and we will arrange that the other listed relations don't return any
+ * tuple during an EvalPlanQual() call.  Otherwise resultRelations
+ * should be NIL.
+ *
  * Note: subplan/auxrowmarks can be NULL/NIL if they will be set later
  * with EvalPlanQualSetPlan.
  */
 void
 EvalPlanQualInit(EPQState *epqstate, EState *parentestate,
-                 Plan *subplan, List *auxrowmarks, int epqParam)
+                 Plan *subplan, List *auxrowmarks,
+                 int epqParam, List *resultRelations)
 {
     Index        rtsize = parentestate->es_range_table_size;

     /* initialize data not changing over EPQState's lifetime */
     epqstate->parentestate = parentestate;
     epqstate->epqParam = epqParam;
+    epqstate->resultRelations = resultRelations;

     /*
      * Allocate space to reference a slot for each potential rti - do so now
@@ -2760,10 +2776,20 @@ EvalPlanQualBegin(EPQState *epqstate)
         /*
          * We already have a suitable child EPQ tree, so just reset it.
          */
-        Index        rtsize = parentestate->es_range_table_size;
         PlanState  *rcplanstate = epqstate->recheckplanstate;

-        MemSet(epqstate->relsubs_done, 0, rtsize * sizeof(bool));
+        /*
+         * We reset the relsubs_done[] flags only when the caller of
+         * EvalPlanQualInit provided no result relations.  Otherwise, we keep
+         * the current state (which should have all result relations marked as
+         * "done", and others not).
+         */
+        if (epqstate->resultRelations == NIL)
+        {
+            Index        rtsize = parentestate->es_range_table_size;
+
+            memset(epqstate->relsubs_done, 0, rtsize * sizeof(bool));
+        }

         /* Recopy current values of parent parameters */
         if (parentestate->es_plannedstmt->paramExecTypes != NIL)
@@ -2931,11 +2957,20 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
     }

     /*
-     * Initialize per-relation EPQ tuple states to not-fetched.
+     * Initialize per-relation EPQ tuple states.  Result relations, if any,
+     * get marked as already-fetched; others as not-fetched.
      */
     epqstate->relsubs_done = (bool *)
         palloc0(rtsize * sizeof(bool));

+    foreach(l, epqstate->resultRelations)
+    {
+        int            rtindex = lfirst_int(l);
+
+        Assert(rtindex > 0 && rtindex <= rtsize);
+        epqstate->relsubs_done[rtindex - 1] = true;
+    }
+
     /*
      * Initialize the private state information for all the nodes in the part
      * of the plan tree we need to run.  This opens files, allocates storage
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index cf1871b0f5..ea8c5ab149 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -69,13 +69,12 @@ ExecScanFetch(ScanState *node,
         else if (epqstate->relsubs_done[scanrelid - 1])
         {
             /*
-             * Return empty slot, as we already performed an EPQ substitution
-             * for this relation.
+             * Return empty slot, as either there is no EPQ tuple for this rel
+             * or we already returned it.
              */

             TupleTableSlot *slot = node->ss_ScanTupleSlot;

-            /* Return empty slot, as we already returned a tuple */
             return ExecClearTuple(slot);
         }
         else if (epqstate->relsubs_slot[scanrelid - 1] != NULL)
@@ -88,7 +87,7 @@ ExecScanFetch(ScanState *node,

             Assert(epqstate->relsubs_rowmark[scanrelid - 1] == NULL);

-            /* Mark to remember that we shouldn't return more */
+            /* Mark to remember that we shouldn't return it again */
             epqstate->relsubs_done[scanrelid - 1] = true;

             /* Return empty slot if we haven't got a test tuple */
@@ -298,45 +297,9 @@ ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno)
 void
 ExecScanReScan(ScanState *node)
 {
-    EState       *estate = node->ps.state;
-
     /*
      * We must clear the scan tuple so that observers (e.g., execCurrent.c)
      * can tell that this plan node is not positioned on a tuple.
      */
     ExecClearTuple(node->ss_ScanTupleSlot);
-
-    /* Rescan EvalPlanQual tuple if we're inside an EvalPlanQual recheck */
-    if (estate->es_epq_active != NULL)
-    {
-        EPQState   *epqstate = estate->es_epq_active;
-        Index        scanrelid = ((Scan *) node->ps.plan)->scanrelid;
-
-        if (scanrelid > 0)
-            epqstate->relsubs_done[scanrelid - 1] = false;
-        else
-        {
-            Bitmapset  *relids;
-            int            rtindex = -1;
-
-            /*
-             * If an FDW or custom scan provider has replaced the join with a
-             * scan, there are multiple RTIs; reset the epqScanDone flag for
-             * all of them.
-             */
-            if (IsA(node->ps.plan, ForeignScan))
-                relids = ((ForeignScan *) node->ps.plan)->fs_base_relids;
-            else if (IsA(node->ps.plan, CustomScan))
-                relids = ((CustomScan *) node->ps.plan)->custom_relids;
-            else
-                elog(ERROR, "unexpected scan node: %d",
-                     (int) nodeTag(node->ps.plan));
-
-            while ((rtindex = bms_next_member(relids, rtindex)) >= 0)
-            {
-                Assert(rtindex > 0);
-                epqstate->relsubs_done[rtindex - 1] = false;
-            }
-        }
-    }
 }
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 407414fc0c..e459971d32 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -108,7 +108,6 @@ lnext:
                 /* this child is inactive right now */
                 erm->ermActive = false;
                 ItemPointerSetInvalid(&(erm->curCtid));
-                ExecClearTuple(markSlot);
                 continue;
             }
         }
@@ -370,7 +369,7 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)

     /* Now we have the info needed to set up EPQ state */
     EvalPlanQualInit(&lrstate->lr_epqstate, estate,
-                     outerPlan, epq_arowmarks, node->epqParam);
+                     outerPlan, epq_arowmarks, node->epqParam, NIL);

     return lrstate;
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index dc1a2ec551..7f5002527f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -3985,7 +3985,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
     }

     /* set up epqstate with dummy subplan data for the moment */
-    EvalPlanQualInit(&mtstate->mt_epqstate, estate, NULL, NIL, node->epqParam);
+    EvalPlanQualInit(&mtstate->mt_epqstate, estate, NULL, NIL,
+                     node->epqParam, node->resultRelations);
     mtstate->fireBSTriggers = true;

     /*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 879309b316..4b67098814 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2674,7 +2674,7 @@ apply_handle_update_internal(ApplyExecutionData *edata,
     bool        found;
     MemoryContext oldctx;

-    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
+    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
     ExecOpenIndices(relinfo, false);

     found = FindReplTupleInLocalRel(estate, localrel,
@@ -2827,7 +2827,7 @@ apply_handle_delete_internal(ApplyExecutionData *edata,
     TupleTableSlot *localslot;
     bool        found;

-    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
+    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
     ExecOpenIndices(relinfo, false);

     found = FindReplTupleInLocalRel(estate, localrel, remoterel, localindexoid,
@@ -3054,7 +3054,7 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
                      */
                     EPQState    epqstate;

-                    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
+                    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
                     ExecOpenIndices(partrelinfo, false);

                     EvalPlanQualSetSlot(&epqstate, remoteslot_part);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index f9e6bf3d4a..ac02247947 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,7 +233,8 @@ extern ExecAuxRowMark *ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist);
 extern TupleTableSlot *EvalPlanQual(EPQState *epqstate, Relation relation,
                                     Index rti, TupleTableSlot *inputslot);
 extern void EvalPlanQualInit(EPQState *epqstate, EState *parentestate,
-                             Plan *subplan, List *auxrowmarks, int epqParam);
+                             Plan *subplan, List *auxrowmarks,
+                             int epqParam, List *resultRelations);
 extern void EvalPlanQualSetPlan(EPQState *epqstate,
                                 Plan *subplan, List *auxrowmarks);
 extern TupleTableSlot *EvalPlanQualSlot(EPQState *epqstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 61b3517906..155186ac1f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1169,15 +1169,16 @@ typedef struct PlanState
  */
 typedef struct EPQState
 {
-    /* Initialized at EvalPlanQualInit() time: */
-
+    /* These are initialized by EvalPlanQualInit() and do not change later: */
     EState       *parentestate;    /* main query's EState */
     int            epqParam;        /* ID of Param to force scan node re-eval */
+    List       *resultRelations;    /* integer list of RT indexes, or NIL */

     /*
-     * Tuples to be substituted by scan nodes. They need to set up, before
-     * calling EvalPlanQual()/EvalPlanQualNext(), into the slot returned by
-     * EvalPlanQualSlot(scanrelid). The array is indexed by scanrelid - 1.
+     * relsubs_slot[scanrelid - 1] holds the EPQ test tuple to be returned by
+     * the scan node for the scanrelid'th RT index, in place of performing an
+     * actual table scan.  Callers should use EvalPlanQualSlot() to fetch
+     * these slots.
      */
     List       *tuple_table;    /* tuple table for relsubs_slot */
     TupleTableSlot **relsubs_slot;
@@ -1211,8 +1212,8 @@ typedef struct EPQState
     ExecAuxRowMark **relsubs_rowmark;

     /*
-     * True if a relation's EPQ tuple has been fetched for relation, indexed
-     * by scanrelid - 1.
+     * relsubs_done[scanrelid - 1] is true if there is no EPQ tuple for this
+     * target relation or it has already been fetched in the current EPQ run.
      */
     bool       *relsubs_done;

diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index 6af262ec5d..f2a50f46a3 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -842,6 +842,90 @@ step c1: COMMIT;
 step writep3b: <... completed>
 step c2: COMMIT;

+starting permutation: writep4a writep4b c1 c2 readp
+step writep4a: UPDATE p SET c = 4 WHERE c = 0;
+step writep4b: UPDATE p SET b = -4 WHERE c = 0; <waiting ...>
+step c1: COMMIT;
+step writep4b: <... completed>
+step c2: COMMIT;
+step readp: SELECT tableoid::regclass, ctid, * FROM p;
+tableoid|ctid  |a|b|c
+--------+------+-+-+-
+c1      |(0,2) |0|0|1
+c1      |(0,3) |0|0|2
+c1      |(0,5) |0|1|1
+c1      |(0,6) |0|1|2
+c1      |(0,8) |0|2|1
+c1      |(0,9) |0|2|2
+c1      |(0,11)|0|0|4
+c1      |(0,12)|0|1|4
+c1      |(0,13)|0|2|4
+c1      |(0,14)|0|3|4
+c2      |(0,2) |1|0|1
+c2      |(0,3) |1|0|2
+c2      |(0,5) |1|1|1
+c2      |(0,6) |1|1|2
+c2      |(0,8) |1|2|1
+c2      |(0,9) |1|2|2
+c2      |(0,11)|1|0|4
+c2      |(0,12)|1|1|4
+c2      |(0,13)|1|2|4
+c2      |(0,14)|1|3|4
+c3      |(0,2) |2|0|1
+c3      |(0,3) |2|0|2
+c3      |(0,5) |2|1|1
+c3      |(0,6) |2|1|2
+c3      |(0,8) |2|2|1
+c3      |(0,9) |2|2|2
+c3      |(0,11)|2|0|4
+c3      |(0,12)|2|1|4
+c3      |(0,13)|2|2|4
+c3      |(0,14)|2|3|4
+(30 rows)
+
+
+starting permutation: writep4a deletep4 c1 c2 readp
+step writep4a: UPDATE p SET c = 4 WHERE c = 0;
+step deletep4: DELETE FROM p WHERE c = 0; <waiting ...>
+step c1: COMMIT;
+step deletep4: <... completed>
+step c2: COMMIT;
+step readp: SELECT tableoid::regclass, ctid, * FROM p;
+tableoid|ctid  |a|b|c
+--------+------+-+-+-
+c1      |(0,2) |0|0|1
+c1      |(0,3) |0|0|2
+c1      |(0,5) |0|1|1
+c1      |(0,6) |0|1|2
+c1      |(0,8) |0|2|1
+c1      |(0,9) |0|2|2
+c1      |(0,11)|0|0|4
+c1      |(0,12)|0|1|4
+c1      |(0,13)|0|2|4
+c1      |(0,14)|0|3|4
+c2      |(0,2) |1|0|1
+c2      |(0,3) |1|0|2
+c2      |(0,5) |1|1|1
+c2      |(0,6) |1|1|2
+c2      |(0,8) |1|2|1
+c2      |(0,9) |1|2|2
+c2      |(0,11)|1|0|4
+c2      |(0,12)|1|1|4
+c2      |(0,13)|1|2|4
+c2      |(0,14)|1|3|4
+c3      |(0,2) |2|0|1
+c3      |(0,3) |2|0|2
+c3      |(0,5) |2|1|1
+c3      |(0,6) |2|1|2
+c3      |(0,8) |2|2|1
+c3      |(0,9) |2|2|2
+c3      |(0,11)|2|0|4
+c3      |(0,12)|2|1|4
+c3      |(0,13)|2|2|4
+c3      |(0,14)|2|3|4
+(30 rows)
+
+
 starting permutation: wx2 partiallock c2 c1 read
 step wx2: UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance;
 balance
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index 768f7098b9..cb56578208 100644
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@@ -102,11 +102,15 @@ step upsert1    {
 # when the first updated tuple was in a non-first child table.
 # writep2/returningp1 tests a memory allocation issue
 # writep3a/writep3b tests updates touching more than one table
+# writep4a/writep4b tests a case where matches in another table confused EPQ
+# writep4a/deletep4 tests the same case in the DELETE path

+step readp        { SELECT tableoid::regclass, ctid, * FROM p; }
 step readp1        { SELECT tableoid::regclass, ctid, * FROM p WHERE b IN (0, 1) AND c = 0 FOR UPDATE; }
 step writep1    { UPDATE p SET b = -1 WHERE a = 1 AND b = 1 AND c = 0; }
 step writep2    { UPDATE p SET b = -b WHERE a = 1 AND c = 0; }
 step writep3a    { UPDATE p SET b = -b WHERE c = 0; }
+step writep4a    { UPDATE p SET c = 4 WHERE c = 0; }
 step c1        { COMMIT; }
 step r1        { ROLLBACK; }

@@ -210,6 +214,8 @@ step returningp1 {
       SELECT * FROM u;
 }
 step writep3b    { UPDATE p SET b = -b WHERE c = 0; }
+step writep4b    { UPDATE p SET b = -4 WHERE c = 0; }
+step deletep4    { DELETE FROM p WHERE c = 0; }
 step readforss    {
     SELECT ta.id AS ta_id, ta.value AS ta_value,
         (SELECT ROW(tb.id, tb.value)
@@ -347,6 +353,8 @@ permutation upsert1 upsert2 c1 c2 read
 permutation readp1 writep1 readp2 c1 c2
 permutation writep2 returningp1 c1 c2
 permutation writep3a writep3b c1 c2
+permutation writep4a writep4b c1 c2 readp
+permutation writep4a deletep4 c1 c2 readp
 permutation wx2 partiallock c2 c1 read
 permutation wx2 lockwithvalues c2 c1 read
 permutation wx2_ext partiallock_ext c2 c1 read_ext

Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Tom Lane
Date:
I wrote:
> Debian Code Search doesn't know of any outside code touching
> relsubs_done, so I think we are safe in dropping that code in
> ExecScanReScan.  It seems quite pointless anyway considering
> that up to now, EvalPlanQualBegin has always zeroed the whole
> array.

Oh, belay that.  What I'd forgotten is that it's possible that
the target relation is on the inside of a nestloop, meaning that
we might need to fetch the EPQ substitute tuple more than once.
So there are three possible states: blocked (never return a
tuple), ready to return a tuple, and done returning a tuple
for this scan.  ExecScanReScan needs to reset "done" to "ready",
but not touch the "blocked" state.  The attached v2 mechanizes
that using two bool arrays.

What I'm thinking about doing to back-patch this is to replace
one of the pointer fields in EPQState with a pointer to a
subsidiary palloc'd structure, where we can put the new fields
along with the cannibalized old one.  We've done something
similar before, and it seems a lot safer than having basically
different logic in v16 than earlier branches.

            regards, tom lane

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0186be452c..c76fdf59ec 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2469,7 +2469,7 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
  *    relation - table containing tuple
  *    rti - rangetable index of table containing tuple
  *    inputslot - tuple for processing - this can be the slot from
- *        EvalPlanQualSlot(), for the increased efficiency.
+ *        EvalPlanQualSlot() for this rel, for increased efficiency.
  *
  * This tests whether the tuple in inputslot still matches the relevant
  * quals. For that result to be useful, typically the input tuple has to be
@@ -2503,6 +2503,14 @@ EvalPlanQual(EPQState *epqstate, Relation relation,
     if (testslot != inputslot)
         ExecCopySlot(testslot, inputslot);

+    /*
+     * Mark that an EPQ tuple is available for this relation.  (If there is
+     * more than one result relation, the others remain marked as having no
+     * tuple available.)
+     */
+    epqstate->relsubs_done[rti - 1] = false;
+    epqstate->relsubs_blocked[rti - 1] = false;
+
     /*
      * Run the EPQ query.  We assume it will return at most one tuple.
      */
@@ -2519,11 +2527,12 @@ EvalPlanQual(EPQState *epqstate, Relation relation,
         ExecMaterializeSlot(slot);

     /*
-     * Clear out the test tuple.  This is needed in case the EPQ query is
-     * re-used to test a tuple for a different relation.  (Not clear that can
-     * really happen, but let's be safe.)
+     * Clear out the test tuple, and mark that no tuple is available here.
+     * This is needed in case the EPQ state is re-used to test a tuple for a
+     * different target relation.
      */
     ExecClearTuple(testslot);
+    epqstate->relsubs_blocked[rti - 1] = true;

     return slot;
 }
@@ -2532,18 +2541,26 @@ EvalPlanQual(EPQState *epqstate, Relation relation,
  * EvalPlanQualInit -- initialize during creation of a plan state node
  * that might need to invoke EPQ processing.
  *
+ * If the caller intends to use EvalPlanQual(), resultRelations should be
+ * a list of RT indexes of potential target relations for EvalPlanQual(),
+ * and we will arrange that the other listed relations don't return any
+ * tuple during an EvalPlanQual() call.  Otherwise resultRelations
+ * should be NIL.
+ *
  * Note: subplan/auxrowmarks can be NULL/NIL if they will be set later
  * with EvalPlanQualSetPlan.
  */
 void
 EvalPlanQualInit(EPQState *epqstate, EState *parentestate,
-                 Plan *subplan, List *auxrowmarks, int epqParam)
+                 Plan *subplan, List *auxrowmarks,
+                 int epqParam, List *resultRelations)
 {
     Index        rtsize = parentestate->es_range_table_size;

     /* initialize data not changing over EPQState's lifetime */
     epqstate->parentestate = parentestate;
     epqstate->epqParam = epqParam;
+    epqstate->resultRelations = resultRelations;

     /*
      * Allocate space to reference a slot for each potential rti - do so now
@@ -2566,6 +2583,7 @@ EvalPlanQualInit(EPQState *epqstate, EState *parentestate,
     epqstate->recheckplanstate = NULL;
     epqstate->relsubs_rowmark = NULL;
     epqstate->relsubs_done = NULL;
+    epqstate->relsubs_blocked = NULL;
 }

 /*
@@ -2763,7 +2781,13 @@ EvalPlanQualBegin(EPQState *epqstate)
         Index        rtsize = parentestate->es_range_table_size;
         PlanState  *rcplanstate = epqstate->recheckplanstate;

-        MemSet(epqstate->relsubs_done, 0, rtsize * sizeof(bool));
+        /*
+         * Reset the relsubs_done[] flags to equal relsubs_blocked[], so that
+         * the EPQ run will never attempt to fetch tuples from blocked target
+         * relations.
+         */
+        memcpy(epqstate->relsubs_done, epqstate->relsubs_blocked,
+               rtsize * sizeof(bool));

         /* Recopy current values of parent parameters */
         if (parentestate->es_plannedstmt->paramExecTypes != NIL)
@@ -2931,10 +2955,22 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
     }

     /*
-     * Initialize per-relation EPQ tuple states to not-fetched.
+     * Initialize per-relation EPQ tuple states.  Result relations, if any,
+     * get marked as blocked; others as not-fetched.
      */
-    epqstate->relsubs_done = (bool *)
-        palloc0(rtsize * sizeof(bool));
+    epqstate->relsubs_done = palloc_array(bool, rtsize);
+    epqstate->relsubs_blocked = palloc0_array(bool, rtsize);
+
+    foreach(l, epqstate->resultRelations)
+    {
+        int            rtindex = lfirst_int(l);
+
+        Assert(rtindex > 0 && rtindex <= rtsize);
+        epqstate->relsubs_blocked[rtindex - 1] = true;
+    }
+
+    memcpy(epqstate->relsubs_done, epqstate->relsubs_blocked,
+           rtsize * sizeof(bool));

     /*
      * Initialize the private state information for all the nodes in the part
@@ -3010,4 +3046,5 @@ EvalPlanQualEnd(EPQState *epqstate)
     epqstate->recheckplanstate = NULL;
     epqstate->relsubs_rowmark = NULL;
     epqstate->relsubs_done = NULL;
+    epqstate->relsubs_blocked = NULL;
 }
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index cf1871b0f5..a47c8f5f71 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -69,13 +69,12 @@ ExecScanFetch(ScanState *node,
         else if (epqstate->relsubs_done[scanrelid - 1])
         {
             /*
-             * Return empty slot, as we already performed an EPQ substitution
-             * for this relation.
+             * Return empty slot, as either there is no EPQ tuple for this rel
+             * or we already returned it.
              */

             TupleTableSlot *slot = node->ss_ScanTupleSlot;

-            /* Return empty slot, as we already returned a tuple */
             return ExecClearTuple(slot);
         }
         else if (epqstate->relsubs_slot[scanrelid - 1] != NULL)
@@ -88,7 +87,7 @@ ExecScanFetch(ScanState *node,

             Assert(epqstate->relsubs_rowmark[scanrelid - 1] == NULL);

-            /* Mark to remember that we shouldn't return more */
+            /* Mark to remember that we shouldn't return it again */
             epqstate->relsubs_done[scanrelid - 1] = true;

             /* Return empty slot if we haven't got a test tuple */
@@ -306,14 +305,18 @@ ExecScanReScan(ScanState *node)
      */
     ExecClearTuple(node->ss_ScanTupleSlot);

-    /* Rescan EvalPlanQual tuple if we're inside an EvalPlanQual recheck */
+    /*
+     * Rescan EvalPlanQual tuple(s) if we're inside an EvalPlanQual recheck.
+     * But don't lose the "blocked" status of blocked target relations.
+     */
     if (estate->es_epq_active != NULL)
     {
         EPQState   *epqstate = estate->es_epq_active;
         Index        scanrelid = ((Scan *) node->ps.plan)->scanrelid;

         if (scanrelid > 0)
-            epqstate->relsubs_done[scanrelid - 1] = false;
+            epqstate->relsubs_done[scanrelid - 1] =
+                epqstate->relsubs_blocked[scanrelid - 1];
         else
         {
             Bitmapset  *relids;
@@ -335,7 +338,8 @@ ExecScanReScan(ScanState *node)
             while ((rtindex = bms_next_member(relids, rtindex)) >= 0)
             {
                 Assert(rtindex > 0);
-                epqstate->relsubs_done[rtindex - 1] = false;
+                epqstate->relsubs_done[rtindex - 1] =
+                    epqstate->relsubs_blocked[rtindex - 1];
             }
         }
     }
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 407414fc0c..e459971d32 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -108,7 +108,6 @@ lnext:
                 /* this child is inactive right now */
                 erm->ermActive = false;
                 ItemPointerSetInvalid(&(erm->curCtid));
-                ExecClearTuple(markSlot);
                 continue;
             }
         }
@@ -370,7 +369,7 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)

     /* Now we have the info needed to set up EPQ state */
     EvalPlanQualInit(&lrstate->lr_epqstate, estate,
-                     outerPlan, epq_arowmarks, node->epqParam);
+                     outerPlan, epq_arowmarks, node->epqParam, NIL);

     return lrstate;
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index dc1a2ec551..7f5002527f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -3985,7 +3985,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
     }

     /* set up epqstate with dummy subplan data for the moment */
-    EvalPlanQualInit(&mtstate->mt_epqstate, estate, NULL, NIL, node->epqParam);
+    EvalPlanQualInit(&mtstate->mt_epqstate, estate, NULL, NIL,
+                     node->epqParam, node->resultRelations);
     mtstate->fireBSTriggers = true;

     /*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 879309b316..4b67098814 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2674,7 +2674,7 @@ apply_handle_update_internal(ApplyExecutionData *edata,
     bool        found;
     MemoryContext oldctx;

-    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
+    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
     ExecOpenIndices(relinfo, false);

     found = FindReplTupleInLocalRel(estate, localrel,
@@ -2827,7 +2827,7 @@ apply_handle_delete_internal(ApplyExecutionData *edata,
     TupleTableSlot *localslot;
     bool        found;

-    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
+    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
     ExecOpenIndices(relinfo, false);

     found = FindReplTupleInLocalRel(estate, localrel, remoterel, localindexoid,
@@ -3054,7 +3054,7 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
                      */
                     EPQState    epqstate;

-                    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
+                    EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1, NIL);
                     ExecOpenIndices(partrelinfo, false);

                     EvalPlanQualSetSlot(&epqstate, remoteslot_part);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index f9e6bf3d4a..ac02247947 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,7 +233,8 @@ extern ExecAuxRowMark *ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist);
 extern TupleTableSlot *EvalPlanQual(EPQState *epqstate, Relation relation,
                                     Index rti, TupleTableSlot *inputslot);
 extern void EvalPlanQualInit(EPQState *epqstate, EState *parentestate,
-                             Plan *subplan, List *auxrowmarks, int epqParam);
+                             Plan *subplan, List *auxrowmarks,
+                             int epqParam, List *resultRelations);
 extern void EvalPlanQualSetPlan(EPQState *epqstate,
                                 Plan *subplan, List *auxrowmarks);
 extern TupleTableSlot *EvalPlanQualSlot(EPQState *epqstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 61b3517906..68ddcd6794 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1169,15 +1169,16 @@ typedef struct PlanState
  */
 typedef struct EPQState
 {
-    /* Initialized at EvalPlanQualInit() time: */
-
+    /* These are initialized by EvalPlanQualInit() and do not change later: */
     EState       *parentestate;    /* main query's EState */
     int            epqParam;        /* ID of Param to force scan node re-eval */
+    List       *resultRelations;    /* integer list of RT indexes, or NIL */

     /*
-     * Tuples to be substituted by scan nodes. They need to set up, before
-     * calling EvalPlanQual()/EvalPlanQualNext(), into the slot returned by
-     * EvalPlanQualSlot(scanrelid). The array is indexed by scanrelid - 1.
+     * relsubs_slot[scanrelid - 1] holds the EPQ test tuple to be returned by
+     * the scan node for the scanrelid'th RT index, in place of performing an
+     * actual table scan.  Callers should use EvalPlanQualSlot() to fetch
+     * these slots.
      */
     List       *tuple_table;    /* tuple table for relsubs_slot */
     TupleTableSlot **relsubs_slot;
@@ -1211,11 +1212,17 @@ typedef struct EPQState
     ExecAuxRowMark **relsubs_rowmark;

     /*
-     * True if a relation's EPQ tuple has been fetched for relation, indexed
-     * by scanrelid - 1.
+     * relsubs_done[scanrelid - 1] is true if there is no EPQ tuple for this
+     * target relation or it has already been fetched in the current EPQ run.
      */
     bool       *relsubs_done;

+    /*
+     * relsubs_blocked[scanrelid - 1] is true if there is no EPQ tuple for
+     * this target relation.
+     */
+    bool       *relsubs_blocked;
+
     PlanState  *recheckplanstate;    /* EPQ specific exec nodes, for ->plan */
 } EPQState;

diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index 6af262ec5d..f2a50f46a3 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -842,6 +842,90 @@ step c1: COMMIT;
 step writep3b: <... completed>
 step c2: COMMIT;

+starting permutation: writep4a writep4b c1 c2 readp
+step writep4a: UPDATE p SET c = 4 WHERE c = 0;
+step writep4b: UPDATE p SET b = -4 WHERE c = 0; <waiting ...>
+step c1: COMMIT;
+step writep4b: <... completed>
+step c2: COMMIT;
+step readp: SELECT tableoid::regclass, ctid, * FROM p;
+tableoid|ctid  |a|b|c
+--------+------+-+-+-
+c1      |(0,2) |0|0|1
+c1      |(0,3) |0|0|2
+c1      |(0,5) |0|1|1
+c1      |(0,6) |0|1|2
+c1      |(0,8) |0|2|1
+c1      |(0,9) |0|2|2
+c1      |(0,11)|0|0|4
+c1      |(0,12)|0|1|4
+c1      |(0,13)|0|2|4
+c1      |(0,14)|0|3|4
+c2      |(0,2) |1|0|1
+c2      |(0,3) |1|0|2
+c2      |(0,5) |1|1|1
+c2      |(0,6) |1|1|2
+c2      |(0,8) |1|2|1
+c2      |(0,9) |1|2|2
+c2      |(0,11)|1|0|4
+c2      |(0,12)|1|1|4
+c2      |(0,13)|1|2|4
+c2      |(0,14)|1|3|4
+c3      |(0,2) |2|0|1
+c3      |(0,3) |2|0|2
+c3      |(0,5) |2|1|1
+c3      |(0,6) |2|1|2
+c3      |(0,8) |2|2|1
+c3      |(0,9) |2|2|2
+c3      |(0,11)|2|0|4
+c3      |(0,12)|2|1|4
+c3      |(0,13)|2|2|4
+c3      |(0,14)|2|3|4
+(30 rows)
+
+
+starting permutation: writep4a deletep4 c1 c2 readp
+step writep4a: UPDATE p SET c = 4 WHERE c = 0;
+step deletep4: DELETE FROM p WHERE c = 0; <waiting ...>
+step c1: COMMIT;
+step deletep4: <... completed>
+step c2: COMMIT;
+step readp: SELECT tableoid::regclass, ctid, * FROM p;
+tableoid|ctid  |a|b|c
+--------+------+-+-+-
+c1      |(0,2) |0|0|1
+c1      |(0,3) |0|0|2
+c1      |(0,5) |0|1|1
+c1      |(0,6) |0|1|2
+c1      |(0,8) |0|2|1
+c1      |(0,9) |0|2|2
+c1      |(0,11)|0|0|4
+c1      |(0,12)|0|1|4
+c1      |(0,13)|0|2|4
+c1      |(0,14)|0|3|4
+c2      |(0,2) |1|0|1
+c2      |(0,3) |1|0|2
+c2      |(0,5) |1|1|1
+c2      |(0,6) |1|1|2
+c2      |(0,8) |1|2|1
+c2      |(0,9) |1|2|2
+c2      |(0,11)|1|0|4
+c2      |(0,12)|1|1|4
+c2      |(0,13)|1|2|4
+c2      |(0,14)|1|3|4
+c3      |(0,2) |2|0|1
+c3      |(0,3) |2|0|2
+c3      |(0,5) |2|1|1
+c3      |(0,6) |2|1|2
+c3      |(0,8) |2|2|1
+c3      |(0,9) |2|2|2
+c3      |(0,11)|2|0|4
+c3      |(0,12)|2|1|4
+c3      |(0,13)|2|2|4
+c3      |(0,14)|2|3|4
+(30 rows)
+
+
 starting permutation: wx2 partiallock c2 c1 read
 step wx2: UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance;
 balance
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index 768f7098b9..cb56578208 100644
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@@ -102,11 +102,15 @@ step upsert1    {
 # when the first updated tuple was in a non-first child table.
 # writep2/returningp1 tests a memory allocation issue
 # writep3a/writep3b tests updates touching more than one table
+# writep4a/writep4b tests a case where matches in another table confused EPQ
+# writep4a/deletep4 tests the same case in the DELETE path

+step readp        { SELECT tableoid::regclass, ctid, * FROM p; }
 step readp1        { SELECT tableoid::regclass, ctid, * FROM p WHERE b IN (0, 1) AND c = 0 FOR UPDATE; }
 step writep1    { UPDATE p SET b = -1 WHERE a = 1 AND b = 1 AND c = 0; }
 step writep2    { UPDATE p SET b = -b WHERE a = 1 AND c = 0; }
 step writep3a    { UPDATE p SET b = -b WHERE c = 0; }
+step writep4a    { UPDATE p SET c = 4 WHERE c = 0; }
 step c1        { COMMIT; }
 step r1        { ROLLBACK; }

@@ -210,6 +214,8 @@ step returningp1 {
       SELECT * FROM u;
 }
 step writep3b    { UPDATE p SET b = -b WHERE c = 0; }
+step writep4b    { UPDATE p SET b = -4 WHERE c = 0; }
+step deletep4    { DELETE FROM p WHERE c = 0; }
 step readforss    {
     SELECT ta.id AS ta_id, ta.value AS ta_value,
         (SELECT ROW(tb.id, tb.value)
@@ -347,6 +353,8 @@ permutation upsert1 upsert2 c1 c2 read
 permutation readp1 writep1 readp2 c1 c2
 permutation writep2 returningp1 c1 c2
 permutation writep3a writep3b c1 c2
+permutation writep4a writep4b c1 c2 readp
+permutation writep4a deletep4 c1 c2 readp
 permutation wx2 partiallock c2 c1 read
 permutation wx2 lockwithvalues c2 c1 read
 permutation wx2_ext partiallock_ext c2 c1 read_ext

Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Aleksander Alekseev
Date:
Hi,

> The attached v2 mechanizes  that using two bool arrays.

I tested the patch on several combinations of operating systems
(LInux, MacOS) and architectures (x64, RISC-V) available to me at the
moment, with both Meson and Autotools. Also I made sure
eval-plan-qual.spec fails when the C code is untouched.

The patch passed all the checks I could come up with.

--
Best regards,
Aleksander Alekseev



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Amit Langote
Date:
Thanks for the patch.

On Fri, May 19, 2023 at 8:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > Debian Code Search doesn't know of any outside code touching
> > relsubs_done, so I think we are safe in dropping that code in
> > ExecScanReScan.  It seems quite pointless anyway considering
> > that up to now, EvalPlanQualBegin has always zeroed the whole
> > array.
>
> Oh, belay that.  What I'd forgotten is that it's possible that
> the target relation is on the inside of a nestloop, meaning that
> we might need to fetch the EPQ substitute tuple more than once.
> So there are three possible states: blocked (never return a
> tuple), ready to return a tuple, and done returning a tuple
> for this scan.  ExecScanReScan needs to reset "done" to "ready",
> but not touch the "blocked" state.  The attached v2 mechanizes
> that using two bool arrays.

Aha, that's clever.  So ExecScanReScan() would only reset the
relsubs_done[] entry for the currently active ("unblocked") target
relation, because that would be the only one "unblocked" during a
given EvalPlanQual() invocation.

+    * Initialize per-relation EPQ tuple states.  Result relations, if any,
+    * get marked as blocked; others as not-fetched.

Would it be helpful to clarify that "blocked" means blocked for a
given EvalPlanQual() cycle?

+   /*
+    * relsubs_blocked[scanrelid - 1] is true if there is no EPQ tuple for
+    * this target relation.
+    */
+   bool       *relsubs_blocked;

Similarly, maybe say "no EPQ tuple for this target relation in a given
EvalPlanQual() invocation" here?

BTW, I didn't quite understand why EPQ involving resultRelations must
behave in this new way but not the EPQ during LockRows?

> What I'm thinking about doing to back-patch this is to replace
> one of the pointer fields in EPQState with a pointer to a
> subsidiary palloc'd structure, where we can put the new fields
> along with the cannibalized old one.  We've done something
> similar before, and it seems a lot safer than having basically
> different logic in v16 than earlier branches.

+1.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Fri, May 19, 2023 at 8:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> +    * Initialize per-relation EPQ tuple states.  Result relations, if any,
> +    * get marked as blocked; others as not-fetched.

> Would it be helpful to clarify that "blocked" means blocked for a
> given EvalPlanQual() cycle?

Probably best to put that with the data structure's comments.  I changed
those to look like

    /*
     * relsubs_done[scanrelid - 1] is true if there is no EPQ tuple for this
     * target relation or it has already been fetched in the current scan of
     * this target relation within the current EvalPlanQual test.
     */
    bool       *relsubs_done;

    /*
     * relsubs_blocked[scanrelid - 1] is true if there is no EPQ tuple for
     * this target relation during the current EvalPlanQual test.  We keep
     * these flags set for all relids listed in resultRelations, but
     * transiently clear the one for the relation whose tuple is actually
     * passed to EvalPlanQual().
     */
    bool       *relsubs_blocked;


> BTW, I didn't quite understand why EPQ involving resultRelations must
> behave in this new way but not the EPQ during LockRows?

LockRows doesn't have a bug: it always fills all the EPQ tuple slots
it's responsible for, and it doesn't use EvalPlanQual() anyway.
In the name of simplicity I kept the behavior exactly the same for
callers other than nodeModifyTable.

Perhaps replication/logical/worker.c could use a closer look here.
It's not entirely clear to me that the EPQ state it sets up is ever
used; but if it is I think it is okay as I have it here, because it
looks like those invocations always have just one result relation in
the plan, so there aren't any "extra" result rels that need to be
blocked.

            regards, tom lane



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Fri, May 19, 2023 at 8:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What I'm thinking about doing to back-patch this is to replace
>> one of the pointer fields in EPQState with a pointer to a
>> subsidiary palloc'd structure, where we can put the new fields
>> along with the cannibalized old one.  We've done something
>> similar before, and it seems a lot safer than having basically
>> different logic in v16 than earlier branches.

> +1.

Done that way.  I chose to replace the tuple_table field, because
it was in a convenient spot and it seemed like the field least
likely to have any outside code referencing it.

            regards, tom lane



Re: The documentation for READ COMMITTED may be incomplete or wrong

From
Aleksander Alekseev
Date:
Hi Tom,

> Done that way.  I chose to replace the tuple_table field, because
> it was in a convenient spot and it seemed like the field least
> likely to have any outside code referencing it.

Many thanks!

If it's not too much trouble could you please recommend good entry
points to learn more about the internals of this part of the system
and accompanying edge cases? Perhaps there is an experiment or two an
extension author can do in order to lower the entry threshold and/or
known bugs, limitations or wanted features one could start with?

-- 
Best regards,
Aleksander Alekseev