Thread: Converting SetOp to read its two inputs separately

Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

14 November 2024, 05:00:11

Here are the beginnings of a patchset to do what was discussed at [1],
namely change the SetOp node type to read its inputs as outerPlan and
innerPlan, rather than appending them together with a flag column to
show which rows came from where.

The previous thread wondered why manually DISTINCT'ing inputs with a
lot of duplicates would make an INTERSECT faster than not doing so.
I speculated that the overhead of attaching the flag column (which
typically requires an additional projection node) plus the overhead of
an Append node might have a lot to do with that, and it seems I was
right.  With this patchset on yesterday's HEAD, I get results like
these for a simple test case:

regression=# create table t1 as
regression-# select (random()*100)::int as a from generate_series(1,1000000);
SELECT 1000000
regression=# vacuum analyze t1;
VACUUM
regression=# set max_parallel_workers_per_gather TO 0;
SET
regression=# explain analyze select a from t1 intersect select a from t1;
                                                       QUERY PLAN


------------------------------------------------------------------------------------------------------------------------
 HashSetOp Intersect  (cost=0.00..33960.00 rows=101 width=4) (actual time=246.352..246.355 rows=101 loops=1)
   ->  Seq Scan on t1  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.025..35.646 rows=1000000 loops=1)
   ->  Seq Scan on t1 t1_1  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.011..36.163 rows=1000000 loops=1)
 Planning Time: 0.045 ms
 Execution Time: 246.372 ms
(5 rows)

regression=# explain analyze select distinct a from t1 intersect select distinct a from t1;
                                                          QUERY PLAN
      

------------------------------------------------------------------------------------------------------------------------------
 HashSetOp Intersect  (cost=33960.00..33962.52 rows=101 width=4) (actual time=238.888..238.891 rows=101 loops=1)
   ->  HashAggregate  (cost=16980.00..16981.01 rows=101 width=4) (actual time=120.749..120.756 rows=101 loops=1)
         Group Key: t1.a
         Batches: 1  Memory Usage: 24kB
         ->  Seq Scan on t1  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.027..35.391 rows=1000000
loops=1)
   ->  HashAggregate  (cost=16980.00..16981.01 rows=101 width=4) (actual time=118.101..118.107 rows=101 loops=1)
         Group Key: t1_1.a
         Batches: 1  Memory Usage: 24kB
         ->  Seq Scan on t1 t1_1  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.014..35.468 rows=1000000
loops=1)
 Planning Time: 0.043 ms
 Execution Time: 238.916 ms
(11 rows)

It's still a little slower without DISTINCT, but not 50% slower like
before.  I have hopes that it'll be nearly on par once I figure out
how to avoid the extra per-row slot type conversion that's being done
in the 0001 patch.

Aside from that minor TODO, the main thing that's left undone in this
patch series is to persuade the thing to exploit presorted input
paths.  Right now it fails to do so, as can be seen in some of the
regression test cases, eg

regression=# set enable_hashagg = 0;
SET
regression=# explain (costs off) select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
                            QUERY PLAN
------------------------------------------------------------------
 SetOp Except
   ->  Sort
         Sort Key: tenk1.unique1
         ->  Index Only Scan using tenk1_unique1 on tenk1
   ->  Sort
         Sort Key: tenk1_1.unique2
         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
               Filter: (unique2 <> 10)
(8 rows)

Obviously the sorts are unnecessary, but it's not seeing that.
I suppose this requires integrating generate_nonunion_paths with
the logic from commit 66c0185a3.  I tried to make sense of that,
but failed --- either I don't understand it, or there are a
number of things wrong with it.  I'd welcome some help with
getting that done.

            regards, tom lane

[1] https://www.postgresql.org/message-id/2631313.1730733484%40sss.pgh.pa.us

From 05d6ad4530dddaceb771dee66699c2f0beae61f0 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 13 Nov 2024 15:34:29 -0500
Subject: [PATCH v1 1/3] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.
TODO: the tuple hash table logic spits up if I don't force tuples from
the second input into minimal-tuple form.  I don't understand why that
is, but it seems like it shouldn't be necessary.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.
---
 src/backend/executor/nodeSetOp.c        | 531 ++++++++++++++----------
 src/backend/optimizer/README            |   2 +-
 src/backend/optimizer/plan/createplan.c |  81 ++--
 src/backend/optimizer/prep/prepunion.c  | 156 ++++---
 src/backend/optimizer/util/pathnode.c   |  50 ++-
 src/include/nodes/execnodes.h           |  31 +-
 src/include/nodes/pathnodes.h           |   9 +-
 src/include/nodes/plannodes.h           |  19 +-
 src/include/optimizer/pathnode.h        |   7 +-
 src/test/regress/expected/subselect.out |  20 +-
 src/test/regress/expected/union.out     | 226 +++++-----
 src/tools/pgindent/typedefs.list        |   2 +
 12 files changed, 614 insertions(+), 520 deletions(-)

diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index a8ac68b482..5644739d65 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -3,29 +3,30 @@
  * nodeSetOp.c
  *      Routines to handle INTERSECT and EXCEPT selection
  *
- * The input of a SetOp node consists of tuples from two relations,
- * which have been combined into one dataset, with a junk attribute added
- * that shows which relation each tuple came from.  In SETOP_SORTED mode,
- * the input has furthermore been sorted according to all the grouping
- * columns (ie, all the non-junk attributes).  The SetOp node scans each
- * group of identical tuples to determine how many came from each input
- * relation.  Then it is a simple matter to emit the output demanded by the
- * SQL spec for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ * The input of a SetOp node consists of two relations (outer and inner)
+ * with identical column sets.  In EXCEPT queries the outer relation is
+ * always the left side, while in INTERSECT cases the planner tries to
+ * make the outer relation be the smaller of the two inputs.
  *
- * In SETOP_HASHED mode, the input is delivered in no particular order,
- * except that we know all the tuples from one input relation will come before
- * all the tuples of the other.  The planner guarantees that the first input
- * relation is the left-hand one for EXCEPT, and tries to make the smaller
- * input relation come first for INTERSECT.  We build a hash table in memory
- * with one entry for each group of identical tuples, and count the number of
- * tuples in the group from each relation.  After seeing all the input, we
- * scan the hashtable and generate the correct output using those counts.
- * We can avoid making hashtable entries for any tuples appearing only in the
- * second input relation, since they cannot result in any output.
+ * In SETOP_SORTED mode, each input has been sorted according to all the
+ * grouping columns (ie, all the non-junk attributes).  The SetOp node
+ * essentially performs a merge join on the grouping columns, except that
+ * it is only interested in counting how many tuples from each input match.
+ * Then it is a simple matter to emit the output demanded by the SQL spec
+ * for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ *
+ * In SETOP_HASHED mode, the inputs are delivered in no particular order.
+ * We read the outer relation and build a hash table in memory with one entry
+ * for each group of identical tuples, counting the number of tuples in the
+ * group.  Then we read the inner relation and count the number of tuples
+ * matching each outer group.  (We can disregard any tuples appearing only
+ * in the inner relation, since they cannot result in any output.)  After
+ * seeing all the input, we scan the hashtable and generate the correct
+ * output using those counts.
  *
  * This node type is not used for UNION or UNION ALL, since those can be
- * implemented more cheaply (there's no need for the junk attribute to
- * identify the source relation).
+ * implemented more cheaply (there's no need to count the number of
+ * matching tuples).
  *
  * Note that SetOp does no qual checking nor projection.  The delivered
  * output tuples are just copies of the first-to-arrive tuple in each
@@ -54,65 +55,28 @@
 /*
  * SetOpStatePerGroupData - per-group working state
  *
- * These values are working state that is initialized at the start of
- * an input tuple group and updated for each input tuple.
- *
- * In SETOP_SORTED mode, we need only one of these structs, and it's kept in
- * the plan state node.  In SETOP_HASHED mode, the hash table contains one
- * of these for each tuple group.
+ * In SETOP_SORTED mode, we need only one of these structs, and it's just a
+ * local in setop_retrieve_sorted.  In SETOP_HASHED mode, the hash table
+ * contains one of these for each tuple group.
  */
 typedef struct SetOpStatePerGroupData
 {
-    long        numLeft;        /* number of left-input dups in group */
-    long        numRight;        /* number of right-input dups in group */
-}            SetOpStatePerGroupData;
+    int64        numLeft;        /* number of left-input dups in group */
+    int64        numRight;        /* number of right-input dups in group */
+} SetOpStatePerGroupData;
+
+typedef SetOpStatePerGroupData *SetOpStatePerGroup;


-static TupleTableSlot *setop_retrieve_direct(SetOpState *setopstate);
+static TupleTableSlot *setop_retrieve_sorted(SetOpState *setopstate);
+static void setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                             SetOpState *setopstate);
+static int    setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                                SetOpState *setopstate);
 static void setop_fill_hash_table(SetOpState *setopstate);
 static TupleTableSlot *setop_retrieve_hash_table(SetOpState *setopstate);


-/*
- * Initialize state for a new group of input values.
- */
-static inline void
-initialize_counts(SetOpStatePerGroup pergroup)
-{
-    pergroup->numLeft = pergroup->numRight = 0;
-}
-
-/*
- * Advance the appropriate counter for one input tuple.
- */
-static inline void
-advance_counts(SetOpStatePerGroup pergroup, int flag)
-{
-    if (flag)
-        pergroup->numRight++;
-    else
-        pergroup->numLeft++;
-}
-
-/*
- * Fetch the "flag" column from an input tuple.
- * This is an integer column with value 0 for left side, 1 for right side.
- */
-static int
-fetch_tuple_flag(SetOpState *setopstate, TupleTableSlot *inputslot)
-{
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
-    int            flag;
-    bool        isNull;
-
-    flag = DatumGetInt32(slot_getattr(inputslot,
-                                      node->flagColIdx,
-                                      &isNull));
-    Assert(!isNull);
-    Assert(flag == 0 || flag == 1);
-    return flag;
-}
-
 /*
  * Initialize the hash table to empty.
  */
@@ -129,10 +93,10 @@ build_hash_table(SetOpState *setopstate)
     setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
                                                    desc,
                                                    node->numCols,
-                                                   node->dupColIdx,
+                                                   node->cmpColIdx,
                                                    setopstate->eqfuncoids,
                                                    setopstate->hashfunctions,
-                                                   node->dupCollations,
+                                                   node->cmpCollations,
                                                    node->numGroups,
                                                    0,
                                                    setopstate->ps.state->es_query_cxt,
@@ -217,108 +181,126 @@ ExecSetOp(PlanState *pstate)
         return setop_retrieve_hash_table(node);
     }
     else
-        return setop_retrieve_direct(node);
+        return setop_retrieve_sorted(node);
 }

 /*
  * ExecSetOp for non-hashed case
  */
 static TupleTableSlot *
-setop_retrieve_direct(SetOpState *setopstate)
+setop_retrieve_sorted(SetOpState *setopstate)
 {
     PlanState  *outerPlan;
-    SetOpStatePerGroup pergroup;
-    TupleTableSlot *outerslot;
+    PlanState  *innerPlan;
     TupleTableSlot *resultTupleSlot;
-    ExprContext *econtext = setopstate->ps.ps_ExprContext;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    pergroup = (SetOpStatePerGroup) setopstate->pergroup;
+    innerPlan = innerPlanState(setopstate);
     resultTupleSlot = setopstate->ps.ps_ResultTupleSlot;

     /*
-     * We loop retrieving groups until we find one we should return
+     * If first time through, establish the invariant that setop_load_group
+     * expects: each side's nextTupleSlot is the next output from the child
+     * plan, or empty if there is no more output from it.
      */
-    while (!setopstate->setop_done)
+    if (setopstate->need_init)
     {
+        setopstate->need_init = false;
+
+        setopstate->leftInput.nextTupleSlot = ExecProcNode(outerPlan);
+
         /*
-         * If we don't already have the first tuple of the new group, fetch it
-         * from the outer plan.
+         * If the outer relation is empty, then we will emit nothing, and we
+         * don't need to read the inner relation at all.
          */
-        if (setopstate->grp_firstTuple == NULL)
+        if (TupIsNull(setopstate->leftInput.nextTupleSlot))
         {
-            outerslot = ExecProcNode(outerPlan);
-            if (!TupIsNull(outerslot))
-            {
-                /* Make a copy of the first input tuple */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-            }
-            else
-            {
-                /* outer plan produced no tuples at all */
-                setopstate->setop_done = true;
-                return NULL;
-            }
+            setopstate->setop_done = true;
+            return NULL;
         }

+        setopstate->rightInput.nextTupleSlot = ExecProcNode(innerPlan);
+
+        /* Set flags that we've not completed either side's group */
+        setopstate->leftInput.needGroup = true;
+        setopstate->rightInput.needGroup = true;
+    }
+
+    /*
+     * We loop retrieving groups until we find one we should return
+     */
+    while (!setopstate->setop_done)
+    {
+        int            cmpresult;
+        SetOpStatePerGroupData pergroup;
+
         /*
-         * Store the copied first input tuple in the tuple table slot reserved
-         * for it.  The tuple will be deleted when it is cleared from the
-         * slot.
+         * Fetch the rest of the current outer group, if we didn't already.
          */
-        ExecStoreHeapTuple(setopstate->grp_firstTuple,
-                           resultTupleSlot,
-                           true);
-        setopstate->grp_firstTuple = NULL;    /* don't keep two pointers */
+        if (setopstate->leftInput.needGroup)
+            setop_load_group(&setopstate->leftInput, outerPlan, setopstate);

-        /* Initialize working state for a new input tuple group */
-        initialize_counts(pergroup);
+        /*
+         * If no more outer groups, we're done, and don't need to look at any
+         * more of the inner relation.
+         */
+        if (setopstate->leftInput.numTuples == 0)
+        {
+            setopstate->setop_done = true;
+            break;
+        }

-        /* Count the first input tuple */
-        advance_counts(pergroup,
-                       fetch_tuple_flag(setopstate, resultTupleSlot));
+        /*
+         * Fetch the rest of the current inner group, if we didn't already.
+         */
+        if (setopstate->rightInput.needGroup)
+            setop_load_group(&setopstate->rightInput, innerPlan, setopstate);

         /*
-         * Scan the outer plan until we exhaust it or cross a group boundary.
+         * Determine whether we have matching groups on both sides (this is
+         * basically like the core logic of a merge join).
          */
-        for (;;)
-        {
-            outerslot = ExecProcNode(outerPlan);
-            if (TupIsNull(outerslot))
-            {
-                /* no more outer-plan tuples available */
-                setopstate->setop_done = true;
-                break;
-            }
-
-            /*
-             * Check whether we've crossed a group boundary.
-             */
-            econtext->ecxt_outertuple = resultTupleSlot;
-            econtext->ecxt_innertuple = outerslot;
-
-            if (!ExecQualAndReset(setopstate->eqfunction, econtext))
-            {
-                /*
-                 * Save the first input tuple of the next group.
-                 */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-                break;
-            }
+        if (setopstate->rightInput.numTuples == 0)
+            cmpresult = -1;        /* as though left input is lesser */
+        else
+            cmpresult = setop_compare_slots(setopstate->leftInput.firstTupleSlot,
+                                            setopstate->rightInput.firstTupleSlot,
+                                            setopstate);

-            /* Still in same group, so count this tuple */
-            advance_counts(pergroup,
-                           fetch_tuple_flag(setopstate, outerslot));
+        if (cmpresult < 0)
+        {
+            /* Left group is first, has no right matches */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = 0;
+            /* We'll need another left group next time */
+            setopstate->leftInput.needGroup = true;
+        }
+        else if (cmpresult == 0)
+        {
+            /* We have matching groups */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = setopstate->rightInput.numTuples;
+            /* We'll need to read from both sides next time */
+            setopstate->leftInput.needGroup = true;
+            setopstate->rightInput.needGroup = true;
+        }
+        else
+        {
+            /* Right group has no left matches, so we can ignore it */
+            setopstate->rightInput.needGroup = true;
+            continue;
         }

         /*
-         * Done scanning input tuple group.  See if we should emit any copies
-         * of result tuple, and if so return the first copy.
+         * Done scanning these input tuple groups.  See if we should emit any
+         * copies of result tuple, and if so return the first copy.  (Note
+         * that the result tuple is the same as the left input's firstTuple
+         * slot.)
          */
-        set_output_count(setopstate, pergroup);
+        set_output_count(setopstate, &pergroup);

         if (setopstate->numOutput > 0)
         {
@@ -333,84 +315,173 @@ setop_retrieve_direct(SetOpState *setopstate)
 }

 /*
- * ExecSetOp for hashed case: phase 1, read input and build hash table
+ * Load next group of tuples from one child plan or the other.
+ *
+ * On entry, we've already read the first tuple of the next group
+ * (if there is one) into input->nextTupleSlot.  This invariant
+ * is maintained on exit.
+ */
+static void
+setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                 SetOpState *setopstate)
+{
+    input->needGroup = false;
+
+    /* If we've exhausted this child plan, report an empty group */
+    if (TupIsNull(input->nextTupleSlot))
+    {
+        ExecClearTuple(input->firstTupleSlot);
+        input->numTuples = 0;
+        return;
+    }
+
+    /* Make a local copy of the first tuple for comparisons */
+    ExecStoreHeapTuple(ExecCopySlotHeapTuple(input->nextTupleSlot),
+                       input->firstTupleSlot,
+                       true);
+    /* and count it */
+    input->numTuples = 1;
+
+    /* Scan till we find the end-of-group */
+    for (;;)
+    {
+        int            cmpresult;
+
+        /* Get next input tuple, if there is one */
+        input->nextTupleSlot = ExecProcNode(inputPlan);
+        if (TupIsNull(input->nextTupleSlot))
+            break;
+
+        /* There is; does it belong to same group as firstTuple? */
+        cmpresult = setop_compare_slots(input->firstTupleSlot,
+                                        input->nextTupleSlot,
+                                        setopstate);
+        Assert(cmpresult <= 0); /* else input is mis-sorted */
+        if (cmpresult != 0)
+            break;
+
+        /* Still in same group, so count this tuple */
+        input->numTuples++;
+    }
+}
+
+/*
+ * Compare the tuples in the two given slots.
+ */
+static int
+setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                    SetOpState *setopstate)
+{
+    for (int nkey = 0; nkey < setopstate->numCols; nkey++)
+    {
+        SortSupport sortKey = setopstate->sortKeys + nkey;
+        AttrNumber    attno = sortKey->ssup_attno;
+        Datum        datum1,
+                    datum2;
+        bool        isNull1,
+                    isNull2;
+        int            compare;
+
+        datum1 = slot_getattr(s1, attno, &isNull1);
+        datum2 = slot_getattr(s2, attno, &isNull2);
+
+        compare = ApplySortComparator(datum1, isNull1,
+                                      datum2, isNull2,
+                                      sortKey);
+        if (compare != 0)
+            return compare;
+    }
+    return 0;
+}
+
+/*
+ * ExecSetOp for hashed case: phase 1, read inputs and build hash table
  */
 static void
 setop_fill_hash_table(SetOpState *setopstate)
 {
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
     PlanState  *outerPlan;
-    int            firstFlag;
-    bool        in_first_rel PG_USED_FOR_ASSERTS_ONLY;
+    PlanState  *innerPlan;
     ExprContext *econtext = setopstate->ps.ps_ExprContext;
+    bool        have_tuples = false;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    firstFlag = node->firstFlag;
-    /* verify planner didn't mess up */
-    Assert(firstFlag == 0 ||
-           (firstFlag == 1 &&
-            (node->cmd == SETOPCMD_INTERSECT ||
-             node->cmd == SETOPCMD_INTERSECT_ALL)));
+    innerPlan = innerPlanState(setopstate);

     /*
      * Process each outer-plan tuple, and then fetch the next one, until we
      * exhaust the outer plan.
      */
-    in_first_rel = true;
     for (;;)
     {
         TupleTableSlot *outerslot;
-        int            flag;
         TupleHashEntryData *entry;
         bool        isnew;

         outerslot = ExecProcNode(outerPlan);
         if (TupIsNull(outerslot))
             break;
+        have_tuples = true;

-        /* Identify whether it's left or right input */
-        flag = fetch_tuple_flag(setopstate, outerslot);
+        /* Find or build hashtable entry for this tuple's group */
+        entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+                                     &isnew, NULL);

-        if (flag == firstFlag)
+        /* If new tuple group, initialize counts to zero */
+        if (isnew)
         {
-            /* (still) in first input relation */
-            Assert(in_first_rel);
-
-            /* Find or build hashtable entry for this tuple's group */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
-                                         &isnew, NULL);
-
-            /* If new tuple group, initialize counts */
-            if (isnew)
-            {
-                entry->additional = (SetOpStatePerGroup)
-                    MemoryContextAlloc(setopstate->hashtable->tablecxt,
+            entry->additional = (SetOpStatePerGroup)
+                MemoryContextAllocZero(setopstate->hashtable->tablecxt,
                                        sizeof(SetOpStatePerGroupData));
-                initialize_counts((SetOpStatePerGroup) entry->additional);
-            }
-
-            /* Advance the counts */
-            advance_counts((SetOpStatePerGroup) entry->additional, flag);
         }
-        else
+
+        /* Advance the counts */
+        ((SetOpStatePerGroup) entry->additional)->numLeft++;
+
+        /* Must reset expression context after each hashtable lookup */
+        ResetExprContext(econtext);
+    }
+
+    /*
+     * If the outer relation is empty, then we will emit nothing, and we don't
+     * need to read the inner relation at all.
+     */
+    if (have_tuples)
+    {
+        /*
+         * Process each inner-plan tuple, and then fetch the next one, until
+         * we exhaust the inner plan.
+         */
+        for (;;)
         {
-            /* reached second relation */
-            in_first_rel = false;
+            TupleTableSlot *innerslot;
+            TupleHashEntryData *entry;
+
+            innerslot = ExecProcNode(innerPlan);
+            if (TupIsNull(innerslot))
+                break;
+
+            /* XXX hack: force the tuple into minimal form */
+            /* XXX should not have to do this */
+            ExecForceStoreHeapTuple(ExecCopySlotHeapTuple(innerslot),
+                                    setopstate->ps.ps_ResultTupleSlot,
+                                    true);
+            innerslot = setopstate->ps.ps_ResultTupleSlot;

             /* For tuples not seen previously, do not make hashtable entry */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+            entry = LookupTupleHashEntry(setopstate->hashtable, innerslot,
                                          NULL, NULL);

             /* Advance the counts if entry is already present */
             if (entry)
-                advance_counts((SetOpStatePerGroup) entry->additional, flag);
-        }
+                ((SetOpStatePerGroup) entry->additional)->numRight++;

-        /* Must reset expression context after each hashtable lookup */
-        ResetExprContext(econtext);
+            /* Must reset expression context after each hashtable lookup */
+            ResetExprContext(econtext);
+        }
     }

     setopstate->table_filled = true;
@@ -481,7 +552,6 @@ SetOpState *
 ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 {
     SetOpState *setopstate;
-    TupleDesc    outerDesc;

     /* check for unsupported flags */
     Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -494,14 +564,10 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     setopstate->ps.state = estate;
     setopstate->ps.ExecProcNode = ExecSetOp;

-    setopstate->eqfuncoids = NULL;
-    setopstate->hashfunctions = NULL;
     setopstate->setop_done = false;
     setopstate->numOutput = 0;
-    setopstate->pergroup = NULL;
-    setopstate->grp_firstTuple = NULL;
-    setopstate->hashtable = NULL;
-    setopstate->tableContext = NULL;
+    setopstate->numCols = node->numCols;
+    setopstate->need_init = true;

     /*
      * create expression context
@@ -522,52 +588,74 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * initialize child nodes
      *
-     * If we are hashing then the child plan does not need to handle REWIND
+     * If we are hashing then the child plans do not need to handle REWIND
      * efficiently; see ExecReScanSetOp.
      */
     if (node->strategy == SETOP_HASHED)
         eflags &= ~EXEC_FLAG_REWIND;
     outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
-    outerDesc = ExecGetResultType(outerPlanState(setopstate));
+    innerPlanState(setopstate) = ExecInitNode(innerPlan(node), estate, eflags);

     /*
-     * Initialize result slot and type. Setop nodes do no projections, so
-     * initialize projection info for this node appropriately.
+     * Initialize locally-allocated slots.  In hashed mode, we just need a
+     * result slot.  In sorted mode, we need one first-tuple-of-group slot for
+     * each input; we use the result slot for the left input's slot and create
+     * another for the right input.  (Note: the nextTupleSlot slots are not
+     * ours, but just point to the last slot returned by the input plan node.)
      */
     ExecInitResultTupleSlotTL(&setopstate->ps,
                               node->strategy == SETOP_HASHED ?
                               &TTSOpsMinimalTuple : &TTSOpsHeapTuple);
+    if (node->strategy != SETOP_HASHED)
+    {
+        setopstate->leftInput.firstTupleSlot =
+            setopstate->ps.ps_ResultTupleSlot;
+        setopstate->rightInput.firstTupleSlot =
+            ExecInitExtraTupleSlot(estate,
+                                   setopstate->ps.ps_ResultTupleDesc,
+                                   &TTSOpsHeapTuple);
+    }
+
+    /* Setop nodes do no projections. */
     setopstate->ps.ps_ProjInfo = NULL;

     /*
-     * Precompute fmgr lookup data for inner loop. We need both equality and
-     * hashing functions to do it by hashing, but only equality if not
-     * hashing.
+     * Precompute fmgr lookup data for inner loop.  We need equality and
+     * hashing functions to do it by hashing, while for sorting we need
+     * SortSupport data.
      */
     if (node->strategy == SETOP_HASHED)
         execTuplesHashPrepare(node->numCols,
-                              node->dupOperators,
+                              node->cmpOperators,
                               &setopstate->eqfuncoids,
                               &setopstate->hashfunctions);
     else
-        setopstate->eqfunction =
-            execTuplesMatchPrepare(outerDesc,
-                                   node->numCols,
-                                   node->dupColIdx,
-                                   node->dupOperators,
-                                   node->dupCollations,
-                                   &setopstate->ps);
+    {
+        int            nkeys = node->numCols;
+
+        setopstate->sortKeys = (SortSupport)
+            palloc0(nkeys * sizeof(SortSupportData));
+        for (int i = 0; i < nkeys; i++)
+        {
+            SortSupport sortKey = setopstate->sortKeys + i;
+
+            sortKey->ssup_cxt = CurrentMemoryContext;
+            sortKey->ssup_collation = node->cmpCollations[i];
+            sortKey->ssup_nulls_first = node->cmpNullsFirst[i];
+            sortKey->ssup_attno = node->cmpColIdx[i];
+            /* abbreviated key conversion is not useful here */
+            sortKey->abbreviate = false;

+            PrepareSortSupportFromOrderingOp(node->cmpOperators[i], sortKey);
+        }
+    }
+
+    /* Create a hash table if needed */
     if (node->strategy == SETOP_HASHED)
     {
         build_hash_table(setopstate);
         setopstate->table_filled = false;
     }
-    else
-    {
-        setopstate->pergroup =
-            (SetOpStatePerGroup) palloc0(sizeof(SetOpStatePerGroupData));
-    }

     return setopstate;
 }
@@ -575,7 +663,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 /* ----------------------------------------------------------------
  *        ExecEndSetOp
  *
- *        This shuts down the subplan and frees resources allocated
+ *        This shuts down the subplans and frees resources allocated
  *        to this node.
  * ----------------------------------------------------------------
  */
@@ -587,6 +675,7 @@ ExecEndSetOp(SetOpState *node)
         MemoryContextDelete(node->tableContext);

     ExecEndNode(outerPlanState(node));
+    ExecEndNode(innerPlanState(node));
 }


@@ -594,6 +683,7 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
     PlanState  *outerPlan = outerPlanState(node);
+    PlanState  *innerPlan = innerPlanState(node);

     ExecClearTuple(node->ps.ps_ResultTupleSlot);
     node->setop_done = false;
@@ -611,34 +701,29 @@ ExecReScanSetOp(SetOpState *node)
             return;

         /*
-         * If we do have the hash table and the subplan does not have any
+         * If we do have the hash table and the subplans do not have any
          * parameter changes, then we can just rescan the existing hash table;
          * no need to build it again.
          */
-        if (outerPlan->chgParam == NULL)
+        if (outerPlan->chgParam == NULL && innerPlan->chgParam == NULL)
         {
             ResetTupleHashIterator(node->hashtable, &node->hashiter);
             return;
         }
-    }
-
-    /* Release first tuple of group, if we have made a copy */
-    if (node->grp_firstTuple != NULL)
-    {
-        heap_freetuple(node->grp_firstTuple);
-        node->grp_firstTuple = NULL;
-    }

-    /* Release any hashtable storage */
-    if (node->tableContext)
-        MemoryContextReset(node->tableContext);
+        /* Release any hashtable storage */
+        if (node->tableContext)
+            MemoryContextReset(node->tableContext);

-    /* And rebuild empty hashtable if needed */
-    if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
-    {
+        /* And rebuild an empty hashtable */
         ResetTupleHashTable(node->hashtable);
         node->table_filled = false;
     }
+    else
+    {
+        /* Need to re-read first input from each side */
+        node->need_init = true;
+    }

     /*
      * if chgParam of subnode is not null then plan will be re-scanned by
@@ -646,4 +731,6 @@ ExecReScanSetOp(SetOpState *node)
      */
     if (outerPlan->chgParam == NULL)
         ExecReScan(outerPlan);
+    if (innerPlan->chgParam == NULL)
+        ExecReScan(innerPlan);
 }
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 2ab4f3dbf3..f341d9f303 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -649,7 +649,7 @@ RelOptInfo      - a relation or joined relations
   GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
   MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
   WindowAggPath - a WindowAgg plan node applied to some sub-path
-  SetOpPath     - a SetOp plan node applied to some sub-path
+  SetOpPath     - a SetOp plan node applied to two sub-paths
   RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
   LockRowsPath  - a LockRows plan node applied to some sub-path
   ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f2ed0d81f6..07976c5938 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -301,9 +301,9 @@ static Unique *make_unique_from_pathkeys(Plan *lefttree,
                                          List *pathkeys, int numCols);
 static Gather *make_gather(List *qptlist, List *qpqual,
                            int nworkers, int rescan_param, bool single_copy, Plan *subplan);
-static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-                         List *distinctList, AttrNumber flagColIdx, int firstFlag,
-                         long numGroups);
+static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+                         List *tlist, Plan *lefttree, Plan *righttree,
+                         List *groupList, long numGroups);
 static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
 static Result *make_result(List *tlist, Node *resconstantqual, Plan *subplan);
 static ProjectSet *make_project_set(List *tlist, Plan *subplan);
@@ -2719,25 +2719,29 @@ static SetOp *
 create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 {
     SetOp       *plan;
-    Plan       *subplan;
+    List       *tlist = build_path_tlist(root, &best_path->path);
+    Plan       *leftplan;
+    Plan       *rightplan;
     long        numGroups;

     /*
      * SetOp doesn't project, so tlist requirements pass through; moreover we
      * need grouping columns to be labeled.
      */
-    subplan = create_plan_recurse(root, best_path->subpath,
-                                  flags | CP_LABEL_TLIST);
+    leftplan = create_plan_recurse(root, best_path->leftpath,
+                                   flags | CP_LABEL_TLIST);
+    rightplan = create_plan_recurse(root, best_path->rightpath,
+                                    flags | CP_LABEL_TLIST);

     /* Convert numGroups to long int --- but 'ware overflow! */
     numGroups = clamp_cardinality_to_long(best_path->numGroups);

     plan = make_setop(best_path->cmd,
                       best_path->strategy,
-                      subplan,
-                      best_path->distinctList,
-                      best_path->flagColIdx,
-                      best_path->firstFlag,
+                      tlist,
+                      leftplan,
+                      rightplan,
+                      best_path->groupList,
                       numGroups);

     copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6952,57 +6956,62 @@ make_gather(List *qptlist,
 }

 /*
- * distinctList is a list of SortGroupClauses, identifying the targetlist
- * items that should be considered by the SetOp filter.  The input path must
- * already be sorted accordingly.
+ * groupList is a list of SortGroupClauses, identifying the targetlist
+ * items that should be considered by the SetOp filter.  The input plans must
+ * already be sorted accordingly, if we're doing SETOP_SORTED mode.
  */
 static SetOp *
-make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-           List *distinctList, AttrNumber flagColIdx, int firstFlag,
-           long numGroups)
+make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+           List *tlist, Plan *lefttree, Plan *righttree,
+           List *groupList, long numGroups)
 {
     SetOp       *node = makeNode(SetOp);
     Plan       *plan = &node->plan;
-    int            numCols = list_length(distinctList);
+    int            numCols = list_length(groupList);
     int            keyno = 0;
-    AttrNumber *dupColIdx;
-    Oid           *dupOperators;
-    Oid           *dupCollations;
+    AttrNumber *cmpColIdx;
+    Oid           *cmpOperators;
+    Oid           *cmpCollations;
+    bool       *cmpNullsFirst;
     ListCell   *slitem;

-    plan->targetlist = lefttree->targetlist;
+    plan->targetlist = tlist;
     plan->qual = NIL;
     plan->lefttree = lefttree;
-    plan->righttree = NULL;
+    plan->righttree = righttree;

     /*
-     * convert SortGroupClause list into arrays of attr indexes and equality
+     * convert SortGroupClause list into arrays of attr indexes and comparison
      * operators, as wanted by executor
      */
-    dupColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
-    dupOperators = (Oid *) palloc(sizeof(Oid) * numCols);
-    dupCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
+    cmpOperators = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpNullsFirst = (bool *) palloc(sizeof(bool) * numCols);

-    foreach(slitem, distinctList)
+    foreach(slitem, groupList)
     {
         SortGroupClause *sortcl = (SortGroupClause *) lfirst(slitem);
         TargetEntry *tle = get_sortgroupclause_tle(sortcl, plan->targetlist);

-        dupColIdx[keyno] = tle->resno;
-        dupOperators[keyno] = sortcl->eqop;
-        dupCollations[keyno] = exprCollation((Node *) tle->expr);
-        Assert(OidIsValid(dupOperators[keyno]));
+        cmpColIdx[keyno] = tle->resno;
+        if (strategy == SETOP_HASHED)
+            cmpOperators[keyno] = sortcl->eqop;
+        else
+            cmpOperators[keyno] = sortcl->sortop;
+        Assert(OidIsValid(cmpOperators[keyno]));
+        cmpCollations[keyno] = exprCollation((Node *) tle->expr);
+        cmpNullsFirst[keyno] = sortcl->nulls_first;
         keyno++;
     }

     node->cmd = cmd;
     node->strategy = strategy;
     node->numCols = numCols;
-    node->dupColIdx = dupColIdx;
-    node->dupOperators = dupOperators;
-    node->dupCollations = dupCollations;
-    node->flagColIdx = flagColIdx;
-    node->firstFlag = firstFlag;
+    node->cmpColIdx = cmpColIdx;
+    node->cmpOperators = cmpOperators;
+    node->cmpCollations = cmpCollations;
+    node->cmpNullsFirst = cmpNullsFirst;
     node->numGroups = numGroups;

     return node;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index a0baf6d4a1..8c48109bd4 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -65,7 +65,7 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
 static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *input_path,
+                                Path *lpath, Path *rpath,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
@@ -315,8 +315,8 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * to the corresponding tlist entries of the subplan. However, since
          * the subplan was generated by generate_union_paths() or
          * generate_nonunion_paths(), and hence its tlist was generated by
-         * generate_append_tlist(), this will work.  We just tell
-         * generate_setop_tlist() to use varno 0.
+         * generate_append_tlist() or generate_setop_tlist(), this will work.
+         * We just tell generate_setop_tlist() to use varno 0.
          */
         if (flag >= 0 ||
             !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
@@ -1028,29 +1028,27 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                *path;
     List       *lpath_tlist,
                *rpath_tlist,
-               *tlist_list,
                *tlist,
-               *groupList,
-               *pathlist;
+               *groupList;
     bool        lpath_trivial_tlist,
-                rpath_trivial_tlist;
+                rpath_trivial_tlist,
+                result_trivial_tlist;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
     bool        use_hash;
     SetOpCmd    cmd;
-    int            firstFlag;

     /*
      * Tell children to fetch all tuples.
      */
     root->tuple_fraction = 0.0;

-    /* Recurse on children, ensuring their outputs are marked */
+    /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 0,
+                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,10 +1058,9 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dLeftGroups = lrel->rows;

-    lpath = lrel->cheapest_total_path;
     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 1,
+                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1073,41 +1070,51 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dRightGroups = rrel->rows;

-    rpath = rrel->cheapest_total_path;
-
     /* Undo effects of forcing tuple_fraction to 0 */
     root->tuple_fraction = save_fraction;

     /*
      * For EXCEPT, we must put the left input first.  For INTERSECT, either
      * order should give the same results, and we prefer to put the smaller
-     * input first in order to minimize the size of the hash table in the
-     * hashing case.  "Smaller" means the one with the fewer groups.
+     * input first in order to (a) minimize the size of the hash table in the
+     * hashing case, and (b) improve our chances of exploiting the executor's
+     * fast path for empty left-hand input.  "Smaller" means the one with the
+     * fewer groups.
      */
-    if (op->op == SETOP_EXCEPT || dLeftGroups <= dRightGroups)
-    {
-        pathlist = list_make2(lpath, rpath);
-        tlist_list = list_make2(lpath_tlist, rpath_tlist);
-        firstFlag = 0;
-    }
-    else
+    if (op->op != SETOP_EXCEPT && dLeftGroups > dRightGroups)
     {
-        pathlist = list_make2(rpath, lpath);
-        tlist_list = list_make2(rpath_tlist, lpath_tlist);
-        firstFlag = 1;
+        /* need to swap the two inputs */
+        RelOptInfo *tmprel;
+        List       *tmplist;
+        double        tmpd;
+
+        tmprel = lrel;
+        lrel = rrel;
+        rrel = tmprel;
+        tmplist = lpath_tlist;
+        lpath_tlist = rpath_tlist;
+        rpath_tlist = tmplist;
+        tmpd = dLeftGroups;
+        dLeftGroups = dRightGroups;
+        dRightGroups = tmpd;
     }

+    lpath = lrel->cheapest_total_path;
+    rpath = rrel->cheapest_total_path;
+
     /*
-     * Generate tlist for Append plan node.
+     * Generate tlist for SetOp plan node.
      *
-     * The tlist for an Append plan isn't important as far as the Append is
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
      * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.  In fact, it has to be real enough that the flag
-     * column is shown as a variable not a constant, else setrefs.c will get
-     * confused.
+     * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, true,
-                                  tlist_list, refnames_tlist);
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);

     *pTargetList = tlist;

@@ -1116,12 +1123,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /*
-     * Append the child results together.
-     */
-    path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
-                                       NIL, NULL, 0, false, -1);
-
     /* Identify the grouping semantics */
     groupList = generate_setop_grouplist(op, tlist);

@@ -1140,25 +1141,40 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     else
     {
-        dNumGroups = Min(dLeftGroups, dRightGroups);
+        dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }

     /*
-     * Decide whether to hash or sort, and add a sort node if needed.
+     * Decide whether to hash or sort, and add sort nodes if needed.
      */
-    use_hash = choose_hashed_setop(root, groupList, path,
+    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
                                    dNumGroups, dNumOutputRows,
                                    (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");

     if (groupList && !use_hash)
-        path = (Path *) create_sort_path(root,
-                                         result_rel,
-                                         path,
-                                         make_pathkeys_for_sortclauses(root,
-                                                                       groupList,
-                                                                       tlist),
-                                         -1.0);
+    {
+        List       *pathkeys;
+
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 lpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            lpath = (Path *) create_sort_path(root,
+                                              lpath->parent,
+                                              lpath,
+                                              pathkeys,
+                                              -1.0);
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 rpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            rpath = (Path *) create_sort_path(root,
+                                              rpath->parent,
+                                              rpath,
+                                              pathkeys,
+                                              -1.0);
+    }

     /*
      * Finally, add a SetOp path node to generate the correct output.
@@ -1178,12 +1194,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     path = (Path *) create_setop_path(root,
                                       result_rel,
-                                      path,
+                                      lpath,
+                                      rpath,
                                       cmd,
                                       use_hash ? SETOP_HASHED : SETOP_SORTED,
                                       groupList,
-                                      list_length(op->colTypes) + 1,
-                                      use_hash ? firstFlag : -1,
                                       dNumGroups,
                                       dNumOutputRows);

@@ -1285,10 +1300,13 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)

 /*
  * choose_hashed_setop - should we use hashing for a set operation?
+ *
+ * XXX probably this should go away: just make both paths and let
+ * add_path sort it out.
  */
 static bool
 choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *input_path,
+                    Path *lpath, Path *rpath,
                     double dNumGroups, double dNumOutputRows,
                     const char *construct)
 {
@@ -1327,7 +1345,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * Don't do it if it doesn't look like the hashtable will fit into
      * hash_mem.
      */
-    hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
+    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);

     if (hashentrysize * dNumGroups > hash_mem_limit)
         return false;
@@ -1336,9 +1354,9 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * See if the estimated cost is no more than doing it the other way.
      *
      * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group.  Note that the actual result plan might involve a SetOp or
-     * Unique node, not Agg or Group, but the cost estimates for Agg and Group
-     * should be close enough for our purposes here.
+     * group. XXX NOT TRUE: Note that the actual result plan might involve a
+     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
+     * and Group should be close enough for our purposes here.
      *
      * These path variables are dummies that just hold cost fields; we don't
      * make actual Paths for these steps.
@@ -1346,27 +1364,31 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
     cost_agg(&hashed_p, root, AGG_HASHED, NULL,
              numGroupCols, dNumGroups,
              NIL,
-             input_path->disabled_nodes,
-             input_path->startup_cost, input_path->total_cost,
-             input_path->rows, input_path->pathtarget->width);
+             lpath->disabled_nodes + rpath->disabled_nodes,
+             lpath->startup_cost + rpath->startup_cost,
+             lpath->total_cost + rpath->total_cost,
+             lpath->rows + rpath->rows,
+             lpath->pathtarget->width);

     /*
-     * Now for the sorted case.  Note that the input is *always* unsorted,
-     * since it was made by appending unrelated sub-relations together.
+     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
+     * unsorted, since it was made by appending unrelated sub-relations
+     * together.
      */
-    sorted_p.disabled_nodes = input_path->disabled_nodes;
-    sorted_p.startup_cost = input_path->startup_cost;
-    sorted_p.total_cost = input_path->total_cost;
+    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
+    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
+    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
     /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
     cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
               sorted_p.total_cost,
-              input_path->rows, input_path->pathtarget->width,
+              lpath->rows + rpath->rows,
+              lpath->pathtarget->width,
               0.0, work_mem, -1.0);
     cost_group(&sorted_p, root, numGroupCols, dNumGroups,
                NIL,
                sorted_p.disabled_nodes,
                sorted_p.startup_cost, sorted_p.total_cost,
-               input_path->rows);
+               lpath->rows + rpath->rows);

     /*
      * Now make the decision using the top-level tuple fraction.  First we
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..e52e4b1d67 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3634,25 +3634,26 @@ create_windowagg_path(PlannerInfo *root,
  *      Creates a pathnode that represents computation of INTERSECT or EXCEPT
  *
  * 'rel' is the parent relation associated with the result
- * 'subpath' is the path representing the source of data
+ * 'leftpath' is the path representing the left-hand source of data
+ * 'rightpath' is the path representing the right-hand source of data
  * 'cmd' is the specific semantics (INTERSECT or EXCEPT, with/without ALL)
  * 'strategy' is the implementation strategy (sorted or hashed)
- * 'distinctList' is a list of SortGroupClause's representing the grouping
- * 'flagColIdx' is the column number where the flag column will be, if any
- * 'firstFlag' is the flag value for the first input relation when hashing;
- *        or -1 when sorting
- * 'numGroups' is the estimated number of distinct groups
+ * 'groupList' is a list of SortGroupClause's representing the grouping
+ * 'numGroups' is the estimated number of distinct groups in left-hand input
  * 'outputRows' is the estimated number of output rows
+ *
+ * leftpath and rightpath must produce the same columns.  Moreover, if
+ * strategy is SETOP_SORTED, leftpath and rightpath must both be sorted
+ * by all the grouping columns.
  */
 SetOpPath *
 create_setop_path(PlannerInfo *root,
                   RelOptInfo *rel,
-                  Path *subpath,
+                  Path *leftpath,
+                  Path *rightpath,
                   SetOpCmd cmd,
                   SetOpStrategy strategy,
-                  List *distinctList,
-                  AttrNumber flagColIdx,
-                  int firstFlag,
+                  List *groupList,
                   double numGroups,
                   double outputRows)
 {
@@ -3660,34 +3661,37 @@ create_setop_path(PlannerInfo *root,

     pathnode->path.pathtype = T_SetOp;
     pathnode->path.parent = rel;
-    /* SetOp doesn't project, so use source path's pathtarget */
-    pathnode->path.pathtarget = subpath->pathtarget;
+    pathnode->path.pathtarget = rel->reltarget;
     /* For now, assume we are above any joins, so no parameterization */
     pathnode->path.param_info = NULL;
     pathnode->path.parallel_aware = false;
     pathnode->path.parallel_safe = rel->consider_parallel &&
-        subpath->parallel_safe;
-    pathnode->path.parallel_workers = subpath->parallel_workers;
+        leftpath->parallel_safe && rightpath->parallel_safe;
+    pathnode->path.parallel_workers =
+        leftpath->parallel_workers + rightpath->parallel_workers;
     /* SetOp preserves the input sort order if in sort mode */
     pathnode->path.pathkeys =
-        (strategy == SETOP_SORTED) ? subpath->pathkeys : NIL;
+        (strategy == SETOP_SORTED) ? leftpath->pathkeys : NIL;

-    pathnode->subpath = subpath;
+    pathnode->leftpath = leftpath;
+    pathnode->rightpath = rightpath;
     pathnode->cmd = cmd;
     pathnode->strategy = strategy;
-    pathnode->distinctList = distinctList;
-    pathnode->flagColIdx = flagColIdx;
-    pathnode->firstFlag = firstFlag;
+    pathnode->groupList = groupList;
     pathnode->numGroups = numGroups;

     /*
      * Charge one cpu_operator_cost per comparison per input tuple. We assume
      * all columns get compared at most of the tuples.
+     *
+     * XXX all wrong for hashing
      */
-    pathnode->path.disabled_nodes = subpath->disabled_nodes;
-    pathnode->path.startup_cost = subpath->startup_cost;
-    pathnode->path.total_cost = subpath->total_cost +
-        cpu_operator_cost * subpath->rows * list_length(distinctList);
+    pathnode->path.disabled_nodes =
+        leftpath->disabled_nodes + rightpath->disabled_nodes;
+    pathnode->path.startup_cost =
+        leftpath->startup_cost + rightpath->startup_cost;
+    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
+        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 182a6956bb..a730c6ec6d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2806,27 +2806,34 @@ typedef struct HashState
 /* ----------------
  *     SetOpState information
  *
- *        Even in "sorted" mode, SetOp nodes are more complex than a simple
- *        Unique, since we have to count how many duplicates to return.  But
- *        we also support hashing, so this is really more like a cut-down
- *        form of Agg.
+ *        SetOp nodes support either sorted or hashed de-duplication.
+ *        The sorted mode is a bit like MergeJoin, the hashed mode like Agg.
  * ----------------
  */
-/* this struct is private in nodeSetOp.c: */
-typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
+typedef struct SetOpStatePerInput
+{
+    TupleTableSlot *firstTupleSlot; /* first tuple of current group */
+    int64        numTuples;        /* number of tuples in current group */
+    TupleTableSlot *nextTupleSlot;    /* next input tuple, if already read */
+    bool        needGroup;        /* do we need to load a new group? */
+} SetOpStatePerInput;

 typedef struct SetOpState
 {
     PlanState    ps;                /* its first field is NodeTag */
-    ExprState  *eqfunction;        /* equality comparator */
-    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
-    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     bool        setop_done;        /* indicates completion of output scan */
-    long        numOutput;        /* number of dups left to output */
+    int64        numOutput;        /* number of dups left to output */
+    int            numCols;        /* number of grouping columns */
+
     /* these fields are used in SETOP_SORTED mode: */
-    SetOpStatePerGroup pergroup;    /* per-group working state */
-    HeapTuple    grp_firstTuple; /* copy of first tuple of current group */
+    SortSupport sortKeys;        /* per-grouping-field sort data */
+    SetOpStatePerInput leftInput;    /* current outer-relation input state */
+    SetOpStatePerInput rightInput;    /* current inner-relation input state */
+    bool        need_init;        /* have we read the first tuples yet? */
+
     /* these fields are used in SETOP_HASHED mode: */
+    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
+    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     TupleHashTable hashtable;    /* hash table with one entry per group */
     MemoryContext tableContext; /* memory context containing hash table */
     bool        table_filled;    /* hash table filled yet? */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..da3bdbab27 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2340,13 +2340,12 @@ typedef struct WindowAggPath
 typedef struct SetOpPath
 {
     Path        path;
-    Path       *subpath;        /* path representing input source */
+    Path       *leftpath;        /* paths representing input sources */
+    Path       *rightpath;
     SetOpCmd    cmd;            /* what to do, see nodes.h */
     SetOpStrategy strategy;        /* how to do it, see nodes.h */
-    List       *distinctList;    /* SortGroupClauses identifying target cols */
-    AttrNumber    flagColIdx;        /* where is the flag column, if any */
-    int            firstFlag;        /* flag value for first input relation */
-    Cardinality numGroups;        /* estimated number of groups in input */
+    List       *groupList;        /* SortGroupClauses identifying target cols */
+    Cardinality numGroups;        /* estimated number of groups in left input */
 } SetOpPath;

 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 52f29bcdb6..4633121689 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1225,23 +1225,20 @@ typedef struct SetOp
     /* how to do it, see nodes.h */
     SetOpStrategy strategy;

-    /* number of columns to check for duplicate-ness */
+    /* number of columns to compare */
     int            numCols;

     /* their indexes in the target list */
-    AttrNumber *dupColIdx pg_node_attr(array_size(numCols));
+    AttrNumber *cmpColIdx pg_node_attr(array_size(numCols));

-    /* equality operators to compare with */
-    Oid           *dupOperators pg_node_attr(array_size(numCols));
-    Oid           *dupCollations pg_node_attr(array_size(numCols));
-
-    /* where is the flag column, if any */
-    AttrNumber    flagColIdx;
+    /* comparison operators (either equality operators or sort operators) */
+    Oid           *cmpOperators pg_node_attr(array_size(numCols));
+    Oid           *cmpCollations pg_node_attr(array_size(numCols));

-    /* flag value for first input relation */
-    int            firstFlag;
+    /* nulls-first flags if sorting, otherwise not interesting */
+    bool       *cmpNullsFirst pg_node_attr(array_size(numCols));

-    /* estimated number of groups in input */
+    /* estimated number of groups in left input */
     long        numGroups;
 } SetOp;

diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 1035e6560c..5a6d0350c1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -260,12 +260,11 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
                                             bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
                                     RelOptInfo *rel,
-                                    Path *subpath,
+                                    Path *leftpath,
+                                    Path *rightpath,
                                     SetOpCmd cmd,
                                     SetOpStrategy strategy,
-                                    List *distinctList,
-                                    AttrNumber flagColIdx,
-                                    int firstFlag,
+                                    List *groupList,
                                     double numGroups,
                                     double outputRows);
 extern RecursiveUnionPath *create_recursiveunion_path(PlannerInfo *root,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 2d35de3fad..154fffabae 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1231,22 +1231,18 @@ select count(*) from
     select * from onek i2 where i2.unique1 = o.unique2
   ) ss
 where o.ten = 1;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Subquery Scan on ss
-               ->  HashSetOp Except
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Scan using onek_unique1 on onek i1
-                                       Index Cond: (unique1 = o.unique1)
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Index Scan using onek_unique1 on onek i2
-                                       Index Cond: (unique1 = o.unique2)
-(13 rows)
+         ->  HashSetOp Except
+               ->  Index Scan using onek_unique1 on onek i1
+                     Index Cond: (unique1 = o.unique1)
+               ->  Index Scan using onek_unique1 on onek i2
+                     Index Cond: (unique1 = o.unique2)
+(9 rows)

 select count(*) from
   onek o cross join lateral (
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..fbf24490e9 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -370,17 +370,13 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                     QUERY PLAN
-------------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  HashSetOp Intersect
-               ->  Append
-                     ->  Subquery Scan on "*SELECT* 2"
-                           ->  Seq Scan on tenk1
-                     ->  Subquery Scan on "*SELECT* 1"
-                           ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+   ->  HashSetOp Intersect
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(4 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -391,16 +387,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Index Only Scan using tenk1_unique1 on tenk1
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                     Filter: (unique2 <> 10)
-(7 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -434,19 +427,17 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                        QUERY PLAN
-------------------------------------------------------------------------------------------
+                               QUERY PLAN
+------------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  SetOp Intersect
-               ->  Sort
-                     Sort Key: "*SELECT* 2".fivethous
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Seq Scan on tenk1
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(10 rows)
+   ->  SetOp Intersect
+         ->  Sort
+               Sort Key: tenk1.fivethous
+               ->  Seq Scan on tenk1
+         ->  Sort
+               Sort Key: tenk1_1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(8 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -457,18 +448,17 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".unique1
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Index Only Scan using tenk1_unique1 on tenk1
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                           Filter: (unique2 <> 10)
-(9 rows)
+         Sort Key: tenk1.unique1
+         ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Sort
+         Sort Key: tenk1_1.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+               Filter: (unique2 <> 10)
+(8 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -528,15 +518,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -546,15 +533,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -606,17 +590,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -626,17 +609,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -669,17 +651,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -689,17 +670,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -777,17 +757,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -797,17 +776,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -970,15 +948,12 @@ set enable_sort = false;
 -- no means to disable Unique.
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union all select from generate_series(1,3);
 --
@@ -1015,15 +990,12 @@ select from generate_series(1,5) union select from generate_series(1,3);

 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  SetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union select from generate_series(1,3);
 --
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 100afe40e1..1ef6492ef8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2608,6 +2608,8 @@ SetOpCmd
 SetOpPath
 SetOpState
 SetOpStatePerGroup
+SetOpStatePerGroupData
+SetOpStatePerInput
 SetOpStrategy
 SetOperation
 SetOperationStmt
--
2.43.5

From c4006d23afc02dbd49cc235918ee2403a8e7962a Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 13 Nov 2024 17:33:00 -0500
Subject: [PATCH v1 2/3] Remove some dead code in prepunion.c.

Now that we no longer use flag columns for SetOp, remove the
code that supported that.

This also means that there are no resjunk output columns anymore
in a set-operations plan tree.  So we can get rid of some code
that supported that, too.

It would probably be possible now to do what's speculated about in
the comments, and get rid of this code's use of transiently-created
targetlists in favor of dealing only in pathtargets (save for
generating root->processed_tlist at the very end).  However, that
would involve a good deal of notational churn and the benefit
would only be to save a little planning time.  So I'll leave that
project for some other day.
---
 src/backend/optimizer/prep/prepunion.c | 96 ++++----------------------
 1 file changed, 14 insertions(+), 82 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 8c48109bd4..5013e0d45d 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -40,8 +40,7 @@

 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
                                           List *colTypes, List *colCollations,
-                                          bool junkOK,
-                                          int flag, List *refnames_tlist,
+                                          List *refnames_tlist,
                                           List **pTargetList,
                                           bool *istrivial_tlist);
 static RelOptInfo *generate_recursion_path(SetOperationStmt *setOp,
@@ -69,14 +68,12 @@ static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
-                                  int flag,
                                   Index varno,
                                   bool hack_constants,
                                   List *input_tlist,
                                   List *refnames_tlist,
                                   bool *trivial_tlist);
 static List *generate_append_tlist(List *colTypes, List *colCollations,
-                                   bool flag,
                                    List *input_tlists,
                                    List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
@@ -160,12 +157,10 @@ plan_set_operations(PlannerInfo *root)
         /*
          * Recurse on setOperations tree to generate paths for set ops. The
          * final output paths should have just the column types shown as the
-         * output from the top-level node, plus possibly resjunk working
-         * columns (we can rely on upper-level nodes to deal with that).
+         * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
                                            topop->colTypes, topop->colCollations,
-                                           true, -1,
                                            leftmostQuery->targetList,
                                            &top_tlist,
                                            &trivial_tlist);
@@ -208,8 +203,6 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * junkOK: if true, child resjunk columns may be left in the result
- * flag: if >= 0, add a resjunk output column indicating value of flag
  * refnames_tlist: targetlist to take column names from
  *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
@@ -220,7 +213,8 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
- * Pending a redesign of how that works, this is the easy way out.
+ * XXX Now that there are no flag columns and hence no resjunk columns, we
+ * could probably refactor this file to deal only in pathtargets.
  *
  * We don't have to care about typmods here: the only allowed difference
  * between set-op input and output typmods is input is a specific typmod
@@ -229,8 +223,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
                        List *colTypes, List *colCollations,
-                       bool junkOK,
-                       int flag, List *refnames_tlist,
+                       List *refnames_tlist,
                        List **pTargetList,
                        bool *istrivial_tlist)
 {
@@ -279,7 +272,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,

         /* Figure out the appropriate target list for this subquery. */
         tlist = generate_setop_tlist(colTypes, colCollations,
-                                     flag,
                                      rtr->rtindex,
                                      true,
                                      subroot->processed_tlist,
@@ -318,16 +310,14 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * generate_append_tlist() or generate_setop_tlist(), this will work.
          * We just tell generate_setop_tlist() to use varno 0.
          */
-        if (flag >= 0 ||
-            !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
-            !tlist_same_collations(*pTargetList, colCollations, junkOK))
+        if (!tlist_same_datatypes(*pTargetList, colTypes, false) ||
+            !tlist_same_collations(*pTargetList, colCollations, false))
         {
             PathTarget *target;
             bool        trivial_tlist;
             ListCell   *lc;

             *pTargetList = generate_setop_tlist(colTypes, colCollations,
-                                                flag,
                                                 0,
                                                 false,
                                                 *pTargetList,
@@ -411,7 +401,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      */
     lrel = recurse_set_operations(setOp->larg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -423,7 +412,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -436,7 +424,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /*
      * Generate tlist for RecursiveUnion path node --- same as in Append cases
      */
-    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations, false,
+    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations,
                                   list_make2(lpath_tlist, rpath_tlist),
                                   refnames_tlist);

@@ -736,7 +724,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, false,
+    tlist = generate_append_tlist(op->colTypes, op->colCollations,
                                   tlist_list, refnames_tlist);
     *pTargetList = tlist;

@@ -1048,7 +1036,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,7 +1047,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1109,7 +1095,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
                                  0, false, lpath_tlist, refnames_tlist,
                                  &result_trivial_tlist);

@@ -1258,18 +1244,10 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
-         *
-         * Note we disallow any resjunk columns in child results.  This is
-         * necessary since the Append node that implements the union won't do
-         * any projection, and upper levels will get confused if some of our
-         * output tuples have junk and some don't.  This case only arises when
-         * we have an EXCEPT or INTERSECT as child, else there won't be
-         * resjunk anyway.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
-                                                        false, -1,
                                                         refnames_tlist,
                                                         &child_tlist,
                                                         &trivial_tlist));
@@ -1412,7 +1390,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: -1 if no flag column needed, 0 or 1 to create a const flag column
  * varno: varno to use in generated Vars
  * hack_constants: true to copy up constants (see comments in code)
  * input_tlist: targetlist of this node's input node
@@ -1421,7 +1398,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  */
 static List *
 generate_setop_tlist(List *colTypes, List *colCollations,
-                     int flag,
                      Index varno,
                      bool hack_constants,
                      List *input_tlist,
@@ -1520,7 +1496,7 @@ generate_setop_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1529,25 +1505,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag >= 0)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is the given constant */
-        expr = (Node *) makeConst(INT4OID,
-                                  -1,
-                                  InvalidOid,
-                                  sizeof(int32),
-                                  Int32GetDatum(flag),
-                                  false,
-                                  true);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-        *trivial_tlist = false; /* the extra entry makes it not trivial */
-    }
-
     return tlist;
 }

@@ -1556,7 +1513,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: true to create a flag column copied up from subplans
  * input_tlists: list of tlists for sub-plans of the Append
  * refnames_tlist: targetlist to take column names from
  *
@@ -1570,7 +1526,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  */
 static List *
 generate_append_tlist(List *colTypes, List *colCollations,
-                      bool flag,
                       List *input_tlists,
                       List *refnames_tlist)
 {
@@ -1604,8 +1559,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
         {
             TargetEntry *subtle = (TargetEntry *) lfirst(subtlistl);

-            if (subtle->resjunk)
-                continue;
+            Assert(!subtle->resjunk);
             Assert(curColType != NULL);
             if (exprType((Node *) subtle->expr) == lfirst_oid(curColType))
             {
@@ -1654,7 +1608,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1663,23 +1617,6 @@ generate_append_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is shown as copied up from subplan */
-        expr = (Node *) makeVar(0,
-                                resno,
-                                INT4OID,
-                                -1,
-                                InvalidOid,
-                                0);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-    }
-
     pfree(colTypmods);

     return tlist;
@@ -1709,12 +1646,7 @@ generate_setop_grouplist(SetOperationStmt *op, List *targetlist)
         TargetEntry *tle = (TargetEntry *) lfirst(lt);
         SortGroupClause *sgc;

-        if (tle->resjunk)
-        {
-            /* resjunk columns should not have sortgrouprefs */
-            Assert(tle->ressortgroupref == 0);
-            continue;            /* ignore resjunk columns */
-        }
+        Assert(!tle->resjunk);

         /* non-resjunk columns should have sortgroupref = resno */
         Assert(tle->ressortgroupref == tle->resno);
--
2.43.5

From f997d4e4dd4cbd681a7d96f2613ce0188d6f3bfe Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 13 Nov 2024 19:07:17 -0500
Subject: [PATCH v1 3/3] Get rid of choose_hashed_setop().

This logic is a hangover from days when we could only generate
one plan per setop node.  Now, it's better to generate multiple
Paths and let add_path choose the cheapest.  The main advantage
is not having to keep choose_hashed_setop in perfect sync with
cost estimates made elsewhere --- and indeed, it looks like it
was out of sync already.  Not to mention that it doesn't account
for presorted input, which doesn't matter too much right now
but will soon.

This does require fixing create_setop_path to generate good cost
estimates.  That was another area where we'd not maintained things
well, because it wasn't distinguishing sorted from hashed costs at
all.  With the formulas I chose, the total cost attributed to the node
is actually the same for sorted and hashed modes --- but the startup
cost is wildly different, because hashed mode does basically all the
work before returning anything.

I also modified the existing logic that disables hashed setops
so that it works by incrementing path->disabled_nodes rather
than not generating the path at all.  That seems more in keeping
with the methods established by commit e22253467.  (But should
we put that logic in create_setop_path, rather than its caller?)

This results in two changes in plans that are visible in the
regression tests.  The one in union.sql seems fine: it's now
preferring sorted over hashed mode for a zero-column INTERSECT,
since no sort is really required.  The one in subselect.sql
is trying to test nodeSetOp's rescan logic, so I extended it
to test both sorted and hashed modes.
---
 src/backend/optimizer/prep/prepunion.c  | 226 +++++++++---------------
 src/backend/optimizer/util/pathnode.c   |  50 +++++-
 src/test/regress/expected/subselect.out |  48 ++++-
 src/test/regress/expected/union.out     |   2 +-
 src/test/regress/sql/subselect.sql      |  32 +++-
 5 files changed, 200 insertions(+), 158 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5013e0d45d..6f385a2aba 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -63,10 +63,6 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **tlist_list,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
-static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *lpath, Path *rpath,
-                                double dNumGroups, double dNumOutputRows,
-                                const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
                                   Index varno,
                                   bool hack_constants,
@@ -1025,7 +1021,8 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
-    bool        use_hash;
+    bool        can_sort;
+    bool        can_hash;
     SetOpCmd    cmd;

     /*
@@ -1130,15 +1127,76 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }
+    result_rel->rows = dNumOutputRows;
+
+    /* Select the SetOpCmd type */
+    switch (op->op)
+    {
+        case SETOP_INTERSECT:
+            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
+            break;
+        case SETOP_EXCEPT:
+            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
+            break;
+        default:
+            elog(ERROR, "unrecognized set op: %d", (int) op->op);
+            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
+            break;
+    }
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));

     /*
-     * Decide whether to hash or sort, and add sort nodes if needed.
+     * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
-    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
-                                   dNumGroups, dNumOutputRows,
-                                   (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");
+    if (can_hash)
+    {
+        Size        hash_mem_limit = get_hash_memory_limit();
+        Size        hashentrysize;
+
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_HASHED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);

-    if (groupList && !use_hash)
+        /*
+         * Mark the path as disabled if enable_hashagg is off.  While this
+         * isn't exactly a HashAgg node, it seems close enough to justify
+         * letting that switch control it.
+         */
+        if (!enable_hashagg)
+            path->disabled_nodes++;
+
+        /*
+         * Also disable if it doesn't look like the hashtable will fit into
+         * hash_mem.
+         */
+        hashentrysize = MAXALIGN(lpath->pathtarget->width) +
+            MAXALIGN(SizeofMinimalTupleHeader);
+        if (hashentrysize * dNumGroups > hash_mem_limit)
+            path->disabled_nodes++;
+
+        add_path(result_rel, path);
+    }
+
+    /*
+     * If we can sort, sort the inputs if needed before adding SetOp.
+     */
+    if (can_sort)
     {
         List       *pathkeys;

@@ -1160,36 +1218,19 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                               rpath,
                                               pathkeys,
                                               -1.0);
-    }

-    /*
-     * Finally, add a SetOp path node to generate the correct output.
-     */
-    switch (op->op)
-    {
-        case SETOP_INTERSECT:
-            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
-            break;
-        case SETOP_EXCEPT:
-            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
-            break;
-        default:
-            elog(ERROR, "unrecognized set op: %d", (int) op->op);
-            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
-            break;
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_SORTED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);
+        add_path(result_rel, path);
     }
-    path = (Path *) create_setop_path(root,
-                                      result_rel,
-                                      lpath,
-                                      rpath,
-                                      cmd,
-                                      use_hash ? SETOP_HASHED : SETOP_SORTED,
-                                      groupList,
-                                      dNumGroups,
-                                      dNumOutputRows);
-
-    result_rel->rows = path->rows;
-    add_path(result_rel, path);
+
     return result_rel;
 }

@@ -1276,115 +1317,6 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)
     set_cheapest(rel);
 }

-/*
- * choose_hashed_setop - should we use hashing for a set operation?
- *
- * XXX probably this should go away: just make both paths and let
- * add_path sort it out.
- */
-static bool
-choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *lpath, Path *rpath,
-                    double dNumGroups, double dNumOutputRows,
-                    const char *construct)
-{
-    int            numGroupCols = list_length(groupClauses);
-    Size        hash_mem_limit = get_hash_memory_limit();
-    bool        can_sort;
-    bool        can_hash;
-    Size        hashentrysize;
-    Path        hashed_p;
-    Path        sorted_p;
-    double        tuple_fraction;
-
-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupClauses);
-    can_hash = grouping_is_hashable(groupClauses);
-    if (can_hash && can_sort)
-    {
-        /* we have a meaningful choice to make, continue ... */
-    }
-    else if (can_hash)
-        return true;
-    else if (can_sort)
-        return false;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is UNION, INTERSECT, or EXCEPT */
-                 errmsg("could not implement %s", construct),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
-    /* Prefer sorting when enable_hashagg is off */
-    if (!enable_hashagg)
-        return false;
-
-    /*
-     * Don't do it if it doesn't look like the hashtable will fit into
-     * hash_mem.
-     */
-    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
-
-    if (hashentrysize * dNumGroups > hash_mem_limit)
-        return false;
-
-    /*
-     * See if the estimated cost is no more than doing it the other way.
-     *
-     * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group. XXX NOT TRUE: Note that the actual result plan might involve a
-     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
-     * and Group should be close enough for our purposes here.
-     *
-     * These path variables are dummies that just hold cost fields; we don't
-     * make actual Paths for these steps.
-     */
-    cost_agg(&hashed_p, root, AGG_HASHED, NULL,
-             numGroupCols, dNumGroups,
-             NIL,
-             lpath->disabled_nodes + rpath->disabled_nodes,
-             lpath->startup_cost + rpath->startup_cost,
-             lpath->total_cost + rpath->total_cost,
-             lpath->rows + rpath->rows,
-             lpath->pathtarget->width);
-
-    /*
-     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
-     * unsorted, since it was made by appending unrelated sub-relations
-     * together.
-     */
-    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
-    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
-    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
-    /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-    cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
-              sorted_p.total_cost,
-              lpath->rows + rpath->rows,
-              lpath->pathtarget->width,
-              0.0, work_mem, -1.0);
-    cost_group(&sorted_p, root, numGroupCols, dNumGroups,
-               NIL,
-               sorted_p.disabled_nodes,
-               sorted_p.startup_cost, sorted_p.total_cost,
-               lpath->rows + rpath->rows);
-
-    /*
-     * Now make the decision using the top-level tuple fraction.  First we
-     * have to convert an absolute count (LIMIT) into fractional form.
-     */
-    tuple_fraction = root->tuple_fraction;
-    if (tuple_fraction >= 1.0)
-        tuple_fraction /= dNumOutputRows;
-
-    if (compare_fractional_path_costs(&hashed_p, &sorted_p,
-                                      tuple_fraction) < 0)
-    {
-        /* Hashed is cheaper, so use it */
-        return true;
-    }
-    return false;
-}
-
 /*
  * Generate targetlist for a set-operation plan node
  *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e52e4b1d67..d7422f2dd8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3681,17 +3681,51 @@ create_setop_path(PlannerInfo *root,
     pathnode->numGroups = numGroups;

     /*
-     * Charge one cpu_operator_cost per comparison per input tuple. We assume
-     * all columns get compared at most of the tuples.
-     *
-     * XXX all wrong for hashing
+     * Compute cost estimates.  As things stand, we end up with the same total
+     * cost in this node for sort and hash methods, but different startup
+     * costs.  This could be refined perhaps, but it'll do for now.
      */
     pathnode->path.disabled_nodes =
         leftpath->disabled_nodes + rightpath->disabled_nodes;
-    pathnode->path.startup_cost =
-        leftpath->startup_cost + rightpath->startup_cost;
-    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
-        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+    if (strategy == SETOP_SORTED)
+    {
+        /*
+         * In sorted mode, we can emit output incrementally.  Charge one
+         * cpu_operator_cost per comparison per input tuple.  Like cost_group,
+         * we assume all columns get compared at most of the tuples.
+         */
+        pathnode->path.startup_cost =
+            leftpath->startup_cost + rightpath->startup_cost;
+        pathnode->path.total_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
+    else
+    {
+        /*
+         * In hashed mode, we must read all the input before we can emit
+         * anything.  Also charge comparison costs to represent the cost of
+         * hash table lookups.
+         */
+        pathnode->path.startup_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+        pathnode->path.total_cost = pathnode->path.startup_cost;
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 154fffabae..6f2926c45a 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1221,8 +1221,10 @@ where o.ten = 0;
 (1 row)

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -1256,6 +1258,50 @@ where o.ten = 1;
    100
 (1 row)

+rollback;
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+                                                                                                     QUERY PLAN
                                                                                              

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  SetOp Except
+               ->  Sort
+                     Sort Key: i1.unique1, i1.unique2, i1.two, i1.four, i1.ten, i1.twenty, i1.hundred, i1.thousand,
i1.twothousand,i1.fivethous, i1.tenthous, i1.odd, i1.even, i1.stringu1, i1.stringu2, i1.string4 
+                     ->  Index Scan using onek_unique1 on onek i1
+                           Index Cond: (unique1 = o.unique1)
+               ->  Sort
+                     Sort Key: i2.unique1, i2.unique2, i2.two, i2.four, i2.ten, i2.twenty, i2.hundred, i2.thousand,
i2.twothousand,i2.fivethous, i2.tenthous, i2.odd, i2.even, i2.stringu1, i2.stringu2, i2.string4 
+                     ->  Index Scan using onek_unique1 on onek i2
+                           Index Cond: (unique1 = o.unique2)
+(13 rows)
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+ count
+-------
+   100
+(1 row)
+
+rollback;
 --
 -- Test rescan of a RecursiveUnion node
 --
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index fbf24490e9..9474b01510 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -950,7 +950,7 @@ explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
                         QUERY PLAN
 ----------------------------------------------------------
- HashSetOp Intersect
+ SetOp Intersect
    ->  Function Scan on generate_series
    ->  Function Scan on generate_series generate_series_1
 (3 rows)
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index af6e157aca..11becd885c 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -638,8 +638,11 @@ select sum(ss.tst::int) from
 where o.ten = 0;

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
+
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -657,6 +660,33 @@ select count(*) from
   ) ss
 where o.ten = 1;

+rollback;
+
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+rollback;
+
 --
 -- Test rescan of a RecursiveUnion node
 --
--
2.43.5

Re: Converting SetOp to read its two inputs separately

From

Richard Guo

Date:

14 November 2024, 12:28:17

On Thu, Nov 14, 2024 at 11:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Aside from that minor TODO, the main thing that's left undone in this
> patch series is to persuade the thing to exploit presorted input
> paths.  Right now it fails to do so, as can be seen in some of the
> regression test cases, eg
>
> regression=# set enable_hashagg = 0;
> SET
> regression=# explain (costs off) select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
>                             QUERY PLAN
> ------------------------------------------------------------------
>  SetOp Except
>    ->  Sort
>          Sort Key: tenk1.unique1
>          ->  Index Only Scan using tenk1_unique1 on tenk1
>    ->  Sort
>          Sort Key: tenk1_1.unique2
>          ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
>                Filter: (unique2 <> 10)
> (8 rows)
>
> Obviously the sorts are unnecessary, but it's not seeing that.
> I suppose this requires integrating generate_nonunion_paths with
> the logic from commit 66c0185a3.  I tried to make sense of that,
> but failed --- either I don't understand it, or there are a
> number of things wrong with it.  I'd welcome some help with
> getting that done.

I think we may need to do the following to make this work:

1. We need to teach set_operation_ordered_results_useful() that sorted
input paths are also useful for INTERSECT/EXCEPT, so that we can have
setop_pathkeys set for the subqueries.

2. In generate_nonunion_paths(), we need to provide a valid
"interesting_pathkeys" when calling build_setop_child_paths().

Thanks
Richard

Re: Converting SetOp to read its two inputs separately

From

Richard Guo

Date:

14 November 2024, 12:32:57

On Thu, Nov 14, 2024 at 6:28 PM Richard Guo <guofenglinux@gmail.com> wrote:
> 1. We need to teach set_operation_ordered_results_useful() that sorted
> input paths are also useful for INTERSECT/EXCEPT, so that we can have
> setop_pathkeys set for the subqueries.

BTW, I noticed a typo in the comment for function
set_operation_ordered_results_useful().

  * set_operation_ordered_results_useful
  *     Return true if the given SetOperationStmt can be executed by utilizing
  *     paths that provide sorted input according to the setop's targetlist.
- *     Returns false when sorted paths are not any more useful then unsorted
+ *     Returns false when sorted paths are not any more useful than unsorted
  *     ones.

Thanks
Richard

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

20 November 2024, 05:09:36

Richard Guo <guofenglinux@gmail.com> writes:
> On Thu, Nov 14, 2024 at 11:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Aside from that minor TODO, the main thing that's left undone in this
>> patch series is to persuade the thing to exploit presorted input
>> paths.

> I think we may need to do the following to make this work:
> 1. We need to teach set_operation_ordered_results_useful() that sorted
> input paths are also useful for INTERSECT/EXCEPT, so that we can have
> setop_pathkeys set for the subqueries.
> 2. In generate_nonunion_paths(), we need to provide a valid
> "interesting_pathkeys" when calling build_setop_child_paths().

Once I'd wrapped my head around how things are done now (which the
comments in prepunion.c were remarkably unhelpful about), I saw that
most of the problem for #2 just requires re-ordering things that
generate_nonunion_paths was already doing.  As for #1, I have a modest
proposal: we should get rid of set_operation_ordered_results_useful
entirely.  It's not the code that actually does useful work, and
keeping it in sync with the code that does do useful work is hard and
unnecessary.

0001-0003 below are the same as before (so the slot-munging TODO is
still there).  0004 fixes a rather basic bug for nested set-operations
and gets rid of set_operation_ordered_results_useful along the way.
Then 0005 does your step 2.

            regards, tom lane

From 801f2d63c221b0b01c228158d5b79db9a1edc629 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:26:26 -0500
Subject: [PATCH v2 1/5] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.
TODO: the tuple hash table logic spits up if I don't force tuples from
the second input into minimal-tuple form.  I don't understand why that
is, but it seems like it shouldn't be necessary.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.
---
 src/backend/executor/nodeSetOp.c        | 531 ++++++++++++++----------
 src/backend/optimizer/README            |   2 +-
 src/backend/optimizer/plan/createplan.c |  81 ++--
 src/backend/optimizer/prep/prepunion.c  | 156 ++++---
 src/backend/optimizer/util/pathnode.c   |  50 ++-
 src/include/nodes/execnodes.h           |  31 +-
 src/include/nodes/pathnodes.h           |   9 +-
 src/include/nodes/plannodes.h           |  19 +-
 src/include/optimizer/pathnode.h        |   7 +-
 src/test/regress/expected/subselect.out |  20 +-
 src/test/regress/expected/union.out     | 226 +++++-----
 src/tools/pgindent/typedefs.list        |   2 +
 12 files changed, 614 insertions(+), 520 deletions(-)

diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index a8ac68b482..5644739d65 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -3,29 +3,30 @@
  * nodeSetOp.c
  *      Routines to handle INTERSECT and EXCEPT selection
  *
- * The input of a SetOp node consists of tuples from two relations,
- * which have been combined into one dataset, with a junk attribute added
- * that shows which relation each tuple came from.  In SETOP_SORTED mode,
- * the input has furthermore been sorted according to all the grouping
- * columns (ie, all the non-junk attributes).  The SetOp node scans each
- * group of identical tuples to determine how many came from each input
- * relation.  Then it is a simple matter to emit the output demanded by the
- * SQL spec for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ * The input of a SetOp node consists of two relations (outer and inner)
+ * with identical column sets.  In EXCEPT queries the outer relation is
+ * always the left side, while in INTERSECT cases the planner tries to
+ * make the outer relation be the smaller of the two inputs.
  *
- * In SETOP_HASHED mode, the input is delivered in no particular order,
- * except that we know all the tuples from one input relation will come before
- * all the tuples of the other.  The planner guarantees that the first input
- * relation is the left-hand one for EXCEPT, and tries to make the smaller
- * input relation come first for INTERSECT.  We build a hash table in memory
- * with one entry for each group of identical tuples, and count the number of
- * tuples in the group from each relation.  After seeing all the input, we
- * scan the hashtable and generate the correct output using those counts.
- * We can avoid making hashtable entries for any tuples appearing only in the
- * second input relation, since they cannot result in any output.
+ * In SETOP_SORTED mode, each input has been sorted according to all the
+ * grouping columns (ie, all the non-junk attributes).  The SetOp node
+ * essentially performs a merge join on the grouping columns, except that
+ * it is only interested in counting how many tuples from each input match.
+ * Then it is a simple matter to emit the output demanded by the SQL spec
+ * for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ *
+ * In SETOP_HASHED mode, the inputs are delivered in no particular order.
+ * We read the outer relation and build a hash table in memory with one entry
+ * for each group of identical tuples, counting the number of tuples in the
+ * group.  Then we read the inner relation and count the number of tuples
+ * matching each outer group.  (We can disregard any tuples appearing only
+ * in the inner relation, since they cannot result in any output.)  After
+ * seeing all the input, we scan the hashtable and generate the correct
+ * output using those counts.
  *
  * This node type is not used for UNION or UNION ALL, since those can be
- * implemented more cheaply (there's no need for the junk attribute to
- * identify the source relation).
+ * implemented more cheaply (there's no need to count the number of
+ * matching tuples).
  *
  * Note that SetOp does no qual checking nor projection.  The delivered
  * output tuples are just copies of the first-to-arrive tuple in each
@@ -54,65 +55,28 @@
 /*
  * SetOpStatePerGroupData - per-group working state
  *
- * These values are working state that is initialized at the start of
- * an input tuple group and updated for each input tuple.
- *
- * In SETOP_SORTED mode, we need only one of these structs, and it's kept in
- * the plan state node.  In SETOP_HASHED mode, the hash table contains one
- * of these for each tuple group.
+ * In SETOP_SORTED mode, we need only one of these structs, and it's just a
+ * local in setop_retrieve_sorted.  In SETOP_HASHED mode, the hash table
+ * contains one of these for each tuple group.
  */
 typedef struct SetOpStatePerGroupData
 {
-    long        numLeft;        /* number of left-input dups in group */
-    long        numRight;        /* number of right-input dups in group */
-}            SetOpStatePerGroupData;
+    int64        numLeft;        /* number of left-input dups in group */
+    int64        numRight;        /* number of right-input dups in group */
+} SetOpStatePerGroupData;
+
+typedef SetOpStatePerGroupData *SetOpStatePerGroup;


-static TupleTableSlot *setop_retrieve_direct(SetOpState *setopstate);
+static TupleTableSlot *setop_retrieve_sorted(SetOpState *setopstate);
+static void setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                             SetOpState *setopstate);
+static int    setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                                SetOpState *setopstate);
 static void setop_fill_hash_table(SetOpState *setopstate);
 static TupleTableSlot *setop_retrieve_hash_table(SetOpState *setopstate);


-/*
- * Initialize state for a new group of input values.
- */
-static inline void
-initialize_counts(SetOpStatePerGroup pergroup)
-{
-    pergroup->numLeft = pergroup->numRight = 0;
-}
-
-/*
- * Advance the appropriate counter for one input tuple.
- */
-static inline void
-advance_counts(SetOpStatePerGroup pergroup, int flag)
-{
-    if (flag)
-        pergroup->numRight++;
-    else
-        pergroup->numLeft++;
-}
-
-/*
- * Fetch the "flag" column from an input tuple.
- * This is an integer column with value 0 for left side, 1 for right side.
- */
-static int
-fetch_tuple_flag(SetOpState *setopstate, TupleTableSlot *inputslot)
-{
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
-    int            flag;
-    bool        isNull;
-
-    flag = DatumGetInt32(slot_getattr(inputslot,
-                                      node->flagColIdx,
-                                      &isNull));
-    Assert(!isNull);
-    Assert(flag == 0 || flag == 1);
-    return flag;
-}
-
 /*
  * Initialize the hash table to empty.
  */
@@ -129,10 +93,10 @@ build_hash_table(SetOpState *setopstate)
     setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
                                                    desc,
                                                    node->numCols,
-                                                   node->dupColIdx,
+                                                   node->cmpColIdx,
                                                    setopstate->eqfuncoids,
                                                    setopstate->hashfunctions,
-                                                   node->dupCollations,
+                                                   node->cmpCollations,
                                                    node->numGroups,
                                                    0,
                                                    setopstate->ps.state->es_query_cxt,
@@ -217,108 +181,126 @@ ExecSetOp(PlanState *pstate)
         return setop_retrieve_hash_table(node);
     }
     else
-        return setop_retrieve_direct(node);
+        return setop_retrieve_sorted(node);
 }

 /*
  * ExecSetOp for non-hashed case
  */
 static TupleTableSlot *
-setop_retrieve_direct(SetOpState *setopstate)
+setop_retrieve_sorted(SetOpState *setopstate)
 {
     PlanState  *outerPlan;
-    SetOpStatePerGroup pergroup;
-    TupleTableSlot *outerslot;
+    PlanState  *innerPlan;
     TupleTableSlot *resultTupleSlot;
-    ExprContext *econtext = setopstate->ps.ps_ExprContext;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    pergroup = (SetOpStatePerGroup) setopstate->pergroup;
+    innerPlan = innerPlanState(setopstate);
     resultTupleSlot = setopstate->ps.ps_ResultTupleSlot;

     /*
-     * We loop retrieving groups until we find one we should return
+     * If first time through, establish the invariant that setop_load_group
+     * expects: each side's nextTupleSlot is the next output from the child
+     * plan, or empty if there is no more output from it.
      */
-    while (!setopstate->setop_done)
+    if (setopstate->need_init)
     {
+        setopstate->need_init = false;
+
+        setopstate->leftInput.nextTupleSlot = ExecProcNode(outerPlan);
+
         /*
-         * If we don't already have the first tuple of the new group, fetch it
-         * from the outer plan.
+         * If the outer relation is empty, then we will emit nothing, and we
+         * don't need to read the inner relation at all.
          */
-        if (setopstate->grp_firstTuple == NULL)
+        if (TupIsNull(setopstate->leftInput.nextTupleSlot))
         {
-            outerslot = ExecProcNode(outerPlan);
-            if (!TupIsNull(outerslot))
-            {
-                /* Make a copy of the first input tuple */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-            }
-            else
-            {
-                /* outer plan produced no tuples at all */
-                setopstate->setop_done = true;
-                return NULL;
-            }
+            setopstate->setop_done = true;
+            return NULL;
         }

+        setopstate->rightInput.nextTupleSlot = ExecProcNode(innerPlan);
+
+        /* Set flags that we've not completed either side's group */
+        setopstate->leftInput.needGroup = true;
+        setopstate->rightInput.needGroup = true;
+    }
+
+    /*
+     * We loop retrieving groups until we find one we should return
+     */
+    while (!setopstate->setop_done)
+    {
+        int            cmpresult;
+        SetOpStatePerGroupData pergroup;
+
         /*
-         * Store the copied first input tuple in the tuple table slot reserved
-         * for it.  The tuple will be deleted when it is cleared from the
-         * slot.
+         * Fetch the rest of the current outer group, if we didn't already.
          */
-        ExecStoreHeapTuple(setopstate->grp_firstTuple,
-                           resultTupleSlot,
-                           true);
-        setopstate->grp_firstTuple = NULL;    /* don't keep two pointers */
+        if (setopstate->leftInput.needGroup)
+            setop_load_group(&setopstate->leftInput, outerPlan, setopstate);

-        /* Initialize working state for a new input tuple group */
-        initialize_counts(pergroup);
+        /*
+         * If no more outer groups, we're done, and don't need to look at any
+         * more of the inner relation.
+         */
+        if (setopstate->leftInput.numTuples == 0)
+        {
+            setopstate->setop_done = true;
+            break;
+        }

-        /* Count the first input tuple */
-        advance_counts(pergroup,
-                       fetch_tuple_flag(setopstate, resultTupleSlot));
+        /*
+         * Fetch the rest of the current inner group, if we didn't already.
+         */
+        if (setopstate->rightInput.needGroup)
+            setop_load_group(&setopstate->rightInput, innerPlan, setopstate);

         /*
-         * Scan the outer plan until we exhaust it or cross a group boundary.
+         * Determine whether we have matching groups on both sides (this is
+         * basically like the core logic of a merge join).
          */
-        for (;;)
-        {
-            outerslot = ExecProcNode(outerPlan);
-            if (TupIsNull(outerslot))
-            {
-                /* no more outer-plan tuples available */
-                setopstate->setop_done = true;
-                break;
-            }
-
-            /*
-             * Check whether we've crossed a group boundary.
-             */
-            econtext->ecxt_outertuple = resultTupleSlot;
-            econtext->ecxt_innertuple = outerslot;
-
-            if (!ExecQualAndReset(setopstate->eqfunction, econtext))
-            {
-                /*
-                 * Save the first input tuple of the next group.
-                 */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-                break;
-            }
+        if (setopstate->rightInput.numTuples == 0)
+            cmpresult = -1;        /* as though left input is lesser */
+        else
+            cmpresult = setop_compare_slots(setopstate->leftInput.firstTupleSlot,
+                                            setopstate->rightInput.firstTupleSlot,
+                                            setopstate);

-            /* Still in same group, so count this tuple */
-            advance_counts(pergroup,
-                           fetch_tuple_flag(setopstate, outerslot));
+        if (cmpresult < 0)
+        {
+            /* Left group is first, has no right matches */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = 0;
+            /* We'll need another left group next time */
+            setopstate->leftInput.needGroup = true;
+        }
+        else if (cmpresult == 0)
+        {
+            /* We have matching groups */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = setopstate->rightInput.numTuples;
+            /* We'll need to read from both sides next time */
+            setopstate->leftInput.needGroup = true;
+            setopstate->rightInput.needGroup = true;
+        }
+        else
+        {
+            /* Right group has no left matches, so we can ignore it */
+            setopstate->rightInput.needGroup = true;
+            continue;
         }

         /*
-         * Done scanning input tuple group.  See if we should emit any copies
-         * of result tuple, and if so return the first copy.
+         * Done scanning these input tuple groups.  See if we should emit any
+         * copies of result tuple, and if so return the first copy.  (Note
+         * that the result tuple is the same as the left input's firstTuple
+         * slot.)
          */
-        set_output_count(setopstate, pergroup);
+        set_output_count(setopstate, &pergroup);

         if (setopstate->numOutput > 0)
         {
@@ -333,84 +315,173 @@ setop_retrieve_direct(SetOpState *setopstate)
 }

 /*
- * ExecSetOp for hashed case: phase 1, read input and build hash table
+ * Load next group of tuples from one child plan or the other.
+ *
+ * On entry, we've already read the first tuple of the next group
+ * (if there is one) into input->nextTupleSlot.  This invariant
+ * is maintained on exit.
+ */
+static void
+setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                 SetOpState *setopstate)
+{
+    input->needGroup = false;
+
+    /* If we've exhausted this child plan, report an empty group */
+    if (TupIsNull(input->nextTupleSlot))
+    {
+        ExecClearTuple(input->firstTupleSlot);
+        input->numTuples = 0;
+        return;
+    }
+
+    /* Make a local copy of the first tuple for comparisons */
+    ExecStoreHeapTuple(ExecCopySlotHeapTuple(input->nextTupleSlot),
+                       input->firstTupleSlot,
+                       true);
+    /* and count it */
+    input->numTuples = 1;
+
+    /* Scan till we find the end-of-group */
+    for (;;)
+    {
+        int            cmpresult;
+
+        /* Get next input tuple, if there is one */
+        input->nextTupleSlot = ExecProcNode(inputPlan);
+        if (TupIsNull(input->nextTupleSlot))
+            break;
+
+        /* There is; does it belong to same group as firstTuple? */
+        cmpresult = setop_compare_slots(input->firstTupleSlot,
+                                        input->nextTupleSlot,
+                                        setopstate);
+        Assert(cmpresult <= 0); /* else input is mis-sorted */
+        if (cmpresult != 0)
+            break;
+
+        /* Still in same group, so count this tuple */
+        input->numTuples++;
+    }
+}
+
+/*
+ * Compare the tuples in the two given slots.
+ */
+static int
+setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                    SetOpState *setopstate)
+{
+    for (int nkey = 0; nkey < setopstate->numCols; nkey++)
+    {
+        SortSupport sortKey = setopstate->sortKeys + nkey;
+        AttrNumber    attno = sortKey->ssup_attno;
+        Datum        datum1,
+                    datum2;
+        bool        isNull1,
+                    isNull2;
+        int            compare;
+
+        datum1 = slot_getattr(s1, attno, &isNull1);
+        datum2 = slot_getattr(s2, attno, &isNull2);
+
+        compare = ApplySortComparator(datum1, isNull1,
+                                      datum2, isNull2,
+                                      sortKey);
+        if (compare != 0)
+            return compare;
+    }
+    return 0;
+}
+
+/*
+ * ExecSetOp for hashed case: phase 1, read inputs and build hash table
  */
 static void
 setop_fill_hash_table(SetOpState *setopstate)
 {
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
     PlanState  *outerPlan;
-    int            firstFlag;
-    bool        in_first_rel PG_USED_FOR_ASSERTS_ONLY;
+    PlanState  *innerPlan;
     ExprContext *econtext = setopstate->ps.ps_ExprContext;
+    bool        have_tuples = false;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    firstFlag = node->firstFlag;
-    /* verify planner didn't mess up */
-    Assert(firstFlag == 0 ||
-           (firstFlag == 1 &&
-            (node->cmd == SETOPCMD_INTERSECT ||
-             node->cmd == SETOPCMD_INTERSECT_ALL)));
+    innerPlan = innerPlanState(setopstate);

     /*
      * Process each outer-plan tuple, and then fetch the next one, until we
      * exhaust the outer plan.
      */
-    in_first_rel = true;
     for (;;)
     {
         TupleTableSlot *outerslot;
-        int            flag;
         TupleHashEntryData *entry;
         bool        isnew;

         outerslot = ExecProcNode(outerPlan);
         if (TupIsNull(outerslot))
             break;
+        have_tuples = true;

-        /* Identify whether it's left or right input */
-        flag = fetch_tuple_flag(setopstate, outerslot);
+        /* Find or build hashtable entry for this tuple's group */
+        entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+                                     &isnew, NULL);

-        if (flag == firstFlag)
+        /* If new tuple group, initialize counts to zero */
+        if (isnew)
         {
-            /* (still) in first input relation */
-            Assert(in_first_rel);
-
-            /* Find or build hashtable entry for this tuple's group */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
-                                         &isnew, NULL);
-
-            /* If new tuple group, initialize counts */
-            if (isnew)
-            {
-                entry->additional = (SetOpStatePerGroup)
-                    MemoryContextAlloc(setopstate->hashtable->tablecxt,
+            entry->additional = (SetOpStatePerGroup)
+                MemoryContextAllocZero(setopstate->hashtable->tablecxt,
                                        sizeof(SetOpStatePerGroupData));
-                initialize_counts((SetOpStatePerGroup) entry->additional);
-            }
-
-            /* Advance the counts */
-            advance_counts((SetOpStatePerGroup) entry->additional, flag);
         }
-        else
+
+        /* Advance the counts */
+        ((SetOpStatePerGroup) entry->additional)->numLeft++;
+
+        /* Must reset expression context after each hashtable lookup */
+        ResetExprContext(econtext);
+    }
+
+    /*
+     * If the outer relation is empty, then we will emit nothing, and we don't
+     * need to read the inner relation at all.
+     */
+    if (have_tuples)
+    {
+        /*
+         * Process each inner-plan tuple, and then fetch the next one, until
+         * we exhaust the inner plan.
+         */
+        for (;;)
         {
-            /* reached second relation */
-            in_first_rel = false;
+            TupleTableSlot *innerslot;
+            TupleHashEntryData *entry;
+
+            innerslot = ExecProcNode(innerPlan);
+            if (TupIsNull(innerslot))
+                break;
+
+            /* XXX hack: force the tuple into minimal form */
+            /* XXX should not have to do this */
+            ExecForceStoreHeapTuple(ExecCopySlotHeapTuple(innerslot),
+                                    setopstate->ps.ps_ResultTupleSlot,
+                                    true);
+            innerslot = setopstate->ps.ps_ResultTupleSlot;

             /* For tuples not seen previously, do not make hashtable entry */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+            entry = LookupTupleHashEntry(setopstate->hashtable, innerslot,
                                          NULL, NULL);

             /* Advance the counts if entry is already present */
             if (entry)
-                advance_counts((SetOpStatePerGroup) entry->additional, flag);
-        }
+                ((SetOpStatePerGroup) entry->additional)->numRight++;

-        /* Must reset expression context after each hashtable lookup */
-        ResetExprContext(econtext);
+            /* Must reset expression context after each hashtable lookup */
+            ResetExprContext(econtext);
+        }
     }

     setopstate->table_filled = true;
@@ -481,7 +552,6 @@ SetOpState *
 ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 {
     SetOpState *setopstate;
-    TupleDesc    outerDesc;

     /* check for unsupported flags */
     Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -494,14 +564,10 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     setopstate->ps.state = estate;
     setopstate->ps.ExecProcNode = ExecSetOp;

-    setopstate->eqfuncoids = NULL;
-    setopstate->hashfunctions = NULL;
     setopstate->setop_done = false;
     setopstate->numOutput = 0;
-    setopstate->pergroup = NULL;
-    setopstate->grp_firstTuple = NULL;
-    setopstate->hashtable = NULL;
-    setopstate->tableContext = NULL;
+    setopstate->numCols = node->numCols;
+    setopstate->need_init = true;

     /*
      * create expression context
@@ -522,52 +588,74 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * initialize child nodes
      *
-     * If we are hashing then the child plan does not need to handle REWIND
+     * If we are hashing then the child plans do not need to handle REWIND
      * efficiently; see ExecReScanSetOp.
      */
     if (node->strategy == SETOP_HASHED)
         eflags &= ~EXEC_FLAG_REWIND;
     outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
-    outerDesc = ExecGetResultType(outerPlanState(setopstate));
+    innerPlanState(setopstate) = ExecInitNode(innerPlan(node), estate, eflags);

     /*
-     * Initialize result slot and type. Setop nodes do no projections, so
-     * initialize projection info for this node appropriately.
+     * Initialize locally-allocated slots.  In hashed mode, we just need a
+     * result slot.  In sorted mode, we need one first-tuple-of-group slot for
+     * each input; we use the result slot for the left input's slot and create
+     * another for the right input.  (Note: the nextTupleSlot slots are not
+     * ours, but just point to the last slot returned by the input plan node.)
      */
     ExecInitResultTupleSlotTL(&setopstate->ps,
                               node->strategy == SETOP_HASHED ?
                               &TTSOpsMinimalTuple : &TTSOpsHeapTuple);
+    if (node->strategy != SETOP_HASHED)
+    {
+        setopstate->leftInput.firstTupleSlot =
+            setopstate->ps.ps_ResultTupleSlot;
+        setopstate->rightInput.firstTupleSlot =
+            ExecInitExtraTupleSlot(estate,
+                                   setopstate->ps.ps_ResultTupleDesc,
+                                   &TTSOpsHeapTuple);
+    }
+
+    /* Setop nodes do no projections. */
     setopstate->ps.ps_ProjInfo = NULL;

     /*
-     * Precompute fmgr lookup data for inner loop. We need both equality and
-     * hashing functions to do it by hashing, but only equality if not
-     * hashing.
+     * Precompute fmgr lookup data for inner loop.  We need equality and
+     * hashing functions to do it by hashing, while for sorting we need
+     * SortSupport data.
      */
     if (node->strategy == SETOP_HASHED)
         execTuplesHashPrepare(node->numCols,
-                              node->dupOperators,
+                              node->cmpOperators,
                               &setopstate->eqfuncoids,
                               &setopstate->hashfunctions);
     else
-        setopstate->eqfunction =
-            execTuplesMatchPrepare(outerDesc,
-                                   node->numCols,
-                                   node->dupColIdx,
-                                   node->dupOperators,
-                                   node->dupCollations,
-                                   &setopstate->ps);
+    {
+        int            nkeys = node->numCols;
+
+        setopstate->sortKeys = (SortSupport)
+            palloc0(nkeys * sizeof(SortSupportData));
+        for (int i = 0; i < nkeys; i++)
+        {
+            SortSupport sortKey = setopstate->sortKeys + i;
+
+            sortKey->ssup_cxt = CurrentMemoryContext;
+            sortKey->ssup_collation = node->cmpCollations[i];
+            sortKey->ssup_nulls_first = node->cmpNullsFirst[i];
+            sortKey->ssup_attno = node->cmpColIdx[i];
+            /* abbreviated key conversion is not useful here */
+            sortKey->abbreviate = false;

+            PrepareSortSupportFromOrderingOp(node->cmpOperators[i], sortKey);
+        }
+    }
+
+    /* Create a hash table if needed */
     if (node->strategy == SETOP_HASHED)
     {
         build_hash_table(setopstate);
         setopstate->table_filled = false;
     }
-    else
-    {
-        setopstate->pergroup =
-            (SetOpStatePerGroup) palloc0(sizeof(SetOpStatePerGroupData));
-    }

     return setopstate;
 }
@@ -575,7 +663,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 /* ----------------------------------------------------------------
  *        ExecEndSetOp
  *
- *        This shuts down the subplan and frees resources allocated
+ *        This shuts down the subplans and frees resources allocated
  *        to this node.
  * ----------------------------------------------------------------
  */
@@ -587,6 +675,7 @@ ExecEndSetOp(SetOpState *node)
         MemoryContextDelete(node->tableContext);

     ExecEndNode(outerPlanState(node));
+    ExecEndNode(innerPlanState(node));
 }


@@ -594,6 +683,7 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
     PlanState  *outerPlan = outerPlanState(node);
+    PlanState  *innerPlan = innerPlanState(node);

     ExecClearTuple(node->ps.ps_ResultTupleSlot);
     node->setop_done = false;
@@ -611,34 +701,29 @@ ExecReScanSetOp(SetOpState *node)
             return;

         /*
-         * If we do have the hash table and the subplan does not have any
+         * If we do have the hash table and the subplans do not have any
          * parameter changes, then we can just rescan the existing hash table;
          * no need to build it again.
          */
-        if (outerPlan->chgParam == NULL)
+        if (outerPlan->chgParam == NULL && innerPlan->chgParam == NULL)
         {
             ResetTupleHashIterator(node->hashtable, &node->hashiter);
             return;
         }
-    }
-
-    /* Release first tuple of group, if we have made a copy */
-    if (node->grp_firstTuple != NULL)
-    {
-        heap_freetuple(node->grp_firstTuple);
-        node->grp_firstTuple = NULL;
-    }

-    /* Release any hashtable storage */
-    if (node->tableContext)
-        MemoryContextReset(node->tableContext);
+        /* Release any hashtable storage */
+        if (node->tableContext)
+            MemoryContextReset(node->tableContext);

-    /* And rebuild empty hashtable if needed */
-    if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
-    {
+        /* And rebuild an empty hashtable */
         ResetTupleHashTable(node->hashtable);
         node->table_filled = false;
     }
+    else
+    {
+        /* Need to re-read first input from each side */
+        node->need_init = true;
+    }

     /*
      * if chgParam of subnode is not null then plan will be re-scanned by
@@ -646,4 +731,6 @@ ExecReScanSetOp(SetOpState *node)
      */
     if (outerPlan->chgParam == NULL)
         ExecReScan(outerPlan);
+    if (innerPlan->chgParam == NULL)
+        ExecReScan(innerPlan);
 }
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 2ab4f3dbf3..f341d9f303 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -649,7 +649,7 @@ RelOptInfo      - a relation or joined relations
   GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
   MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
   WindowAggPath - a WindowAgg plan node applied to some sub-path
-  SetOpPath     - a SetOp plan node applied to some sub-path
+  SetOpPath     - a SetOp plan node applied to two sub-paths
   RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
   LockRowsPath  - a LockRows plan node applied to some sub-path
   ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f2ed0d81f6..07976c5938 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -301,9 +301,9 @@ static Unique *make_unique_from_pathkeys(Plan *lefttree,
                                          List *pathkeys, int numCols);
 static Gather *make_gather(List *qptlist, List *qpqual,
                            int nworkers, int rescan_param, bool single_copy, Plan *subplan);
-static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-                         List *distinctList, AttrNumber flagColIdx, int firstFlag,
-                         long numGroups);
+static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+                         List *tlist, Plan *lefttree, Plan *righttree,
+                         List *groupList, long numGroups);
 static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
 static Result *make_result(List *tlist, Node *resconstantqual, Plan *subplan);
 static ProjectSet *make_project_set(List *tlist, Plan *subplan);
@@ -2719,25 +2719,29 @@ static SetOp *
 create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 {
     SetOp       *plan;
-    Plan       *subplan;
+    List       *tlist = build_path_tlist(root, &best_path->path);
+    Plan       *leftplan;
+    Plan       *rightplan;
     long        numGroups;

     /*
      * SetOp doesn't project, so tlist requirements pass through; moreover we
      * need grouping columns to be labeled.
      */
-    subplan = create_plan_recurse(root, best_path->subpath,
-                                  flags | CP_LABEL_TLIST);
+    leftplan = create_plan_recurse(root, best_path->leftpath,
+                                   flags | CP_LABEL_TLIST);
+    rightplan = create_plan_recurse(root, best_path->rightpath,
+                                    flags | CP_LABEL_TLIST);

     /* Convert numGroups to long int --- but 'ware overflow! */
     numGroups = clamp_cardinality_to_long(best_path->numGroups);

     plan = make_setop(best_path->cmd,
                       best_path->strategy,
-                      subplan,
-                      best_path->distinctList,
-                      best_path->flagColIdx,
-                      best_path->firstFlag,
+                      tlist,
+                      leftplan,
+                      rightplan,
+                      best_path->groupList,
                       numGroups);

     copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6952,57 +6956,62 @@ make_gather(List *qptlist,
 }

 /*
- * distinctList is a list of SortGroupClauses, identifying the targetlist
- * items that should be considered by the SetOp filter.  The input path must
- * already be sorted accordingly.
+ * groupList is a list of SortGroupClauses, identifying the targetlist
+ * items that should be considered by the SetOp filter.  The input plans must
+ * already be sorted accordingly, if we're doing SETOP_SORTED mode.
  */
 static SetOp *
-make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-           List *distinctList, AttrNumber flagColIdx, int firstFlag,
-           long numGroups)
+make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+           List *tlist, Plan *lefttree, Plan *righttree,
+           List *groupList, long numGroups)
 {
     SetOp       *node = makeNode(SetOp);
     Plan       *plan = &node->plan;
-    int            numCols = list_length(distinctList);
+    int            numCols = list_length(groupList);
     int            keyno = 0;
-    AttrNumber *dupColIdx;
-    Oid           *dupOperators;
-    Oid           *dupCollations;
+    AttrNumber *cmpColIdx;
+    Oid           *cmpOperators;
+    Oid           *cmpCollations;
+    bool       *cmpNullsFirst;
     ListCell   *slitem;

-    plan->targetlist = lefttree->targetlist;
+    plan->targetlist = tlist;
     plan->qual = NIL;
     plan->lefttree = lefttree;
-    plan->righttree = NULL;
+    plan->righttree = righttree;

     /*
-     * convert SortGroupClause list into arrays of attr indexes and equality
+     * convert SortGroupClause list into arrays of attr indexes and comparison
      * operators, as wanted by executor
      */
-    dupColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
-    dupOperators = (Oid *) palloc(sizeof(Oid) * numCols);
-    dupCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
+    cmpOperators = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpNullsFirst = (bool *) palloc(sizeof(bool) * numCols);

-    foreach(slitem, distinctList)
+    foreach(slitem, groupList)
     {
         SortGroupClause *sortcl = (SortGroupClause *) lfirst(slitem);
         TargetEntry *tle = get_sortgroupclause_tle(sortcl, plan->targetlist);

-        dupColIdx[keyno] = tle->resno;
-        dupOperators[keyno] = sortcl->eqop;
-        dupCollations[keyno] = exprCollation((Node *) tle->expr);
-        Assert(OidIsValid(dupOperators[keyno]));
+        cmpColIdx[keyno] = tle->resno;
+        if (strategy == SETOP_HASHED)
+            cmpOperators[keyno] = sortcl->eqop;
+        else
+            cmpOperators[keyno] = sortcl->sortop;
+        Assert(OidIsValid(cmpOperators[keyno]));
+        cmpCollations[keyno] = exprCollation((Node *) tle->expr);
+        cmpNullsFirst[keyno] = sortcl->nulls_first;
         keyno++;
     }

     node->cmd = cmd;
     node->strategy = strategy;
     node->numCols = numCols;
-    node->dupColIdx = dupColIdx;
-    node->dupOperators = dupOperators;
-    node->dupCollations = dupCollations;
-    node->flagColIdx = flagColIdx;
-    node->firstFlag = firstFlag;
+    node->cmpColIdx = cmpColIdx;
+    node->cmpOperators = cmpOperators;
+    node->cmpCollations = cmpCollations;
+    node->cmpNullsFirst = cmpNullsFirst;
     node->numGroups = numGroups;

     return node;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f2b0bcfdf5..bd0ca000e8 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -65,7 +65,7 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
 static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *input_path,
+                                Path *lpath, Path *rpath,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
@@ -315,8 +315,8 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * to the corresponding tlist entries of the subplan. However, since
          * the subplan was generated by generate_union_paths() or
          * generate_nonunion_paths(), and hence its tlist was generated by
-         * generate_append_tlist(), this will work.  We just tell
-         * generate_setop_tlist() to use varno 0.
+         * generate_append_tlist() or generate_setop_tlist(), this will work.
+         * We just tell generate_setop_tlist() to use varno 0.
          */
         if (flag >= 0 ||
             !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
@@ -1028,29 +1028,27 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                *path;
     List       *lpath_tlist,
                *rpath_tlist,
-               *tlist_list,
                *tlist,
-               *groupList,
-               *pathlist;
+               *groupList;
     bool        lpath_trivial_tlist,
-                rpath_trivial_tlist;
+                rpath_trivial_tlist,
+                result_trivial_tlist;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
     bool        use_hash;
     SetOpCmd    cmd;
-    int            firstFlag;

     /*
      * Tell children to fetch all tuples.
      */
     root->tuple_fraction = 0.0;

-    /* Recurse on children, ensuring their outputs are marked */
+    /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 0,
+                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,10 +1058,9 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dLeftGroups = lrel->rows;

-    lpath = lrel->cheapest_total_path;
     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 1,
+                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1073,41 +1070,51 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dRightGroups = rrel->rows;

-    rpath = rrel->cheapest_total_path;
-
     /* Undo effects of forcing tuple_fraction to 0 */
     root->tuple_fraction = save_fraction;

     /*
      * For EXCEPT, we must put the left input first.  For INTERSECT, either
      * order should give the same results, and we prefer to put the smaller
-     * input first in order to minimize the size of the hash table in the
-     * hashing case.  "Smaller" means the one with the fewer groups.
+     * input first in order to (a) minimize the size of the hash table in the
+     * hashing case, and (b) improve our chances of exploiting the executor's
+     * fast path for empty left-hand input.  "Smaller" means the one with the
+     * fewer groups.
      */
-    if (op->op == SETOP_EXCEPT || dLeftGroups <= dRightGroups)
-    {
-        pathlist = list_make2(lpath, rpath);
-        tlist_list = list_make2(lpath_tlist, rpath_tlist);
-        firstFlag = 0;
-    }
-    else
+    if (op->op != SETOP_EXCEPT && dLeftGroups > dRightGroups)
     {
-        pathlist = list_make2(rpath, lpath);
-        tlist_list = list_make2(rpath_tlist, lpath_tlist);
-        firstFlag = 1;
+        /* need to swap the two inputs */
+        RelOptInfo *tmprel;
+        List       *tmplist;
+        double        tmpd;
+
+        tmprel = lrel;
+        lrel = rrel;
+        rrel = tmprel;
+        tmplist = lpath_tlist;
+        lpath_tlist = rpath_tlist;
+        rpath_tlist = tmplist;
+        tmpd = dLeftGroups;
+        dLeftGroups = dRightGroups;
+        dRightGroups = tmpd;
     }

+    lpath = lrel->cheapest_total_path;
+    rpath = rrel->cheapest_total_path;
+
     /*
-     * Generate tlist for Append plan node.
+     * Generate tlist for SetOp plan node.
      *
-     * The tlist for an Append plan isn't important as far as the Append is
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
      * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.  In fact, it has to be real enough that the flag
-     * column is shown as a variable not a constant, else setrefs.c will get
-     * confused.
+     * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, true,
-                                  tlist_list, refnames_tlist);
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);

     *pTargetList = tlist;

@@ -1116,12 +1123,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /*
-     * Append the child results together.
-     */
-    path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
-                                       NIL, NULL, 0, false, -1);
-
     /* Identify the grouping semantics */
     groupList = generate_setop_grouplist(op, tlist);

@@ -1140,25 +1141,40 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     else
     {
-        dNumGroups = Min(dLeftGroups, dRightGroups);
+        dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }

     /*
-     * Decide whether to hash or sort, and add a sort node if needed.
+     * Decide whether to hash or sort, and add sort nodes if needed.
      */
-    use_hash = choose_hashed_setop(root, groupList, path,
+    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
                                    dNumGroups, dNumOutputRows,
                                    (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");

     if (groupList && !use_hash)
-        path = (Path *) create_sort_path(root,
-                                         result_rel,
-                                         path,
-                                         make_pathkeys_for_sortclauses(root,
-                                                                       groupList,
-                                                                       tlist),
-                                         -1.0);
+    {
+        List       *pathkeys;
+
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 lpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            lpath = (Path *) create_sort_path(root,
+                                              lpath->parent,
+                                              lpath,
+                                              pathkeys,
+                                              -1.0);
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 rpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            rpath = (Path *) create_sort_path(root,
+                                              rpath->parent,
+                                              rpath,
+                                              pathkeys,
+                                              -1.0);
+    }

     /*
      * Finally, add a SetOp path node to generate the correct output.
@@ -1178,12 +1194,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     path = (Path *) create_setop_path(root,
                                       result_rel,
-                                      path,
+                                      lpath,
+                                      rpath,
                                       cmd,
                                       use_hash ? SETOP_HASHED : SETOP_SORTED,
                                       groupList,
-                                      list_length(op->colTypes) + 1,
-                                      use_hash ? firstFlag : -1,
                                       dNumGroups,
                                       dNumOutputRows);

@@ -1285,10 +1300,13 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)

 /*
  * choose_hashed_setop - should we use hashing for a set operation?
+ *
+ * XXX probably this should go away: just make both paths and let
+ * add_path sort it out.
  */
 static bool
 choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *input_path,
+                    Path *lpath, Path *rpath,
                     double dNumGroups, double dNumOutputRows,
                     const char *construct)
 {
@@ -1327,7 +1345,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * Don't do it if it doesn't look like the hashtable will fit into
      * hash_mem.
      */
-    hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
+    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);

     if (hashentrysize * dNumGroups > hash_mem_limit)
         return false;
@@ -1336,9 +1354,9 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * See if the estimated cost is no more than doing it the other way.
      *
      * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group.  Note that the actual result plan might involve a SetOp or
-     * Unique node, not Agg or Group, but the cost estimates for Agg and Group
-     * should be close enough for our purposes here.
+     * group. XXX NOT TRUE: Note that the actual result plan might involve a
+     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
+     * and Group should be close enough for our purposes here.
      *
      * These path variables are dummies that just hold cost fields; we don't
      * make actual Paths for these steps.
@@ -1346,27 +1364,31 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
     cost_agg(&hashed_p, root, AGG_HASHED, NULL,
              numGroupCols, dNumGroups,
              NIL,
-             input_path->disabled_nodes,
-             input_path->startup_cost, input_path->total_cost,
-             input_path->rows, input_path->pathtarget->width);
+             lpath->disabled_nodes + rpath->disabled_nodes,
+             lpath->startup_cost + rpath->startup_cost,
+             lpath->total_cost + rpath->total_cost,
+             lpath->rows + rpath->rows,
+             lpath->pathtarget->width);

     /*
-     * Now for the sorted case.  Note that the input is *always* unsorted,
-     * since it was made by appending unrelated sub-relations together.
+     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
+     * unsorted, since it was made by appending unrelated sub-relations
+     * together.
      */
-    sorted_p.disabled_nodes = input_path->disabled_nodes;
-    sorted_p.startup_cost = input_path->startup_cost;
-    sorted_p.total_cost = input_path->total_cost;
+    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
+    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
+    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
     /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
     cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
               sorted_p.total_cost,
-              input_path->rows, input_path->pathtarget->width,
+              lpath->rows + rpath->rows,
+              lpath->pathtarget->width,
               0.0, work_mem, -1.0);
     cost_group(&sorted_p, root, numGroupCols, dNumGroups,
                NIL,
                sorted_p.disabled_nodes,
                sorted_p.startup_cost, sorted_p.total_cost,
-               input_path->rows);
+               lpath->rows + rpath->rows);

     /*
      * Now make the decision using the top-level tuple fraction.  First we
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..e52e4b1d67 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3634,25 +3634,26 @@ create_windowagg_path(PlannerInfo *root,
  *      Creates a pathnode that represents computation of INTERSECT or EXCEPT
  *
  * 'rel' is the parent relation associated with the result
- * 'subpath' is the path representing the source of data
+ * 'leftpath' is the path representing the left-hand source of data
+ * 'rightpath' is the path representing the right-hand source of data
  * 'cmd' is the specific semantics (INTERSECT or EXCEPT, with/without ALL)
  * 'strategy' is the implementation strategy (sorted or hashed)
- * 'distinctList' is a list of SortGroupClause's representing the grouping
- * 'flagColIdx' is the column number where the flag column will be, if any
- * 'firstFlag' is the flag value for the first input relation when hashing;
- *        or -1 when sorting
- * 'numGroups' is the estimated number of distinct groups
+ * 'groupList' is a list of SortGroupClause's representing the grouping
+ * 'numGroups' is the estimated number of distinct groups in left-hand input
  * 'outputRows' is the estimated number of output rows
+ *
+ * leftpath and rightpath must produce the same columns.  Moreover, if
+ * strategy is SETOP_SORTED, leftpath and rightpath must both be sorted
+ * by all the grouping columns.
  */
 SetOpPath *
 create_setop_path(PlannerInfo *root,
                   RelOptInfo *rel,
-                  Path *subpath,
+                  Path *leftpath,
+                  Path *rightpath,
                   SetOpCmd cmd,
                   SetOpStrategy strategy,
-                  List *distinctList,
-                  AttrNumber flagColIdx,
-                  int firstFlag,
+                  List *groupList,
                   double numGroups,
                   double outputRows)
 {
@@ -3660,34 +3661,37 @@ create_setop_path(PlannerInfo *root,

     pathnode->path.pathtype = T_SetOp;
     pathnode->path.parent = rel;
-    /* SetOp doesn't project, so use source path's pathtarget */
-    pathnode->path.pathtarget = subpath->pathtarget;
+    pathnode->path.pathtarget = rel->reltarget;
     /* For now, assume we are above any joins, so no parameterization */
     pathnode->path.param_info = NULL;
     pathnode->path.parallel_aware = false;
     pathnode->path.parallel_safe = rel->consider_parallel &&
-        subpath->parallel_safe;
-    pathnode->path.parallel_workers = subpath->parallel_workers;
+        leftpath->parallel_safe && rightpath->parallel_safe;
+    pathnode->path.parallel_workers =
+        leftpath->parallel_workers + rightpath->parallel_workers;
     /* SetOp preserves the input sort order if in sort mode */
     pathnode->path.pathkeys =
-        (strategy == SETOP_SORTED) ? subpath->pathkeys : NIL;
+        (strategy == SETOP_SORTED) ? leftpath->pathkeys : NIL;

-    pathnode->subpath = subpath;
+    pathnode->leftpath = leftpath;
+    pathnode->rightpath = rightpath;
     pathnode->cmd = cmd;
     pathnode->strategy = strategy;
-    pathnode->distinctList = distinctList;
-    pathnode->flagColIdx = flagColIdx;
-    pathnode->firstFlag = firstFlag;
+    pathnode->groupList = groupList;
     pathnode->numGroups = numGroups;

     /*
      * Charge one cpu_operator_cost per comparison per input tuple. We assume
      * all columns get compared at most of the tuples.
+     *
+     * XXX all wrong for hashing
      */
-    pathnode->path.disabled_nodes = subpath->disabled_nodes;
-    pathnode->path.startup_cost = subpath->startup_cost;
-    pathnode->path.total_cost = subpath->total_cost +
-        cpu_operator_cost * subpath->rows * list_length(distinctList);
+    pathnode->path.disabled_nodes =
+        leftpath->disabled_nodes + rightpath->disabled_nodes;
+    pathnode->path.startup_cost =
+        leftpath->startup_cost + rightpath->startup_cost;
+    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
+        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 182a6956bb..a730c6ec6d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2806,27 +2806,34 @@ typedef struct HashState
 /* ----------------
  *     SetOpState information
  *
- *        Even in "sorted" mode, SetOp nodes are more complex than a simple
- *        Unique, since we have to count how many duplicates to return.  But
- *        we also support hashing, so this is really more like a cut-down
- *        form of Agg.
+ *        SetOp nodes support either sorted or hashed de-duplication.
+ *        The sorted mode is a bit like MergeJoin, the hashed mode like Agg.
  * ----------------
  */
-/* this struct is private in nodeSetOp.c: */
-typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
+typedef struct SetOpStatePerInput
+{
+    TupleTableSlot *firstTupleSlot; /* first tuple of current group */
+    int64        numTuples;        /* number of tuples in current group */
+    TupleTableSlot *nextTupleSlot;    /* next input tuple, if already read */
+    bool        needGroup;        /* do we need to load a new group? */
+} SetOpStatePerInput;

 typedef struct SetOpState
 {
     PlanState    ps;                /* its first field is NodeTag */
-    ExprState  *eqfunction;        /* equality comparator */
-    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
-    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     bool        setop_done;        /* indicates completion of output scan */
-    long        numOutput;        /* number of dups left to output */
+    int64        numOutput;        /* number of dups left to output */
+    int            numCols;        /* number of grouping columns */
+
     /* these fields are used in SETOP_SORTED mode: */
-    SetOpStatePerGroup pergroup;    /* per-group working state */
-    HeapTuple    grp_firstTuple; /* copy of first tuple of current group */
+    SortSupport sortKeys;        /* per-grouping-field sort data */
+    SetOpStatePerInput leftInput;    /* current outer-relation input state */
+    SetOpStatePerInput rightInput;    /* current inner-relation input state */
+    bool        need_init;        /* have we read the first tuples yet? */
+
     /* these fields are used in SETOP_HASHED mode: */
+    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
+    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     TupleHashTable hashtable;    /* hash table with one entry per group */
     MemoryContext tableContext; /* memory context containing hash table */
     bool        table_filled;    /* hash table filled yet? */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..da3bdbab27 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2340,13 +2340,12 @@ typedef struct WindowAggPath
 typedef struct SetOpPath
 {
     Path        path;
-    Path       *subpath;        /* path representing input source */
+    Path       *leftpath;        /* paths representing input sources */
+    Path       *rightpath;
     SetOpCmd    cmd;            /* what to do, see nodes.h */
     SetOpStrategy strategy;        /* how to do it, see nodes.h */
-    List       *distinctList;    /* SortGroupClauses identifying target cols */
-    AttrNumber    flagColIdx;        /* where is the flag column, if any */
-    int            firstFlag;        /* flag value for first input relation */
-    Cardinality numGroups;        /* estimated number of groups in input */
+    List       *groupList;        /* SortGroupClauses identifying target cols */
+    Cardinality numGroups;        /* estimated number of groups in left input */
 } SetOpPath;

 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 52f29bcdb6..4633121689 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1225,23 +1225,20 @@ typedef struct SetOp
     /* how to do it, see nodes.h */
     SetOpStrategy strategy;

-    /* number of columns to check for duplicate-ness */
+    /* number of columns to compare */
     int            numCols;

     /* their indexes in the target list */
-    AttrNumber *dupColIdx pg_node_attr(array_size(numCols));
+    AttrNumber *cmpColIdx pg_node_attr(array_size(numCols));

-    /* equality operators to compare with */
-    Oid           *dupOperators pg_node_attr(array_size(numCols));
-    Oid           *dupCollations pg_node_attr(array_size(numCols));
-
-    /* where is the flag column, if any */
-    AttrNumber    flagColIdx;
+    /* comparison operators (either equality operators or sort operators) */
+    Oid           *cmpOperators pg_node_attr(array_size(numCols));
+    Oid           *cmpCollations pg_node_attr(array_size(numCols));

-    /* flag value for first input relation */
-    int            firstFlag;
+    /* nulls-first flags if sorting, otherwise not interesting */
+    bool       *cmpNullsFirst pg_node_attr(array_size(numCols));

-    /* estimated number of groups in input */
+    /* estimated number of groups in left input */
     long        numGroups;
 } SetOp;

diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 1035e6560c..5a6d0350c1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -260,12 +260,11 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
                                             bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
                                     RelOptInfo *rel,
-                                    Path *subpath,
+                                    Path *leftpath,
+                                    Path *rightpath,
                                     SetOpCmd cmd,
                                     SetOpStrategy strategy,
-                                    List *distinctList,
-                                    AttrNumber flagColIdx,
-                                    int firstFlag,
+                                    List *groupList,
                                     double numGroups,
                                     double outputRows);
 extern RecursiveUnionPath *create_recursiveunion_path(PlannerInfo *root,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 2d35de3fad..154fffabae 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1231,22 +1231,18 @@ select count(*) from
     select * from onek i2 where i2.unique1 = o.unique2
   ) ss
 where o.ten = 1;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Subquery Scan on ss
-               ->  HashSetOp Except
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Scan using onek_unique1 on onek i1
-                                       Index Cond: (unique1 = o.unique1)
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Index Scan using onek_unique1 on onek i2
-                                       Index Cond: (unique1 = o.unique2)
-(13 rows)
+         ->  HashSetOp Except
+               ->  Index Scan using onek_unique1 on onek i1
+                     Index Cond: (unique1 = o.unique1)
+               ->  Index Scan using onek_unique1 on onek i2
+                     Index Cond: (unique1 = o.unique2)
+(9 rows)

 select count(*) from
   onek o cross join lateral (
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..fbf24490e9 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -370,17 +370,13 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                     QUERY PLAN
-------------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  HashSetOp Intersect
-               ->  Append
-                     ->  Subquery Scan on "*SELECT* 2"
-                           ->  Seq Scan on tenk1
-                     ->  Subquery Scan on "*SELECT* 1"
-                           ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+   ->  HashSetOp Intersect
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(4 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -391,16 +387,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Index Only Scan using tenk1_unique1 on tenk1
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                     Filter: (unique2 <> 10)
-(7 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -434,19 +427,17 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                        QUERY PLAN
-------------------------------------------------------------------------------------------
+                               QUERY PLAN
+------------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  SetOp Intersect
-               ->  Sort
-                     Sort Key: "*SELECT* 2".fivethous
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Seq Scan on tenk1
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(10 rows)
+   ->  SetOp Intersect
+         ->  Sort
+               Sort Key: tenk1.fivethous
+               ->  Seq Scan on tenk1
+         ->  Sort
+               Sort Key: tenk1_1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(8 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -457,18 +448,17 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".unique1
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Index Only Scan using tenk1_unique1 on tenk1
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                           Filter: (unique2 <> 10)
-(9 rows)
+         Sort Key: tenk1.unique1
+         ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Sort
+         Sort Key: tenk1_1.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+               Filter: (unique2 <> 10)
+(8 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -528,15 +518,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -546,15 +533,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -606,17 +590,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -626,17 +609,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -669,17 +651,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -689,17 +670,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -777,17 +757,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -797,17 +776,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -970,15 +948,12 @@ set enable_sort = false;
 -- no means to disable Unique.
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union all select from generate_series(1,3);
 --
@@ -1015,15 +990,12 @@ select from generate_series(1,5) union select from generate_series(1,3);

 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  SetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union select from generate_series(1,3);
 --
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 08521d51a9..e5588c9cd9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2610,6 +2610,8 @@ SetOpCmd
 SetOpPath
 SetOpState
 SetOpStatePerGroup
+SetOpStatePerGroupData
+SetOpStatePerInput
 SetOpStrategy
 SetOperation
 SetOperationStmt
--
2.43.5

From fd8bb0da496965ef5bb40f759fb35a550a62345f Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:30:36 -0500
Subject: [PATCH v2 2/5] Remove some dead code in prepunion.c.

Now that we no longer use flag columns for SetOp, remove the
code that supported that.

This also means that there are no resjunk output columns anymore
in a set-operations plan tree.  So we can get rid of some code
that supported that, too.

It would probably be possible now to do what's speculated about in
the comments, and get rid of this code's use of transiently-created
targetlists in favor of dealing only in pathtargets (save for
generating root->processed_tlist at the very end).  However, that
would involve a good deal of notational churn and the benefit
would only be to save a little planning time.  So I'll leave that
project for some other day.
---
 src/backend/optimizer/prep/prepunion.c | 96 ++++----------------------
 1 file changed, 14 insertions(+), 82 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index bd0ca000e8..977bd546e0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -40,8 +40,7 @@

 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
                                           List *colTypes, List *colCollations,
-                                          bool junkOK,
-                                          int flag, List *refnames_tlist,
+                                          List *refnames_tlist,
                                           List **pTargetList,
                                           bool *istrivial_tlist);
 static RelOptInfo *generate_recursion_path(SetOperationStmt *setOp,
@@ -69,14 +68,12 @@ static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
-                                  int flag,
                                   Index varno,
                                   bool hack_constants,
                                   List *input_tlist,
                                   List *refnames_tlist,
                                   bool *trivial_tlist);
 static List *generate_append_tlist(List *colTypes, List *colCollations,
-                                   bool flag,
                                    List *input_tlists,
                                    List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
@@ -160,12 +157,10 @@ plan_set_operations(PlannerInfo *root)
         /*
          * Recurse on setOperations tree to generate paths for set ops. The
          * final output paths should have just the column types shown as the
-         * output from the top-level node, plus possibly resjunk working
-         * columns (we can rely on upper-level nodes to deal with that).
+         * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
                                            topop->colTypes, topop->colCollations,
-                                           true, -1,
                                            leftmostQuery->targetList,
                                            &top_tlist,
                                            &trivial_tlist);
@@ -208,8 +203,6 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * junkOK: if true, child resjunk columns may be left in the result
- * flag: if >= 0, add a resjunk output column indicating value of flag
  * refnames_tlist: targetlist to take column names from
  *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
@@ -220,7 +213,8 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
- * Pending a redesign of how that works, this is the easy way out.
+ * XXX Now that there are no flag columns and hence no resjunk columns, we
+ * could probably refactor this file to deal only in pathtargets.
  *
  * We don't have to care about typmods here: the only allowed difference
  * between set-op input and output typmods is input is a specific typmod
@@ -229,8 +223,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
                        List *colTypes, List *colCollations,
-                       bool junkOK,
-                       int flag, List *refnames_tlist,
+                       List *refnames_tlist,
                        List **pTargetList,
                        bool *istrivial_tlist)
 {
@@ -279,7 +272,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,

         /* Figure out the appropriate target list for this subquery. */
         tlist = generate_setop_tlist(colTypes, colCollations,
-                                     flag,
                                      rtr->rtindex,
                                      true,
                                      subroot->processed_tlist,
@@ -318,16 +310,14 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * generate_append_tlist() or generate_setop_tlist(), this will work.
          * We just tell generate_setop_tlist() to use varno 0.
          */
-        if (flag >= 0 ||
-            !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
-            !tlist_same_collations(*pTargetList, colCollations, junkOK))
+        if (!tlist_same_datatypes(*pTargetList, colTypes, false) ||
+            !tlist_same_collations(*pTargetList, colCollations, false))
         {
             PathTarget *target;
             bool        trivial_tlist;
             ListCell   *lc;

             *pTargetList = generate_setop_tlist(colTypes, colCollations,
-                                                flag,
                                                 0,
                                                 false,
                                                 *pTargetList,
@@ -411,7 +401,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      */
     lrel = recurse_set_operations(setOp->larg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -423,7 +412,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -436,7 +424,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /*
      * Generate tlist for RecursiveUnion path node --- same as in Append cases
      */
-    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations, false,
+    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations,
                                   list_make2(lpath_tlist, rpath_tlist),
                                   refnames_tlist);

@@ -736,7 +724,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, false,
+    tlist = generate_append_tlist(op->colTypes, op->colCollations,
                                   tlist_list, refnames_tlist);
     *pTargetList = tlist;

@@ -1048,7 +1036,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,7 +1047,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1109,7 +1095,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
                                  0, false, lpath_tlist, refnames_tlist,
                                  &result_trivial_tlist);

@@ -1258,18 +1244,10 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
-         *
-         * Note we disallow any resjunk columns in child results.  This is
-         * necessary since the Append node that implements the union won't do
-         * any projection, and upper levels will get confused if some of our
-         * output tuples have junk and some don't.  This case only arises when
-         * we have an EXCEPT or INTERSECT as child, else there won't be
-         * resjunk anyway.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
-                                                        false, -1,
                                                         refnames_tlist,
                                                         &child_tlist,
                                                         &trivial_tlist));
@@ -1412,7 +1390,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: -1 if no flag column needed, 0 or 1 to create a const flag column
  * varno: varno to use in generated Vars
  * hack_constants: true to copy up constants (see comments in code)
  * input_tlist: targetlist of this node's input node
@@ -1421,7 +1398,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  */
 static List *
 generate_setop_tlist(List *colTypes, List *colCollations,
-                     int flag,
                      Index varno,
                      bool hack_constants,
                      List *input_tlist,
@@ -1520,7 +1496,7 @@ generate_setop_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1529,25 +1505,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag >= 0)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is the given constant */
-        expr = (Node *) makeConst(INT4OID,
-                                  -1,
-                                  InvalidOid,
-                                  sizeof(int32),
-                                  Int32GetDatum(flag),
-                                  false,
-                                  true);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-        *trivial_tlist = false; /* the extra entry makes it not trivial */
-    }
-
     return tlist;
 }

@@ -1556,7 +1513,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: true to create a flag column copied up from subplans
  * input_tlists: list of tlists for sub-plans of the Append
  * refnames_tlist: targetlist to take column names from
  *
@@ -1570,7 +1526,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  */
 static List *
 generate_append_tlist(List *colTypes, List *colCollations,
-                      bool flag,
                       List *input_tlists,
                       List *refnames_tlist)
 {
@@ -1604,8 +1559,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
         {
             TargetEntry *subtle = (TargetEntry *) lfirst(subtlistl);

-            if (subtle->resjunk)
-                continue;
+            Assert(!subtle->resjunk);
             Assert(curColType != NULL);
             if (exprType((Node *) subtle->expr) == lfirst_oid(curColType))
             {
@@ -1654,7 +1608,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1663,23 +1617,6 @@ generate_append_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is shown as copied up from subplan */
-        expr = (Node *) makeVar(0,
-                                resno,
-                                INT4OID,
-                                -1,
-                                InvalidOid,
-                                0);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-    }
-
     pfree(colTypmods);

     return tlist;
@@ -1709,12 +1646,7 @@ generate_setop_grouplist(SetOperationStmt *op, List *targetlist)
         TargetEntry *tle = (TargetEntry *) lfirst(lt);
         SortGroupClause *sgc;

-        if (tle->resjunk)
-        {
-            /* resjunk columns should not have sortgrouprefs */
-            Assert(tle->ressortgroupref == 0);
-            continue;            /* ignore resjunk columns */
-        }
+        Assert(!tle->resjunk);

         /* non-resjunk columns should have sortgroupref = resno */
         Assert(tle->ressortgroupref == tle->resno);
--
2.43.5

From b46cd2f88ead9a6facf89064cc2ac2f7aac01134 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:35:29 -0500
Subject: [PATCH v2 3/5] Get rid of choose_hashed_setop().

This logic is a hangover from days when we could only generate
one plan per setop node.  Now, it's better to generate multiple
Paths and let add_path choose the cheapest.  The main advantage
is not having to keep choose_hashed_setop in perfect sync with
cost estimates made elsewhere --- and indeed, it looks like it
was out of sync already.  Not to mention that it doesn't account
for presorted input, which doesn't matter too much right now
but will soon.

This does require fixing create_setop_path to generate good cost
estimates.  That was another area where we'd not maintained things
well, because it wasn't distinguishing sorted from hashed costs at
all.  With the formulas I chose, the total cost attributed to the node
is actually the same for sorted and hashed modes --- but the startup
cost is wildly different, because hashed mode does basically all the
work before returning anything.

I also modified the existing logic that disables hashed setops
so that it works by incrementing path->disabled_nodes rather
than not generating the path at all.  That seems more in keeping
with the methods established by commit e22253467.  (But should
we put that logic in create_setop_path, rather than its caller?)

This results in two changes in plans that are visible in the
regression tests.  The one in union.sql seems fine: it's now
preferring sorted over hashed mode for a zero-column INTERSECT,
since no sort is really required.  The one in subselect.sql
is trying to test nodeSetOp's rescan logic, so I extended it
to test both sorted and hashed modes.
---
 src/backend/optimizer/prep/prepunion.c  | 226 +++++++++---------------
 src/backend/optimizer/util/pathnode.c   |  50 +++++-
 src/test/regress/expected/subselect.out |  48 ++++-
 src/test/regress/expected/union.out     |   2 +-
 src/test/regress/sql/subselect.sql      |  32 +++-
 5 files changed, 200 insertions(+), 158 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 977bd546e0..21513dbb19 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -63,10 +63,6 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **tlist_list,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
-static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *lpath, Path *rpath,
-                                double dNumGroups, double dNumOutputRows,
-                                const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
                                   Index varno,
                                   bool hack_constants,
@@ -1025,7 +1021,8 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
-    bool        use_hash;
+    bool        can_sort;
+    bool        can_hash;
     SetOpCmd    cmd;

     /*
@@ -1130,15 +1127,76 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }
+    result_rel->rows = dNumOutputRows;
+
+    /* Select the SetOpCmd type */
+    switch (op->op)
+    {
+        case SETOP_INTERSECT:
+            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
+            break;
+        case SETOP_EXCEPT:
+            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
+            break;
+        default:
+            elog(ERROR, "unrecognized set op: %d", (int) op->op);
+            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
+            break;
+    }
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));

     /*
-     * Decide whether to hash or sort, and add sort nodes if needed.
+     * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
-    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
-                                   dNumGroups, dNumOutputRows,
-                                   (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");
+    if (can_hash)
+    {
+        Size        hash_mem_limit = get_hash_memory_limit();
+        Size        hashentrysize;
+
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_HASHED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);

-    if (groupList && !use_hash)
+        /*
+         * Mark the path as disabled if enable_hashagg is off.  While this
+         * isn't exactly a HashAgg node, it seems close enough to justify
+         * letting that switch control it.
+         */
+        if (!enable_hashagg)
+            path->disabled_nodes++;
+
+        /*
+         * Also disable if it doesn't look like the hashtable will fit into
+         * hash_mem.
+         */
+        hashentrysize = MAXALIGN(lpath->pathtarget->width) +
+            MAXALIGN(SizeofMinimalTupleHeader);
+        if (hashentrysize * dNumGroups > hash_mem_limit)
+            path->disabled_nodes++;
+
+        add_path(result_rel, path);
+    }
+
+    /*
+     * If we can sort, sort the inputs if needed before adding SetOp.
+     */
+    if (can_sort)
     {
         List       *pathkeys;

@@ -1160,36 +1218,19 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                               rpath,
                                               pathkeys,
                                               -1.0);
-    }

-    /*
-     * Finally, add a SetOp path node to generate the correct output.
-     */
-    switch (op->op)
-    {
-        case SETOP_INTERSECT:
-            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
-            break;
-        case SETOP_EXCEPT:
-            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
-            break;
-        default:
-            elog(ERROR, "unrecognized set op: %d", (int) op->op);
-            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
-            break;
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_SORTED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);
+        add_path(result_rel, path);
     }
-    path = (Path *) create_setop_path(root,
-                                      result_rel,
-                                      lpath,
-                                      rpath,
-                                      cmd,
-                                      use_hash ? SETOP_HASHED : SETOP_SORTED,
-                                      groupList,
-                                      dNumGroups,
-                                      dNumOutputRows);
-
-    result_rel->rows = path->rows;
-    add_path(result_rel, path);
+
     return result_rel;
 }

@@ -1276,115 +1317,6 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)
     set_cheapest(rel);
 }

-/*
- * choose_hashed_setop - should we use hashing for a set operation?
- *
- * XXX probably this should go away: just make both paths and let
- * add_path sort it out.
- */
-static bool
-choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *lpath, Path *rpath,
-                    double dNumGroups, double dNumOutputRows,
-                    const char *construct)
-{
-    int            numGroupCols = list_length(groupClauses);
-    Size        hash_mem_limit = get_hash_memory_limit();
-    bool        can_sort;
-    bool        can_hash;
-    Size        hashentrysize;
-    Path        hashed_p;
-    Path        sorted_p;
-    double        tuple_fraction;
-
-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupClauses);
-    can_hash = grouping_is_hashable(groupClauses);
-    if (can_hash && can_sort)
-    {
-        /* we have a meaningful choice to make, continue ... */
-    }
-    else if (can_hash)
-        return true;
-    else if (can_sort)
-        return false;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is UNION, INTERSECT, or EXCEPT */
-                 errmsg("could not implement %s", construct),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
-    /* Prefer sorting when enable_hashagg is off */
-    if (!enable_hashagg)
-        return false;
-
-    /*
-     * Don't do it if it doesn't look like the hashtable will fit into
-     * hash_mem.
-     */
-    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
-
-    if (hashentrysize * dNumGroups > hash_mem_limit)
-        return false;
-
-    /*
-     * See if the estimated cost is no more than doing it the other way.
-     *
-     * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group. XXX NOT TRUE: Note that the actual result plan might involve a
-     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
-     * and Group should be close enough for our purposes here.
-     *
-     * These path variables are dummies that just hold cost fields; we don't
-     * make actual Paths for these steps.
-     */
-    cost_agg(&hashed_p, root, AGG_HASHED, NULL,
-             numGroupCols, dNumGroups,
-             NIL,
-             lpath->disabled_nodes + rpath->disabled_nodes,
-             lpath->startup_cost + rpath->startup_cost,
-             lpath->total_cost + rpath->total_cost,
-             lpath->rows + rpath->rows,
-             lpath->pathtarget->width);
-
-    /*
-     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
-     * unsorted, since it was made by appending unrelated sub-relations
-     * together.
-     */
-    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
-    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
-    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
-    /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-    cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
-              sorted_p.total_cost,
-              lpath->rows + rpath->rows,
-              lpath->pathtarget->width,
-              0.0, work_mem, -1.0);
-    cost_group(&sorted_p, root, numGroupCols, dNumGroups,
-               NIL,
-               sorted_p.disabled_nodes,
-               sorted_p.startup_cost, sorted_p.total_cost,
-               lpath->rows + rpath->rows);
-
-    /*
-     * Now make the decision using the top-level tuple fraction.  First we
-     * have to convert an absolute count (LIMIT) into fractional form.
-     */
-    tuple_fraction = root->tuple_fraction;
-    if (tuple_fraction >= 1.0)
-        tuple_fraction /= dNumOutputRows;
-
-    if (compare_fractional_path_costs(&hashed_p, &sorted_p,
-                                      tuple_fraction) < 0)
-    {
-        /* Hashed is cheaper, so use it */
-        return true;
-    }
-    return false;
-}
-
 /*
  * Generate targetlist for a set-operation plan node
  *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e52e4b1d67..d7422f2dd8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3681,17 +3681,51 @@ create_setop_path(PlannerInfo *root,
     pathnode->numGroups = numGroups;

     /*
-     * Charge one cpu_operator_cost per comparison per input tuple. We assume
-     * all columns get compared at most of the tuples.
-     *
-     * XXX all wrong for hashing
+     * Compute cost estimates.  As things stand, we end up with the same total
+     * cost in this node for sort and hash methods, but different startup
+     * costs.  This could be refined perhaps, but it'll do for now.
      */
     pathnode->path.disabled_nodes =
         leftpath->disabled_nodes + rightpath->disabled_nodes;
-    pathnode->path.startup_cost =
-        leftpath->startup_cost + rightpath->startup_cost;
-    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
-        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+    if (strategy == SETOP_SORTED)
+    {
+        /*
+         * In sorted mode, we can emit output incrementally.  Charge one
+         * cpu_operator_cost per comparison per input tuple.  Like cost_group,
+         * we assume all columns get compared at most of the tuples.
+         */
+        pathnode->path.startup_cost =
+            leftpath->startup_cost + rightpath->startup_cost;
+        pathnode->path.total_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
+    else
+    {
+        /*
+         * In hashed mode, we must read all the input before we can emit
+         * anything.  Also charge comparison costs to represent the cost of
+         * hash table lookups.
+         */
+        pathnode->path.startup_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+        pathnode->path.total_cost = pathnode->path.startup_cost;
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 154fffabae..6f2926c45a 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1221,8 +1221,10 @@ where o.ten = 0;
 (1 row)

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -1256,6 +1258,50 @@ where o.ten = 1;
    100
 (1 row)

+rollback;
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+                                                                                                     QUERY PLAN
                                                                                              

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  SetOp Except
+               ->  Sort
+                     Sort Key: i1.unique1, i1.unique2, i1.two, i1.four, i1.ten, i1.twenty, i1.hundred, i1.thousand,
i1.twothousand,i1.fivethous, i1.tenthous, i1.odd, i1.even, i1.stringu1, i1.stringu2, i1.string4 
+                     ->  Index Scan using onek_unique1 on onek i1
+                           Index Cond: (unique1 = o.unique1)
+               ->  Sort
+                     Sort Key: i2.unique1, i2.unique2, i2.two, i2.four, i2.ten, i2.twenty, i2.hundred, i2.thousand,
i2.twothousand,i2.fivethous, i2.tenthous, i2.odd, i2.even, i2.stringu1, i2.stringu2, i2.string4 
+                     ->  Index Scan using onek_unique1 on onek i2
+                           Index Cond: (unique1 = o.unique2)
+(13 rows)
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+ count
+-------
+   100
+(1 row)
+
+rollback;
 --
 -- Test rescan of a RecursiveUnion node
 --
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index fbf24490e9..9474b01510 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -950,7 +950,7 @@ explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
                         QUERY PLAN
 ----------------------------------------------------------
- HashSetOp Intersect
+ SetOp Intersect
    ->  Function Scan on generate_series
    ->  Function Scan on generate_series generate_series_1
 (3 rows)
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index af6e157aca..11becd885c 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -638,8 +638,11 @@ select sum(ss.tst::int) from
 where o.ten = 0;

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
+
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -657,6 +660,33 @@ select count(*) from
   ) ss
 where o.ten = 1;

+rollback;
+
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+rollback;
+
 --
 -- Test rescan of a RecursiveUnion node
 --
--
2.43.5

From 7ec25e0de0810a0130914ed6f691049859d4917c Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:51:54 -0500
Subject: [PATCH v2 4/5] Fix bogus decisions about whether we want pre-sorted
 child paths.

As things stand, recurse_set_operations unconditionally passes down
root->parse->setOperations to subquery_planner, which then applies
set_operation_ordered_results_useful to it to decide whether we care
about producing pre-sorted paths for the subquery.

This is flat wrong.

The main problem is that root->parse->setOperations is not necessarily
the relevant ancestor node.  In the test case added here, we have
UNION ALL atop a child UNION, and the existing logic fails to
recognize that it could use pre-sorted paths for the child UNION step.

A lesser problem is that set_operation_ordered_results_useful is
inadequate anyway: it will falsely claim that we could use pre-sorted
paths for a recursive UNION.

A better way to do this is for each caller of recurse_set_operations
to pass down the relevant parent node.  Then we can further specify
that the caller should pass NULL if it has no use for pre-sorted
paths, and then we can get rid of set_operation_ordered_results_useful
altogether, eliminating the need to keep it in sync with the code that
is actually doing the work.

Note: this patch makes generate_nonunion_paths pass down the non-UNION
node as parent.  generate_nonunion_paths doesn't yet do anything with
the pre-sorted paths that will result, so no regression test plans
change.  If we were going to leave it like this, we'd pass down NULL
instead.
---
 src/backend/optimizer/plan/planner.c   |  7 ++-
 src/backend/optimizer/prep/prepunion.c | 65 +++++++++++++-------------
 src/include/optimizer/prep.h           |  1 -
 src/test/regress/expected/union.out    | 14 ++++++
 src/test/regress/sql/union.sql         |  4 ++
 5 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1f78dc3d53..df168d46ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -608,7 +608,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Basically, this routine does the stuff that should only be done once
  * per Query object.  It then calls grouping_planner.  At one time,
@@ -1342,7 +1342,7 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Returns nothing; the useful output is in the Paths we attach to the
  * (UPPERREL_FINAL, NULL) upperrel in *root.  In addition,
@@ -3621,8 +3621,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
                                       tlist);

     /* setting setop_pathkeys might be useful to the union planner */
-    if (qp_extra->setop != NULL &&
-        set_operation_ordered_results_useful(qp_extra->setop))
+    if (qp_extra->setop != NULL)
     {
         List       *groupClauses;
         bool        sortable;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 21513dbb19..e92a088334 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -39,6 +39,7 @@


 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
+                                          SetOperationStmt *parentOp,
                                           List *colTypes, List *colCollations,
                                           List *refnames_tlist,
                                           List **pTargetList,
@@ -156,6 +157,7 @@ plan_set_operations(PlannerInfo *root)
          * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
+                                           NULL,    /* no parent */
                                            topop->colTypes, topop->colCollations,
                                            leftmostQuery->targetList,
                                            &top_tlist,
@@ -168,44 +170,31 @@ plan_set_operations(PlannerInfo *root)
     return setop_rel;
 }

-/*
- * set_operation_ordered_results_useful
- *        Return true if the given SetOperationStmt can be executed by utilizing
- *        paths that provide sorted input according to the setop's targetlist.
- *        Returns false when sorted paths are not any more useful then unsorted
- *        ones.
- */
-bool
-set_operation_ordered_results_useful(SetOperationStmt *setop)
-{
-    /*
-     * Paths sorted by the targetlist are useful for UNION as we can opt to
-     * MergeAppend the sorted paths then Unique them.  Ordered paths are no
-     * more useful than unordered ones for UNION ALL.
-     */
-    if (!setop->all && setop->op == SETOP_UNION)
-        return true;
-
-    /*
-     * EXCEPT / EXCEPT ALL / INTERSECT / INTERSECT ALL cannot yet utilize
-     * correctly sorted input paths.
-     */
-    return false;
-}
-
 /*
  * recurse_set_operations
  *      Recursively handle one step in a tree of set operations
  *
+ * setOp: current step (could be a SetOperationStmt or a leaf RangeTblRef)
+ * parentOp: parent step, or NULL if none (but see below)
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
  * refnames_tlist: targetlist to take column names from
  *
+ * parentOp should be passed as NULL unless that step is interested in
+ * getting sorted output from this step.  ("Sorted" means "sorted according
+ * to the default btree opclasses of the result column datatypes".)
+ *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
  * *pTargetList: receives the fully-fledged tlist for the subtree's top plan
  * *istrivial_tlist: true if, and only if, datatypes between parent and child
  * match.
  *
+ * If setOp is a leaf node, this function plans the sub-query but does
+ * not populate the pathlist of the returned RelOptInfo.  The caller will
+ * generate SubqueryScan paths using useful path(s) of the subquery (see
+ * build_setop_child_paths).  But this function does build the paths for
+ * set-operation nodes.
+ *
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
@@ -218,6 +207,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  */
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
+                       SetOperationStmt *parentOp,
                        List *colTypes, List *colCollations,
                        List *refnames_tlist,
                        List **pTargetList,
@@ -234,7 +224,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
     {
         RangeTblRef *rtr = (RangeTblRef *) setOp;
         RangeTblEntry *rte = root->simple_rte_array[rtr->rtindex];
-        SetOperationStmt *setops;
         Query       *subquery = rte->subquery;
         PlannerInfo *subroot;
         List       *tlist;
@@ -249,15 +238,13 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
         Assert(root->plan_params == NIL);

         /*
-         * Pass the set operation details to the subquery_planner to have it
-         * consider generating Paths correctly ordered for the set operation.
+         * Generate a subroot and Paths for the subquery.  If we have a
+         * parentOp, pass that down to encourage subquery_planner to consider
+         * suitably-sorted Paths.
          */
-        setops = castNode(SetOperationStmt, root->parse->setOperations);
-
-        /* Generate a subroot and Paths for the subquery */
         subroot = rel->subroot = subquery_planner(root->glob, subquery, root,
                                                   false, root->tuple_fraction,
-                                                  setops);
+                                                  parentOp);

         /*
          * It should not be possible for the primitive query to contain any
@@ -396,6 +383,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      * separately without any intention of combining them into one Append.
      */
     lrel = recurse_set_operations(setOp->larg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -407,6 +395,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /* The right path will want to look at the left one ... */
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -479,6 +468,10 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
  * build_setop_child_paths
  *        Build paths for the set op child relation denoted by 'rel'.
  *
+ * 'rel' is an RTE_SUBQUERY relation.  We have already generated paths within
+ * the subquery's subroot; the task here is to create SubqueryScan paths for
+ * 'rel', representing scans of the useful subquery paths.
+ *
  * interesting_pathkeys: if not NIL, also include paths that suit these
  * pathkeys, sorting any unsorted paths as required.
  * *pNumGroups: if not NULL, we estimate the number of distinct groups
@@ -1032,6 +1025,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -1043,6 +1037,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -1285,8 +1280,14 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
+         *
+         * If top_union isn't a UNION ALL, then we are interested in sorted
+         * output from the child, so pass top_union as parentOp.  Note that
+         * this isn't necessarily the child node's immediate SetOperationStmt
+         * parent, but that's fine: it's the effective parent.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
+                                                        top_union->all ? NULL : top_union,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
                                                         refnames_tlist,
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
index a52dec285d..74bb6c1d3b 100644
--- a/src/include/optimizer/prep.h
+++ b/src/include/optimizer/prep.h
@@ -53,6 +53,5 @@ extern void preprocess_aggrefs(PlannerInfo *root, Node *clause);
  * prototypes for prepunion.c
  */
 extern RelOptInfo *plan_set_operations(PlannerInfo *root);
-extern bool set_operation_ordered_results_useful(SetOperationStmt *setop);

 #endif                            /* PREP_H */
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 9474b01510..3e1adbd992 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -466,6 +466,20 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+                               QUERY PLAN
+------------------------------------------------------------------------
+ Append
+   ->  Seq Scan on int4_tbl
+   ->  Unique
+         ->  Merge Append
+               Sort Key: tenk1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1
+               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+(7 rows)
+
 reset enable_hashagg;
 -- non-hashable type
 set enable_hashagg to on;
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index f8826514e4..10030ec1c5 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -156,6 +156,10 @@ explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+
 reset enable_hashagg;

 -- non-hashable type
--
2.43.5

From ca15dc437b2ca0d736c89dfbef79316c218b90d2 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:58:34 -0500
Subject: [PATCH v2 5/5] Teach generate_nonunion_paths to consider pre-sorted
 child paths.

Much of this patch is just re-ordering existing code so that we can
compute the desired pathkeys before we call build_setop_child_paths.

I don't think we need any new test cases: the existing cases that
change plan are enough to show that this works.

One test case in union.sql would switch from HashSetOp Except to
sorted SetOp Except if we left it alone, since the cheapest inputs
are already sorted.  But the point of that set of tests is to
exercise HashSetOp, so swing a bigger hammer.  (The same query is
re-tested with sorting further down, so this seems fine.)
---
 src/backend/optimizer/prep/prepunion.c | 144 ++++++++++++++++---------
 src/test/regress/expected/union.out    |  37 +++----
 src/test/regress/sql/union.sql         |   5 +
 3 files changed, 115 insertions(+), 71 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e92a088334..b3a26738d0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1010,6 +1010,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     bool        lpath_trivial_tlist,
                 rpath_trivial_tlist,
                 result_trivial_tlist;
+    List       *nonunion_pathkeys = NIL;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
@@ -1030,11 +1031,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
-    if (lrel->rtekind == RTE_SUBQUERY)
-        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
-                                NIL, &dLeftGroups);
-    else
-        dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
                                   op,
@@ -1042,9 +1038,57 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
+
+    /*
+     * Generate tlist for SetOp plan node.
+     *
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
+     * concerned, but we must make it look real anyway for the benefit of the
+     * next plan level up.
+     */
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);
+
+    *pTargetList = tlist;
+
+    /* Identify the grouping semantics */
+    groupList = generate_setop_grouplist(op, tlist);
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
+
+    if (can_sort)
+    {
+        /* Determine the pathkeys for sorting by the whole target list */
+        nonunion_pathkeys = make_pathkeys_for_sortclauses(root, groupList,
+                                                          tlist);
+
+        root->query_pathkeys = nonunion_pathkeys;
+    }
+
+    /*
+     * Now that we've got all that info, we can build the child paths.
+     */
+    if (lrel->rtekind == RTE_SUBQUERY)
+        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
+                                nonunion_pathkeys, &dLeftGroups);
+    else
+        dLeftGroups = lrel->rows;
     if (rrel->rtekind == RTE_SUBQUERY)
         build_setop_child_paths(root, rrel, rpath_trivial_tlist, rpath_tlist,
-                                NIL, &dRightGroups);
+                                nonunion_pathkeys, &dRightGroups);
     else
         dRightGroups = rrel->rows;

@@ -1080,30 +1124,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     lpath = lrel->cheapest_total_path;
     rpath = rrel->cheapest_total_path;

-    /*
-     * Generate tlist for SetOp plan node.
-     *
-     * The tlist for a SetOp plan isn't important so far as the SetOp is
-     * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.
-     */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
-                                 0, false, lpath_tlist, refnames_tlist,
-                                 &result_trivial_tlist);
-
-    /* We should not have needed any type coercions in the tlist */
-    Assert(result_trivial_tlist);
-
-    *pTargetList = tlist;
-
     /* Build result relation. */
     result_rel = fetch_upper_rel(root, UPPERREL_SETOP,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /* Identify the grouping semantics */
-    groupList = generate_setop_grouplist(op, tlist);
-
     /*
      * Estimate number of distinct groups that we'll need hashtable entries
      * for; this is the size of the left-hand input for EXCEPT, or the smaller
@@ -1139,17 +1164,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
             break;
     }

-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupList);
-    can_hash = grouping_is_hashable(groupList);
-    if (!can_sort && !can_hash)
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is INTERSECT or EXCEPT */
-                 errmsg("could not implement %s",
-                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
     /*
      * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
@@ -1189,35 +1203,63 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }

     /*
-     * If we can sort, sort the inputs if needed before adding SetOp.
+     * If we can sort, generate the cheapest sorted input paths, and add a
+     * SetOp atop those.
      */
     if (can_sort)
     {
         List       *pathkeys;
+        Path       *slpath,
+                   *srpath;

+        /* First the left input ... */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  lpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
-            lpath = (Path *) create_sort_path(root,
-                                              lpath->parent,
-                                              lpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            slpath = lpath;        /* cheapest path is already sorted */
+        else
+        {
+            slpath = get_cheapest_path_for_pathkeys(lrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (slpath == NULL)
+                slpath = (Path *) create_sort_path(root,
+                                                   lpath->parent,
+                                                   lpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }
+
+        /* and now the same for the right. */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  rpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
-            rpath = (Path *) create_sort_path(root,
-                                              rpath->parent,
-                                              rpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            srpath = rpath;        /* cheapest path is already sorted */
+        else
+        {
+            srpath = get_cheapest_path_for_pathkeys(rrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (srpath == NULL)
+                srpath = (Path *) create_sort_path(root,
+                                                   rpath->parent,
+                                                   rpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }

         path = (Path *) create_setop_path(root,
                                           result_rel,
-                                          lpath,
-                                          rpath,
+                                          slpath,
+                                          srpath,
                                           cmd,
                                           SETOP_SORTED,
                                           groupList,
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 3e1adbd992..aa73edeac5 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -385,13 +385,15 @@ select count(*) from
   5000
 (1 row)

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                         QUERY PLAN
-------------------------------------------------------------
+           QUERY PLAN
+---------------------------------
  HashSetOp Except
-   ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+   ->  Seq Scan on tenk1
+   ->  Seq Scan on tenk1 tenk1_1
          Filter: (unique2 <> 10)
 (4 rows)

@@ -401,6 +403,7 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+reset enable_indexscan;
 set enable_hashagg to off;
 explain (costs off)
 select count(*) from
@@ -427,17 +430,15 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
    ->  SetOp Intersect
          ->  Sort
                Sort Key: tenk1.fivethous
                ->  Seq Scan on tenk1
-         ->  Sort
-               Sort Key: tenk1_1.unique1
-               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(6 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -448,17 +449,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                            QUERY PLAN
-------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  SetOp Except
-   ->  Sort
-         Sort Key: tenk1.unique1
-         ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Sort
-         Sort Key: tenk1_1.unique2
-         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-               Filter: (unique2 <> 10)
-(8 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index 10030ec1c5..0be2911b0b 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -134,10 +134,15 @@ select count(*) from
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
+
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+reset enable_indexscan;
+
 set enable_hashagg to off;

 explain (costs off)
--
2.43.5

Re: Converting SetOp to read its two inputs separately

From

David Rowley

Date:

29 November 2024, 06:02:01

On Thu, 14 Nov 2024 at 22:33, Richard Guo <guofenglinux@gmail.com> wrote:
> BTW, I noticed a typo in the comment for function

> - *     Returns false when sorted paths are not any more useful then unsorted
> + *     Returns false when sorted paths are not any more useful than unsorted

I pushed a fix for that.

Thanks.

David

Re: Converting SetOp to read its two inputs separately

From

David Rowley

Date:

29 November 2024, 08:37:48

On Wed, 20 Nov 2024 at 15:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Once I'd wrapped my head around how things are done now (which the
> comments in prepunion.c were remarkably unhelpful about), I saw that
> most of the problem for #2 just requires re-ordering things that
> generate_nonunion_paths was already doing.  As for #1, I have a modest
> proposal: we should get rid of set_operation_ordered_results_useful
> entirely.  It's not the code that actually does useful work, and
> keeping it in sync with the code that does do useful work is hard and
> unnecessary.
>
> 0001-0003 below are the same as before (so the slot-munging TODO is
> still there).  0004 fixes a rather basic bug for nested set-operations
> and gets rid of set_operation_ordered_results_useful along the way.
> Then 0005 does your step 2.

Here's a quick review of all 5 patches together.

1. In setop_load_group(), the primary concern with the result of
setop_compare_slots() seems to be if the tuples match or not. If
you're not too concerned with keeping the Assert checking for
mis-sorts, then it could be much more efficient for many cases to
check the final column first. ExecBuildGroupingEqual() makes use of
this knowledge already, so it's nothing new. Maybe you could just use
an ExprState made by ExecBuildGroupingEqual() instead of
setop_compare_slots() for this case. That would allow JIT to work.

2. (related to #1) I'm unsure if it's also worth deforming all the
needed attributes in one go rather than calling slot_getattr() each
time, which will result in incrementally deforming the tuple.
ExecBuildGroupingEqual() also does that.

3.
/* XXX hack: force the tuple into minimal form */
/* XXX should not have to do this */
ExecForceStoreHeapTuple(ExecCopySlotHeapTuple(innerslot),
setopstate->ps.ps_ResultTupleSlot, true);
innerslot = setopstate->ps.ps_ResultTupleSlot;

nodeAgg.c seems to do this by using prepare_hash_slot() which deforms
the heap tuple and sets the Datums verbatim rather than making copies
of any byref ones.

4. Since the EXPLAIN output is going to change for SetOps, what are
your thoughts on being more explicit about the setop strategy being
used?  Showing "SetOp Except" isn't that informative.  You've got to
look at the non-text format to know it's using the merge method.  How
about "MergeSetOp Except"?

5. You might not be too worried, but in SetOpState if you move
need_init up to below setop_done, you'll close a 7-byte hole in that
struct

I'm also happy if you just push the 0004 patch. Are you planning to
backpatch that? If I'd realised you were proposing to remove that
function, I'd not have fixed the typo Richard mentioned.

David

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

29 November 2024, 22:44:25

David Rowley <dgrowleyml@gmail.com> writes:
> Here's a quick review of all 5 patches together.

Thanks for looking at it!

> 1. In setop_load_group(), the primary concern with the result of
> setop_compare_slots() seems to be if the tuples match or not. If
> you're not too concerned with keeping the Assert checking for
> mis-sorts, then it could be much more efficient for many cases to
> check the final column first.

Hm.  setop_compare_slots() has to deliver a 3-way compare result for
its usage in setop_retrieve_sorted(), where we're actually doing
a merge.  We could potentially use a different subroutine within
setop_load_group(), but I'm not sure it's worth added complexity.

> 2. (related to #1) I'm unsure if it's also worth deforming all the
> needed attributes in one go rather than calling slot_getattr() each
> time, which will result in incrementally deforming the tuple.

Good point; probably worth doing given that the low-order
column is often going to determine the result.

>> /* XXX hack: force the tuple into minimal form */
>> /* XXX should not have to do this */

> nodeAgg.c seems to do this by using prepare_hash_slot() which deforms
> the heap tuple and sets the Datums verbatim rather than making copies
> of any byref ones.

I'll take a look at that, thanks for the pointer.

> 4. Since the EXPLAIN output is going to change for SetOps, what are
> your thoughts on being more explicit about the setop strategy being
> used?  Showing "SetOp Except" isn't that informative.  You've got to
> look at the non-text format to know it's using the merge method.  How
> about "MergeSetOp Except"?

I'm confused here.  explain.c already uses "SetOp" versus "HashSetOp"
to show the strategy.  Are you saying we should instead print
"MergeSetOp" and "HashSetOp"?  That seems like change for the sake
of change, really.

> I'm also happy if you just push the 0004 patch. Are you planning to
> backpatch that?

I've been debating whether to propose a back-patch of that.  It's
clearly a bug fix in the sense that the current code isn't doing
what was intended.  On the other hand, it isn't (I think) resulting
in any wrong answers, just plans that are a bit less efficient
than they could be, and perhaps some planning time wasted on making
Paths that can't be used.  And generally we refrain from making
unnecessary changes of plan choices in released branches, because
users value plan stability.  So on the whole I'm leaning towards
not back-patching, but I could be persuaded otherwise if anyone
feels strongly that we should.

            regards, tom lane

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

30 November 2024, 02:36:35

I wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
>> nodeAgg.c seems to do this by using prepare_hash_slot() which deforms
>> the heap tuple and sets the Datums verbatim rather than making copies
>> of any byref ones.

> I'll take a look at that, thanks for the pointer.

So after digging into it, I realized that I'd been completely misled
by setop_fill_hash_table's not failing while reading the first input.
That's not because the code was correct, it was because none of our
test cases manage to reach TupleHashTableMatch() while reading the
first input.  Apparently there are no regression test cases where
tuples of the first input have identical hash codes.  But as soon
as I tried a case with duplicate rows in the first input, kaboom!

The short answer therefore is that LookupTupleHashEntry() absolutely
requires a MinimalTuple slot as input, and there is not some magic
code path whereby it doesn't.  So I put in some code similar to
prepare_hash_slot() to do that as efficiently as can be managed.

I also switched to using slot_getallattrs() per your suggestion.
Other than that and rebasing, this is the same as v2.

            regards, tom lane

From 2f615ab929c6d130420a41253440a02865d28e6b Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 29 Nov 2024 18:14:42 -0500
Subject: [PATCH v3 1/5] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.
---
 src/backend/executor/nodeSetOp.c        | 555 ++++++++++++++----------
 src/backend/optimizer/README            |   2 +-
 src/backend/optimizer/plan/createplan.c |  81 ++--
 src/backend/optimizer/prep/prepunion.c  | 156 ++++---
 src/backend/optimizer/util/pathnode.c   |  50 ++-
 src/include/nodes/execnodes.h           |  31 +-
 src/include/nodes/pathnodes.h           |   9 +-
 src/include/nodes/plannodes.h           |  19 +-
 src/include/optimizer/pathnode.h        |   7 +-
 src/test/regress/expected/subselect.out |  20 +-
 src/test/regress/expected/union.out     | 226 +++++-----
 src/tools/pgindent/typedefs.list        |   2 +
 12 files changed, 638 insertions(+), 520 deletions(-)

diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index a8ac68b482..54d31a1f89 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -3,29 +3,30 @@
  * nodeSetOp.c
  *      Routines to handle INTERSECT and EXCEPT selection
  *
- * The input of a SetOp node consists of tuples from two relations,
- * which have been combined into one dataset, with a junk attribute added
- * that shows which relation each tuple came from.  In SETOP_SORTED mode,
- * the input has furthermore been sorted according to all the grouping
- * columns (ie, all the non-junk attributes).  The SetOp node scans each
- * group of identical tuples to determine how many came from each input
- * relation.  Then it is a simple matter to emit the output demanded by the
- * SQL spec for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ * The input of a SetOp node consists of two relations (outer and inner)
+ * with identical column sets.  In EXCEPT queries the outer relation is
+ * always the left side, while in INTERSECT cases the planner tries to
+ * make the outer relation be the smaller of the two inputs.
  *
- * In SETOP_HASHED mode, the input is delivered in no particular order,
- * except that we know all the tuples from one input relation will come before
- * all the tuples of the other.  The planner guarantees that the first input
- * relation is the left-hand one for EXCEPT, and tries to make the smaller
- * input relation come first for INTERSECT.  We build a hash table in memory
- * with one entry for each group of identical tuples, and count the number of
- * tuples in the group from each relation.  After seeing all the input, we
- * scan the hashtable and generate the correct output using those counts.
- * We can avoid making hashtable entries for any tuples appearing only in the
- * second input relation, since they cannot result in any output.
+ * In SETOP_SORTED mode, each input has been sorted according to all the
+ * grouping columns (ie, all the non-junk attributes).  The SetOp node
+ * essentially performs a merge join on the grouping columns, except that
+ * it is only interested in counting how many tuples from each input match.
+ * Then it is a simple matter to emit the output demanded by the SQL spec
+ * for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ *
+ * In SETOP_HASHED mode, the inputs are delivered in no particular order.
+ * We read the outer relation and build a hash table in memory with one entry
+ * for each group of identical tuples, counting the number of tuples in the
+ * group.  Then we read the inner relation and count the number of tuples
+ * matching each outer group.  (We can disregard any tuples appearing only
+ * in the inner relation, since they cannot result in any output.)  After
+ * seeing all the input, we scan the hashtable and generate the correct
+ * output using those counts.
  *
  * This node type is not used for UNION or UNION ALL, since those can be
- * implemented more cheaply (there's no need for the junk attribute to
- * identify the source relation).
+ * implemented more cheaply (there's no need to count the number of
+ * matching tuples).
  *
  * Note that SetOp does no qual checking nor projection.  The delivered
  * output tuples are just copies of the first-to-arrive tuple in each
@@ -54,65 +55,30 @@
 /*
  * SetOpStatePerGroupData - per-group working state
  *
- * These values are working state that is initialized at the start of
- * an input tuple group and updated for each input tuple.
- *
- * In SETOP_SORTED mode, we need only one of these structs, and it's kept in
- * the plan state node.  In SETOP_HASHED mode, the hash table contains one
- * of these for each tuple group.
+ * In SETOP_SORTED mode, we need only one of these structs, and it's just a
+ * local in setop_retrieve_sorted.  In SETOP_HASHED mode, the hash table
+ * contains one of these for each tuple group.
  */
 typedef struct SetOpStatePerGroupData
 {
-    long        numLeft;        /* number of left-input dups in group */
-    long        numRight;        /* number of right-input dups in group */
-}            SetOpStatePerGroupData;
+    int64        numLeft;        /* number of left-input dups in group */
+    int64        numRight;        /* number of right-input dups in group */
+} SetOpStatePerGroupData;
+
+typedef SetOpStatePerGroupData *SetOpStatePerGroup;


-static TupleTableSlot *setop_retrieve_direct(SetOpState *setopstate);
+static TupleTableSlot *setop_retrieve_sorted(SetOpState *setopstate);
+static void setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                             SetOpState *setopstate);
+static int    setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                                SetOpState *setopstate);
+static TupleTableSlot *setop_load_hash_tuple(SetOpState *setopstate,
+                                             TupleTableSlot *inputslot);
 static void setop_fill_hash_table(SetOpState *setopstate);
 static TupleTableSlot *setop_retrieve_hash_table(SetOpState *setopstate);


-/*
- * Initialize state for a new group of input values.
- */
-static inline void
-initialize_counts(SetOpStatePerGroup pergroup)
-{
-    pergroup->numLeft = pergroup->numRight = 0;
-}
-
-/*
- * Advance the appropriate counter for one input tuple.
- */
-static inline void
-advance_counts(SetOpStatePerGroup pergroup, int flag)
-{
-    if (flag)
-        pergroup->numRight++;
-    else
-        pergroup->numLeft++;
-}
-
-/*
- * Fetch the "flag" column from an input tuple.
- * This is an integer column with value 0 for left side, 1 for right side.
- */
-static int
-fetch_tuple_flag(SetOpState *setopstate, TupleTableSlot *inputslot)
-{
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
-    int            flag;
-    bool        isNull;
-
-    flag = DatumGetInt32(slot_getattr(inputslot,
-                                      node->flagColIdx,
-                                      &isNull));
-    Assert(!isNull);
-    Assert(flag == 0 || flag == 1);
-    return flag;
-}
-
 /*
  * Initialize the hash table to empty.
  */
@@ -129,10 +95,10 @@ build_hash_table(SetOpState *setopstate)
     setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
                                                    desc,
                                                    node->numCols,
-                                                   node->dupColIdx,
+                                                   node->cmpColIdx,
                                                    setopstate->eqfuncoids,
                                                    setopstate->hashfunctions,
-                                                   node->dupCollations,
+                                                   node->cmpCollations,
                                                    node->numGroups,
                                                    0,
                                                    setopstate->ps.state->es_query_cxt,
@@ -217,108 +183,126 @@ ExecSetOp(PlanState *pstate)
         return setop_retrieve_hash_table(node);
     }
     else
-        return setop_retrieve_direct(node);
+        return setop_retrieve_sorted(node);
 }

 /*
  * ExecSetOp for non-hashed case
  */
 static TupleTableSlot *
-setop_retrieve_direct(SetOpState *setopstate)
+setop_retrieve_sorted(SetOpState *setopstate)
 {
     PlanState  *outerPlan;
-    SetOpStatePerGroup pergroup;
-    TupleTableSlot *outerslot;
+    PlanState  *innerPlan;
     TupleTableSlot *resultTupleSlot;
-    ExprContext *econtext = setopstate->ps.ps_ExprContext;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    pergroup = (SetOpStatePerGroup) setopstate->pergroup;
+    innerPlan = innerPlanState(setopstate);
     resultTupleSlot = setopstate->ps.ps_ResultTupleSlot;

     /*
-     * We loop retrieving groups until we find one we should return
+     * If first time through, establish the invariant that setop_load_group
+     * expects: each side's nextTupleSlot is the next output from the child
+     * plan, or empty if there is no more output from it.
      */
-    while (!setopstate->setop_done)
+    if (setopstate->need_init)
     {
+        setopstate->need_init = false;
+
+        setopstate->leftInput.nextTupleSlot = ExecProcNode(outerPlan);
+
         /*
-         * If we don't already have the first tuple of the new group, fetch it
-         * from the outer plan.
+         * If the outer relation is empty, then we will emit nothing, and we
+         * don't need to read the inner relation at all.
          */
-        if (setopstate->grp_firstTuple == NULL)
+        if (TupIsNull(setopstate->leftInput.nextTupleSlot))
         {
-            outerslot = ExecProcNode(outerPlan);
-            if (!TupIsNull(outerslot))
-            {
-                /* Make a copy of the first input tuple */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-            }
-            else
-            {
-                /* outer plan produced no tuples at all */
-                setopstate->setop_done = true;
-                return NULL;
-            }
+            setopstate->setop_done = true;
+            return NULL;
         }

+        setopstate->rightInput.nextTupleSlot = ExecProcNode(innerPlan);
+
+        /* Set flags that we've not completed either side's group */
+        setopstate->leftInput.needGroup = true;
+        setopstate->rightInput.needGroup = true;
+    }
+
+    /*
+     * We loop retrieving groups until we find one we should return
+     */
+    while (!setopstate->setop_done)
+    {
+        int            cmpresult;
+        SetOpStatePerGroupData pergroup;
+
         /*
-         * Store the copied first input tuple in the tuple table slot reserved
-         * for it.  The tuple will be deleted when it is cleared from the
-         * slot.
+         * Fetch the rest of the current outer group, if we didn't already.
          */
-        ExecStoreHeapTuple(setopstate->grp_firstTuple,
-                           resultTupleSlot,
-                           true);
-        setopstate->grp_firstTuple = NULL;    /* don't keep two pointers */
+        if (setopstate->leftInput.needGroup)
+            setop_load_group(&setopstate->leftInput, outerPlan, setopstate);

-        /* Initialize working state for a new input tuple group */
-        initialize_counts(pergroup);
+        /*
+         * If no more outer groups, we're done, and don't need to look at any
+         * more of the inner relation.
+         */
+        if (setopstate->leftInput.numTuples == 0)
+        {
+            setopstate->setop_done = true;
+            break;
+        }

-        /* Count the first input tuple */
-        advance_counts(pergroup,
-                       fetch_tuple_flag(setopstate, resultTupleSlot));
+        /*
+         * Fetch the rest of the current inner group, if we didn't already.
+         */
+        if (setopstate->rightInput.needGroup)
+            setop_load_group(&setopstate->rightInput, innerPlan, setopstate);

         /*
-         * Scan the outer plan until we exhaust it or cross a group boundary.
+         * Determine whether we have matching groups on both sides (this is
+         * basically like the core logic of a merge join).
          */
-        for (;;)
-        {
-            outerslot = ExecProcNode(outerPlan);
-            if (TupIsNull(outerslot))
-            {
-                /* no more outer-plan tuples available */
-                setopstate->setop_done = true;
-                break;
-            }
-
-            /*
-             * Check whether we've crossed a group boundary.
-             */
-            econtext->ecxt_outertuple = resultTupleSlot;
-            econtext->ecxt_innertuple = outerslot;
-
-            if (!ExecQualAndReset(setopstate->eqfunction, econtext))
-            {
-                /*
-                 * Save the first input tuple of the next group.
-                 */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-                break;
-            }
+        if (setopstate->rightInput.numTuples == 0)
+            cmpresult = -1;        /* as though left input is lesser */
+        else
+            cmpresult = setop_compare_slots(setopstate->leftInput.firstTupleSlot,
+                                            setopstate->rightInput.firstTupleSlot,
+                                            setopstate);

-            /* Still in same group, so count this tuple */
-            advance_counts(pergroup,
-                           fetch_tuple_flag(setopstate, outerslot));
+        if (cmpresult < 0)
+        {
+            /* Left group is first, has no right matches */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = 0;
+            /* We'll need another left group next time */
+            setopstate->leftInput.needGroup = true;
+        }
+        else if (cmpresult == 0)
+        {
+            /* We have matching groups */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = setopstate->rightInput.numTuples;
+            /* We'll need to read from both sides next time */
+            setopstate->leftInput.needGroup = true;
+            setopstate->rightInput.needGroup = true;
+        }
+        else
+        {
+            /* Right group has no left matches, so we can ignore it */
+            setopstate->rightInput.needGroup = true;
+            continue;
         }

         /*
-         * Done scanning input tuple group.  See if we should emit any copies
-         * of result tuple, and if so return the first copy.
+         * Done scanning these input tuple groups.  See if we should emit any
+         * copies of result tuple, and if so return the first copy.  (Note
+         * that the result tuple is the same as the left input's firstTuple
+         * slot.)
          */
-        set_output_count(setopstate, pergroup);
+        set_output_count(setopstate, &pergroup);

         if (setopstate->numOutput > 0)
         {
@@ -333,84 +317,195 @@ setop_retrieve_direct(SetOpState *setopstate)
 }

 /*
- * ExecSetOp for hashed case: phase 1, read input and build hash table
+ * Load next group of tuples from one child plan or the other.
+ *
+ * On entry, we've already read the first tuple of the next group
+ * (if there is one) into input->nextTupleSlot.  This invariant
+ * is maintained on exit.
+ */
+static void
+setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                 SetOpState *setopstate)
+{
+    input->needGroup = false;
+
+    /* If we've exhausted this child plan, report an empty group */
+    if (TupIsNull(input->nextTupleSlot))
+    {
+        ExecClearTuple(input->firstTupleSlot);
+        input->numTuples = 0;
+        return;
+    }
+
+    /* Make a local copy of the first tuple for comparisons */
+    ExecStoreHeapTuple(ExecCopySlotHeapTuple(input->nextTupleSlot),
+                       input->firstTupleSlot,
+                       true);
+    /* and count it */
+    input->numTuples = 1;
+
+    /* Scan till we find the end-of-group */
+    for (;;)
+    {
+        int            cmpresult;
+
+        /* Get next input tuple, if there is one */
+        input->nextTupleSlot = ExecProcNode(inputPlan);
+        if (TupIsNull(input->nextTupleSlot))
+            break;
+
+        /* There is; does it belong to same group as firstTuple? */
+        cmpresult = setop_compare_slots(input->firstTupleSlot,
+                                        input->nextTupleSlot,
+                                        setopstate);
+        Assert(cmpresult <= 0); /* else input is mis-sorted */
+        if (cmpresult != 0)
+            break;
+
+        /* Still in same group, so count this tuple */
+        input->numTuples++;
+    }
+}
+
+/*
+ * Compare the tuples in the two given slots.
+ */
+static int
+setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                    SetOpState *setopstate)
+{
+    /* We'll often need to fetch all the columns, so just do it */
+    slot_getallattrs(s1);
+    slot_getallattrs(s2);
+    for (int nkey = 0; nkey < setopstate->numCols; nkey++)
+    {
+        SortSupport sortKey = setopstate->sortKeys + nkey;
+        AttrNumber    attno = sortKey->ssup_attno;
+        Datum        datum1 = s1->tts_values[attno - 1],
+                    datum2 = s2->tts_values[attno - 1];
+        bool        isNull1 = s1->tts_isnull[attno - 1],
+                    isNull2 = s2->tts_isnull[attno - 1];
+        int            compare;
+
+        compare = ApplySortComparator(datum1, isNull1,
+                                      datum2, isNull2,
+                                      sortKey);
+        if (compare != 0)
+            return compare;
+    }
+    return 0;
+}
+
+/*
+ * Prepare an input tuple for hashtable lookup.
+ *
+ * We could use the input slot directly except that LookupTupleHashEntry
+ * only accepts MinimalTuple slots, which is typically not what our child
+ * plans will return.  We abuse the SetOp node's ResultTupleSlot as a
+ * temporary slot for this purpose.
+ */
+static TupleTableSlot *
+setop_load_hash_tuple(SetOpState *setopstate, TupleTableSlot *inputslot)
+{
+    TupleTableSlot *resultslot = setopstate->ps.ps_ResultTupleSlot;
+    int            natts = resultslot->tts_tupleDescriptor->natts;
+
+    slot_getallattrs(inputslot);
+    ExecClearTuple(resultslot);
+    for (int i = 0; i < natts; i++)
+    {
+        resultslot->tts_values[i] = inputslot->tts_values[i];
+        resultslot->tts_isnull[i] = inputslot->tts_isnull[i];
+    }
+    ExecStoreVirtualTuple(resultslot);
+    return resultslot;
+}
+
+/*
+ * ExecSetOp for hashed case: phase 1, read inputs and build hash table
  */
 static void
 setop_fill_hash_table(SetOpState *setopstate)
 {
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
     PlanState  *outerPlan;
-    int            firstFlag;
-    bool        in_first_rel PG_USED_FOR_ASSERTS_ONLY;
+    PlanState  *innerPlan;
     ExprContext *econtext = setopstate->ps.ps_ExprContext;
+    bool        have_tuples = false;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    firstFlag = node->firstFlag;
-    /* verify planner didn't mess up */
-    Assert(firstFlag == 0 ||
-           (firstFlag == 1 &&
-            (node->cmd == SETOPCMD_INTERSECT ||
-             node->cmd == SETOPCMD_INTERSECT_ALL)));
+    innerPlan = innerPlanState(setopstate);

     /*
      * Process each outer-plan tuple, and then fetch the next one, until we
      * exhaust the outer plan.
      */
-    in_first_rel = true;
     for (;;)
     {
         TupleTableSlot *outerslot;
-        int            flag;
         TupleHashEntryData *entry;
         bool        isnew;

         outerslot = ExecProcNode(outerPlan);
         if (TupIsNull(outerslot))
             break;
+        have_tuples = true;

-        /* Identify whether it's left or right input */
-        flag = fetch_tuple_flag(setopstate, outerslot);
+        /* Find or build hashtable entry for this tuple's group */
+        entry = LookupTupleHashEntry(setopstate->hashtable,
+                                     setop_load_hash_tuple(setopstate,
+                                                           outerslot),
+                                     &isnew, NULL);

-        if (flag == firstFlag)
+        /* If new tuple group, initialize counts to zero */
+        if (isnew)
         {
-            /* (still) in first input relation */
-            Assert(in_first_rel);
-
-            /* Find or build hashtable entry for this tuple's group */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
-                                         &isnew, NULL);
-
-            /* If new tuple group, initialize counts */
-            if (isnew)
-            {
-                entry->additional = (SetOpStatePerGroup)
-                    MemoryContextAlloc(setopstate->hashtable->tablecxt,
+            entry->additional = (SetOpStatePerGroup)
+                MemoryContextAllocZero(setopstate->hashtable->tablecxt,
                                        sizeof(SetOpStatePerGroupData));
-                initialize_counts((SetOpStatePerGroup) entry->additional);
-            }
-
-            /* Advance the counts */
-            advance_counts((SetOpStatePerGroup) entry->additional, flag);
         }
-        else
+
+        /* Advance the counts */
+        ((SetOpStatePerGroup) entry->additional)->numLeft++;
+
+        /* Must reset expression context after each hashtable lookup */
+        ResetExprContext(econtext);
+    }
+
+    /*
+     * If the outer relation is empty, then we will emit nothing, and we don't
+     * need to read the inner relation at all.
+     */
+    if (have_tuples)
+    {
+        /*
+         * Process each inner-plan tuple, and then fetch the next one, until
+         * we exhaust the inner plan.
+         */
+        for (;;)
         {
-            /* reached second relation */
-            in_first_rel = false;
+            TupleTableSlot *innerslot;
+            TupleHashEntryData *entry;
+
+            innerslot = ExecProcNode(innerPlan);
+            if (TupIsNull(innerslot))
+                break;

             /* For tuples not seen previously, do not make hashtable entry */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+            entry = LookupTupleHashEntry(setopstate->hashtable,
+                                         setop_load_hash_tuple(setopstate,
+                                                               innerslot),
                                          NULL, NULL);

             /* Advance the counts if entry is already present */
             if (entry)
-                advance_counts((SetOpStatePerGroup) entry->additional, flag);
-        }
+                ((SetOpStatePerGroup) entry->additional)->numRight++;

-        /* Must reset expression context after each hashtable lookup */
-        ResetExprContext(econtext);
+            /* Must reset expression context after each hashtable lookup */
+            ResetExprContext(econtext);
+        }
     }

     setopstate->table_filled = true;
@@ -481,7 +576,6 @@ SetOpState *
 ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 {
     SetOpState *setopstate;
-    TupleDesc    outerDesc;

     /* check for unsupported flags */
     Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -494,14 +588,10 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     setopstate->ps.state = estate;
     setopstate->ps.ExecProcNode = ExecSetOp;

-    setopstate->eqfuncoids = NULL;
-    setopstate->hashfunctions = NULL;
     setopstate->setop_done = false;
     setopstate->numOutput = 0;
-    setopstate->pergroup = NULL;
-    setopstate->grp_firstTuple = NULL;
-    setopstate->hashtable = NULL;
-    setopstate->tableContext = NULL;
+    setopstate->numCols = node->numCols;
+    setopstate->need_init = true;

     /*
      * create expression context
@@ -522,52 +612,74 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * initialize child nodes
      *
-     * If we are hashing then the child plan does not need to handle REWIND
+     * If we are hashing then the child plans do not need to handle REWIND
      * efficiently; see ExecReScanSetOp.
      */
     if (node->strategy == SETOP_HASHED)
         eflags &= ~EXEC_FLAG_REWIND;
     outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
-    outerDesc = ExecGetResultType(outerPlanState(setopstate));
+    innerPlanState(setopstate) = ExecInitNode(innerPlan(node), estate, eflags);

     /*
-     * Initialize result slot and type. Setop nodes do no projections, so
-     * initialize projection info for this node appropriately.
+     * Initialize locally-allocated slots.  In hashed mode, we just need a
+     * result slot.  In sorted mode, we need one first-tuple-of-group slot for
+     * each input; we use the result slot for the left input's slot and create
+     * another for the right input.  (Note: the nextTupleSlot slots are not
+     * ours, but just point to the last slot returned by the input plan node.)
      */
     ExecInitResultTupleSlotTL(&setopstate->ps,
                               node->strategy == SETOP_HASHED ?
                               &TTSOpsMinimalTuple : &TTSOpsHeapTuple);
+    if (node->strategy != SETOP_HASHED)
+    {
+        setopstate->leftInput.firstTupleSlot =
+            setopstate->ps.ps_ResultTupleSlot;
+        setopstate->rightInput.firstTupleSlot =
+            ExecInitExtraTupleSlot(estate,
+                                   setopstate->ps.ps_ResultTupleDesc,
+                                   &TTSOpsHeapTuple);
+    }
+
+    /* Setop nodes do no projections. */
     setopstate->ps.ps_ProjInfo = NULL;

     /*
-     * Precompute fmgr lookup data for inner loop. We need both equality and
-     * hashing functions to do it by hashing, but only equality if not
-     * hashing.
+     * Precompute fmgr lookup data for inner loop.  We need equality and
+     * hashing functions to do it by hashing, while for sorting we need
+     * SortSupport data.
      */
     if (node->strategy == SETOP_HASHED)
         execTuplesHashPrepare(node->numCols,
-                              node->dupOperators,
+                              node->cmpOperators,
                               &setopstate->eqfuncoids,
                               &setopstate->hashfunctions);
     else
-        setopstate->eqfunction =
-            execTuplesMatchPrepare(outerDesc,
-                                   node->numCols,
-                                   node->dupColIdx,
-                                   node->dupOperators,
-                                   node->dupCollations,
-                                   &setopstate->ps);
+    {
+        int            nkeys = node->numCols;
+
+        setopstate->sortKeys = (SortSupport)
+            palloc0(nkeys * sizeof(SortSupportData));
+        for (int i = 0; i < nkeys; i++)
+        {
+            SortSupport sortKey = setopstate->sortKeys + i;

+            sortKey->ssup_cxt = CurrentMemoryContext;
+            sortKey->ssup_collation = node->cmpCollations[i];
+            sortKey->ssup_nulls_first = node->cmpNullsFirst[i];
+            sortKey->ssup_attno = node->cmpColIdx[i];
+            /* abbreviated key conversion is not useful here */
+            sortKey->abbreviate = false;
+
+            PrepareSortSupportFromOrderingOp(node->cmpOperators[i], sortKey);
+        }
+    }
+
+    /* Create a hash table if needed */
     if (node->strategy == SETOP_HASHED)
     {
         build_hash_table(setopstate);
         setopstate->table_filled = false;
     }
-    else
-    {
-        setopstate->pergroup =
-            (SetOpStatePerGroup) palloc0(sizeof(SetOpStatePerGroupData));
-    }

     return setopstate;
 }
@@ -575,7 +687,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 /* ----------------------------------------------------------------
  *        ExecEndSetOp
  *
- *        This shuts down the subplan and frees resources allocated
+ *        This shuts down the subplans and frees resources allocated
  *        to this node.
  * ----------------------------------------------------------------
  */
@@ -587,6 +699,7 @@ ExecEndSetOp(SetOpState *node)
         MemoryContextDelete(node->tableContext);

     ExecEndNode(outerPlanState(node));
+    ExecEndNode(innerPlanState(node));
 }


@@ -594,6 +707,7 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
     PlanState  *outerPlan = outerPlanState(node);
+    PlanState  *innerPlan = innerPlanState(node);

     ExecClearTuple(node->ps.ps_ResultTupleSlot);
     node->setop_done = false;
@@ -611,34 +725,29 @@ ExecReScanSetOp(SetOpState *node)
             return;

         /*
-         * If we do have the hash table and the subplan does not have any
+         * If we do have the hash table and the subplans do not have any
          * parameter changes, then we can just rescan the existing hash table;
          * no need to build it again.
          */
-        if (outerPlan->chgParam == NULL)
+        if (outerPlan->chgParam == NULL && innerPlan->chgParam == NULL)
         {
             ResetTupleHashIterator(node->hashtable, &node->hashiter);
             return;
         }
-    }

-    /* Release first tuple of group, if we have made a copy */
-    if (node->grp_firstTuple != NULL)
-    {
-        heap_freetuple(node->grp_firstTuple);
-        node->grp_firstTuple = NULL;
-    }
-
-    /* Release any hashtable storage */
-    if (node->tableContext)
-        MemoryContextReset(node->tableContext);
+        /* Release any hashtable storage */
+        if (node->tableContext)
+            MemoryContextReset(node->tableContext);

-    /* And rebuild empty hashtable if needed */
-    if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
-    {
+        /* And rebuild an empty hashtable */
         ResetTupleHashTable(node->hashtable);
         node->table_filled = false;
     }
+    else
+    {
+        /* Need to re-read first input from each side */
+        node->need_init = true;
+    }

     /*
      * if chgParam of subnode is not null then plan will be re-scanned by
@@ -646,4 +755,6 @@ ExecReScanSetOp(SetOpState *node)
      */
     if (outerPlan->chgParam == NULL)
         ExecReScan(outerPlan);
+    if (innerPlan->chgParam == NULL)
+        ExecReScan(innerPlan);
 }
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 2ab4f3dbf3..f341d9f303 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -649,7 +649,7 @@ RelOptInfo      - a relation or joined relations
   GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
   MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
   WindowAggPath - a WindowAgg plan node applied to some sub-path
-  SetOpPath     - a SetOp plan node applied to some sub-path
+  SetOpPath     - a SetOp plan node applied to two sub-paths
   RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
   LockRowsPath  - a LockRows plan node applied to some sub-path
   ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 178c572b02..b3e2294e84 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -301,9 +301,9 @@ static Unique *make_unique_from_pathkeys(Plan *lefttree,
                                          List *pathkeys, int numCols);
 static Gather *make_gather(List *qptlist, List *qpqual,
                            int nworkers, int rescan_param, bool single_copy, Plan *subplan);
-static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-                         List *distinctList, AttrNumber flagColIdx, int firstFlag,
-                         long numGroups);
+static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+                         List *tlist, Plan *lefttree, Plan *righttree,
+                         List *groupList, long numGroups);
 static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
 static Result *make_result(List *tlist, Node *resconstantqual, Plan *subplan);
 static ProjectSet *make_project_set(List *tlist, Plan *subplan);
@@ -2719,25 +2719,29 @@ static SetOp *
 create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 {
     SetOp       *plan;
-    Plan       *subplan;
+    List       *tlist = build_path_tlist(root, &best_path->path);
+    Plan       *leftplan;
+    Plan       *rightplan;
     long        numGroups;

     /*
      * SetOp doesn't project, so tlist requirements pass through; moreover we
      * need grouping columns to be labeled.
      */
-    subplan = create_plan_recurse(root, best_path->subpath,
-                                  flags | CP_LABEL_TLIST);
+    leftplan = create_plan_recurse(root, best_path->leftpath,
+                                   flags | CP_LABEL_TLIST);
+    rightplan = create_plan_recurse(root, best_path->rightpath,
+                                    flags | CP_LABEL_TLIST);

     /* Convert numGroups to long int --- but 'ware overflow! */
     numGroups = clamp_cardinality_to_long(best_path->numGroups);

     plan = make_setop(best_path->cmd,
                       best_path->strategy,
-                      subplan,
-                      best_path->distinctList,
-                      best_path->flagColIdx,
-                      best_path->firstFlag,
+                      tlist,
+                      leftplan,
+                      rightplan,
+                      best_path->groupList,
                       numGroups);

     copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6950,57 +6954,62 @@ make_gather(List *qptlist,
 }

 /*
- * distinctList is a list of SortGroupClauses, identifying the targetlist
- * items that should be considered by the SetOp filter.  The input path must
- * already be sorted accordingly.
+ * groupList is a list of SortGroupClauses, identifying the targetlist
+ * items that should be considered by the SetOp filter.  The input plans must
+ * already be sorted accordingly, if we're doing SETOP_SORTED mode.
  */
 static SetOp *
-make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-           List *distinctList, AttrNumber flagColIdx, int firstFlag,
-           long numGroups)
+make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+           List *tlist, Plan *lefttree, Plan *righttree,
+           List *groupList, long numGroups)
 {
     SetOp       *node = makeNode(SetOp);
     Plan       *plan = &node->plan;
-    int            numCols = list_length(distinctList);
+    int            numCols = list_length(groupList);
     int            keyno = 0;
-    AttrNumber *dupColIdx;
-    Oid           *dupOperators;
-    Oid           *dupCollations;
+    AttrNumber *cmpColIdx;
+    Oid           *cmpOperators;
+    Oid           *cmpCollations;
+    bool       *cmpNullsFirst;
     ListCell   *slitem;

-    plan->targetlist = lefttree->targetlist;
+    plan->targetlist = tlist;
     plan->qual = NIL;
     plan->lefttree = lefttree;
-    plan->righttree = NULL;
+    plan->righttree = righttree;

     /*
-     * convert SortGroupClause list into arrays of attr indexes and equality
+     * convert SortGroupClause list into arrays of attr indexes and comparison
      * operators, as wanted by executor
      */
-    dupColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
-    dupOperators = (Oid *) palloc(sizeof(Oid) * numCols);
-    dupCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
+    cmpOperators = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpNullsFirst = (bool *) palloc(sizeof(bool) * numCols);

-    foreach(slitem, distinctList)
+    foreach(slitem, groupList)
     {
         SortGroupClause *sortcl = (SortGroupClause *) lfirst(slitem);
         TargetEntry *tle = get_sortgroupclause_tle(sortcl, plan->targetlist);

-        dupColIdx[keyno] = tle->resno;
-        dupOperators[keyno] = sortcl->eqop;
-        dupCollations[keyno] = exprCollation((Node *) tle->expr);
-        Assert(OidIsValid(dupOperators[keyno]));
+        cmpColIdx[keyno] = tle->resno;
+        if (strategy == SETOP_HASHED)
+            cmpOperators[keyno] = sortcl->eqop;
+        else
+            cmpOperators[keyno] = sortcl->sortop;
+        Assert(OidIsValid(cmpOperators[keyno]));
+        cmpCollations[keyno] = exprCollation((Node *) tle->expr);
+        cmpNullsFirst[keyno] = sortcl->nulls_first;
         keyno++;
     }

     node->cmd = cmd;
     node->strategy = strategy;
     node->numCols = numCols;
-    node->dupColIdx = dupColIdx;
-    node->dupOperators = dupOperators;
-    node->dupCollations = dupCollations;
-    node->flagColIdx = flagColIdx;
-    node->firstFlag = firstFlag;
+    node->cmpColIdx = cmpColIdx;
+    node->cmpOperators = cmpOperators;
+    node->cmpCollations = cmpCollations;
+    node->cmpNullsFirst = cmpNullsFirst;
     node->numGroups = numGroups;

     return node;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 9c3822f19a..1c43370005 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -65,7 +65,7 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
 static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *input_path,
+                                Path *lpath, Path *rpath,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
@@ -315,8 +315,8 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * to the corresponding tlist entries of the subplan. However, since
          * the subplan was generated by generate_union_paths() or
          * generate_nonunion_paths(), and hence its tlist was generated by
-         * generate_append_tlist(), this will work.  We just tell
-         * generate_setop_tlist() to use varno 0.
+         * generate_append_tlist() or generate_setop_tlist(), this will work.
+         * We just tell generate_setop_tlist() to use varno 0.
          */
         if (flag >= 0 ||
             !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
@@ -1028,29 +1028,27 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                *path;
     List       *lpath_tlist,
                *rpath_tlist,
-               *tlist_list,
                *tlist,
-               *groupList,
-               *pathlist;
+               *groupList;
     bool        lpath_trivial_tlist,
-                rpath_trivial_tlist;
+                rpath_trivial_tlist,
+                result_trivial_tlist;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
     bool        use_hash;
     SetOpCmd    cmd;
-    int            firstFlag;

     /*
      * Tell children to fetch all tuples.
      */
     root->tuple_fraction = 0.0;

-    /* Recurse on children, ensuring their outputs are marked */
+    /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 0,
+                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,10 +1058,9 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dLeftGroups = lrel->rows;

-    lpath = lrel->cheapest_total_path;
     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 1,
+                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1073,41 +1070,51 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dRightGroups = rrel->rows;

-    rpath = rrel->cheapest_total_path;
-
     /* Undo effects of forcing tuple_fraction to 0 */
     root->tuple_fraction = save_fraction;

     /*
      * For EXCEPT, we must put the left input first.  For INTERSECT, either
      * order should give the same results, and we prefer to put the smaller
-     * input first in order to minimize the size of the hash table in the
-     * hashing case.  "Smaller" means the one with the fewer groups.
+     * input first in order to (a) minimize the size of the hash table in the
+     * hashing case, and (b) improve our chances of exploiting the executor's
+     * fast path for empty left-hand input.  "Smaller" means the one with the
+     * fewer groups.
      */
-    if (op->op == SETOP_EXCEPT || dLeftGroups <= dRightGroups)
-    {
-        pathlist = list_make2(lpath, rpath);
-        tlist_list = list_make2(lpath_tlist, rpath_tlist);
-        firstFlag = 0;
-    }
-    else
+    if (op->op != SETOP_EXCEPT && dLeftGroups > dRightGroups)
     {
-        pathlist = list_make2(rpath, lpath);
-        tlist_list = list_make2(rpath_tlist, lpath_tlist);
-        firstFlag = 1;
+        /* need to swap the two inputs */
+        RelOptInfo *tmprel;
+        List       *tmplist;
+        double        tmpd;
+
+        tmprel = lrel;
+        lrel = rrel;
+        rrel = tmprel;
+        tmplist = lpath_tlist;
+        lpath_tlist = rpath_tlist;
+        rpath_tlist = tmplist;
+        tmpd = dLeftGroups;
+        dLeftGroups = dRightGroups;
+        dRightGroups = tmpd;
     }

+    lpath = lrel->cheapest_total_path;
+    rpath = rrel->cheapest_total_path;
+
     /*
-     * Generate tlist for Append plan node.
+     * Generate tlist for SetOp plan node.
      *
-     * The tlist for an Append plan isn't important as far as the Append is
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
      * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.  In fact, it has to be real enough that the flag
-     * column is shown as a variable not a constant, else setrefs.c will get
-     * confused.
+     * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, true,
-                                  tlist_list, refnames_tlist);
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);

     *pTargetList = tlist;

@@ -1116,12 +1123,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /*
-     * Append the child results together.
-     */
-    path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
-                                       NIL, NULL, 0, false, -1);
-
     /* Identify the grouping semantics */
     groupList = generate_setop_grouplist(op, tlist);

@@ -1140,25 +1141,40 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     else
     {
-        dNumGroups = Min(dLeftGroups, dRightGroups);
+        dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }

     /*
-     * Decide whether to hash or sort, and add a sort node if needed.
+     * Decide whether to hash or sort, and add sort nodes if needed.
      */
-    use_hash = choose_hashed_setop(root, groupList, path,
+    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
                                    dNumGroups, dNumOutputRows,
                                    (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");

     if (groupList && !use_hash)
-        path = (Path *) create_sort_path(root,
-                                         result_rel,
-                                         path,
-                                         make_pathkeys_for_sortclauses(root,
-                                                                       groupList,
-                                                                       tlist),
-                                         -1.0);
+    {
+        List       *pathkeys;
+
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 lpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            lpath = (Path *) create_sort_path(root,
+                                              lpath->parent,
+                                              lpath,
+                                              pathkeys,
+                                              -1.0);
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 rpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            rpath = (Path *) create_sort_path(root,
+                                              rpath->parent,
+                                              rpath,
+                                              pathkeys,
+                                              -1.0);
+    }

     /*
      * Finally, add a SetOp path node to generate the correct output.
@@ -1178,12 +1194,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     path = (Path *) create_setop_path(root,
                                       result_rel,
-                                      path,
+                                      lpath,
+                                      rpath,
                                       cmd,
                                       use_hash ? SETOP_HASHED : SETOP_SORTED,
                                       groupList,
-                                      list_length(op->colTypes) + 1,
-                                      use_hash ? firstFlag : -1,
                                       dNumGroups,
                                       dNumOutputRows);

@@ -1285,10 +1300,13 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)

 /*
  * choose_hashed_setop - should we use hashing for a set operation?
+ *
+ * XXX probably this should go away: just make both paths and let
+ * add_path sort it out.
  */
 static bool
 choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *input_path,
+                    Path *lpath, Path *rpath,
                     double dNumGroups, double dNumOutputRows,
                     const char *construct)
 {
@@ -1327,7 +1345,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * Don't do it if it doesn't look like the hashtable will fit into
      * hash_mem.
      */
-    hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
+    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);

     if (hashentrysize * dNumGroups > hash_mem_limit)
         return false;
@@ -1336,9 +1354,9 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * See if the estimated cost is no more than doing it the other way.
      *
      * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group.  Note that the actual result plan might involve a SetOp or
-     * Unique node, not Agg or Group, but the cost estimates for Agg and Group
-     * should be close enough for our purposes here.
+     * group. XXX NOT TRUE: Note that the actual result plan might involve a
+     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
+     * and Group should be close enough for our purposes here.
      *
      * These path variables are dummies that just hold cost fields; we don't
      * make actual Paths for these steps.
@@ -1346,27 +1364,31 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
     cost_agg(&hashed_p, root, AGG_HASHED, NULL,
              numGroupCols, dNumGroups,
              NIL,
-             input_path->disabled_nodes,
-             input_path->startup_cost, input_path->total_cost,
-             input_path->rows, input_path->pathtarget->width);
+             lpath->disabled_nodes + rpath->disabled_nodes,
+             lpath->startup_cost + rpath->startup_cost,
+             lpath->total_cost + rpath->total_cost,
+             lpath->rows + rpath->rows,
+             lpath->pathtarget->width);

     /*
-     * Now for the sorted case.  Note that the input is *always* unsorted,
-     * since it was made by appending unrelated sub-relations together.
+     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
+     * unsorted, since it was made by appending unrelated sub-relations
+     * together.
      */
-    sorted_p.disabled_nodes = input_path->disabled_nodes;
-    sorted_p.startup_cost = input_path->startup_cost;
-    sorted_p.total_cost = input_path->total_cost;
+    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
+    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
+    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
     /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
     cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
               sorted_p.total_cost,
-              input_path->rows, input_path->pathtarget->width,
+              lpath->rows + rpath->rows,
+              lpath->pathtarget->width,
               0.0, work_mem, -1.0);
     cost_group(&sorted_p, root, numGroupCols, dNumGroups,
                NIL,
                sorted_p.disabled_nodes,
                sorted_p.startup_cost, sorted_p.total_cost,
-               input_path->rows);
+               lpath->rows + rpath->rows);

     /*
      * Now make the decision using the top-level tuple fraction.  First we
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..e52e4b1d67 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3634,25 +3634,26 @@ create_windowagg_path(PlannerInfo *root,
  *      Creates a pathnode that represents computation of INTERSECT or EXCEPT
  *
  * 'rel' is the parent relation associated with the result
- * 'subpath' is the path representing the source of data
+ * 'leftpath' is the path representing the left-hand source of data
+ * 'rightpath' is the path representing the right-hand source of data
  * 'cmd' is the specific semantics (INTERSECT or EXCEPT, with/without ALL)
  * 'strategy' is the implementation strategy (sorted or hashed)
- * 'distinctList' is a list of SortGroupClause's representing the grouping
- * 'flagColIdx' is the column number where the flag column will be, if any
- * 'firstFlag' is the flag value for the first input relation when hashing;
- *        or -1 when sorting
- * 'numGroups' is the estimated number of distinct groups
+ * 'groupList' is a list of SortGroupClause's representing the grouping
+ * 'numGroups' is the estimated number of distinct groups in left-hand input
  * 'outputRows' is the estimated number of output rows
+ *
+ * leftpath and rightpath must produce the same columns.  Moreover, if
+ * strategy is SETOP_SORTED, leftpath and rightpath must both be sorted
+ * by all the grouping columns.
  */
 SetOpPath *
 create_setop_path(PlannerInfo *root,
                   RelOptInfo *rel,
-                  Path *subpath,
+                  Path *leftpath,
+                  Path *rightpath,
                   SetOpCmd cmd,
                   SetOpStrategy strategy,
-                  List *distinctList,
-                  AttrNumber flagColIdx,
-                  int firstFlag,
+                  List *groupList,
                   double numGroups,
                   double outputRows)
 {
@@ -3660,34 +3661,37 @@ create_setop_path(PlannerInfo *root,

     pathnode->path.pathtype = T_SetOp;
     pathnode->path.parent = rel;
-    /* SetOp doesn't project, so use source path's pathtarget */
-    pathnode->path.pathtarget = subpath->pathtarget;
+    pathnode->path.pathtarget = rel->reltarget;
     /* For now, assume we are above any joins, so no parameterization */
     pathnode->path.param_info = NULL;
     pathnode->path.parallel_aware = false;
     pathnode->path.parallel_safe = rel->consider_parallel &&
-        subpath->parallel_safe;
-    pathnode->path.parallel_workers = subpath->parallel_workers;
+        leftpath->parallel_safe && rightpath->parallel_safe;
+    pathnode->path.parallel_workers =
+        leftpath->parallel_workers + rightpath->parallel_workers;
     /* SetOp preserves the input sort order if in sort mode */
     pathnode->path.pathkeys =
-        (strategy == SETOP_SORTED) ? subpath->pathkeys : NIL;
+        (strategy == SETOP_SORTED) ? leftpath->pathkeys : NIL;

-    pathnode->subpath = subpath;
+    pathnode->leftpath = leftpath;
+    pathnode->rightpath = rightpath;
     pathnode->cmd = cmd;
     pathnode->strategy = strategy;
-    pathnode->distinctList = distinctList;
-    pathnode->flagColIdx = flagColIdx;
-    pathnode->firstFlag = firstFlag;
+    pathnode->groupList = groupList;
     pathnode->numGroups = numGroups;

     /*
      * Charge one cpu_operator_cost per comparison per input tuple. We assume
      * all columns get compared at most of the tuples.
+     *
+     * XXX all wrong for hashing
      */
-    pathnode->path.disabled_nodes = subpath->disabled_nodes;
-    pathnode->path.startup_cost = subpath->startup_cost;
-    pathnode->path.total_cost = subpath->total_cost +
-        cpu_operator_cost * subpath->rows * list_length(distinctList);
+    pathnode->path.disabled_nodes =
+        leftpath->disabled_nodes + rightpath->disabled_nodes;
+    pathnode->path.startup_cost =
+        leftpath->startup_cost + rightpath->startup_cost;
+    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
+        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 182a6956bb..a730c6ec6d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2806,27 +2806,34 @@ typedef struct HashState
 /* ----------------
  *     SetOpState information
  *
- *        Even in "sorted" mode, SetOp nodes are more complex than a simple
- *        Unique, since we have to count how many duplicates to return.  But
- *        we also support hashing, so this is really more like a cut-down
- *        form of Agg.
+ *        SetOp nodes support either sorted or hashed de-duplication.
+ *        The sorted mode is a bit like MergeJoin, the hashed mode like Agg.
  * ----------------
  */
-/* this struct is private in nodeSetOp.c: */
-typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
+typedef struct SetOpStatePerInput
+{
+    TupleTableSlot *firstTupleSlot; /* first tuple of current group */
+    int64        numTuples;        /* number of tuples in current group */
+    TupleTableSlot *nextTupleSlot;    /* next input tuple, if already read */
+    bool        needGroup;        /* do we need to load a new group? */
+} SetOpStatePerInput;

 typedef struct SetOpState
 {
     PlanState    ps;                /* its first field is NodeTag */
-    ExprState  *eqfunction;        /* equality comparator */
-    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
-    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     bool        setop_done;        /* indicates completion of output scan */
-    long        numOutput;        /* number of dups left to output */
+    int64        numOutput;        /* number of dups left to output */
+    int            numCols;        /* number of grouping columns */
+
     /* these fields are used in SETOP_SORTED mode: */
-    SetOpStatePerGroup pergroup;    /* per-group working state */
-    HeapTuple    grp_firstTuple; /* copy of first tuple of current group */
+    SortSupport sortKeys;        /* per-grouping-field sort data */
+    SetOpStatePerInput leftInput;    /* current outer-relation input state */
+    SetOpStatePerInput rightInput;    /* current inner-relation input state */
+    bool        need_init;        /* have we read the first tuples yet? */
+
     /* these fields are used in SETOP_HASHED mode: */
+    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
+    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     TupleHashTable hashtable;    /* hash table with one entry per group */
     MemoryContext tableContext; /* memory context containing hash table */
     bool        table_filled;    /* hash table filled yet? */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..da3bdbab27 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2340,13 +2340,12 @@ typedef struct WindowAggPath
 typedef struct SetOpPath
 {
     Path        path;
-    Path       *subpath;        /* path representing input source */
+    Path       *leftpath;        /* paths representing input sources */
+    Path       *rightpath;
     SetOpCmd    cmd;            /* what to do, see nodes.h */
     SetOpStrategy strategy;        /* how to do it, see nodes.h */
-    List       *distinctList;    /* SortGroupClauses identifying target cols */
-    AttrNumber    flagColIdx;        /* where is the flag column, if any */
-    int            firstFlag;        /* flag value for first input relation */
-    Cardinality numGroups;        /* estimated number of groups in input */
+    List       *groupList;        /* SortGroupClauses identifying target cols */
+    Cardinality numGroups;        /* estimated number of groups in left input */
 } SetOpPath;

 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 52f29bcdb6..4633121689 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1225,23 +1225,20 @@ typedef struct SetOp
     /* how to do it, see nodes.h */
     SetOpStrategy strategy;

-    /* number of columns to check for duplicate-ness */
+    /* number of columns to compare */
     int            numCols;

     /* their indexes in the target list */
-    AttrNumber *dupColIdx pg_node_attr(array_size(numCols));
+    AttrNumber *cmpColIdx pg_node_attr(array_size(numCols));

-    /* equality operators to compare with */
-    Oid           *dupOperators pg_node_attr(array_size(numCols));
-    Oid           *dupCollations pg_node_attr(array_size(numCols));
-
-    /* where is the flag column, if any */
-    AttrNumber    flagColIdx;
+    /* comparison operators (either equality operators or sort operators) */
+    Oid           *cmpOperators pg_node_attr(array_size(numCols));
+    Oid           *cmpCollations pg_node_attr(array_size(numCols));

-    /* flag value for first input relation */
-    int            firstFlag;
+    /* nulls-first flags if sorting, otherwise not interesting */
+    bool       *cmpNullsFirst pg_node_attr(array_size(numCols));

-    /* estimated number of groups in input */
+    /* estimated number of groups in left input */
     long        numGroups;
 } SetOp;

diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 1035e6560c..5a6d0350c1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -260,12 +260,11 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
                                             bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
                                     RelOptInfo *rel,
-                                    Path *subpath,
+                                    Path *leftpath,
+                                    Path *rightpath,
                                     SetOpCmd cmd,
                                     SetOpStrategy strategy,
-                                    List *distinctList,
-                                    AttrNumber flagColIdx,
-                                    int firstFlag,
+                                    List *groupList,
                                     double numGroups,
                                     double outputRows);
 extern RecursiveUnionPath *create_recursiveunion_path(PlannerInfo *root,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index da219aba4c..3106cf6265 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1231,22 +1231,18 @@ select count(*) from
     select * from onek i2 where i2.unique1 = o.unique2
   ) ss
 where o.ten = 1;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Subquery Scan on ss
-               ->  HashSetOp Except
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Scan using onek_unique1 on onek i1
-                                       Index Cond: (unique1 = o.unique1)
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Index Scan using onek_unique1 on onek i2
-                                       Index Cond: (unique1 = o.unique2)
-(13 rows)
+         ->  HashSetOp Except
+               ->  Index Scan using onek_unique1 on onek i1
+                     Index Cond: (unique1 = o.unique1)
+               ->  Index Scan using onek_unique1 on onek i2
+                     Index Cond: (unique1 = o.unique2)
+(9 rows)

 select count(*) from
   onek o cross join lateral (
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..fbf24490e9 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -370,17 +370,13 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                     QUERY PLAN
-------------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  HashSetOp Intersect
-               ->  Append
-                     ->  Subquery Scan on "*SELECT* 2"
-                           ->  Seq Scan on tenk1
-                     ->  Subquery Scan on "*SELECT* 1"
-                           ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+   ->  HashSetOp Intersect
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(4 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -391,16 +387,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Index Only Scan using tenk1_unique1 on tenk1
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                     Filter: (unique2 <> 10)
-(7 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -434,19 +427,17 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                        QUERY PLAN
-------------------------------------------------------------------------------------------
+                               QUERY PLAN
+------------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  SetOp Intersect
-               ->  Sort
-                     Sort Key: "*SELECT* 2".fivethous
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Seq Scan on tenk1
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(10 rows)
+   ->  SetOp Intersect
+         ->  Sort
+               Sort Key: tenk1.fivethous
+               ->  Seq Scan on tenk1
+         ->  Sort
+               Sort Key: tenk1_1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(8 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -457,18 +448,17 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".unique1
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Index Only Scan using tenk1_unique1 on tenk1
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                           Filter: (unique2 <> 10)
-(9 rows)
+         Sort Key: tenk1.unique1
+         ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Sort
+         Sort Key: tenk1_1.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+               Filter: (unique2 <> 10)
+(8 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -528,15 +518,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -546,15 +533,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -606,17 +590,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -626,17 +609,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -669,17 +651,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -689,17 +670,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -777,17 +757,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -797,17 +776,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -970,15 +948,12 @@ set enable_sort = false;
 -- no means to disable Unique.
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union all select from generate_series(1,3);
 --
@@ -1015,15 +990,12 @@ select from generate_series(1,5) union select from generate_series(1,3);

 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  SetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union select from generate_series(1,3);
 --
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b54428b38c..58bc48e0ac 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2611,6 +2611,8 @@ SetOpCmd
 SetOpPath
 SetOpState
 SetOpStatePerGroup
+SetOpStatePerGroupData
+SetOpStatePerInput
 SetOpStrategy
 SetOperation
 SetOperationStmt
--
2.43.5

From 369f72fcf230d28aebfe9afc91a9a7b01fd08958 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 29 Nov 2024 18:17:27 -0500
Subject: [PATCH v3 2/5] Remove some dead code in prepunion.c.

Now that we no longer use flag columns for SetOp, remove the
code that supported that.

This also means that there are no resjunk output columns anymore
in a set-operations plan tree.  So we can get rid of some code
that supported that, too.

It would probably be possible now to do what's speculated about in
the comments, and get rid of this code's use of transiently-created
targetlists in favor of dealing only in pathtargets (save for
generating root->processed_tlist at the very end).  However, that
would involve a good deal of notational churn and the benefit
would only be to save a little planning time.  So I'll leave that
project for some other day.
---
 src/backend/optimizer/prep/prepunion.c | 96 ++++----------------------
 1 file changed, 14 insertions(+), 82 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1c43370005..8d70a4b61a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -40,8 +40,7 @@

 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
                                           List *colTypes, List *colCollations,
-                                          bool junkOK,
-                                          int flag, List *refnames_tlist,
+                                          List *refnames_tlist,
                                           List **pTargetList,
                                           bool *istrivial_tlist);
 static RelOptInfo *generate_recursion_path(SetOperationStmt *setOp,
@@ -69,14 +68,12 @@ static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
-                                  int flag,
                                   Index varno,
                                   bool hack_constants,
                                   List *input_tlist,
                                   List *refnames_tlist,
                                   bool *trivial_tlist);
 static List *generate_append_tlist(List *colTypes, List *colCollations,
-                                   bool flag,
                                    List *input_tlists,
                                    List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
@@ -160,12 +157,10 @@ plan_set_operations(PlannerInfo *root)
         /*
          * Recurse on setOperations tree to generate paths for set ops. The
          * final output paths should have just the column types shown as the
-         * output from the top-level node, plus possibly resjunk working
-         * columns (we can rely on upper-level nodes to deal with that).
+         * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
                                            topop->colTypes, topop->colCollations,
-                                           true, -1,
                                            leftmostQuery->targetList,
                                            &top_tlist,
                                            &trivial_tlist);
@@ -208,8 +203,6 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * junkOK: if true, child resjunk columns may be left in the result
- * flag: if >= 0, add a resjunk output column indicating value of flag
  * refnames_tlist: targetlist to take column names from
  *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
@@ -220,7 +213,8 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
- * Pending a redesign of how that works, this is the easy way out.
+ * XXX Now that there are no flag columns and hence no resjunk columns, we
+ * could probably refactor this file to deal only in pathtargets.
  *
  * We don't have to care about typmods here: the only allowed difference
  * between set-op input and output typmods is input is a specific typmod
@@ -229,8 +223,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
                        List *colTypes, List *colCollations,
-                       bool junkOK,
-                       int flag, List *refnames_tlist,
+                       List *refnames_tlist,
                        List **pTargetList,
                        bool *istrivial_tlist)
 {
@@ -279,7 +272,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,

         /* Figure out the appropriate target list for this subquery. */
         tlist = generate_setop_tlist(colTypes, colCollations,
-                                     flag,
                                      rtr->rtindex,
                                      true,
                                      subroot->processed_tlist,
@@ -318,16 +310,14 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * generate_append_tlist() or generate_setop_tlist(), this will work.
          * We just tell generate_setop_tlist() to use varno 0.
          */
-        if (flag >= 0 ||
-            !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
-            !tlist_same_collations(*pTargetList, colCollations, junkOK))
+        if (!tlist_same_datatypes(*pTargetList, colTypes, false) ||
+            !tlist_same_collations(*pTargetList, colCollations, false))
         {
             PathTarget *target;
             bool        trivial_tlist;
             ListCell   *lc;

             *pTargetList = generate_setop_tlist(colTypes, colCollations,
-                                                flag,
                                                 0,
                                                 false,
                                                 *pTargetList,
@@ -411,7 +401,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      */
     lrel = recurse_set_operations(setOp->larg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -423,7 +412,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -436,7 +424,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /*
      * Generate tlist for RecursiveUnion path node --- same as in Append cases
      */
-    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations, false,
+    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations,
                                   list_make2(lpath_tlist, rpath_tlist),
                                   refnames_tlist);

@@ -736,7 +724,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, false,
+    tlist = generate_append_tlist(op->colTypes, op->colCollations,
                                   tlist_list, refnames_tlist);
     *pTargetList = tlist;

@@ -1048,7 +1036,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,7 +1047,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1109,7 +1095,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
                                  0, false, lpath_tlist, refnames_tlist,
                                  &result_trivial_tlist);

@@ -1258,18 +1244,10 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
-         *
-         * Note we disallow any resjunk columns in child results.  This is
-         * necessary since the Append node that implements the union won't do
-         * any projection, and upper levels will get confused if some of our
-         * output tuples have junk and some don't.  This case only arises when
-         * we have an EXCEPT or INTERSECT as child, else there won't be
-         * resjunk anyway.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
-                                                        false, -1,
                                                         refnames_tlist,
                                                         &child_tlist,
                                                         &trivial_tlist));
@@ -1412,7 +1390,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: -1 if no flag column needed, 0 or 1 to create a const flag column
  * varno: varno to use in generated Vars
  * hack_constants: true to copy up constants (see comments in code)
  * input_tlist: targetlist of this node's input node
@@ -1421,7 +1398,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  */
 static List *
 generate_setop_tlist(List *colTypes, List *colCollations,
-                     int flag,
                      Index varno,
                      bool hack_constants,
                      List *input_tlist,
@@ -1520,7 +1496,7 @@ generate_setop_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1529,25 +1505,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag >= 0)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is the given constant */
-        expr = (Node *) makeConst(INT4OID,
-                                  -1,
-                                  InvalidOid,
-                                  sizeof(int32),
-                                  Int32GetDatum(flag),
-                                  false,
-                                  true);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-        *trivial_tlist = false; /* the extra entry makes it not trivial */
-    }
-
     return tlist;
 }

@@ -1556,7 +1513,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: true to create a flag column copied up from subplans
  * input_tlists: list of tlists for sub-plans of the Append
  * refnames_tlist: targetlist to take column names from
  *
@@ -1570,7 +1526,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  */
 static List *
 generate_append_tlist(List *colTypes, List *colCollations,
-                      bool flag,
                       List *input_tlists,
                       List *refnames_tlist)
 {
@@ -1604,8 +1559,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
         {
             TargetEntry *subtle = (TargetEntry *) lfirst(subtlistl);

-            if (subtle->resjunk)
-                continue;
+            Assert(!subtle->resjunk);
             Assert(curColType != NULL);
             if (exprType((Node *) subtle->expr) == lfirst_oid(curColType))
             {
@@ -1654,7 +1608,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1663,23 +1617,6 @@ generate_append_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is shown as copied up from subplan */
-        expr = (Node *) makeVar(0,
-                                resno,
-                                INT4OID,
-                                -1,
-                                InvalidOid,
-                                0);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-    }
-
     pfree(colTypmods);

     return tlist;
@@ -1709,12 +1646,7 @@ generate_setop_grouplist(SetOperationStmt *op, List *targetlist)
         TargetEntry *tle = (TargetEntry *) lfirst(lt);
         SortGroupClause *sgc;

-        if (tle->resjunk)
-        {
-            /* resjunk columns should not have sortgrouprefs */
-            Assert(tle->ressortgroupref == 0);
-            continue;            /* ignore resjunk columns */
-        }
+        Assert(!tle->resjunk);

         /* non-resjunk columns should have sortgroupref = resno */
         Assert(tle->ressortgroupref == tle->resno);
--
2.43.5

From 371ca07586df16c8f9513000b11a0b5b0e21b2e9 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 29 Nov 2024 18:18:17 -0500
Subject: [PATCH v3 3/5] Get rid of choose_hashed_setop().

This logic is a hangover from days when we could only generate
one plan per setop node.  Now, it's better to generate multiple
Paths and let add_path choose the cheapest.  The main advantage
is not having to keep choose_hashed_setop in perfect sync with
cost estimates made elsewhere --- and indeed, it looks like it
was out of sync already.  Not to mention that it doesn't account
for presorted input, which doesn't matter too much right now
but will soon.

This does require fixing create_setop_path to generate good cost
estimates.  That was another area where we'd not maintained things
well, because it wasn't distinguishing sorted from hashed costs at
all.  With the formulas I chose, the total cost attributed to the node
is actually the same for sorted and hashed modes --- but the startup
cost is wildly different, because hashed mode does basically all the
work before returning anything.

I also modified the existing logic that disables hashed setops
so that it works by incrementing path->disabled_nodes rather
than not generating the path at all.  That seems more in keeping
with the methods established by commit e22253467.  (But should
we put that logic in create_setop_path, rather than its caller?)

This results in two changes in plans that are visible in the
regression tests.  The one in union.sql seems fine: it's now
preferring sorted over hashed mode for a zero-column INTERSECT,
since no sort is really required.  The one in subselect.sql
is trying to test nodeSetOp's rescan logic, so I extended it
to test both sorted and hashed modes.
---
 src/backend/optimizer/prep/prepunion.c  | 226 +++++++++---------------
 src/backend/optimizer/util/pathnode.c   |  50 +++++-
 src/test/regress/expected/subselect.out |  48 ++++-
 src/test/regress/expected/union.out     |   2 +-
 src/test/regress/sql/subselect.sql      |  32 +++-
 5 files changed, 200 insertions(+), 158 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 8d70a4b61a..23bb335354 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -63,10 +63,6 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **tlist_list,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
-static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *lpath, Path *rpath,
-                                double dNumGroups, double dNumOutputRows,
-                                const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
                                   Index varno,
                                   bool hack_constants,
@@ -1025,7 +1021,8 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
-    bool        use_hash;
+    bool        can_sort;
+    bool        can_hash;
     SetOpCmd    cmd;

     /*
@@ -1130,15 +1127,76 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }
+    result_rel->rows = dNumOutputRows;
+
+    /* Select the SetOpCmd type */
+    switch (op->op)
+    {
+        case SETOP_INTERSECT:
+            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
+            break;
+        case SETOP_EXCEPT:
+            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
+            break;
+        default:
+            elog(ERROR, "unrecognized set op: %d", (int) op->op);
+            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
+            break;
+    }
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));

     /*
-     * Decide whether to hash or sort, and add sort nodes if needed.
+     * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
-    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
-                                   dNumGroups, dNumOutputRows,
-                                   (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");
+    if (can_hash)
+    {
+        Size        hash_mem_limit = get_hash_memory_limit();
+        Size        hashentrysize;
+
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_HASHED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);

-    if (groupList && !use_hash)
+        /*
+         * Mark the path as disabled if enable_hashagg is off.  While this
+         * isn't exactly a HashAgg node, it seems close enough to justify
+         * letting that switch control it.
+         */
+        if (!enable_hashagg)
+            path->disabled_nodes++;
+
+        /*
+         * Also disable if it doesn't look like the hashtable will fit into
+         * hash_mem.
+         */
+        hashentrysize = MAXALIGN(lpath->pathtarget->width) +
+            MAXALIGN(SizeofMinimalTupleHeader);
+        if (hashentrysize * dNumGroups > hash_mem_limit)
+            path->disabled_nodes++;
+
+        add_path(result_rel, path);
+    }
+
+    /*
+     * If we can sort, sort the inputs if needed before adding SetOp.
+     */
+    if (can_sort)
     {
         List       *pathkeys;

@@ -1160,36 +1218,19 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                               rpath,
                                               pathkeys,
                                               -1.0);
-    }

-    /*
-     * Finally, add a SetOp path node to generate the correct output.
-     */
-    switch (op->op)
-    {
-        case SETOP_INTERSECT:
-            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
-            break;
-        case SETOP_EXCEPT:
-            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
-            break;
-        default:
-            elog(ERROR, "unrecognized set op: %d", (int) op->op);
-            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
-            break;
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_SORTED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);
+        add_path(result_rel, path);
     }
-    path = (Path *) create_setop_path(root,
-                                      result_rel,
-                                      lpath,
-                                      rpath,
-                                      cmd,
-                                      use_hash ? SETOP_HASHED : SETOP_SORTED,
-                                      groupList,
-                                      dNumGroups,
-                                      dNumOutputRows);
-
-    result_rel->rows = path->rows;
-    add_path(result_rel, path);
+
     return result_rel;
 }

@@ -1276,115 +1317,6 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)
     set_cheapest(rel);
 }

-/*
- * choose_hashed_setop - should we use hashing for a set operation?
- *
- * XXX probably this should go away: just make both paths and let
- * add_path sort it out.
- */
-static bool
-choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *lpath, Path *rpath,
-                    double dNumGroups, double dNumOutputRows,
-                    const char *construct)
-{
-    int            numGroupCols = list_length(groupClauses);
-    Size        hash_mem_limit = get_hash_memory_limit();
-    bool        can_sort;
-    bool        can_hash;
-    Size        hashentrysize;
-    Path        hashed_p;
-    Path        sorted_p;
-    double        tuple_fraction;
-
-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupClauses);
-    can_hash = grouping_is_hashable(groupClauses);
-    if (can_hash && can_sort)
-    {
-        /* we have a meaningful choice to make, continue ... */
-    }
-    else if (can_hash)
-        return true;
-    else if (can_sort)
-        return false;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is UNION, INTERSECT, or EXCEPT */
-                 errmsg("could not implement %s", construct),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
-    /* Prefer sorting when enable_hashagg is off */
-    if (!enable_hashagg)
-        return false;
-
-    /*
-     * Don't do it if it doesn't look like the hashtable will fit into
-     * hash_mem.
-     */
-    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
-
-    if (hashentrysize * dNumGroups > hash_mem_limit)
-        return false;
-
-    /*
-     * See if the estimated cost is no more than doing it the other way.
-     *
-     * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group. XXX NOT TRUE: Note that the actual result plan might involve a
-     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
-     * and Group should be close enough for our purposes here.
-     *
-     * These path variables are dummies that just hold cost fields; we don't
-     * make actual Paths for these steps.
-     */
-    cost_agg(&hashed_p, root, AGG_HASHED, NULL,
-             numGroupCols, dNumGroups,
-             NIL,
-             lpath->disabled_nodes + rpath->disabled_nodes,
-             lpath->startup_cost + rpath->startup_cost,
-             lpath->total_cost + rpath->total_cost,
-             lpath->rows + rpath->rows,
-             lpath->pathtarget->width);
-
-    /*
-     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
-     * unsorted, since it was made by appending unrelated sub-relations
-     * together.
-     */
-    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
-    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
-    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
-    /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-    cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
-              sorted_p.total_cost,
-              lpath->rows + rpath->rows,
-              lpath->pathtarget->width,
-              0.0, work_mem, -1.0);
-    cost_group(&sorted_p, root, numGroupCols, dNumGroups,
-               NIL,
-               sorted_p.disabled_nodes,
-               sorted_p.startup_cost, sorted_p.total_cost,
-               lpath->rows + rpath->rows);
-
-    /*
-     * Now make the decision using the top-level tuple fraction.  First we
-     * have to convert an absolute count (LIMIT) into fractional form.
-     */
-    tuple_fraction = root->tuple_fraction;
-    if (tuple_fraction >= 1.0)
-        tuple_fraction /= dNumOutputRows;
-
-    if (compare_fractional_path_costs(&hashed_p, &sorted_p,
-                                      tuple_fraction) < 0)
-    {
-        /* Hashed is cheaper, so use it */
-        return true;
-    }
-    return false;
-}
-
 /*
  * Generate targetlist for a set-operation plan node
  *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e52e4b1d67..d7422f2dd8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3681,17 +3681,51 @@ create_setop_path(PlannerInfo *root,
     pathnode->numGroups = numGroups;

     /*
-     * Charge one cpu_operator_cost per comparison per input tuple. We assume
-     * all columns get compared at most of the tuples.
-     *
-     * XXX all wrong for hashing
+     * Compute cost estimates.  As things stand, we end up with the same total
+     * cost in this node for sort and hash methods, but different startup
+     * costs.  This could be refined perhaps, but it'll do for now.
      */
     pathnode->path.disabled_nodes =
         leftpath->disabled_nodes + rightpath->disabled_nodes;
-    pathnode->path.startup_cost =
-        leftpath->startup_cost + rightpath->startup_cost;
-    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
-        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+    if (strategy == SETOP_SORTED)
+    {
+        /*
+         * In sorted mode, we can emit output incrementally.  Charge one
+         * cpu_operator_cost per comparison per input tuple.  Like cost_group,
+         * we assume all columns get compared at most of the tuples.
+         */
+        pathnode->path.startup_cost =
+            leftpath->startup_cost + rightpath->startup_cost;
+        pathnode->path.total_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
+    else
+    {
+        /*
+         * In hashed mode, we must read all the input before we can emit
+         * anything.  Also charge comparison costs to represent the cost of
+         * hash table lookups.
+         */
+        pathnode->path.startup_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+        pathnode->path.total_cost = pathnode->path.startup_cost;
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 3106cf6265..79e52b46c0 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1221,8 +1221,10 @@ where o.ten = 0;
 (1 row)

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -1256,6 +1258,50 @@ where o.ten = 1;
    100
 (1 row)

+rollback;
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+                                                                                                     QUERY PLAN
                                                                                              

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  SetOp Except
+               ->  Sort
+                     Sort Key: i1.unique1, i1.unique2, i1.two, i1.four, i1.ten, i1.twenty, i1.hundred, i1.thousand,
i1.twothousand,i1.fivethous, i1.tenthous, i1.odd, i1.even, i1.stringu1, i1.stringu2, i1.string4 
+                     ->  Index Scan using onek_unique1 on onek i1
+                           Index Cond: (unique1 = o.unique1)
+               ->  Sort
+                     Sort Key: i2.unique1, i2.unique2, i2.two, i2.four, i2.ten, i2.twenty, i2.hundred, i2.thousand,
i2.twothousand,i2.fivethous, i2.tenthous, i2.odd, i2.even, i2.stringu1, i2.stringu2, i2.string4 
+                     ->  Index Scan using onek_unique1 on onek i2
+                           Index Cond: (unique1 = o.unique2)
+(13 rows)
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+ count
+-------
+   100
+(1 row)
+
+rollback;
 --
 -- Test rescan of a RecursiveUnion node
 --
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index fbf24490e9..9474b01510 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -950,7 +950,7 @@ explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
                         QUERY PLAN
 ----------------------------------------------------------
- HashSetOp Intersect
+ SetOp Intersect
    ->  Function Scan on generate_series
    ->  Function Scan on generate_series generate_series_1
 (3 rows)
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 1815a5e664..0f3a064066 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -638,8 +638,11 @@ select sum(ss.tst::int) from
 where o.ten = 0;

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
+
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -657,6 +660,33 @@ select count(*) from
   ) ss
 where o.ten = 1;

+rollback;
+
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+rollback;
+
 --
 -- Test rescan of a RecursiveUnion node
 --
--
2.43.5

From 735f55167320dec62951d0c0ea23f0a4f4702f7c Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 29 Nov 2024 18:20:16 -0500
Subject: [PATCH v3 4/5] Fix bogus decisions about whether we want pre-sorted
 child paths.

As things stand, recurse_set_operations unconditionally passes down
root->parse->setOperations to subquery_planner, which then applies
set_operation_ordered_results_useful to it to decide whether we care
about producing pre-sorted paths for the subquery.

This is flat wrong.

The main problem is that root->parse->setOperations is not necessarily
the relevant ancestor node.  In the test case added here, we have
UNION ALL atop a child UNION, and the existing logic fails to
recognize that it could use pre-sorted paths for the child UNION step.

A lesser problem is that set_operation_ordered_results_useful is
inadequate anyway: it will falsely claim that we could use pre-sorted
paths for a recursive UNION.

A better way to do this is for each caller of recurse_set_operations
to pass down the relevant parent node.  Then we can further specify
that the caller should pass NULL if it has no use for pre-sorted
paths, and then we can get rid of set_operation_ordered_results_useful
altogether, eliminating the need to keep it in sync with the code that
is actually doing the work.

Note: this patch makes generate_nonunion_paths pass down the non-UNION
node as parent.  generate_nonunion_paths doesn't yet do anything with
the pre-sorted paths that will result, so no regression test plans
change.  If we were going to leave it like this, we'd pass down NULL
instead.
---
 src/backend/optimizer/plan/planner.c   |  7 ++-
 src/backend/optimizer/prep/prepunion.c | 65 +++++++++++++-------------
 src/include/optimizer/prep.h           |  1 -
 src/test/regress/expected/union.out    | 14 ++++++
 src/test/regress/sql/union.sql         |  4 ++
 5 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..9e3ccb915c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -618,7 +618,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Basically, this routine does the stuff that should only be done once
  * per Query object.  It then calls grouping_planner.  At one time,
@@ -1352,7 +1352,7 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Returns nothing; the useful output is in the Paths we attach to the
  * (UPPERREL_FINAL, NULL) upperrel in *root.  In addition,
@@ -3631,8 +3631,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
                                       tlist);

     /* setting setop_pathkeys might be useful to the union planner */
-    if (qp_extra->setop != NULL &&
-        set_operation_ordered_results_useful(qp_extra->setop))
+    if (qp_extra->setop != NULL)
     {
         List       *groupClauses;
         bool        sortable;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 23bb335354..e92a088334 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -39,6 +39,7 @@


 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
+                                          SetOperationStmt *parentOp,
                                           List *colTypes, List *colCollations,
                                           List *refnames_tlist,
                                           List **pTargetList,
@@ -156,6 +157,7 @@ plan_set_operations(PlannerInfo *root)
          * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
+                                           NULL,    /* no parent */
                                            topop->colTypes, topop->colCollations,
                                            leftmostQuery->targetList,
                                            &top_tlist,
@@ -168,44 +170,31 @@ plan_set_operations(PlannerInfo *root)
     return setop_rel;
 }

-/*
- * set_operation_ordered_results_useful
- *        Return true if the given SetOperationStmt can be executed by utilizing
- *        paths that provide sorted input according to the setop's targetlist.
- *        Returns false when sorted paths are not any more useful than unsorted
- *        ones.
- */
-bool
-set_operation_ordered_results_useful(SetOperationStmt *setop)
-{
-    /*
-     * Paths sorted by the targetlist are useful for UNION as we can opt to
-     * MergeAppend the sorted paths then Unique them.  Ordered paths are no
-     * more useful than unordered ones for UNION ALL.
-     */
-    if (!setop->all && setop->op == SETOP_UNION)
-        return true;
-
-    /*
-     * EXCEPT / EXCEPT ALL / INTERSECT / INTERSECT ALL cannot yet utilize
-     * correctly sorted input paths.
-     */
-    return false;
-}
-
 /*
  * recurse_set_operations
  *      Recursively handle one step in a tree of set operations
  *
+ * setOp: current step (could be a SetOperationStmt or a leaf RangeTblRef)
+ * parentOp: parent step, or NULL if none (but see below)
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
  * refnames_tlist: targetlist to take column names from
  *
+ * parentOp should be passed as NULL unless that step is interested in
+ * getting sorted output from this step.  ("Sorted" means "sorted according
+ * to the default btree opclasses of the result column datatypes".)
+ *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
  * *pTargetList: receives the fully-fledged tlist for the subtree's top plan
  * *istrivial_tlist: true if, and only if, datatypes between parent and child
  * match.
  *
+ * If setOp is a leaf node, this function plans the sub-query but does
+ * not populate the pathlist of the returned RelOptInfo.  The caller will
+ * generate SubqueryScan paths using useful path(s) of the subquery (see
+ * build_setop_child_paths).  But this function does build the paths for
+ * set-operation nodes.
+ *
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
@@ -218,6 +207,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  */
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
+                       SetOperationStmt *parentOp,
                        List *colTypes, List *colCollations,
                        List *refnames_tlist,
                        List **pTargetList,
@@ -234,7 +224,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
     {
         RangeTblRef *rtr = (RangeTblRef *) setOp;
         RangeTblEntry *rte = root->simple_rte_array[rtr->rtindex];
-        SetOperationStmt *setops;
         Query       *subquery = rte->subquery;
         PlannerInfo *subroot;
         List       *tlist;
@@ -249,15 +238,13 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
         Assert(root->plan_params == NIL);

         /*
-         * Pass the set operation details to the subquery_planner to have it
-         * consider generating Paths correctly ordered for the set operation.
+         * Generate a subroot and Paths for the subquery.  If we have a
+         * parentOp, pass that down to encourage subquery_planner to consider
+         * suitably-sorted Paths.
          */
-        setops = castNode(SetOperationStmt, root->parse->setOperations);
-
-        /* Generate a subroot and Paths for the subquery */
         subroot = rel->subroot = subquery_planner(root->glob, subquery, root,
                                                   false, root->tuple_fraction,
-                                                  setops);
+                                                  parentOp);

         /*
          * It should not be possible for the primitive query to contain any
@@ -396,6 +383,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      * separately without any intention of combining them into one Append.
      */
     lrel = recurse_set_operations(setOp->larg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -407,6 +395,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /* The right path will want to look at the left one ... */
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -479,6 +468,10 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
  * build_setop_child_paths
  *        Build paths for the set op child relation denoted by 'rel'.
  *
+ * 'rel' is an RTE_SUBQUERY relation.  We have already generated paths within
+ * the subquery's subroot; the task here is to create SubqueryScan paths for
+ * 'rel', representing scans of the useful subquery paths.
+ *
  * interesting_pathkeys: if not NIL, also include paths that suit these
  * pathkeys, sorting any unsorted paths as required.
  * *pNumGroups: if not NULL, we estimate the number of distinct groups
@@ -1032,6 +1025,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -1043,6 +1037,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -1285,8 +1280,14 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
+         *
+         * If top_union isn't a UNION ALL, then we are interested in sorted
+         * output from the child, so pass top_union as parentOp.  Note that
+         * this isn't necessarily the child node's immediate SetOperationStmt
+         * parent, but that's fine: it's the effective parent.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
+                                                        top_union->all ? NULL : top_union,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
                                                         refnames_tlist,
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
index a52dec285d..74bb6c1d3b 100644
--- a/src/include/optimizer/prep.h
+++ b/src/include/optimizer/prep.h
@@ -53,6 +53,5 @@ extern void preprocess_aggrefs(PlannerInfo *root, Node *clause);
  * prototypes for prepunion.c
  */
 extern RelOptInfo *plan_set_operations(PlannerInfo *root);
-extern bool set_operation_ordered_results_useful(SetOperationStmt *setop);

 #endif                            /* PREP_H */
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 9474b01510..3e1adbd992 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -466,6 +466,20 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+                               QUERY PLAN
+------------------------------------------------------------------------
+ Append
+   ->  Seq Scan on int4_tbl
+   ->  Unique
+         ->  Merge Append
+               Sort Key: tenk1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1
+               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+(7 rows)
+
 reset enable_hashagg;
 -- non-hashable type
 set enable_hashagg to on;
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index f8826514e4..10030ec1c5 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -156,6 +156,10 @@ explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+
 reset enable_hashagg;

 -- non-hashable type
--
2.43.5

From 2515e71a27334b5921115334fb70e5fa2fca706b Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 29 Nov 2024 18:21:37 -0500
Subject: [PATCH v3 5/5] Teach generate_nonunion_paths to consider pre-sorted
 child paths.

Much of this patch is just re-ordering existing code so that we can
compute the desired pathkeys before we call build_setop_child_paths.

I don't think we need any new test cases: the existing cases that
change plan are enough to show that this works.

One test case in union.sql would switch from HashSetOp Except to
sorted SetOp Except if we left it alone, since the cheapest inputs
are already sorted.  But the point of that set of tests is to
exercise HashSetOp, so swing a bigger hammer.  (The same query is
re-tested with sorting further down, so this seems fine.)
---
 src/backend/optimizer/prep/prepunion.c | 144 ++++++++++++++++---------
 src/test/regress/expected/union.out    |  37 +++----
 src/test/regress/sql/union.sql         |   5 +
 3 files changed, 115 insertions(+), 71 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e92a088334..b3a26738d0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1010,6 +1010,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     bool        lpath_trivial_tlist,
                 rpath_trivial_tlist,
                 result_trivial_tlist;
+    List       *nonunion_pathkeys = NIL;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
@@ -1030,11 +1031,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
-    if (lrel->rtekind == RTE_SUBQUERY)
-        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
-                                NIL, &dLeftGroups);
-    else
-        dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
                                   op,
@@ -1042,9 +1038,57 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
+
+    /*
+     * Generate tlist for SetOp plan node.
+     *
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
+     * concerned, but we must make it look real anyway for the benefit of the
+     * next plan level up.
+     */
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);
+
+    *pTargetList = tlist;
+
+    /* Identify the grouping semantics */
+    groupList = generate_setop_grouplist(op, tlist);
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
+
+    if (can_sort)
+    {
+        /* Determine the pathkeys for sorting by the whole target list */
+        nonunion_pathkeys = make_pathkeys_for_sortclauses(root, groupList,
+                                                          tlist);
+
+        root->query_pathkeys = nonunion_pathkeys;
+    }
+
+    /*
+     * Now that we've got all that info, we can build the child paths.
+     */
+    if (lrel->rtekind == RTE_SUBQUERY)
+        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
+                                nonunion_pathkeys, &dLeftGroups);
+    else
+        dLeftGroups = lrel->rows;
     if (rrel->rtekind == RTE_SUBQUERY)
         build_setop_child_paths(root, rrel, rpath_trivial_tlist, rpath_tlist,
-                                NIL, &dRightGroups);
+                                nonunion_pathkeys, &dRightGroups);
     else
         dRightGroups = rrel->rows;

@@ -1080,30 +1124,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     lpath = lrel->cheapest_total_path;
     rpath = rrel->cheapest_total_path;

-    /*
-     * Generate tlist for SetOp plan node.
-     *
-     * The tlist for a SetOp plan isn't important so far as the SetOp is
-     * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.
-     */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
-                                 0, false, lpath_tlist, refnames_tlist,
-                                 &result_trivial_tlist);
-
-    /* We should not have needed any type coercions in the tlist */
-    Assert(result_trivial_tlist);
-
-    *pTargetList = tlist;
-
     /* Build result relation. */
     result_rel = fetch_upper_rel(root, UPPERREL_SETOP,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /* Identify the grouping semantics */
-    groupList = generate_setop_grouplist(op, tlist);
-
     /*
      * Estimate number of distinct groups that we'll need hashtable entries
      * for; this is the size of the left-hand input for EXCEPT, or the smaller
@@ -1139,17 +1164,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
             break;
     }

-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupList);
-    can_hash = grouping_is_hashable(groupList);
-    if (!can_sort && !can_hash)
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is INTERSECT or EXCEPT */
-                 errmsg("could not implement %s",
-                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
     /*
      * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
@@ -1189,35 +1203,63 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }

     /*
-     * If we can sort, sort the inputs if needed before adding SetOp.
+     * If we can sort, generate the cheapest sorted input paths, and add a
+     * SetOp atop those.
      */
     if (can_sort)
     {
         List       *pathkeys;
+        Path       *slpath,
+                   *srpath;

+        /* First the left input ... */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  lpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
-            lpath = (Path *) create_sort_path(root,
-                                              lpath->parent,
-                                              lpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            slpath = lpath;        /* cheapest path is already sorted */
+        else
+        {
+            slpath = get_cheapest_path_for_pathkeys(lrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (slpath == NULL)
+                slpath = (Path *) create_sort_path(root,
+                                                   lpath->parent,
+                                                   lpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }
+
+        /* and now the same for the right. */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  rpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
-            rpath = (Path *) create_sort_path(root,
-                                              rpath->parent,
-                                              rpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            srpath = rpath;        /* cheapest path is already sorted */
+        else
+        {
+            srpath = get_cheapest_path_for_pathkeys(rrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (srpath == NULL)
+                srpath = (Path *) create_sort_path(root,
+                                                   rpath->parent,
+                                                   rpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }

         path = (Path *) create_setop_path(root,
                                           result_rel,
-                                          lpath,
-                                          rpath,
+                                          slpath,
+                                          srpath,
                                           cmd,
                                           SETOP_SORTED,
                                           groupList,
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 3e1adbd992..aa73edeac5 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -385,13 +385,15 @@ select count(*) from
   5000
 (1 row)

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                         QUERY PLAN
-------------------------------------------------------------
+           QUERY PLAN
+---------------------------------
  HashSetOp Except
-   ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+   ->  Seq Scan on tenk1
+   ->  Seq Scan on tenk1 tenk1_1
          Filter: (unique2 <> 10)
 (4 rows)

@@ -401,6 +403,7 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+reset enable_indexscan;
 set enable_hashagg to off;
 explain (costs off)
 select count(*) from
@@ -427,17 +430,15 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
    ->  SetOp Intersect
          ->  Sort
                Sort Key: tenk1.fivethous
                ->  Seq Scan on tenk1
-         ->  Sort
-               Sort Key: tenk1_1.unique1
-               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(6 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -448,17 +449,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                            QUERY PLAN
-------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  SetOp Except
-   ->  Sort
-         Sort Key: tenk1.unique1
-         ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Sort
-         Sort Key: tenk1_1.unique2
-         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-               Filter: (unique2 <> 10)
-(8 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index 10030ec1c5..0be2911b0b 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -134,10 +134,15 @@ select count(*) from
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
+
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+reset enable_indexscan;
+
 set enable_hashagg to off;

 explain (costs off)
--
2.43.5

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

02 December 2024, 00:47:52

I wrote:
> So after digging into it, I realized that I'd been completely misled
> by setop_fill_hash_table's not failing while reading the first input.
> That's not because the code was correct, it was because none of our
> test cases manage to reach TupleHashTableMatch() while reading the
> first input.  Apparently there are no regression test cases where
> tuples of the first input have identical hash codes.  But as soon
> as I tried a case with duplicate rows in the first input, kaboom!

I didn't quite believe that theory, and so I went back to look again.
Indeed we do have test cases exercising duplicate-row removal, so
why wasn't TupleHashTableMatch crashing?  The answer turns out to
be less about duplicates (although we need at least a hash-code
match to reach the problem) and more about the form of the input.
This test case from union.sql doesn't crash (with my v2 patch):

SELECT q2 FROM int8_tbl INTERSECT SELECT q1 FROM int8_tbl ORDER BY 1;

but this does:

SELECT * FROM int8_tbl INTERSECT SELECT * FROM int8_tbl ORDER BY 1;

The reason is that "SELECT *" doesn't have to do a projection,
so the SeqScan node just returns its scan tuple slot, which is
a BufferHeapTuple slot, and that triggers the wrong-slot-type
crash.  But "SELECT q2" does have to do a projection, so what
it returns is a Virtual tuple slot, and that's okay!  That's
not so much about restrictions of LookupTupleHashEntry as it
is about this optimization in ExecComputeSlotInfo:

    /* if the slot is known to always virtual we never need to deform */
    if (op->d.fetch.fixed && op->d.fetch.kind == &TTSOpsVirtual)
        return false;

That causes us not to emit the EEOP_INNER_FETCHSOME opcode that
is spitting up in the BufferHeapTuple case.  And apparently all
the rest of LookupTupleHashEntry is good with a virtual slot.

I can't avoid the feeling that this has all been optimized a little
too far, because it's darn near impossible to understand what has
gone wrong when something goes wrong.

Anyway, it's certainly unsafe for nodeSetOp.c to assume that its
two inputs will return slots of identical types.  Since
LookupTupleHashEntry does call this tuple-comparison code that is
happy to wire in assumptions about what the input slot type is,
we had better insert a buffering slot that all the data goes through,
as the v3 patch does.

I'm inclined to add a couple of test cases similar to

SELECT * FROM int8_tbl INTERSECT SELECT q2, q1 FROM int8_tbl ORDER BY 1,2;

so that we get some coverage of cases where the input slots
are dissimilar.

            regards, tom lane

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

19 December 2024, 05:44:02

With David's recent fixes to allow telling BuildTupleHashTableExt
what input slot type to expect, it's possible to remove the per-row
slot type conversions I was doing before.  So here's an updated
patchset with that done.

The common_result_slot_type() function I wrote here perhaps
should be made generally available, but I didn't do that yet.

0002-0005 are the same as before.

            regards, tom lane

From 98dbe056964bb5d86db6598a8c3b344264351c20 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 18 Dec 2024 21:13:48 -0500
Subject: [PATCH v4 1/5] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/executor/nodeSetOp.c        | 571 ++++++++++++++----------
 src/backend/optimizer/README            |   2 +-
 src/backend/optimizer/plan/createplan.c |  81 ++--
 src/backend/optimizer/prep/prepunion.c  | 156 ++++---
 src/backend/optimizer/util/pathnode.c   |  50 ++-
 src/include/nodes/execnodes.h           |  31 +-
 src/include/nodes/pathnodes.h           |   9 +-
 src/include/nodes/plannodes.h           |  19 +-
 src/include/optimizer/pathnode.h        |   7 +-
 src/test/regress/expected/subselect.out |  20 +-
 src/test/regress/expected/union.out     | 226 ++++------
 src/tools/pgindent/typedefs.list        |   2 +
 12 files changed, 645 insertions(+), 529 deletions(-)

diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index b40d81f3ff..fa00560aaf 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -3,29 +3,30 @@
  * nodeSetOp.c
  *      Routines to handle INTERSECT and EXCEPT selection
  *
- * The input of a SetOp node consists of tuples from two relations,
- * which have been combined into one dataset, with a junk attribute added
- * that shows which relation each tuple came from.  In SETOP_SORTED mode,
- * the input has furthermore been sorted according to all the grouping
- * columns (ie, all the non-junk attributes).  The SetOp node scans each
- * group of identical tuples to determine how many came from each input
- * relation.  Then it is a simple matter to emit the output demanded by the
- * SQL spec for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ * The input of a SetOp node consists of two relations (outer and inner)
+ * with identical column sets.  In EXCEPT queries the outer relation is
+ * always the left side, while in INTERSECT cases the planner tries to
+ * make the outer relation be the smaller of the two inputs.
  *
- * In SETOP_HASHED mode, the input is delivered in no particular order,
- * except that we know all the tuples from one input relation will come before
- * all the tuples of the other.  The planner guarantees that the first input
- * relation is the left-hand one for EXCEPT, and tries to make the smaller
- * input relation come first for INTERSECT.  We build a hash table in memory
- * with one entry for each group of identical tuples, and count the number of
- * tuples in the group from each relation.  After seeing all the input, we
- * scan the hashtable and generate the correct output using those counts.
- * We can avoid making hashtable entries for any tuples appearing only in the
- * second input relation, since they cannot result in any output.
+ * In SETOP_SORTED mode, each input has been sorted according to all the
+ * grouping columns (ie, all the non-junk attributes).  The SetOp node
+ * essentially performs a merge join on the grouping columns, except that
+ * it is only interested in counting how many tuples from each input match.
+ * Then it is a simple matter to emit the output demanded by the SQL spec
+ * for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ *
+ * In SETOP_HASHED mode, the inputs are delivered in no particular order.
+ * We read the outer relation and build a hash table in memory with one entry
+ * for each group of identical tuples, counting the number of tuples in the
+ * group.  Then we read the inner relation and count the number of tuples
+ * matching each outer group.  (We can disregard any tuples appearing only
+ * in the inner relation, since they cannot result in any output.)  After
+ * seeing all the input, we scan the hashtable and generate the correct
+ * output using those counts.
  *
  * This node type is not used for UNION or UNION ALL, since those can be
- * implemented more cheaply (there's no need for the junk attribute to
- * identify the source relation).
+ * implemented more cheaply (there's no need to count the number of
+ * matching tuples).
  *
  * Note that SetOp does no qual checking nor projection.  The delivered
  * output tuples are just copies of the first-to-arrive tuple in each
@@ -54,63 +55,48 @@
 /*
  * SetOpStatePerGroupData - per-group working state
  *
- * These values are working state that is initialized at the start of
- * an input tuple group and updated for each input tuple.
- *
- * In SETOP_SORTED mode, we need only one of these structs, and it's kept in
- * the plan state node.  In SETOP_HASHED mode, the hash table contains one
- * of these for each tuple group.
+ * In SETOP_SORTED mode, we need only one of these structs, and it's just a
+ * local in setop_retrieve_sorted.  In SETOP_HASHED mode, the hash table
+ * contains one of these for each tuple group.
  */
 typedef struct SetOpStatePerGroupData
 {
-    long        numLeft;        /* number of left-input dups in group */
-    long        numRight;        /* number of right-input dups in group */
-}            SetOpStatePerGroupData;
+    int64        numLeft;        /* number of left-input dups in group */
+    int64        numRight;        /* number of right-input dups in group */
+} SetOpStatePerGroupData;

+typedef SetOpStatePerGroupData *SetOpStatePerGroup;

-static TupleTableSlot *setop_retrieve_direct(SetOpState *setopstate);
+
+static TupleTableSlot *setop_retrieve_sorted(SetOpState *setopstate);
+static void setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                             SetOpState *setopstate);
+static int    setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                                SetOpState *setopstate);
 static void setop_fill_hash_table(SetOpState *setopstate);
 static TupleTableSlot *setop_retrieve_hash_table(SetOpState *setopstate);


 /*
- * Initialize state for a new group of input values.
- */
-static inline void
-initialize_counts(SetOpStatePerGroup pergroup)
-{
-    pergroup->numLeft = pergroup->numRight = 0;
-}
-
-/*
- * Advance the appropriate counter for one input tuple.
+ * If the two given PlanState nodes return the same fixed tuple slot type,
+ * return the slot ops struct for that slot type.  Else return NULL.
  */
-static inline void
-advance_counts(SetOpStatePerGroup pergroup, int flag)
+static const TupleTableSlotOps *
+common_result_slot_type(PlanState *ps1, PlanState *ps2)
 {
-    if (flag)
-        pergroup->numRight++;
-    else
-        pergroup->numLeft++;
-}
+    const TupleTableSlotOps *ops1;
+    const TupleTableSlotOps *ops2;
+    bool        isfixed;

-/*
- * Fetch the "flag" column from an input tuple.
- * This is an integer column with value 0 for left side, 1 for right side.
- */
-static int
-fetch_tuple_flag(SetOpState *setopstate, TupleTableSlot *inputslot)
-{
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
-    int            flag;
-    bool        isNull;
-
-    flag = DatumGetInt32(slot_getattr(inputslot,
-                                      node->flagColIdx,
-                                      &isNull));
-    Assert(!isNull);
-    Assert(flag == 0 || flag == 1);
-    return flag;
+    ops1 = ExecGetResultSlotOps(ps1, &isfixed);
+    if (!isfixed)
+        return NULL;
+    ops2 = ExecGetResultSlotOps(ps2, &isfixed);
+    if (!isfixed)
+        return NULL;
+    if (ops1 == ops2)
+        return ops1;
+    return NULL;
 }

 /*
@@ -126,20 +112,27 @@ build_hash_table(SetOpState *setopstate)
     Assert(node->strategy == SETOP_HASHED);
     Assert(node->numGroups > 0);

-    setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
-                                                   desc,
-                                                   NULL,
-                                                   node->numCols,
-                                                   node->dupColIdx,
-                                                   setopstate->eqfuncoids,
-                                                   setopstate->hashfunctions,
-                                                   node->dupCollations,
-                                                   node->numGroups,
-                                                   0,
-                                                   setopstate->ps.state->es_query_cxt,
-                                                   setopstate->tableContext,
-                                                   econtext->ecxt_per_tuple_memory,
-                                                   false);
+    /*
+     * If both child plans deliver the same fixed tuple slot type, we can tell
+     * BuildTupleHashTableExt to expect that slot type as input.  Otherwise,
+     * we'll pass NULL denoting that any slot type is possible.
+     */
+    setopstate->hashtable =
+        BuildTupleHashTableExt(&setopstate->ps,
+                               desc,
+                               common_result_slot_type(outerPlanState(setopstate),
+                                                       innerPlanState(setopstate)),
+                               node->numCols,
+                               node->cmpColIdx,
+                               setopstate->eqfuncoids,
+                               setopstate->hashfunctions,
+                               node->cmpCollations,
+                               node->numGroups,
+                               0,
+                               setopstate->ps.state->es_query_cxt,
+                               setopstate->tableContext,
+                               econtext->ecxt_per_tuple_memory,
+                               false);
 }

 /*
@@ -218,108 +211,126 @@ ExecSetOp(PlanState *pstate)
         return setop_retrieve_hash_table(node);
     }
     else
-        return setop_retrieve_direct(node);
+        return setop_retrieve_sorted(node);
 }

 /*
  * ExecSetOp for non-hashed case
  */
 static TupleTableSlot *
-setop_retrieve_direct(SetOpState *setopstate)
+setop_retrieve_sorted(SetOpState *setopstate)
 {
     PlanState  *outerPlan;
-    SetOpStatePerGroup pergroup;
-    TupleTableSlot *outerslot;
+    PlanState  *innerPlan;
     TupleTableSlot *resultTupleSlot;
-    ExprContext *econtext = setopstate->ps.ps_ExprContext;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    pergroup = (SetOpStatePerGroup) setopstate->pergroup;
+    innerPlan = innerPlanState(setopstate);
     resultTupleSlot = setopstate->ps.ps_ResultTupleSlot;

     /*
-     * We loop retrieving groups until we find one we should return
+     * If first time through, establish the invariant that setop_load_group
+     * expects: each side's nextTupleSlot is the next output from the child
+     * plan, or empty if there is no more output from it.
      */
-    while (!setopstate->setop_done)
+    if (setopstate->need_init)
     {
+        setopstate->need_init = false;
+
+        setopstate->leftInput.nextTupleSlot = ExecProcNode(outerPlan);
+
         /*
-         * If we don't already have the first tuple of the new group, fetch it
-         * from the outer plan.
+         * If the outer relation is empty, then we will emit nothing, and we
+         * don't need to read the inner relation at all.
          */
-        if (setopstate->grp_firstTuple == NULL)
+        if (TupIsNull(setopstate->leftInput.nextTupleSlot))
         {
-            outerslot = ExecProcNode(outerPlan);
-            if (!TupIsNull(outerslot))
-            {
-                /* Make a copy of the first input tuple */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-            }
-            else
-            {
-                /* outer plan produced no tuples at all */
-                setopstate->setop_done = true;
-                return NULL;
-            }
+            setopstate->setop_done = true;
+            return NULL;
         }

+        setopstate->rightInput.nextTupleSlot = ExecProcNode(innerPlan);
+
+        /* Set flags that we've not completed either side's group */
+        setopstate->leftInput.needGroup = true;
+        setopstate->rightInput.needGroup = true;
+    }
+
+    /*
+     * We loop retrieving groups until we find one we should return
+     */
+    while (!setopstate->setop_done)
+    {
+        int            cmpresult;
+        SetOpStatePerGroupData pergroup;
+
         /*
-         * Store the copied first input tuple in the tuple table slot reserved
-         * for it.  The tuple will be deleted when it is cleared from the
-         * slot.
+         * Fetch the rest of the current outer group, if we didn't already.
          */
-        ExecStoreHeapTuple(setopstate->grp_firstTuple,
-                           resultTupleSlot,
-                           true);
-        setopstate->grp_firstTuple = NULL;    /* don't keep two pointers */
+        if (setopstate->leftInput.needGroup)
+            setop_load_group(&setopstate->leftInput, outerPlan, setopstate);

-        /* Initialize working state for a new input tuple group */
-        initialize_counts(pergroup);
+        /*
+         * If no more outer groups, we're done, and don't need to look at any
+         * more of the inner relation.
+         */
+        if (setopstate->leftInput.numTuples == 0)
+        {
+            setopstate->setop_done = true;
+            break;
+        }

-        /* Count the first input tuple */
-        advance_counts(pergroup,
-                       fetch_tuple_flag(setopstate, resultTupleSlot));
+        /*
+         * Fetch the rest of the current inner group, if we didn't already.
+         */
+        if (setopstate->rightInput.needGroup)
+            setop_load_group(&setopstate->rightInput, innerPlan, setopstate);

         /*
-         * Scan the outer plan until we exhaust it or cross a group boundary.
+         * Determine whether we have matching groups on both sides (this is
+         * basically like the core logic of a merge join).
          */
-        for (;;)
-        {
-            outerslot = ExecProcNode(outerPlan);
-            if (TupIsNull(outerslot))
-            {
-                /* no more outer-plan tuples available */
-                setopstate->setop_done = true;
-                break;
-            }
-
-            /*
-             * Check whether we've crossed a group boundary.
-             */
-            econtext->ecxt_outertuple = resultTupleSlot;
-            econtext->ecxt_innertuple = outerslot;
-
-            if (!ExecQualAndReset(setopstate->eqfunction, econtext))
-            {
-                /*
-                 * Save the first input tuple of the next group.
-                 */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-                break;
-            }
+        if (setopstate->rightInput.numTuples == 0)
+            cmpresult = -1;        /* as though left input is lesser */
+        else
+            cmpresult = setop_compare_slots(setopstate->leftInput.firstTupleSlot,
+                                            setopstate->rightInput.firstTupleSlot,
+                                            setopstate);

-            /* Still in same group, so count this tuple */
-            advance_counts(pergroup,
-                           fetch_tuple_flag(setopstate, outerslot));
+        if (cmpresult < 0)
+        {
+            /* Left group is first, has no right matches */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = 0;
+            /* We'll need another left group next time */
+            setopstate->leftInput.needGroup = true;
+        }
+        else if (cmpresult == 0)
+        {
+            /* We have matching groups */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = setopstate->rightInput.numTuples;
+            /* We'll need to read from both sides next time */
+            setopstate->leftInput.needGroup = true;
+            setopstate->rightInput.needGroup = true;
+        }
+        else
+        {
+            /* Right group has no left matches, so we can ignore it */
+            setopstate->rightInput.needGroup = true;
+            continue;
         }

         /*
-         * Done scanning input tuple group.  See if we should emit any copies
-         * of result tuple, and if so return the first copy.
+         * Done scanning these input tuple groups.  See if we should emit any
+         * copies of result tuple, and if so return the first copy.  (Note
+         * that the result tuple is the same as the left input's firstTuple
+         * slot.)
          */
-        set_output_count(setopstate, pergroup);
+        set_output_count(setopstate, &pergroup);

         if (setopstate->numOutput > 0)
         {
@@ -334,84 +345,168 @@ setop_retrieve_direct(SetOpState *setopstate)
 }

 /*
- * ExecSetOp for hashed case: phase 1, read input and build hash table
+ * Load next group of tuples from one child plan or the other.
+ *
+ * On entry, we've already read the first tuple of the next group
+ * (if there is one) into input->nextTupleSlot.  This invariant
+ * is maintained on exit.
+ */
+static void
+setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                 SetOpState *setopstate)
+{
+    input->needGroup = false;
+
+    /* If we've exhausted this child plan, report an empty group */
+    if (TupIsNull(input->nextTupleSlot))
+    {
+        ExecClearTuple(input->firstTupleSlot);
+        input->numTuples = 0;
+        return;
+    }
+
+    /* Make a local copy of the first tuple for comparisons */
+    ExecStoreMinimalTuple(ExecCopySlotMinimalTuple(input->nextTupleSlot),
+                          input->firstTupleSlot,
+                          true);
+    /* and count it */
+    input->numTuples = 1;
+
+    /* Scan till we find the end-of-group */
+    for (;;)
+    {
+        int            cmpresult;
+
+        /* Get next input tuple, if there is one */
+        input->nextTupleSlot = ExecProcNode(inputPlan);
+        if (TupIsNull(input->nextTupleSlot))
+            break;
+
+        /* There is; does it belong to same group as firstTuple? */
+        cmpresult = setop_compare_slots(input->firstTupleSlot,
+                                        input->nextTupleSlot,
+                                        setopstate);
+        Assert(cmpresult <= 0); /* else input is mis-sorted */
+        if (cmpresult != 0)
+            break;
+
+        /* Still in same group, so count this tuple */
+        input->numTuples++;
+    }
+}
+
+/*
+ * Compare the tuples in the two given slots.
+ */
+static int
+setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                    SetOpState *setopstate)
+{
+    /* We'll often need to fetch all the columns, so just do it */
+    slot_getallattrs(s1);
+    slot_getallattrs(s2);
+    for (int nkey = 0; nkey < setopstate->numCols; nkey++)
+    {
+        SortSupport sortKey = setopstate->sortKeys + nkey;
+        AttrNumber    attno = sortKey->ssup_attno;
+        Datum        datum1 = s1->tts_values[attno - 1],
+                    datum2 = s2->tts_values[attno - 1];
+        bool        isNull1 = s1->tts_isnull[attno - 1],
+                    isNull2 = s2->tts_isnull[attno - 1];
+        int            compare;
+
+        compare = ApplySortComparator(datum1, isNull1,
+                                      datum2, isNull2,
+                                      sortKey);
+        if (compare != 0)
+            return compare;
+    }
+    return 0;
+}
+
+/*
+ * ExecSetOp for hashed case: phase 1, read inputs and build hash table
  */
 static void
 setop_fill_hash_table(SetOpState *setopstate)
 {
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
     PlanState  *outerPlan;
-    int            firstFlag;
-    bool        in_first_rel PG_USED_FOR_ASSERTS_ONLY;
+    PlanState  *innerPlan;
     ExprContext *econtext = setopstate->ps.ps_ExprContext;
+    bool        have_tuples = false;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    firstFlag = node->firstFlag;
-    /* verify planner didn't mess up */
-    Assert(firstFlag == 0 ||
-           (firstFlag == 1 &&
-            (node->cmd == SETOPCMD_INTERSECT ||
-             node->cmd == SETOPCMD_INTERSECT_ALL)));
+    innerPlan = innerPlanState(setopstate);

     /*
      * Process each outer-plan tuple, and then fetch the next one, until we
      * exhaust the outer plan.
      */
-    in_first_rel = true;
     for (;;)
     {
         TupleTableSlot *outerslot;
-        int            flag;
         TupleHashEntryData *entry;
         bool        isnew;

         outerslot = ExecProcNode(outerPlan);
         if (TupIsNull(outerslot))
             break;
+        have_tuples = true;

-        /* Identify whether it's left or right input */
-        flag = fetch_tuple_flag(setopstate, outerslot);
+        /* Find or build hashtable entry for this tuple's group */
+        entry = LookupTupleHashEntry(setopstate->hashtable,
+                                     outerslot,
+                                     &isnew, NULL);

-        if (flag == firstFlag)
+        /* If new tuple group, initialize counts to zero */
+        if (isnew)
         {
-            /* (still) in first input relation */
-            Assert(in_first_rel);
-
-            /* Find or build hashtable entry for this tuple's group */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
-                                         &isnew, NULL);
-
-            /* If new tuple group, initialize counts */
-            if (isnew)
-            {
-                entry->additional = (SetOpStatePerGroup)
-                    MemoryContextAlloc(setopstate->hashtable->tablecxt,
+            entry->additional = (SetOpStatePerGroup)
+                MemoryContextAllocZero(setopstate->hashtable->tablecxt,
                                        sizeof(SetOpStatePerGroupData));
-                initialize_counts((SetOpStatePerGroup) entry->additional);
-            }
-
-            /* Advance the counts */
-            advance_counts((SetOpStatePerGroup) entry->additional, flag);
         }
-        else
+
+        /* Advance the counts */
+        ((SetOpStatePerGroup) entry->additional)->numLeft++;
+
+        /* Must reset expression context after each hashtable lookup */
+        ResetExprContext(econtext);
+    }
+
+    /*
+     * If the outer relation is empty, then we will emit nothing, and we don't
+     * need to read the inner relation at all.
+     */
+    if (have_tuples)
+    {
+        /*
+         * Process each inner-plan tuple, and then fetch the next one, until
+         * we exhaust the inner plan.
+         */
+        for (;;)
         {
-            /* reached second relation */
-            in_first_rel = false;
+            TupleTableSlot *innerslot;
+            TupleHashEntryData *entry;
+
+            innerslot = ExecProcNode(innerPlan);
+            if (TupIsNull(innerslot))
+                break;

             /* For tuples not seen previously, do not make hashtable entry */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+            entry = LookupTupleHashEntry(setopstate->hashtable,
+                                         innerslot,
                                          NULL, NULL);

             /* Advance the counts if entry is already present */
             if (entry)
-                advance_counts((SetOpStatePerGroup) entry->additional, flag);
-        }
+                ((SetOpStatePerGroup) entry->additional)->numRight++;

-        /* Must reset expression context after each hashtable lookup */
-        ResetExprContext(econtext);
+            /* Must reset expression context after each hashtable lookup */
+            ResetExprContext(econtext);
+        }
     }

     setopstate->table_filled = true;
@@ -482,7 +577,6 @@ SetOpState *
 ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 {
     SetOpState *setopstate;
-    TupleDesc    outerDesc;

     /* check for unsupported flags */
     Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -495,14 +589,10 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     setopstate->ps.state = estate;
     setopstate->ps.ExecProcNode = ExecSetOp;

-    setopstate->eqfuncoids = NULL;
-    setopstate->hashfunctions = NULL;
     setopstate->setop_done = false;
     setopstate->numOutput = 0;
-    setopstate->pergroup = NULL;
-    setopstate->grp_firstTuple = NULL;
-    setopstate->hashtable = NULL;
-    setopstate->tableContext = NULL;
+    setopstate->numCols = node->numCols;
+    setopstate->need_init = true;

     /*
      * create expression context
@@ -523,52 +613,72 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * initialize child nodes
      *
-     * If we are hashing then the child plan does not need to handle REWIND
+     * If we are hashing then the child plans do not need to handle REWIND
      * efficiently; see ExecReScanSetOp.
      */
     if (node->strategy == SETOP_HASHED)
         eflags &= ~EXEC_FLAG_REWIND;
     outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
-    outerDesc = ExecGetResultType(outerPlanState(setopstate));
+    innerPlanState(setopstate) = ExecInitNode(innerPlan(node), estate, eflags);

     /*
-     * Initialize result slot and type. Setop nodes do no projections, so
-     * initialize projection info for this node appropriately.
+     * Initialize locally-allocated slots.  In hashed mode, we just need a
+     * result slot.  In sorted mode, we need one first-tuple-of-group slot for
+     * each input; we use the result slot for the left input's slot and create
+     * another for the right input.  (Note: the nextTupleSlot slots are not
+     * ours, but just point to the last slot returned by the input plan node.)
      */
-    ExecInitResultTupleSlotTL(&setopstate->ps,
-                              node->strategy == SETOP_HASHED ?
-                              &TTSOpsMinimalTuple : &TTSOpsHeapTuple);
+    ExecInitResultTupleSlotTL(&setopstate->ps, &TTSOpsMinimalTuple);
+    if (node->strategy != SETOP_HASHED)
+    {
+        setopstate->leftInput.firstTupleSlot =
+            setopstate->ps.ps_ResultTupleSlot;
+        setopstate->rightInput.firstTupleSlot =
+            ExecInitExtraTupleSlot(estate,
+                                   setopstate->ps.ps_ResultTupleDesc,
+                                   &TTSOpsMinimalTuple);
+    }
+
+    /* Setop nodes do no projections. */
     setopstate->ps.ps_ProjInfo = NULL;

     /*
-     * Precompute fmgr lookup data for inner loop. We need both equality and
-     * hashing functions to do it by hashing, but only equality if not
-     * hashing.
+     * Precompute fmgr lookup data for inner loop.  We need equality and
+     * hashing functions to do it by hashing, while for sorting we need
+     * SortSupport data.
      */
     if (node->strategy == SETOP_HASHED)
         execTuplesHashPrepare(node->numCols,
-                              node->dupOperators,
+                              node->cmpOperators,
                               &setopstate->eqfuncoids,
                               &setopstate->hashfunctions);
     else
-        setopstate->eqfunction =
-            execTuplesMatchPrepare(outerDesc,
-                                   node->numCols,
-                                   node->dupColIdx,
-                                   node->dupOperators,
-                                   node->dupCollations,
-                                   &setopstate->ps);
+    {
+        int            nkeys = node->numCols;
+
+        setopstate->sortKeys = (SortSupport)
+            palloc0(nkeys * sizeof(SortSupportData));
+        for (int i = 0; i < nkeys; i++)
+        {
+            SortSupport sortKey = setopstate->sortKeys + i;
+
+            sortKey->ssup_cxt = CurrentMemoryContext;
+            sortKey->ssup_collation = node->cmpCollations[i];
+            sortKey->ssup_nulls_first = node->cmpNullsFirst[i];
+            sortKey->ssup_attno = node->cmpColIdx[i];
+            /* abbreviated key conversion is not useful here */
+            sortKey->abbreviate = false;

+            PrepareSortSupportFromOrderingOp(node->cmpOperators[i], sortKey);
+        }
+    }
+
+    /* Create a hash table if needed */
     if (node->strategy == SETOP_HASHED)
     {
         build_hash_table(setopstate);
         setopstate->table_filled = false;
     }
-    else
-    {
-        setopstate->pergroup =
-            (SetOpStatePerGroup) palloc0(sizeof(SetOpStatePerGroupData));
-    }

     return setopstate;
 }
@@ -576,7 +686,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 /* ----------------------------------------------------------------
  *        ExecEndSetOp
  *
- *        This shuts down the subplan and frees resources allocated
+ *        This shuts down the subplans and frees resources allocated
  *        to this node.
  * ----------------------------------------------------------------
  */
@@ -588,6 +698,7 @@ ExecEndSetOp(SetOpState *node)
         MemoryContextDelete(node->tableContext);

     ExecEndNode(outerPlanState(node));
+    ExecEndNode(innerPlanState(node));
 }


@@ -595,6 +706,7 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
     PlanState  *outerPlan = outerPlanState(node);
+    PlanState  *innerPlan = innerPlanState(node);

     ExecClearTuple(node->ps.ps_ResultTupleSlot);
     node->setop_done = false;
@@ -612,34 +724,29 @@ ExecReScanSetOp(SetOpState *node)
             return;

         /*
-         * If we do have the hash table and the subplan does not have any
+         * If we do have the hash table and the subplans do not have any
          * parameter changes, then we can just rescan the existing hash table;
          * no need to build it again.
          */
-        if (outerPlan->chgParam == NULL)
+        if (outerPlan->chgParam == NULL && innerPlan->chgParam == NULL)
         {
             ResetTupleHashIterator(node->hashtable, &node->hashiter);
             return;
         }
-    }
-
-    /* Release first tuple of group, if we have made a copy */
-    if (node->grp_firstTuple != NULL)
-    {
-        heap_freetuple(node->grp_firstTuple);
-        node->grp_firstTuple = NULL;
-    }

-    /* Release any hashtable storage */
-    if (node->tableContext)
-        MemoryContextReset(node->tableContext);
+        /* Release any hashtable storage */
+        if (node->tableContext)
+            MemoryContextReset(node->tableContext);

-    /* And rebuild empty hashtable if needed */
-    if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
-    {
+        /* And rebuild an empty hashtable */
         ResetTupleHashTable(node->hashtable);
         node->table_filled = false;
     }
+    else
+    {
+        /* Need to re-read first input from each side */
+        node->need_init = true;
+    }

     /*
      * if chgParam of subnode is not null then plan will be re-scanned by
@@ -647,4 +754,6 @@ ExecReScanSetOp(SetOpState *node)
      */
     if (outerPlan->chgParam == NULL)
         ExecReScan(outerPlan);
+    if (innerPlan->chgParam == NULL)
+        ExecReScan(innerPlan);
 }
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 2ab4f3dbf3..f341d9f303 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -649,7 +649,7 @@ RelOptInfo      - a relation or joined relations
   GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
   MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
   WindowAggPath - a WindowAgg plan node applied to some sub-path
-  SetOpPath     - a SetOp plan node applied to some sub-path
+  SetOpPath     - a SetOp plan node applied to two sub-paths
   RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
   LockRowsPath  - a LockRows plan node applied to some sub-path
   ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 178c572b02..b3e2294e84 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -301,9 +301,9 @@ static Unique *make_unique_from_pathkeys(Plan *lefttree,
                                          List *pathkeys, int numCols);
 static Gather *make_gather(List *qptlist, List *qpqual,
                            int nworkers, int rescan_param, bool single_copy, Plan *subplan);
-static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-                         List *distinctList, AttrNumber flagColIdx, int firstFlag,
-                         long numGroups);
+static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+                         List *tlist, Plan *lefttree, Plan *righttree,
+                         List *groupList, long numGroups);
 static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
 static Result *make_result(List *tlist, Node *resconstantqual, Plan *subplan);
 static ProjectSet *make_project_set(List *tlist, Plan *subplan);
@@ -2719,25 +2719,29 @@ static SetOp *
 create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 {
     SetOp       *plan;
-    Plan       *subplan;
+    List       *tlist = build_path_tlist(root, &best_path->path);
+    Plan       *leftplan;
+    Plan       *rightplan;
     long        numGroups;

     /*
      * SetOp doesn't project, so tlist requirements pass through; moreover we
      * need grouping columns to be labeled.
      */
-    subplan = create_plan_recurse(root, best_path->subpath,
-                                  flags | CP_LABEL_TLIST);
+    leftplan = create_plan_recurse(root, best_path->leftpath,
+                                   flags | CP_LABEL_TLIST);
+    rightplan = create_plan_recurse(root, best_path->rightpath,
+                                    flags | CP_LABEL_TLIST);

     /* Convert numGroups to long int --- but 'ware overflow! */
     numGroups = clamp_cardinality_to_long(best_path->numGroups);

     plan = make_setop(best_path->cmd,
                       best_path->strategy,
-                      subplan,
-                      best_path->distinctList,
-                      best_path->flagColIdx,
-                      best_path->firstFlag,
+                      tlist,
+                      leftplan,
+                      rightplan,
+                      best_path->groupList,
                       numGroups);

     copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6950,57 +6954,62 @@ make_gather(List *qptlist,
 }

 /*
- * distinctList is a list of SortGroupClauses, identifying the targetlist
- * items that should be considered by the SetOp filter.  The input path must
- * already be sorted accordingly.
+ * groupList is a list of SortGroupClauses, identifying the targetlist
+ * items that should be considered by the SetOp filter.  The input plans must
+ * already be sorted accordingly, if we're doing SETOP_SORTED mode.
  */
 static SetOp *
-make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-           List *distinctList, AttrNumber flagColIdx, int firstFlag,
-           long numGroups)
+make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+           List *tlist, Plan *lefttree, Plan *righttree,
+           List *groupList, long numGroups)
 {
     SetOp       *node = makeNode(SetOp);
     Plan       *plan = &node->plan;
-    int            numCols = list_length(distinctList);
+    int            numCols = list_length(groupList);
     int            keyno = 0;
-    AttrNumber *dupColIdx;
-    Oid           *dupOperators;
-    Oid           *dupCollations;
+    AttrNumber *cmpColIdx;
+    Oid           *cmpOperators;
+    Oid           *cmpCollations;
+    bool       *cmpNullsFirst;
     ListCell   *slitem;

-    plan->targetlist = lefttree->targetlist;
+    plan->targetlist = tlist;
     plan->qual = NIL;
     plan->lefttree = lefttree;
-    plan->righttree = NULL;
+    plan->righttree = righttree;

     /*
-     * convert SortGroupClause list into arrays of attr indexes and equality
+     * convert SortGroupClause list into arrays of attr indexes and comparison
      * operators, as wanted by executor
      */
-    dupColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
-    dupOperators = (Oid *) palloc(sizeof(Oid) * numCols);
-    dupCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
+    cmpOperators = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpNullsFirst = (bool *) palloc(sizeof(bool) * numCols);

-    foreach(slitem, distinctList)
+    foreach(slitem, groupList)
     {
         SortGroupClause *sortcl = (SortGroupClause *) lfirst(slitem);
         TargetEntry *tle = get_sortgroupclause_tle(sortcl, plan->targetlist);

-        dupColIdx[keyno] = tle->resno;
-        dupOperators[keyno] = sortcl->eqop;
-        dupCollations[keyno] = exprCollation((Node *) tle->expr);
-        Assert(OidIsValid(dupOperators[keyno]));
+        cmpColIdx[keyno] = tle->resno;
+        if (strategy == SETOP_HASHED)
+            cmpOperators[keyno] = sortcl->eqop;
+        else
+            cmpOperators[keyno] = sortcl->sortop;
+        Assert(OidIsValid(cmpOperators[keyno]));
+        cmpCollations[keyno] = exprCollation((Node *) tle->expr);
+        cmpNullsFirst[keyno] = sortcl->nulls_first;
         keyno++;
     }

     node->cmd = cmd;
     node->strategy = strategy;
     node->numCols = numCols;
-    node->dupColIdx = dupColIdx;
-    node->dupOperators = dupOperators;
-    node->dupCollations = dupCollations;
-    node->flagColIdx = flagColIdx;
-    node->firstFlag = firstFlag;
+    node->cmpColIdx = cmpColIdx;
+    node->cmpOperators = cmpOperators;
+    node->cmpCollations = cmpCollations;
+    node->cmpNullsFirst = cmpNullsFirst;
     node->numGroups = numGroups;

     return node;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 9c3822f19a..1c43370005 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -65,7 +65,7 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
 static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *input_path,
+                                Path *lpath, Path *rpath,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
@@ -315,8 +315,8 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * to the corresponding tlist entries of the subplan. However, since
          * the subplan was generated by generate_union_paths() or
          * generate_nonunion_paths(), and hence its tlist was generated by
-         * generate_append_tlist(), this will work.  We just tell
-         * generate_setop_tlist() to use varno 0.
+         * generate_append_tlist() or generate_setop_tlist(), this will work.
+         * We just tell generate_setop_tlist() to use varno 0.
          */
         if (flag >= 0 ||
             !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
@@ -1028,29 +1028,27 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                *path;
     List       *lpath_tlist,
                *rpath_tlist,
-               *tlist_list,
                *tlist,
-               *groupList,
-               *pathlist;
+               *groupList;
     bool        lpath_trivial_tlist,
-                rpath_trivial_tlist;
+                rpath_trivial_tlist,
+                result_trivial_tlist;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
     bool        use_hash;
     SetOpCmd    cmd;
-    int            firstFlag;

     /*
      * Tell children to fetch all tuples.
      */
     root->tuple_fraction = 0.0;

-    /* Recurse on children, ensuring their outputs are marked */
+    /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 0,
+                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,10 +1058,9 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dLeftGroups = lrel->rows;

-    lpath = lrel->cheapest_total_path;
     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 1,
+                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1073,41 +1070,51 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dRightGroups = rrel->rows;

-    rpath = rrel->cheapest_total_path;
-
     /* Undo effects of forcing tuple_fraction to 0 */
     root->tuple_fraction = save_fraction;

     /*
      * For EXCEPT, we must put the left input first.  For INTERSECT, either
      * order should give the same results, and we prefer to put the smaller
-     * input first in order to minimize the size of the hash table in the
-     * hashing case.  "Smaller" means the one with the fewer groups.
+     * input first in order to (a) minimize the size of the hash table in the
+     * hashing case, and (b) improve our chances of exploiting the executor's
+     * fast path for empty left-hand input.  "Smaller" means the one with the
+     * fewer groups.
      */
-    if (op->op == SETOP_EXCEPT || dLeftGroups <= dRightGroups)
-    {
-        pathlist = list_make2(lpath, rpath);
-        tlist_list = list_make2(lpath_tlist, rpath_tlist);
-        firstFlag = 0;
-    }
-    else
+    if (op->op != SETOP_EXCEPT && dLeftGroups > dRightGroups)
     {
-        pathlist = list_make2(rpath, lpath);
-        tlist_list = list_make2(rpath_tlist, lpath_tlist);
-        firstFlag = 1;
+        /* need to swap the two inputs */
+        RelOptInfo *tmprel;
+        List       *tmplist;
+        double        tmpd;
+
+        tmprel = lrel;
+        lrel = rrel;
+        rrel = tmprel;
+        tmplist = lpath_tlist;
+        lpath_tlist = rpath_tlist;
+        rpath_tlist = tmplist;
+        tmpd = dLeftGroups;
+        dLeftGroups = dRightGroups;
+        dRightGroups = tmpd;
     }

+    lpath = lrel->cheapest_total_path;
+    rpath = rrel->cheapest_total_path;
+
     /*
-     * Generate tlist for Append plan node.
+     * Generate tlist for SetOp plan node.
      *
-     * The tlist for an Append plan isn't important as far as the Append is
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
      * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.  In fact, it has to be real enough that the flag
-     * column is shown as a variable not a constant, else setrefs.c will get
-     * confused.
+     * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, true,
-                                  tlist_list, refnames_tlist);
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);

     *pTargetList = tlist;

@@ -1116,12 +1123,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /*
-     * Append the child results together.
-     */
-    path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
-                                       NIL, NULL, 0, false, -1);
-
     /* Identify the grouping semantics */
     groupList = generate_setop_grouplist(op, tlist);

@@ -1140,25 +1141,40 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     else
     {
-        dNumGroups = Min(dLeftGroups, dRightGroups);
+        dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }

     /*
-     * Decide whether to hash or sort, and add a sort node if needed.
+     * Decide whether to hash or sort, and add sort nodes if needed.
      */
-    use_hash = choose_hashed_setop(root, groupList, path,
+    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
                                    dNumGroups, dNumOutputRows,
                                    (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");

     if (groupList && !use_hash)
-        path = (Path *) create_sort_path(root,
-                                         result_rel,
-                                         path,
-                                         make_pathkeys_for_sortclauses(root,
-                                                                       groupList,
-                                                                       tlist),
-                                         -1.0);
+    {
+        List       *pathkeys;
+
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 lpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            lpath = (Path *) create_sort_path(root,
+                                              lpath->parent,
+                                              lpath,
+                                              pathkeys,
+                                              -1.0);
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 rpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            rpath = (Path *) create_sort_path(root,
+                                              rpath->parent,
+                                              rpath,
+                                              pathkeys,
+                                              -1.0);
+    }

     /*
      * Finally, add a SetOp path node to generate the correct output.
@@ -1178,12 +1194,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     path = (Path *) create_setop_path(root,
                                       result_rel,
-                                      path,
+                                      lpath,
+                                      rpath,
                                       cmd,
                                       use_hash ? SETOP_HASHED : SETOP_SORTED,
                                       groupList,
-                                      list_length(op->colTypes) + 1,
-                                      use_hash ? firstFlag : -1,
                                       dNumGroups,
                                       dNumOutputRows);

@@ -1285,10 +1300,13 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)

 /*
  * choose_hashed_setop - should we use hashing for a set operation?
+ *
+ * XXX probably this should go away: just make both paths and let
+ * add_path sort it out.
  */
 static bool
 choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *input_path,
+                    Path *lpath, Path *rpath,
                     double dNumGroups, double dNumOutputRows,
                     const char *construct)
 {
@@ -1327,7 +1345,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * Don't do it if it doesn't look like the hashtable will fit into
      * hash_mem.
      */
-    hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
+    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);

     if (hashentrysize * dNumGroups > hash_mem_limit)
         return false;
@@ -1336,9 +1354,9 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * See if the estimated cost is no more than doing it the other way.
      *
      * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group.  Note that the actual result plan might involve a SetOp or
-     * Unique node, not Agg or Group, but the cost estimates for Agg and Group
-     * should be close enough for our purposes here.
+     * group. XXX NOT TRUE: Note that the actual result plan might involve a
+     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
+     * and Group should be close enough for our purposes here.
      *
      * These path variables are dummies that just hold cost fields; we don't
      * make actual Paths for these steps.
@@ -1346,27 +1364,31 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
     cost_agg(&hashed_p, root, AGG_HASHED, NULL,
              numGroupCols, dNumGroups,
              NIL,
-             input_path->disabled_nodes,
-             input_path->startup_cost, input_path->total_cost,
-             input_path->rows, input_path->pathtarget->width);
+             lpath->disabled_nodes + rpath->disabled_nodes,
+             lpath->startup_cost + rpath->startup_cost,
+             lpath->total_cost + rpath->total_cost,
+             lpath->rows + rpath->rows,
+             lpath->pathtarget->width);

     /*
-     * Now for the sorted case.  Note that the input is *always* unsorted,
-     * since it was made by appending unrelated sub-relations together.
+     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
+     * unsorted, since it was made by appending unrelated sub-relations
+     * together.
      */
-    sorted_p.disabled_nodes = input_path->disabled_nodes;
-    sorted_p.startup_cost = input_path->startup_cost;
-    sorted_p.total_cost = input_path->total_cost;
+    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
+    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
+    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
     /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
     cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
               sorted_p.total_cost,
-              input_path->rows, input_path->pathtarget->width,
+              lpath->rows + rpath->rows,
+              lpath->pathtarget->width,
               0.0, work_mem, -1.0);
     cost_group(&sorted_p, root, numGroupCols, dNumGroups,
                NIL,
                sorted_p.disabled_nodes,
                sorted_p.startup_cost, sorted_p.total_cost,
-               input_path->rows);
+               lpath->rows + rpath->rows);

     /*
      * Now make the decision using the top-level tuple fraction.  First we
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..e52e4b1d67 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3634,25 +3634,26 @@ create_windowagg_path(PlannerInfo *root,
  *      Creates a pathnode that represents computation of INTERSECT or EXCEPT
  *
  * 'rel' is the parent relation associated with the result
- * 'subpath' is the path representing the source of data
+ * 'leftpath' is the path representing the left-hand source of data
+ * 'rightpath' is the path representing the right-hand source of data
  * 'cmd' is the specific semantics (INTERSECT or EXCEPT, with/without ALL)
  * 'strategy' is the implementation strategy (sorted or hashed)
- * 'distinctList' is a list of SortGroupClause's representing the grouping
- * 'flagColIdx' is the column number where the flag column will be, if any
- * 'firstFlag' is the flag value for the first input relation when hashing;
- *        or -1 when sorting
- * 'numGroups' is the estimated number of distinct groups
+ * 'groupList' is a list of SortGroupClause's representing the grouping
+ * 'numGroups' is the estimated number of distinct groups in left-hand input
  * 'outputRows' is the estimated number of output rows
+ *
+ * leftpath and rightpath must produce the same columns.  Moreover, if
+ * strategy is SETOP_SORTED, leftpath and rightpath must both be sorted
+ * by all the grouping columns.
  */
 SetOpPath *
 create_setop_path(PlannerInfo *root,
                   RelOptInfo *rel,
-                  Path *subpath,
+                  Path *leftpath,
+                  Path *rightpath,
                   SetOpCmd cmd,
                   SetOpStrategy strategy,
-                  List *distinctList,
-                  AttrNumber flagColIdx,
-                  int firstFlag,
+                  List *groupList,
                   double numGroups,
                   double outputRows)
 {
@@ -3660,34 +3661,37 @@ create_setop_path(PlannerInfo *root,

     pathnode->path.pathtype = T_SetOp;
     pathnode->path.parent = rel;
-    /* SetOp doesn't project, so use source path's pathtarget */
-    pathnode->path.pathtarget = subpath->pathtarget;
+    pathnode->path.pathtarget = rel->reltarget;
     /* For now, assume we are above any joins, so no parameterization */
     pathnode->path.param_info = NULL;
     pathnode->path.parallel_aware = false;
     pathnode->path.parallel_safe = rel->consider_parallel &&
-        subpath->parallel_safe;
-    pathnode->path.parallel_workers = subpath->parallel_workers;
+        leftpath->parallel_safe && rightpath->parallel_safe;
+    pathnode->path.parallel_workers =
+        leftpath->parallel_workers + rightpath->parallel_workers;
     /* SetOp preserves the input sort order if in sort mode */
     pathnode->path.pathkeys =
-        (strategy == SETOP_SORTED) ? subpath->pathkeys : NIL;
+        (strategy == SETOP_SORTED) ? leftpath->pathkeys : NIL;

-    pathnode->subpath = subpath;
+    pathnode->leftpath = leftpath;
+    pathnode->rightpath = rightpath;
     pathnode->cmd = cmd;
     pathnode->strategy = strategy;
-    pathnode->distinctList = distinctList;
-    pathnode->flagColIdx = flagColIdx;
-    pathnode->firstFlag = firstFlag;
+    pathnode->groupList = groupList;
     pathnode->numGroups = numGroups;

     /*
      * Charge one cpu_operator_cost per comparison per input tuple. We assume
      * all columns get compared at most of the tuples.
+     *
+     * XXX all wrong for hashing
      */
-    pathnode->path.disabled_nodes = subpath->disabled_nodes;
-    pathnode->path.startup_cost = subpath->startup_cost;
-    pathnode->path.total_cost = subpath->total_cost +
-        cpu_operator_cost * subpath->rows * list_length(distinctList);
+    pathnode->path.disabled_nodes =
+        leftpath->disabled_nodes + rightpath->disabled_nodes;
+    pathnode->path.startup_cost =
+        leftpath->startup_cost + rightpath->startup_cost;
+    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
+        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 79d5e96021..1590b64392 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2803,27 +2803,34 @@ typedef struct HashState
 /* ----------------
  *     SetOpState information
  *
- *        Even in "sorted" mode, SetOp nodes are more complex than a simple
- *        Unique, since we have to count how many duplicates to return.  But
- *        we also support hashing, so this is really more like a cut-down
- *        form of Agg.
+ *        SetOp nodes support either sorted or hashed de-duplication.
+ *        The sorted mode is a bit like MergeJoin, the hashed mode like Agg.
  * ----------------
  */
-/* this struct is private in nodeSetOp.c: */
-typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
+typedef struct SetOpStatePerInput
+{
+    TupleTableSlot *firstTupleSlot; /* first tuple of current group */
+    int64        numTuples;        /* number of tuples in current group */
+    TupleTableSlot *nextTupleSlot;    /* next input tuple, if already read */
+    bool        needGroup;        /* do we need to load a new group? */
+} SetOpStatePerInput;

 typedef struct SetOpState
 {
     PlanState    ps;                /* its first field is NodeTag */
-    ExprState  *eqfunction;        /* equality comparator */
-    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
-    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     bool        setop_done;        /* indicates completion of output scan */
-    long        numOutput;        /* number of dups left to output */
+    int64        numOutput;        /* number of dups left to output */
+    int            numCols;        /* number of grouping columns */
+
     /* these fields are used in SETOP_SORTED mode: */
-    SetOpStatePerGroup pergroup;    /* per-group working state */
-    HeapTuple    grp_firstTuple; /* copy of first tuple of current group */
+    SortSupport sortKeys;        /* per-grouping-field sort data */
+    SetOpStatePerInput leftInput;    /* current outer-relation input state */
+    SetOpStatePerInput rightInput;    /* current inner-relation input state */
+    bool        need_init;        /* have we read the first tuples yet? */
+
     /* these fields are used in SETOP_HASHED mode: */
+    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
+    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     TupleHashTable hashtable;    /* hash table with one entry per group */
     MemoryContext tableContext; /* memory context containing hash table */
     bool        table_filled;    /* hash table filled yet? */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0759e00e96..58748d2ca6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2342,13 +2342,12 @@ typedef struct WindowAggPath
 typedef struct SetOpPath
 {
     Path        path;
-    Path       *subpath;        /* path representing input source */
+    Path       *leftpath;        /* paths representing input sources */
+    Path       *rightpath;
     SetOpCmd    cmd;            /* what to do, see nodes.h */
     SetOpStrategy strategy;        /* how to do it, see nodes.h */
-    List       *distinctList;    /* SortGroupClauses identifying target cols */
-    AttrNumber    flagColIdx;        /* where is the flag column, if any */
-    int            firstFlag;        /* flag value for first input relation */
-    Cardinality numGroups;        /* estimated number of groups in input */
+    List       *groupList;        /* SortGroupClauses identifying target cols */
+    Cardinality numGroups;        /* estimated number of groups in left input */
 } SetOpPath;

 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 52f29bcdb6..4633121689 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1225,23 +1225,20 @@ typedef struct SetOp
     /* how to do it, see nodes.h */
     SetOpStrategy strategy;

-    /* number of columns to check for duplicate-ness */
+    /* number of columns to compare */
     int            numCols;

     /* their indexes in the target list */
-    AttrNumber *dupColIdx pg_node_attr(array_size(numCols));
+    AttrNumber *cmpColIdx pg_node_attr(array_size(numCols));

-    /* equality operators to compare with */
-    Oid           *dupOperators pg_node_attr(array_size(numCols));
-    Oid           *dupCollations pg_node_attr(array_size(numCols));
-
-    /* where is the flag column, if any */
-    AttrNumber    flagColIdx;
+    /* comparison operators (either equality operators or sort operators) */
+    Oid           *cmpOperators pg_node_attr(array_size(numCols));
+    Oid           *cmpCollations pg_node_attr(array_size(numCols));

-    /* flag value for first input relation */
-    int            firstFlag;
+    /* nulls-first flags if sorting, otherwise not interesting */
+    bool       *cmpNullsFirst pg_node_attr(array_size(numCols));

-    /* estimated number of groups in input */
+    /* estimated number of groups in left input */
     long        numGroups;
 } SetOp;

diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 1035e6560c..5a6d0350c1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -260,12 +260,11 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
                                             bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
                                     RelOptInfo *rel,
-                                    Path *subpath,
+                                    Path *leftpath,
+                                    Path *rightpath,
                                     SetOpCmd cmd,
                                     SetOpStrategy strategy,
-                                    List *distinctList,
-                                    AttrNumber flagColIdx,
-                                    int firstFlag,
+                                    List *groupList,
                                     double numGroups,
                                     double outputRows);
 extern RecursiveUnionPath *create_recursiveunion_path(PlannerInfo *root,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 857ba1c40c..b4cd204de7 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1231,22 +1231,18 @@ select count(*) from
     select * from onek i2 where i2.unique1 = o.unique2
   ) ss
 where o.ten = 1;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Subquery Scan on ss
-               ->  HashSetOp Except
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Scan using onek_unique1 on onek i1
-                                       Index Cond: (unique1 = o.unique1)
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Index Scan using onek_unique1 on onek i2
-                                       Index Cond: (unique1 = o.unique2)
-(13 rows)
+         ->  HashSetOp Except
+               ->  Index Scan using onek_unique1 on onek i1
+                     Index Cond: (unique1 = o.unique1)
+               ->  Index Scan using onek_unique1 on onek i2
+                     Index Cond: (unique1 = o.unique2)
+(9 rows)

 select count(*) from
   onek o cross join lateral (
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..fbf24490e9 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -370,17 +370,13 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                     QUERY PLAN
-------------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  HashSetOp Intersect
-               ->  Append
-                     ->  Subquery Scan on "*SELECT* 2"
-                           ->  Seq Scan on tenk1
-                     ->  Subquery Scan on "*SELECT* 1"
-                           ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+   ->  HashSetOp Intersect
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(4 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -391,16 +387,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Index Only Scan using tenk1_unique1 on tenk1
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                     Filter: (unique2 <> 10)
-(7 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -434,19 +427,17 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                        QUERY PLAN
-------------------------------------------------------------------------------------------
+                               QUERY PLAN
+------------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  SetOp Intersect
-               ->  Sort
-                     Sort Key: "*SELECT* 2".fivethous
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Seq Scan on tenk1
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(10 rows)
+   ->  SetOp Intersect
+         ->  Sort
+               Sort Key: tenk1.fivethous
+               ->  Seq Scan on tenk1
+         ->  Sort
+               Sort Key: tenk1_1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(8 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -457,18 +448,17 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".unique1
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Index Only Scan using tenk1_unique1 on tenk1
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                           Filter: (unique2 <> 10)
-(9 rows)
+         Sort Key: tenk1.unique1
+         ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Sort
+         Sort Key: tenk1_1.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+               Filter: (unique2 <> 10)
+(8 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -528,15 +518,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -546,15 +533,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -606,17 +590,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -626,17 +609,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -669,17 +651,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -689,17 +670,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -777,17 +757,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -797,17 +776,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -970,15 +948,12 @@ set enable_sort = false;
 -- no means to disable Unique.
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union all select from generate_series(1,3);
 --
@@ -1015,15 +990,12 @@ select from generate_series(1,5) union select from generate_series(1,3);

 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  SetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union select from generate_series(1,3);
 --
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cda21239cb..b90e2e259b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2612,6 +2612,8 @@ SetOpCmd
 SetOpPath
 SetOpState
 SetOpStatePerGroup
+SetOpStatePerGroupData
+SetOpStatePerInput
 SetOpStrategy
 SetOperation
 SetOperationStmt
--
2.43.5

From d7d5963568f43958c5055aade005d3ed722880c5 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 18 Dec 2024 21:16:18 -0500
Subject: [PATCH v4 2/5] Remove some dead code in prepunion.c.

Now that we no longer use flag columns for SetOp, remove the
code that supported that.

This also means that there are no resjunk output columns anymore
in a set-operations plan tree.  So we can get rid of some code
that supported that, too.

It would probably be possible now to do what's speculated about in
the comments, and get rid of this code's use of transiently-created
targetlists in favor of dealing only in pathtargets (save for
generating root->processed_tlist at the very end).  However, that
would involve a good deal of notational churn and the benefit
would only be to save a little planning time.  So I'll leave that
project for some other day.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/prep/prepunion.c | 96 ++++----------------------
 1 file changed, 14 insertions(+), 82 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1c43370005..8d70a4b61a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -40,8 +40,7 @@

 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
                                           List *colTypes, List *colCollations,
-                                          bool junkOK,
-                                          int flag, List *refnames_tlist,
+                                          List *refnames_tlist,
                                           List **pTargetList,
                                           bool *istrivial_tlist);
 static RelOptInfo *generate_recursion_path(SetOperationStmt *setOp,
@@ -69,14 +68,12 @@ static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
-                                  int flag,
                                   Index varno,
                                   bool hack_constants,
                                   List *input_tlist,
                                   List *refnames_tlist,
                                   bool *trivial_tlist);
 static List *generate_append_tlist(List *colTypes, List *colCollations,
-                                   bool flag,
                                    List *input_tlists,
                                    List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
@@ -160,12 +157,10 @@ plan_set_operations(PlannerInfo *root)
         /*
          * Recurse on setOperations tree to generate paths for set ops. The
          * final output paths should have just the column types shown as the
-         * output from the top-level node, plus possibly resjunk working
-         * columns (we can rely on upper-level nodes to deal with that).
+         * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
                                            topop->colTypes, topop->colCollations,
-                                           true, -1,
                                            leftmostQuery->targetList,
                                            &top_tlist,
                                            &trivial_tlist);
@@ -208,8 +203,6 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * junkOK: if true, child resjunk columns may be left in the result
- * flag: if >= 0, add a resjunk output column indicating value of flag
  * refnames_tlist: targetlist to take column names from
  *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
@@ -220,7 +213,8 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
- * Pending a redesign of how that works, this is the easy way out.
+ * XXX Now that there are no flag columns and hence no resjunk columns, we
+ * could probably refactor this file to deal only in pathtargets.
  *
  * We don't have to care about typmods here: the only allowed difference
  * between set-op input and output typmods is input is a specific typmod
@@ -229,8 +223,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
                        List *colTypes, List *colCollations,
-                       bool junkOK,
-                       int flag, List *refnames_tlist,
+                       List *refnames_tlist,
                        List **pTargetList,
                        bool *istrivial_tlist)
 {
@@ -279,7 +272,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,

         /* Figure out the appropriate target list for this subquery. */
         tlist = generate_setop_tlist(colTypes, colCollations,
-                                     flag,
                                      rtr->rtindex,
                                      true,
                                      subroot->processed_tlist,
@@ -318,16 +310,14 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * generate_append_tlist() or generate_setop_tlist(), this will work.
          * We just tell generate_setop_tlist() to use varno 0.
          */
-        if (flag >= 0 ||
-            !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
-            !tlist_same_collations(*pTargetList, colCollations, junkOK))
+        if (!tlist_same_datatypes(*pTargetList, colTypes, false) ||
+            !tlist_same_collations(*pTargetList, colCollations, false))
         {
             PathTarget *target;
             bool        trivial_tlist;
             ListCell   *lc;

             *pTargetList = generate_setop_tlist(colTypes, colCollations,
-                                                flag,
                                                 0,
                                                 false,
                                                 *pTargetList,
@@ -411,7 +401,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      */
     lrel = recurse_set_operations(setOp->larg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -423,7 +412,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -436,7 +424,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /*
      * Generate tlist for RecursiveUnion path node --- same as in Append cases
      */
-    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations, false,
+    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations,
                                   list_make2(lpath_tlist, rpath_tlist),
                                   refnames_tlist);

@@ -736,7 +724,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, false,
+    tlist = generate_append_tlist(op->colTypes, op->colCollations,
                                   tlist_list, refnames_tlist);
     *pTargetList = tlist;

@@ -1048,7 +1036,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,7 +1047,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1109,7 +1095,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
                                  0, false, lpath_tlist, refnames_tlist,
                                  &result_trivial_tlist);

@@ -1258,18 +1244,10 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
-         *
-         * Note we disallow any resjunk columns in child results.  This is
-         * necessary since the Append node that implements the union won't do
-         * any projection, and upper levels will get confused if some of our
-         * output tuples have junk and some don't.  This case only arises when
-         * we have an EXCEPT or INTERSECT as child, else there won't be
-         * resjunk anyway.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
-                                                        false, -1,
                                                         refnames_tlist,
                                                         &child_tlist,
                                                         &trivial_tlist));
@@ -1412,7 +1390,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: -1 if no flag column needed, 0 or 1 to create a const flag column
  * varno: varno to use in generated Vars
  * hack_constants: true to copy up constants (see comments in code)
  * input_tlist: targetlist of this node's input node
@@ -1421,7 +1398,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  */
 static List *
 generate_setop_tlist(List *colTypes, List *colCollations,
-                     int flag,
                      Index varno,
                      bool hack_constants,
                      List *input_tlist,
@@ -1520,7 +1496,7 @@ generate_setop_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1529,25 +1505,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag >= 0)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is the given constant */
-        expr = (Node *) makeConst(INT4OID,
-                                  -1,
-                                  InvalidOid,
-                                  sizeof(int32),
-                                  Int32GetDatum(flag),
-                                  false,
-                                  true);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-        *trivial_tlist = false; /* the extra entry makes it not trivial */
-    }
-
     return tlist;
 }

@@ -1556,7 +1513,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: true to create a flag column copied up from subplans
  * input_tlists: list of tlists for sub-plans of the Append
  * refnames_tlist: targetlist to take column names from
  *
@@ -1570,7 +1526,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  */
 static List *
 generate_append_tlist(List *colTypes, List *colCollations,
-                      bool flag,
                       List *input_tlists,
                       List *refnames_tlist)
 {
@@ -1604,8 +1559,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
         {
             TargetEntry *subtle = (TargetEntry *) lfirst(subtlistl);

-            if (subtle->resjunk)
-                continue;
+            Assert(!subtle->resjunk);
             Assert(curColType != NULL);
             if (exprType((Node *) subtle->expr) == lfirst_oid(curColType))
             {
@@ -1654,7 +1608,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1663,23 +1617,6 @@ generate_append_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is shown as copied up from subplan */
-        expr = (Node *) makeVar(0,
-                                resno,
-                                INT4OID,
-                                -1,
-                                InvalidOid,
-                                0);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-    }
-
     pfree(colTypmods);

     return tlist;
@@ -1709,12 +1646,7 @@ generate_setop_grouplist(SetOperationStmt *op, List *targetlist)
         TargetEntry *tle = (TargetEntry *) lfirst(lt);
         SortGroupClause *sgc;

-        if (tle->resjunk)
-        {
-            /* resjunk columns should not have sortgrouprefs */
-            Assert(tle->ressortgroupref == 0);
-            continue;            /* ignore resjunk columns */
-        }
+        Assert(!tle->resjunk);

         /* non-resjunk columns should have sortgroupref = resno */
         Assert(tle->ressortgroupref == tle->resno);
--
2.43.5

From 9817f20276d3c4a719d20582155d55c2fafa103a Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 18 Dec 2024 21:20:10 -0500
Subject: [PATCH v4 3/5] Get rid of choose_hashed_setop().

This logic is a hangover from days when we could only generate
one plan per setop node.  Now, it's better to generate multiple
Paths and let add_path choose the cheapest.  The main advantage
is not having to keep choose_hashed_setop in perfect sync with
cost estimates made elsewhere --- and indeed, it looks like it
was out of sync already.  Not to mention that it doesn't account
for presorted input, which doesn't matter too much right now
but will soon.

This does require fixing create_setop_path to generate good cost
estimates.  That was another area where we'd not maintained things
well, because it wasn't distinguishing sorted from hashed costs at
all.  With the formulas I chose, the total cost attributed to the node
is actually the same for sorted and hashed modes --- but the startup
cost is wildly different, because hashed mode does basically all the
work before returning anything.

I also modified the existing logic that disables hashed setops
so that it works by incrementing path->disabled_nodes rather
than not generating the path at all.  That seems more in keeping
with the methods established by commit e22253467.  (But should
we put that logic in create_setop_path, rather than its caller?)

This results in two changes in plans that are visible in the
regression tests.  The one in union.sql seems fine: it's now
preferring sorted over hashed mode for a zero-column INTERSECT,
since no sort is really required.  The one in subselect.sql
is trying to test nodeSetOp's rescan logic, so I extended it
to test both sorted and hashed modes.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/prep/prepunion.c  | 226 +++++++++---------------
 src/backend/optimizer/util/pathnode.c   |  50 +++++-
 src/test/regress/expected/subselect.out |  48 ++++-
 src/test/regress/expected/union.out     |   2 +-
 src/test/regress/sql/subselect.sql      |  32 +++-
 5 files changed, 200 insertions(+), 158 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 8d70a4b61a..23bb335354 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -63,10 +63,6 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **tlist_list,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
-static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *lpath, Path *rpath,
-                                double dNumGroups, double dNumOutputRows,
-                                const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
                                   Index varno,
                                   bool hack_constants,
@@ -1025,7 +1021,8 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
-    bool        use_hash;
+    bool        can_sort;
+    bool        can_hash;
     SetOpCmd    cmd;

     /*
@@ -1130,15 +1127,76 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }
+    result_rel->rows = dNumOutputRows;
+
+    /* Select the SetOpCmd type */
+    switch (op->op)
+    {
+        case SETOP_INTERSECT:
+            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
+            break;
+        case SETOP_EXCEPT:
+            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
+            break;
+        default:
+            elog(ERROR, "unrecognized set op: %d", (int) op->op);
+            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
+            break;
+    }
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));

     /*
-     * Decide whether to hash or sort, and add sort nodes if needed.
+     * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
-    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
-                                   dNumGroups, dNumOutputRows,
-                                   (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");
+    if (can_hash)
+    {
+        Size        hash_mem_limit = get_hash_memory_limit();
+        Size        hashentrysize;
+
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_HASHED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);

-    if (groupList && !use_hash)
+        /*
+         * Mark the path as disabled if enable_hashagg is off.  While this
+         * isn't exactly a HashAgg node, it seems close enough to justify
+         * letting that switch control it.
+         */
+        if (!enable_hashagg)
+            path->disabled_nodes++;
+
+        /*
+         * Also disable if it doesn't look like the hashtable will fit into
+         * hash_mem.
+         */
+        hashentrysize = MAXALIGN(lpath->pathtarget->width) +
+            MAXALIGN(SizeofMinimalTupleHeader);
+        if (hashentrysize * dNumGroups > hash_mem_limit)
+            path->disabled_nodes++;
+
+        add_path(result_rel, path);
+    }
+
+    /*
+     * If we can sort, sort the inputs if needed before adding SetOp.
+     */
+    if (can_sort)
     {
         List       *pathkeys;

@@ -1160,36 +1218,19 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                               rpath,
                                               pathkeys,
                                               -1.0);
-    }

-    /*
-     * Finally, add a SetOp path node to generate the correct output.
-     */
-    switch (op->op)
-    {
-        case SETOP_INTERSECT:
-            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
-            break;
-        case SETOP_EXCEPT:
-            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
-            break;
-        default:
-            elog(ERROR, "unrecognized set op: %d", (int) op->op);
-            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
-            break;
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_SORTED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);
+        add_path(result_rel, path);
     }
-    path = (Path *) create_setop_path(root,
-                                      result_rel,
-                                      lpath,
-                                      rpath,
-                                      cmd,
-                                      use_hash ? SETOP_HASHED : SETOP_SORTED,
-                                      groupList,
-                                      dNumGroups,
-                                      dNumOutputRows);
-
-    result_rel->rows = path->rows;
-    add_path(result_rel, path);
+
     return result_rel;
 }

@@ -1276,115 +1317,6 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)
     set_cheapest(rel);
 }

-/*
- * choose_hashed_setop - should we use hashing for a set operation?
- *
- * XXX probably this should go away: just make both paths and let
- * add_path sort it out.
- */
-static bool
-choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *lpath, Path *rpath,
-                    double dNumGroups, double dNumOutputRows,
-                    const char *construct)
-{
-    int            numGroupCols = list_length(groupClauses);
-    Size        hash_mem_limit = get_hash_memory_limit();
-    bool        can_sort;
-    bool        can_hash;
-    Size        hashentrysize;
-    Path        hashed_p;
-    Path        sorted_p;
-    double        tuple_fraction;
-
-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupClauses);
-    can_hash = grouping_is_hashable(groupClauses);
-    if (can_hash && can_sort)
-    {
-        /* we have a meaningful choice to make, continue ... */
-    }
-    else if (can_hash)
-        return true;
-    else if (can_sort)
-        return false;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is UNION, INTERSECT, or EXCEPT */
-                 errmsg("could not implement %s", construct),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
-    /* Prefer sorting when enable_hashagg is off */
-    if (!enable_hashagg)
-        return false;
-
-    /*
-     * Don't do it if it doesn't look like the hashtable will fit into
-     * hash_mem.
-     */
-    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
-
-    if (hashentrysize * dNumGroups > hash_mem_limit)
-        return false;
-
-    /*
-     * See if the estimated cost is no more than doing it the other way.
-     *
-     * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group. XXX NOT TRUE: Note that the actual result plan might involve a
-     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
-     * and Group should be close enough for our purposes here.
-     *
-     * These path variables are dummies that just hold cost fields; we don't
-     * make actual Paths for these steps.
-     */
-    cost_agg(&hashed_p, root, AGG_HASHED, NULL,
-             numGroupCols, dNumGroups,
-             NIL,
-             lpath->disabled_nodes + rpath->disabled_nodes,
-             lpath->startup_cost + rpath->startup_cost,
-             lpath->total_cost + rpath->total_cost,
-             lpath->rows + rpath->rows,
-             lpath->pathtarget->width);
-
-    /*
-     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
-     * unsorted, since it was made by appending unrelated sub-relations
-     * together.
-     */
-    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
-    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
-    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
-    /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-    cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
-              sorted_p.total_cost,
-              lpath->rows + rpath->rows,
-              lpath->pathtarget->width,
-              0.0, work_mem, -1.0);
-    cost_group(&sorted_p, root, numGroupCols, dNumGroups,
-               NIL,
-               sorted_p.disabled_nodes,
-               sorted_p.startup_cost, sorted_p.total_cost,
-               lpath->rows + rpath->rows);
-
-    /*
-     * Now make the decision using the top-level tuple fraction.  First we
-     * have to convert an absolute count (LIMIT) into fractional form.
-     */
-    tuple_fraction = root->tuple_fraction;
-    if (tuple_fraction >= 1.0)
-        tuple_fraction /= dNumOutputRows;
-
-    if (compare_fractional_path_costs(&hashed_p, &sorted_p,
-                                      tuple_fraction) < 0)
-    {
-        /* Hashed is cheaper, so use it */
-        return true;
-    }
-    return false;
-}
-
 /*
  * Generate targetlist for a set-operation plan node
  *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e52e4b1d67..d7422f2dd8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3681,17 +3681,51 @@ create_setop_path(PlannerInfo *root,
     pathnode->numGroups = numGroups;

     /*
-     * Charge one cpu_operator_cost per comparison per input tuple. We assume
-     * all columns get compared at most of the tuples.
-     *
-     * XXX all wrong for hashing
+     * Compute cost estimates.  As things stand, we end up with the same total
+     * cost in this node for sort and hash methods, but different startup
+     * costs.  This could be refined perhaps, but it'll do for now.
      */
     pathnode->path.disabled_nodes =
         leftpath->disabled_nodes + rightpath->disabled_nodes;
-    pathnode->path.startup_cost =
-        leftpath->startup_cost + rightpath->startup_cost;
-    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
-        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+    if (strategy == SETOP_SORTED)
+    {
+        /*
+         * In sorted mode, we can emit output incrementally.  Charge one
+         * cpu_operator_cost per comparison per input tuple.  Like cost_group,
+         * we assume all columns get compared at most of the tuples.
+         */
+        pathnode->path.startup_cost =
+            leftpath->startup_cost + rightpath->startup_cost;
+        pathnode->path.total_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
+    else
+    {
+        /*
+         * In hashed mode, we must read all the input before we can emit
+         * anything.  Also charge comparison costs to represent the cost of
+         * hash table lookups.
+         */
+        pathnode->path.startup_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+        pathnode->path.total_cost = pathnode->path.startup_cost;
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index b4cd204de7..ebc545e246 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1221,8 +1221,10 @@ where o.ten = 0;
 (1 row)

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -1256,6 +1258,50 @@ where o.ten = 1;
    100
 (1 row)

+rollback;
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+                                                                                                     QUERY PLAN
                                                                                              

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  SetOp Except
+               ->  Sort
+                     Sort Key: i1.unique1, i1.unique2, i1.two, i1.four, i1.ten, i1.twenty, i1.hundred, i1.thousand,
i1.twothousand,i1.fivethous, i1.tenthous, i1.odd, i1.even, i1.stringu1, i1.stringu2, i1.string4 
+                     ->  Index Scan using onek_unique1 on onek i1
+                           Index Cond: (unique1 = o.unique1)
+               ->  Sort
+                     Sort Key: i2.unique1, i2.unique2, i2.two, i2.four, i2.ten, i2.twenty, i2.hundred, i2.thousand,
i2.twothousand,i2.fivethous, i2.tenthous, i2.odd, i2.even, i2.stringu1, i2.stringu2, i2.string4 
+                     ->  Index Scan using onek_unique1 on onek i2
+                           Index Cond: (unique1 = o.unique2)
+(13 rows)
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+ count
+-------
+   100
+(1 row)
+
+rollback;
 --
 -- Test rescan of a RecursiveUnion node
 --
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index fbf24490e9..9474b01510 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -950,7 +950,7 @@ explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
                         QUERY PLAN
 ----------------------------------------------------------
- HashSetOp Intersect
+ SetOp Intersect
    ->  Function Scan on generate_series
    ->  Function Scan on generate_series generate_series_1
 (3 rows)
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index a6ab42454f..6ed3636a9e 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -638,8 +638,11 @@ select sum(ss.tst::int) from
 where o.ten = 0;

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
+
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -657,6 +660,33 @@ select count(*) from
   ) ss
 where o.ten = 1;

+rollback;
+
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+rollback;
+
 --
 -- Test rescan of a RecursiveUnion node
 --
--
2.43.5

From a0d5e0aaf83bf4057d2fbf74ef6a72625df4748a Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 18 Dec 2024 21:24:01 -0500
Subject: [PATCH v4 4/5] Fix bogus decisions about whether we want pre-sorted
 child paths.

As things stand, recurse_set_operations unconditionally passes down
root->parse->setOperations to subquery_planner, which then applies
set_operation_ordered_results_useful to it to decide whether we care
about producing pre-sorted paths for the subquery.

This is flat wrong.

The main problem is that root->parse->setOperations is not necessarily
the relevant ancestor node.  In the test case added here, we have
UNION ALL atop a child UNION, and the existing logic fails to
recognize that it could use pre-sorted paths for the child UNION step.

A lesser problem is that set_operation_ordered_results_useful is
inadequate anyway: it will falsely claim that we could use pre-sorted
paths for a recursive UNION.

A better way to do this is for each caller of recurse_set_operations
to pass down the relevant parent node.  Then we can further specify
that the caller should pass NULL if it has no use for pre-sorted
paths, and then we can get rid of set_operation_ordered_results_useful
altogether, eliminating the need to keep it in sync with the code that
is actually doing the work.

Note: this patch makes generate_nonunion_paths pass down the non-UNION
node as parent.  generate_nonunion_paths doesn't yet do anything with
the pre-sorted paths that will result, so no regression test plans
change.  If we were going to leave it like this, we'd pass down NULL
instead.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/plan/planner.c   |  7 ++-
 src/backend/optimizer/prep/prepunion.c | 65 +++++++++++++-------------
 src/include/optimizer/prep.h           |  1 -
 src/test/regress/expected/union.out    | 14 ++++++
 src/test/regress/sql/union.sql         |  4 ++
 5 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f..7468961b01 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -616,7 +616,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Basically, this routine does the stuff that should only be done once
  * per Query object.  It then calls grouping_planner.  At one time,
@@ -1350,7 +1350,7 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Returns nothing; the useful output is in the Paths we attach to the
  * (UPPERREL_FINAL, NULL) upperrel in *root.  In addition,
@@ -3467,8 +3467,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
                                       tlist);

     /* setting setop_pathkeys might be useful to the union planner */
-    if (qp_extra->setop != NULL &&
-        set_operation_ordered_results_useful(qp_extra->setop))
+    if (qp_extra->setop != NULL)
     {
         List       *groupClauses;
         bool        sortable;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 23bb335354..e92a088334 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -39,6 +39,7 @@


 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
+                                          SetOperationStmt *parentOp,
                                           List *colTypes, List *colCollations,
                                           List *refnames_tlist,
                                           List **pTargetList,
@@ -156,6 +157,7 @@ plan_set_operations(PlannerInfo *root)
          * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
+                                           NULL,    /* no parent */
                                            topop->colTypes, topop->colCollations,
                                            leftmostQuery->targetList,
                                            &top_tlist,
@@ -168,44 +170,31 @@ plan_set_operations(PlannerInfo *root)
     return setop_rel;
 }

-/*
- * set_operation_ordered_results_useful
- *        Return true if the given SetOperationStmt can be executed by utilizing
- *        paths that provide sorted input according to the setop's targetlist.
- *        Returns false when sorted paths are not any more useful than unsorted
- *        ones.
- */
-bool
-set_operation_ordered_results_useful(SetOperationStmt *setop)
-{
-    /*
-     * Paths sorted by the targetlist are useful for UNION as we can opt to
-     * MergeAppend the sorted paths then Unique them.  Ordered paths are no
-     * more useful than unordered ones for UNION ALL.
-     */
-    if (!setop->all && setop->op == SETOP_UNION)
-        return true;
-
-    /*
-     * EXCEPT / EXCEPT ALL / INTERSECT / INTERSECT ALL cannot yet utilize
-     * correctly sorted input paths.
-     */
-    return false;
-}
-
 /*
  * recurse_set_operations
  *      Recursively handle one step in a tree of set operations
  *
+ * setOp: current step (could be a SetOperationStmt or a leaf RangeTblRef)
+ * parentOp: parent step, or NULL if none (but see below)
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
  * refnames_tlist: targetlist to take column names from
  *
+ * parentOp should be passed as NULL unless that step is interested in
+ * getting sorted output from this step.  ("Sorted" means "sorted according
+ * to the default btree opclasses of the result column datatypes".)
+ *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
  * *pTargetList: receives the fully-fledged tlist for the subtree's top plan
  * *istrivial_tlist: true if, and only if, datatypes between parent and child
  * match.
  *
+ * If setOp is a leaf node, this function plans the sub-query but does
+ * not populate the pathlist of the returned RelOptInfo.  The caller will
+ * generate SubqueryScan paths using useful path(s) of the subquery (see
+ * build_setop_child_paths).  But this function does build the paths for
+ * set-operation nodes.
+ *
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
@@ -218,6 +207,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  */
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
+                       SetOperationStmt *parentOp,
                        List *colTypes, List *colCollations,
                        List *refnames_tlist,
                        List **pTargetList,
@@ -234,7 +224,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
     {
         RangeTblRef *rtr = (RangeTblRef *) setOp;
         RangeTblEntry *rte = root->simple_rte_array[rtr->rtindex];
-        SetOperationStmt *setops;
         Query       *subquery = rte->subquery;
         PlannerInfo *subroot;
         List       *tlist;
@@ -249,15 +238,13 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
         Assert(root->plan_params == NIL);

         /*
-         * Pass the set operation details to the subquery_planner to have it
-         * consider generating Paths correctly ordered for the set operation.
+         * Generate a subroot and Paths for the subquery.  If we have a
+         * parentOp, pass that down to encourage subquery_planner to consider
+         * suitably-sorted Paths.
          */
-        setops = castNode(SetOperationStmt, root->parse->setOperations);
-
-        /* Generate a subroot and Paths for the subquery */
         subroot = rel->subroot = subquery_planner(root->glob, subquery, root,
                                                   false, root->tuple_fraction,
-                                                  setops);
+                                                  parentOp);

         /*
          * It should not be possible for the primitive query to contain any
@@ -396,6 +383,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      * separately without any intention of combining them into one Append.
      */
     lrel = recurse_set_operations(setOp->larg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -407,6 +395,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /* The right path will want to look at the left one ... */
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -479,6 +468,10 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
  * build_setop_child_paths
  *        Build paths for the set op child relation denoted by 'rel'.
  *
+ * 'rel' is an RTE_SUBQUERY relation.  We have already generated paths within
+ * the subquery's subroot; the task here is to create SubqueryScan paths for
+ * 'rel', representing scans of the useful subquery paths.
+ *
  * interesting_pathkeys: if not NIL, also include paths that suit these
  * pathkeys, sorting any unsorted paths as required.
  * *pNumGroups: if not NULL, we estimate the number of distinct groups
@@ -1032,6 +1025,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -1043,6 +1037,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -1285,8 +1280,14 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
+         *
+         * If top_union isn't a UNION ALL, then we are interested in sorted
+         * output from the child, so pass top_union as parentOp.  Note that
+         * this isn't necessarily the child node's immediate SetOperationStmt
+         * parent, but that's fine: it's the effective parent.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
+                                                        top_union->all ? NULL : top_union,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
                                                         refnames_tlist,
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
index a52dec285d..74bb6c1d3b 100644
--- a/src/include/optimizer/prep.h
+++ b/src/include/optimizer/prep.h
@@ -53,6 +53,5 @@ extern void preprocess_aggrefs(PlannerInfo *root, Node *clause);
  * prototypes for prepunion.c
  */
 extern RelOptInfo *plan_set_operations(PlannerInfo *root);
-extern bool set_operation_ordered_results_useful(SetOperationStmt *setop);

 #endif                            /* PREP_H */
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 9474b01510..3e1adbd992 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -466,6 +466,20 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+                               QUERY PLAN
+------------------------------------------------------------------------
+ Append
+   ->  Seq Scan on int4_tbl
+   ->  Unique
+         ->  Merge Append
+               Sort Key: tenk1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1
+               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+(7 rows)
+
 reset enable_hashagg;
 -- non-hashable type
 set enable_hashagg to on;
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index f8826514e4..10030ec1c5 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -156,6 +156,10 @@ explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+
 reset enable_hashagg;

 -- non-hashable type
--
2.43.5

From 7269a1881095bdf80bf95623e4e310e927c4b361 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 18 Dec 2024 21:35:01 -0500
Subject: [PATCH v4 5/5] Teach generate_nonunion_paths to consider pre-sorted
 child paths.

Much of this patch is just re-ordering existing code so that we can
compute the desired pathkeys before we call build_setop_child_paths.

I don't think we need any new test cases: the existing cases that
change plan are enough to show that this works.

One test case in union.sql would switch from HashSetOp Except to
sorted SetOp Except if we left it alone, since the cheapest inputs
are already sorted.  But the point of that set of tests is to
exercise HashSetOp, so swing a bigger hammer.  (The same query is
re-tested with sorting further down, so this seems fine.)

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/prep/prepunion.c | 144 ++++++++++++++++---------
 src/test/regress/expected/union.out    |  37 +++----
 src/test/regress/sql/union.sql         |   5 +
 3 files changed, 115 insertions(+), 71 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e92a088334..b3a26738d0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1010,6 +1010,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     bool        lpath_trivial_tlist,
                 rpath_trivial_tlist,
                 result_trivial_tlist;
+    List       *nonunion_pathkeys = NIL;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
@@ -1030,11 +1031,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
-    if (lrel->rtekind == RTE_SUBQUERY)
-        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
-                                NIL, &dLeftGroups);
-    else
-        dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
                                   op,
@@ -1042,9 +1038,57 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
+
+    /*
+     * Generate tlist for SetOp plan node.
+     *
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
+     * concerned, but we must make it look real anyway for the benefit of the
+     * next plan level up.
+     */
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);
+
+    *pTargetList = tlist;
+
+    /* Identify the grouping semantics */
+    groupList = generate_setop_grouplist(op, tlist);
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
+
+    if (can_sort)
+    {
+        /* Determine the pathkeys for sorting by the whole target list */
+        nonunion_pathkeys = make_pathkeys_for_sortclauses(root, groupList,
+                                                          tlist);
+
+        root->query_pathkeys = nonunion_pathkeys;
+    }
+
+    /*
+     * Now that we've got all that info, we can build the child paths.
+     */
+    if (lrel->rtekind == RTE_SUBQUERY)
+        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
+                                nonunion_pathkeys, &dLeftGroups);
+    else
+        dLeftGroups = lrel->rows;
     if (rrel->rtekind == RTE_SUBQUERY)
         build_setop_child_paths(root, rrel, rpath_trivial_tlist, rpath_tlist,
-                                NIL, &dRightGroups);
+                                nonunion_pathkeys, &dRightGroups);
     else
         dRightGroups = rrel->rows;

@@ -1080,30 +1124,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     lpath = lrel->cheapest_total_path;
     rpath = rrel->cheapest_total_path;

-    /*
-     * Generate tlist for SetOp plan node.
-     *
-     * The tlist for a SetOp plan isn't important so far as the SetOp is
-     * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.
-     */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
-                                 0, false, lpath_tlist, refnames_tlist,
-                                 &result_trivial_tlist);
-
-    /* We should not have needed any type coercions in the tlist */
-    Assert(result_trivial_tlist);
-
-    *pTargetList = tlist;
-
     /* Build result relation. */
     result_rel = fetch_upper_rel(root, UPPERREL_SETOP,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /* Identify the grouping semantics */
-    groupList = generate_setop_grouplist(op, tlist);
-
     /*
      * Estimate number of distinct groups that we'll need hashtable entries
      * for; this is the size of the left-hand input for EXCEPT, or the smaller
@@ -1139,17 +1164,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
             break;
     }

-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupList);
-    can_hash = grouping_is_hashable(groupList);
-    if (!can_sort && !can_hash)
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is INTERSECT or EXCEPT */
-                 errmsg("could not implement %s",
-                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
     /*
      * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
@@ -1189,35 +1203,63 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }

     /*
-     * If we can sort, sort the inputs if needed before adding SetOp.
+     * If we can sort, generate the cheapest sorted input paths, and add a
+     * SetOp atop those.
      */
     if (can_sort)
     {
         List       *pathkeys;
+        Path       *slpath,
+                   *srpath;

+        /* First the left input ... */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  lpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
-            lpath = (Path *) create_sort_path(root,
-                                              lpath->parent,
-                                              lpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            slpath = lpath;        /* cheapest path is already sorted */
+        else
+        {
+            slpath = get_cheapest_path_for_pathkeys(lrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (slpath == NULL)
+                slpath = (Path *) create_sort_path(root,
+                                                   lpath->parent,
+                                                   lpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }
+
+        /* and now the same for the right. */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  rpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
-            rpath = (Path *) create_sort_path(root,
-                                              rpath->parent,
-                                              rpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            srpath = rpath;        /* cheapest path is already sorted */
+        else
+        {
+            srpath = get_cheapest_path_for_pathkeys(rrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (srpath == NULL)
+                srpath = (Path *) create_sort_path(root,
+                                                   rpath->parent,
+                                                   rpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }

         path = (Path *) create_setop_path(root,
                                           result_rel,
-                                          lpath,
-                                          rpath,
+                                          slpath,
+                                          srpath,
                                           cmd,
                                           SETOP_SORTED,
                                           groupList,
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 3e1adbd992..aa73edeac5 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -385,13 +385,15 @@ select count(*) from
   5000
 (1 row)

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                         QUERY PLAN
-------------------------------------------------------------
+           QUERY PLAN
+---------------------------------
  HashSetOp Except
-   ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+   ->  Seq Scan on tenk1
+   ->  Seq Scan on tenk1 tenk1_1
          Filter: (unique2 <> 10)
 (4 rows)

@@ -401,6 +403,7 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+reset enable_indexscan;
 set enable_hashagg to off;
 explain (costs off)
 select count(*) from
@@ -427,17 +430,15 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
    ->  SetOp Intersect
          ->  Sort
                Sort Key: tenk1.fivethous
                ->  Seq Scan on tenk1
-         ->  Sort
-               Sort Key: tenk1_1.unique1
-               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(6 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -448,17 +449,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                            QUERY PLAN
-------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  SetOp Except
-   ->  Sort
-         Sort Key: tenk1.unique1
-         ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Sort
-         Sort Key: tenk1_1.unique2
-         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-               Filter: (unique2 <> 10)
-(8 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index 10030ec1c5..0be2911b0b 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -134,10 +134,15 @@ select count(*) from
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
+
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+reset enable_indexscan;
+
 set enable_hashagg to off;

 explain (costs off)
--
2.43.5

Re: Converting SetOp to read its two inputs separately

From

David Rowley

Date:

19 December 2024, 09:33:20

On Thu, 19 Dec 2024 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> With David's recent fixes to allow telling BuildTupleHashTableExt
> what input slot type to expect, it's possible to remove the per-row
> slot type conversions I was doing before.  So here's an updated
> patchset with that done.
>
> The common_result_slot_type() function I wrote here perhaps
> should be made generally available, but I didn't do that yet.

I think it would be good to make this generic as it can be at least
used in nodeRecursiveunion.c and nodeAppend.c.  For the Append usage,
I wonder if it's worth making the function accept an array of
PlanStates rather than just two of them. Then, either add a helper
function that allocates and assigns the outer and inner side to a
stack-allocated two-element array, or just do that part in nodeSetOp.c
and nodeRecursiveunion.c and pass that to the array-accepting
function.

Review:

0001:

+ /* Left group is first, has no right matches */

Should the comma be an "or"?

0002:

Should this adjust the following nodeSetOp.c comment:

 * In SETOP_SORTED mode, each input has been sorted according to all the
 * grouping columns (ie, all the non-junk attributes).  The SetOp node

This implies that there can be junk columns.

0003:

Looks fine. The only thing that I paused to think about was your
decision to double disable hashing when !enable_hashagg and the
results are bigger than get_hash_memory_limit(). I think have you have
makes sense there.

0004:

Looks fine. I see no reason to delay you pushing this.

0005:

Looks good.

David

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

19 December 2024, 22:38:14

David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 19 Dec 2024 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The common_result_slot_type() function I wrote here perhaps
>> should be made generally available, but I didn't do that yet.

> I think it would be good to make this generic as it can be at least
> used in nodeRecursiveunion.c and nodeAppend.c.

OK, done, and I added an 0006 patch that uses that infrastructure
in the obvious places.

I also addressed your remarks about comments.  Otherwise I'm feeling
like this is about ready to push.

            regards, tom lane

From 3e5c08bdadb64d87945b7328d3ba7e28b7590883 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 19 Dec 2024 12:25:31 -0500
Subject: [PATCH v5 1/6] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.

Also, this patch adds some infrastructure for detecting whether
child plan nodes all output the same type of tuple table slot.
If they do, the hash table logic can use slightly more efficient
code based on assuming that that's the input slot type it will see.
We'll make use of that infrastructure in other plan node types later.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/executor/execUtils.c        |  43 ++
 src/backend/executor/nodeSetOp.c        | 537 ++++++++++++++----------
 src/backend/optimizer/README            |   2 +-
 src/backend/optimizer/plan/createplan.c |  81 ++--
 src/backend/optimizer/prep/prepunion.c  | 156 ++++---
 src/backend/optimizer/util/pathnode.c   |  50 ++-
 src/include/executor/executor.h         |   3 +
 src/include/nodes/execnodes.h           |  31 +-
 src/include/nodes/pathnodes.h           |   9 +-
 src/include/nodes/plannodes.h           |  19 +-
 src/include/optimizer/pathnode.h        |   7 +-
 src/test/regress/expected/subselect.out |  20 +-
 src/test/regress/expected/union.out     | 254 +++++------
 src/test/regress/sql/union.sql          |   6 +
 src/tools/pgindent/typedefs.list        |   2 +
 15 files changed, 696 insertions(+), 524 deletions(-)

diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 740e8fb148..df0223129c 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -526,6 +526,49 @@ ExecGetResultSlotOps(PlanState *planstate, bool *isfixed)
     return planstate->ps_ResultTupleSlot->tts_ops;
 }

+/*
+ * ExecGetCommonSlotOps - identify common result slot type, if any
+ *
+ * If all the given PlanState nodes return the same fixed tuple slot type,
+ * return the slot ops struct for that slot type.  Else, return NULL.
+ */
+const TupleTableSlotOps *
+ExecGetCommonSlotOps(PlanState **planstates, int nplans)
+{
+    const TupleTableSlotOps *result;
+    bool        isfixed;
+
+    if (nplans <= 0)
+        return NULL;
+    result = ExecGetResultSlotOps(planstates[0], &isfixed);
+    if (!isfixed)
+        return NULL;
+    for (int i = 1; i < nplans; i++)
+    {
+        const TupleTableSlotOps *thisops;
+
+        thisops = ExecGetResultSlotOps(planstates[i], &isfixed);
+        if (!isfixed)
+            return NULL;
+        if (result != thisops)
+            return NULL;
+    }
+    return result;
+}
+
+/*
+ * ExecGetCommonChildSlotOps - as above, for the PlanState's standard children
+ */
+const TupleTableSlotOps *
+ExecGetCommonChildSlotOps(PlanState *ps)
+{
+    PlanState  *planstates[2];
+
+    planstates[0] = outerPlanState(ps);
+    planstates[1] = innerPlanState(ps);
+    return ExecGetCommonSlotOps(planstates, 2);
+}
+

 /* ----------------
  *        ExecAssignProjectionInfo
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index b40d81f3ff..09089204e8 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -3,29 +3,30 @@
  * nodeSetOp.c
  *      Routines to handle INTERSECT and EXCEPT selection
  *
- * The input of a SetOp node consists of tuples from two relations,
- * which have been combined into one dataset, with a junk attribute added
- * that shows which relation each tuple came from.  In SETOP_SORTED mode,
- * the input has furthermore been sorted according to all the grouping
- * columns (ie, all the non-junk attributes).  The SetOp node scans each
- * group of identical tuples to determine how many came from each input
- * relation.  Then it is a simple matter to emit the output demanded by the
- * SQL spec for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ * The input of a SetOp node consists of two relations (outer and inner)
+ * with identical column sets.  In EXCEPT queries the outer relation is
+ * always the left side, while in INTERSECT cases the planner tries to
+ * make the outer relation be the smaller of the two inputs.
  *
- * In SETOP_HASHED mode, the input is delivered in no particular order,
- * except that we know all the tuples from one input relation will come before
- * all the tuples of the other.  The planner guarantees that the first input
- * relation is the left-hand one for EXCEPT, and tries to make the smaller
- * input relation come first for INTERSECT.  We build a hash table in memory
- * with one entry for each group of identical tuples, and count the number of
- * tuples in the group from each relation.  After seeing all the input, we
- * scan the hashtable and generate the correct output using those counts.
- * We can avoid making hashtable entries for any tuples appearing only in the
- * second input relation, since they cannot result in any output.
+ * In SETOP_SORTED mode, each input has been sorted according to all the
+ * grouping columns.  The SetOp node essentially performs a merge join on
+ * the grouping columns, except that it is only interested in counting how
+ * many tuples from each input match.  Then it is a simple matter to emit
+ * the output demanded by the SQL spec for INTERSECT, INTERSECT ALL, EXCEPT,
+ * or EXCEPT ALL.
+ *
+ * In SETOP_HASHED mode, the inputs are delivered in no particular order.
+ * We read the outer relation and build a hash table in memory with one entry
+ * for each group of identical tuples, counting the number of tuples in the
+ * group.  Then we read the inner relation and count the number of tuples
+ * matching each outer group.  (We can disregard any tuples appearing only
+ * in the inner relation, since they cannot result in any output.)  After
+ * seeing all the input, we scan the hashtable and generate the correct
+ * output using those counts.
  *
  * This node type is not used for UNION or UNION ALL, since those can be
- * implemented more cheaply (there's no need for the junk attribute to
- * identify the source relation).
+ * implemented more cheaply (there's no need to count the number of
+ * matching tuples).
  *
  * Note that SetOp does no qual checking nor projection.  The delivered
  * output tuples are just copies of the first-to-arrive tuple in each
@@ -54,65 +55,28 @@
 /*
  * SetOpStatePerGroupData - per-group working state
  *
- * These values are working state that is initialized at the start of
- * an input tuple group and updated for each input tuple.
- *
- * In SETOP_SORTED mode, we need only one of these structs, and it's kept in
- * the plan state node.  In SETOP_HASHED mode, the hash table contains one
- * of these for each tuple group.
+ * In SETOP_SORTED mode, we need only one of these structs, and it's just a
+ * local in setop_retrieve_sorted.  In SETOP_HASHED mode, the hash table
+ * contains one of these for each tuple group.
  */
 typedef struct SetOpStatePerGroupData
 {
-    long        numLeft;        /* number of left-input dups in group */
-    long        numRight;        /* number of right-input dups in group */
-}            SetOpStatePerGroupData;
+    int64        numLeft;        /* number of left-input dups in group */
+    int64        numRight;        /* number of right-input dups in group */
+} SetOpStatePerGroupData;
+
+typedef SetOpStatePerGroupData *SetOpStatePerGroup;


-static TupleTableSlot *setop_retrieve_direct(SetOpState *setopstate);
+static TupleTableSlot *setop_retrieve_sorted(SetOpState *setopstate);
+static void setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                             SetOpState *setopstate);
+static int    setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                                SetOpState *setopstate);
 static void setop_fill_hash_table(SetOpState *setopstate);
 static TupleTableSlot *setop_retrieve_hash_table(SetOpState *setopstate);


-/*
- * Initialize state for a new group of input values.
- */
-static inline void
-initialize_counts(SetOpStatePerGroup pergroup)
-{
-    pergroup->numLeft = pergroup->numRight = 0;
-}
-
-/*
- * Advance the appropriate counter for one input tuple.
- */
-static inline void
-advance_counts(SetOpStatePerGroup pergroup, int flag)
-{
-    if (flag)
-        pergroup->numRight++;
-    else
-        pergroup->numLeft++;
-}
-
-/*
- * Fetch the "flag" column from an input tuple.
- * This is an integer column with value 0 for left side, 1 for right side.
- */
-static int
-fetch_tuple_flag(SetOpState *setopstate, TupleTableSlot *inputslot)
-{
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
-    int            flag;
-    bool        isNull;
-
-    flag = DatumGetInt32(slot_getattr(inputslot,
-                                      node->flagColIdx,
-                                      &isNull));
-    Assert(!isNull);
-    Assert(flag == 0 || flag == 1);
-    return flag;
-}
-
 /*
  * Initialize the hash table to empty.
  */
@@ -126,14 +90,19 @@ build_hash_table(SetOpState *setopstate)
     Assert(node->strategy == SETOP_HASHED);
     Assert(node->numGroups > 0);

+    /*
+     * If both child plans deliver the same fixed tuple slot type, we can tell
+     * BuildTupleHashTableExt to expect that slot type as input.  Otherwise,
+     * we'll pass NULL denoting that any slot type is possible.
+     */
     setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
                                                    desc,
-                                                   NULL,
+                                                   ExecGetCommonChildSlotOps(&setopstate->ps),
                                                    node->numCols,
-                                                   node->dupColIdx,
+                                                   node->cmpColIdx,
                                                    setopstate->eqfuncoids,
                                                    setopstate->hashfunctions,
-                                                   node->dupCollations,
+                                                   node->cmpCollations,
                                                    node->numGroups,
                                                    0,
                                                    setopstate->ps.state->es_query_cxt,
@@ -218,108 +187,126 @@ ExecSetOp(PlanState *pstate)
         return setop_retrieve_hash_table(node);
     }
     else
-        return setop_retrieve_direct(node);
+        return setop_retrieve_sorted(node);
 }

 /*
  * ExecSetOp for non-hashed case
  */
 static TupleTableSlot *
-setop_retrieve_direct(SetOpState *setopstate)
+setop_retrieve_sorted(SetOpState *setopstate)
 {
     PlanState  *outerPlan;
-    SetOpStatePerGroup pergroup;
-    TupleTableSlot *outerslot;
+    PlanState  *innerPlan;
     TupleTableSlot *resultTupleSlot;
-    ExprContext *econtext = setopstate->ps.ps_ExprContext;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    pergroup = (SetOpStatePerGroup) setopstate->pergroup;
+    innerPlan = innerPlanState(setopstate);
     resultTupleSlot = setopstate->ps.ps_ResultTupleSlot;

     /*
-     * We loop retrieving groups until we find one we should return
+     * If first time through, establish the invariant that setop_load_group
+     * expects: each side's nextTupleSlot is the next output from the child
+     * plan, or empty if there is no more output from it.
      */
-    while (!setopstate->setop_done)
+    if (setopstate->need_init)
     {
+        setopstate->need_init = false;
+
+        setopstate->leftInput.nextTupleSlot = ExecProcNode(outerPlan);
+
         /*
-         * If we don't already have the first tuple of the new group, fetch it
-         * from the outer plan.
+         * If the outer relation is empty, then we will emit nothing, and we
+         * don't need to read the inner relation at all.
          */
-        if (setopstate->grp_firstTuple == NULL)
+        if (TupIsNull(setopstate->leftInput.nextTupleSlot))
         {
-            outerslot = ExecProcNode(outerPlan);
-            if (!TupIsNull(outerslot))
-            {
-                /* Make a copy of the first input tuple */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-            }
-            else
-            {
-                /* outer plan produced no tuples at all */
-                setopstate->setop_done = true;
-                return NULL;
-            }
+            setopstate->setop_done = true;
+            return NULL;
         }

+        setopstate->rightInput.nextTupleSlot = ExecProcNode(innerPlan);
+
+        /* Set flags that we've not completed either side's group */
+        setopstate->leftInput.needGroup = true;
+        setopstate->rightInput.needGroup = true;
+    }
+
+    /*
+     * We loop retrieving groups until we find one we should return
+     */
+    while (!setopstate->setop_done)
+    {
+        int            cmpresult;
+        SetOpStatePerGroupData pergroup;
+
         /*
-         * Store the copied first input tuple in the tuple table slot reserved
-         * for it.  The tuple will be deleted when it is cleared from the
-         * slot.
+         * Fetch the rest of the current outer group, if we didn't already.
          */
-        ExecStoreHeapTuple(setopstate->grp_firstTuple,
-                           resultTupleSlot,
-                           true);
-        setopstate->grp_firstTuple = NULL;    /* don't keep two pointers */
+        if (setopstate->leftInput.needGroup)
+            setop_load_group(&setopstate->leftInput, outerPlan, setopstate);

-        /* Initialize working state for a new input tuple group */
-        initialize_counts(pergroup);
+        /*
+         * If no more outer groups, we're done, and don't need to look at any
+         * more of the inner relation.
+         */
+        if (setopstate->leftInput.numTuples == 0)
+        {
+            setopstate->setop_done = true;
+            break;
+        }

-        /* Count the first input tuple */
-        advance_counts(pergroup,
-                       fetch_tuple_flag(setopstate, resultTupleSlot));
+        /*
+         * Fetch the rest of the current inner group, if we didn't already.
+         */
+        if (setopstate->rightInput.needGroup)
+            setop_load_group(&setopstate->rightInput, innerPlan, setopstate);

         /*
-         * Scan the outer plan until we exhaust it or cross a group boundary.
+         * Determine whether we have matching groups on both sides (this is
+         * basically like the core logic of a merge join).
          */
-        for (;;)
-        {
-            outerslot = ExecProcNode(outerPlan);
-            if (TupIsNull(outerslot))
-            {
-                /* no more outer-plan tuples available */
-                setopstate->setop_done = true;
-                break;
-            }
-
-            /*
-             * Check whether we've crossed a group boundary.
-             */
-            econtext->ecxt_outertuple = resultTupleSlot;
-            econtext->ecxt_innertuple = outerslot;
-
-            if (!ExecQualAndReset(setopstate->eqfunction, econtext))
-            {
-                /*
-                 * Save the first input tuple of the next group.
-                 */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-                break;
-            }
+        if (setopstate->rightInput.numTuples == 0)
+            cmpresult = -1;        /* as though left input is lesser */
+        else
+            cmpresult = setop_compare_slots(setopstate->leftInput.firstTupleSlot,
+                                            setopstate->rightInput.firstTupleSlot,
+                                            setopstate);

-            /* Still in same group, so count this tuple */
-            advance_counts(pergroup,
-                           fetch_tuple_flag(setopstate, outerslot));
+        if (cmpresult < 0)
+        {
+            /* Left group is first, and has no right matches */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = 0;
+            /* We'll need another left group next time */
+            setopstate->leftInput.needGroup = true;
+        }
+        else if (cmpresult == 0)
+        {
+            /* We have matching groups */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = setopstate->rightInput.numTuples;
+            /* We'll need to read from both sides next time */
+            setopstate->leftInput.needGroup = true;
+            setopstate->rightInput.needGroup = true;
+        }
+        else
+        {
+            /* Right group has no left matches, so we can ignore it */
+            setopstate->rightInput.needGroup = true;
+            continue;
         }

         /*
-         * Done scanning input tuple group.  See if we should emit any copies
-         * of result tuple, and if so return the first copy.
+         * Done scanning these input tuple groups.  See if we should emit any
+         * copies of result tuple, and if so return the first copy.  (Note
+         * that the result tuple is the same as the left input's firstTuple
+         * slot.)
          */
-        set_output_count(setopstate, pergroup);
+        set_output_count(setopstate, &pergroup);

         if (setopstate->numOutput > 0)
         {
@@ -334,84 +321,168 @@ setop_retrieve_direct(SetOpState *setopstate)
 }

 /*
- * ExecSetOp for hashed case: phase 1, read input and build hash table
+ * Load next group of tuples from one child plan or the other.
+ *
+ * On entry, we've already read the first tuple of the next group
+ * (if there is one) into input->nextTupleSlot.  This invariant
+ * is maintained on exit.
+ */
+static void
+setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                 SetOpState *setopstate)
+{
+    input->needGroup = false;
+
+    /* If we've exhausted this child plan, report an empty group */
+    if (TupIsNull(input->nextTupleSlot))
+    {
+        ExecClearTuple(input->firstTupleSlot);
+        input->numTuples = 0;
+        return;
+    }
+
+    /* Make a local copy of the first tuple for comparisons */
+    ExecStoreMinimalTuple(ExecCopySlotMinimalTuple(input->nextTupleSlot),
+                          input->firstTupleSlot,
+                          true);
+    /* and count it */
+    input->numTuples = 1;
+
+    /* Scan till we find the end-of-group */
+    for (;;)
+    {
+        int            cmpresult;
+
+        /* Get next input tuple, if there is one */
+        input->nextTupleSlot = ExecProcNode(inputPlan);
+        if (TupIsNull(input->nextTupleSlot))
+            break;
+
+        /* There is; does it belong to same group as firstTuple? */
+        cmpresult = setop_compare_slots(input->firstTupleSlot,
+                                        input->nextTupleSlot,
+                                        setopstate);
+        Assert(cmpresult <= 0); /* else input is mis-sorted */
+        if (cmpresult != 0)
+            break;
+
+        /* Still in same group, so count this tuple */
+        input->numTuples++;
+    }
+}
+
+/*
+ * Compare the tuples in the two given slots.
+ */
+static int
+setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                    SetOpState *setopstate)
+{
+    /* We'll often need to fetch all the columns, so just do it */
+    slot_getallattrs(s1);
+    slot_getallattrs(s2);
+    for (int nkey = 0; nkey < setopstate->numCols; nkey++)
+    {
+        SortSupport sortKey = setopstate->sortKeys + nkey;
+        AttrNumber    attno = sortKey->ssup_attno;
+        Datum        datum1 = s1->tts_values[attno - 1],
+                    datum2 = s2->tts_values[attno - 1];
+        bool        isNull1 = s1->tts_isnull[attno - 1],
+                    isNull2 = s2->tts_isnull[attno - 1];
+        int            compare;
+
+        compare = ApplySortComparator(datum1, isNull1,
+                                      datum2, isNull2,
+                                      sortKey);
+        if (compare != 0)
+            return compare;
+    }
+    return 0;
+}
+
+/*
+ * ExecSetOp for hashed case: phase 1, read inputs and build hash table
  */
 static void
 setop_fill_hash_table(SetOpState *setopstate)
 {
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
     PlanState  *outerPlan;
-    int            firstFlag;
-    bool        in_first_rel PG_USED_FOR_ASSERTS_ONLY;
+    PlanState  *innerPlan;
     ExprContext *econtext = setopstate->ps.ps_ExprContext;
+    bool        have_tuples = false;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    firstFlag = node->firstFlag;
-    /* verify planner didn't mess up */
-    Assert(firstFlag == 0 ||
-           (firstFlag == 1 &&
-            (node->cmd == SETOPCMD_INTERSECT ||
-             node->cmd == SETOPCMD_INTERSECT_ALL)));
+    innerPlan = innerPlanState(setopstate);

     /*
      * Process each outer-plan tuple, and then fetch the next one, until we
      * exhaust the outer plan.
      */
-    in_first_rel = true;
     for (;;)
     {
         TupleTableSlot *outerslot;
-        int            flag;
         TupleHashEntryData *entry;
         bool        isnew;

         outerslot = ExecProcNode(outerPlan);
         if (TupIsNull(outerslot))
             break;
+        have_tuples = true;

-        /* Identify whether it's left or right input */
-        flag = fetch_tuple_flag(setopstate, outerslot);
+        /* Find or build hashtable entry for this tuple's group */
+        entry = LookupTupleHashEntry(setopstate->hashtable,
+                                     outerslot,
+                                     &isnew, NULL);

-        if (flag == firstFlag)
+        /* If new tuple group, initialize counts to zero */
+        if (isnew)
         {
-            /* (still) in first input relation */
-            Assert(in_first_rel);
-
-            /* Find or build hashtable entry for this tuple's group */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
-                                         &isnew, NULL);
-
-            /* If new tuple group, initialize counts */
-            if (isnew)
-            {
-                entry->additional = (SetOpStatePerGroup)
-                    MemoryContextAlloc(setopstate->hashtable->tablecxt,
+            entry->additional = (SetOpStatePerGroup)
+                MemoryContextAllocZero(setopstate->hashtable->tablecxt,
                                        sizeof(SetOpStatePerGroupData));
-                initialize_counts((SetOpStatePerGroup) entry->additional);
-            }
-
-            /* Advance the counts */
-            advance_counts((SetOpStatePerGroup) entry->additional, flag);
         }
-        else
+
+        /* Advance the counts */
+        ((SetOpStatePerGroup) entry->additional)->numLeft++;
+
+        /* Must reset expression context after each hashtable lookup */
+        ResetExprContext(econtext);
+    }
+
+    /*
+     * If the outer relation is empty, then we will emit nothing, and we don't
+     * need to read the inner relation at all.
+     */
+    if (have_tuples)
+    {
+        /*
+         * Process each inner-plan tuple, and then fetch the next one, until
+         * we exhaust the inner plan.
+         */
+        for (;;)
         {
-            /* reached second relation */
-            in_first_rel = false;
+            TupleTableSlot *innerslot;
+            TupleHashEntryData *entry;
+
+            innerslot = ExecProcNode(innerPlan);
+            if (TupIsNull(innerslot))
+                break;

             /* For tuples not seen previously, do not make hashtable entry */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+            entry = LookupTupleHashEntry(setopstate->hashtable,
+                                         innerslot,
                                          NULL, NULL);

             /* Advance the counts if entry is already present */
             if (entry)
-                advance_counts((SetOpStatePerGroup) entry->additional, flag);
-        }
+                ((SetOpStatePerGroup) entry->additional)->numRight++;

-        /* Must reset expression context after each hashtable lookup */
-        ResetExprContext(econtext);
+            /* Must reset expression context after each hashtable lookup */
+            ResetExprContext(econtext);
+        }
     }

     setopstate->table_filled = true;
@@ -482,7 +553,6 @@ SetOpState *
 ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 {
     SetOpState *setopstate;
-    TupleDesc    outerDesc;

     /* check for unsupported flags */
     Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -495,14 +565,10 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     setopstate->ps.state = estate;
     setopstate->ps.ExecProcNode = ExecSetOp;

-    setopstate->eqfuncoids = NULL;
-    setopstate->hashfunctions = NULL;
     setopstate->setop_done = false;
     setopstate->numOutput = 0;
-    setopstate->pergroup = NULL;
-    setopstate->grp_firstTuple = NULL;
-    setopstate->hashtable = NULL;
-    setopstate->tableContext = NULL;
+    setopstate->numCols = node->numCols;
+    setopstate->need_init = true;

     /*
      * create expression context
@@ -523,52 +589,72 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * initialize child nodes
      *
-     * If we are hashing then the child plan does not need to handle REWIND
+     * If we are hashing then the child plans do not need to handle REWIND
      * efficiently; see ExecReScanSetOp.
      */
     if (node->strategy == SETOP_HASHED)
         eflags &= ~EXEC_FLAG_REWIND;
     outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
-    outerDesc = ExecGetResultType(outerPlanState(setopstate));
+    innerPlanState(setopstate) = ExecInitNode(innerPlan(node), estate, eflags);

     /*
-     * Initialize result slot and type. Setop nodes do no projections, so
-     * initialize projection info for this node appropriately.
+     * Initialize locally-allocated slots.  In hashed mode, we just need a
+     * result slot.  In sorted mode, we need one first-tuple-of-group slot for
+     * each input; we use the result slot for the left input's slot and create
+     * another for the right input.  (Note: the nextTupleSlot slots are not
+     * ours, but just point to the last slot returned by the input plan node.)
      */
-    ExecInitResultTupleSlotTL(&setopstate->ps,
-                              node->strategy == SETOP_HASHED ?
-                              &TTSOpsMinimalTuple : &TTSOpsHeapTuple);
+    ExecInitResultTupleSlotTL(&setopstate->ps, &TTSOpsMinimalTuple);
+    if (node->strategy != SETOP_HASHED)
+    {
+        setopstate->leftInput.firstTupleSlot =
+            setopstate->ps.ps_ResultTupleSlot;
+        setopstate->rightInput.firstTupleSlot =
+            ExecInitExtraTupleSlot(estate,
+                                   setopstate->ps.ps_ResultTupleDesc,
+                                   &TTSOpsMinimalTuple);
+    }
+
+    /* Setop nodes do no projections. */
     setopstate->ps.ps_ProjInfo = NULL;

     /*
-     * Precompute fmgr lookup data for inner loop. We need both equality and
-     * hashing functions to do it by hashing, but only equality if not
-     * hashing.
+     * Precompute fmgr lookup data for inner loop.  We need equality and
+     * hashing functions to do it by hashing, while for sorting we need
+     * SortSupport data.
      */
     if (node->strategy == SETOP_HASHED)
         execTuplesHashPrepare(node->numCols,
-                              node->dupOperators,
+                              node->cmpOperators,
                               &setopstate->eqfuncoids,
                               &setopstate->hashfunctions);
     else
-        setopstate->eqfunction =
-            execTuplesMatchPrepare(outerDesc,
-                                   node->numCols,
-                                   node->dupColIdx,
-                                   node->dupOperators,
-                                   node->dupCollations,
-                                   &setopstate->ps);
+    {
+        int            nkeys = node->numCols;
+
+        setopstate->sortKeys = (SortSupport)
+            palloc0(nkeys * sizeof(SortSupportData));
+        for (int i = 0; i < nkeys; i++)
+        {
+            SortSupport sortKey = setopstate->sortKeys + i;
+
+            sortKey->ssup_cxt = CurrentMemoryContext;
+            sortKey->ssup_collation = node->cmpCollations[i];
+            sortKey->ssup_nulls_first = node->cmpNullsFirst[i];
+            sortKey->ssup_attno = node->cmpColIdx[i];
+            /* abbreviated key conversion is not useful here */
+            sortKey->abbreviate = false;

+            PrepareSortSupportFromOrderingOp(node->cmpOperators[i], sortKey);
+        }
+    }
+
+    /* Create a hash table if needed */
     if (node->strategy == SETOP_HASHED)
     {
         build_hash_table(setopstate);
         setopstate->table_filled = false;
     }
-    else
-    {
-        setopstate->pergroup =
-            (SetOpStatePerGroup) palloc0(sizeof(SetOpStatePerGroupData));
-    }

     return setopstate;
 }
@@ -576,7 +662,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 /* ----------------------------------------------------------------
  *        ExecEndSetOp
  *
- *        This shuts down the subplan and frees resources allocated
+ *        This shuts down the subplans and frees resources allocated
  *        to this node.
  * ----------------------------------------------------------------
  */
@@ -588,6 +674,7 @@ ExecEndSetOp(SetOpState *node)
         MemoryContextDelete(node->tableContext);

     ExecEndNode(outerPlanState(node));
+    ExecEndNode(innerPlanState(node));
 }


@@ -595,6 +682,7 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
     PlanState  *outerPlan = outerPlanState(node);
+    PlanState  *innerPlan = innerPlanState(node);

     ExecClearTuple(node->ps.ps_ResultTupleSlot);
     node->setop_done = false;
@@ -612,34 +700,29 @@ ExecReScanSetOp(SetOpState *node)
             return;

         /*
-         * If we do have the hash table and the subplan does not have any
+         * If we do have the hash table and the subplans do not have any
          * parameter changes, then we can just rescan the existing hash table;
          * no need to build it again.
          */
-        if (outerPlan->chgParam == NULL)
+        if (outerPlan->chgParam == NULL && innerPlan->chgParam == NULL)
         {
             ResetTupleHashIterator(node->hashtable, &node->hashiter);
             return;
         }
-    }
-
-    /* Release first tuple of group, if we have made a copy */
-    if (node->grp_firstTuple != NULL)
-    {
-        heap_freetuple(node->grp_firstTuple);
-        node->grp_firstTuple = NULL;
-    }

-    /* Release any hashtable storage */
-    if (node->tableContext)
-        MemoryContextReset(node->tableContext);
+        /* Release any hashtable storage */
+        if (node->tableContext)
+            MemoryContextReset(node->tableContext);

-    /* And rebuild empty hashtable if needed */
-    if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
-    {
+        /* And rebuild an empty hashtable */
         ResetTupleHashTable(node->hashtable);
         node->table_filled = false;
     }
+    else
+    {
+        /* Need to re-read first input from each side */
+        node->need_init = true;
+    }

     /*
      * if chgParam of subnode is not null then plan will be re-scanned by
@@ -647,4 +730,6 @@ ExecReScanSetOp(SetOpState *node)
      */
     if (outerPlan->chgParam == NULL)
         ExecReScan(outerPlan);
+    if (innerPlan->chgParam == NULL)
+        ExecReScan(innerPlan);
 }
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 2ab4f3dbf3..f341d9f303 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -649,7 +649,7 @@ RelOptInfo      - a relation or joined relations
   GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
   MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
   WindowAggPath - a WindowAgg plan node applied to some sub-path
-  SetOpPath     - a SetOp plan node applied to some sub-path
+  SetOpPath     - a SetOp plan node applied to two sub-paths
   RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
   LockRowsPath  - a LockRows plan node applied to some sub-path
   ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 178c572b02..b3e2294e84 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -301,9 +301,9 @@ static Unique *make_unique_from_pathkeys(Plan *lefttree,
                                          List *pathkeys, int numCols);
 static Gather *make_gather(List *qptlist, List *qpqual,
                            int nworkers, int rescan_param, bool single_copy, Plan *subplan);
-static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-                         List *distinctList, AttrNumber flagColIdx, int firstFlag,
-                         long numGroups);
+static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+                         List *tlist, Plan *lefttree, Plan *righttree,
+                         List *groupList, long numGroups);
 static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
 static Result *make_result(List *tlist, Node *resconstantqual, Plan *subplan);
 static ProjectSet *make_project_set(List *tlist, Plan *subplan);
@@ -2719,25 +2719,29 @@ static SetOp *
 create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 {
     SetOp       *plan;
-    Plan       *subplan;
+    List       *tlist = build_path_tlist(root, &best_path->path);
+    Plan       *leftplan;
+    Plan       *rightplan;
     long        numGroups;

     /*
      * SetOp doesn't project, so tlist requirements pass through; moreover we
      * need grouping columns to be labeled.
      */
-    subplan = create_plan_recurse(root, best_path->subpath,
-                                  flags | CP_LABEL_TLIST);
+    leftplan = create_plan_recurse(root, best_path->leftpath,
+                                   flags | CP_LABEL_TLIST);
+    rightplan = create_plan_recurse(root, best_path->rightpath,
+                                    flags | CP_LABEL_TLIST);

     /* Convert numGroups to long int --- but 'ware overflow! */
     numGroups = clamp_cardinality_to_long(best_path->numGroups);

     plan = make_setop(best_path->cmd,
                       best_path->strategy,
-                      subplan,
-                      best_path->distinctList,
-                      best_path->flagColIdx,
-                      best_path->firstFlag,
+                      tlist,
+                      leftplan,
+                      rightplan,
+                      best_path->groupList,
                       numGroups);

     copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6950,57 +6954,62 @@ make_gather(List *qptlist,
 }

 /*
- * distinctList is a list of SortGroupClauses, identifying the targetlist
- * items that should be considered by the SetOp filter.  The input path must
- * already be sorted accordingly.
+ * groupList is a list of SortGroupClauses, identifying the targetlist
+ * items that should be considered by the SetOp filter.  The input plans must
+ * already be sorted accordingly, if we're doing SETOP_SORTED mode.
  */
 static SetOp *
-make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-           List *distinctList, AttrNumber flagColIdx, int firstFlag,
-           long numGroups)
+make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+           List *tlist, Plan *lefttree, Plan *righttree,
+           List *groupList, long numGroups)
 {
     SetOp       *node = makeNode(SetOp);
     Plan       *plan = &node->plan;
-    int            numCols = list_length(distinctList);
+    int            numCols = list_length(groupList);
     int            keyno = 0;
-    AttrNumber *dupColIdx;
-    Oid           *dupOperators;
-    Oid           *dupCollations;
+    AttrNumber *cmpColIdx;
+    Oid           *cmpOperators;
+    Oid           *cmpCollations;
+    bool       *cmpNullsFirst;
     ListCell   *slitem;

-    plan->targetlist = lefttree->targetlist;
+    plan->targetlist = tlist;
     plan->qual = NIL;
     plan->lefttree = lefttree;
-    plan->righttree = NULL;
+    plan->righttree = righttree;

     /*
-     * convert SortGroupClause list into arrays of attr indexes and equality
+     * convert SortGroupClause list into arrays of attr indexes and comparison
      * operators, as wanted by executor
      */
-    dupColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
-    dupOperators = (Oid *) palloc(sizeof(Oid) * numCols);
-    dupCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
+    cmpOperators = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpNullsFirst = (bool *) palloc(sizeof(bool) * numCols);

-    foreach(slitem, distinctList)
+    foreach(slitem, groupList)
     {
         SortGroupClause *sortcl = (SortGroupClause *) lfirst(slitem);
         TargetEntry *tle = get_sortgroupclause_tle(sortcl, plan->targetlist);

-        dupColIdx[keyno] = tle->resno;
-        dupOperators[keyno] = sortcl->eqop;
-        dupCollations[keyno] = exprCollation((Node *) tle->expr);
-        Assert(OidIsValid(dupOperators[keyno]));
+        cmpColIdx[keyno] = tle->resno;
+        if (strategy == SETOP_HASHED)
+            cmpOperators[keyno] = sortcl->eqop;
+        else
+            cmpOperators[keyno] = sortcl->sortop;
+        Assert(OidIsValid(cmpOperators[keyno]));
+        cmpCollations[keyno] = exprCollation((Node *) tle->expr);
+        cmpNullsFirst[keyno] = sortcl->nulls_first;
         keyno++;
     }

     node->cmd = cmd;
     node->strategy = strategy;
     node->numCols = numCols;
-    node->dupColIdx = dupColIdx;
-    node->dupOperators = dupOperators;
-    node->dupCollations = dupCollations;
-    node->flagColIdx = flagColIdx;
-    node->firstFlag = firstFlag;
+    node->cmpColIdx = cmpColIdx;
+    node->cmpOperators = cmpOperators;
+    node->cmpCollations = cmpCollations;
+    node->cmpNullsFirst = cmpNullsFirst;
     node->numGroups = numGroups;

     return node;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 9c3822f19a..1c43370005 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -65,7 +65,7 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
 static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *input_path,
+                                Path *lpath, Path *rpath,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
@@ -315,8 +315,8 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * to the corresponding tlist entries of the subplan. However, since
          * the subplan was generated by generate_union_paths() or
          * generate_nonunion_paths(), and hence its tlist was generated by
-         * generate_append_tlist(), this will work.  We just tell
-         * generate_setop_tlist() to use varno 0.
+         * generate_append_tlist() or generate_setop_tlist(), this will work.
+         * We just tell generate_setop_tlist() to use varno 0.
          */
         if (flag >= 0 ||
             !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
@@ -1028,29 +1028,27 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                *path;
     List       *lpath_tlist,
                *rpath_tlist,
-               *tlist_list,
                *tlist,
-               *groupList,
-               *pathlist;
+               *groupList;
     bool        lpath_trivial_tlist,
-                rpath_trivial_tlist;
+                rpath_trivial_tlist,
+                result_trivial_tlist;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
     bool        use_hash;
     SetOpCmd    cmd;
-    int            firstFlag;

     /*
      * Tell children to fetch all tuples.
      */
     root->tuple_fraction = 0.0;

-    /* Recurse on children, ensuring their outputs are marked */
+    /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 0,
+                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,10 +1058,9 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dLeftGroups = lrel->rows;

-    lpath = lrel->cheapest_total_path;
     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 1,
+                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1073,41 +1070,51 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dRightGroups = rrel->rows;

-    rpath = rrel->cheapest_total_path;
-
     /* Undo effects of forcing tuple_fraction to 0 */
     root->tuple_fraction = save_fraction;

     /*
      * For EXCEPT, we must put the left input first.  For INTERSECT, either
      * order should give the same results, and we prefer to put the smaller
-     * input first in order to minimize the size of the hash table in the
-     * hashing case.  "Smaller" means the one with the fewer groups.
+     * input first in order to (a) minimize the size of the hash table in the
+     * hashing case, and (b) improve our chances of exploiting the executor's
+     * fast path for empty left-hand input.  "Smaller" means the one with the
+     * fewer groups.
      */
-    if (op->op == SETOP_EXCEPT || dLeftGroups <= dRightGroups)
-    {
-        pathlist = list_make2(lpath, rpath);
-        tlist_list = list_make2(lpath_tlist, rpath_tlist);
-        firstFlag = 0;
-    }
-    else
+    if (op->op != SETOP_EXCEPT && dLeftGroups > dRightGroups)
     {
-        pathlist = list_make2(rpath, lpath);
-        tlist_list = list_make2(rpath_tlist, lpath_tlist);
-        firstFlag = 1;
+        /* need to swap the two inputs */
+        RelOptInfo *tmprel;
+        List       *tmplist;
+        double        tmpd;
+
+        tmprel = lrel;
+        lrel = rrel;
+        rrel = tmprel;
+        tmplist = lpath_tlist;
+        lpath_tlist = rpath_tlist;
+        rpath_tlist = tmplist;
+        tmpd = dLeftGroups;
+        dLeftGroups = dRightGroups;
+        dRightGroups = tmpd;
     }

+    lpath = lrel->cheapest_total_path;
+    rpath = rrel->cheapest_total_path;
+
     /*
-     * Generate tlist for Append plan node.
+     * Generate tlist for SetOp plan node.
      *
-     * The tlist for an Append plan isn't important as far as the Append is
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
      * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.  In fact, it has to be real enough that the flag
-     * column is shown as a variable not a constant, else setrefs.c will get
-     * confused.
+     * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, true,
-                                  tlist_list, refnames_tlist);
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);

     *pTargetList = tlist;

@@ -1116,12 +1123,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /*
-     * Append the child results together.
-     */
-    path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
-                                       NIL, NULL, 0, false, -1);
-
     /* Identify the grouping semantics */
     groupList = generate_setop_grouplist(op, tlist);

@@ -1140,25 +1141,40 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     else
     {
-        dNumGroups = Min(dLeftGroups, dRightGroups);
+        dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }

     /*
-     * Decide whether to hash or sort, and add a sort node if needed.
+     * Decide whether to hash or sort, and add sort nodes if needed.
      */
-    use_hash = choose_hashed_setop(root, groupList, path,
+    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
                                    dNumGroups, dNumOutputRows,
                                    (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");

     if (groupList && !use_hash)
-        path = (Path *) create_sort_path(root,
-                                         result_rel,
-                                         path,
-                                         make_pathkeys_for_sortclauses(root,
-                                                                       groupList,
-                                                                       tlist),
-                                         -1.0);
+    {
+        List       *pathkeys;
+
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 lpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            lpath = (Path *) create_sort_path(root,
+                                              lpath->parent,
+                                              lpath,
+                                              pathkeys,
+                                              -1.0);
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 rpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            rpath = (Path *) create_sort_path(root,
+                                              rpath->parent,
+                                              rpath,
+                                              pathkeys,
+                                              -1.0);
+    }

     /*
      * Finally, add a SetOp path node to generate the correct output.
@@ -1178,12 +1194,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     path = (Path *) create_setop_path(root,
                                       result_rel,
-                                      path,
+                                      lpath,
+                                      rpath,
                                       cmd,
                                       use_hash ? SETOP_HASHED : SETOP_SORTED,
                                       groupList,
-                                      list_length(op->colTypes) + 1,
-                                      use_hash ? firstFlag : -1,
                                       dNumGroups,
                                       dNumOutputRows);

@@ -1285,10 +1300,13 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)

 /*
  * choose_hashed_setop - should we use hashing for a set operation?
+ *
+ * XXX probably this should go away: just make both paths and let
+ * add_path sort it out.
  */
 static bool
 choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *input_path,
+                    Path *lpath, Path *rpath,
                     double dNumGroups, double dNumOutputRows,
                     const char *construct)
 {
@@ -1327,7 +1345,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * Don't do it if it doesn't look like the hashtable will fit into
      * hash_mem.
      */
-    hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
+    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);

     if (hashentrysize * dNumGroups > hash_mem_limit)
         return false;
@@ -1336,9 +1354,9 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * See if the estimated cost is no more than doing it the other way.
      *
      * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group.  Note that the actual result plan might involve a SetOp or
-     * Unique node, not Agg or Group, but the cost estimates for Agg and Group
-     * should be close enough for our purposes here.
+     * group. XXX NOT TRUE: Note that the actual result plan might involve a
+     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
+     * and Group should be close enough for our purposes here.
      *
      * These path variables are dummies that just hold cost fields; we don't
      * make actual Paths for these steps.
@@ -1346,27 +1364,31 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
     cost_agg(&hashed_p, root, AGG_HASHED, NULL,
              numGroupCols, dNumGroups,
              NIL,
-             input_path->disabled_nodes,
-             input_path->startup_cost, input_path->total_cost,
-             input_path->rows, input_path->pathtarget->width);
+             lpath->disabled_nodes + rpath->disabled_nodes,
+             lpath->startup_cost + rpath->startup_cost,
+             lpath->total_cost + rpath->total_cost,
+             lpath->rows + rpath->rows,
+             lpath->pathtarget->width);

     /*
-     * Now for the sorted case.  Note that the input is *always* unsorted,
-     * since it was made by appending unrelated sub-relations together.
+     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
+     * unsorted, since it was made by appending unrelated sub-relations
+     * together.
      */
-    sorted_p.disabled_nodes = input_path->disabled_nodes;
-    sorted_p.startup_cost = input_path->startup_cost;
-    sorted_p.total_cost = input_path->total_cost;
+    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
+    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
+    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
     /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
     cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
               sorted_p.total_cost,
-              input_path->rows, input_path->pathtarget->width,
+              lpath->rows + rpath->rows,
+              lpath->pathtarget->width,
               0.0, work_mem, -1.0);
     cost_group(&sorted_p, root, numGroupCols, dNumGroups,
                NIL,
                sorted_p.disabled_nodes,
                sorted_p.startup_cost, sorted_p.total_cost,
-               input_path->rows);
+               lpath->rows + rpath->rows);

     /*
      * Now make the decision using the top-level tuple fraction.  First we
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..e52e4b1d67 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3634,25 +3634,26 @@ create_windowagg_path(PlannerInfo *root,
  *      Creates a pathnode that represents computation of INTERSECT or EXCEPT
  *
  * 'rel' is the parent relation associated with the result
- * 'subpath' is the path representing the source of data
+ * 'leftpath' is the path representing the left-hand source of data
+ * 'rightpath' is the path representing the right-hand source of data
  * 'cmd' is the specific semantics (INTERSECT or EXCEPT, with/without ALL)
  * 'strategy' is the implementation strategy (sorted or hashed)
- * 'distinctList' is a list of SortGroupClause's representing the grouping
- * 'flagColIdx' is the column number where the flag column will be, if any
- * 'firstFlag' is the flag value for the first input relation when hashing;
- *        or -1 when sorting
- * 'numGroups' is the estimated number of distinct groups
+ * 'groupList' is a list of SortGroupClause's representing the grouping
+ * 'numGroups' is the estimated number of distinct groups in left-hand input
  * 'outputRows' is the estimated number of output rows
+ *
+ * leftpath and rightpath must produce the same columns.  Moreover, if
+ * strategy is SETOP_SORTED, leftpath and rightpath must both be sorted
+ * by all the grouping columns.
  */
 SetOpPath *
 create_setop_path(PlannerInfo *root,
                   RelOptInfo *rel,
-                  Path *subpath,
+                  Path *leftpath,
+                  Path *rightpath,
                   SetOpCmd cmd,
                   SetOpStrategy strategy,
-                  List *distinctList,
-                  AttrNumber flagColIdx,
-                  int firstFlag,
+                  List *groupList,
                   double numGroups,
                   double outputRows)
 {
@@ -3660,34 +3661,37 @@ create_setop_path(PlannerInfo *root,

     pathnode->path.pathtype = T_SetOp;
     pathnode->path.parent = rel;
-    /* SetOp doesn't project, so use source path's pathtarget */
-    pathnode->path.pathtarget = subpath->pathtarget;
+    pathnode->path.pathtarget = rel->reltarget;
     /* For now, assume we are above any joins, so no parameterization */
     pathnode->path.param_info = NULL;
     pathnode->path.parallel_aware = false;
     pathnode->path.parallel_safe = rel->consider_parallel &&
-        subpath->parallel_safe;
-    pathnode->path.parallel_workers = subpath->parallel_workers;
+        leftpath->parallel_safe && rightpath->parallel_safe;
+    pathnode->path.parallel_workers =
+        leftpath->parallel_workers + rightpath->parallel_workers;
     /* SetOp preserves the input sort order if in sort mode */
     pathnode->path.pathkeys =
-        (strategy == SETOP_SORTED) ? subpath->pathkeys : NIL;
+        (strategy == SETOP_SORTED) ? leftpath->pathkeys : NIL;

-    pathnode->subpath = subpath;
+    pathnode->leftpath = leftpath;
+    pathnode->rightpath = rightpath;
     pathnode->cmd = cmd;
     pathnode->strategy = strategy;
-    pathnode->distinctList = distinctList;
-    pathnode->flagColIdx = flagColIdx;
-    pathnode->firstFlag = firstFlag;
+    pathnode->groupList = groupList;
     pathnode->numGroups = numGroups;

     /*
      * Charge one cpu_operator_cost per comparison per input tuple. We assume
      * all columns get compared at most of the tuples.
+     *
+     * XXX all wrong for hashing
      */
-    pathnode->path.disabled_nodes = subpath->disabled_nodes;
-    pathnode->path.startup_cost = subpath->startup_cost;
-    pathnode->path.total_cost = subpath->total_cost +
-        cpu_operator_cost * subpath->rows * list_length(distinctList);
+    pathnode->path.disabled_nodes =
+        leftpath->disabled_nodes + rightpath->disabled_nodes;
+    pathnode->path.startup_cost =
+        leftpath->startup_cost + rightpath->startup_cost;
+    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
+        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c8e6befca8..dee866e96b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -585,6 +585,9 @@ extern void ExecAssignExprContext(EState *estate, PlanState *planstate);
 extern TupleDesc ExecGetResultType(PlanState *planstate);
 extern const TupleTableSlotOps *ExecGetResultSlotOps(PlanState *planstate,
                                                      bool *isfixed);
+extern const TupleTableSlotOps *ExecGetCommonSlotOps(PlanState **planstates,
+                                                     int nplans);
+extern const TupleTableSlotOps *ExecGetCommonChildSlotOps(PlanState *ps);
 extern void ExecAssignProjectionInfo(PlanState *planstate,
                                      TupleDesc inputDesc);
 extern void ExecConditionalAssignProjectionInfo(PlanState *planstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 79d5e96021..1590b64392 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2803,27 +2803,34 @@ typedef struct HashState
 /* ----------------
  *     SetOpState information
  *
- *        Even in "sorted" mode, SetOp nodes are more complex than a simple
- *        Unique, since we have to count how many duplicates to return.  But
- *        we also support hashing, so this is really more like a cut-down
- *        form of Agg.
+ *        SetOp nodes support either sorted or hashed de-duplication.
+ *        The sorted mode is a bit like MergeJoin, the hashed mode like Agg.
  * ----------------
  */
-/* this struct is private in nodeSetOp.c: */
-typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
+typedef struct SetOpStatePerInput
+{
+    TupleTableSlot *firstTupleSlot; /* first tuple of current group */
+    int64        numTuples;        /* number of tuples in current group */
+    TupleTableSlot *nextTupleSlot;    /* next input tuple, if already read */
+    bool        needGroup;        /* do we need to load a new group? */
+} SetOpStatePerInput;

 typedef struct SetOpState
 {
     PlanState    ps;                /* its first field is NodeTag */
-    ExprState  *eqfunction;        /* equality comparator */
-    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
-    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     bool        setop_done;        /* indicates completion of output scan */
-    long        numOutput;        /* number of dups left to output */
+    int64        numOutput;        /* number of dups left to output */
+    int            numCols;        /* number of grouping columns */
+
     /* these fields are used in SETOP_SORTED mode: */
-    SetOpStatePerGroup pergroup;    /* per-group working state */
-    HeapTuple    grp_firstTuple; /* copy of first tuple of current group */
+    SortSupport sortKeys;        /* per-grouping-field sort data */
+    SetOpStatePerInput leftInput;    /* current outer-relation input state */
+    SetOpStatePerInput rightInput;    /* current inner-relation input state */
+    bool        need_init;        /* have we read the first tuples yet? */
+
     /* these fields are used in SETOP_HASHED mode: */
+    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
+    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     TupleHashTable hashtable;    /* hash table with one entry per group */
     MemoryContext tableContext; /* memory context containing hash table */
     bool        table_filled;    /* hash table filled yet? */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0759e00e96..58748d2ca6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2342,13 +2342,12 @@ typedef struct WindowAggPath
 typedef struct SetOpPath
 {
     Path        path;
-    Path       *subpath;        /* path representing input source */
+    Path       *leftpath;        /* paths representing input sources */
+    Path       *rightpath;
     SetOpCmd    cmd;            /* what to do, see nodes.h */
     SetOpStrategy strategy;        /* how to do it, see nodes.h */
-    List       *distinctList;    /* SortGroupClauses identifying target cols */
-    AttrNumber    flagColIdx;        /* where is the flag column, if any */
-    int            firstFlag;        /* flag value for first input relation */
-    Cardinality numGroups;        /* estimated number of groups in input */
+    List       *groupList;        /* SortGroupClauses identifying target cols */
+    Cardinality numGroups;        /* estimated number of groups in left input */
 } SetOpPath;

 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 52f29bcdb6..4633121689 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1225,23 +1225,20 @@ typedef struct SetOp
     /* how to do it, see nodes.h */
     SetOpStrategy strategy;

-    /* number of columns to check for duplicate-ness */
+    /* number of columns to compare */
     int            numCols;

     /* their indexes in the target list */
-    AttrNumber *dupColIdx pg_node_attr(array_size(numCols));
+    AttrNumber *cmpColIdx pg_node_attr(array_size(numCols));

-    /* equality operators to compare with */
-    Oid           *dupOperators pg_node_attr(array_size(numCols));
-    Oid           *dupCollations pg_node_attr(array_size(numCols));
-
-    /* where is the flag column, if any */
-    AttrNumber    flagColIdx;
+    /* comparison operators (either equality operators or sort operators) */
+    Oid           *cmpOperators pg_node_attr(array_size(numCols));
+    Oid           *cmpCollations pg_node_attr(array_size(numCols));

-    /* flag value for first input relation */
-    int            firstFlag;
+    /* nulls-first flags if sorting, otherwise not interesting */
+    bool       *cmpNullsFirst pg_node_attr(array_size(numCols));

-    /* estimated number of groups in input */
+    /* estimated number of groups in left input */
     long        numGroups;
 } SetOp;

diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 1035e6560c..5a6d0350c1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -260,12 +260,11 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
                                             bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
                                     RelOptInfo *rel,
-                                    Path *subpath,
+                                    Path *leftpath,
+                                    Path *rightpath,
                                     SetOpCmd cmd,
                                     SetOpStrategy strategy,
-                                    List *distinctList,
-                                    AttrNumber flagColIdx,
-                                    int firstFlag,
+                                    List *groupList,
                                     double numGroups,
                                     double outputRows);
 extern RecursiveUnionPath *create_recursiveunion_path(PlannerInfo *root,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 857ba1c40c..b4cd204de7 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1231,22 +1231,18 @@ select count(*) from
     select * from onek i2 where i2.unique1 = o.unique2
   ) ss
 where o.ten = 1;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Subquery Scan on ss
-               ->  HashSetOp Except
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Scan using onek_unique1 on onek i1
-                                       Index Cond: (unique1 = o.unique1)
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Index Scan using onek_unique1 on onek i2
-                                       Index Cond: (unique1 = o.unique2)
-(13 rows)
+         ->  HashSetOp Except
+               ->  Index Scan using onek_unique1 on onek i1
+                     Index Cond: (unique1 = o.unique1)
+               ->  Index Scan using onek_unique1 on onek i2
+                     Index Cond: (unique1 = o.unique2)
+(9 rows)

 select count(*) from
   onek o cross join lateral (
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..17a0dea719 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -370,17 +370,13 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                     QUERY PLAN
-------------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  HashSetOp Intersect
-               ->  Append
-                     ->  Subquery Scan on "*SELECT* 2"
-                           ->  Seq Scan on tenk1
-                     ->  Subquery Scan on "*SELECT* 1"
-                           ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+   ->  HashSetOp Intersect
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(4 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -391,16 +387,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Index Only Scan using tenk1_unique1 on tenk1
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                     Filter: (unique2 <> 10)
-(7 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -408,6 +401,34 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+-- the hashed implementation is sensitive to child plans' tuple slot types
+explain (costs off)
+select * from int8_tbl intersect select q2, q1 from int8_tbl order by 1, 2;
+                 QUERY PLAN
+---------------------------------------------
+ Sort
+   Sort Key: int8_tbl.q1, int8_tbl.q2
+   ->  HashSetOp Intersect
+         ->  Seq Scan on int8_tbl
+         ->  Seq Scan on int8_tbl int8_tbl_1
+(5 rows)
+
+select * from int8_tbl intersect select q2, q1 from int8_tbl order by 1, 2;
+        q1        |        q2
+------------------+------------------
+              123 | 4567890123456789
+ 4567890123456789 |              123
+ 4567890123456789 | 4567890123456789
+(3 rows)
+
+select q2, q1 from int8_tbl intersect select * from int8_tbl order by 1, 2;
+        q2        |        q1
+------------------+------------------
+              123 | 4567890123456789
+ 4567890123456789 |              123
+ 4567890123456789 | 4567890123456789
+(3 rows)
+
 set enable_hashagg to off;
 explain (costs off)
 select count(*) from
@@ -434,19 +455,17 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                        QUERY PLAN
-------------------------------------------------------------------------------------------
+                               QUERY PLAN
+------------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  SetOp Intersect
-               ->  Sort
-                     Sort Key: "*SELECT* 2".fivethous
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Seq Scan on tenk1
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(10 rows)
+   ->  SetOp Intersect
+         ->  Sort
+               Sort Key: tenk1.fivethous
+               ->  Seq Scan on tenk1
+         ->  Sort
+               Sort Key: tenk1_1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(8 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -457,18 +476,17 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".unique1
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Index Only Scan using tenk1_unique1 on tenk1
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                           Filter: (unique2 <> 10)
-(9 rows)
+         Sort Key: tenk1.unique1
+         ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Sort
+         Sort Key: tenk1_1.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+               Filter: (unique2 <> 10)
+(8 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -528,15 +546,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -546,15 +561,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -606,17 +618,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -626,17 +637,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -669,17 +679,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -689,17 +698,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -777,17 +785,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -797,17 +804,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -970,15 +976,12 @@ set enable_sort = false;
 -- no means to disable Unique.
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union all select from generate_series(1,3);
 --
@@ -1015,15 +1018,12 @@ select from generate_series(1,5) union select from generate_series(1,3);

 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  SetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union select from generate_series(1,3);
 --
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index f8826514e4..e34e8a2d7e 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -138,6 +138,12 @@ explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+-- the hashed implementation is sensitive to child plans' tuple slot types
+explain (costs off)
+select * from int8_tbl intersect select q2, q1 from int8_tbl order by 1, 2;
+select * from int8_tbl intersect select q2, q1 from int8_tbl order by 1, 2;
+select q2, q1 from int8_tbl intersect select * from int8_tbl order by 1, 2;
+
 set enable_hashagg to off;

 explain (costs off)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 971a150dbb..fbdb932e6b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2615,6 +2615,8 @@ SetOpCmd
 SetOpPath
 SetOpState
 SetOpStatePerGroup
+SetOpStatePerGroupData
+SetOpStatePerInput
 SetOpStrategy
 SetOperation
 SetOperationStmt
--
2.43.5

From d0518581fd6d27cd11231c0078e52fef081f7570 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 19 Dec 2024 12:26:16 -0500
Subject: [PATCH v5 2/6] Remove some dead code in prepunion.c.

Now that we no longer use flag columns for SetOp, remove the
code that supported that.

This also means that there are no resjunk output columns anymore
in a set-operations plan tree.  So we can get rid of some code
that supported that, too.

It would probably be possible now to do what's speculated about in
the comments, and get rid of this code's use of transiently-created
targetlists in favor of dealing only in pathtargets (save for
generating root->processed_tlist at the very end).  However, that
would involve a good deal of notational churn and the benefit
would only be to save a little planning time.  So I'll leave that
project for some other day.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/prep/prepunion.c | 96 ++++----------------------
 1 file changed, 14 insertions(+), 82 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 1c43370005..8d70a4b61a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -40,8 +40,7 @@

 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
                                           List *colTypes, List *colCollations,
-                                          bool junkOK,
-                                          int flag, List *refnames_tlist,
+                                          List *refnames_tlist,
                                           List **pTargetList,
                                           bool *istrivial_tlist);
 static RelOptInfo *generate_recursion_path(SetOperationStmt *setOp,
@@ -69,14 +68,12 @@ static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
-                                  int flag,
                                   Index varno,
                                   bool hack_constants,
                                   List *input_tlist,
                                   List *refnames_tlist,
                                   bool *trivial_tlist);
 static List *generate_append_tlist(List *colTypes, List *colCollations,
-                                   bool flag,
                                    List *input_tlists,
                                    List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
@@ -160,12 +157,10 @@ plan_set_operations(PlannerInfo *root)
         /*
          * Recurse on setOperations tree to generate paths for set ops. The
          * final output paths should have just the column types shown as the
-         * output from the top-level node, plus possibly resjunk working
-         * columns (we can rely on upper-level nodes to deal with that).
+         * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
                                            topop->colTypes, topop->colCollations,
-                                           true, -1,
                                            leftmostQuery->targetList,
                                            &top_tlist,
                                            &trivial_tlist);
@@ -208,8 +203,6 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * junkOK: if true, child resjunk columns may be left in the result
- * flag: if >= 0, add a resjunk output column indicating value of flag
  * refnames_tlist: targetlist to take column names from
  *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
@@ -220,7 +213,8 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
- * Pending a redesign of how that works, this is the easy way out.
+ * XXX Now that there are no flag columns and hence no resjunk columns, we
+ * could probably refactor this file to deal only in pathtargets.
  *
  * We don't have to care about typmods here: the only allowed difference
  * between set-op input and output typmods is input is a specific typmod
@@ -229,8 +223,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
                        List *colTypes, List *colCollations,
-                       bool junkOK,
-                       int flag, List *refnames_tlist,
+                       List *refnames_tlist,
                        List **pTargetList,
                        bool *istrivial_tlist)
 {
@@ -279,7 +272,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,

         /* Figure out the appropriate target list for this subquery. */
         tlist = generate_setop_tlist(colTypes, colCollations,
-                                     flag,
                                      rtr->rtindex,
                                      true,
                                      subroot->processed_tlist,
@@ -318,16 +310,14 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * generate_append_tlist() or generate_setop_tlist(), this will work.
          * We just tell generate_setop_tlist() to use varno 0.
          */
-        if (flag >= 0 ||
-            !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
-            !tlist_same_collations(*pTargetList, colCollations, junkOK))
+        if (!tlist_same_datatypes(*pTargetList, colTypes, false) ||
+            !tlist_same_collations(*pTargetList, colCollations, false))
         {
             PathTarget *target;
             bool        trivial_tlist;
             ListCell   *lc;

             *pTargetList = generate_setop_tlist(colTypes, colCollations,
-                                                flag,
                                                 0,
                                                 false,
                                                 *pTargetList,
@@ -411,7 +401,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      */
     lrel = recurse_set_operations(setOp->larg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -423,7 +412,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -436,7 +424,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /*
      * Generate tlist for RecursiveUnion path node --- same as in Append cases
      */
-    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations, false,
+    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations,
                                   list_make2(lpath_tlist, rpath_tlist),
                                   refnames_tlist);

@@ -736,7 +724,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, false,
+    tlist = generate_append_tlist(op->colTypes, op->colCollations,
                                   tlist_list, refnames_tlist);
     *pTargetList = tlist;

@@ -1048,7 +1036,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,7 +1047,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1109,7 +1095,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
                                  0, false, lpath_tlist, refnames_tlist,
                                  &result_trivial_tlist);

@@ -1258,18 +1244,10 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
-         *
-         * Note we disallow any resjunk columns in child results.  This is
-         * necessary since the Append node that implements the union won't do
-         * any projection, and upper levels will get confused if some of our
-         * output tuples have junk and some don't.  This case only arises when
-         * we have an EXCEPT or INTERSECT as child, else there won't be
-         * resjunk anyway.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
-                                                        false, -1,
                                                         refnames_tlist,
                                                         &child_tlist,
                                                         &trivial_tlist));
@@ -1412,7 +1390,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: -1 if no flag column needed, 0 or 1 to create a const flag column
  * varno: varno to use in generated Vars
  * hack_constants: true to copy up constants (see comments in code)
  * input_tlist: targetlist of this node's input node
@@ -1421,7 +1398,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  */
 static List *
 generate_setop_tlist(List *colTypes, List *colCollations,
-                     int flag,
                      Index varno,
                      bool hack_constants,
                      List *input_tlist,
@@ -1520,7 +1496,7 @@ generate_setop_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1529,25 +1505,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag >= 0)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is the given constant */
-        expr = (Node *) makeConst(INT4OID,
-                                  -1,
-                                  InvalidOid,
-                                  sizeof(int32),
-                                  Int32GetDatum(flag),
-                                  false,
-                                  true);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-        *trivial_tlist = false; /* the extra entry makes it not trivial */
-    }
-
     return tlist;
 }

@@ -1556,7 +1513,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: true to create a flag column copied up from subplans
  * input_tlists: list of tlists for sub-plans of the Append
  * refnames_tlist: targetlist to take column names from
  *
@@ -1570,7 +1526,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  */
 static List *
 generate_append_tlist(List *colTypes, List *colCollations,
-                      bool flag,
                       List *input_tlists,
                       List *refnames_tlist)
 {
@@ -1604,8 +1559,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
         {
             TargetEntry *subtle = (TargetEntry *) lfirst(subtlistl);

-            if (subtle->resjunk)
-                continue;
+            Assert(!subtle->resjunk);
             Assert(curColType != NULL);
             if (exprType((Node *) subtle->expr) == lfirst_oid(curColType))
             {
@@ -1654,7 +1608,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1663,23 +1617,6 @@ generate_append_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is shown as copied up from subplan */
-        expr = (Node *) makeVar(0,
-                                resno,
-                                INT4OID,
-                                -1,
-                                InvalidOid,
-                                0);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-    }
-
     pfree(colTypmods);

     return tlist;
@@ -1709,12 +1646,7 @@ generate_setop_grouplist(SetOperationStmt *op, List *targetlist)
         TargetEntry *tle = (TargetEntry *) lfirst(lt);
         SortGroupClause *sgc;

-        if (tle->resjunk)
-        {
-            /* resjunk columns should not have sortgrouprefs */
-            Assert(tle->ressortgroupref == 0);
-            continue;            /* ignore resjunk columns */
-        }
+        Assert(!tle->resjunk);

         /* non-resjunk columns should have sortgroupref = resno */
         Assert(tle->ressortgroupref == tle->resno);
--
2.43.5

From e70dff18a03e7ac00129cb4333ef8ca85feecbc3 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 19 Dec 2024 12:26:34 -0500
Subject: [PATCH v5 3/6] Get rid of choose_hashed_setop().

This logic is a hangover from days when we could only generate
one plan per setop node.  Now, it's better to generate multiple
Paths and let add_path choose the cheapest.  The main advantage
is not having to keep choose_hashed_setop in perfect sync with
cost estimates made elsewhere --- and indeed, it looks like it
was out of sync already.  Not to mention that it doesn't account
for presorted input, which doesn't matter too much right now
but will soon.

This does require fixing create_setop_path to generate good cost
estimates.  That was another area where we'd not maintained things
well, because it wasn't distinguishing sorted from hashed costs at
all.  With the formulas I chose, the total cost attributed to the node
is actually the same for sorted and hashed modes --- but the startup
cost is wildly different, because hashed mode does basically all the
work before returning anything.

I also modified the existing logic that disables hashed setops
so that it works by incrementing path->disabled_nodes rather
than not generating the path at all.  That seems more in keeping
with the methods established by commit e22253467.  (But should
we put that logic in create_setop_path, rather than its caller?)

This results in two changes in plans that are visible in the
regression tests.  The one in union.sql seems fine: it's now
preferring sorted over hashed mode for a zero-column INTERSECT,
since no sort is really required.  The one in subselect.sql
is trying to test nodeSetOp's rescan logic, so I extended it
to test both sorted and hashed modes.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/prep/prepunion.c  | 226 +++++++++---------------
 src/backend/optimizer/util/pathnode.c   |  50 +++++-
 src/test/regress/expected/subselect.out |  48 ++++-
 src/test/regress/expected/union.out     |   2 +-
 src/test/regress/sql/subselect.sql      |  32 +++-
 5 files changed, 200 insertions(+), 158 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 8d70a4b61a..23bb335354 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -63,10 +63,6 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **tlist_list,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
-static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *lpath, Path *rpath,
-                                double dNumGroups, double dNumOutputRows,
-                                const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
                                   Index varno,
                                   bool hack_constants,
@@ -1025,7 +1021,8 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
-    bool        use_hash;
+    bool        can_sort;
+    bool        can_hash;
     SetOpCmd    cmd;

     /*
@@ -1130,15 +1127,76 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }
+    result_rel->rows = dNumOutputRows;
+
+    /* Select the SetOpCmd type */
+    switch (op->op)
+    {
+        case SETOP_INTERSECT:
+            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
+            break;
+        case SETOP_EXCEPT:
+            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
+            break;
+        default:
+            elog(ERROR, "unrecognized set op: %d", (int) op->op);
+            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
+            break;
+    }
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));

     /*
-     * Decide whether to hash or sort, and add sort nodes if needed.
+     * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
-    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
-                                   dNumGroups, dNumOutputRows,
-                                   (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");
+    if (can_hash)
+    {
+        Size        hash_mem_limit = get_hash_memory_limit();
+        Size        hashentrysize;
+
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_HASHED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);

-    if (groupList && !use_hash)
+        /*
+         * Mark the path as disabled if enable_hashagg is off.  While this
+         * isn't exactly a HashAgg node, it seems close enough to justify
+         * letting that switch control it.
+         */
+        if (!enable_hashagg)
+            path->disabled_nodes++;
+
+        /*
+         * Also disable if it doesn't look like the hashtable will fit into
+         * hash_mem.
+         */
+        hashentrysize = MAXALIGN(lpath->pathtarget->width) +
+            MAXALIGN(SizeofMinimalTupleHeader);
+        if (hashentrysize * dNumGroups > hash_mem_limit)
+            path->disabled_nodes++;
+
+        add_path(result_rel, path);
+    }
+
+    /*
+     * If we can sort, sort the inputs if needed before adding SetOp.
+     */
+    if (can_sort)
     {
         List       *pathkeys;

@@ -1160,36 +1218,19 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                               rpath,
                                               pathkeys,
                                               -1.0);
-    }

-    /*
-     * Finally, add a SetOp path node to generate the correct output.
-     */
-    switch (op->op)
-    {
-        case SETOP_INTERSECT:
-            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
-            break;
-        case SETOP_EXCEPT:
-            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
-            break;
-        default:
-            elog(ERROR, "unrecognized set op: %d", (int) op->op);
-            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
-            break;
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_SORTED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);
+        add_path(result_rel, path);
     }
-    path = (Path *) create_setop_path(root,
-                                      result_rel,
-                                      lpath,
-                                      rpath,
-                                      cmd,
-                                      use_hash ? SETOP_HASHED : SETOP_SORTED,
-                                      groupList,
-                                      dNumGroups,
-                                      dNumOutputRows);
-
-    result_rel->rows = path->rows;
-    add_path(result_rel, path);
+
     return result_rel;
 }

@@ -1276,115 +1317,6 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)
     set_cheapest(rel);
 }

-/*
- * choose_hashed_setop - should we use hashing for a set operation?
- *
- * XXX probably this should go away: just make both paths and let
- * add_path sort it out.
- */
-static bool
-choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *lpath, Path *rpath,
-                    double dNumGroups, double dNumOutputRows,
-                    const char *construct)
-{
-    int            numGroupCols = list_length(groupClauses);
-    Size        hash_mem_limit = get_hash_memory_limit();
-    bool        can_sort;
-    bool        can_hash;
-    Size        hashentrysize;
-    Path        hashed_p;
-    Path        sorted_p;
-    double        tuple_fraction;
-
-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupClauses);
-    can_hash = grouping_is_hashable(groupClauses);
-    if (can_hash && can_sort)
-    {
-        /* we have a meaningful choice to make, continue ... */
-    }
-    else if (can_hash)
-        return true;
-    else if (can_sort)
-        return false;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is UNION, INTERSECT, or EXCEPT */
-                 errmsg("could not implement %s", construct),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
-    /* Prefer sorting when enable_hashagg is off */
-    if (!enable_hashagg)
-        return false;
-
-    /*
-     * Don't do it if it doesn't look like the hashtable will fit into
-     * hash_mem.
-     */
-    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
-
-    if (hashentrysize * dNumGroups > hash_mem_limit)
-        return false;
-
-    /*
-     * See if the estimated cost is no more than doing it the other way.
-     *
-     * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group. XXX NOT TRUE: Note that the actual result plan might involve a
-     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
-     * and Group should be close enough for our purposes here.
-     *
-     * These path variables are dummies that just hold cost fields; we don't
-     * make actual Paths for these steps.
-     */
-    cost_agg(&hashed_p, root, AGG_HASHED, NULL,
-             numGroupCols, dNumGroups,
-             NIL,
-             lpath->disabled_nodes + rpath->disabled_nodes,
-             lpath->startup_cost + rpath->startup_cost,
-             lpath->total_cost + rpath->total_cost,
-             lpath->rows + rpath->rows,
-             lpath->pathtarget->width);
-
-    /*
-     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
-     * unsorted, since it was made by appending unrelated sub-relations
-     * together.
-     */
-    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
-    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
-    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
-    /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-    cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
-              sorted_p.total_cost,
-              lpath->rows + rpath->rows,
-              lpath->pathtarget->width,
-              0.0, work_mem, -1.0);
-    cost_group(&sorted_p, root, numGroupCols, dNumGroups,
-               NIL,
-               sorted_p.disabled_nodes,
-               sorted_p.startup_cost, sorted_p.total_cost,
-               lpath->rows + rpath->rows);
-
-    /*
-     * Now make the decision using the top-level tuple fraction.  First we
-     * have to convert an absolute count (LIMIT) into fractional form.
-     */
-    tuple_fraction = root->tuple_fraction;
-    if (tuple_fraction >= 1.0)
-        tuple_fraction /= dNumOutputRows;
-
-    if (compare_fractional_path_costs(&hashed_p, &sorted_p,
-                                      tuple_fraction) < 0)
-    {
-        /* Hashed is cheaper, so use it */
-        return true;
-    }
-    return false;
-}
-
 /*
  * Generate targetlist for a set-operation plan node
  *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e52e4b1d67..d7422f2dd8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3681,17 +3681,51 @@ create_setop_path(PlannerInfo *root,
     pathnode->numGroups = numGroups;

     /*
-     * Charge one cpu_operator_cost per comparison per input tuple. We assume
-     * all columns get compared at most of the tuples.
-     *
-     * XXX all wrong for hashing
+     * Compute cost estimates.  As things stand, we end up with the same total
+     * cost in this node for sort and hash methods, but different startup
+     * costs.  This could be refined perhaps, but it'll do for now.
      */
     pathnode->path.disabled_nodes =
         leftpath->disabled_nodes + rightpath->disabled_nodes;
-    pathnode->path.startup_cost =
-        leftpath->startup_cost + rightpath->startup_cost;
-    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
-        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+    if (strategy == SETOP_SORTED)
+    {
+        /*
+         * In sorted mode, we can emit output incrementally.  Charge one
+         * cpu_operator_cost per comparison per input tuple.  Like cost_group,
+         * we assume all columns get compared at most of the tuples.
+         */
+        pathnode->path.startup_cost =
+            leftpath->startup_cost + rightpath->startup_cost;
+        pathnode->path.total_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
+    else
+    {
+        /*
+         * In hashed mode, we must read all the input before we can emit
+         * anything.  Also charge comparison costs to represent the cost of
+         * hash table lookups.
+         */
+        pathnode->path.startup_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+        pathnode->path.total_cost = pathnode->path.startup_cost;
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index b4cd204de7..ebc545e246 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1221,8 +1221,10 @@ where o.ten = 0;
 (1 row)

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -1256,6 +1258,50 @@ where o.ten = 1;
    100
 (1 row)

+rollback;
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+                                                                                                     QUERY PLAN
                                                                                              

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  SetOp Except
+               ->  Sort
+                     Sort Key: i1.unique1, i1.unique2, i1.two, i1.four, i1.ten, i1.twenty, i1.hundred, i1.thousand,
i1.twothousand,i1.fivethous, i1.tenthous, i1.odd, i1.even, i1.stringu1, i1.stringu2, i1.string4 
+                     ->  Index Scan using onek_unique1 on onek i1
+                           Index Cond: (unique1 = o.unique1)
+               ->  Sort
+                     Sort Key: i2.unique1, i2.unique2, i2.two, i2.four, i2.ten, i2.twenty, i2.hundred, i2.thousand,
i2.twothousand,i2.fivethous, i2.tenthous, i2.odd, i2.even, i2.stringu1, i2.stringu2, i2.string4 
+                     ->  Index Scan using onek_unique1 on onek i2
+                           Index Cond: (unique1 = o.unique2)
+(13 rows)
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+ count
+-------
+   100
+(1 row)
+
+rollback;
 --
 -- Test rescan of a RecursiveUnion node
 --
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 17a0dea719..6fa957ca4a 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -978,7 +978,7 @@ explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
                         QUERY PLAN
 ----------------------------------------------------------
- HashSetOp Intersect
+ SetOp Intersect
    ->  Function Scan on generate_series
    ->  Function Scan on generate_series generate_series_1
 (3 rows)
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index a6ab42454f..6ed3636a9e 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -638,8 +638,11 @@ select sum(ss.tst::int) from
 where o.ten = 0;

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
+
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -657,6 +660,33 @@ select count(*) from
   ) ss
 where o.ten = 1;

+rollback;
+
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+rollback;
+
 --
 -- Test rescan of a RecursiveUnion node
 --
--
2.43.5

From ce7d49ee9865c44e6da5116c4bbc1643ada873e3 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 19 Dec 2024 12:26:57 -0500
Subject: [PATCH v5 4/6] Fix bogus decisions about whether we want pre-sorted
 child paths.

As things stand, recurse_set_operations unconditionally passes down
root->parse->setOperations to subquery_planner, which then applies
set_operation_ordered_results_useful to it to decide whether we care
about producing pre-sorted paths for the subquery.

This is flat wrong.

The main problem is that root->parse->setOperations is not necessarily
the relevant ancestor node.  In the test case added here, we have
UNION ALL atop a child UNION, and the existing logic fails to
recognize that it could use pre-sorted paths for the child UNION step.

A lesser problem is that set_operation_ordered_results_useful is
inadequate anyway: it will falsely claim that we could use pre-sorted
paths for a recursive UNION.

A better way to do this is for each caller of recurse_set_operations
to pass down the relevant parent node.  Then we can further specify
that the caller should pass NULL if it has no use for pre-sorted
paths, and then we can get rid of set_operation_ordered_results_useful
altogether, eliminating the need to keep it in sync with the code that
is actually doing the work.

Note: this patch makes generate_nonunion_paths pass down the non-UNION
node as parent.  generate_nonunion_paths doesn't yet do anything with
the pre-sorted paths that will result, so no regression test plans
change.  If we were going to leave it like this, we'd pass down NULL
instead.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/plan/planner.c   |  7 ++-
 src/backend/optimizer/prep/prepunion.c | 65 +++++++++++++-------------
 src/include/optimizer/prep.h           |  1 -
 src/test/regress/expected/union.out    | 14 ++++++
 src/test/regress/sql/union.sql         |  4 ++
 5 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f..7468961b01 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -616,7 +616,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Basically, this routine does the stuff that should only be done once
  * per Query object.  It then calls grouping_planner.  At one time,
@@ -1350,7 +1350,7 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Returns nothing; the useful output is in the Paths we attach to the
  * (UPPERREL_FINAL, NULL) upperrel in *root.  In addition,
@@ -3467,8 +3467,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
                                       tlist);

     /* setting setop_pathkeys might be useful to the union planner */
-    if (qp_extra->setop != NULL &&
-        set_operation_ordered_results_useful(qp_extra->setop))
+    if (qp_extra->setop != NULL)
     {
         List       *groupClauses;
         bool        sortable;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 23bb335354..e92a088334 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -39,6 +39,7 @@


 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
+                                          SetOperationStmt *parentOp,
                                           List *colTypes, List *colCollations,
                                           List *refnames_tlist,
                                           List **pTargetList,
@@ -156,6 +157,7 @@ plan_set_operations(PlannerInfo *root)
          * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
+                                           NULL,    /* no parent */
                                            topop->colTypes, topop->colCollations,
                                            leftmostQuery->targetList,
                                            &top_tlist,
@@ -168,44 +170,31 @@ plan_set_operations(PlannerInfo *root)
     return setop_rel;
 }

-/*
- * set_operation_ordered_results_useful
- *        Return true if the given SetOperationStmt can be executed by utilizing
- *        paths that provide sorted input according to the setop's targetlist.
- *        Returns false when sorted paths are not any more useful than unsorted
- *        ones.
- */
-bool
-set_operation_ordered_results_useful(SetOperationStmt *setop)
-{
-    /*
-     * Paths sorted by the targetlist are useful for UNION as we can opt to
-     * MergeAppend the sorted paths then Unique them.  Ordered paths are no
-     * more useful than unordered ones for UNION ALL.
-     */
-    if (!setop->all && setop->op == SETOP_UNION)
-        return true;
-
-    /*
-     * EXCEPT / EXCEPT ALL / INTERSECT / INTERSECT ALL cannot yet utilize
-     * correctly sorted input paths.
-     */
-    return false;
-}
-
 /*
  * recurse_set_operations
  *      Recursively handle one step in a tree of set operations
  *
+ * setOp: current step (could be a SetOperationStmt or a leaf RangeTblRef)
+ * parentOp: parent step, or NULL if none (but see below)
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
  * refnames_tlist: targetlist to take column names from
  *
+ * parentOp should be passed as NULL unless that step is interested in
+ * getting sorted output from this step.  ("Sorted" means "sorted according
+ * to the default btree opclasses of the result column datatypes".)
+ *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
  * *pTargetList: receives the fully-fledged tlist for the subtree's top plan
  * *istrivial_tlist: true if, and only if, datatypes between parent and child
  * match.
  *
+ * If setOp is a leaf node, this function plans the sub-query but does
+ * not populate the pathlist of the returned RelOptInfo.  The caller will
+ * generate SubqueryScan paths using useful path(s) of the subquery (see
+ * build_setop_child_paths).  But this function does build the paths for
+ * set-operation nodes.
+ *
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
@@ -218,6 +207,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  */
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
+                       SetOperationStmt *parentOp,
                        List *colTypes, List *colCollations,
                        List *refnames_tlist,
                        List **pTargetList,
@@ -234,7 +224,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
     {
         RangeTblRef *rtr = (RangeTblRef *) setOp;
         RangeTblEntry *rte = root->simple_rte_array[rtr->rtindex];
-        SetOperationStmt *setops;
         Query       *subquery = rte->subquery;
         PlannerInfo *subroot;
         List       *tlist;
@@ -249,15 +238,13 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
         Assert(root->plan_params == NIL);

         /*
-         * Pass the set operation details to the subquery_planner to have it
-         * consider generating Paths correctly ordered for the set operation.
+         * Generate a subroot and Paths for the subquery.  If we have a
+         * parentOp, pass that down to encourage subquery_planner to consider
+         * suitably-sorted Paths.
          */
-        setops = castNode(SetOperationStmt, root->parse->setOperations);
-
-        /* Generate a subroot and Paths for the subquery */
         subroot = rel->subroot = subquery_planner(root->glob, subquery, root,
                                                   false, root->tuple_fraction,
-                                                  setops);
+                                                  parentOp);

         /*
          * It should not be possible for the primitive query to contain any
@@ -396,6 +383,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      * separately without any intention of combining them into one Append.
      */
     lrel = recurse_set_operations(setOp->larg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -407,6 +395,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /* The right path will want to look at the left one ... */
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -479,6 +468,10 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
  * build_setop_child_paths
  *        Build paths for the set op child relation denoted by 'rel'.
  *
+ * 'rel' is an RTE_SUBQUERY relation.  We have already generated paths within
+ * the subquery's subroot; the task here is to create SubqueryScan paths for
+ * 'rel', representing scans of the useful subquery paths.
+ *
  * interesting_pathkeys: if not NIL, also include paths that suit these
  * pathkeys, sorting any unsorted paths as required.
  * *pNumGroups: if not NULL, we estimate the number of distinct groups
@@ -1032,6 +1025,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -1043,6 +1037,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -1285,8 +1280,14 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
+         *
+         * If top_union isn't a UNION ALL, then we are interested in sorted
+         * output from the child, so pass top_union as parentOp.  Note that
+         * this isn't necessarily the child node's immediate SetOperationStmt
+         * parent, but that's fine: it's the effective parent.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
+                                                        top_union->all ? NULL : top_union,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
                                                         refnames_tlist,
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
index a52dec285d..74bb6c1d3b 100644
--- a/src/include/optimizer/prep.h
+++ b/src/include/optimizer/prep.h
@@ -53,6 +53,5 @@ extern void preprocess_aggrefs(PlannerInfo *root, Node *clause);
  * prototypes for prepunion.c
  */
 extern RelOptInfo *plan_set_operations(PlannerInfo *root);
-extern bool set_operation_ordered_results_useful(SetOperationStmt *setop);

 #endif                            /* PREP_H */
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 6fa957ca4a..16d0807e7f 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -494,6 +494,20 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+                               QUERY PLAN
+------------------------------------------------------------------------
+ Append
+   ->  Seq Scan on int4_tbl
+   ->  Unique
+         ->  Merge Append
+               Sort Key: tenk1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1
+               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+(7 rows)
+
 reset enable_hashagg;
 -- non-hashable type
 set enable_hashagg to on;
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index e34e8a2d7e..db2aae3cc5 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -162,6 +162,10 @@ explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+
 reset enable_hashagg;

 -- non-hashable type
--
2.43.5

From 79532b0e1bcf5faa35c65743b6afc869de8efafc Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 19 Dec 2024 12:27:13 -0500
Subject: [PATCH v5 5/6] Teach generate_nonunion_paths to consider pre-sorted
 child paths.

Much of this patch is just re-ordering existing code so that we can
compute the desired pathkeys before we call build_setop_child_paths.

I don't think we need any new test cases: the existing cases that
change plan are enough to show that this works.

One test case in union.sql would switch from HashSetOp Except to
sorted SetOp Except if we left it alone, since the cheapest inputs
are already sorted.  But the point of that set of tests is to
exercise HashSetOp, so swing a bigger hammer.  (The same query is
re-tested with sorting further down, so this seems fine.)

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/optimizer/prep/prepunion.c | 144 ++++++++++++++++---------
 src/test/regress/expected/union.out    |  37 +++----
 src/test/regress/sql/union.sql         |   5 +
 3 files changed, 115 insertions(+), 71 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e92a088334..b3a26738d0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1010,6 +1010,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     bool        lpath_trivial_tlist,
                 rpath_trivial_tlist,
                 result_trivial_tlist;
+    List       *nonunion_pathkeys = NIL;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
@@ -1030,11 +1031,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
-    if (lrel->rtekind == RTE_SUBQUERY)
-        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
-                                NIL, &dLeftGroups);
-    else
-        dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
                                   op,
@@ -1042,9 +1038,57 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
+
+    /*
+     * Generate tlist for SetOp plan node.
+     *
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
+     * concerned, but we must make it look real anyway for the benefit of the
+     * next plan level up.
+     */
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);
+
+    *pTargetList = tlist;
+
+    /* Identify the grouping semantics */
+    groupList = generate_setop_grouplist(op, tlist);
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
+
+    if (can_sort)
+    {
+        /* Determine the pathkeys for sorting by the whole target list */
+        nonunion_pathkeys = make_pathkeys_for_sortclauses(root, groupList,
+                                                          tlist);
+
+        root->query_pathkeys = nonunion_pathkeys;
+    }
+
+    /*
+     * Now that we've got all that info, we can build the child paths.
+     */
+    if (lrel->rtekind == RTE_SUBQUERY)
+        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
+                                nonunion_pathkeys, &dLeftGroups);
+    else
+        dLeftGroups = lrel->rows;
     if (rrel->rtekind == RTE_SUBQUERY)
         build_setop_child_paths(root, rrel, rpath_trivial_tlist, rpath_tlist,
-                                NIL, &dRightGroups);
+                                nonunion_pathkeys, &dRightGroups);
     else
         dRightGroups = rrel->rows;

@@ -1080,30 +1124,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     lpath = lrel->cheapest_total_path;
     rpath = rrel->cheapest_total_path;

-    /*
-     * Generate tlist for SetOp plan node.
-     *
-     * The tlist for a SetOp plan isn't important so far as the SetOp is
-     * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.
-     */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
-                                 0, false, lpath_tlist, refnames_tlist,
-                                 &result_trivial_tlist);
-
-    /* We should not have needed any type coercions in the tlist */
-    Assert(result_trivial_tlist);
-
-    *pTargetList = tlist;
-
     /* Build result relation. */
     result_rel = fetch_upper_rel(root, UPPERREL_SETOP,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /* Identify the grouping semantics */
-    groupList = generate_setop_grouplist(op, tlist);
-
     /*
      * Estimate number of distinct groups that we'll need hashtable entries
      * for; this is the size of the left-hand input for EXCEPT, or the smaller
@@ -1139,17 +1164,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
             break;
     }

-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupList);
-    can_hash = grouping_is_hashable(groupList);
-    if (!can_sort && !can_hash)
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is INTERSECT or EXCEPT */
-                 errmsg("could not implement %s",
-                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
     /*
      * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
@@ -1189,35 +1203,63 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }

     /*
-     * If we can sort, sort the inputs if needed before adding SetOp.
+     * If we can sort, generate the cheapest sorted input paths, and add a
+     * SetOp atop those.
      */
     if (can_sort)
     {
         List       *pathkeys;
+        Path       *slpath,
+                   *srpath;

+        /* First the left input ... */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  lpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
-            lpath = (Path *) create_sort_path(root,
-                                              lpath->parent,
-                                              lpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            slpath = lpath;        /* cheapest path is already sorted */
+        else
+        {
+            slpath = get_cheapest_path_for_pathkeys(lrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (slpath == NULL)
+                slpath = (Path *) create_sort_path(root,
+                                                   lpath->parent,
+                                                   lpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }
+
+        /* and now the same for the right. */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  rpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
-            rpath = (Path *) create_sort_path(root,
-                                              rpath->parent,
-                                              rpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            srpath = rpath;        /* cheapest path is already sorted */
+        else
+        {
+            srpath = get_cheapest_path_for_pathkeys(rrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (srpath == NULL)
+                srpath = (Path *) create_sort_path(root,
+                                                   rpath->parent,
+                                                   rpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }

         path = (Path *) create_setop_path(root,
                                           result_rel,
-                                          lpath,
-                                          rpath,
+                                          slpath,
+                                          srpath,
                                           cmd,
                                           SETOP_SORTED,
                                           groupList,
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 16d0807e7f..caa8fe70a0 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -385,13 +385,15 @@ select count(*) from
   5000
 (1 row)

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                         QUERY PLAN
-------------------------------------------------------------
+           QUERY PLAN
+---------------------------------
  HashSetOp Except
-   ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+   ->  Seq Scan on tenk1
+   ->  Seq Scan on tenk1 tenk1_1
          Filter: (unique2 <> 10)
 (4 rows)

@@ -401,6 +403,7 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+reset enable_indexscan;
 -- the hashed implementation is sensitive to child plans' tuple slot types
 explain (costs off)
 select * from int8_tbl intersect select q2, q1 from int8_tbl order by 1, 2;
@@ -455,17 +458,15 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
    ->  SetOp Intersect
          ->  Sort
                Sort Key: tenk1.fivethous
                ->  Seq Scan on tenk1
-         ->  Sort
-               Sort Key: tenk1_1.unique1
-               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(6 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -476,17 +477,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                            QUERY PLAN
-------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  SetOp Except
-   ->  Sort
-         Sort Key: tenk1.unique1
-         ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Sort
-         Sort Key: tenk1_1.unique2
-         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-               Filter: (unique2 <> 10)
-(8 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index db2aae3cc5..13700a6bfc 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -134,10 +134,15 @@ select count(*) from
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
+
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+reset enable_indexscan;
+
 -- the hashed implementation is sensitive to child plans' tuple slot types
 explain (costs off)
 select * from int8_tbl intersect select q2, q1 from int8_tbl order by 1, 2;
--
2.43.5

From afd29f78b98d2ab447b6169c93412614067580d8 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 19 Dec 2024 14:33:30 -0500
Subject: [PATCH v5 6/6] Use ExecGetCommonSlotOps infrastructure in more
 places.

Append, MergeAppend, and RecursiveUnion can all use the support
functions added in the first patch in this series.  The first
two can report a fixed result slot type if all their children
return the same fixed slot type.  That does nothing for the
append step itself, but might allow optimizations in the
parent plan node.  RecursiveUnion can optimize tuple hash table
operations in the same way as SetOp now does.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
---
 src/backend/executor/nodeAppend.c         | 32 +++++++++++++------
 src/backend/executor/nodeMergeAppend.c    | 38 ++++++++++++++++-------
 src/backend/executor/nodeRecursiveunion.c |  8 +++--
 3 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index ca0f54d676..b5d56569f7 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -110,6 +110,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 {
     AppendState *appendstate = makeNode(AppendState);
     PlanState **appendplanstates;
+    const TupleTableSlotOps *appendops;
     Bitmapset  *validsubplans;
     Bitmapset  *asyncplans;
     int            nplans;
@@ -176,15 +177,6 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
         appendstate->as_prune_state = NULL;
     }

-    /*
-     * Initialize result tuple type and slot.
-     */
-    ExecInitResultTupleSlotTL(&appendstate->ps, &TTSOpsVirtual);
-
-    /* node returns slots from each of its subnodes, therefore not fixed */
-    appendstate->ps.resultopsset = true;
-    appendstate->ps.resultopsfixed = false;
-
     appendplanstates = (PlanState **) palloc(nplans *
                                              sizeof(PlanState *));

@@ -227,6 +219,28 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;

+    /*
+     * Initialize Append's result tuple type and slot.  If the child plans all
+     * produce the same fixed slot type, we can use that slot type; otherwise
+     * make a virtual slot.  (Note that the result slot itself is used only to
+     * return a null tuple at end of execution; real tuples are returned to
+     * the caller in the children's own result slots.  What we are doing here
+     * is allowing the parent plan node to optimize if the Append will return
+     * only one kind of slot.)
+     */
+    appendops = ExecGetCommonSlotOps(appendplanstates, j);
+    if (appendops != NULL)
+    {
+        ExecInitResultTupleSlotTL(&appendstate->ps, appendops);
+    }
+    else
+    {
+        ExecInitResultTupleSlotTL(&appendstate->ps, &TTSOpsVirtual);
+        /* show that the output slot type is not fixed */
+        appendstate->ps.resultopsset = true;
+        appendstate->ps.resultopsfixed = false;
+    }
+
     /* Initialize async state */
     appendstate->as_asyncplans = asyncplans;
     appendstate->as_nasyncplans = nasyncplans;
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e1b9b984a7..081734e817 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -66,6 +66,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 {
     MergeAppendState *mergestate = makeNode(MergeAppendState);
     PlanState **mergeplanstates;
+    const TupleTableSlotOps *mergeops;
     Bitmapset  *validsubplans;
     int            nplans;
     int            i,
@@ -128,18 +129,6 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
     mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
                                               mergestate);

-    /*
-     * Miscellaneous initialization
-     *
-     * MergeAppend nodes do have Result slots, which hold pointers to tuples,
-     * so we have to initialize them.  FIXME
-     */
-    ExecInitResultTupleSlotTL(&mergestate->ps, &TTSOpsVirtual);
-
-    /* node returns slots from each of its subnodes, therefore not fixed */
-    mergestate->ps.resultopsset = true;
-    mergestate->ps.resultopsfixed = false;
-
     /*
      * call ExecInitNode on each of the valid plans to be executed and save
      * the results into the mergeplanstates array.
@@ -153,6 +142,31 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
         mergeplanstates[j++] = ExecInitNode(initNode, estate, eflags);
     }

+    /*
+     * Initialize MergeAppend's result tuple type and slot.  If the child
+     * plans all produce the same fixed slot type, we can use that slot type;
+     * otherwise make a virtual slot.  (Note that the result slot itself is
+     * used only to return a null tuple at end of execution; real tuples are
+     * returned to the caller in the children's own result slots.  What we are
+     * doing here is allowing the parent plan node to optimize if the
+     * MergeAppend will return only one kind of slot.)
+     */
+    mergeops = ExecGetCommonSlotOps(mergeplanstates, j);
+    if (mergeops != NULL)
+    {
+        ExecInitResultTupleSlotTL(&mergestate->ps, mergeops);
+    }
+    else
+    {
+        ExecInitResultTupleSlotTL(&mergestate->ps, &TTSOpsVirtual);
+        /* show that the output slot type is not fixed */
+        mergestate->ps.resultopsset = true;
+        mergestate->ps.resultopsfixed = false;
+    }
+
+    /*
+     * Miscellaneous initialization
+     */
     mergestate->ps.ps_ProjInfo = NULL;

     /*
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 39be4cdc3b..b577b72b50 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -37,10 +37,14 @@ build_hash_table(RecursiveUnionState *rustate)
     Assert(node->numCols > 0);
     Assert(node->numGroups > 0);

-    /* XXX is it worth working a bit harder to determine the inputOps here? */
+    /*
+     * If both child plans deliver the same fixed tuple slot type, we can tell
+     * BuildTupleHashTableExt to expect that slot type as input.  Otherwise,
+     * we'll pass NULL denoting that any slot type is possible.
+     */
     rustate->hashtable = BuildTupleHashTableExt(&rustate->ps,
                                                 desc,
-                                                NULL,
+                                                ExecGetCommonChildSlotOps(&rustate->ps),
                                                 node->numCols,
                                                 node->dupColIdx,
                                                 rustate->eqfuncoids,
--
2.43.5

Re: Converting SetOp to read its two inputs separately

From

David Rowley

Date:

19 December 2024, 23:36:56

On Fri, 20 Dec 2024 at 08:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Thu, 19 Dec 2024 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> The common_result_slot_type() function I wrote here perhaps
> >> should be made generally available, but I didn't do that yet.
>
> > I think it would be good to make this generic as it can be at least
> > used in nodeRecursiveunion.c and nodeAppend.c.
>
> OK, done, and I added an 0006 patch that uses that infrastructure
> in the obvious places.

That looks good. Thanks for adjusting the other node types too.

> I also addressed your remarks about comments.  Otherwise I'm feeling
> like this is about ready to push.

I think so too.

One minor detail... I think the only thing I'd like to see is the
moving of the enable_hashagg checks to increment the disabled_nodes
count in create_setop_path() instead of where it's being called.  I
understand there's only 1 caller of that function that passes
SETOP_HASHED, but it does seem nicer to put that logic where it
belongs. With how you have it now, if we were ever to grow any more
places that built SETOP_HASHED SetOpPaths, they'd also need to adjust
disabled_nodes manually and that seems easy to forget.  Also, looking
around for references to "disabled_nodes", it looks like all other
places where we fiddle with the value of disabled_nodes are in
costsize.c. I understand we do check enable_hashagg in other places,
but those all seem to be so we avoid generating some Path rather than
to determine the disabled_node value.

David

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

20 December 2024, 00:19:05

David Rowley <dgrowleyml@gmail.com> writes:
> One minor detail... I think the only thing I'd like to see is the
> moving of the enable_hashagg checks to increment the disabled_nodes
> count in create_setop_path() instead of where it's being called.  I
> understand there's only 1 caller of that function that passes
> SETOP_HASHED, but it does seem nicer to put that logic where it
> belongs. With how you have it now, if we were ever to grow any more
> places that built SETOP_HASHED SetOpPaths, they'd also need to adjust
> disabled_nodes manually and that seems easy to forget.  Also, looking
> around for references to "disabled_nodes", it looks like all other
> places where we fiddle with the value of disabled_nodes are in
> costsize.c.

Looks like costsize.c and pathnode.c to me, but either way I take your
point.  I'd not realized that Robert set it up that way, but now I see
he did.  I agree that moving that bit of logic into
create_setop_path() seems better.  I'll make it so and push.

            regards, tom lane

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

20 December 2024, 01:23:06

Pushed ... and now I have one more beef about the way things are
in this area.  I don't think we should leave the compatibility
function BuildTupleHashTable() in place in HEAD.  Making it a
wrapper around a new function BuildTupleHashTableExt() was a fine
solution for preserving ABI in released branches, but that doesn't
mean we should clutter the code with unused ABI hacks forevermore.
Attached is a patch to take it out and then rename
BuildTupleHashTableExt() back to BuildTupleHashTable().

Since BuildTupleHashTableExt has already grown more arguments
in HEAD than it had in v17, renaming it doesn't increase the number of
places that will have to be touched in any extensions that were using
this infrastructure.  Removal of the compatibility wrapper could force
some code updates, but really we want those places to update anyway.

I also made an effort at fixing the woefully out of date
header comment for it.

            regards, tom lane

From 322de6ae7d91c28e97f31ba1aed878181bcdc5aa Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 19 Dec 2024 15:25:57 -0500
Subject: [PATCH v1] Get rid of old version of BuildTupleHashTable().

It was reasonable to preserve the old API of BuildTupleHashTable()
in the back branches, but in HEAD we should actively discourage use
of that version.  There are no remaining callers in core, so just
get rid of it.  Then rename BuildTupleHashTableExt() back to
BuildTupleHashTable().

While at it, fix up the miserably-poorly-maintained header comment
for BuildTupleHashTable[Ext].  It looks like more than one patch in
this area has had the opinion that updating comments is beneath them.
---
 src/backend/executor/execGrouping.c       | 93 +++++++++--------------
 src/backend/executor/nodeAgg.c            | 28 +++----
 src/backend/executor/nodeRecursiveunion.c | 30 ++++----
 src/backend/executor/nodeSetOp.c          | 30 ++++----
 src/backend/executor/nodeSubplan.c        | 58 +++++++-------
 src/include/executor/executor.h           | 22 ++----
 6 files changed, 115 insertions(+), 146 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 7491e53c03..0aa9e92ad9 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -135,36 +135,43 @@ execTuplesHashPrepare(int numCols,
 /*
  * Construct an empty TupleHashTable
  *
- *  inputOps: slot ops for input hash values, or NULL if unknown or not fixed
- *    numCols, keyColIdx: identify the tuple fields to use as lookup key
- *    eqfunctions: equality comparison functions to use
- *    hashfunctions: datatype-specific hashing functions to use
+ *    parent: PlanState node that will own this hash table
+ *    inputDesc: tuple descriptor for input tuples
+ *    inputOps: slot ops for input tuples, or NULL if unknown or not fixed
+ *    numCols: number of columns to be compared (length of next 4 arrays)
+ *    keyColIdx: indexes of tuple columns to compare
+ *    eqfuncoids: OIDs of equality comparison functions to use
+ *    hashfunctions: FmgrInfos of datatype-specific hashing functions to use
+ *    collations: collations to use in comparisons
  *    nbuckets: initial estimate of hashtable size
  *    additionalsize: size of data stored in ->additional
  *    metacxt: memory context for long-lived allocation, but not per-entry data
  *    tablecxt: memory context in which to store table entries
  *    tempcxt: short-lived context for evaluation hash and comparison functions
+ *    use_variable_hash_iv: if true, adjust hash IV per-parallel-worker
  *
- * The function arrays may be made with execTuplesHashPrepare().  Note they
+ * The hashfunctions array may be made with execTuplesHashPrepare().  Note they
  * are not cross-type functions, but expect to see the table datatype(s)
  * on both sides.
  *
- * Note that keyColIdx, eqfunctions, and hashfunctions must be allocated in
- * storage that will live as long as the hashtable does.
+ * Note that the keyColIdx, hashfunctions, and collations arrays must be
+ * allocated in storage that will live as long as the hashtable does.
  */
 TupleHashTable
-BuildTupleHashTableExt(PlanState *parent,
-                       TupleDesc inputDesc,
-                       const TupleTableSlotOps *inputOps,
-                       int numCols, AttrNumber *keyColIdx,
-                       const Oid *eqfuncoids,
-                       FmgrInfo *hashfunctions,
-                       Oid *collations,
-                       long nbuckets, Size additionalsize,
-                       MemoryContext metacxt,
-                       MemoryContext tablecxt,
-                       MemoryContext tempcxt,
-                       bool use_variable_hash_iv)
+BuildTupleHashTable(PlanState *parent,
+                    TupleDesc inputDesc,
+                    const TupleTableSlotOps *inputOps,
+                    int numCols,
+                    AttrNumber *keyColIdx,
+                    const Oid *eqfuncoids,
+                    FmgrInfo *hashfunctions,
+                    Oid *collations,
+                    long nbuckets,
+                    Size additionalsize,
+                    MemoryContext metacxt,
+                    MemoryContext tablecxt,
+                    MemoryContext tempcxt,
+                    bool use_variable_hash_iv)
 {
     TupleHashTable hashtable;
     Size        entrysize = sizeof(TupleHashEntryData) + additionalsize;
@@ -216,14 +223,14 @@ BuildTupleHashTableExt(PlanState *parent,
                                                     &TTSOpsMinimalTuple);

     /*
-     * If the old reset interface is used (i.e. BuildTupleHashTable, rather
-     * than BuildTupleHashTableExt), allowing JIT would lead to the generated
-     * functions to a) live longer than the query b) be re-generated each time
-     * the table is being reset. Therefore prevent JIT from being used in that
-     * case, by not providing a parent node (which prevents accessing the
-     * JitContext in the EState).
+     * If the caller fails to make the metacxt different from the tablecxt,
+     * allowing JIT would lead to the generated functions to a) live longer
+     * than the query or b) be re-generated each time the table is being
+     * reset. Therefore prevent JIT from being used in that case, by not
+     * providing a parent node (which prevents accessing the JitContext in the
+     * EState).
      */
-    allow_jit = metacxt != tablecxt;
+    allow_jit = (metacxt != tablecxt);

     /* build hash ExprState for all columns */
     hashtable->tab_hash_expr = ExecBuildHash32FromAttrs(inputDesc,
@@ -256,41 +263,9 @@ BuildTupleHashTableExt(PlanState *parent,
     return hashtable;
 }

-/*
- * BuildTupleHashTable is a backwards-compatibility wrapper for
- * BuildTupleHashTableExt(), that allocates the hashtable's metadata in
- * tablecxt. Note that hashtables created this way cannot be reset leak-free
- * with ResetTupleHashTable().
- */
-TupleHashTable
-BuildTupleHashTable(PlanState *parent,
-                    TupleDesc inputDesc,
-                    int numCols, AttrNumber *keyColIdx,
-                    const Oid *eqfuncoids,
-                    FmgrInfo *hashfunctions,
-                    Oid *collations,
-                    long nbuckets, Size additionalsize,
-                    MemoryContext tablecxt,
-                    MemoryContext tempcxt,
-                    bool use_variable_hash_iv)
-{
-    return BuildTupleHashTableExt(parent,
-                                  inputDesc,
-                                  NULL,
-                                  numCols, keyColIdx,
-                                  eqfuncoids,
-                                  hashfunctions,
-                                  collations,
-                                  nbuckets, additionalsize,
-                                  tablecxt,
-                                  tablecxt,
-                                  tempcxt,
-                                  use_variable_hash_iv);
-}
-
 /*
  * Reset contents of the hashtable to be empty, preserving all the non-content
- * state. Note that the tablecxt passed to BuildTupleHashTableExt() should
+ * state. Note that the tablecxt passed to BuildTupleHashTable() should
  * also be reset, otherwise there will be leaks.
  */
 void
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 53931c8285..54e7806fc0 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1518,20 +1518,20 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
      */
     additionalsize = aggstate->numtrans * sizeof(AggStatePerGroupData);

-    perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
-                                                perhash->hashslot->tts_tupleDescriptor,
-                                                perhash->hashslot->tts_ops,
-                                                perhash->numCols,
-                                                perhash->hashGrpColIdxHash,
-                                                perhash->eqfuncoids,
-                                                perhash->hashfunctions,
-                                                perhash->aggnode->grpCollations,
-                                                nbuckets,
-                                                additionalsize,
-                                                metacxt,
-                                                hashcxt,
-                                                tmpcxt,
-                                                DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+    perhash->hashtable = BuildTupleHashTable(&aggstate->ss.ps,
+                                             perhash->hashslot->tts_tupleDescriptor,
+                                             perhash->hashslot->tts_ops,
+                                             perhash->numCols,
+                                             perhash->hashGrpColIdxHash,
+                                             perhash->eqfuncoids,
+                                             perhash->hashfunctions,
+                                             perhash->aggnode->grpCollations,
+                                             nbuckets,
+                                             additionalsize,
+                                             metacxt,
+                                             hashcxt,
+                                             tmpcxt,
+                                             DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
 }

 /*
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index b577b72b50..8bbec46017 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -39,23 +39,23 @@ build_hash_table(RecursiveUnionState *rustate)

     /*
      * If both child plans deliver the same fixed tuple slot type, we can tell
-     * BuildTupleHashTableExt to expect that slot type as input.  Otherwise,
+     * BuildTupleHashTable to expect that slot type as input.  Otherwise,
      * we'll pass NULL denoting that any slot type is possible.
      */
-    rustate->hashtable = BuildTupleHashTableExt(&rustate->ps,
-                                                desc,
-                                                ExecGetCommonChildSlotOps(&rustate->ps),
-                                                node->numCols,
-                                                node->dupColIdx,
-                                                rustate->eqfuncoids,
-                                                rustate->hashfunctions,
-                                                node->dupCollations,
-                                                node->numGroups,
-                                                0,
-                                                rustate->ps.state->es_query_cxt,
-                                                rustate->tableContext,
-                                                rustate->tempContext,
-                                                false);
+    rustate->hashtable = BuildTupleHashTable(&rustate->ps,
+                                             desc,
+                                             ExecGetCommonChildSlotOps(&rustate->ps),
+                                             node->numCols,
+                                             node->dupColIdx,
+                                             rustate->eqfuncoids,
+                                             rustate->hashfunctions,
+                                             node->dupCollations,
+                                             node->numGroups,
+                                             0,
+                                             rustate->ps.state->es_query_cxt,
+                                             rustate->tableContext,
+                                             rustate->tempContext,
+                                             false);
 }


diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 09089204e8..54311d31e0 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -92,23 +92,23 @@ build_hash_table(SetOpState *setopstate)

     /*
      * If both child plans deliver the same fixed tuple slot type, we can tell
-     * BuildTupleHashTableExt to expect that slot type as input.  Otherwise,
+     * BuildTupleHashTable to expect that slot type as input.  Otherwise,
      * we'll pass NULL denoting that any slot type is possible.
      */
-    setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
-                                                   desc,
-                                                   ExecGetCommonChildSlotOps(&setopstate->ps),
-                                                   node->numCols,
-                                                   node->cmpColIdx,
-                                                   setopstate->eqfuncoids,
-                                                   setopstate->hashfunctions,
-                                                   node->cmpCollations,
-                                                   node->numGroups,
-                                                   0,
-                                                   setopstate->ps.state->es_query_cxt,
-                                                   setopstate->tableContext,
-                                                   econtext->ecxt_per_tuple_memory,
-                                                   false);
+    setopstate->hashtable = BuildTupleHashTable(&setopstate->ps,
+                                                desc,
+                                                ExecGetCommonChildSlotOps(&setopstate->ps),
+                                                node->numCols,
+                                                node->cmpColIdx,
+                                                setopstate->eqfuncoids,
+                                                setopstate->hashfunctions,
+                                                node->cmpCollations,
+                                                node->numGroups,
+                                                0,
+                                                setopstate->ps.state->es_query_cxt,
+                                                setopstate->tableContext,
+                                                econtext->ecxt_per_tuple_memory,
+                                                false);
 }

 /*
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index bb4a021919..316995ca9f 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -523,7 +523,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
      * Because the input slot for each hash table is always the slot resulting
      * from an ExecProject(), we can use TTSOpsVirtual for the input ops. This
      * saves a needless fetch inner op step for the hashing ExprState created
-     * in BuildTupleHashTableExt().
+     * in BuildTupleHashTable().
      */
     MemoryContextReset(node->hashtablecxt);
     node->havehashrows = false;
@@ -536,20 +536,20 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
     if (node->hashtable)
         ResetTupleHashTable(node->hashtable);
     else
-        node->hashtable = BuildTupleHashTableExt(node->parent,
-                                                 node->descRight,
-                                                 &TTSOpsVirtual,
-                                                 ncols,
-                                                 node->keyColIdx,
-                                                 node->tab_eq_funcoids,
-                                                 node->tab_hash_funcs,
-                                                 node->tab_collations,
-                                                 nbuckets,
-                                                 0,
-                                                 node->planstate->state->es_query_cxt,
-                                                 node->hashtablecxt,
-                                                 node->hashtempcxt,
-                                                 false);
+        node->hashtable = BuildTupleHashTable(node->parent,
+                                              node->descRight,
+                                              &TTSOpsVirtual,
+                                              ncols,
+                                              node->keyColIdx,
+                                              node->tab_eq_funcoids,
+                                              node->tab_hash_funcs,
+                                              node->tab_collations,
+                                              nbuckets,
+                                              0,
+                                              node->planstate->state->es_query_cxt,
+                                              node->hashtablecxt,
+                                              node->hashtempcxt,
+                                              false);

     if (!subplan->unknownEqFalse)
     {
@@ -565,20 +565,20 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
         if (node->hashnulls)
             ResetTupleHashTable(node->hashnulls);
         else
-            node->hashnulls = BuildTupleHashTableExt(node->parent,
-                                                     node->descRight,
-                                                     &TTSOpsVirtual,
-                                                     ncols,
-                                                     node->keyColIdx,
-                                                     node->tab_eq_funcoids,
-                                                     node->tab_hash_funcs,
-                                                     node->tab_collations,
-                                                     nbuckets,
-                                                     0,
-                                                     node->planstate->state->es_query_cxt,
-                                                     node->hashtablecxt,
-                                                     node->hashtempcxt,
-                                                     false);
+            node->hashnulls = BuildTupleHashTable(node->parent,
+                                                  node->descRight,
+                                                  &TTSOpsVirtual,
+                                                  ncols,
+                                                  node->keyColIdx,
+                                                  node->tab_eq_funcoids,
+                                                  node->tab_hash_funcs,
+                                                  node->tab_collations,
+                                                  nbuckets,
+                                                  0,
+                                                  node->planstate->state->es_query_cxt,
+                                                  node->hashtablecxt,
+                                                  node->hashtempcxt,
+                                                  false);
     }
     else
         node->hashnulls = NULL;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index dee866e96b..1c7fae0930 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -131,24 +131,18 @@ extern void execTuplesHashPrepare(int numCols,
                                   FmgrInfo **hashFunctions);
 extern TupleHashTable BuildTupleHashTable(PlanState *parent,
                                           TupleDesc inputDesc,
-                                          int numCols, AttrNumber *keyColIdx,
+                                          const TupleTableSlotOps *inputOps,
+                                          int numCols,
+                                          AttrNumber *keyColIdx,
                                           const Oid *eqfuncoids,
                                           FmgrInfo *hashfunctions,
                                           Oid *collations,
-                                          long nbuckets, Size additionalsize,
+                                          long nbuckets,
+                                          Size additionalsize,
+                                          MemoryContext metacxt,
                                           MemoryContext tablecxt,
-                                          MemoryContext tempcxt, bool use_variable_hash_iv);
-extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
-                                             TupleDesc inputDesc,
-                                             const TupleTableSlotOps *inputOps,
-                                             int numCols, AttrNumber *keyColIdx,
-                                             const Oid *eqfuncoids,
-                                             FmgrInfo *hashfunctions,
-                                             Oid *collations,
-                                             long nbuckets, Size additionalsize,
-                                             MemoryContext metacxt,
-                                             MemoryContext tablecxt,
-                                             MemoryContext tempcxt, bool use_variable_hash_iv);
+                                          MemoryContext tempcxt,
+                                          bool use_variable_hash_iv);
 extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
                                            TupleTableSlot *slot,
                                            bool *isnew, uint32 *hash);
--
2.43.5

Re: Converting SetOp to read its two inputs separately

From

David Rowley

Date:

20 December 2024, 01:53:26

On Fri, 20 Dec 2024 at 11:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pushed ... and now I have one more beef about the way things are
> in this area.  I don't think we should leave the compatibility
> function BuildTupleHashTable() in place in HEAD.  Making it a
> wrapper around a new function BuildTupleHashTableExt() was a fine
> solution for preserving ABI in released branches, but that doesn't
> mean we should clutter the code with unused ABI hacks forevermore.
> Attached is a patch to take it out and then rename
> BuildTupleHashTableExt() back to BuildTupleHashTable().

No complaints here. Thanks for cleaning that up.

I couldn't help but also notice the nbuckets parameter is using the
"long" datatype. The code in BuildTupleHashTable seems to think it's
fine to pass the long as the uint32 parameter to tuplehash_create().
I just thought if we're in the area of adjusting this API function's
signature then it might be worth fixing the "long" issue at the same
time, or at least in the same release.

I'm also quite keen to see less use of long as it's not a very
consistently sized datatype on all platforms which can lead to
platform dependent bugs.

David

Re: Converting SetOp to read its two inputs separately

From

Tom Lane

Date:

20 December 2024, 02:05:01

David Rowley <dgrowleyml@gmail.com> writes:
> On Fri, 20 Dec 2024 at 11:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Attached is a patch to take it out and then rename
>> BuildTupleHashTableExt() back to BuildTupleHashTable().

> No complaints here. Thanks for cleaning that up.

Thanks, will push.

> I couldn't help but also notice the nbuckets parameter is using the
> "long" datatype. The code in BuildTupleHashTable seems to think it's
> fine to pass the long as the uint32 parameter to tuplehash_create().
> I just thought if we're in the area of adjusting this API function's
> signature then it might be worth fixing the "long" issue at the same
> time, or at least in the same release.

I'm in favor of doing that, but it seems like a separate patch,
because we'd have to chase things back a fairly long way.
For instance, the numGroups fields in Agg, RecursiveUnion, and
SetOp are all "long" at the moment, and some of the callers
are getting their arguments via clamp_cardinality_to_long()
which'd need adjustment, etc etc.

> I'm also quite keen to see less use of long as it's not a very
> consistently sized datatype on all platforms which can lead to
> platform dependent bugs.

Yeah.  Switching all these places to int64 likely would be
worthwhile cleanup (but I'm not volunteering).  Also, if
tuplehash_create expects a uint32, that isn't consistent
either.

            regards, tom lane