Re: Converting SetOp to read its two inputs separately - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Converting SetOp to read its two inputs separately
Date
Msg-id 56245.1732068576@sss.pgh.pa.us
Whole thread Raw
In response to Re: Converting SetOp to read its two inputs separately  (Richard Guo <guofenglinux@gmail.com>)
List pgsql-hackers
Richard Guo <guofenglinux@gmail.com> writes:
> On Thu, Nov 14, 2024 at 11:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Aside from that minor TODO, the main thing that's left undone in this
>> patch series is to persuade the thing to exploit presorted input
>> paths.

> I think we may need to do the following to make this work:
> 1. We need to teach set_operation_ordered_results_useful() that sorted
> input paths are also useful for INTERSECT/EXCEPT, so that we can have
> setop_pathkeys set for the subqueries.
> 2. In generate_nonunion_paths(), we need to provide a valid
> "interesting_pathkeys" when calling build_setop_child_paths().

Once I'd wrapped my head around how things are done now (which the
comments in prepunion.c were remarkably unhelpful about), I saw that
most of the problem for #2 just requires re-ordering things that
generate_nonunion_paths was already doing.  As for #1, I have a modest
proposal: we should get rid of set_operation_ordered_results_useful
entirely.  It's not the code that actually does useful work, and
keeping it in sync with the code that does do useful work is hard and
unnecessary.

0001-0003 below are the same as before (so the slot-munging TODO is
still there).  0004 fixes a rather basic bug for nested set-operations
and gets rid of set_operation_ordered_results_useful along the way.
Then 0005 does your step 2.

            regards, tom lane

From 801f2d63c221b0b01c228158d5b79db9a1edc629 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:26:26 -0500
Subject: [PATCH v2 1/5] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.
TODO: the tuple hash table logic spits up if I don't force tuples from
the second input into minimal-tuple form.  I don't understand why that
is, but it seems like it shouldn't be necessary.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.
---
 src/backend/executor/nodeSetOp.c        | 531 ++++++++++++++----------
 src/backend/optimizer/README            |   2 +-
 src/backend/optimizer/plan/createplan.c |  81 ++--
 src/backend/optimizer/prep/prepunion.c  | 156 ++++---
 src/backend/optimizer/util/pathnode.c   |  50 ++-
 src/include/nodes/execnodes.h           |  31 +-
 src/include/nodes/pathnodes.h           |   9 +-
 src/include/nodes/plannodes.h           |  19 +-
 src/include/optimizer/pathnode.h        |   7 +-
 src/test/regress/expected/subselect.out |  20 +-
 src/test/regress/expected/union.out     | 226 +++++-----
 src/tools/pgindent/typedefs.list        |   2 +
 12 files changed, 614 insertions(+), 520 deletions(-)

diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index a8ac68b482..5644739d65 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -3,29 +3,30 @@
  * nodeSetOp.c
  *      Routines to handle INTERSECT and EXCEPT selection
  *
- * The input of a SetOp node consists of tuples from two relations,
- * which have been combined into one dataset, with a junk attribute added
- * that shows which relation each tuple came from.  In SETOP_SORTED mode,
- * the input has furthermore been sorted according to all the grouping
- * columns (ie, all the non-junk attributes).  The SetOp node scans each
- * group of identical tuples to determine how many came from each input
- * relation.  Then it is a simple matter to emit the output demanded by the
- * SQL spec for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ * The input of a SetOp node consists of two relations (outer and inner)
+ * with identical column sets.  In EXCEPT queries the outer relation is
+ * always the left side, while in INTERSECT cases the planner tries to
+ * make the outer relation be the smaller of the two inputs.
  *
- * In SETOP_HASHED mode, the input is delivered in no particular order,
- * except that we know all the tuples from one input relation will come before
- * all the tuples of the other.  The planner guarantees that the first input
- * relation is the left-hand one for EXCEPT, and tries to make the smaller
- * input relation come first for INTERSECT.  We build a hash table in memory
- * with one entry for each group of identical tuples, and count the number of
- * tuples in the group from each relation.  After seeing all the input, we
- * scan the hashtable and generate the correct output using those counts.
- * We can avoid making hashtable entries for any tuples appearing only in the
- * second input relation, since they cannot result in any output.
+ * In SETOP_SORTED mode, each input has been sorted according to all the
+ * grouping columns (ie, all the non-junk attributes).  The SetOp node
+ * essentially performs a merge join on the grouping columns, except that
+ * it is only interested in counting how many tuples from each input match.
+ * Then it is a simple matter to emit the output demanded by the SQL spec
+ * for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ *
+ * In SETOP_HASHED mode, the inputs are delivered in no particular order.
+ * We read the outer relation and build a hash table in memory with one entry
+ * for each group of identical tuples, counting the number of tuples in the
+ * group.  Then we read the inner relation and count the number of tuples
+ * matching each outer group.  (We can disregard any tuples appearing only
+ * in the inner relation, since they cannot result in any output.)  After
+ * seeing all the input, we scan the hashtable and generate the correct
+ * output using those counts.
  *
  * This node type is not used for UNION or UNION ALL, since those can be
- * implemented more cheaply (there's no need for the junk attribute to
- * identify the source relation).
+ * implemented more cheaply (there's no need to count the number of
+ * matching tuples).
  *
  * Note that SetOp does no qual checking nor projection.  The delivered
  * output tuples are just copies of the first-to-arrive tuple in each
@@ -54,65 +55,28 @@
 /*
  * SetOpStatePerGroupData - per-group working state
  *
- * These values are working state that is initialized at the start of
- * an input tuple group and updated for each input tuple.
- *
- * In SETOP_SORTED mode, we need only one of these structs, and it's kept in
- * the plan state node.  In SETOP_HASHED mode, the hash table contains one
- * of these for each tuple group.
+ * In SETOP_SORTED mode, we need only one of these structs, and it's just a
+ * local in setop_retrieve_sorted.  In SETOP_HASHED mode, the hash table
+ * contains one of these for each tuple group.
  */
 typedef struct SetOpStatePerGroupData
 {
-    long        numLeft;        /* number of left-input dups in group */
-    long        numRight;        /* number of right-input dups in group */
-}            SetOpStatePerGroupData;
+    int64        numLeft;        /* number of left-input dups in group */
+    int64        numRight;        /* number of right-input dups in group */
+} SetOpStatePerGroupData;
+
+typedef SetOpStatePerGroupData *SetOpStatePerGroup;


-static TupleTableSlot *setop_retrieve_direct(SetOpState *setopstate);
+static TupleTableSlot *setop_retrieve_sorted(SetOpState *setopstate);
+static void setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                             SetOpState *setopstate);
+static int    setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                                SetOpState *setopstate);
 static void setop_fill_hash_table(SetOpState *setopstate);
 static TupleTableSlot *setop_retrieve_hash_table(SetOpState *setopstate);


-/*
- * Initialize state for a new group of input values.
- */
-static inline void
-initialize_counts(SetOpStatePerGroup pergroup)
-{
-    pergroup->numLeft = pergroup->numRight = 0;
-}
-
-/*
- * Advance the appropriate counter for one input tuple.
- */
-static inline void
-advance_counts(SetOpStatePerGroup pergroup, int flag)
-{
-    if (flag)
-        pergroup->numRight++;
-    else
-        pergroup->numLeft++;
-}
-
-/*
- * Fetch the "flag" column from an input tuple.
- * This is an integer column with value 0 for left side, 1 for right side.
- */
-static int
-fetch_tuple_flag(SetOpState *setopstate, TupleTableSlot *inputslot)
-{
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
-    int            flag;
-    bool        isNull;
-
-    flag = DatumGetInt32(slot_getattr(inputslot,
-                                      node->flagColIdx,
-                                      &isNull));
-    Assert(!isNull);
-    Assert(flag == 0 || flag == 1);
-    return flag;
-}
-
 /*
  * Initialize the hash table to empty.
  */
@@ -129,10 +93,10 @@ build_hash_table(SetOpState *setopstate)
     setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
                                                    desc,
                                                    node->numCols,
-                                                   node->dupColIdx,
+                                                   node->cmpColIdx,
                                                    setopstate->eqfuncoids,
                                                    setopstate->hashfunctions,
-                                                   node->dupCollations,
+                                                   node->cmpCollations,
                                                    node->numGroups,
                                                    0,
                                                    setopstate->ps.state->es_query_cxt,
@@ -217,108 +181,126 @@ ExecSetOp(PlanState *pstate)
         return setop_retrieve_hash_table(node);
     }
     else
-        return setop_retrieve_direct(node);
+        return setop_retrieve_sorted(node);
 }

 /*
  * ExecSetOp for non-hashed case
  */
 static TupleTableSlot *
-setop_retrieve_direct(SetOpState *setopstate)
+setop_retrieve_sorted(SetOpState *setopstate)
 {
     PlanState  *outerPlan;
-    SetOpStatePerGroup pergroup;
-    TupleTableSlot *outerslot;
+    PlanState  *innerPlan;
     TupleTableSlot *resultTupleSlot;
-    ExprContext *econtext = setopstate->ps.ps_ExprContext;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    pergroup = (SetOpStatePerGroup) setopstate->pergroup;
+    innerPlan = innerPlanState(setopstate);
     resultTupleSlot = setopstate->ps.ps_ResultTupleSlot;

     /*
-     * We loop retrieving groups until we find one we should return
+     * If first time through, establish the invariant that setop_load_group
+     * expects: each side's nextTupleSlot is the next output from the child
+     * plan, or empty if there is no more output from it.
      */
-    while (!setopstate->setop_done)
+    if (setopstate->need_init)
     {
+        setopstate->need_init = false;
+
+        setopstate->leftInput.nextTupleSlot = ExecProcNode(outerPlan);
+
         /*
-         * If we don't already have the first tuple of the new group, fetch it
-         * from the outer plan.
+         * If the outer relation is empty, then we will emit nothing, and we
+         * don't need to read the inner relation at all.
          */
-        if (setopstate->grp_firstTuple == NULL)
+        if (TupIsNull(setopstate->leftInput.nextTupleSlot))
         {
-            outerslot = ExecProcNode(outerPlan);
-            if (!TupIsNull(outerslot))
-            {
-                /* Make a copy of the first input tuple */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-            }
-            else
-            {
-                /* outer plan produced no tuples at all */
-                setopstate->setop_done = true;
-                return NULL;
-            }
+            setopstate->setop_done = true;
+            return NULL;
         }

+        setopstate->rightInput.nextTupleSlot = ExecProcNode(innerPlan);
+
+        /* Set flags that we've not completed either side's group */
+        setopstate->leftInput.needGroup = true;
+        setopstate->rightInput.needGroup = true;
+    }
+
+    /*
+     * We loop retrieving groups until we find one we should return
+     */
+    while (!setopstate->setop_done)
+    {
+        int            cmpresult;
+        SetOpStatePerGroupData pergroup;
+
         /*
-         * Store the copied first input tuple in the tuple table slot reserved
-         * for it.  The tuple will be deleted when it is cleared from the
-         * slot.
+         * Fetch the rest of the current outer group, if we didn't already.
          */
-        ExecStoreHeapTuple(setopstate->grp_firstTuple,
-                           resultTupleSlot,
-                           true);
-        setopstate->grp_firstTuple = NULL;    /* don't keep two pointers */
+        if (setopstate->leftInput.needGroup)
+            setop_load_group(&setopstate->leftInput, outerPlan, setopstate);

-        /* Initialize working state for a new input tuple group */
-        initialize_counts(pergroup);
+        /*
+         * If no more outer groups, we're done, and don't need to look at any
+         * more of the inner relation.
+         */
+        if (setopstate->leftInput.numTuples == 0)
+        {
+            setopstate->setop_done = true;
+            break;
+        }

-        /* Count the first input tuple */
-        advance_counts(pergroup,
-                       fetch_tuple_flag(setopstate, resultTupleSlot));
+        /*
+         * Fetch the rest of the current inner group, if we didn't already.
+         */
+        if (setopstate->rightInput.needGroup)
+            setop_load_group(&setopstate->rightInput, innerPlan, setopstate);

         /*
-         * Scan the outer plan until we exhaust it or cross a group boundary.
+         * Determine whether we have matching groups on both sides (this is
+         * basically like the core logic of a merge join).
          */
-        for (;;)
-        {
-            outerslot = ExecProcNode(outerPlan);
-            if (TupIsNull(outerslot))
-            {
-                /* no more outer-plan tuples available */
-                setopstate->setop_done = true;
-                break;
-            }
-
-            /*
-             * Check whether we've crossed a group boundary.
-             */
-            econtext->ecxt_outertuple = resultTupleSlot;
-            econtext->ecxt_innertuple = outerslot;
-
-            if (!ExecQualAndReset(setopstate->eqfunction, econtext))
-            {
-                /*
-                 * Save the first input tuple of the next group.
-                 */
-                setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-                break;
-            }
+        if (setopstate->rightInput.numTuples == 0)
+            cmpresult = -1;        /* as though left input is lesser */
+        else
+            cmpresult = setop_compare_slots(setopstate->leftInput.firstTupleSlot,
+                                            setopstate->rightInput.firstTupleSlot,
+                                            setopstate);

-            /* Still in same group, so count this tuple */
-            advance_counts(pergroup,
-                           fetch_tuple_flag(setopstate, outerslot));
+        if (cmpresult < 0)
+        {
+            /* Left group is first, has no right matches */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = 0;
+            /* We'll need another left group next time */
+            setopstate->leftInput.needGroup = true;
+        }
+        else if (cmpresult == 0)
+        {
+            /* We have matching groups */
+            pergroup.numLeft = setopstate->leftInput.numTuples;
+            pergroup.numRight = setopstate->rightInput.numTuples;
+            /* We'll need to read from both sides next time */
+            setopstate->leftInput.needGroup = true;
+            setopstate->rightInput.needGroup = true;
+        }
+        else
+        {
+            /* Right group has no left matches, so we can ignore it */
+            setopstate->rightInput.needGroup = true;
+            continue;
         }

         /*
-         * Done scanning input tuple group.  See if we should emit any copies
-         * of result tuple, and if so return the first copy.
+         * Done scanning these input tuple groups.  See if we should emit any
+         * copies of result tuple, and if so return the first copy.  (Note
+         * that the result tuple is the same as the left input's firstTuple
+         * slot.)
          */
-        set_output_count(setopstate, pergroup);
+        set_output_count(setopstate, &pergroup);

         if (setopstate->numOutput > 0)
         {
@@ -333,84 +315,173 @@ setop_retrieve_direct(SetOpState *setopstate)
 }

 /*
- * ExecSetOp for hashed case: phase 1, read input and build hash table
+ * Load next group of tuples from one child plan or the other.
+ *
+ * On entry, we've already read the first tuple of the next group
+ * (if there is one) into input->nextTupleSlot.  This invariant
+ * is maintained on exit.
+ */
+static void
+setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+                 SetOpState *setopstate)
+{
+    input->needGroup = false;
+
+    /* If we've exhausted this child plan, report an empty group */
+    if (TupIsNull(input->nextTupleSlot))
+    {
+        ExecClearTuple(input->firstTupleSlot);
+        input->numTuples = 0;
+        return;
+    }
+
+    /* Make a local copy of the first tuple for comparisons */
+    ExecStoreHeapTuple(ExecCopySlotHeapTuple(input->nextTupleSlot),
+                       input->firstTupleSlot,
+                       true);
+    /* and count it */
+    input->numTuples = 1;
+
+    /* Scan till we find the end-of-group */
+    for (;;)
+    {
+        int            cmpresult;
+
+        /* Get next input tuple, if there is one */
+        input->nextTupleSlot = ExecProcNode(inputPlan);
+        if (TupIsNull(input->nextTupleSlot))
+            break;
+
+        /* There is; does it belong to same group as firstTuple? */
+        cmpresult = setop_compare_slots(input->firstTupleSlot,
+                                        input->nextTupleSlot,
+                                        setopstate);
+        Assert(cmpresult <= 0); /* else input is mis-sorted */
+        if (cmpresult != 0)
+            break;
+
+        /* Still in same group, so count this tuple */
+        input->numTuples++;
+    }
+}
+
+/*
+ * Compare the tuples in the two given slots.
+ */
+static int
+setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+                    SetOpState *setopstate)
+{
+    for (int nkey = 0; nkey < setopstate->numCols; nkey++)
+    {
+        SortSupport sortKey = setopstate->sortKeys + nkey;
+        AttrNumber    attno = sortKey->ssup_attno;
+        Datum        datum1,
+                    datum2;
+        bool        isNull1,
+                    isNull2;
+        int            compare;
+
+        datum1 = slot_getattr(s1, attno, &isNull1);
+        datum2 = slot_getattr(s2, attno, &isNull2);
+
+        compare = ApplySortComparator(datum1, isNull1,
+                                      datum2, isNull2,
+                                      sortKey);
+        if (compare != 0)
+            return compare;
+    }
+    return 0;
+}
+
+/*
+ * ExecSetOp for hashed case: phase 1, read inputs and build hash table
  */
 static void
 setop_fill_hash_table(SetOpState *setopstate)
 {
-    SetOp       *node = (SetOp *) setopstate->ps.plan;
     PlanState  *outerPlan;
-    int            firstFlag;
-    bool        in_first_rel PG_USED_FOR_ASSERTS_ONLY;
+    PlanState  *innerPlan;
     ExprContext *econtext = setopstate->ps.ps_ExprContext;
+    bool        have_tuples = false;

     /*
      * get state info from node
      */
     outerPlan = outerPlanState(setopstate);
-    firstFlag = node->firstFlag;
-    /* verify planner didn't mess up */
-    Assert(firstFlag == 0 ||
-           (firstFlag == 1 &&
-            (node->cmd == SETOPCMD_INTERSECT ||
-             node->cmd == SETOPCMD_INTERSECT_ALL)));
+    innerPlan = innerPlanState(setopstate);

     /*
      * Process each outer-plan tuple, and then fetch the next one, until we
      * exhaust the outer plan.
      */
-    in_first_rel = true;
     for (;;)
     {
         TupleTableSlot *outerslot;
-        int            flag;
         TupleHashEntryData *entry;
         bool        isnew;

         outerslot = ExecProcNode(outerPlan);
         if (TupIsNull(outerslot))
             break;
+        have_tuples = true;

-        /* Identify whether it's left or right input */
-        flag = fetch_tuple_flag(setopstate, outerslot);
+        /* Find or build hashtable entry for this tuple's group */
+        entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+                                     &isnew, NULL);

-        if (flag == firstFlag)
+        /* If new tuple group, initialize counts to zero */
+        if (isnew)
         {
-            /* (still) in first input relation */
-            Assert(in_first_rel);
-
-            /* Find or build hashtable entry for this tuple's group */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
-                                         &isnew, NULL);
-
-            /* If new tuple group, initialize counts */
-            if (isnew)
-            {
-                entry->additional = (SetOpStatePerGroup)
-                    MemoryContextAlloc(setopstate->hashtable->tablecxt,
+            entry->additional = (SetOpStatePerGroup)
+                MemoryContextAllocZero(setopstate->hashtable->tablecxt,
                                        sizeof(SetOpStatePerGroupData));
-                initialize_counts((SetOpStatePerGroup) entry->additional);
-            }
-
-            /* Advance the counts */
-            advance_counts((SetOpStatePerGroup) entry->additional, flag);
         }
-        else
+
+        /* Advance the counts */
+        ((SetOpStatePerGroup) entry->additional)->numLeft++;
+
+        /* Must reset expression context after each hashtable lookup */
+        ResetExprContext(econtext);
+    }
+
+    /*
+     * If the outer relation is empty, then we will emit nothing, and we don't
+     * need to read the inner relation at all.
+     */
+    if (have_tuples)
+    {
+        /*
+         * Process each inner-plan tuple, and then fetch the next one, until
+         * we exhaust the inner plan.
+         */
+        for (;;)
         {
-            /* reached second relation */
-            in_first_rel = false;
+            TupleTableSlot *innerslot;
+            TupleHashEntryData *entry;
+
+            innerslot = ExecProcNode(innerPlan);
+            if (TupIsNull(innerslot))
+                break;
+
+            /* XXX hack: force the tuple into minimal form */
+            /* XXX should not have to do this */
+            ExecForceStoreHeapTuple(ExecCopySlotHeapTuple(innerslot),
+                                    setopstate->ps.ps_ResultTupleSlot,
+                                    true);
+            innerslot = setopstate->ps.ps_ResultTupleSlot;

             /* For tuples not seen previously, do not make hashtable entry */
-            entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+            entry = LookupTupleHashEntry(setopstate->hashtable, innerslot,
                                          NULL, NULL);

             /* Advance the counts if entry is already present */
             if (entry)
-                advance_counts((SetOpStatePerGroup) entry->additional, flag);
-        }
+                ((SetOpStatePerGroup) entry->additional)->numRight++;

-        /* Must reset expression context after each hashtable lookup */
-        ResetExprContext(econtext);
+            /* Must reset expression context after each hashtable lookup */
+            ResetExprContext(econtext);
+        }
     }

     setopstate->table_filled = true;
@@ -481,7 +552,6 @@ SetOpState *
 ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 {
     SetOpState *setopstate;
-    TupleDesc    outerDesc;

     /* check for unsupported flags */
     Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -494,14 +564,10 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     setopstate->ps.state = estate;
     setopstate->ps.ExecProcNode = ExecSetOp;

-    setopstate->eqfuncoids = NULL;
-    setopstate->hashfunctions = NULL;
     setopstate->setop_done = false;
     setopstate->numOutput = 0;
-    setopstate->pergroup = NULL;
-    setopstate->grp_firstTuple = NULL;
-    setopstate->hashtable = NULL;
-    setopstate->tableContext = NULL;
+    setopstate->numCols = node->numCols;
+    setopstate->need_init = true;

     /*
      * create expression context
@@ -522,52 +588,74 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
     /*
      * initialize child nodes
      *
-     * If we are hashing then the child plan does not need to handle REWIND
+     * If we are hashing then the child plans do not need to handle REWIND
      * efficiently; see ExecReScanSetOp.
      */
     if (node->strategy == SETOP_HASHED)
         eflags &= ~EXEC_FLAG_REWIND;
     outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
-    outerDesc = ExecGetResultType(outerPlanState(setopstate));
+    innerPlanState(setopstate) = ExecInitNode(innerPlan(node), estate, eflags);

     /*
-     * Initialize result slot and type. Setop nodes do no projections, so
-     * initialize projection info for this node appropriately.
+     * Initialize locally-allocated slots.  In hashed mode, we just need a
+     * result slot.  In sorted mode, we need one first-tuple-of-group slot for
+     * each input; we use the result slot for the left input's slot and create
+     * another for the right input.  (Note: the nextTupleSlot slots are not
+     * ours, but just point to the last slot returned by the input plan node.)
      */
     ExecInitResultTupleSlotTL(&setopstate->ps,
                               node->strategy == SETOP_HASHED ?
                               &TTSOpsMinimalTuple : &TTSOpsHeapTuple);
+    if (node->strategy != SETOP_HASHED)
+    {
+        setopstate->leftInput.firstTupleSlot =
+            setopstate->ps.ps_ResultTupleSlot;
+        setopstate->rightInput.firstTupleSlot =
+            ExecInitExtraTupleSlot(estate,
+                                   setopstate->ps.ps_ResultTupleDesc,
+                                   &TTSOpsHeapTuple);
+    }
+
+    /* Setop nodes do no projections. */
     setopstate->ps.ps_ProjInfo = NULL;

     /*
-     * Precompute fmgr lookup data for inner loop. We need both equality and
-     * hashing functions to do it by hashing, but only equality if not
-     * hashing.
+     * Precompute fmgr lookup data for inner loop.  We need equality and
+     * hashing functions to do it by hashing, while for sorting we need
+     * SortSupport data.
      */
     if (node->strategy == SETOP_HASHED)
         execTuplesHashPrepare(node->numCols,
-                              node->dupOperators,
+                              node->cmpOperators,
                               &setopstate->eqfuncoids,
                               &setopstate->hashfunctions);
     else
-        setopstate->eqfunction =
-            execTuplesMatchPrepare(outerDesc,
-                                   node->numCols,
-                                   node->dupColIdx,
-                                   node->dupOperators,
-                                   node->dupCollations,
-                                   &setopstate->ps);
+    {
+        int            nkeys = node->numCols;
+
+        setopstate->sortKeys = (SortSupport)
+            palloc0(nkeys * sizeof(SortSupportData));
+        for (int i = 0; i < nkeys; i++)
+        {
+            SortSupport sortKey = setopstate->sortKeys + i;
+
+            sortKey->ssup_cxt = CurrentMemoryContext;
+            sortKey->ssup_collation = node->cmpCollations[i];
+            sortKey->ssup_nulls_first = node->cmpNullsFirst[i];
+            sortKey->ssup_attno = node->cmpColIdx[i];
+            /* abbreviated key conversion is not useful here */
+            sortKey->abbreviate = false;

+            PrepareSortSupportFromOrderingOp(node->cmpOperators[i], sortKey);
+        }
+    }
+
+    /* Create a hash table if needed */
     if (node->strategy == SETOP_HASHED)
     {
         build_hash_table(setopstate);
         setopstate->table_filled = false;
     }
-    else
-    {
-        setopstate->pergroup =
-            (SetOpStatePerGroup) palloc0(sizeof(SetOpStatePerGroupData));
-    }

     return setopstate;
 }
@@ -575,7 +663,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 /* ----------------------------------------------------------------
  *        ExecEndSetOp
  *
- *        This shuts down the subplan and frees resources allocated
+ *        This shuts down the subplans and frees resources allocated
  *        to this node.
  * ----------------------------------------------------------------
  */
@@ -587,6 +675,7 @@ ExecEndSetOp(SetOpState *node)
         MemoryContextDelete(node->tableContext);

     ExecEndNode(outerPlanState(node));
+    ExecEndNode(innerPlanState(node));
 }


@@ -594,6 +683,7 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
     PlanState  *outerPlan = outerPlanState(node);
+    PlanState  *innerPlan = innerPlanState(node);

     ExecClearTuple(node->ps.ps_ResultTupleSlot);
     node->setop_done = false;
@@ -611,34 +701,29 @@ ExecReScanSetOp(SetOpState *node)
             return;

         /*
-         * If we do have the hash table and the subplan does not have any
+         * If we do have the hash table and the subplans do not have any
          * parameter changes, then we can just rescan the existing hash table;
          * no need to build it again.
          */
-        if (outerPlan->chgParam == NULL)
+        if (outerPlan->chgParam == NULL && innerPlan->chgParam == NULL)
         {
             ResetTupleHashIterator(node->hashtable, &node->hashiter);
             return;
         }
-    }
-
-    /* Release first tuple of group, if we have made a copy */
-    if (node->grp_firstTuple != NULL)
-    {
-        heap_freetuple(node->grp_firstTuple);
-        node->grp_firstTuple = NULL;
-    }

-    /* Release any hashtable storage */
-    if (node->tableContext)
-        MemoryContextReset(node->tableContext);
+        /* Release any hashtable storage */
+        if (node->tableContext)
+            MemoryContextReset(node->tableContext);

-    /* And rebuild empty hashtable if needed */
-    if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
-    {
+        /* And rebuild an empty hashtable */
         ResetTupleHashTable(node->hashtable);
         node->table_filled = false;
     }
+    else
+    {
+        /* Need to re-read first input from each side */
+        node->need_init = true;
+    }

     /*
      * if chgParam of subnode is not null then plan will be re-scanned by
@@ -646,4 +731,6 @@ ExecReScanSetOp(SetOpState *node)
      */
     if (outerPlan->chgParam == NULL)
         ExecReScan(outerPlan);
+    if (innerPlan->chgParam == NULL)
+        ExecReScan(innerPlan);
 }
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 2ab4f3dbf3..f341d9f303 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -649,7 +649,7 @@ RelOptInfo      - a relation or joined relations
   GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
   MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
   WindowAggPath - a WindowAgg plan node applied to some sub-path
-  SetOpPath     - a SetOp plan node applied to some sub-path
+  SetOpPath     - a SetOp plan node applied to two sub-paths
   RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
   LockRowsPath  - a LockRows plan node applied to some sub-path
   ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f2ed0d81f6..07976c5938 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -301,9 +301,9 @@ static Unique *make_unique_from_pathkeys(Plan *lefttree,
                                          List *pathkeys, int numCols);
 static Gather *make_gather(List *qptlist, List *qpqual,
                            int nworkers, int rescan_param, bool single_copy, Plan *subplan);
-static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-                         List *distinctList, AttrNumber flagColIdx, int firstFlag,
-                         long numGroups);
+static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+                         List *tlist, Plan *lefttree, Plan *righttree,
+                         List *groupList, long numGroups);
 static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
 static Result *make_result(List *tlist, Node *resconstantqual, Plan *subplan);
 static ProjectSet *make_project_set(List *tlist, Plan *subplan);
@@ -2719,25 +2719,29 @@ static SetOp *
 create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 {
     SetOp       *plan;
-    Plan       *subplan;
+    List       *tlist = build_path_tlist(root, &best_path->path);
+    Plan       *leftplan;
+    Plan       *rightplan;
     long        numGroups;

     /*
      * SetOp doesn't project, so tlist requirements pass through; moreover we
      * need grouping columns to be labeled.
      */
-    subplan = create_plan_recurse(root, best_path->subpath,
-                                  flags | CP_LABEL_TLIST);
+    leftplan = create_plan_recurse(root, best_path->leftpath,
+                                   flags | CP_LABEL_TLIST);
+    rightplan = create_plan_recurse(root, best_path->rightpath,
+                                    flags | CP_LABEL_TLIST);

     /* Convert numGroups to long int --- but 'ware overflow! */
     numGroups = clamp_cardinality_to_long(best_path->numGroups);

     plan = make_setop(best_path->cmd,
                       best_path->strategy,
-                      subplan,
-                      best_path->distinctList,
-                      best_path->flagColIdx,
-                      best_path->firstFlag,
+                      tlist,
+                      leftplan,
+                      rightplan,
+                      best_path->groupList,
                       numGroups);

     copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6952,57 +6956,62 @@ make_gather(List *qptlist,
 }

 /*
- * distinctList is a list of SortGroupClauses, identifying the targetlist
- * items that should be considered by the SetOp filter.  The input path must
- * already be sorted accordingly.
+ * groupList is a list of SortGroupClauses, identifying the targetlist
+ * items that should be considered by the SetOp filter.  The input plans must
+ * already be sorted accordingly, if we're doing SETOP_SORTED mode.
  */
 static SetOp *
-make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-           List *distinctList, AttrNumber flagColIdx, int firstFlag,
-           long numGroups)
+make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+           List *tlist, Plan *lefttree, Plan *righttree,
+           List *groupList, long numGroups)
 {
     SetOp       *node = makeNode(SetOp);
     Plan       *plan = &node->plan;
-    int            numCols = list_length(distinctList);
+    int            numCols = list_length(groupList);
     int            keyno = 0;
-    AttrNumber *dupColIdx;
-    Oid           *dupOperators;
-    Oid           *dupCollations;
+    AttrNumber *cmpColIdx;
+    Oid           *cmpOperators;
+    Oid           *cmpCollations;
+    bool       *cmpNullsFirst;
     ListCell   *slitem;

-    plan->targetlist = lefttree->targetlist;
+    plan->targetlist = tlist;
     plan->qual = NIL;
     plan->lefttree = lefttree;
-    plan->righttree = NULL;
+    plan->righttree = righttree;

     /*
-     * convert SortGroupClause list into arrays of attr indexes and equality
+     * convert SortGroupClause list into arrays of attr indexes and comparison
      * operators, as wanted by executor
      */
-    dupColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
-    dupOperators = (Oid *) palloc(sizeof(Oid) * numCols);
-    dupCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
+    cmpOperators = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+    cmpNullsFirst = (bool *) palloc(sizeof(bool) * numCols);

-    foreach(slitem, distinctList)
+    foreach(slitem, groupList)
     {
         SortGroupClause *sortcl = (SortGroupClause *) lfirst(slitem);
         TargetEntry *tle = get_sortgroupclause_tle(sortcl, plan->targetlist);

-        dupColIdx[keyno] = tle->resno;
-        dupOperators[keyno] = sortcl->eqop;
-        dupCollations[keyno] = exprCollation((Node *) tle->expr);
-        Assert(OidIsValid(dupOperators[keyno]));
+        cmpColIdx[keyno] = tle->resno;
+        if (strategy == SETOP_HASHED)
+            cmpOperators[keyno] = sortcl->eqop;
+        else
+            cmpOperators[keyno] = sortcl->sortop;
+        Assert(OidIsValid(cmpOperators[keyno]));
+        cmpCollations[keyno] = exprCollation((Node *) tle->expr);
+        cmpNullsFirst[keyno] = sortcl->nulls_first;
         keyno++;
     }

     node->cmd = cmd;
     node->strategy = strategy;
     node->numCols = numCols;
-    node->dupColIdx = dupColIdx;
-    node->dupOperators = dupOperators;
-    node->dupCollations = dupCollations;
-    node->flagColIdx = flagColIdx;
-    node->firstFlag = firstFlag;
+    node->cmpColIdx = cmpColIdx;
+    node->cmpOperators = cmpOperators;
+    node->cmpCollations = cmpCollations;
+    node->cmpNullsFirst = cmpNullsFirst;
     node->numGroups = numGroups;

     return node;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f2b0bcfdf5..bd0ca000e8 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -65,7 +65,7 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
 static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *input_path,
+                                Path *lpath, Path *rpath,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
@@ -315,8 +315,8 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * to the corresponding tlist entries of the subplan. However, since
          * the subplan was generated by generate_union_paths() or
          * generate_nonunion_paths(), and hence its tlist was generated by
-         * generate_append_tlist(), this will work.  We just tell
-         * generate_setop_tlist() to use varno 0.
+         * generate_append_tlist() or generate_setop_tlist(), this will work.
+         * We just tell generate_setop_tlist() to use varno 0.
          */
         if (flag >= 0 ||
             !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
@@ -1028,29 +1028,27 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                *path;
     List       *lpath_tlist,
                *rpath_tlist,
-               *tlist_list,
                *tlist,
-               *groupList,
-               *pathlist;
+               *groupList;
     bool        lpath_trivial_tlist,
-                rpath_trivial_tlist;
+                rpath_trivial_tlist,
+                result_trivial_tlist;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
     bool        use_hash;
     SetOpCmd    cmd;
-    int            firstFlag;

     /*
      * Tell children to fetch all tuples.
      */
     root->tuple_fraction = 0.0;

-    /* Recurse on children, ensuring their outputs are marked */
+    /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 0,
+                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,10 +1058,9 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dLeftGroups = lrel->rows;

-    lpath = lrel->cheapest_total_path;
     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, 1,
+                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1073,41 +1070,51 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     else
         dRightGroups = rrel->rows;

-    rpath = rrel->cheapest_total_path;
-
     /* Undo effects of forcing tuple_fraction to 0 */
     root->tuple_fraction = save_fraction;

     /*
      * For EXCEPT, we must put the left input first.  For INTERSECT, either
      * order should give the same results, and we prefer to put the smaller
-     * input first in order to minimize the size of the hash table in the
-     * hashing case.  "Smaller" means the one with the fewer groups.
+     * input first in order to (a) minimize the size of the hash table in the
+     * hashing case, and (b) improve our chances of exploiting the executor's
+     * fast path for empty left-hand input.  "Smaller" means the one with the
+     * fewer groups.
      */
-    if (op->op == SETOP_EXCEPT || dLeftGroups <= dRightGroups)
-    {
-        pathlist = list_make2(lpath, rpath);
-        tlist_list = list_make2(lpath_tlist, rpath_tlist);
-        firstFlag = 0;
-    }
-    else
+    if (op->op != SETOP_EXCEPT && dLeftGroups > dRightGroups)
     {
-        pathlist = list_make2(rpath, lpath);
-        tlist_list = list_make2(rpath_tlist, lpath_tlist);
-        firstFlag = 1;
+        /* need to swap the two inputs */
+        RelOptInfo *tmprel;
+        List       *tmplist;
+        double        tmpd;
+
+        tmprel = lrel;
+        lrel = rrel;
+        rrel = tmprel;
+        tmplist = lpath_tlist;
+        lpath_tlist = rpath_tlist;
+        rpath_tlist = tmplist;
+        tmpd = dLeftGroups;
+        dLeftGroups = dRightGroups;
+        dRightGroups = tmpd;
     }

+    lpath = lrel->cheapest_total_path;
+    rpath = rrel->cheapest_total_path;
+
     /*
-     * Generate tlist for Append plan node.
+     * Generate tlist for SetOp plan node.
      *
-     * The tlist for an Append plan isn't important as far as the Append is
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
      * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.  In fact, it has to be real enough that the flag
-     * column is shown as a variable not a constant, else setrefs.c will get
-     * confused.
+     * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, true,
-                                  tlist_list, refnames_tlist);
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);

     *pTargetList = tlist;

@@ -1116,12 +1123,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /*
-     * Append the child results together.
-     */
-    path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
-                                       NIL, NULL, 0, false, -1);
-
     /* Identify the grouping semantics */
     groupList = generate_setop_grouplist(op, tlist);

@@ -1140,25 +1141,40 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     else
     {
-        dNumGroups = Min(dLeftGroups, dRightGroups);
+        dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }

     /*
-     * Decide whether to hash or sort, and add a sort node if needed.
+     * Decide whether to hash or sort, and add sort nodes if needed.
      */
-    use_hash = choose_hashed_setop(root, groupList, path,
+    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
                                    dNumGroups, dNumOutputRows,
                                    (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");

     if (groupList && !use_hash)
-        path = (Path *) create_sort_path(root,
-                                         result_rel,
-                                         path,
-                                         make_pathkeys_for_sortclauses(root,
-                                                                       groupList,
-                                                                       tlist),
-                                         -1.0);
+    {
+        List       *pathkeys;
+
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 lpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            lpath = (Path *) create_sort_path(root,
+                                              lpath->parent,
+                                              lpath,
+                                              pathkeys,
+                                              -1.0);
+        pathkeys = make_pathkeys_for_sortclauses(root,
+                                                 groupList,
+                                                 rpath_tlist);
+        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            rpath = (Path *) create_sort_path(root,
+                                              rpath->parent,
+                                              rpath,
+                                              pathkeys,
+                                              -1.0);
+    }

     /*
      * Finally, add a SetOp path node to generate the correct output.
@@ -1178,12 +1194,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }
     path = (Path *) create_setop_path(root,
                                       result_rel,
-                                      path,
+                                      lpath,
+                                      rpath,
                                       cmd,
                                       use_hash ? SETOP_HASHED : SETOP_SORTED,
                                       groupList,
-                                      list_length(op->colTypes) + 1,
-                                      use_hash ? firstFlag : -1,
                                       dNumGroups,
                                       dNumOutputRows);

@@ -1285,10 +1300,13 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)

 /*
  * choose_hashed_setop - should we use hashing for a set operation?
+ *
+ * XXX probably this should go away: just make both paths and let
+ * add_path sort it out.
  */
 static bool
 choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *input_path,
+                    Path *lpath, Path *rpath,
                     double dNumGroups, double dNumOutputRows,
                     const char *construct)
 {
@@ -1327,7 +1345,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * Don't do it if it doesn't look like the hashtable will fit into
      * hash_mem.
      */
-    hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
+    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);

     if (hashentrysize * dNumGroups > hash_mem_limit)
         return false;
@@ -1336,9 +1354,9 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      * See if the estimated cost is no more than doing it the other way.
      *
      * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group.  Note that the actual result plan might involve a SetOp or
-     * Unique node, not Agg or Group, but the cost estimates for Agg and Group
-     * should be close enough for our purposes here.
+     * group. XXX NOT TRUE: Note that the actual result plan might involve a
+     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
+     * and Group should be close enough for our purposes here.
      *
      * These path variables are dummies that just hold cost fields; we don't
      * make actual Paths for these steps.
@@ -1346,27 +1364,31 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
     cost_agg(&hashed_p, root, AGG_HASHED, NULL,
              numGroupCols, dNumGroups,
              NIL,
-             input_path->disabled_nodes,
-             input_path->startup_cost, input_path->total_cost,
-             input_path->rows, input_path->pathtarget->width);
+             lpath->disabled_nodes + rpath->disabled_nodes,
+             lpath->startup_cost + rpath->startup_cost,
+             lpath->total_cost + rpath->total_cost,
+             lpath->rows + rpath->rows,
+             lpath->pathtarget->width);

     /*
-     * Now for the sorted case.  Note that the input is *always* unsorted,
-     * since it was made by appending unrelated sub-relations together.
+     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
+     * unsorted, since it was made by appending unrelated sub-relations
+     * together.
      */
-    sorted_p.disabled_nodes = input_path->disabled_nodes;
-    sorted_p.startup_cost = input_path->startup_cost;
-    sorted_p.total_cost = input_path->total_cost;
+    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
+    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
+    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
     /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
     cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
               sorted_p.total_cost,
-              input_path->rows, input_path->pathtarget->width,
+              lpath->rows + rpath->rows,
+              lpath->pathtarget->width,
               0.0, work_mem, -1.0);
     cost_group(&sorted_p, root, numGroupCols, dNumGroups,
                NIL,
                sorted_p.disabled_nodes,
                sorted_p.startup_cost, sorted_p.total_cost,
-               input_path->rows);
+               lpath->rows + rpath->rows);

     /*
      * Now make the decision using the top-level tuple fraction.  First we
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..e52e4b1d67 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3634,25 +3634,26 @@ create_windowagg_path(PlannerInfo *root,
  *      Creates a pathnode that represents computation of INTERSECT or EXCEPT
  *
  * 'rel' is the parent relation associated with the result
- * 'subpath' is the path representing the source of data
+ * 'leftpath' is the path representing the left-hand source of data
+ * 'rightpath' is the path representing the right-hand source of data
  * 'cmd' is the specific semantics (INTERSECT or EXCEPT, with/without ALL)
  * 'strategy' is the implementation strategy (sorted or hashed)
- * 'distinctList' is a list of SortGroupClause's representing the grouping
- * 'flagColIdx' is the column number where the flag column will be, if any
- * 'firstFlag' is the flag value for the first input relation when hashing;
- *        or -1 when sorting
- * 'numGroups' is the estimated number of distinct groups
+ * 'groupList' is a list of SortGroupClause's representing the grouping
+ * 'numGroups' is the estimated number of distinct groups in left-hand input
  * 'outputRows' is the estimated number of output rows
+ *
+ * leftpath and rightpath must produce the same columns.  Moreover, if
+ * strategy is SETOP_SORTED, leftpath and rightpath must both be sorted
+ * by all the grouping columns.
  */
 SetOpPath *
 create_setop_path(PlannerInfo *root,
                   RelOptInfo *rel,
-                  Path *subpath,
+                  Path *leftpath,
+                  Path *rightpath,
                   SetOpCmd cmd,
                   SetOpStrategy strategy,
-                  List *distinctList,
-                  AttrNumber flagColIdx,
-                  int firstFlag,
+                  List *groupList,
                   double numGroups,
                   double outputRows)
 {
@@ -3660,34 +3661,37 @@ create_setop_path(PlannerInfo *root,

     pathnode->path.pathtype = T_SetOp;
     pathnode->path.parent = rel;
-    /* SetOp doesn't project, so use source path's pathtarget */
-    pathnode->path.pathtarget = subpath->pathtarget;
+    pathnode->path.pathtarget = rel->reltarget;
     /* For now, assume we are above any joins, so no parameterization */
     pathnode->path.param_info = NULL;
     pathnode->path.parallel_aware = false;
     pathnode->path.parallel_safe = rel->consider_parallel &&
-        subpath->parallel_safe;
-    pathnode->path.parallel_workers = subpath->parallel_workers;
+        leftpath->parallel_safe && rightpath->parallel_safe;
+    pathnode->path.parallel_workers =
+        leftpath->parallel_workers + rightpath->parallel_workers;
     /* SetOp preserves the input sort order if in sort mode */
     pathnode->path.pathkeys =
-        (strategy == SETOP_SORTED) ? subpath->pathkeys : NIL;
+        (strategy == SETOP_SORTED) ? leftpath->pathkeys : NIL;

-    pathnode->subpath = subpath;
+    pathnode->leftpath = leftpath;
+    pathnode->rightpath = rightpath;
     pathnode->cmd = cmd;
     pathnode->strategy = strategy;
-    pathnode->distinctList = distinctList;
-    pathnode->flagColIdx = flagColIdx;
-    pathnode->firstFlag = firstFlag;
+    pathnode->groupList = groupList;
     pathnode->numGroups = numGroups;

     /*
      * Charge one cpu_operator_cost per comparison per input tuple. We assume
      * all columns get compared at most of the tuples.
+     *
+     * XXX all wrong for hashing
      */
-    pathnode->path.disabled_nodes = subpath->disabled_nodes;
-    pathnode->path.startup_cost = subpath->startup_cost;
-    pathnode->path.total_cost = subpath->total_cost +
-        cpu_operator_cost * subpath->rows * list_length(distinctList);
+    pathnode->path.disabled_nodes =
+        leftpath->disabled_nodes + rightpath->disabled_nodes;
+    pathnode->path.startup_cost =
+        leftpath->startup_cost + rightpath->startup_cost;
+    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
+        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 182a6956bb..a730c6ec6d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2806,27 +2806,34 @@ typedef struct HashState
 /* ----------------
  *     SetOpState information
  *
- *        Even in "sorted" mode, SetOp nodes are more complex than a simple
- *        Unique, since we have to count how many duplicates to return.  But
- *        we also support hashing, so this is really more like a cut-down
- *        form of Agg.
+ *        SetOp nodes support either sorted or hashed de-duplication.
+ *        The sorted mode is a bit like MergeJoin, the hashed mode like Agg.
  * ----------------
  */
-/* this struct is private in nodeSetOp.c: */
-typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
+typedef struct SetOpStatePerInput
+{
+    TupleTableSlot *firstTupleSlot; /* first tuple of current group */
+    int64        numTuples;        /* number of tuples in current group */
+    TupleTableSlot *nextTupleSlot;    /* next input tuple, if already read */
+    bool        needGroup;        /* do we need to load a new group? */
+} SetOpStatePerInput;

 typedef struct SetOpState
 {
     PlanState    ps;                /* its first field is NodeTag */
-    ExprState  *eqfunction;        /* equality comparator */
-    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
-    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     bool        setop_done;        /* indicates completion of output scan */
-    long        numOutput;        /* number of dups left to output */
+    int64        numOutput;        /* number of dups left to output */
+    int            numCols;        /* number of grouping columns */
+
     /* these fields are used in SETOP_SORTED mode: */
-    SetOpStatePerGroup pergroup;    /* per-group working state */
-    HeapTuple    grp_firstTuple; /* copy of first tuple of current group */
+    SortSupport sortKeys;        /* per-grouping-field sort data */
+    SetOpStatePerInput leftInput;    /* current outer-relation input state */
+    SetOpStatePerInput rightInput;    /* current inner-relation input state */
+    bool        need_init;        /* have we read the first tuples yet? */
+
     /* these fields are used in SETOP_HASHED mode: */
+    Oid           *eqfuncoids;        /* per-grouping-field equality fns */
+    FmgrInfo   *hashfunctions;    /* per-grouping-field hash fns */
     TupleHashTable hashtable;    /* hash table with one entry per group */
     MemoryContext tableContext; /* memory context containing hash table */
     bool        table_filled;    /* hash table filled yet? */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..da3bdbab27 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2340,13 +2340,12 @@ typedef struct WindowAggPath
 typedef struct SetOpPath
 {
     Path        path;
-    Path       *subpath;        /* path representing input source */
+    Path       *leftpath;        /* paths representing input sources */
+    Path       *rightpath;
     SetOpCmd    cmd;            /* what to do, see nodes.h */
     SetOpStrategy strategy;        /* how to do it, see nodes.h */
-    List       *distinctList;    /* SortGroupClauses identifying target cols */
-    AttrNumber    flagColIdx;        /* where is the flag column, if any */
-    int            firstFlag;        /* flag value for first input relation */
-    Cardinality numGroups;        /* estimated number of groups in input */
+    List       *groupList;        /* SortGroupClauses identifying target cols */
+    Cardinality numGroups;        /* estimated number of groups in left input */
 } SetOpPath;

 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 52f29bcdb6..4633121689 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1225,23 +1225,20 @@ typedef struct SetOp
     /* how to do it, see nodes.h */
     SetOpStrategy strategy;

-    /* number of columns to check for duplicate-ness */
+    /* number of columns to compare */
     int            numCols;

     /* their indexes in the target list */
-    AttrNumber *dupColIdx pg_node_attr(array_size(numCols));
+    AttrNumber *cmpColIdx pg_node_attr(array_size(numCols));

-    /* equality operators to compare with */
-    Oid           *dupOperators pg_node_attr(array_size(numCols));
-    Oid           *dupCollations pg_node_attr(array_size(numCols));
-
-    /* where is the flag column, if any */
-    AttrNumber    flagColIdx;
+    /* comparison operators (either equality operators or sort operators) */
+    Oid           *cmpOperators pg_node_attr(array_size(numCols));
+    Oid           *cmpCollations pg_node_attr(array_size(numCols));

-    /* flag value for first input relation */
-    int            firstFlag;
+    /* nulls-first flags if sorting, otherwise not interesting */
+    bool       *cmpNullsFirst pg_node_attr(array_size(numCols));

-    /* estimated number of groups in input */
+    /* estimated number of groups in left input */
     long        numGroups;
 } SetOp;

diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 1035e6560c..5a6d0350c1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -260,12 +260,11 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
                                             bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
                                     RelOptInfo *rel,
-                                    Path *subpath,
+                                    Path *leftpath,
+                                    Path *rightpath,
                                     SetOpCmd cmd,
                                     SetOpStrategy strategy,
-                                    List *distinctList,
-                                    AttrNumber flagColIdx,
-                                    int firstFlag,
+                                    List *groupList,
                                     double numGroups,
                                     double outputRows);
 extern RecursiveUnionPath *create_recursiveunion_path(PlannerInfo *root,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 2d35de3fad..154fffabae 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1231,22 +1231,18 @@ select count(*) from
     select * from onek i2 where i2.unique1 = o.unique2
   ) ss
 where o.ten = 1;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Subquery Scan on ss
-               ->  HashSetOp Except
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Scan using onek_unique1 on onek i1
-                                       Index Cond: (unique1 = o.unique1)
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Index Scan using onek_unique1 on onek i2
-                                       Index Cond: (unique1 = o.unique2)
-(13 rows)
+         ->  HashSetOp Except
+               ->  Index Scan using onek_unique1 on onek i1
+                     Index Cond: (unique1 = o.unique1)
+               ->  Index Scan using onek_unique1 on onek i2
+                     Index Cond: (unique1 = o.unique2)
+(9 rows)

 select count(*) from
   onek o cross join lateral (
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..fbf24490e9 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -370,17 +370,13 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                     QUERY PLAN
-------------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  HashSetOp Intersect
-               ->  Append
-                     ->  Subquery Scan on "*SELECT* 2"
-                           ->  Seq Scan on tenk1
-                     ->  Subquery Scan on "*SELECT* 1"
-                           ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+   ->  HashSetOp Intersect
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(4 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -391,16 +387,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Index Only Scan using tenk1_unique1 on tenk1
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                     Filter: (unique2 <> 10)
-(7 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -434,19 +427,17 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                        QUERY PLAN
-------------------------------------------------------------------------------------------
+                               QUERY PLAN
+------------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  SetOp Intersect
-               ->  Sort
-                     Sort Key: "*SELECT* 2".fivethous
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Seq Scan on tenk1
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(10 rows)
+   ->  SetOp Intersect
+         ->  Sort
+               Sort Key: tenk1.fivethous
+               ->  Seq Scan on tenk1
+         ->  Sort
+               Sort Key: tenk1_1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(8 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -457,18 +448,17 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                                  QUERY PLAN
-------------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".unique1
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Index Only Scan using tenk1_unique1 on tenk1
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                           Filter: (unique2 <> 10)
-(9 rows)
+         Sort Key: tenk1.unique1
+         ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Sort
+         Sort Key: tenk1_1.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+               Filter: (unique2 <> 10)
+(8 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
@@ -528,15 +518,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -546,15 +533,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                  QUERY PLAN
------------------------------------------------
+            QUERY PLAN
+-----------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -606,17 +590,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -626,17 +609,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from

 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4]))
_(x);
    x
@@ -669,17 +651,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -689,17 +670,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -777,17 +757,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -797,17 +776,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va

 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN
------------------------------------------------------
+               QUERY PLAN
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)

 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x
@@ -970,15 +948,12 @@ set enable_sort = false;
 -- no means to disable Unique.
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union all select from generate_series(1,3);
 --
@@ -1015,15 +990,12 @@ select from generate_series(1,5) union select from generate_series(1,3);

 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN
-----------------------------------------------------------------------
+                        QUERY PLAN
+----------------------------------------------------------
  SetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)

 select from generate_series(1,5) union select from generate_series(1,3);
 --
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 08521d51a9..e5588c9cd9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2610,6 +2610,8 @@ SetOpCmd
 SetOpPath
 SetOpState
 SetOpStatePerGroup
+SetOpStatePerGroupData
+SetOpStatePerInput
 SetOpStrategy
 SetOperation
 SetOperationStmt
--
2.43.5

From fd8bb0da496965ef5bb40f759fb35a550a62345f Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:30:36 -0500
Subject: [PATCH v2 2/5] Remove some dead code in prepunion.c.

Now that we no longer use flag columns for SetOp, remove the
code that supported that.

This also means that there are no resjunk output columns anymore
in a set-operations plan tree.  So we can get rid of some code
that supported that, too.

It would probably be possible now to do what's speculated about in
the comments, and get rid of this code's use of transiently-created
targetlists in favor of dealing only in pathtargets (save for
generating root->processed_tlist at the very end).  However, that
would involve a good deal of notational churn and the benefit
would only be to save a little planning time.  So I'll leave that
project for some other day.
---
 src/backend/optimizer/prep/prepunion.c | 96 ++++----------------------
 1 file changed, 14 insertions(+), 82 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index bd0ca000e8..977bd546e0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -40,8 +40,7 @@

 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
                                           List *colTypes, List *colCollations,
-                                          bool junkOK,
-                                          int flag, List *refnames_tlist,
+                                          List *refnames_tlist,
                                           List **pTargetList,
                                           bool *istrivial_tlist);
 static RelOptInfo *generate_recursion_path(SetOperationStmt *setOp,
@@ -69,14 +68,12 @@ static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
                                 double dNumGroups, double dNumOutputRows,
                                 const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
-                                  int flag,
                                   Index varno,
                                   bool hack_constants,
                                   List *input_tlist,
                                   List *refnames_tlist,
                                   bool *trivial_tlist);
 static List *generate_append_tlist(List *colTypes, List *colCollations,
-                                   bool flag,
                                    List *input_tlists,
                                    List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
@@ -160,12 +157,10 @@ plan_set_operations(PlannerInfo *root)
         /*
          * Recurse on setOperations tree to generate paths for set ops. The
          * final output paths should have just the column types shown as the
-         * output from the top-level node, plus possibly resjunk working
-         * columns (we can rely on upper-level nodes to deal with that).
+         * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
                                            topop->colTypes, topop->colCollations,
-                                           true, -1,
                                            leftmostQuery->targetList,
                                            &top_tlist,
                                            &trivial_tlist);
@@ -208,8 +203,6 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * junkOK: if true, child resjunk columns may be left in the result
- * flag: if >= 0, add a resjunk output column indicating value of flag
  * refnames_tlist: targetlist to take column names from
  *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
@@ -220,7 +213,8 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
- * Pending a redesign of how that works, this is the easy way out.
+ * XXX Now that there are no flag columns and hence no resjunk columns, we
+ * could probably refactor this file to deal only in pathtargets.
  *
  * We don't have to care about typmods here: the only allowed difference
  * between set-op input and output typmods is input is a specific typmod
@@ -229,8 +223,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
                        List *colTypes, List *colCollations,
-                       bool junkOK,
-                       int flag, List *refnames_tlist,
+                       List *refnames_tlist,
                        List **pTargetList,
                        bool *istrivial_tlist)
 {
@@ -279,7 +272,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,

         /* Figure out the appropriate target list for this subquery. */
         tlist = generate_setop_tlist(colTypes, colCollations,
-                                     flag,
                                      rtr->rtindex,
                                      true,
                                      subroot->processed_tlist,
@@ -318,16 +310,14 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
          * generate_append_tlist() or generate_setop_tlist(), this will work.
          * We just tell generate_setop_tlist() to use varno 0.
          */
-        if (flag >= 0 ||
-            !tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
-            !tlist_same_collations(*pTargetList, colCollations, junkOK))
+        if (!tlist_same_datatypes(*pTargetList, colTypes, false) ||
+            !tlist_same_collations(*pTargetList, colCollations, false))
         {
             PathTarget *target;
             bool        trivial_tlist;
             ListCell   *lc;

             *pTargetList = generate_setop_tlist(colTypes, colCollations,
-                                                flag,
                                                 0,
                                                 false,
                                                 *pTargetList,
@@ -411,7 +401,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      */
     lrel = recurse_set_operations(setOp->larg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -423,7 +412,6 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
                                   setOp->colTypes, setOp->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -436,7 +424,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /*
      * Generate tlist for RecursiveUnion path node --- same as in Append cases
      */
-    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations, false,
+    tlist = generate_append_tlist(setOp->colTypes, setOp->colCollations,
                                   list_make2(lpath_tlist, rpath_tlist),
                                   refnames_tlist);

@@ -736,7 +724,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_append_tlist(op->colTypes, op->colCollations, false,
+    tlist = generate_append_tlist(op->colTypes, op->colCollations,
                                   tlist_list, refnames_tlist);
     *pTargetList = tlist;

@@ -1048,7 +1036,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
@@ -1060,7 +1047,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     rrel = recurse_set_operations(op->rarg, root,
                                   op->colTypes, op->colCollations,
-                                  false, -1,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
@@ -1109,7 +1095,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
      * concerned, but we must make it look real anyway for the benefit of the
      * next plan level up.
      */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
                                  0, false, lpath_tlist, refnames_tlist,
                                  &result_trivial_tlist);

@@ -1258,18 +1244,10 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
-         *
-         * Note we disallow any resjunk columns in child results.  This is
-         * necessary since the Append node that implements the union won't do
-         * any projection, and upper levels will get confused if some of our
-         * output tuples have junk and some don't.  This case only arises when
-         * we have an EXCEPT or INTERSECT as child, else there won't be
-         * resjunk anyway.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
-                                                        false, -1,
                                                         refnames_tlist,
                                                         &child_tlist,
                                                         &trivial_tlist));
@@ -1412,7 +1390,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: -1 if no flag column needed, 0 or 1 to create a const flag column
  * varno: varno to use in generated Vars
  * hack_constants: true to copy up constants (see comments in code)
  * input_tlist: targetlist of this node's input node
@@ -1421,7 +1398,6 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
  */
 static List *
 generate_setop_tlist(List *colTypes, List *colCollations,
-                     int flag,
                      Index varno,
                      bool hack_constants,
                      List *input_tlist,
@@ -1520,7 +1496,7 @@ generate_setop_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1529,25 +1505,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag >= 0)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is the given constant */
-        expr = (Node *) makeConst(INT4OID,
-                                  -1,
-                                  InvalidOid,
-                                  sizeof(int32),
-                                  Int32GetDatum(flag),
-                                  false,
-                                  true);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-        *trivial_tlist = false; /* the extra entry makes it not trivial */
-    }
-
     return tlist;
 }

@@ -1556,7 +1513,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  *
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
- * flag: true to create a flag column copied up from subplans
  * input_tlists: list of tlists for sub-plans of the Append
  * refnames_tlist: targetlist to take column names from
  *
@@ -1570,7 +1526,6 @@ generate_setop_tlist(List *colTypes, List *colCollations,
  */
 static List *
 generate_append_tlist(List *colTypes, List *colCollations,
-                      bool flag,
                       List *input_tlists,
                       List *refnames_tlist)
 {
@@ -1604,8 +1559,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
         {
             TargetEntry *subtle = (TargetEntry *) lfirst(subtlistl);

-            if (subtle->resjunk)
-                continue;
+            Assert(!subtle->resjunk);
             Assert(curColType != NULL);
             if (exprType((Node *) subtle->expr) == lfirst_oid(curColType))
             {
@@ -1654,7 +1608,7 @@ generate_append_tlist(List *colTypes, List *colCollations,
                               false);

         /*
-         * By convention, all non-resjunk columns in a setop tree have
+         * By convention, all output columns in a setop tree have
          * ressortgroupref equal to their resno.  In some cases the ref isn't
          * needed, but this is a cleaner way than modifying the tlist later.
          */
@@ -1663,23 +1617,6 @@ generate_append_tlist(List *colTypes, List *colCollations,
         tlist = lappend(tlist, tle);
     }

-    if (flag)
-    {
-        /* Add a resjunk flag column */
-        /* flag value is shown as copied up from subplan */
-        expr = (Node *) makeVar(0,
-                                resno,
-                                INT4OID,
-                                -1,
-                                InvalidOid,
-                                0);
-        tle = makeTargetEntry((Expr *) expr,
-                              (AttrNumber) resno++,
-                              pstrdup("flag"),
-                              true);
-        tlist = lappend(tlist, tle);
-    }
-
     pfree(colTypmods);

     return tlist;
@@ -1709,12 +1646,7 @@ generate_setop_grouplist(SetOperationStmt *op, List *targetlist)
         TargetEntry *tle = (TargetEntry *) lfirst(lt);
         SortGroupClause *sgc;

-        if (tle->resjunk)
-        {
-            /* resjunk columns should not have sortgrouprefs */
-            Assert(tle->ressortgroupref == 0);
-            continue;            /* ignore resjunk columns */
-        }
+        Assert(!tle->resjunk);

         /* non-resjunk columns should have sortgroupref = resno */
         Assert(tle->ressortgroupref == tle->resno);
--
2.43.5

From b46cd2f88ead9a6facf89064cc2ac2f7aac01134 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:35:29 -0500
Subject: [PATCH v2 3/5] Get rid of choose_hashed_setop().

This logic is a hangover from days when we could only generate
one plan per setop node.  Now, it's better to generate multiple
Paths and let add_path choose the cheapest.  The main advantage
is not having to keep choose_hashed_setop in perfect sync with
cost estimates made elsewhere --- and indeed, it looks like it
was out of sync already.  Not to mention that it doesn't account
for presorted input, which doesn't matter too much right now
but will soon.

This does require fixing create_setop_path to generate good cost
estimates.  That was another area where we'd not maintained things
well, because it wasn't distinguishing sorted from hashed costs at
all.  With the formulas I chose, the total cost attributed to the node
is actually the same for sorted and hashed modes --- but the startup
cost is wildly different, because hashed mode does basically all the
work before returning anything.

I also modified the existing logic that disables hashed setops
so that it works by incrementing path->disabled_nodes rather
than not generating the path at all.  That seems more in keeping
with the methods established by commit e22253467.  (But should
we put that logic in create_setop_path, rather than its caller?)

This results in two changes in plans that are visible in the
regression tests.  The one in union.sql seems fine: it's now
preferring sorted over hashed mode for a zero-column INTERSECT,
since no sort is really required.  The one in subselect.sql
is trying to test nodeSetOp's rescan logic, so I extended it
to test both sorted and hashed modes.
---
 src/backend/optimizer/prep/prepunion.c  | 226 +++++++++---------------
 src/backend/optimizer/util/pathnode.c   |  50 +++++-
 src/test/regress/expected/subselect.out |  48 ++++-
 src/test/regress/expected/union.out     |   2 +-
 src/test/regress/sql/subselect.sql      |  32 +++-
 5 files changed, 200 insertions(+), 158 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 977bd546e0..21513dbb19 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -63,10 +63,6 @@ static List *plan_union_children(PlannerInfo *root,
                                  List **tlist_list,
                                  List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
-static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                                Path *lpath, Path *rpath,
-                                double dNumGroups, double dNumOutputRows,
-                                const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
                                   Index varno,
                                   bool hack_constants,
@@ -1025,7 +1021,8 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                 dRightGroups,
                 dNumGroups,
                 dNumOutputRows;
-    bool        use_hash;
+    bool        can_sort;
+    bool        can_hash;
     SetOpCmd    cmd;

     /*
@@ -1130,15 +1127,76 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dNumGroups = dLeftGroups;
         dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
     }
+    result_rel->rows = dNumOutputRows;
+
+    /* Select the SetOpCmd type */
+    switch (op->op)
+    {
+        case SETOP_INTERSECT:
+            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
+            break;
+        case SETOP_EXCEPT:
+            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
+            break;
+        default:
+            elog(ERROR, "unrecognized set op: %d", (int) op->op);
+            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
+            break;
+    }
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));

     /*
-     * Decide whether to hash or sort, and add sort nodes if needed.
+     * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
-    use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
-                                   dNumGroups, dNumOutputRows,
-                                   (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");
+    if (can_hash)
+    {
+        Size        hash_mem_limit = get_hash_memory_limit();
+        Size        hashentrysize;
+
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_HASHED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);

-    if (groupList && !use_hash)
+        /*
+         * Mark the path as disabled if enable_hashagg is off.  While this
+         * isn't exactly a HashAgg node, it seems close enough to justify
+         * letting that switch control it.
+         */
+        if (!enable_hashagg)
+            path->disabled_nodes++;
+
+        /*
+         * Also disable if it doesn't look like the hashtable will fit into
+         * hash_mem.
+         */
+        hashentrysize = MAXALIGN(lpath->pathtarget->width) +
+            MAXALIGN(SizeofMinimalTupleHeader);
+        if (hashentrysize * dNumGroups > hash_mem_limit)
+            path->disabled_nodes++;
+
+        add_path(result_rel, path);
+    }
+
+    /*
+     * If we can sort, sort the inputs if needed before adding SetOp.
+     */
+    if (can_sort)
     {
         List       *pathkeys;

@@ -1160,36 +1218,19 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                               rpath,
                                               pathkeys,
                                               -1.0);
-    }

-    /*
-     * Finally, add a SetOp path node to generate the correct output.
-     */
-    switch (op->op)
-    {
-        case SETOP_INTERSECT:
-            cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT;
-            break;
-        case SETOP_EXCEPT:
-            cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT;
-            break;
-        default:
-            elog(ERROR, "unrecognized set op: %d", (int) op->op);
-            cmd = SETOPCMD_INTERSECT;    /* keep compiler quiet */
-            break;
+        path = (Path *) create_setop_path(root,
+                                          result_rel,
+                                          lpath,
+                                          rpath,
+                                          cmd,
+                                          SETOP_SORTED,
+                                          groupList,
+                                          dNumGroups,
+                                          dNumOutputRows);
+        add_path(result_rel, path);
     }
-    path = (Path *) create_setop_path(root,
-                                      result_rel,
-                                      lpath,
-                                      rpath,
-                                      cmd,
-                                      use_hash ? SETOP_HASHED : SETOP_SORTED,
-                                      groupList,
-                                      dNumGroups,
-                                      dNumOutputRows);
-
-    result_rel->rows = path->rows;
-    add_path(result_rel, path);
+
     return result_rel;
 }

@@ -1276,115 +1317,6 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)
     set_cheapest(rel);
 }

-/*
- * choose_hashed_setop - should we use hashing for a set operation?
- *
- * XXX probably this should go away: just make both paths and let
- * add_path sort it out.
- */
-static bool
-choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-                    Path *lpath, Path *rpath,
-                    double dNumGroups, double dNumOutputRows,
-                    const char *construct)
-{
-    int            numGroupCols = list_length(groupClauses);
-    Size        hash_mem_limit = get_hash_memory_limit();
-    bool        can_sort;
-    bool        can_hash;
-    Size        hashentrysize;
-    Path        hashed_p;
-    Path        sorted_p;
-    double        tuple_fraction;
-
-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupClauses);
-    can_hash = grouping_is_hashable(groupClauses);
-    if (can_hash && can_sort)
-    {
-        /* we have a meaningful choice to make, continue ... */
-    }
-    else if (can_hash)
-        return true;
-    else if (can_sort)
-        return false;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is UNION, INTERSECT, or EXCEPT */
-                 errmsg("could not implement %s", construct),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
-    /* Prefer sorting when enable_hashagg is off */
-    if (!enable_hashagg)
-        return false;
-
-    /*
-     * Don't do it if it doesn't look like the hashtable will fit into
-     * hash_mem.
-     */
-    hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
-
-    if (hashentrysize * dNumGroups > hash_mem_limit)
-        return false;
-
-    /*
-     * See if the estimated cost is no more than doing it the other way.
-     *
-     * We need to consider input_plan + hashagg versus input_plan + sort +
-     * group. XXX NOT TRUE: Note that the actual result plan might involve a
-     * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
-     * and Group should be close enough for our purposes here.
-     *
-     * These path variables are dummies that just hold cost fields; we don't
-     * make actual Paths for these steps.
-     */
-    cost_agg(&hashed_p, root, AGG_HASHED, NULL,
-             numGroupCols, dNumGroups,
-             NIL,
-             lpath->disabled_nodes + rpath->disabled_nodes,
-             lpath->startup_cost + rpath->startup_cost,
-             lpath->total_cost + rpath->total_cost,
-             lpath->rows + rpath->rows,
-             lpath->pathtarget->width);
-
-    /*
-     * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
-     * unsorted, since it was made by appending unrelated sub-relations
-     * together.
-     */
-    sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
-    sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
-    sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
-    /* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-    cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
-              sorted_p.total_cost,
-              lpath->rows + rpath->rows,
-              lpath->pathtarget->width,
-              0.0, work_mem, -1.0);
-    cost_group(&sorted_p, root, numGroupCols, dNumGroups,
-               NIL,
-               sorted_p.disabled_nodes,
-               sorted_p.startup_cost, sorted_p.total_cost,
-               lpath->rows + rpath->rows);
-
-    /*
-     * Now make the decision using the top-level tuple fraction.  First we
-     * have to convert an absolute count (LIMIT) into fractional form.
-     */
-    tuple_fraction = root->tuple_fraction;
-    if (tuple_fraction >= 1.0)
-        tuple_fraction /= dNumOutputRows;
-
-    if (compare_fractional_path_costs(&hashed_p, &sorted_p,
-                                      tuple_fraction) < 0)
-    {
-        /* Hashed is cheaper, so use it */
-        return true;
-    }
-    return false;
-}
-
 /*
  * Generate targetlist for a set-operation plan node
  *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e52e4b1d67..d7422f2dd8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3681,17 +3681,51 @@ create_setop_path(PlannerInfo *root,
     pathnode->numGroups = numGroups;

     /*
-     * Charge one cpu_operator_cost per comparison per input tuple. We assume
-     * all columns get compared at most of the tuples.
-     *
-     * XXX all wrong for hashing
+     * Compute cost estimates.  As things stand, we end up with the same total
+     * cost in this node for sort and hash methods, but different startup
+     * costs.  This could be refined perhaps, but it'll do for now.
      */
     pathnode->path.disabled_nodes =
         leftpath->disabled_nodes + rightpath->disabled_nodes;
-    pathnode->path.startup_cost =
-        leftpath->startup_cost + rightpath->startup_cost;
-    pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
-        cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+    if (strategy == SETOP_SORTED)
+    {
+        /*
+         * In sorted mode, we can emit output incrementally.  Charge one
+         * cpu_operator_cost per comparison per input tuple.  Like cost_group,
+         * we assume all columns get compared at most of the tuples.
+         */
+        pathnode->path.startup_cost =
+            leftpath->startup_cost + rightpath->startup_cost;
+        pathnode->path.total_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
+    else
+    {
+        /*
+         * In hashed mode, we must read all the input before we can emit
+         * anything.  Also charge comparison costs to represent the cost of
+         * hash table lookups.
+         */
+        pathnode->path.startup_cost =
+            leftpath->total_cost + rightpath->total_cost +
+            cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
+        pathnode->path.total_cost = pathnode->path.startup_cost;
+
+        /*
+         * Also charge a small amount per extracted tuple.  Like cost_sort,
+         * charge only operator cost not cpu_tuple_cost, since SetOp does no
+         * qual-checking or projection.
+         */
+        pathnode->path.total_cost += cpu_operator_cost * outputRows;
+    }
     pathnode->path.rows = outputRows;

     return pathnode;
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 154fffabae..6f2926c45a 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1221,8 +1221,10 @@ where o.ten = 0;
 (1 row)

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -1256,6 +1258,50 @@ where o.ten = 1;
    100
 (1 row)

+rollback;
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+                                                                                                     QUERY PLAN
                                                                                              

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  SetOp Except
+               ->  Sort
+                     Sort Key: i1.unique1, i1.unique2, i1.two, i1.four, i1.ten, i1.twenty, i1.hundred, i1.thousand,
i1.twothousand,i1.fivethous, i1.tenthous, i1.odd, i1.even, i1.stringu1, i1.stringu2, i1.string4 
+                     ->  Index Scan using onek_unique1 on onek i1
+                           Index Cond: (unique1 = o.unique1)
+               ->  Sort
+                     Sort Key: i2.unique1, i2.unique2, i2.two, i2.four, i2.ten, i2.twenty, i2.hundred, i2.thousand,
i2.twothousand,i2.fivethous, i2.tenthous, i2.odd, i2.even, i2.stringu1, i2.stringu2, i2.string4 
+                     ->  Index Scan using onek_unique1 on onek i2
+                           Index Cond: (unique1 = o.unique2)
+(13 rows)
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+ count
+-------
+   100
+(1 row)
+
+rollback;
 --
 -- Test rescan of a RecursiveUnion node
 --
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index fbf24490e9..9474b01510 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -950,7 +950,7 @@ explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
                         QUERY PLAN
 ----------------------------------------------------------
- HashSetOp Intersect
+ SetOp Intersect
    ->  Function Scan on generate_series
    ->  Function Scan on generate_series generate_series_1
 (3 rows)
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index af6e157aca..11becd885c 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -638,8 +638,11 @@ select sum(ss.tst::int) from
 where o.ten = 0;

 --
--- Test rescan of a SetOp node
+-- Test rescan of a hashed SetOp node
 --
+begin;
+set local enable_sort = off;
+
 explain (costs off)
 select count(*) from
   onek o cross join lateral (
@@ -657,6 +660,33 @@ select count(*) from
   ) ss
 where o.ten = 1;

+rollback;
+
+--
+-- Test rescan of a sorted SetOp node
+--
+begin;
+set local enable_hashagg = off;
+
+explain (costs off)
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+select count(*) from
+  onek o cross join lateral (
+    select * from onek i1 where i1.unique1 = o.unique1
+    except
+    select * from onek i2 where i2.unique1 = o.unique2
+  ) ss
+where o.ten = 1;
+
+rollback;
+
 --
 -- Test rescan of a RecursiveUnion node
 --
--
2.43.5

From 7ec25e0de0810a0130914ed6f691049859d4917c Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:51:54 -0500
Subject: [PATCH v2 4/5] Fix bogus decisions about whether we want pre-sorted
 child paths.

As things stand, recurse_set_operations unconditionally passes down
root->parse->setOperations to subquery_planner, which then applies
set_operation_ordered_results_useful to it to decide whether we care
about producing pre-sorted paths for the subquery.

This is flat wrong.

The main problem is that root->parse->setOperations is not necessarily
the relevant ancestor node.  In the test case added here, we have
UNION ALL atop a child UNION, and the existing logic fails to
recognize that it could use pre-sorted paths for the child UNION step.

A lesser problem is that set_operation_ordered_results_useful is
inadequate anyway: it will falsely claim that we could use pre-sorted
paths for a recursive UNION.

A better way to do this is for each caller of recurse_set_operations
to pass down the relevant parent node.  Then we can further specify
that the caller should pass NULL if it has no use for pre-sorted
paths, and then we can get rid of set_operation_ordered_results_useful
altogether, eliminating the need to keep it in sync with the code that
is actually doing the work.

Note: this patch makes generate_nonunion_paths pass down the non-UNION
node as parent.  generate_nonunion_paths doesn't yet do anything with
the pre-sorted paths that will result, so no regression test plans
change.  If we were going to leave it like this, we'd pass down NULL
instead.
---
 src/backend/optimizer/plan/planner.c   |  7 ++-
 src/backend/optimizer/prep/prepunion.c | 65 +++++++++++++-------------
 src/include/optimizer/prep.h           |  1 -
 src/test/regress/expected/union.out    | 14 ++++++
 src/test/regress/sql/union.sql         |  4 ++
 5 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1f78dc3d53..df168d46ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -608,7 +608,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Basically, this routine does the stuff that should only be done once
  * per Query object.  It then calls grouping_planner.  At one time,
@@ -1342,7 +1342,7 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
  * setops is used for set operation subqueries to provide the subquery with
  * the context in which it's being used so that Paths correctly sorted for the
  * set operation can be generated.  NULL when not planning a set operation
- * child.
+ * child, or when a child of a set op that isn't interested in sorted input.
  *
  * Returns nothing; the useful output is in the Paths we attach to the
  * (UPPERREL_FINAL, NULL) upperrel in *root.  In addition,
@@ -3621,8 +3621,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
                                       tlist);

     /* setting setop_pathkeys might be useful to the union planner */
-    if (qp_extra->setop != NULL &&
-        set_operation_ordered_results_useful(qp_extra->setop))
+    if (qp_extra->setop != NULL)
     {
         List       *groupClauses;
         bool        sortable;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 21513dbb19..e92a088334 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -39,6 +39,7 @@


 static RelOptInfo *recurse_set_operations(Node *setOp, PlannerInfo *root,
+                                          SetOperationStmt *parentOp,
                                           List *colTypes, List *colCollations,
                                           List *refnames_tlist,
                                           List **pTargetList,
@@ -156,6 +157,7 @@ plan_set_operations(PlannerInfo *root)
          * output from the top-level node.
          */
         setop_rel = recurse_set_operations((Node *) topop, root,
+                                           NULL,    /* no parent */
                                            topop->colTypes, topop->colCollations,
                                            leftmostQuery->targetList,
                                            &top_tlist,
@@ -168,44 +170,31 @@ plan_set_operations(PlannerInfo *root)
     return setop_rel;
 }

-/*
- * set_operation_ordered_results_useful
- *        Return true if the given SetOperationStmt can be executed by utilizing
- *        paths that provide sorted input according to the setop's targetlist.
- *        Returns false when sorted paths are not any more useful then unsorted
- *        ones.
- */
-bool
-set_operation_ordered_results_useful(SetOperationStmt *setop)
-{
-    /*
-     * Paths sorted by the targetlist are useful for UNION as we can opt to
-     * MergeAppend the sorted paths then Unique them.  Ordered paths are no
-     * more useful than unordered ones for UNION ALL.
-     */
-    if (!setop->all && setop->op == SETOP_UNION)
-        return true;
-
-    /*
-     * EXCEPT / EXCEPT ALL / INTERSECT / INTERSECT ALL cannot yet utilize
-     * correctly sorted input paths.
-     */
-    return false;
-}
-
 /*
  * recurse_set_operations
  *      Recursively handle one step in a tree of set operations
  *
+ * setOp: current step (could be a SetOperationStmt or a leaf RangeTblRef)
+ * parentOp: parent step, or NULL if none (but see below)
  * colTypes: OID list of set-op's result column datatypes
  * colCollations: OID list of set-op's result column collations
  * refnames_tlist: targetlist to take column names from
  *
+ * parentOp should be passed as NULL unless that step is interested in
+ * getting sorted output from this step.  ("Sorted" means "sorted according
+ * to the default btree opclasses of the result column datatypes".)
+ *
  * Returns a RelOptInfo for the subtree, as well as these output parameters:
  * *pTargetList: receives the fully-fledged tlist for the subtree's top plan
  * *istrivial_tlist: true if, and only if, datatypes between parent and child
  * match.
  *
+ * If setOp is a leaf node, this function plans the sub-query but does
+ * not populate the pathlist of the returned RelOptInfo.  The caller will
+ * generate SubqueryScan paths using useful path(s) of the subquery (see
+ * build_setop_child_paths).  But this function does build the paths for
+ * set-operation nodes.
+ *
  * The pTargetList output parameter is mostly redundant with the pathtarget
  * of the returned RelOptInfo, but for the moment we need it because much of
  * the logic in this file depends on flag columns being marked resjunk.
@@ -218,6 +207,7 @@ set_operation_ordered_results_useful(SetOperationStmt *setop)
  */
 static RelOptInfo *
 recurse_set_operations(Node *setOp, PlannerInfo *root,
+                       SetOperationStmt *parentOp,
                        List *colTypes, List *colCollations,
                        List *refnames_tlist,
                        List **pTargetList,
@@ -234,7 +224,6 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
     {
         RangeTblRef *rtr = (RangeTblRef *) setOp;
         RangeTblEntry *rte = root->simple_rte_array[rtr->rtindex];
-        SetOperationStmt *setops;
         Query       *subquery = rte->subquery;
         PlannerInfo *subroot;
         List       *tlist;
@@ -249,15 +238,13 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
         Assert(root->plan_params == NIL);

         /*
-         * Pass the set operation details to the subquery_planner to have it
-         * consider generating Paths correctly ordered for the set operation.
+         * Generate a subroot and Paths for the subquery.  If we have a
+         * parentOp, pass that down to encourage subquery_planner to consider
+         * suitably-sorted Paths.
          */
-        setops = castNode(SetOperationStmt, root->parse->setOperations);
-
-        /* Generate a subroot and Paths for the subquery */
         subroot = rel->subroot = subquery_planner(root->glob, subquery, root,
                                                   false, root->tuple_fraction,
-                                                  setops);
+                                                  parentOp);

         /*
          * It should not be possible for the primitive query to contain any
@@ -396,6 +383,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
      * separately without any intention of combining them into one Append.
      */
     lrel = recurse_set_operations(setOp->larg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -407,6 +395,7 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
     /* The right path will want to look at the left one ... */
     root->non_recursive_path = lpath;
     rrel = recurse_set_operations(setOp->rarg, root,
+                                  NULL, /* no value in sorted results */
                                   setOp->colTypes, setOp->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -479,6 +468,10 @@ generate_recursion_path(SetOperationStmt *setOp, PlannerInfo *root,
  * build_setop_child_paths
  *        Build paths for the set op child relation denoted by 'rel'.
  *
+ * 'rel' is an RTE_SUBQUERY relation.  We have already generated paths within
+ * the subquery's subroot; the task here is to create SubqueryScan paths for
+ * 'rel', representing scans of the useful subquery paths.
+ *
  * interesting_pathkeys: if not NIL, also include paths that suit these
  * pathkeys, sorting any unsorted paths as required.
  * *pNumGroups: if not NULL, we estimate the number of distinct groups
@@ -1032,6 +1025,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,

     /* Recurse on children */
     lrel = recurse_set_operations(op->larg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &lpath_tlist,
@@ -1043,6 +1037,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
         dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
+                                  op,
                                   op->colTypes, op->colCollations,
                                   refnames_tlist,
                                   &rpath_tlist,
@@ -1285,8 +1280,14 @@ plan_union_children(PlannerInfo *root,

         /*
          * Not same, so plan this child separately.
+         *
+         * If top_union isn't a UNION ALL, then we are interested in sorted
+         * output from the child, so pass top_union as parentOp.  Note that
+         * this isn't necessarily the child node's immediate SetOperationStmt
+         * parent, but that's fine: it's the effective parent.
          */
         result = lappend(result, recurse_set_operations(setOp, root,
+                                                        top_union->all ? NULL : top_union,
                                                         top_union->colTypes,
                                                         top_union->colCollations,
                                                         refnames_tlist,
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
index a52dec285d..74bb6c1d3b 100644
--- a/src/include/optimizer/prep.h
+++ b/src/include/optimizer/prep.h
@@ -53,6 +53,5 @@ extern void preprocess_aggrefs(PlannerInfo *root, Node *clause);
  * prototypes for prepunion.c
  */
 extern RelOptInfo *plan_set_operations(PlannerInfo *root);
-extern bool set_operation_ordered_results_useful(SetOperationStmt *setop);

 #endif                            /* PREP_H */
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 9474b01510..3e1adbd992 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -466,6 +466,20 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+                               QUERY PLAN
+------------------------------------------------------------------------
+ Append
+   ->  Seq Scan on int4_tbl
+   ->  Unique
+         ->  Merge Append
+               Sort Key: tenk1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1
+               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+(7 rows)
+
 reset enable_hashagg;
 -- non-hashable type
 set enable_hashagg to on;
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index f8826514e4..10030ec1c5 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -156,6 +156,10 @@ explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+explain (costs off)
+select f1 from int4_tbl union all
+  (select unique1 from tenk1 union select unique2 from tenk1);
+
 reset enable_hashagg;

 -- non-hashable type
--
2.43.5

From ca15dc437b2ca0d736c89dfbef79316c218b90d2 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 19 Nov 2024 20:58:34 -0500
Subject: [PATCH v2 5/5] Teach generate_nonunion_paths to consider pre-sorted
 child paths.

Much of this patch is just re-ordering existing code so that we can
compute the desired pathkeys before we call build_setop_child_paths.

I don't think we need any new test cases: the existing cases that
change plan are enough to show that this works.

One test case in union.sql would switch from HashSetOp Except to
sorted SetOp Except if we left it alone, since the cheapest inputs
are already sorted.  But the point of that set of tests is to
exercise HashSetOp, so swing a bigger hammer.  (The same query is
re-tested with sorting further down, so this seems fine.)
---
 src/backend/optimizer/prep/prepunion.c | 144 ++++++++++++++++---------
 src/test/regress/expected/union.out    |  37 +++----
 src/test/regress/sql/union.sql         |   5 +
 3 files changed, 115 insertions(+), 71 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e92a088334..b3a26738d0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1010,6 +1010,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     bool        lpath_trivial_tlist,
                 rpath_trivial_tlist,
                 result_trivial_tlist;
+    List       *nonunion_pathkeys = NIL;
     double        dLeftGroups,
                 dRightGroups,
                 dNumGroups,
@@ -1030,11 +1031,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &lpath_tlist,
                                   &lpath_trivial_tlist);
-    if (lrel->rtekind == RTE_SUBQUERY)
-        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
-                                NIL, &dLeftGroups);
-    else
-        dLeftGroups = lrel->rows;

     rrel = recurse_set_operations(op->rarg, root,
                                   op,
@@ -1042,9 +1038,57 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
                                   refnames_tlist,
                                   &rpath_tlist,
                                   &rpath_trivial_tlist);
+
+    /*
+     * Generate tlist for SetOp plan node.
+     *
+     * The tlist for a SetOp plan isn't important so far as the SetOp is
+     * concerned, but we must make it look real anyway for the benefit of the
+     * next plan level up.
+     */
+    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
+                                 0, false, lpath_tlist, refnames_tlist,
+                                 &result_trivial_tlist);
+
+    /* We should not have needed any type coercions in the tlist */
+    Assert(result_trivial_tlist);
+
+    *pTargetList = tlist;
+
+    /* Identify the grouping semantics */
+    groupList = generate_setop_grouplist(op, tlist);
+
+    /* Check whether the operators support sorting or hashing */
+    can_sort = grouping_is_sortable(groupList);
+    can_hash = grouping_is_hashable(groupList);
+    if (!can_sort && !can_hash)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+        /* translator: %s is INTERSECT or EXCEPT */
+                 errmsg("could not implement %s",
+                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
+                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
+
+    if (can_sort)
+    {
+        /* Determine the pathkeys for sorting by the whole target list */
+        nonunion_pathkeys = make_pathkeys_for_sortclauses(root, groupList,
+                                                          tlist);
+
+        root->query_pathkeys = nonunion_pathkeys;
+    }
+
+    /*
+     * Now that we've got all that info, we can build the child paths.
+     */
+    if (lrel->rtekind == RTE_SUBQUERY)
+        build_setop_child_paths(root, lrel, lpath_trivial_tlist, lpath_tlist,
+                                nonunion_pathkeys, &dLeftGroups);
+    else
+        dLeftGroups = lrel->rows;
     if (rrel->rtekind == RTE_SUBQUERY)
         build_setop_child_paths(root, rrel, rpath_trivial_tlist, rpath_tlist,
-                                NIL, &dRightGroups);
+                                nonunion_pathkeys, &dRightGroups);
     else
         dRightGroups = rrel->rows;

@@ -1080,30 +1124,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     lpath = lrel->cheapest_total_path;
     rpath = rrel->cheapest_total_path;

-    /*
-     * Generate tlist for SetOp plan node.
-     *
-     * The tlist for a SetOp plan isn't important so far as the SetOp is
-     * concerned, but we must make it look real anyway for the benefit of the
-     * next plan level up.
-     */
-    tlist = generate_setop_tlist(op->colTypes, op->colCollations,
-                                 0, false, lpath_tlist, refnames_tlist,
-                                 &result_trivial_tlist);
-
-    /* We should not have needed any type coercions in the tlist */
-    Assert(result_trivial_tlist);
-
-    *pTargetList = tlist;
-
     /* Build result relation. */
     result_rel = fetch_upper_rel(root, UPPERREL_SETOP,
                                  bms_union(lrel->relids, rrel->relids));
     result_rel->reltarget = create_pathtarget(root, tlist);

-    /* Identify the grouping semantics */
-    groupList = generate_setop_grouplist(op, tlist);
-
     /*
      * Estimate number of distinct groups that we'll need hashtable entries
      * for; this is the size of the left-hand input for EXCEPT, or the smaller
@@ -1139,17 +1164,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
             break;
     }

-    /* Check whether the operators support sorting or hashing */
-    can_sort = grouping_is_sortable(groupList);
-    can_hash = grouping_is_hashable(groupList);
-    if (!can_sort && !can_hash)
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-        /* translator: %s is INTERSECT or EXCEPT */
-                 errmsg("could not implement %s",
-                        (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT"),
-                 errdetail("Some of the datatypes only support hashing, while others only support sorting.")));
-
     /*
      * If we can hash, that just requires a SetOp atop the cheapest inputs.
      */
@@ -1189,35 +1203,63 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
     }

     /*
-     * If we can sort, sort the inputs if needed before adding SetOp.
+     * If we can sort, generate the cheapest sorted input paths, and add a
+     * SetOp atop those.
      */
     if (can_sort)
     {
         List       *pathkeys;
+        Path       *slpath,
+                   *srpath;

+        /* First the left input ... */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  lpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
-            lpath = (Path *) create_sort_path(root,
-                                              lpath->parent,
-                                              lpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, lpath->pathkeys))
+            slpath = lpath;        /* cheapest path is already sorted */
+        else
+        {
+            slpath = get_cheapest_path_for_pathkeys(lrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (slpath == NULL)
+                slpath = (Path *) create_sort_path(root,
+                                                   lpath->parent,
+                                                   lpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }
+
+        /* and now the same for the right. */
         pathkeys = make_pathkeys_for_sortclauses(root,
                                                  groupList,
                                                  rpath_tlist);
-        if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
-            rpath = (Path *) create_sort_path(root,
-                                              rpath->parent,
-                                              rpath,
-                                              pathkeys,
-                                              -1.0);
+        if (pathkeys_contained_in(pathkeys, rpath->pathkeys))
+            srpath = rpath;        /* cheapest path is already sorted */
+        else
+        {
+            srpath = get_cheapest_path_for_pathkeys(rrel->pathlist,
+                                                    nonunion_pathkeys,
+                                                    NULL,
+                                                    TOTAL_COST,
+                                                    false);
+            /* Subquery failed to produce any presorted paths? */
+            if (srpath == NULL)
+                srpath = (Path *) create_sort_path(root,
+                                                   rpath->parent,
+                                                   rpath,
+                                                   pathkeys,
+                                                   -1.0);
+        }

         path = (Path *) create_setop_path(root,
                                           result_rel,
-                                          lpath,
-                                          rpath,
+                                          slpath,
+                                          srpath,
                                           cmd,
                                           SETOP_SORTED,
                                           groupList,
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 3e1adbd992..aa73edeac5 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -385,13 +385,15 @@ select count(*) from
   5000
 (1 row)

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                         QUERY PLAN
-------------------------------------------------------------
+           QUERY PLAN
+---------------------------------
  HashSetOp Except
-   ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+   ->  Seq Scan on tenk1
+   ->  Seq Scan on tenk1 tenk1_1
          Filter: (unique2 <> 10)
 (4 rows)

@@ -401,6 +403,7 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
       10
 (1 row)

+reset enable_indexscan;
 set enable_hashagg to off;
 explain (costs off)
 select count(*) from
@@ -427,17 +430,15 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                               QUERY PLAN
-------------------------------------------------------------------------
+                            QUERY PLAN
+------------------------------------------------------------------
  Aggregate
    ->  SetOp Intersect
          ->  Sort
                Sort Key: tenk1.fivethous
                ->  Seq Scan on tenk1
-         ->  Sort
-               Sort Key: tenk1_1.unique1
-               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(6 rows)

 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -448,17 +449,13 @@ select count(*) from

 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                            QUERY PLAN
-------------------------------------------------------------------
+                         QUERY PLAN
+------------------------------------------------------------
  SetOp Except
-   ->  Sort
-         Sort Key: tenk1.unique1
-         ->  Index Only Scan using tenk1_unique1 on tenk1
-   ->  Sort
-         Sort Key: tenk1_1.unique2
-         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-               Filter: (unique2 <> 10)
-(8 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)

 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index 10030ec1c5..0be2911b0b 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -134,10 +134,15 @@ select count(*) from
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;

+-- this query will prefer a sorted setop unless we force it.
+set enable_indexscan to off;
+
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;

+reset enable_indexscan;
+
 set enable_hashagg to off;

 explain (costs off)
--
2.43.5


pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: sunsetting md5 password support
Next
From: jian he
Date:
Subject: Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints