Re: Properly pathify the union planner - Mailing list pgsql-hackers

From David Rowley
Subject Re: Properly pathify the union planner
Date
Msg-id CAApHDvoV4QgNsXsegaCJQFHK4v8-gE7_DzXmSMgm_Hg+C+4oYg@mail.gmail.com
Whole thread Raw
In response to Properly pathify the union planner  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Properly pathify the union planner
Re: Properly pathify the union planner
List pgsql-hackers
On Thu, 2 Nov 2023 at 12:42, David Rowley <dgrowleyml@gmail.com> wrote:
> One part that still needs work is the EquivalanceClass building.
> Because we only build the final targetlist for the Append after
> planning all the append child queries, I ended up having to populate
> the EquivalanceClasses backwards, i.e children first. add_eq_member()
> determines if you're passing a child member by checking if parent !=
> NULL.  Since I don't have a parent EquivalenceMember to pass,
> em_is_child gets set wrongly, and that causes problems because
> ec_has_const can get set to true when it shouldn't. This is a problem
> as it can make a PathKey redundant when it isn't.  I wonder if I'll
> need to change the signature of add_eq_member() and add an "is_child"
> bool to force the EM to be a child em... Needs more thought...

I've spent more time working on this and I ended up solving the above
problem by delaying the subquery path creation on the union children
until after we've built the top-level targetlist.  This allows the
parent eclasses to be correctly added before adding members for the
union children. (See build_setop_child_paths() in the patch)

There's been quite a bit of progress in other areas too.  Incremental
sorts now work:

# create table t1(a int primary key, b int not null);
# create table t2(a int primary key, b int not null);
# insert into t1 select x,x from generate_Series(1,1000000)x;
# insert into t2 select x,x from generate_Series(1,1000000)x;
# vacuum analyze t1,t2;


# explain (costs off) select * from t1 union select * from t2;
                    QUERY PLAN
--------------------------------------------------
 Unique
   ->  Merge Append
         Sort Key: t1.a, t1.b
         ->  Incremental Sort
               Sort Key: t1.a, t1.b
               Presorted Key: t1.a
               ->  Index Scan using t1_pkey on t1
         ->  Incremental Sort
               Sort Key: t2.a, t2.b
               Presorted Key: t2.a
               ->  Index Scan using t2_pkey on t2
(11 rows)

However, I've not yet made the MergeAppend UNIONs work when the
datatypes don't match on either side of the UNION.  For now, the
reason this does not work is due to convert_subquery_pathkeys() being
unable to find the pathkey targets in the targetlist.  The actual
targets can't be found due to the typecast.  I wondered if this could
be fixed by adding an additional projection path to the subquery when
the output columns don't match the setop->colTypes, but I'm a bit put
off by the comment in transformSetOperationTree:

> * For all non-UNKNOWN-type cases, we verify coercibility but we
> * don't modify the child's expression, for fear of changing the
> * child query's semantics.

I assume that's worried about the semantics of things like WHERE
clauses, so maybe the projection path in the subquery would be ok. I
need to spend more time on that.

Another problem I hit was add_path() pfreeing a Path that I needed.
This was happening due to how I'm building the final paths in the
subquery when setop_pathkeys are set.  Because I want to always
include the cheapest_input_path to allow that path to be used in
hash-based UNIONs, I also want to provide sorted paths so that
MergeAppend has something to work with.  I found cases where I'd
add_path() the cheapest_input_path to the final rel then also attempt
to sort that path.  Unfortunately, add_path() found the unsorted path
and the sorted path fuzzily the same cost and opted to keep the sorted
one due to it having better pathkeys. add_path() then pfree'd the
cheapest_input_path which meant the Sort's subpath was gone which
obviously caused issues in createplan.c.

For now, as a temporary fix, I've just #ifdef'd out the code in
add_path() that's pfreeing the old path.  I have drafted a patch that
refcounts Paths, but I'm unsure if that's the correct solution as I'm
only maintaining the refcounts in add_path() and add_partial_path(). I
think a true correct solution would bump the refcount when a path is
used as some other path's subpath.  That would mean having to
recursively pfree paths up until we find one with a refcount>0. Seems
a bit expensive for add_path() to do.

I've attached the updated patch.  This one is probably ready for
someone to test out. There will be more work to do, however.

David

Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Questions regarding Index AMs and natural ordering
Next
From: Thomas Munro
Date:
Subject: Improving the comments in pqsignal()