Re: Removing unneeded self joins - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: Removing unneeded self joins
Date
Msg-id CAEZATCUwwvoaRuK+UuFF+S49KzKA3uMibjj7qUCOw_iSdtrZnA@mail.gmail.com
Whole thread Raw
In response to Re: Removing unneeded self joins  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
On Sat, 20 Jul 2024 at 12:39, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > The new function replace_relid() looks to be the same as adjust_relid_set().
>
> They are similar, not the same.  replace_relid() has handling for
> negative newId, while adjust_relid_set() hasn't.  One thing I'd like
> to borrow from adjust_relid_set() to replace_relid() is the usage of
> IS_SPECIAL_VARNO() macro.

Ah, that makes sense. In that case, I'd say that replace_relid()
should go in analyzejoins.c (and be a local function there), since
that's the only place that requires this special negative newId
handling.

> It would be probably nice to move this logic into bms_replace_member()
> residing at bitmapset.c.  What do you think?

Maybe. It feels a little specialised though, so maybe it's not worth the effort.

I have been reviewing more of the patch, mainly focusing on the logic
in analyzejoins.c that decides when to apply SJE.

I understand broadly what the code is doing, but I still find it
somewhat hard to follow. One thing that makes it hard is that in
analyzejoins.c, "inner" and "outer" get swapped round at various
points. For example generate_join_implied_equalities() is defined like
this:

List *
generate_join_implied_equalities(PlannerInfo *root,
                                 Relids join_relids,
                                 Relids outer_relids,
                                 RelOptInfo *inner_rel,
                                 SpecialJoinInfo *sjinfo);

but remove_self_joins_one_group() calls it like this:

            restrictlist = generate_join_implied_equalities(root, joinrelids,
                                                            inner->relids,
                                                            outer, NULL);

So you have to remember that "inner" is "outer" and "outer" is "inner"
when going into generate_join_implied_equalities() from
remove_self_joins_one_group(). And the same thing happens when calling
innerrel_is_unique_ext() and match_unique_clauses(). I think all that
could be resolved by swapping "inner" and "outer" in the variable
names and comments in remove_self_joins_one_group().

Another thing I noticed in remove_self_joins_one_group() was this:

    /*
     * To enable SJE for the only degenerate case without any self
     * join clauses at all, add baserestrictinfo to this list. The
     * degenerate case works only if both sides have the same clause.
     * So doesn't matter which side to add.
     */
    selfjoinquals = list_concat(selfjoinquals, outer->baserestrictinfo);

That appears to be pointless, because is_innerrel_unique_for() will
filter the restrictlist it is given, removing those baserestrictinfo
clauses (because I think they'll always have can_join = false). And
then relation_has_unique_index_ext() will re-add them:

    /*
     * Examine the rel's restriction clauses for usable var = const clauses
     * that we can add to the restrictlist.
     */
    foreach(ic, rel->baserestrictinfo)
    {
        ... add suitable clauses
    }

where "rel" is "innerrel" from is_innerrel_unique_for(), which is
"outer" from remove_self_joins_one_group(), so it's the same set of
baserestrictinfo clauses.

Something else that looks a little messy is this in innerrel_is_unique_ext():

/*
 * innerrel_is_unique_ext
 *    Do the same as innerrel_is_unique(), but also set to '*extra_clauses'
 *    additional clauses from a baserestrictinfo list that were used to prove
 *    uniqueness.  A non NULL 'extra_clauses' indicates that we're checking
 *    for self-join and correspondingly dealing with filtered clauses.
 */
bool
innerrel_is_unique_ext(PlannerInfo *root,
                       ...
                       List **extra_clauses)
{
    bool        self_join = (extra_clauses != NULL);

    [logic depending on self_join]
}

This presumes that any caller interested in knowing the extra
baserestrictinfo clauses used to prove uniqueness must be looking at a
self join. That may be true today, but it doesn't seem like a good API
design choice. I think it would be better to just add "self_join" as
an extra parameter, and also maybe have the function return the
UniqueRelInfo containing the "extra_clauses", or NULL if it's not
unique. That way, it would be more extensible, if we wanted it to
return more information in the future.

Instead of adding relation_has_unique_index_ext(), maybe it would be
OK to just change the signature of relation_has_unique_index_for(). It
looks like it's only called from one other place outside
analyzejoins.c. Perhaps the same is true for innerrel_is_unique_ext().

Should match_unique_clauses() be comparing mergeopfamilies or opnos to
ensure that the clauses are using the same equality operator?

Regards,
Dean



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: PG buildfarm member cisticola
Next
From: "Andrey M. Borodin"
Date:
Subject: Re: [PATCH] GROUP BY ALL