Re: Removing unneeded self joins - Mailing list pgsql-hackers

From David Rowley
Subject Re: Removing unneeded self joins
Date
Msg-id CAKJS1f_cPDTRBTW4KpuD5fktVjPs2ugg704CJn=OxWXoUv-rDw@mail.gmail.com
Whole thread Raw
In response to Re: Removing unneeded self joins  (Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru>)
Responses Re: Removing unneeded self joins
List pgsql-hackers
On Sat, 23 Mar 2019 at 03:39, Alexander Kuzmenkov
<a.kuzmenkov@postgrespro.ru> wrote:
> The bug you mention later is an implementation bug that can be fixed (I
> will expand on that below). Besides this, do you think current self-join
> detection algorithm has fundamental correctness problems? I am not aware
> of such problems and this algorithm reflects my understanding of what
> constitutes a removable self-join, so it would be helpful if you could
> explain what exactly is wrong with it.

I did explain what is wrong with it, and also showed an example of why
it is broken.  I didn't see anything that looks fundamentally broken,
just the implementation needs more work.

> Your alternative idea sounds plausible, but I won't know it's correct
> for sure until I implement it, which is a nontrivial amount of work. I
> am also concerned that we will have to redo calculations similar to
> innerrel_is_unique(), because having near-zero overhead is a hard
> prerequisite for this patch, as was discussed upthread.

Are you worried about bypassing the unique rel cache is going to be too costly?

> > Here's an example of what can go wrong with your current code:
>
> This is a bug indeed. Unique index search is not exhaustive, so if many
> indexes match the join quals, we might not find the same index for both
> sides. I think this can be overcome if we switch to exhaustive search in
> relation_has_unique_index_for, and then try to find matching indexes in
> is_unique_self_join. Please see the new patch for the fix.

I really don't think modifying relation_has_unique_index_for to
collect details for up to 5 indexes is a good fix for this.  It looks
like it's still possible to trigger this, just the example would need
to be more complex. Also, you've likely just more than doubled the
runtime of a successful match in relation_has_unique_index_for().
Previously, on average we'd have found that match halfway through the
list of indexes. Now it's most likely you'll end up looking at every
index, even after a match has been found. That does not seem well
aligned to keeping the CPU overhead for the patch low.

What is wrong with just weeding out join quals that don't have the
equivalent expression on either side before passing them to
relation_has_unique_index_for()? That'll save you from getting false
matches like I showed. To make that work it just seems mostly like
you'd mostly just need to swap the order of operations in the patch,
but you'd also likely end up needing to rip out all the UniqueRelInfo
code, since I don't think that'll be needed any longer. Likely that
means your entire patch would be limited to analyzejoins.c, although
I'm unsure what of the eclass editing code should be moved into
equivclass.c.

> > I also think this way would give you the subquery GROUP BY / DISTINCT
> > self join removal for just about free.
>
>
> Could you explain how exactly we can generalize join removal to the
> DISTINCT/GROUP BY case? I understand the removable self-joins as having
> the same row on both sides of join, as identified by ctid, but I'm not
> sure how to apply this to subqueries. Your earlier DISTINCT example
> looked like it needed a different kind of join removal, with different
> conditions for when it applies (please see my previous letter for details).

Well, by removing the requirement that the unique proofs have to come
from a unique index. I don't think you need to ensure this works for
your patch, it would be nice if it did for free. I just don't think
your implementation should block it from ever working.

> Another thing I'd like to mention is that this patch splits in two
> independent parts, the detection of self-joins and their removal. While
> we are looking for agreement on the detection part, could you also
> review the removal part? I'm sure it has its own share of problems.

I'd rather focus on the detection method before reviewing the removal
code.  If there's some blocker in the detection code then the removal
code is not useful.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Ordered Partitioned Table Scans
Next
From: Christoph Berg
Date:
Subject: Enable data checksums by default