Re: Performance improvement for joins where outer side is unique - Mailing list pgsql-hackers

From David Rowley
Subject Re: Performance improvement for joins where outer side is unique
Date
Msg-id CAKJS1f_6iSqJ15b8PFfh1yJ0tyaDeyF8pMysw-4gP1ES-iHKMg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Performance improvement for joins where outer side is unique  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: Performance improvement for joins where outer side is unique  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On 14 March 2017 at 16:37, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 14 March 2017 at 11:35, David Rowley <david.rowley@2ndquadrant.com>
> wrote:
>>
>> On 14 March 2017 at 07:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>> [ getting back to this patch finally... ]
>>>
>>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>> > I've attached a patch which implements this, though only for
>>> > MergeJoin, else I'd imagine we'd also need to ensure all proofs used
>>> > for testing the uniqueness were also hash-able too. I added some XXX
>>> > comments in analyzejoin.c around the mergeopfamilies == NIL tests to
>>> > mention that Merge Join depends on all the unique proof quals having
>>> > mergeopfamilies.  This also assumes we'll never use some subset of
>>> > mergejoin-able quals for a merge join, which could be an interesting
>>> > area in the future, as we might have some btree index on a subset of
>>> > those columns to provide pre-sorted input. In short, it's a great
>>> > optimisation, but I'm a little scared we might break it one day.
>>>
>>> Umm ... it's broken already isn't it?  See the comment for
>>> generate_mergejoin_paths:
>>>
>> Thanks for looking at this again.
>>
>> Yeah confirmed. It's broken. I guess I just need to remember in the Path
>> if we got all the join quals, although I've not looked in detail what the
>> best fix is. I imagine it'll require storing something else in the JoinPath.
>
>
> OK, so I've spent some more time on this and I've come up with a solution.
>
> Basically the solution is to not skip mark and restore when joinquals
> contains any items.  This is a requirement for SEMI joins, but overly
> cautious for unique joins. However, I think it'll apply in most cases when
> Merge Join will be a win, and that's when there's a btree index on both
> sides of the join which covers all columns in the join condition.  I
> carefully commented this part of the code to explain what can be done to
> have it apply in more cases.
>
> This caused me to go and change the following code too:
>
> @@ -2676,6 +2688,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath
> *path,
>   * it off does not entitle us to deliver an invalid plan.
>   */
>   else if (innersortkeys == NIL &&
> + !((extra->inner_unique || path->jpath.jointype == JOIN_SEMI) &&
> + list_length(path->jpath.joinrestrictinfo) ==
> + list_length(path->path_mergeclauses)) &&
>   !ExecSupportsMarkRestore(inner_path))
>   path->materialize_inner = true;
>
> I've been staring at this for a while and I think it's correct. If it's
> wrong then we'd get "ERROR: unrecognized node type: <n>" from ExecMarkPos().
>
> Here we must make sure and never skip materializing the inner side, if we're
> not going to skip mark and restore in ExecInitMergeJoin().
>
> I've attached a patch which fixes the issue.
>
> Also added a regression test to try to make sure it stays fixed. There's a
> few other mostly cosmetic fixes in there too.

Patch is attached which fixes up the conflict between the expression
evaluation performance patch.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: crashes due to setting max_parallel_workers=0
Next
From: Thomas Munro
Date:
Subject: Re: WIP: [[Parallel] Shared] Hash