Re: [HACKERS] Performance improvement for joins where outer side is unique - Mailing list pgsql-hackers

From David Rowley
Subject Re: [HACKERS] Performance improvement for joins where outer side is unique
Date
Msg-id CAKJS1f8iD5mDMBsCRvV9XnrwQy9xj4R2qyTQLFaCJGAnG_UOHA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Performance improvement for joins where outer side is unique  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: Performance improvement for joins where outer side is unique  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On 14 March 2017 at 11:35, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 14 March 2017 at 07:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
[ getting back to this patch finally... ]

David Rowley <david.rowley@2ndquadrant.com> writes:
> I've attached a patch which implements this, though only for
> MergeJoin, else I'd imagine we'd also need to ensure all proofs used
> for testing the uniqueness were also hash-able too. I added some XXX
> comments in analyzejoin.c around the mergeopfamilies == NIL tests to
> mention that Merge Join depends on all the unique proof quals having
> mergeopfamilies.  This also assumes we'll never use some subset of
> mergejoin-able quals for a merge join, which could be an interesting
> area in the future, as we might have some btree index on a subset of
> those columns to provide pre-sorted input. In short, it's a great
> optimisation, but I'm a little scared we might break it one day.

Umm ... it's broken already isn't it?  See the comment for
generate_mergejoin_paths:

Thanks for looking at this again. 

Yeah confirmed. It's broken. I guess I just need to remember in the Path if we got all the join quals, although I've not looked in detail what the best fix is. I imagine it'll require storing something else in the JoinPath.

OK, so I've spent some more time on this and I've come up with a solution.

Basically the solution is to not skip mark and restore when joinquals contains any items.  This is a requirement for SEMI joins, but overly cautious for unique joins. However, I think it'll apply in most cases when Merge Join will be a win, and that's when there's a btree index on both sides of the join which covers all columns in the join condition.  I carefully commented this part of the code to explain what can be done to have it apply in more cases.

This caused me to go and change the following code too:

@@ -2676,6 +2688,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
  * it off does not entitle us to deliver an invalid plan.
  */
  else if (innersortkeys == NIL &&
+ !((extra->inner_unique || path->jpath.jointype == JOIN_SEMI) &&
+ list_length(path->jpath.joinrestrictinfo) ==
+ list_length(path->path_mergeclauses)) &&
  !ExecSupportsMarkRestore(inner_path))
  path->materialize_inner = true;

I've been staring at this for a while and I think it's correct. If it's wrong then we'd get "ERROR: unrecognized node type: <n>" from ExecMarkPos().

Here we must make sure and never skip materializing the inner side, if we're not going to skip mark and restore in ExecInitMergeJoin().

I've attached a patch which fixes the issue.

Also added a regression test to try to make sure it stays fixed. There's a few other mostly cosmetic fixes in there too.
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: [HACKERS] ToDo: listagg is in ANSI/SQL:2016
Next
From: Beena Emerson
Date:
Subject: Re: [HACKERS] increasing the default WAL segment size