Problem with CVS HEAD's handling of mergejoins - Mailing list pgsql-hackers

From Tom Lane
Subject Problem with CVS HEAD's handling of mergejoins
Date
Msg-id 24443.1199844480@sss.pgh.pa.us
Whole thread Raw
Responses Re: Problem with CVS HEAD's handling of mergejoins
Re: Problem with CVS HEAD's handling of mergejoins
List pgsql-hackers
So I adjusted the patch I was working on as suggested here
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00251.php
and things started blowing up all over the place --- Assert failures,
"too few pathkeys for mergeclauses" errors, etc :-(

On investigation, the problem seems to be a bit of brain-fade on my
part.  The planner uses "PathKeys" lists both to represent what's known
about the output sort order of a Path, and to represent per-merge-clause
ordering data for a merge join.  A PathKeys list that represents sort
ordering ought to be "canonical", meaning it contains no redundant keys
--- a key might be redundant because it is the same as some prior key
in the list ("ORDER BY x,x") or because it is known equal to a constant
and thus uninteresting for sorting ("WHERE x = 1 ... ORDER BY x").
However, there are several places in the planner that expect the
pathkeys for a merge join to be one-for-one with the selected merge
clauses.

I'm not sure why we didn't come across test cases exposing this problem
earlier in beta.  A partial explanation is that the equal-to-a-constant
case was unlikely to arise before, because given "WHERE x = y AND x = 1"
the code up to now would get rid of "x = y" altogether and then never
try for a mergejoin; my patch to eliminate that behavior was what
exposed the problem.  But there are other ways to get redundant keys in
a list of candidate mergejoin clauses.

One possibility seems to be to keep a list of "raw" (not canonical)
pathkeys for each side of a list of proposed mergejoin clauses, which
is one-to-one with the clauses, with the clear understanding that the
canonical list that describes the path's sort ordering might be just a
subset of this list.  This would mean a couple of new fields in MergePath
structs, but fortunately no on-disk format changes since MergePaths
never get to disk.

A perhaps less invasive idea is to discard any proposed mergeclauses
that are redundant in this sense.  This would still require some
reshuffling of responsibility between select_mergejoin_clauses and
the code in pathkeys.c, since right now select_mergejoin_clauses
takes no account of that.  However, I'm worried that that might result
in planner failure on some FULL JOIN cases that work today, since we
require all the join clauses to be mergejoinable for a FULL JOIN.
People seem to complain when the planner fails, even for really
stupid queries ;-).  I think this would only be acceptable if we can
prove that ignoring clauses that are redundant in this sense doesn't
change the result --- which might be the case, but I'm not sure.

I think I can fix this in a day or so, but I now definitely feel that
we'll need an RC2 :-(
        regards, tom lane


pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Index trouble with 8.3b4
Next
From: Bruce Momjian
Date:
Subject: Re: Problem with CVS HEAD's handling of mergejoins