Re: PATCH: use foreign keys to improve join estimates v1 - Mailing list pgsql-hackers

From David Rowley
Subject Re: PATCH: use foreign keys to improve join estimates v1
Date
Msg-id CAKJS1f91RSjkY571XL09nokjuK849ALBrCORLSAh=dMDc7bJFg@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: use foreign keys to improve join estimates v1  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: PATCH: use foreign keys to improve join estimates v1  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On 19 May 2015 at 11:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
On 05/17/15 14:31, David Rowley wrote:

I think if you find any quals that are a part of *any* foreign key
between the 2 join tables, then you should be never assume these quals
to reduce the number of rows. I believe this should be the case even if
you don't fully match all columns of the foreign key.

If we modify your example a little, let's say your foreign key between
fact and dim is made from 3 columns (a,b,c)

If we do:

EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);

Then we should always (under normal circumstances) find at least one
matching row, although in this case since the join qual for c is
missing, we could find more than 1 matching row.

Without digging too deep here, I'd say that the best way to do this
would be to either have calc_joinrel_size_estimate() build a list of
restrictinfo items of all quals that are NOT part of any foreign key and
pass that trimmed list down to clauselist_selectivity() for selectivity
estimates. Or perhaps a better way would be just to
teach clauselist_selectivity() about foreign keys. Likely
clauselist_selectivity() would just have to skip over RestrictInfo items
that are part of a foreign key.

I'm not particularly happy about the way the current patch tweaks the selectivity, but I think simply removing the clauses is not going to work, because that implies the (removed) conditions have selectivity 1.0 (so the estimate would match a cartesian product). So we really need to estimate the selectivity, we can't just throw them away.


Of course I should have said 1.0 / <estimated rows>
 
And that's essentially what the current patch does - it realizes the clauses match a FK, and then computes the estimate as 1/N where N is the cardinality of the table with PK.

Another thing is that there may be multiple "candidate" foreign keys, and we can't just remove clauses matching at least one of them. Imagine e.g. a (somewhat artificial) example:

    CREATE TABLE parent (
       a INT,
       b INT,
       c INT,
       d INT,
       PRIMARY KEY (a,b),
       UNIQUE (c,d)
    );

    CREATE TABLE child (
       a INT,
       b INT,
       c INT,
       d INT,
       FOREIGN KEY (a,b) REFERENCES parent (a,b),
       FOREIGN KEY (c,d) REFERENCES parent (c,b)
    );

and a join on (a,b,c,d). We can't just discard all the conditions, because (a,b) and (c,d) may 'mismatch'. We know (a,b) and (c,d) matches about 1/N of the 'parent' table, but we don't know selectivity for the whole join condition.


Ok how about we ignore partial foreign key matches and just get things working when additional quals exist that are not a part of the foreign key.

I think here it would just be a matter of just deciding on which foreign key to use the quals for and which ones to estimate the old fashioned way.

How about we have a function:

List *
get_non_foreign_key_quals(PlannerInfo *root, List *restrictlist) /* XXX think of a better name */

which will be called from within calc_joinrel_size_estimate()

This function will simply take the list that's currently being sent off to clauselist_selectivity() and it would search for the biggest possible subset of foreign key columns from within that list. The returned list would be what we would pass to clauselist_selectivity(), so if nothing matched we'd just do the normal thing.

Let's say we have a restrictlist like in the join you describe above... I'll use p for parent, and c for the child table:

{p.a = c.a}, {p.b = c.b}, {p.c = c.c}, {p.d = c.d}

get_non_foreign_key_quals would parse that list looking for foreign keys going in either direction from both of these relations. It would look for the "longest chain" of quals that match a foreign key and return the quals which couldn't be matched. This "couldn't match" list would be what would be returned to clauselist_selectivity(), the return value from that function would have to be multiplied with.... I think 1.0 / min(<est inner rows>, <est outer rows>). Though I've not quite gotten my head around how outer joins change that, but I think that's ok for inner... Perhaps the divide depends on which relation the fk is on... I need to think a bit more to get my head around that...

I think you could implement this by generating a List of Bitmapset, pseudo code along the lines of:

List *
get_non_foreign_key_quals(PlannerInfo *root, List *restrictlist) /* XXX think of a better name */
{
List *fkey_matches;
foreach (foreign key in list)
{
    Bitmapset matches;
    if (foreign key is not for this rel)
       continue;
 
    foreach (qual in fkey)
    {
         int which_qual = 1;
         foreach (qual in restrict list)
         {
            if (fkey qual matches ri qual)
               matches = bmp_add_member(matches, which_qual);
            which_qual++;
         }
   }
   fkey_matches = lappend(fkey_matches, matches);
}

int longest_match = -1
int longest_match_idx;
int idx = 0;
foreach (match in fkey_matches) 
{
    int nmembers = bms_num_members(match);
    /* only change the match it beats a previous match
     * as we want to take the first longest match */
    if (nmembers  > longest_match)
    {
       longest_match = nmembers;
       longest_match_idx = idx;
   }
   idx++;
}

if (longest_match == list_length(restrictlist))
  return NULL; /* everything matches a fkey */
else
{
   List *non_fk_list;
   int which_qual = 1;
   foreach (qual in restrict_list)
   {
      if (!bms_is_member(longest_match, which_qual))
        non_fk_list = lappend(non_fk_list, qual);
     which_qual++;
   }
  return non_fk_list;
 }
}

Now in your example there's two foreign keys both of which will match 2 different quals each, this means that we have two different equally long chains. I don't propose we do anything more complex than take the first of the equal length chains. Although perhaps we'll want something stable... perhaps in order of constraint name or something, but for a first cut I highly doubt this matters. (See qsort() at the bottom of CheckConstraintFetch() for example of what I mean)


And it may be more complex, e.g. there may be two overlapping FKeys, e.b. (a,b) and (b,c) - how do you split that?

But this may be an overkill (multiple multi-column FK join), although if we could handle that reasonably, then why not ... someone out there certainly has schema like that ;-)

What I think is somewhat more realistic is conditions matching only a subset of FK columns - for example with a FK on (a,b) the join may only use (a), for example. Then again - we can't just discard that, we need to estimate that (and use it to compute the actual selectivity).

I agree that if we want to do anything fancy with this, we will have to teach clauselist_selectivity() about foreign keys.


A few other things:

+ /* TODO Can we do something like (hasindex) here? Is it necessary? */
+ if (true)

I wondered about this too in my join removals stuff. I ended up adding a pg_class flag for this. This was Tom's response:


Note that the unsetting of relhaspkey is not so great. It only happens to get unset if the primary key is dropped and no other indexes of any type exist on the relation and the table is vacuumed. So I think I understand why Tom didn't want more flags that can end up being wrongly set.

+bool
+rel_has_unique_index(PlannerInfo *root, RelOptInfo *rel)

root is not used by this function.

+ for (i = 0; i < fkinfo->nkeys; i++)
+ match &= matches[i];
+
+ break;

This looks wrong. Why use break? This seems to be broken if there's more than 1 foreign key between the two relations, as if the first of the two does not match then it looks like you'll just return false... I've not tested but that's the way the code seems to read.

Something like:

+ match = true;
+
+ for (i = 0; i < fkinfo->nkeys; i++)
+ {
+ if (!matches[i])
+ {
+ matched = false;
+ break;
+ }
+ }
+
+ /* each fkey column has been matched to the correct index column */
+ if (matched)
+ return true;
+ } /* foreach (lc, rel->fkeylist) */
+
+ return false; /* no matching fkey found */
 
With this we'll try the next fkey in the list if we didn't find a match.


Also I'm pretty sure it's just your temporary workings... but in join_matches_fkey() you're returning 0, 1, or 2 from a bool function. perhaps this won't be required anymore if my get_non_foreign_key_quals() idea works out.

Thoughts?

Regards

David Rowley

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Next
From: Cédric Villemain
Date:
Subject: Re: checkpointer continuous flushing