Re: [HACKERS] 200 = 199 + 1? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] 200 = 199 + 1?
Date
Msg-id 6709.1506527111@sss.pgh.pa.us
Whole thread Raw
In response to [HACKERS] 200 = 199 + 1?  (Marko Tiikkaja <marko@joh.to>)
Responses Re: [HACKERS] 200 = 199 + 1?
List pgsql-hackers
Marko Tiikkaja <marko@joh.to> writes:
> I just came across this very peculiar behavior:

I think this is a consequence of the clamping + fallback logic in
eqjoinsel_semi.  The planner has no info about the inner select's result,
except it can see the LIMIT clause so it uses that as a rowcount estimate.
get_variable_numdistinct reports DEFAULT_NUM_DISTINCT (200) with isdefault
true; but if the LIMIT is <= 200, eqjoinsel_semi's clamping logic fires:
    * If we clamp, we can treat nd2 as being a non-default estimate; it's not    * great, maybe, but it didn't come out
ofnowhere either.  This is most    * helpful when the inner relation is empty and consequently has no stats.
 

and that enables the nondefault (nd2 / nd1) estimate to be used,
which happens to be dead on in this case, because the estimated nd2
is in fact exact.  With a larger LIMIT, we end up with the 0.5 default
selectivity estimate, which is way off in this particular example.

Now I'd be the first to say that the 0.5 default didn't have all that much
thought put into it.  One idea that occurs to me while staring at this is
why don't we use DEFAULT_EQ_SEL instead?  Or for that matter, maybe just
bulling ahead and using nd2 even if it is default would give superior
results in many cases.  (Not sure whether nd1 ought to get the same
treatment.  At least in this example, we do know nd1 with some precision;
and if we don't know either one, dividing nd2/nd1 to arrive at 1.0 doesn't
sound like a good idea.)  Or maybe take nd2 = size of inner rel if we want
to be conservative but still do something smarter than 0.5.

> I wonder if the nested loop shouldn't have some kind of a cap on its own
> estimate if it's wildly off of what you'd get by multiplying the child
> nodes' estimates with each other?

Nonstarter I'm afraid.  The join relation's size estimate is determined
long before we get to a point where we could multiply the sizes of these
particular child paths to arrive at the conclusion that it should be
something different than what we decided originally.  Adjusting the size
of the nestloop result at that point would merely give it an unfair
advantage over other paths for the same join relation.  (I think it would
also break some assumptions about paths for the same relation all giving
the same number of rows, unless parameterized.)

Looking at it another way, the main thing that the combination of hashagg
outer path + indexscan inner path knows that eqjoinsel_semi didn't account
for is that there's a unique index on foo.id.  But that info is available
to eqjoinsel_semi, in the sense that it's been given a nondefault estimate
that nd1 is equal to the outer relation size.  So the mistake that it's
making is to throw up its hands and use an 0.5 selectivity estimate just
because it has no info about the inner relation.  I think if we'd pushed
through the nd2/nd1 calculation after setting nd2 = size of inner rel,
we'd end up with an estimate matching the product of these path sizes.
(Caution: inadequate caffeine absorbed yet, this might be all wrong.)
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Emre Hasegeli
Date:
Subject: Re: [HACKERS] [PATCH] Improve geometric types
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.