Home > mailing lists

Re: [HACKERS] <> join selectivity estimate question - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [HACKERS] <> join selectivity estimate question
Date	March 17, 2017 23:14:12
Msg-id	15053.1489770852@sss.pgh.pa.us Whole thread Raw
In response to	Re: [HACKERS] <> join selectivity estimate question (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: [HACKERS] <> join selectivity estimate question (Robert Haas <robertmhaas@gmail.com>) Re: [HACKERS] <> join selectivity estimate question (Thomas Munro <thomas.munro@enterprisedb.com>)
List	pgsql-hackers

Tree view

I wrote:
> The problem here appears to be that we don't have any MCV list for
> the "twothousand" column (because it has a perfectly flat distribution),
> and the heuristic that eqjoinsel_semi is using for the no-MCVs case
> is falling down badly.

Oh ... wait.  eqjoinsel_semi's charter is to "estimate the fraction of the
LHS relation that has a match".  Well, at least in the given regression
test case, it's satisfying that exactly: they all do.  For instance,
this estimate is dead on:

regression=# explain analyze select * from tenk1 a where exists(select * from tenk1 b where a.twothousand =
b.twothousand);                                                       QUERY PLAN

--------------------------------------------------------------------------------
---------------------------------------------Hash Join  (cost=528.00..1123.50 rows=10000 width=244) (actual
time=9.902..15.1
02 rows=10000 loops=1)  Hash Cond: (a.twothousand = b.twothousand)

So eqjoinsel_semi is doing exactly what it thinks it's supposed to.

After a bit more thought, it seems like the bug here is that "the
fraction of the LHS that has a non-matching row" is not one minus
"the fraction of the LHS that has a matching row".  In fact, in
this example, *all* LHS rows have both matching and non-matching
RHS rows.  So the problem is that neqjoinsel is doing something
that's entirely insane for semijoin cases.

It would not be too hard to convince me that neqjoinsel should
simply return 1.0 for any semijoin/antijoin case, perhaps with
some kind of discount for nullfrac.  Whether or not there's an
equal row, there's almost always going to be non-equal row(s).
Maybe we can think of a better implementation but that seems
like the zero-order approximation.
        regards, tom lane

pgsql-hackers by date:

From: Amit Khandekar
Date: 17 March 2017, 23:12:40
Subject: Re: [HACKERS] Parallel Append implementation

From: Amit Khandekar
Date: 17 March 2017, 23:15:51
Subject: Re: [HACKERS] Parallel Append implementation

Re: [HACKERS] <> join selectivity estimate question - Mailing list pgsql-hackers

Previous

Next