Thread: [HACKERS] <> join selectivity estimate question

[HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
Hi hackers,

While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master.  One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.

Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB.  That is a query with a large anti-join and a large semi-join.

  8 workers = 8.3s
  7 workers = 8.2s
  6 workers = 8.5s
  5 workers = 8.9s
  4 workers = 9.5s
  3 workers = 39.7s
  2 workers = 36.9s
  1 worker  = 38.2s
  0 workers = 47.9s

Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).

The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded.  Compare these two queries:

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey);

 -> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey
                   AND l1.l_suppkey <> l2.l_suppkey);

 -> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

Or for a standalone example:

  CREATE TABLE foo AS
  SELECT (generate_series(1, 1000000) / 4)::int AS a,
         (generate_series(1, 1000000) % 100)::int AS b;

  ANALYZE foo;

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a);

 -> estimates 1,000,000 rows

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a
                    AND f1.b <> f2.b);

 -> estimates 1 row

I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch.  What is happening here?

Thanks for reading.

[1] https://www.postgresql.org/message-id/CAEepm%3D3Og-7-b3WOkiT%3Dc%2B6y3eZ0VVSyb1K%2BSOvF17BO5KAt0A%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] <> join selectivity estimate question

From
Robert Haas
Date:
On Fri, Mar 17, 2017 at 1:54 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>  SELECT *
>    FROM lineitem l1
>   WHERE EXISTS (SELECT *
>                   FROM lineitem l2
>                  WHERE l1.l_orderkey = l2.l_orderkey);
>
>  -> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)
>
>  SELECT *
>    FROM lineitem l1
>   WHERE EXISTS (SELECT *
>                   FROM lineitem l2
>                  WHERE l1.l_orderkey = l2.l_orderkey
>                    AND l1.l_suppkey <> l2.l_suppkey);
>
>  -> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

The relevant code is in neqsel().  It estimates the fraction of rows
that will be equal, and then does 1 - that number.  Evidently, the
query planner thinks that l1.l_suppkey = l2.l_suppkey would almost
always be true, and therefore l1.l_suppkey <> l2.l_suppkey will almost
always be false.  I think the presumed selectivity of l1.l_suppkey =
l2.l_suppkey is being computed by var_eq_non_const(), but I'm a little
puzzled by that function is managing to produce a selectivity estimate
of, essentially, 1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> The relevant code is in neqsel().  It estimates the fraction of rows
> that will be equal, and then does 1 - that number.  Evidently, the
> query planner thinks that l1.l_suppkey = l2.l_suppkey would almost
> always be true, and therefore l1.l_suppkey <> l2.l_suppkey will almost
> always be false.  I think the presumed selectivity of l1.l_suppkey =
> l2.l_suppkey is being computed by var_eq_non_const(), but I'm a little
> puzzled by that function is managing to produce a selectivity estimate
> of, essentially, 1.

No, I believe it's going through neqjoinsel and thence to eqjoinsel_semi.
This query will have been flattened into a semijoin.

I can reproduce a similarly bad estimate in the regression database:

regression=# explain select * from tenk1 a where exists(select * from tenk1 b where a.thousand = b.thousand and
a.twothousand<> b.twothousand);                              QUERY PLAN                                 
-------------------------------------------------------------------------Hash Semi Join  (cost=583.00..1067.25 rows=1
width=244) Hash Cond: (a.thousand = b.thousand)  Join Filter: (a.twothousand <> b.twothousand)  ->  Seq Scan on tenk1 a
(cost=0.00..458.00 rows=10000 width=244)  ->  Hash  (cost=458.00..458.00 rows=10000 width=8)        ->  Seq Scan on
tenk1b  (cost=0.00..458.00 rows=10000 width=8) 
(6 rows)

The problem here appears to be that we don't have any MCV list for
the "twothousand" column (because it has a perfectly flat distribution),
and the heuristic that eqjoinsel_semi is using for the no-MCVs case
is falling down badly.
        regards, tom lane



Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
I wrote:
> The problem here appears to be that we don't have any MCV list for
> the "twothousand" column (because it has a perfectly flat distribution),
> and the heuristic that eqjoinsel_semi is using for the no-MCVs case
> is falling down badly.

Oh ... wait.  eqjoinsel_semi's charter is to "estimate the fraction of the
LHS relation that has a match".  Well, at least in the given regression
test case, it's satisfying that exactly: they all do.  For instance,
this estimate is dead on:

regression=# explain analyze select * from tenk1 a where exists(select * from tenk1 b where a.twothousand =
b.twothousand);                                                       QUERY PLAN
                 
--------------------------------------------------------------------------------
---------------------------------------------Hash Join  (cost=528.00..1123.50 rows=10000 width=244) (actual
time=9.902..15.1
02 rows=10000 loops=1)  Hash Cond: (a.twothousand = b.twothousand)

So eqjoinsel_semi is doing exactly what it thinks it's supposed to.

After a bit more thought, it seems like the bug here is that "the
fraction of the LHS that has a non-matching row" is not one minus
"the fraction of the LHS that has a matching row".  In fact, in
this example, *all* LHS rows have both matching and non-matching
RHS rows.  So the problem is that neqjoinsel is doing something
that's entirely insane for semijoin cases.

It would not be too hard to convince me that neqjoinsel should
simply return 1.0 for any semijoin/antijoin case, perhaps with
some kind of discount for nullfrac.  Whether or not there's an
equal row, there's almost always going to be non-equal row(s).
Maybe we can think of a better implementation but that seems
like the zero-order approximation.
        regards, tom lane



Re: [HACKERS] <> join selectivity estimate question

From
Robert Haas
Date:
On Fri, Mar 17, 2017 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After a bit more thought, it seems like the bug here is that "the
> fraction of the LHS that has a non-matching row" is not one minus
> "the fraction of the LHS that has a matching row".  In fact, in
> this example, *all* LHS rows have both matching and non-matching
> RHS rows.  So the problem is that neqjoinsel is doing something
> that's entirely insane for semijoin cases.

Thanks for the analysis.  I had a niggling feeling that there might be
something of this sort going on, but I was not sure.

> It would not be too hard to convince me that neqjoinsel should
> simply return 1.0 for any semijoin/antijoin case, perhaps with
> some kind of discount for nullfrac.  Whether or not there's an
> equal row, there's almost always going to be non-equal row(s).
> Maybe we can think of a better implementation but that seems
> like the zero-order approximation.

Yeah, it's not obvious how to do better than that considering only one
clause at a time.  Of course, what we really want to know is
P(x<>y|z=t), but don't ask me how to compute that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 17, 2017 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It would not be too hard to convince me that neqjoinsel should
>> simply return 1.0 for any semijoin/antijoin case, perhaps with
>> some kind of discount for nullfrac.  Whether or not there's an
>> equal row, there's almost always going to be non-equal row(s).
>> Maybe we can think of a better implementation but that seems
>> like the zero-order approximation.

> Yeah, it's not obvious how to do better than that considering only one
> clause at a time.  Of course, what we really want to know is
> P(x<>y|z=t), but don't ask me how to compute that.

Yeah.  Another hole in this solution is that it means that the
estimate for x <> y will be quite different from the estimate
for NOT(x = y).  You wouldn't notice it in the field unless
somebody forgot to put a negator link on their equality operator,
but it seems like ideally we'd think of a solution that made sense
for generic NOT in this context.

No, I have no idea how to do that.
        regards, tom lane



Re: [HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
On Sat, Mar 18, 2017 at 6:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After a bit more thought, it seems like the bug here is that "the
> fraction of the LHS that has a non-matching row" is not one minus
> "the fraction of the LHS that has a matching row".  In fact, in
> this example, *all* LHS rows have both matching and non-matching
> RHS rows.  So the problem is that neqjoinsel is doing something
> that's entirely insane for semijoin cases.
>
> It would not be too hard to convince me that neqjoinsel should
> simply return 1.0 for any semijoin/antijoin case, perhaps with
> some kind of discount for nullfrac.  Whether or not there's an
> equal row, there's almost always going to be non-equal row(s).
> Maybe we can think of a better implementation but that seems
> like the zero-order approximation.

Right.  If I temporarily hack neqjoinsel() thus:

        result = 1.0 - result;
+
+       if (jointype == JOIN_SEMI)
+               result = 1.0;
+
        PG_RETURN_FLOAT8(result);
 }

... then I obtain sensible row estimates and the following speedups
for TPCH Q21:

  8 workers = 8.3s -> 7.8s
  7 workers = 8.2s -> 7.9s
  6 workers = 8.5s -> 8.2s
  5 workers = 8.9s -> 8.5s
  4 workers = 9.5s -> 9.1s
  3 workers = 39.7s -> 9.9s
  2 workers = 36.9s -> 11.7s
  1 worker  = 38.2s -> 15.0s
  0 workers = 47.9s -> 24.7s

The plan is similar to the good plan from before even at lower worker
counts, but slightly better because the aggregation has been pushed
under the Gather node.  See attached.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
On Sat, Mar 18, 2017 at 11:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sat, Mar 18, 2017 at 6:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> After a bit more thought, it seems like the bug here is that "the
>> fraction of the LHS that has a non-matching row" is not one minus
>> "the fraction of the LHS that has a matching row".  In fact, in
>> this example, *all* LHS rows have both matching and non-matching
>> RHS rows.  So the problem is that neqjoinsel is doing something
>> that's entirely insane for semijoin cases.
>>
>> It would not be too hard to convince me that neqjoinsel should
>> simply return 1.0 for any semijoin/antijoin case, perhaps with
>> some kind of discount for nullfrac.  Whether or not there's an
>> equal row, there's almost always going to be non-equal row(s).
>> Maybe we can think of a better implementation but that seems
>> like the zero-order approximation.
>
> Right.  If I temporarily hack neqjoinsel() thus:
>
>         result = 1.0 - result;
> +
> +       if (jointype == JOIN_SEMI)
> +               result = 1.0;
> +
>         PG_RETURN_FLOAT8(result);
>  }
>
> ... then I obtain sensible row estimates and the following speedups
> for TPCH Q21:
>
>   8 workers = 8.3s -> 7.8s
>   7 workers = 8.2s -> 7.9s
>   6 workers = 8.5s -> 8.2s
>   5 workers = 8.9s -> 8.5s
>   4 workers = 9.5s -> 9.1s
>   3 workers = 39.7s -> 9.9s
>   2 workers = 36.9s -> 11.7s
>   1 worker  = 38.2s -> 15.0s
>   0 workers = 47.9s -> 24.7s
>
> The plan is similar to the good plan from before even at lower worker
> counts, but slightly better because the aggregation has been pushed
> under the Gather node.  See attached.

... and so has the anti-join, probably more importantly.

Thanks for looking at this!

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] <> join selectivity estimate question

From
Dilip Kumar
Date:
On Fri, Mar 17, 2017 at 6:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Right.  If I temporarily hack neqjoinsel() thus:
>
>         result = 1.0 - result;
> +
> +       if (jointype == JOIN_SEMI)
> +               result = 1.0;
> +
>         PG_RETURN_FLOAT8(result);
>  }

I was looking into this problem. IMHO, the correct solution will be
that for JOIN_SEMI, neqjoinsel should not estimate the equijoin
selectivity using eqjoinsel_semi, instead, it should calculate the
equijoin selectivity as inner join and it should get the selectivity
of <> by (1-equijoin selectivity). Because for the inner_join we can
claim that "selectivity of '=' + selectivity of '<>' = 1", but same is
not true for the semi-join selectivity. For semi-join it is possible
that selectivity of '=' and '<>' is both are 1.

something like below
----------------------------

@@ -2659,7 +2659,13 @@ neqjoinsel(PG_FUNCTION_ARGS)       SpecialJoinInfo *sjinfo = (SpecialJoinInfo *)
PG_GETARG_POINTER(4);      Oid                     eqop;       float8          result;
 

+       if (jointype = JOIN_SEMI)
+       {
+               sjinfo->jointype = JOIN_INNER;
+       }       /*        * We want 1 - eqjoinsel() where the equality operator is the one        * associated with
this!= operator, that is, its negator.
 

We may need something similar for anti-join as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] <> join selectivity estimate question

From
Robert Haas
Date:
On Wed, May 31, 2017 at 1:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> +       if (jointype = JOIN_SEMI)
> +       {
> +               sjinfo->jointype = JOIN_INNER;
> +       }

That is pretty obviously half-baked and completely untested.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] <> join selectivity estimate question

From
Dilip Kumar
Date:
On Thu, Jun 1, 2017 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 31, 2017 at 1:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> +       if (jointype = JOIN_SEMI)
>> +       {
>> +               sjinfo->jointype = JOIN_INNER;
>> +       }
>
> That is pretty obviously half-baked and completely untested.

Actually, I was not proposing this patch instead I wanted to discuss
the approach.  I was claiming that for
non-equal JOIN_SEMI selectivity estimation instead of calculating
selectivity in an existing way i.e
= 1- (selectivity of equal JOIN_SEMI)  the better way would be = 1-
(selectivity of equal).  I have only tested only standalone scenario
where it solves the problem but not the TPCH cases.  But I was more
interested in discussing that the way I am thinking how it should
calculate the nonequal SEMI join selectivity make any sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Dilip Kumar <dilipbalaut@gmail.com> writes:
> Actually, I was not proposing this patch instead I wanted to discuss
> the approach.  I was claiming that for
> non-equal JOIN_SEMI selectivity estimation instead of calculating
> selectivity in an existing way i.e
> = 1- (selectivity of equal JOIN_SEMI)  the better way would be = 1-
> (selectivity of equal).  I have only tested only standalone scenario
> where it solves the problem but not the TPCH cases.  But I was more
> interested in discussing that the way I am thinking how it should
> calculate the nonequal SEMI join selectivity make any sense.

I don't think it does really.  The thing about a <> semijoin is that it
will succeed unless *every* join key value from the inner query is equal
to the outer key value (or is null).  That's something we should consider
to be of very low probability typically, so that the <> selectivity should
be estimated as nearly 1.0.  If the regular equality selectivity
approaches 1.0, or when there are expected to be very few rows out of the
inner query, then maybe the <> estimate should start to drop off from 1.0,
but it surely doesn't move linearly with the equality selectivity.

BTW, I'd momentarily confused this thread with the one about bug #14676,
which points out that neqsel() isn't correctly accounting for nulls.
neqjoinsel() isn't either.  Not sure that we want to solve both things
in one patch though.
        regards, tom lane



Re: [HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't think it does really.  The thing about a <> semijoin is that it
> will succeed unless *every* join key value from the inner query is equal
> to the outer key value (or is null).  That's something we should consider
> to be of very low probability typically, so that the <> selectivity should
> be estimated as nearly 1.0.  If the regular equality selectivity
> approaches 1.0, or when there are expected to be very few rows out of the
> inner query, then maybe the <> estimate should start to drop off from 1.0,
> but it surely doesn't move linearly with the equality selectivity.

Ok, here I go like a bull in a china shop: please find attached a
draft patch.  Is this getting warmer?

In the comment for JOIN_SEMI I mentioned a couple of refinements I
thought of but my intuition was that we don't go for such sensitive
and discontinuous treatment of stats; so I made the simplifying
assumption that RHS always has more than 1 distinct value in it.

Anti-join <> returns all the nulls from the LHS, and then it only
returns other LHS rows if there is exactly one distinct non-null value
in RHS and it happens to be that one.  But if we make the same
assumption I described above, namely that there are always at least 2
distinct values on the RHS, then the join selectivity is just
nullfrac.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] <> join selectivity estimate question

From
Ashutosh Bapat
Date:
On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't think it does really.  The thing about a <> semijoin is that it
>> will succeed unless *every* join key value from the inner query is equal
>> to the outer key value (or is null).  That's something we should consider
>> to be of very low probability typically, so that the <> selectivity should
>> be estimated as nearly 1.0.  If the regular equality selectivity
>> approaches 1.0, or when there are expected to be very few rows out of the
>> inner query, then maybe the <> estimate should start to drop off from 1.0,
>> but it surely doesn't move linearly with the equality selectivity.
>
> Ok, here I go like a bull in a china shop: please find attached a
> draft patch.  Is this getting warmer?
>
> In the comment for JOIN_SEMI I mentioned a couple of refinements I
> thought of but my intuition was that we don't go for such sensitive
> and discontinuous treatment of stats; so I made the simplifying
> assumption that RHS always has more than 1 distinct value in it.
>
> Anti-join <> returns all the nulls from the LHS, and then it only
> returns other LHS rows if there is exactly one distinct non-null value
> in RHS and it happens to be that one.  But if we make the same
> assumption I described above, namely that there are always at least 2
> distinct values on the RHS, then the join selectivity is just
> nullfrac.
>

The patch looks good to me.

+       /*
+        * For semi-joins, if there is more than one distinct key in the RHS
+        * relation then every non-null LHS row must find a match since it can
+        * only be equal to one of them.
The word "match" confusing. Google's dictionary entry gives "be equal
to (something) in quality or strength." as its meaning. May be we want
to reword it as "... LHS row must find a joining row in RHS ..."?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
On Thu, Jul 20, 2017 at 11:47 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't think it does really.  The thing about a <> semijoin is that it
>>> will succeed unless *every* join key value from the inner query is equal
>>> to the outer key value (or is null).  That's something we should consider
>>> to be of very low probability typically, so that the <> selectivity should
>>> be estimated as nearly 1.0.  If the regular equality selectivity
>>> approaches 1.0, or when there are expected to be very few rows out of the
>>> inner query, then maybe the <> estimate should start to drop off from 1.0,
>>> but it surely doesn't move linearly with the equality selectivity.
>>
>> Ok, here I go like a bull in a china shop: please find attached a
>> draft patch.  Is this getting warmer?
>>
>> In the comment for JOIN_SEMI I mentioned a couple of refinements I
>> thought of but my intuition was that we don't go for such sensitive
>> and discontinuous treatment of stats; so I made the simplifying
>> assumption that RHS always has more than 1 distinct value in it.
>>
>> Anti-join <> returns all the nulls from the LHS, and then it only
>> returns other LHS rows if there is exactly one distinct non-null value
>> in RHS and it happens to be that one.  But if we make the same
>> assumption I described above, namely that there are always at least 2
>> distinct values on the RHS, then the join selectivity is just
>> nullfrac.
>>
>
> The patch looks good to me.
>
> +       /*
> +        * For semi-joins, if there is more than one distinct key in the RHS
> +        * relation then every non-null LHS row must find a match since it can
> +        * only be equal to one of them.
> The word "match" confusing. Google's dictionary entry gives "be equal
> to (something) in quality or strength." as its meaning. May be we want
> to reword it as "... LHS row must find a joining row in RHS ..."?

Thanks!  Yeah, here's a version with better comments.

Does anyone know how to test a situation where the join is reversed according to
get_join_variables, or "complicated cases where we can't tell for sure"?

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] <> join selectivity estimate question

From
Ashutosh Bapat
Date:
On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Jul 20, 2017 at 11:47 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I don't think it does really.  The thing about a <> semijoin is that it
>>>> will succeed unless *every* join key value from the inner query is equal
>>>> to the outer key value (or is null).  That's something we should consider
>>>> to be of very low probability typically, so that the <> selectivity should
>>>> be estimated as nearly 1.0.  If the regular equality selectivity
>>>> approaches 1.0, or when there are expected to be very few rows out of the
>>>> inner query, then maybe the <> estimate should start to drop off from 1.0,
>>>> but it surely doesn't move linearly with the equality selectivity.
>>>
>>> Ok, here I go like a bull in a china shop: please find attached a
>>> draft patch.  Is this getting warmer?
>>>
>>> In the comment for JOIN_SEMI I mentioned a couple of refinements I
>>> thought of but my intuition was that we don't go for such sensitive
>>> and discontinuous treatment of stats; so I made the simplifying
>>> assumption that RHS always has more than 1 distinct value in it.
>>>
>>> Anti-join <> returns all the nulls from the LHS, and then it only
>>> returns other LHS rows if there is exactly one distinct non-null value
>>> in RHS and it happens to be that one.  But if we make the same
>>> assumption I described above, namely that there are always at least 2
>>> distinct values on the RHS, then the join selectivity is just
>>> nullfrac.
>>>
>>
>> The patch looks good to me.
>>
>> +       /*
>> +        * For semi-joins, if there is more than one distinct key in the RHS
>> +        * relation then every non-null LHS row must find a match since it can
>> +        * only be equal to one of them.
>> The word "match" confusing. Google's dictionary entry gives "be equal
>> to (something) in quality or strength." as its meaning. May be we want
>> to reword it as "... LHS row must find a joining row in RHS ..."?
>
> Thanks!  Yeah, here's a version with better comments.

Thanks. Your version is better than mine.

>
> Does anyone know how to test a situation where the join is reversed according to
> get_join_variables, or "complicated cases where we can't tell for sure"?
>

explain select * from pg_class c right join pg_type t on (c.reltype =
t.oid); would end up with  *join_is_reversed = true; Is that what you
want? For a semi-join however I don't know how to induce that. AFAIU,
in a semi-join there is only one direction in which join can be
specified.

I didn't get the part about "complicated cases where we can't tell for sure".
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Does anyone know how to test a situation where the join is reversed according to
>> get_join_variables, or "complicated cases where we can't tell for sure"?

> explain select * from pg_class c right join pg_type t on (c.reltype =
> t.oid); would end up with  *join_is_reversed = true; Is that what you
> want? For a semi-join however I don't know how to induce that. AFAIU,
> in a semi-join there is only one direction in which join can be
> specified.

You just have to flip the <> clause around, eg instead of

explain analyze select * from tenk1 t where exists (select 1 from int4_tbl i where t.ten <> i.f1);

do

explain analyze select * from tenk1 t where exists (select 1 from int4_tbl i where i.f1 <> t.ten);

No matter what the surrounding query is like exactly, one or the
other of those should end up "join_is_reversed".

This would be a bit harder to trigger for equality clauses, where you'd
have to somehow defeat the EquivalenceClass logic's tendency to rip the
clauses apart and reassemble them according to its own whims.  But for
neqjoinsel that's not a problem.

> I didn't get the part about "complicated cases where we can't tell for sure".

You could force that with mixed relation membership on one or both sides
of the <>, for instance "(a.b + b.y) <> a.c".  I don't think it's
especially interesting for the present purpose though, since we're going
to end up with 1.0 selectivity in any case where examine_variable can't
find stats.
        regards, tom lane



Re: [HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
On Fri, Jul 21, 2017 at 8:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
>> On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> Does anyone know how to test a situation where the join is reversed according to
>>> get_join_variables, or "complicated cases where we can't tell for sure"?
>
>> explain select * from pg_class c right join pg_type t on (c.reltype =
>> t.oid); would end up with  *join_is_reversed = true; Is that what you
>> want? For a semi-join however I don't know how to induce that. AFAIU,
>> in a semi-join there is only one direction in which join can be
>> specified.
>
> You just have to flip the <> clause around, eg instead of
>
> explain analyze select * from tenk1 t
>   where exists (select 1 from int4_tbl i where t.ten <> i.f1);
>
> do
>
> explain analyze select * from tenk1 t
>   where exists (select 1 from int4_tbl i where i.f1 <> t.ten);
>
> No matter what the surrounding query is like exactly, one or the
> other of those should end up "join_is_reversed".

Ahh, I see.  Thanks for the explanation.

> This would be a bit harder to trigger for equality clauses, where you'd
> have to somehow defeat the EquivalenceClass logic's tendency to rip the
> clauses apart and reassemble them according to its own whims.  But for
> neqjoinsel that's not a problem.
>
>> I didn't get the part about "complicated cases where we can't tell for sure".
>
> You could force that with mixed relation membership on one or both sides
> of the <>, for instance "(a.b + b.y) <> a.c".  I don't think it's
> especially interesting for the present purpose though, since we're going
> to end up with 1.0 selectivity in any case where examine_variable can't
> find stats.

Thanks.  Bearing all that in mind, I ran through a series of test
scenarios and discovered that my handling for JOIN_ANTI was wrong: I
thought that I had to deal with inverting the result, but I now see
that that's handled elsewhere (calc_joinrel_size_estimate() I think).
So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
same way.

That just leaves the question of whether we should try to handle the
empty RHS and single-value RHS cases using statistics.  My intuition
is that we shouldn't, but I'll be happy to change my intuition and
code that up if that is the feedback from planner gurus.

Please find attached a new version, and a test script I used, which
shows a bunch of interesting cases.  I'll add this to the commitfest.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] <> join selectivity estimate question

From
Ashutosh Bapat
Date:
On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>
> Thanks.  Bearing all that in mind, I ran through a series of test
> scenarios and discovered that my handling for JOIN_ANTI was wrong: I
> thought that I had to deal with inverting the result, but I now see
> that that's handled elsewhere (calc_joinrel_size_estimate() I think).
> So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
> same way.

I agree, esp. after looking at eqjoinsel_semi(), which is used for
both semi and anti joins, it becomes more clear.

>
> That just leaves the question of whether we should try to handle the
> empty RHS and single-value RHS cases using statistics.  My intuition
> is that we shouldn't, but I'll be happy to change my intuition and
> code that up if that is the feedback from planner gurus.

Empty RHS can result from dummy relations also, which are produced by
constraint exclusion, so may be that's an interesting case. Single
value RHS may be interesting with partitioned table with all rows in a
given partition end up with the same partition key value. But may be
those are just different patches. I am not sure.

>
> Please find attached a new version, and a test script I used, which
> shows a bunch of interesting cases.  I'll add this to the commitfest.

I added some "stable" tests to your patch taking inspiration from the
test SQL file. I think those will be stable across machines and runs.
Please let me know if those look good to you.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] <> join selectivity estimate question

From
Simon Riggs
Date:
On 6 September 2017 at 04:14, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>>
>> Thanks.  Bearing all that in mind, I ran through a series of test
>> scenarios and discovered that my handling for JOIN_ANTI was wrong: I
>> thought that I had to deal with inverting the result, but I now see
>> that that's handled elsewhere (calc_joinrel_size_estimate() I think).
>> So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
>> same way.
>
> I agree, esp. after looking at eqjoinsel_semi(), which is used for
> both semi and anti joins, it becomes more clear.
>
>>
>> That just leaves the question of whether we should try to handle the
>> empty RHS and single-value RHS cases using statistics.  My intuition
>> is that we shouldn't, but I'll be happy to change my intuition and
>> code that up if that is the feedback from planner gurus.
>
> Empty RHS can result from dummy relations also, which are produced by
> constraint exclusion, so may be that's an interesting case. Single
> value RHS may be interesting with partitioned table with all rows in a
> given partition end up with the same partition key value. But may be
> those are just different patches. I am not sure.
>
>>
>> Please find attached a new version, and a test script I used, which
>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>
> I added some "stable" tests to your patch taking inspiration from the
> test SQL file. I think those will be stable across machines and runs.
> Please let me know if those look good to you.

Why isn't this an open item for PG10?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Why isn't this an open item for PG10?

Why should it be?  This behavior has existed for a long time.
        regards, tom lane



Re: [HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> That just leaves the question of whether we should try to handle the
>> empty RHS and single-value RHS cases using statistics.  My intuition
>> is that we shouldn't, but I'll be happy to change my intuition and
>> code that up if that is the feedback from planner gurus.
>
> Empty RHS can result from dummy relations also, which are produced by
> constraint exclusion, so may be that's an interesting case. Single
> value RHS may be interesting with partitioned table with all rows in a
> given partition end up with the same partition key value. But may be
> those are just different patches. I am not sure.

Can you elaborate on the constraint exclusion case?  We don't care
about the selectivity of an excluded relation, do we?

Any other views on the empty and single value special cases, when
combined with [NOT] EXISTS (SELECT ... WHERE r.something <>
s.something)?  Looking at this again, my feeling is that they're too
obscure to spend time on, but others may disagree.

>> Please find attached a new version, and a test script I used, which
>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>
> I added some "stable" tests to your patch taking inspiration from the
> test SQL file. I think those will be stable across machines and runs.
> Please let me know if those look good to you.

Hmm.  But they show actual rows, not plan->plan_rows, and although the
former is interesting as a sanity check the latter is the thing under
test here.  It seems like we don't have fine enough control of
EXPLAIN's output to show estimated rows but not cost.  I suppose we
could try to capture EXPLAIN's output somehow (plpgsql dynamic
execution or spool output from psql?) and then pull out just the row
estimates, maybe with extra rounding to cope with instability.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I added some "stable" tests to your patch taking inspiration from the
>> test SQL file. I think those will be stable across machines and runs.
>> Please let me know if those look good to you.

> Hmm.  But they show actual rows, not plan->plan_rows, and although the
> former is interesting as a sanity check the latter is the thing under
> test here.  It seems like we don't have fine enough control of
> EXPLAIN's output to show estimated rows but not cost.  I suppose we
> could try to capture EXPLAIN's output somehow (plpgsql dynamic
> execution or spool output from psql?) and then pull out just the row
> estimates, maybe with extra rounding to cope with instability.

Don't have time to think about the more general question right now,
but as far as the testing goes, there's already precedent for filtering
EXPLAIN output --- see explain_sq_limit() in subselect.sql.  But I'm
dubious whether the rowcount estimate could be relied on to be perfectly
machine-independent, even if you were hiding costs successfully.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] <> join selectivity estimate question

From
Ashutosh Bapat
Date:
On Thu, Sep 14, 2017 at 4:19 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> That just leaves the question of whether we should try to handle the
>>> empty RHS and single-value RHS cases using statistics.  My intuition
>>> is that we shouldn't, but I'll be happy to change my intuition and
>>> code that up if that is the feedback from planner gurus.
>>
>> Empty RHS can result from dummy relations also, which are produced by
>> constraint exclusion, so may be that's an interesting case. Single
>> value RHS may be interesting with partitioned table with all rows in a
>> given partition end up with the same partition key value. But may be
>> those are just different patches. I am not sure.
>
> Can you elaborate on the constraint exclusion case?  We don't care
> about the selectivity of an excluded relation, do we?
>

I meant, an empty RHS case doesn't necessarily need an empty table, it
could happen because of a relation excluded by constraints (see
relation_excluded_by_constraints()). So, that's not as obscure as we
would think. But it's not very frequent either. But I think we should
deal with that as a separate patch. This patch improves the estimate
for some cases, while not degrading those in other cases. So, I think
we can leave other cases for a later patch.

> Any other views on the empty and single value special cases, when
> combined with [NOT] EXISTS (SELECT ... WHERE r.something <>
> s.something)?  Looking at this again, my feeling is that they're too
> obscure to spend time on, but others may disagree.
>
>>> Please find attached a new version, and a test script I used, which
>>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>>
>> I added some "stable" tests to your patch taking inspiration from the
>> test SQL file. I think those will be stable across machines and runs.
>> Please let me know if those look good to you.
>
> Hmm.  But they show actual rows, not plan->plan_rows, and although the
> former is interesting as a sanity check the latter is the thing under
> test here.

I missed this point while adopting the tests. Sorry.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] <> join selectivity estimate question

From
Ashutosh Bapat
Date:
On Thu, Sep 14, 2017 at 4:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> I added some "stable" tests to your patch taking inspiration from the
>>> test SQL file. I think those will be stable across machines and runs.
>>> Please let me know if those look good to you.
>
>> Hmm.  But they show actual rows, not plan->plan_rows, and although the
>> former is interesting as a sanity check the latter is the thing under
>> test here.  It seems like we don't have fine enough control of
>> EXPLAIN's output to show estimated rows but not cost.  I suppose we
>> could try to capture EXPLAIN's output somehow (plpgsql dynamic
>> execution or spool output from psql?) and then pull out just the row
>> estimates, maybe with extra rounding to cope with instability.
>
> Don't have time to think about the more general question right now,
> but as far as the testing goes, there's already precedent for filtering
> EXPLAIN output --- see explain_sq_limit() in subselect.sql.  But I'm
> dubious whether the rowcount estimate could be relied on to be perfectly
> machine-independent, even if you were hiding costs successfully.
>

Are you referring to rounding errors? We should probably add some fuzz
factor to cover the rounding errors and cause a diff when difference
in expected and reported plan rows is beyond that fuzz factor.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] <> join selectivity estimate question

From
Michael Paquier
Date:
On Thu, Sep 14, 2017 at 2:23 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Are you referring to rounding errors? We should probably add some fuzz
> factor to cover the rounding errors and cause a diff when difference
> in expected and reported plan rows is beyond that fuzz factor.

As far as I can see the patch proposed in
https://www.postgresql.org/message-id/CAFjFpRfXKadXLe6cS=Er8txF=W6g1htCidQ7EW6eeW=SNcnTmQ@mail.gmail.com/
did not get any reviews. So moved to next CF.
-- 
Michael


Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Please find attached a new version, and a test script I used, which
>> shows a bunch of interesting cases.  I'll add this to the commitfest.

> I added some "stable" tests to your patch taking inspiration from the
> test SQL file. I think those will be stable across machines and runs.
> Please let me know if those look good to you.

This seems to have stalled on the question of what the regression tests
should look like, which sems like a pretty silly thing to get hung up on
when everybody agrees the patch itself is OK.  I tried Ashutosh's proposed
test cases and was pretty unimpressed after noting that they passed
equally well against patched or unpatched backends.  In any case, as noted
upthread, we don't really like to expose exact rowcount estimates in test
cases because of the risk of platform to platform variation.  The more
usual approach for checking whether the planner is making sane estimates
is to find a query whose plan shape changes with or without the patch.
I messed around a bit till I found such a query, and committed it.
        regards, tom lane


Re: [HACKERS] <> join selectivity estimate question

From
Thomas Munro
Date:
On Thu, Nov 30, 2017 at 4:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
>> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> Please find attached a new version, and a test script I used, which
>>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>
>> I added some "stable" tests to your patch taking inspiration from the
>> test SQL file. I think those will be stable across machines and runs.
>> Please let me know if those look good to you.
>
> This seems to have stalled on the question of what the regression tests
> should look like, which sems like a pretty silly thing to get hung up on
> when everybody agrees the patch itself is OK.  I tried Ashutosh's proposed
> test cases and was pretty unimpressed after noting that they passed
> equally well against patched or unpatched backends.  In any case, as noted
> upthread, we don't really like to expose exact rowcount estimates in test
> cases because of the risk of platform to platform variation.  The more
> usual approach for checking whether the planner is making sane estimates
> is to find a query whose plan shape changes with or without the patch.
> I messed around a bit till I found such a query, and committed it.

Thank you for the original pointer and the commit.  Everything here
seems to make intuitive sense and the accompanying throw-away tests
that I posted above seem to produce sensible results except in some
cases that we discussed, so I think this is progress.  There is still
something pretty funny about the cardinality estimates for TPCH Q21
which I haven't grokked though.  I suspect it is crafted to look for a
technique we don't know (an ancient challenge set by some long retired
database gurus back in 1992 that their RDBMSs know how to solve,
hopefully not in the manner of a certain car manufacturer's air
pollution tests), but I haven't yet obtained enough round tuits to dig
further.  I will, though.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] <> join selectivity estimate question

From
Robert Haas
Date:
On Wed, Nov 29, 2017 at 11:55 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Thank you for the original pointer and the commit.  Everything here
> seems to make intuitive sense and the accompanying throw-away tests
> that I posted above seem to produce sensible results except in some
> cases that we discussed, so I think this is progress.  There is still
> something pretty funny about the cardinality estimates for TPCH Q21
> which I haven't grokked though.  I suspect it is crafted to look for a
> technique we don't know (an ancient challenge set by some long retired
> database gurus back in 1992 that their RDBMSs know how to solve,
> hopefully not in the manner of a certain car manufacturer's air
> pollution tests), but I haven't yet obtained enough round tuits to dig
> further.  I will, though.

Hmm, do you have an example of the better but still-funky estimates
handy?  Like an EXPLAIN plan?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] <> join selectivity estimate question

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> So, in that plan we saw anti-join estimate 1 row but really there were
> 13462.  If you remove most of Q21 and keep just the anti-join between
> l1 and l3, then you try removing different quals, you can see the the
> problem is not the <> qual:

>   select count(*)
>     from lineitem l1
>    where not exists (
>         select *
>           from lineitem l3
>          where l3.l_orderkey = l1.l_orderkey
>            and l3.l_suppkey <> l1.l_suppkey
>            and l3.l_receiptdate > l3.l_commitdate
>     )
>   => estimate=1 actual=8998304

ISTM this is basically another variant of ye olde column correlation
problem.  That is, we know there's always going to be an antijoin match
for the l_orderkey equality condition, and that there's always going to
be matches for the l_suppkey inequality, but what we don't know is that
l_suppkey is correlated with l_orderkey so that the two conditions aren't
satisfied at the same time.  The same thing is happening on a smaller
scale with the receiptdate/commitdate comparison.

I wonder whether the extended stats machinery could be brought to bear
on this problem.

            regards, tom lane