[HACKERS] <> join selectivity estimate question - Mailing list pgsql-hackers

From Thomas Munro
Subject [HACKERS] <> join selectivity estimate question
Date
Msg-id CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com
Whole thread Raw
Responses Re: [HACKERS] <> join selectivity estimate question  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi hackers,

While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master.  One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.

Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB.  That is a query with a large anti-join and a large semi-join.

  8 workers = 8.3s
  7 workers = 8.2s
  6 workers = 8.5s
  5 workers = 8.9s
  4 workers = 9.5s
  3 workers = 39.7s
  2 workers = 36.9s
  1 worker  = 38.2s
  0 workers = 47.9s

Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).

The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded.  Compare these two queries:

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey);

 -> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey
                   AND l1.l_suppkey <> l2.l_suppkey);

 -> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

Or for a standalone example:

  CREATE TABLE foo AS
  SELECT (generate_series(1, 1000000) / 4)::int AS a,
         (generate_series(1, 1000000) % 100)::int AS b;

  ANALYZE foo;

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a);

 -> estimates 1,000,000 rows

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a
                    AND f1.b <> f2.b);

 -> estimates 1 row

I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch.  What is happening here?

Thanks for reading.

[1] https://www.postgresql.org/message-id/CAEepm%3D3Og-7-b3WOkiT%3Dc%2B6y3eZ0VVSyb1K%2BSOvF17BO5KAt0A%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [HACKERS] Protect syscache from bloating with negative cache entries
Next
From: Beena Emerson
Date:
Subject: Re: [HACKERS] increasing the default WAL segment size