Home > mailing lists

[HACKERS] <> join selectivity estimate question - Mailing list pgsql-hackers

From	Thomas Munro
Subject	[HACKERS] <> join selectivity estimate question
Date	March 17, 2017 08:54:46
Msg-id	CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com Whole thread
Responses	Re: [HACKERS] <> join selectivity estimate question
List	pgsql-hackers

Tree view

Hi hackers,

While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master.  One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.

Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB.  That is a query with a large anti-join and a large semi-join.

  8 workers = 8.3s
  7 workers = 8.2s
  6 workers = 8.5s
  5 workers = 8.9s
  4 workers = 9.5s
  3 workers = 39.7s
  2 workers = 36.9s
  1 worker  = 38.2s
  0 workers = 47.9s

Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).

The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded.  Compare these two queries:

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey);

 -> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey
                   AND l1.l_suppkey <> l2.l_suppkey);

 -> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

Or for a standalone example:

  CREATE TABLE foo AS
  SELECT (generate_series(1, 1000000) / 4)::int AS a,
         (generate_series(1, 1000000) % 100)::int AS b;

  ANALYZE foo;

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a);

 -> estimates 1,000,000 rows

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a
                    AND f1.b <> f2.b);

 -> estimates 1 row

I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch.  What is happening here?

Thanks for reading.

[1] https://www.postgresql.org/message-id/CAEepm%3D3Og-7-b3WOkiT%3Dc%2B6y3eZ0VVSyb1K%2BSOvF17BO5KAt0A%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

From: Kyotaro HORIGUCHI
Date: 17 March 2017, 08:23:13
Subject: Re: [HACKERS] Protect syscache from bloating with negative cache entries

From: Beena Emerson
Date: 17 March 2017, 09:08:56
Subject: Re: [HACKERS] increasing the default WAL segment size

[HACKERS] <> join selectivity estimate question - Mailing list pgsql-hackers

Attachment

Previous

Next