Hi hackers,
While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master. One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.
Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB. That is a query with a large anti-join and a large semi-join.
8 workers = 8.3s
7 workers = 8.2s
6 workers = 8.5s
5 workers = 8.9s
4 workers = 9.5s
3 workers = 39.7s
2 workers = 36.9s
1 worker = 38.2s
0 workers = 47.9s
Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).
The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded. Compare these two queries:
SELECT *
FROM lineitem l1
WHERE EXISTS (SELECT *
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey);
-> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)
SELECT *
FROM lineitem l1
WHERE EXISTS (SELECT *
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey
AND l1.l_suppkey <> l2.l_suppkey);
-> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)
Or for a standalone example:
CREATE TABLE foo AS
SELECT (generate_series(1, 1000000) / 4)::int AS a,
(generate_series(1, 1000000) % 100)::int AS b;
ANALYZE foo;
SELECT *
FROM foo f1
WHERE EXISTS (SELECT *
FROM foo f2
WHERE f1.a = f2.a);
-> estimates 1,000,000 rows
SELECT *
FROM foo f1
WHERE EXISTS (SELECT *
FROM foo f2
WHERE f1.a = f2.a
AND f1.b <> f2.b);
-> estimates 1 row
I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch. What is happening here?
Thanks for reading.
[1] https://www.postgresql.org/message-id/CAEepm%3D3Og-7-b3WOkiT%3Dc%2B6y3eZ0VVSyb1K%2BSOvF17BO5KAt0A%40mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers