Thread: Improving worst-case merge join performance with often-null foreign key
Hi, First-time potential contributor here. We recently had an incident due to a sudden 1000x slowdown of a Postgres query (from ~10ms to ~10s) due to a join with a foreign key that was often null. We found that it was caused by a merge join with an index scan on one join path -- whenever the non-null data happened to be such that the merge join couldn't be terminated early, the index would proceed to scan all of the null rows and filter each one out individually. Since this was an inner join, this was pointless; the nulls would never have matched the join clause anyway. Test script to reproduce + example explain output: https://paste.depesz.com/s/VUj Once you're aware of it for a given index, this is a solvable issue. We solved it by adding a partial index. Including an IS NOT NULL clause in the query also seems to solve it. However, I've gotten the notion that it should be possible to solve this on the Postgres side, with no user action required. When doing an inner-join merge join with a normal equality operator, we should already "know" that we don't have to look at rows where the join keys are null, since any such comparison involving NULL will not return true. If we can just push that knowledge down to the index scan node, then we should be able to avoid this entire problem, leaving one less performance trap to trip people up. Proof-of-concept patch (please ignore style): https://github.com/steinarvk-oda/postgres/pull/1/files I have a few questions: (1) Does this approach seem reasonable and worthwhile? (2) Can I determine programmatically at query planning time whether the binary operator in an OpExpr has the property that all comparisons involving nulls will be non-true? Or, failing that, can I at least somehow hardcode and identify the built-in equality operators (which have this property)? (Integer equality and UUID equality probably covers a lot when it comes to joins.) (3) Is there a systematic way to check whether adding an "IS NOT NULL" condition would be redundant? E.g. if there is such a check in the query already, adding another is just messy. (4) Should this sort of thing be made conditional on a high null_frac? Or is that unnecessary complexity, and the simplicity of just always doing it would be better? Thanks! Steinar Kaldager
Steinar Kaldager <steinar.kaldager@oda.com> writes: > First-time potential contributor here. We recently had an incident due > to a sudden 1000x slowdown of a Postgres query (from ~10ms to ~10s) > due to a join with a foreign key that was often null. We found that it > was caused by a merge join with an index scan on one join path -- > whenever the non-null data happened to be such that the merge join > couldn't be terminated early, the index would proceed to scan all of > the null rows and filter each one out individually. Since this was an > inner join, this was pointless; the nulls would never have matched the > join clause anyway. Hmm. I don't entirely understand why the existing stop-at-nulls logic in nodeMergejoin.c didn't fix this for you. Maybe somebody has broken that? See the commentary for MJEvalOuterValues/MJEvalInnerValues. Pushing down an IS NOT NULL restriction could possibly be of value if the join is being done in the nulls-first direction, but that's an extreme minority use-case. I'm dubious that it'd be worth the overhead in general. It'd probably be more useful to make sure that the planner's cost model is aware of this effect, so that it's prodded to use nulls-last not nulls-first sort order when there are enough nulls to make a difference. regards, tom lane
On Sat, Apr 22, 2023 at 11:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Steinar Kaldager <steinar.kaldager@oda.com> writes:
> First-time potential contributor here. We recently had an incident due
> to a sudden 1000x slowdown of a Postgres query (from ~10ms to ~10s)
> due to a join with a foreign key that was often null. We found that it
> was caused by a merge join with an index scan on one join path --
> whenever the non-null data happened to be such that the merge join
> couldn't be terminated early, the index would proceed to scan all of
> the null rows and filter each one out individually. Since this was an
> inner join, this was pointless; the nulls would never have matched the
> join clause anyway.
Hmm. I don't entirely understand why the existing stop-at-nulls logic
in nodeMergejoin.c didn't fix this for you. Maybe somebody has broken
that? See the commentary for MJEvalOuterValues/MJEvalInnerValues.
I think it's just because the MergeJoin didn't see a NULL foo_id value
from test_bar tuples because all such tuples are removed by the filter
'test_bar.active', thus it does not have a chance to stop at nulls.
# select count(*) from test_bar where foo_id is null and active;
count
-------
0
(1 row)
Instead, the index scan on test_bar will have to scan all the tuples
with NULL foo_id because none of them satisfies the qual clause.
Thanks
Richard
from test_bar tuples because all such tuples are removed by the filter
'test_bar.active', thus it does not have a chance to stop at nulls.
# select count(*) from test_bar where foo_id is null and active;
count
-------
0
(1 row)
Instead, the index scan on test_bar will have to scan all the tuples
with NULL foo_id because none of them satisfies the qual clause.
Thanks
Richard
On Sun, Apr 23, 2023 at 5:29 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Sat, Apr 22, 2023 at 11:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:Steinar Kaldager <steinar.kaldager@oda.com> writes:
> First-time potential contributor here. We recently had an incident due
> to a sudden 1000x slowdown of a Postgres query (from ~10ms to ~10s)
> due to a join with a foreign key that was often null. We found that it
> was caused by a merge join with an index scan on one join path --
> whenever the non-null data happened to be such that the merge join
> couldn't be terminated early, the index would proceed to scan all of
> the null rows and filter each one out individually. Since this was an
> inner join, this was pointless; the nulls would never have matched the
> join clause anyway.
Hmm. I don't entirely understand why the existing stop-at-nulls logic
in nodeMergejoin.c didn't fix this for you. Maybe somebody has broken
that? See the commentary for MJEvalOuterValues/MJEvalInnerValues.I think it's just because the MergeJoin didn't see a NULL foo_id value
from test_bar tuples because all such tuples are removed by the filter
'test_bar.active', thus it does not have a chance to stop at nulls.
# select count(*) from test_bar where foo_id is null and active;
count
-------
0
(1 row)
Instead, the index scan on test_bar will have to scan all the tuples
with NULL foo_id because none of them satisfies the qual clause.
BTW, in Steinar's case the query runs much faster with nestloop with a
parameterized inner path, since test_foo is small and test_bar is very
large, and there is an index on test_bar.foo_id. You can see that by
"set enable_mergejoin to off":
# EXPLAIN (costs off)
SELECT SUM(test_foo.id) FROM test_bar, test_foo WHERE test_bar.foo_id = test_foo.id AND test_foo.active AND test_bar.active;
QUERY PLAN
--------------------------------------------------------------
Aggregate
-> Nested Loop
-> Seq Scan on test_foo
Filter: active
-> Index Scan using test_bar_foo_id_idx on test_bar
Index Cond: (foo_id = test_foo.id)
Filter: active
(7 rows)
In my box the total cost and execution time of mergejoin vs nestloop
are:
mergejoin nestloop
Cost estimate 1458.40 12355.15
Actual (best of 3) 3644.685 ms 13.114 ms
So it seems we have holes in cost estimate for mergejoin or nestloop
with parameterized inner path, or both.
Thanks
Richard
parameterized inner path, since test_foo is small and test_bar is very
large, and there is an index on test_bar.foo_id. You can see that by
"set enable_mergejoin to off":
# EXPLAIN (costs off)
SELECT SUM(test_foo.id) FROM test_bar, test_foo WHERE test_bar.foo_id = test_foo.id AND test_foo.active AND test_bar.active;
QUERY PLAN
--------------------------------------------------------------
Aggregate
-> Nested Loop
-> Seq Scan on test_foo
Filter: active
-> Index Scan using test_bar_foo_id_idx on test_bar
Index Cond: (foo_id = test_foo.id)
Filter: active
(7 rows)
In my box the total cost and execution time of mergejoin vs nestloop
are:
mergejoin nestloop
Cost estimate 1458.40 12355.15
Actual (best of 3) 3644.685 ms 13.114 ms
So it seems we have holes in cost estimate for mergejoin or nestloop
with parameterized inner path, or both.
Thanks
Richard
Re: Improving worst-case merge join performance with often-null foreign key
From
Steinar Kaldager
Date:
On Sun, Apr 23, 2023 at 11:30 AM Richard Guo <guofenglinux@gmail.com> wrote: > On Sat, Apr 22, 2023 at 11:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm. I don't entirely understand why the existing stop-at-nulls logic >> in nodeMergejoin.c didn't fix this for you. Maybe somebody has broken >> that? See the commentary for MJEvalOuterValues/MJEvalInnerValues. > > > I think it's just because the MergeJoin didn't see a NULL foo_id value > from test_bar tuples because all such tuples are removed by the filter > 'test_bar.active', thus it does not have a chance to stop at nulls. Agreed, this is also my understanding. Note that this isn't just a contrived test case, it's also the situation we ran into in prod. (We had a table with a lot of old inactive rows with null values for Historical Reasons, essentially kept for accounting/archival purposes. Newer, active, rows all had the foreign key column set to non-null.) I had initially considered whether this could be fixed in the merge-join execution code instead of by altering the plan, but at first glance that feels like it might be a more awkward fit. It's easy enough to stop the merge join if a null actually appears, but because of the filter, no null will ever appear. You'd have to somehow break the "stream of values" abstraction and look at where the values are actually coming from and/or which values would have appeared if they weren't filtered out. I don't know the codebase well, but to me that feels fairly hacky compared to altering the plan for the index scan. Steinar