Tom Lane wrote:
> Maciej Babinski <maciej@apathy.killer-robot.net> writes:
>> Tom Lane wrote:
>>> I see no bug here. AFAICT your "much faster" query gets that way by
>>> having eliminated all the candidate join rows on the B side.
>
>> The additional clause eliminates no rows beyond what the existing
>> clause would. Any row eliminated by "b.join_id IS NOT NULL" could not
>> possibly have satisfied "a.join_id = b.join_id".
>
> Hmm. You assume that the = operator is strict, which is probably true,
> but the hash join code isn't assuming that.
>
Yes, both queries return identical results.
> It might be worth checking for the case, though. What's happening,
> since we go ahead and put the null rows into the hash table, is that
> they all end up in the same hash chain because they all get hash code 0.
> And then that very long chain gets searched for each null outer row.
> If we knew the join operator is strict we could discard null rows
> immediately on both sides.
>
>> Please note that if the join columns are not null, but still produce
>> no matches for the join, the results are fast without the need for an
>> extra clause in the join:
>
> Yeah, because the rows get reasonably well distributed into different
> hash buckets. The optimizer will avoid a hash if it sees the data is
> not well-distributed, but IIRC it's not considering nulls when it
> makes that decision.
You're right. Filling one join column with '1', and the other with '2'
yields a plan without a hash join. Something that may be worth noting is
that while my initial problem case (200,000 rows in more complicated
tables),
as well as the sample test case I put together (10,000 rows in simple
tables)
both yield this result, changing my test case to use 20,000 rows produces
a plan that avoids hash joins as well:
maciej_test=# EXPLAIN ANALYZE SELECT * FROM a JOIN b ON a.join_id =
b.join_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Merge Join (cost=3435.54..3635.55 rows=1 width=16) (actual
time=100.822..100.822 rows=0 loops=1)
Merge Cond: ("outer".join_id = "inner".join_id)
-> Sort (cost=1717.77..1767.77 rows=20000 width=8) (actual
time=65.401..86.333 rows=20000 loops=1)
Sort Key: a.join_id
-> Seq Scan on a (cost=0.00..289.00 rows=20000 width=8)
(actual time=0.005..21.119 rows=20000 loops=1)
-> Sort (cost=1717.77..1767.77 rows=20000 width=8) (never executed)
Sort Key: b.join_id
-> Seq Scan on b (cost=0.00..289.00 rows=20000 width=8)
(never executed)
Total runtime: 101.891 ms
(9 rows)
Time: 103.051 ms
Thanks for you time,
Maciej Babinski