Re: BUG #2930: Hash join abyssmal with many null fields. - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #2930: Hash join abyssmal with many null fields.
Date
Msg-id 25727.1169832146@sss.pgh.pa.us
Whole thread Raw
In response to BUG #2930: Hash join abyssmal with many null fields.  ("Maciej Babinski" <maciej+postgres@apathy.killer-robot.net>)
Responses Re: BUG #2930: Hash join abyssmal with many null fields.  (Maciej Babinski <maciej@killer-robot.net>)
List pgsql-bugs
Maciej Babinski <maciej@apathy.killer-robot.net> writes:
> Tom Lane wrote:
>> I see no bug here.  AFAICT your "much faster" query gets that way by
>> having eliminated all the candidate join rows on the B side.

> The additional clause eliminates no rows beyond what the existing
> clause would.  Any row eliminated by "b.join_id IS NOT NULL" could not
> possibly have satisfied "a.join_id = b.join_id".

Hmm.  You assume that the = operator is strict, which is probably true,
but the hash join code isn't assuming that.

It might be worth checking for the case, though.  What's happening,
since we go ahead and put the null rows into the hash table, is that
they all end up in the same hash chain because they all get hash code 0.
And then that very long chain gets searched for each null outer row.
If we knew the join operator is strict we could discard null rows
immediately on both sides.

> Please note that if the join columns are not null, but still produce
> no matches for the join, the results are fast without the need for an
> extra clause in the join:

Yeah, because the rows get reasonably well distributed into different
hash buckets.  The optimizer will avoid a hash if it sees the data is
not well-distributed, but IIRC it's not considering nulls when it
makes that decision.

            regards, tom lane

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #2930: Hash join abyssmal with many null fields.
Next
From: "Michael Schmidt"
Date:
Subject: BUG #2931: Can't capture pg_dump Password prompt