Re: Hash Join cost estimates - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Hash Join cost estimates
Date
Msg-id 10502.1365107399@sss.pgh.pa.us
Whole thread Raw
In response to Re: Hash Join cost estimates  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Hash Join cost estimates
List pgsql-hackers
Stephen Frost <sfrost@snowman.net> writes:
> Looking with opannotate, there's two main hotspots in
> ExecScanHashBucket:

>  12846 17.4001 :        hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
> and
>  22100 29.9348 :        hashTuple = hashTuple->next;

Those are, of course, pretty trivial statements; so the way I read this
is that the fundamental slowdown comes from the hash table being large
compared to the CPU's cache, so that you're incurring lots of cache
misses at these particular fetches.  (You might be able to confirm that
if you can set oprofile to count cache misses rather than wall clock
time.)

> I'm certainly curious about those, but I'm also very interested in the
> possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> depending on the work_mem setting.

Not sure about that.  That would make the hash-bucket-header array
larger without changing the size of the rest of the hash table, thus
probably making the CPU cache situation worse not better (which would
manifest as more time at the first of these two statements relative to
the second).

Can you add some instrumentation code to get us information about the
average/max length of the bucket chains?  And maybe try to figure out
how many distinct hash values per bucket, which would give us a clue
whether your two-level-list idea is worth anything.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Multi-pass planner
Next
From: Robert Haas
Date:
Subject: Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)