Re: Hash Join cost estimates - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Hash Join cost estimates
Date
Msg-id 20130413030619.GN4361@tamriel.snowman.net
Whole thread Raw
In response to Re: Hash Join cost estimates  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I'm certainly curious about those, but I'm also very interested in the
> > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> > depending on the work_mem setting.
>
> Not sure about that.  That would make the hash-bucket-header array
> larger without changing the size of the rest of the hash table, thus
> probably making the CPU cache situation worse not better (which would
> manifest as more time at the first of these two statements relative to
> the second).

In the testing that I've done, I've yet to find a case where having a
smaller NTUP_PER_BUCKET makes things worse and it can have a dramatic
improvement.  Regarding the memory situation, I'm not sure that using
buckets really helps, though I'm no CPU architect.  However, assuming
that the CPU is smart enough to only pull in bits of memory that it
needs, I'm not sure how having to pull in parts of a large array of
pointers is worse than having to go fetch randomly placed entries in a
bucket, especially since we have to step through an entire bucket each
time and the bucket tuples are allocated independently, while the hash
array is allocated once.  Indeed, it seems you'd be more likely to get
other things you need in a given pull into cache for the hash table
array than with the bucket tuple entries which might well include
unrelated garbage.

> Can you add some instrumentation code to get us information about the
> average/max length of the bucket chains?  And maybe try to figure out
> how many distinct hash values per bucket, which would give us a clue
> whether your two-level-list idea is worth anything.

I ended up implementing the two-level system and doing some testing with
it.  It ends up making hash table building take quite a bit longer and
only improves the scan performance in very select cases.  The
improvement requires an individual bucket to have both lots of dups and
a lot of distinct values, because it only helps when it can skip a
significant number of tuples.  If we move to a system where there's
rarely more than one distinct value in a bucket then the chances of a
serious improvment from the two-level system goes down that much more.

As such, I've come up with this (trival) patch which simply modifies
ExecChooseHashTableSize() to ignore NTUP_PER_BUCKET (essentially
treating it as 1) when work_mem is large enough to fit the entire hash
table (which also implies that there is only one batch).  I'd love to
hear feedback from others on what this does under different conditions.
This also makes the "hash-the-small-table" case faster for the test case
which I provided earlier (http://snowman.net/~sfrost/test_case2.sql),
and it uses quite a bit less memory too.  Now that I've convinced myself
that the two-level system isn't practical and the "hash-the-small-table"
case is actually faster than "hash-the-big-table", I'll start looking
at improving on your proposal to change the costing to favor the smaller
table being hashed more often.

    Thanks,

        Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Small reduction in memory usage of index relcache entries
Next
From: Sameer Thakur
Date:
Subject: Re: Detach/attach table and index data files from one cluster to another