Thread: Hash Join cost estimates

Hash Join cost estimates

From
Stephen Frost
Date:
All,
 Marty's issue w/ NestLoop reminded me that I'd put together a test case which illustrates a HashJoin doing the wrong
thing.
 The test case is here:
 http://snowman.net/~sfrost/test_case.sql
 Simply load that up and then check out this plan:
 explain select * from small_table join big_table using (id_short);
 We end up deciding to build a hash table of 4M records and then seq scan the table that only has 41K.  Much of the
reasonfor this is because the analytics point out, correctly, that the column we're using from the small table to join
isn'tunique and therefore the buckets will be deeper- but it's not *nearly* as bad as we estimate. 
 The bucket estimate for the small table comes back as 26, while the reality is that we only look through ~5.5 entries
perbucket with the longest run being 21.  With the big table being hashed, the bucket estimate is 4 (though this seem
tovary way more than I would expect, sometimes 4, sometimes 8, sometimes 2..) while the average number scanned through
isactually ~3.5 and the longest scan ends up being 20. 
 Without the bucket question, we end up with pretty reasonable results (directly out of initial_cost_hashjoin):
 41K hashed, seqscan 4M: startup_cost = 1229.46 run_cost = 72307.39
 4M hashed, 41K seqscan: startup_cost = 115030.10 run_cost = 817.70
 When we get through dealing with the bucketing question in final_cost_hashjoin (costsize.c:2673), we have some pretty
grossresults for the 'good' plan's run_cost: 
 run_cost = 72307.39 + 138848.81 = 211156.20
 While the 'bad' plan's run_cost is only bumped a tiny bit:
 run_cost = 817.7 + 411.76 = 1229.46
 Resulting in total costs that look all kinds of wacky:
 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66
 Or the 'good' plan being costed at *nearly double*.  Now, my laptop might not be the 'best' system CPU wise, but
there'sa pretty massive time difference between these plans: 
 41K hashed, seqscan 4M: 2252.007 ms http://explain.depesz.com/s/FEq 4M hashed, seqscan 41K: 2708.471 ms
http://explain.depesz.com/s/FOU
 That's half a second and a good 15+% difference.
 Now, back to the bucket estimation- the ndistinct for the small table is -0.475058 (or 19561 tuples), which is about
right. There are 23 distinct values, 18,795 duplicated, and another 841 with dup counts ranging from 3 to 10.  This
leadsto an avgfreq of 0.000051, unfortunately, we're going for only 8192 buckets, so this gets moved up to 0.00012 and
thenthe adjustment for MCV (which is 0.00027) bumps this all the way up to 0.00064, leading to our bucket depth
estimateof 26 (bucket size estimate multiplied by the inner rows) and the resulting cost dominating the overall
costing.
 If we consider the bucketing wrong and look at what the costs would have been with the actual number of average scans
(andtherefore excluding the 0.5 'adjustment'), we get: 
 seqscan 41K cost: 360.28 (cpu_op_cost * 41K * 3.5) seqscan 4M cost: 58743.73 (cpu_op_cost * 4M * 5.5)
 which isn't exactly going in the 'right' direction for us.  Now, I'm sure that there's a cost to having to consider
morebuckets, but it seems to be far less, in comparison to the hash table creation cost, than what our model would
suggest. In the end, I think the problem here is that we are charging far too much for these bucket costs (particularly
whenwe're getting them so far wrong) and not nearly enough for the cost of building the hash table in the first place. 
 Thoughts?  Ideas about how we can 'fix' this?  Have others run into similar issues?
     Thanks,
    Stephen

Re: Hash Join cost estimates

From
Jeff Davis
Date:
On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote:
>   41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
>   4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66

I think those are backwards -- typo?

>   In the end, I think the problem here is that we are charging far too
>   much for these bucket costs (particularly when we're getting them so
>   far wrong) and not nearly enough for the cost of building the hash
>   table in the first place.
> 
>   Thoughts?  Ideas about how we can 'fix' this?  Have others run into
>   similar issues?

Yes, I have run into this issue (or something very similar). I don't
understand why the bucketsize even matters much -- assuming few hash
collisions, we are not actually evaluating the quals any more times than
necessary. So why all of the hashjoin-specific logic in determining the
number of qual evaluations? The only reason I can think of is to model
the cost of comparing the hashes themselves.

Also, searching the archives turns up at least one other, but I think
I've seen more:

http://www.postgresql.org/message-id/A82128A6-4E3B-43BD-858D-21B129F7BEEB@richrelevance.com

Regards,Jeff Davis




Re: Hash Join cost estimates

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Yes, I have run into this issue (or something very similar). I don't
> understand why the bucketsize even matters much -- assuming few hash
> collisions, we are not actually evaluating the quals any more times than
> necessary. So why all of the hashjoin-specific logic in determining the
> number of qual evaluations? The only reason I can think of is to model
> the cost of comparing the hashes themselves.

I think the point is that there may *not* be few hash collisions ...
        regards, tom lane



Re: Hash Join cost estimates

From
Jeff Davis
Date:
On Fri, 2013-03-29 at 16:37 -0400, Tom Lane wrote: 
> Jeff Davis <pgsql@j-davis.com> writes:
> > Yes, I have run into this issue (or something very similar). I don't
> > understand why the bucketsize even matters much -- assuming few hash
> > collisions, we are not actually evaluating the quals any more times than
> > necessary. So why all of the hashjoin-specific logic in determining the
> > number of qual evaluations? The only reason I can think of is to model
> > the cost of comparing the hashes themselves.
> 
> I think the point is that there may *not* be few hash collisions ...

In Stephen's case the table was only 41KB, so something still seems off.
Maybe we should model the likelihood of a collision based on the
cardinalities (assuming a reasonably good hash function)?

Also, I think I found an important assumption that seems dubious (in
comment for estimate_hash_bucketsize()):

"If the other relation in the join has a key distribution similar to
this one's, then the most-loaded buckets are exactly those that will be
probed most often.  Therefore, the "average" bucket size for costing
purposes should really be taken as something close to the "worst case"
bucket size.  We try to estimate this by adjusting the fraction if there
are too few distinct data values, and then scaling up by the ratio of
the most common value's frequency to the average frequency."

But the key distribution is not necessarily similar at all... the large
table might have many more distinct values.

Stephen, do you think this could explain your problem?

Regards,Jeff Davis






Re: Hash Join cost estimates

From
Stephen Frost
Date:
Jeff,

* Jeff Davis (pgsql@j-davis.com) wrote:
> On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote:
> >   41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
> >   4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66
>
> I think those are backwards -- typo?

Yes, sorry, those are backwards.  The '4M hashed, seqscan 41K' entry
comes out with the lower cost and that's what we end up using.
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I think the point is that there may *not* be few hash collisions ...

Right, but that's actually all entirely based on concerns over there
being duplicates (hence why we consider the MCVs and ndistinct), which
makes *some* sense given that we currently have a single linked-list in
each bucket into which any dups are placed.

It occurs to me that we could get away from this by having a 2-level
system.  We hash to a bucket which contains a linked list of linked
lists.  The bucket list would be for actual collisions (which we don't
consider at all in the current bucket estimating) and each of those
entries would be a linked list of duplicates for that entry in the
bucket.  This would add some small cost for building the hash, since we
would have to step through each entry in the bucket to determine if the
entry being added is a new entry or not, but not very much unless we're
worried about lots of collisions happening (which I certainly hope isn't
the case).  Hash tables are generally expected to take more effort to
build initially anyway, so I don't see a problem putting more logic
there.  Also, we could skip checking each entry in the bucket when the
input is known to be unique and instead just skip to the end of the
bucket since the new entry can't match any existing.

We could still work through the bucketing logic and add some costing to
that case for those situations where we are hashing on only one key of a
multi-key join and we expect a lot of duplicates to exist.  I'm not sure
how much that happens though- I would hope that we would use a composite
hash key most of the time that we have multi-key joins that use hash
tables.

Thoughts?
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Jeff Davis (pgsql@j-davis.com) wrote:
> In Stephen's case the table was only 41KB, so something still seems off.
> Maybe we should model the likelihood of a collision based on the
> cardinalities (assuming a reasonably good hash function)?

It's not really 'hash collisions' that we're trying to be wary of, per
se, it's the risk of duplicates.  To push this very far in the other
direction- if you have 41k of the *same value* in the small table, then
it's currently faster to build the hash table on the large table and
then seq scan the small table (10s vs. 14s on my laptop running w/o
being plugged in, so it's a bit slower).

Now, that's a pretty ridiculous case, but it seems like we're being
pretty dumb here- for every input row from the outer table, we're
looking through all 41K *duplicate* keys in that one hash bucket.  This
goes back to the suggestion I just made- if the hash bucket list
contained only unique values (which are a result of actual hash
collisions), then we'd only be doing *one* comparison for each input row
of the outer table that doesn't match- and when we found one which
*did*, we would only need to step through the dup list for that one
entry and blast back all of those rows, forgetting about the rest of the
bucket which we know can't match.

> Also, I think I found an important assumption that seems dubious (in
> comment for estimate_hash_bucketsize()):

I've wondered about that also.  It certainly seems quite bogus that we
can end up with an 'estimated # of entries in a bucket' that's larger
than the # of entries we've found for the MCV in the table, especially
*double* that.

> Stephen, do you think this could explain your problem?

As I tried to articulate in my initial email- even if we had a *perfect*
answer to "how many comparisons will we need to do", the current costing
would cause us to pick the plan that, intuitively and empirically, takes
longer (hash the 41M row table) because that cost is multiplied times
the number of outer row tables and the cpu_tuple_cost (charged to build
the hash table) isn't high enough relative to the cpu_op_cost (charged
to do the comparisons in the buckets).
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Jeff Davis (pgsql@j-davis.com) wrote:
>> In Stephen's case the table was only 41KB, so something still seems off.
>> Maybe we should model the likelihood of a collision based on the
>> cardinalities (assuming a reasonably good hash function)?

> It's not really 'hash collisions' that we're trying to be wary of, per
> se, it's the risk of duplicates.

I spent some time looking at this.  I think the real issue is that the
code is trying to charge something close to cpu_operator_cost for each
time an outer tuple is compared to a hashtable entry.  That was actually
reasonable back when the costing code was last touched --- the current
scheme of storing and comparing the hash codes before testing the
equality operator proper only dates to commit
849074f9ae422c64501bb1d53ef840de870bf65c.  I punted on changing the cost
estimates at that time, and I think what this example is showing is that
that was a bad decision.  Really, when we're traipsing down a bucket
list, skipping over bucket entries with the wrong hash code is just
about free, or at least it's a whole lot cheaper than applying ExecQual.

Perhaps what we should do is charge the hash_qual_cost only for some
small multiple of the number of tuples that we expect will *pass* the
hash quals, which is a number we have to compute anyway.  The multiple
would represent the rate of hash-code collisions we expect.

I'd still be inclined to charge something per bucket entry, but it
should be really small, perhaps on the order of 0.01 times
cpu_operator_cost.

Or we could just drop that term entirely.  It strikes me that the reason
to be worried about skewed distribution in the inner relation is not
really that it changes the run time much, but rather that it risks
blowing out work_mem if specific virtual buckets have too many members
(or at best, we're forced to increase the number of batches more than we
thought to stay under work_mem; which carries runtime costs of its own).
Maybe what we should be doing with the bucketsize numbers is estimating
peak memory consumption to gate whether we'll accept the plan at all,
rather than adding terms to the cost estimate.
        regards, tom lane



Re: Hash Join cost estimates

From
Jeff Davis
Date:
On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote:
> Really, when we're traipsing down a bucket
> list, skipping over bucket entries with the wrong hash code is just
> about free, or at least it's a whole lot cheaper than applying ExecQual.
> 
> Perhaps what we should do is charge the hash_qual_cost only for some
> small multiple of the number of tuples that we expect will *pass* the
> hash quals, which is a number we have to compute anyway.  The multiple
> would represent the rate of hash-code collisions we expect.

+1.

> I'd still be inclined to charge something per bucket entry, but it
> should be really small, perhaps on the order of 0.01 times
> cpu_operator_cost.

> Or we could just drop that term entirely. 

FWIW, either of those are fine with me based on my limited experience.

> Maybe what we should be doing with the bucketsize numbers is estimating
> peak memory consumption to gate whether we'll accept the plan at all,
> rather than adding terms to the cost estimate.

Sounds reasonable.

Ideally, we'd have a way to continue executing even in that case; but
that's a project by itself, and would make it even more difficult to
cost accurately.

Regards,Jeff Davis




Re: Hash Join cost estimates

From
Robert Haas
Date:
On Mon, Apr 1, 2013 at 2:35 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote:
>> Really, when we're traipsing down a bucket
>> list, skipping over bucket entries with the wrong hash code is just
>> about free, or at least it's a whole lot cheaper than applying ExecQual.
>>
>> Perhaps what we should do is charge the hash_qual_cost only for some
>> small multiple of the number of tuples that we expect will *pass* the
>> hash quals, which is a number we have to compute anyway.  The multiple
>> would represent the rate of hash-code collisions we expect.
>
> +1.
>
>> I'd still be inclined to charge something per bucket entry, but it
>> should be really small, perhaps on the order of 0.01 times
>> cpu_operator_cost.
>
>> Or we could just drop that term entirely.
>
> FWIW, either of those are fine with me based on my limited experience.

FWIW, I have also seen this problem and the proposed fixes sound
reasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Hash Join cost estimates

From
Tom Lane
Date:
I wrote:
> Perhaps what we should do is charge the hash_qual_cost only for some
> small multiple of the number of tuples that we expect will *pass* the
> hash quals, which is a number we have to compute anyway.  The multiple
> would represent the rate of hash-code collisions we expect.

I tried the attached quick-hack patch on Stephen's example.  With
work_mem set to 16MB I get these results:

regression=# explain analyze select * from small_table join big_table using (id_short);
                                                         QUERY PLAN
     

-----------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1229.46..74154.49 rows=41176 width=24) (actual time=47.723..1845.869 rows=13731 loops=1)
   Hash Cond: (big_table.id_short = small_table.id_short)
   ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.045..506.212 rows=4272271
loops=1)
   ->  Hash  (cost=714.76..714.76 rows=41176 width=24) (actual time=24.944..24.944 rows=41176 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 2574kB
         ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24) (actual time=0.007..11.608 rows=41176
loops=1)
 Total runtime: 1847.697 ms
(7 rows)

Forcing the other plan to be chosen, I get

regression=# explain analyze select * from small_table join big_table using (id_short);
                                                           QUERY PLAN
         

---------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=131719.10..150327.44 rows=41176 width=24) (actual time=1922.942..2810.095 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24) (actual time=0.012..10.058 rows=41176 loops=1)
   ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual time=1921.962..1921.962 rows=4272271 loops=1)
         Buckets: 65536  Batches: 16  Memory Usage: 9412kB
         ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.043..702.898 rows=4272271
loops=1)
 Total runtime: 2820.633 ms
(7 rows)

So that's at least going in the right direction.

I have not thought about how the calculation should be adjusted in the
semi/anti join case, nor about how we ought to repurpose the
bucket-size-variance calculations for checking whether work_mem will be
exceeded.  So this is a long way from being committable, but it seems
promising.

            regards, tom lane


Attachment

Re: Hash Join cost estimates

From
Stephen Frost
Date:
Tom, all,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> So that's at least going in the right direction.

I agree that this is going in the right direction; it certainly would
make the plan that I *expect* to be chosen more likely, however..

I've been fiddling with this on the very much larger overall database
where this test case came from and have found that hashing the large
table can actually be *faster* and appears to cause a more consistent
and constant amount of disk i/o (which is good).

The test case exhibits a bit of why this is the case- the per-tuple hash
lookup is way closer to the per-tuple cost of building the hash table
than I'd expect.

per-tuple cost to build the hash table (41M tuples): 0.33us
per-tuple cost to scan/do hash lookups (41M tuples): 0.29us (with a small hash table of only 41K entries)

The total difference being: 1233.854 vs. 1428.424, or only 194.570ms in
favor of scanning the big table instead of hashing it.

These numbers are just from those posted with my original email:

http://explain.depesz.com/s/FEq
http://explain.depesz.com/s/FOU

I've seen much worse though- I have one case where hash-the-big-table
took 5s and hash-the-small-table took almost 10s (total times).  I'm
trying to see if I can pull that out and isolate how it's different (and
see if it was just due to other load on the box).

What I'm trying to get at in this overall email is: why in the world is
it so expensive to do hash lookups?  I would have expected the cost of
the hash table to be *much* more than the cost to do a hash lookup, and
that doing hash lookups against a small hash table would be fast enough
to put serious pressure on the i/o subsystem.  Instead, the building of
the hash table actually puts more pressure and can end up being more
efficient overall.  We have a process that basically does this a whole
bunch and the "hash-the-big-table" operation takes about 4.7 hrs, while
the "hash-the-small-table" approach went well past 5 hours and was only
about 70% complete.

Thoughts?
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> I've been fiddling with this on the very much larger overall database
> where this test case came from and have found that hashing the large
> table can actually be *faster* and appears to cause a more consistent
> and constant amount of disk i/o (which is good).

Interesting.

> What I'm trying to get at in this overall email is: why in the world is
> it so expensive to do hash lookups?

perf or oprofile reveal anything?

Also, I assume that the cases you are looking at are large enough that
even the "small" table doesn't fit in a single hash batch?  It could
well be that the answer has to do with some bogus or at least
unintuitive behavior of the batching process, and it isn't really at all
a matter of individual hash lookups being slow.

(You never did mention what work_mem setting you're testing, anyway.)
        regards, tom lane



Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > What I'm trying to get at in this overall email is: why in the world is
> > it so expensive to do hash lookups?
>
> perf or oprofile reveal anything?

Working on a test case actually- I've got one now:
http://snowman.net/~sfrost/test_case2.sql

In this example, hashing the large table is actually 2 seconds *faster*
than hashing the small table (again, all on my laptop).

> Also, I assume that the cases you are looking at are large enough that
> even the "small" table doesn't fit in a single hash batch?

No, quite the opposite, sorry for not mentioning that before.  Either
side fits completely into memory w/ a single batch.  The explain
analyze's that I posted before show that, either way, there's only one
batch involved.

> (You never did mention what work_mem setting you're testing, anyway.)

With the test case above (where I got a 2s faster run time by hashing
the big table) used a work_mem of 1GB.
Thanks!
    Stephen

Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> perf or oprofile reveal anything?

Here's what we get from oprofile (perhaps not too surprising):

Hash the small table / scan the big table:
samples  cum. samples  %        cum. %     linenr info                 image name               symbol name
167374   167374        47.9491  47.9491    nodeHash.c:915              postgres                 ExecScanHashBucket
85041    252415        24.3624  72.3115    mcount.c:60                 libc-2.15.so             __mcount_internal
28370    280785         8.1274  80.4389    _mcount.S:33                libc-2.15.so             mcount
15856    296641         4.5424  84.9814    (no location information)   [vdso] (tgid:30643
range:0x7fffe6fff000-0x7fffe6ffffff)[vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff)
 
6291     302932         1.8022  86.7836    xact.c:682                  postgres
TransactionIdIsCurrentTransactionId
4555     307487         1.3049  88.0885    instrument.c:70             postgres                 InstrStopNode
3849     311336         1.1027  89.1912    heapam.c:711                postgres                 heapgettup_pagemode
3567     314903         1.0219  90.2130    nodeHashjoin.c:63           postgres                 ExecHashJoin

Hash the big table / scan the small table:
samples  cum. samples  %        cum. %     linenr info                 image name               symbol name
112060   112060        39.2123  39.2123    mcount.c:60                 libc-2.15.so             __mcount_internal
36547    148607        12.7886  52.0009    nodeHash.c:709              postgres                 ExecHashTableInsert
33570    182177        11.7469  63.7477    _mcount.S:33                libc-2.15.so             mcount
16383    198560         5.7328  69.4805    (no location information)   [vdso] (tgid:30643
range:0x7fffe6fff000-0x7fffe6ffffff)[vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff)
 
13200    211760         4.6190  74.0995    (no location information)   no-vmlinux               /no-vmlinux
6345     218105         2.2203  76.3197    xact.c:682                  postgres
TransactionIdIsCurrentTransactionId
5250     223355         1.8371  78.1568    nodeHash.c:915              postgres                 ExecScanHashBucket
4797     228152         1.6786  79.8354    heapam.c:711                postgres                 heapgettup_pagemode
4661     232813         1.6310  81.4664    aset.c:563                  postgres                 AllocSetAlloc
4588     237401         1.6054  83.0718    instrument.c:70             postgres                 InstrStopNode
3550     240951         1.2422  84.3140    memcpy-ssse3-back.S:60      libc-2.15.so             __memcpy_ssse3_back
3013     243964         1.0543  85.3684    aset.c:1109                 postgres                 AllocSetCheck

Looking at the 'Hash the small table / scan the big table', opannotate
claims that this is bar far the worst offender:

147588 42.2808 :        hashTuple = hashTuple->next;

While most of the time in the 'Hash the big table / scan the small
table' is in:
34572 12.0975 :        hashTuple->next = hashtable->buckets[bucketno];

Neither of those strike me as terribly informative though.  To be
honest, I've not really played w/ oprofile all that much.  Now that I've
got things set up to support this, I'd be happy to provide more info if
anyone has suggestions on how to get something more useful.

It does look like reducing bucket depth, as I outlined before through
the use of a 2-level hashing system, might help speed up
ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
entries to consider instead of more.  Along those same lines, I really
wonder if we're being too generous wrt the bucket-depth goal of '10'
instead of, say, '1', especially when we've got plenty of work_mem
available.
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Stephen Frost (sfrost@snowman.net) wrote:
> 85041    252415        24.3624  72.3115    mcount.c:60                 libc-2.15.so             __mcount_internal
> 28370    280785         8.1274  80.4389    _mcount.S:33                libc-2.15.so             mcount
[...]

And as a side-note, I'm rebuilding w/o profiling, asserts, etc and will
run it again, though I don't really expect the top-hitters to change.
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Stephen Frost (sfrost@snowman.net) wrote:
> It does look like reducing bucket depth, as I outlined before through
> the use of a 2-level hashing system, might help speed up
> ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
> entries to consider instead of more.  Along those same lines, I really
> wonder if we're being too generous wrt the bucket-depth goal of '10'
> instead of, say, '1', especially when we've got plenty of work_mem
> available.

Rerunning using a minimally configured build (only --enable-openssl
and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1'
results in a couple of interesting things-

First, the planner actually picks the plan to hash the small table and
seqscan the big one.  That also, finally, turns out to be *faster* for
this test case.

explain analyze results here:
Hash small table / seqscan big table: http://explain.depesz.com/s/nP1
Hash big table / seqscan small table: http://explain.depesz.com/s/AUv

Here's the oprofile reports:

Hash small table / seqscan big table:
samples  cum. samples  %        cum. %     linenr info                 image name               symbol name
39023    39023         52.8574  52.8574    nodeHash.c:915              postgres                 ExecScanHashBucket
3743     42766          5.0700  57.9273    xact.c:682                  postgres
TransactionIdIsCurrentTransactionId
3110     45876          4.2126  62.1399    nodeHashjoin.c:63           postgres                 ExecHashJoin
2561     48437          3.4689  65.6088    heapam.c:711                postgres                 heapgettup_pagemode
2427     50864          3.2874  68.8962    heapam.c:300                postgres                 heapgetpage
2395     53259          3.2441  72.1403    heaptuple.c:1028            postgres                 slot_deform_tuple
2395     55654          3.2441  75.3843    heaptuple.c:1135            postgres                 slot_getattr
2383     58037          3.2278  78.6122    nodeHash.c:786              postgres                 ExecHashGetHashValue
1811     59848          2.4530  81.0652    tqual.c:1044                postgres                 HeapTupleSatisfiesMVCC
1796     61644          2.4327  83.4979    execScan.c:111              postgres                 ExecScan
1298     62942          1.7582  85.2561    hashfunc.c:517              postgres                 hash_uint32
1274     64216          1.7257  86.9817    execProcnode.c:356          postgres                 ExecProcNode
1011     65227          1.3694  88.3511    heapam.c:1453               postgres                 heap_getnext
905      66132          1.2258  89.5770    execTuples.c:333            postgres                 ExecStoreTuple
858      66990          1.1622  90.7392    fmgr.c:1291                 postgres                 FunctionCall1Coll
835      67825          1.1310  91.8702    execQual.c:668              postgres                 ExecEvalScalarVarFast
834      68659          1.1297  92.9999    mcxt.c:126                  postgres                 MemoryContextReset
818      69477          1.1080  94.1078    nodeSeqscan.c:48            postgres                 SeqNext

Hash big table / seqscan small table:
samples  cum. samples  %        cum. %     linenr info                 image name               symbol name
38612    38612         41.2901  41.2901    nodeHash.c:709              postgres                 ExecHashTableInsert
7435     46047          7.9507  49.2408    (no location information)   no-vmlinux               /no-vmlinux
4900     50947          5.2399  54.4806    aset.c:563                  postgres                 AllocSetAlloc
3803     54750          4.0668  58.5474    xact.c:682                  postgres
TransactionIdIsCurrentTransactionId
3335     58085          3.5663  62.1137    heapam.c:711                postgres                 heapgettup_pagemode
2532     60617          2.7076  64.8213    nodeHash.c:786              postgres                 ExecHashGetHashValue
2523     63140          2.6980  67.5193    memcpy-ssse3-back.S:60      libc-2.15.so             __memcpy_ssse3_back
2518     65658          2.6926  70.2119    heaptuple.c:1028            postgres                 slot_deform_tuple
2378     68036          2.5429  72.7549    heapam.c:300                postgres                 heapgetpage
2374     70410          2.5387  75.2935    heaptuple.c:1135            postgres                 slot_getattr
1852     72262          1.9805  77.2740    nodeHash.c:915              postgres                 ExecScanHashBucket
1831     74093          1.9580  79.2320    tqual.c:1044                postgres                 HeapTupleSatisfiesMVCC
1732     75825          1.8521  81.0841    heapam.c:1453               postgres                 heap_getnext
1320     77145          1.4116  82.4957    nodeHash.c:76               postgres                 MultiExecHash
1219     78364          1.3035  83.7992    heaptuple.c:1529            postgres
minimal_tuple_from_heap_tuple
1212     79576          1.2961  85.0953    execProcnode.c:356          postgres                 ExecProcNode
1209     80785          1.2929  86.3881    hashfunc.c:517              postgres                 hash_uint32
1197     81982          1.2800  87.6682    execScan.c:111              postgres                 ExecScan
1139     83121          1.2180  88.8862    execTuples.c:333            postgres                 ExecStoreTuple
1010     84131          1.0801  89.9662    execTuples.c:662            postgres
ExecFetchSlotMinimalTuple
961      85092          1.0277  90.9939    aset.c:821                  postgres                 AllocSetFree

Looking with opannotate, there's two main hotspots in
ExecScanHashBucket:
12846 17.4001 :        hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];

and
22100 29.9348 :        hashTuple = hashTuple->next;

I'm certainly curious about those, but I'm also very interested in the
possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
depending on the work_mem setting.  It's only used in
ExecChooseHashTableSize, so while making it variable or depending on
work_mem could slow planning down a bit, it's not a per-tuple cost item.

Thoughts?
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> Looking with opannotate, there's two main hotspots in
> ExecScanHashBucket:

>  12846 17.4001 :        hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
> and
>  22100 29.9348 :        hashTuple = hashTuple->next;

Those are, of course, pretty trivial statements; so the way I read this
is that the fundamental slowdown comes from the hash table being large
compared to the CPU's cache, so that you're incurring lots of cache
misses at these particular fetches.  (You might be able to confirm that
if you can set oprofile to count cache misses rather than wall clock
time.)

> I'm certainly curious about those, but I'm also very interested in the
> possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> depending on the work_mem setting.

Not sure about that.  That would make the hash-bucket-header array
larger without changing the size of the rest of the hash table, thus
probably making the CPU cache situation worse not better (which would
manifest as more time at the first of these two statements relative to
the second).

Can you add some instrumentation code to get us information about the
average/max length of the bucket chains?  And maybe try to figure out
how many distinct hash values per bucket, which would give us a clue
whether your two-level-list idea is worth anything.
        regards, tom lane



Re: Hash Join cost estimates

From
"ktm@rice.edu"
Date:
On Thu, Apr 04, 2013 at 04:16:12PM -0400, Stephen Frost wrote:
> * Stephen Frost (sfrost@snowman.net) wrote:
> > It does look like reducing bucket depth, as I outlined before through
> > the use of a 2-level hashing system, might help speed up
> > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
> > entries to consider instead of more.  Along those same lines, I really
> > wonder if we're being too generous wrt the bucket-depth goal of '10'
> > instead of, say, '1', especially when we've got plenty of work_mem
> > available.
> 
> Rerunning using a minimally configured build (only --enable-openssl
> and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1'
> results in a couple of interesting things-
> 
> First, the planner actually picks the plan to hash the small table and
> seqscan the big one.  That also, finally, turns out to be *faster* for
> this test case.
> 
> ...
> 
> I'm certainly curious about those, but I'm also very interested in the
> possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> depending on the work_mem setting.  It's only used in
> ExecChooseHashTableSize, so while making it variable or depending on
> work_mem could slow planning down a bit, it's not a per-tuple cost item.
> 
+1 for adjusting this based on work_mem value.

Ken



Re: Hash Join cost estimates

From
Matthias
Date:
> In this example, hashing the large table is actually 2 seconds *faster*
> than hashing the small table (again, all on my laptop).

Are you running the laptop on battery? When I've benchmarked pgsql last  
time I used my laptop as well and it only occured to me after a lot of  
trying that laptops (even with all energy saving disabled in my case)  
don't always make for reliable benchmark machines. Things like your CPU  
clockspeed being dynamically adjusted can produce really strange results.

Also when I was running on battery the performance numbers could not be  
compared in any way to when I was running with the laptop connected  
straight to a socket. Things like IO/CPU ratio were completely different.  
And numbers on the final testing servers were even different.

Of course your test case might not be affected by this at all, but it's  
something to watch out for.

-Matthias



Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Matthias (nitrogenycs@gmail.com) wrote:
> >In this example, hashing the large table is actually 2 seconds *faster*
> >than hashing the small table (again, all on my laptop).
>
> Are you running the laptop on battery? When I've benchmarked pgsql
> last time I used my laptop as well and it only occured to me after a
> lot of trying that laptops (even with all energy saving disabled in
> my case) don't always make for reliable benchmark machines. Things
> like your CPU clockspeed being dynamically adjusted can produce
> really strange results.

Those runs were with the laptop plugged in, but I've also run it w/o the
battery and while the performance is certainly different between those
two cases, the relative speed of hashing vs. hash-lookup has been
consistent.  Also, that's why I provided the test case- feel free (and
please do!) test it on any/all hardware you can find.  I'd love to hear
reports from others on their experiences.  Also, the relative speeds on
my laptop runs matched the performance (the laptop was slower, but
slower in both paths in a comparable way) on the big server where this
is all originating.

> Of course your test case might not be affected by this at all, but
> it's something to watch out for.

Certainly.
Thanks,
    Stephen

Re: Hash Join cost estimates

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I'm certainly curious about those, but I'm also very interested in the
> > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> > depending on the work_mem setting.
>
> Not sure about that.  That would make the hash-bucket-header array
> larger without changing the size of the rest of the hash table, thus
> probably making the CPU cache situation worse not better (which would
> manifest as more time at the first of these two statements relative to
> the second).

In the testing that I've done, I've yet to find a case where having a
smaller NTUP_PER_BUCKET makes things worse and it can have a dramatic
improvement.  Regarding the memory situation, I'm not sure that using
buckets really helps, though I'm no CPU architect.  However, assuming
that the CPU is smart enough to only pull in bits of memory that it
needs, I'm not sure how having to pull in parts of a large array of
pointers is worse than having to go fetch randomly placed entries in a
bucket, especially since we have to step through an entire bucket each
time and the bucket tuples are allocated independently, while the hash
array is allocated once.  Indeed, it seems you'd be more likely to get
other things you need in a given pull into cache for the hash table
array than with the bucket tuple entries which might well include
unrelated garbage.

> Can you add some instrumentation code to get us information about the
> average/max length of the bucket chains?  And maybe try to figure out
> how many distinct hash values per bucket, which would give us a clue
> whether your two-level-list idea is worth anything.

I ended up implementing the two-level system and doing some testing with
it.  It ends up making hash table building take quite a bit longer and
only improves the scan performance in very select cases.  The
improvement requires an individual bucket to have both lots of dups and
a lot of distinct values, because it only helps when it can skip a
significant number of tuples.  If we move to a system where there's
rarely more than one distinct value in a bucket then the chances of a
serious improvment from the two-level system goes down that much more.

As such, I've come up with this (trival) patch which simply modifies
ExecChooseHashTableSize() to ignore NTUP_PER_BUCKET (essentially
treating it as 1) when work_mem is large enough to fit the entire hash
table (which also implies that there is only one batch).  I'd love to
hear feedback from others on what this does under different conditions.
This also makes the "hash-the-small-table" case faster for the test case
which I provided earlier (http://snowman.net/~sfrost/test_case2.sql),
and it uses quite a bit less memory too.  Now that I've convinced myself
that the two-level system isn't practical and the "hash-the-small-table"
case is actually faster than "hash-the-big-table", I'll start looking
at improving on your proposal to change the costing to favor the smaller
table being hashed more often.

    Thanks,

        Stephen

Attachment