Thread: Hash Join cost estimates
All, Marty's issue w/ NestLoop reminded me that I'd put together a test case which illustrates a HashJoin doing the wrong thing. The test case is here: http://snowman.net/~sfrost/test_case.sql Simply load that up and then check out this plan: explain select * from small_table join big_table using (id_short); We end up deciding to build a hash table of 4M records and then seq scan the table that only has 41K. Much of the reasonfor this is because the analytics point out, correctly, that the column we're using from the small table to join isn'tunique and therefore the buckets will be deeper- but it's not *nearly* as bad as we estimate. The bucket estimate for the small table comes back as 26, while the reality is that we only look through ~5.5 entries perbucket with the longest run being 21. With the big table being hashed, the bucket estimate is 4 (though this seem tovary way more than I would expect, sometimes 4, sometimes 8, sometimes 2..) while the average number scanned through isactually ~3.5 and the longest scan ends up being 20. Without the bucket question, we end up with pretty reasonable results (directly out of initial_cost_hashjoin): 41K hashed, seqscan 4M: startup_cost = 1229.46 run_cost = 72307.39 4M hashed, 41K seqscan: startup_cost = 115030.10 run_cost = 817.70 When we get through dealing with the bucketing question in final_cost_hashjoin (costsize.c:2673), we have some pretty grossresults for the 'good' plan's run_cost: run_cost = 72307.39 + 138848.81 = 211156.20 While the 'bad' plan's run_cost is only bumped a tiny bit: run_cost = 817.7 + 411.76 = 1229.46 Resulting in total costs that look all kinds of wacky: 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66 Or the 'good' plan being costed at *nearly double*. Now, my laptop might not be the 'best' system CPU wise, but there'sa pretty massive time difference between these plans: 41K hashed, seqscan 4M: 2252.007 ms http://explain.depesz.com/s/FEq 4M hashed, seqscan 41K: 2708.471 ms http://explain.depesz.com/s/FOU That's half a second and a good 15+% difference. Now, back to the bucket estimation- the ndistinct for the small table is -0.475058 (or 19561 tuples), which is about right. There are 23 distinct values, 18,795 duplicated, and another 841 with dup counts ranging from 3 to 10. This leadsto an avgfreq of 0.000051, unfortunately, we're going for only 8192 buckets, so this gets moved up to 0.00012 and thenthe adjustment for MCV (which is 0.00027) bumps this all the way up to 0.00064, leading to our bucket depth estimateof 26 (bucket size estimate multiplied by the inner rows) and the resulting cost dominating the overall costing. If we consider the bucketing wrong and look at what the costs would have been with the actual number of average scans (andtherefore excluding the 0.5 'adjustment'), we get: seqscan 41K cost: 360.28 (cpu_op_cost * 41K * 3.5) seqscan 4M cost: 58743.73 (cpu_op_cost * 4M * 5.5) which isn't exactly going in the 'right' direction for us. Now, I'm sure that there's a cost to having to consider morebuckets, but it seems to be far less, in comparison to the hash table creation cost, than what our model would suggest. In the end, I think the problem here is that we are charging far too much for these bucket costs (particularly whenwe're getting them so far wrong) and not nearly enough for the cost of building the hash table in the first place. Thoughts? Ideas about how we can 'fix' this? Have others run into similar issues? Thanks, Stephen
On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote: > 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56 > 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66 I think those are backwards -- typo? > In the end, I think the problem here is that we are charging far too > much for these bucket costs (particularly when we're getting them so > far wrong) and not nearly enough for the cost of building the hash > table in the first place. > > Thoughts? Ideas about how we can 'fix' this? Have others run into > similar issues? Yes, I have run into this issue (or something very similar). I don't understand why the bucketsize even matters much -- assuming few hash collisions, we are not actually evaluating the quals any more times than necessary. So why all of the hashjoin-specific logic in determining the number of qual evaluations? The only reason I can think of is to model the cost of comparing the hashes themselves. Also, searching the archives turns up at least one other, but I think I've seen more: http://www.postgresql.org/message-id/A82128A6-4E3B-43BD-858D-21B129F7BEEB@richrelevance.com Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Yes, I have run into this issue (or something very similar). I don't > understand why the bucketsize even matters much -- assuming few hash > collisions, we are not actually evaluating the quals any more times than > necessary. So why all of the hashjoin-specific logic in determining the > number of qual evaluations? The only reason I can think of is to model > the cost of comparing the hashes themselves. I think the point is that there may *not* be few hash collisions ... regards, tom lane
On Fri, 2013-03-29 at 16:37 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Yes, I have run into this issue (or something very similar). I don't > > understand why the bucketsize even matters much -- assuming few hash > > collisions, we are not actually evaluating the quals any more times than > > necessary. So why all of the hashjoin-specific logic in determining the > > number of qual evaluations? The only reason I can think of is to model > > the cost of comparing the hashes themselves. > > I think the point is that there may *not* be few hash collisions ... In Stephen's case the table was only 41KB, so something still seems off. Maybe we should model the likelihood of a collision based on the cardinalities (assuming a reasonably good hash function)? Also, I think I found an important assumption that seems dubious (in comment for estimate_hash_bucketsize()): "If the other relation in the join has a key distribution similar to this one's, then the most-loaded buckets are exactly those that will be probed most often. Therefore, the "average" bucket size for costing purposes should really be taken as something close to the "worst case" bucket size. We try to estimate this by adjusting the fraction if there are too few distinct data values, and then scaling up by the ratio of the most common value's frequency to the average frequency." But the key distribution is not necessarily similar at all... the large table might have many more distinct values. Stephen, do you think this could explain your problem? Regards,Jeff Davis
Jeff, * Jeff Davis (pgsql@j-davis.com) wrote: > On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote: > > 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56 > > 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66 > > I think those are backwards -- typo? Yes, sorry, those are backwards. The '4M hashed, seqscan 41K' entry comes out with the lower cost and that's what we end up using. Thanks, Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I think the point is that there may *not* be few hash collisions ... Right, but that's actually all entirely based on concerns over there being duplicates (hence why we consider the MCVs and ndistinct), which makes *some* sense given that we currently have a single linked-list in each bucket into which any dups are placed. It occurs to me that we could get away from this by having a 2-level system. We hash to a bucket which contains a linked list of linked lists. The bucket list would be for actual collisions (which we don't consider at all in the current bucket estimating) and each of those entries would be a linked list of duplicates for that entry in the bucket. This would add some small cost for building the hash, since we would have to step through each entry in the bucket to determine if the entry being added is a new entry or not, but not very much unless we're worried about lots of collisions happening (which I certainly hope isn't the case). Hash tables are generally expected to take more effort to build initially anyway, so I don't see a problem putting more logic there. Also, we could skip checking each entry in the bucket when the input is known to be unique and instead just skip to the end of the bucket since the new entry can't match any existing. We could still work through the bucketing logic and add some costing to that case for those situations where we are hashing on only one key of a multi-key join and we expect a lot of duplicates to exist. I'm not sure how much that happens though- I would hope that we would use a composite hash key most of the time that we have multi-key joins that use hash tables. Thoughts? Thanks, Stephen
* Jeff Davis (pgsql@j-davis.com) wrote: > In Stephen's case the table was only 41KB, so something still seems off. > Maybe we should model the likelihood of a collision based on the > cardinalities (assuming a reasonably good hash function)? It's not really 'hash collisions' that we're trying to be wary of, per se, it's the risk of duplicates. To push this very far in the other direction- if you have 41k of the *same value* in the small table, then it's currently faster to build the hash table on the large table and then seq scan the small table (10s vs. 14s on my laptop running w/o being plugged in, so it's a bit slower). Now, that's a pretty ridiculous case, but it seems like we're being pretty dumb here- for every input row from the outer table, we're looking through all 41K *duplicate* keys in that one hash bucket. This goes back to the suggestion I just made- if the hash bucket list contained only unique values (which are a result of actual hash collisions), then we'd only be doing *one* comparison for each input row of the outer table that doesn't match- and when we found one which *did*, we would only need to step through the dup list for that one entry and blast back all of those rows, forgetting about the rest of the bucket which we know can't match. > Also, I think I found an important assumption that seems dubious (in > comment for estimate_hash_bucketsize()): I've wondered about that also. It certainly seems quite bogus that we can end up with an 'estimated # of entries in a bucket' that's larger than the # of entries we've found for the MCV in the table, especially *double* that. > Stephen, do you think this could explain your problem? As I tried to articulate in my initial email- even if we had a *perfect* answer to "how many comparisons will we need to do", the current costing would cause us to pick the plan that, intuitively and empirically, takes longer (hash the 41M row table) because that cost is multiplied times the number of outer row tables and the cpu_tuple_cost (charged to build the hash table) isn't high enough relative to the cpu_op_cost (charged to do the comparisons in the buckets). Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Jeff Davis (pgsql@j-davis.com) wrote: >> In Stephen's case the table was only 41KB, so something still seems off. >> Maybe we should model the likelihood of a collision based on the >> cardinalities (assuming a reasonably good hash function)? > It's not really 'hash collisions' that we're trying to be wary of, per > se, it's the risk of duplicates. I spent some time looking at this. I think the real issue is that the code is trying to charge something close to cpu_operator_cost for each time an outer tuple is compared to a hashtable entry. That was actually reasonable back when the costing code was last touched --- the current scheme of storing and comparing the hash codes before testing the equality operator proper only dates to commit 849074f9ae422c64501bb1d53ef840de870bf65c. I punted on changing the cost estimates at that time, and I think what this example is showing is that that was a bad decision. Really, when we're traipsing down a bucket list, skipping over bucket entries with the wrong hash code is just about free, or at least it's a whole lot cheaper than applying ExecQual. Perhaps what we should do is charge the hash_qual_cost only for some small multiple of the number of tuples that we expect will *pass* the hash quals, which is a number we have to compute anyway. The multiple would represent the rate of hash-code collisions we expect. I'd still be inclined to charge something per bucket entry, but it should be really small, perhaps on the order of 0.01 times cpu_operator_cost. Or we could just drop that term entirely. It strikes me that the reason to be worried about skewed distribution in the inner relation is not really that it changes the run time much, but rather that it risks blowing out work_mem if specific virtual buckets have too many members (or at best, we're forced to increase the number of batches more than we thought to stay under work_mem; which carries runtime costs of its own). Maybe what we should be doing with the bucketsize numbers is estimating peak memory consumption to gate whether we'll accept the plan at all, rather than adding terms to the cost estimate. regards, tom lane
On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote: > Really, when we're traipsing down a bucket > list, skipping over bucket entries with the wrong hash code is just > about free, or at least it's a whole lot cheaper than applying ExecQual. > > Perhaps what we should do is charge the hash_qual_cost only for some > small multiple of the number of tuples that we expect will *pass* the > hash quals, which is a number we have to compute anyway. The multiple > would represent the rate of hash-code collisions we expect. +1. > I'd still be inclined to charge something per bucket entry, but it > should be really small, perhaps on the order of 0.01 times > cpu_operator_cost. > Or we could just drop that term entirely. FWIW, either of those are fine with me based on my limited experience. > Maybe what we should be doing with the bucketsize numbers is estimating > peak memory consumption to gate whether we'll accept the plan at all, > rather than adding terms to the cost estimate. Sounds reasonable. Ideally, we'd have a way to continue executing even in that case; but that's a project by itself, and would make it even more difficult to cost accurately. Regards,Jeff Davis
On Mon, Apr 1, 2013 at 2:35 AM, Jeff Davis <pgsql@j-davis.com> wrote: > On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote: >> Really, when we're traipsing down a bucket >> list, skipping over bucket entries with the wrong hash code is just >> about free, or at least it's a whole lot cheaper than applying ExecQual. >> >> Perhaps what we should do is charge the hash_qual_cost only for some >> small multiple of the number of tuples that we expect will *pass* the >> hash quals, which is a number we have to compute anyway. The multiple >> would represent the rate of hash-code collisions we expect. > > +1. > >> I'd still be inclined to charge something per bucket entry, but it >> should be really small, perhaps on the order of 0.01 times >> cpu_operator_cost. > >> Or we could just drop that term entirely. > > FWIW, either of those are fine with me based on my limited experience. FWIW, I have also seen this problem and the proposed fixes sound reasonable to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I wrote: > Perhaps what we should do is charge the hash_qual_cost only for some > small multiple of the number of tuples that we expect will *pass* the > hash quals, which is a number we have to compute anyway. The multiple > would represent the rate of hash-code collisions we expect. I tried the attached quick-hack patch on Stephen's example. With work_mem set to 16MB I get these results: regression=# explain analyze select * from small_table join big_table using (id_short); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1229.46..74154.49 rows=41176 width=24) (actual time=47.723..1845.869 rows=13731 loops=1) Hash Cond: (big_table.id_short = small_table.id_short) -> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.045..506.212 rows=4272271 loops=1) -> Hash (cost=714.76..714.76 rows=41176 width=24) (actual time=24.944..24.944 rows=41176 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 2574kB -> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.007..11.608 rows=41176 loops=1) Total runtime: 1847.697 ms (7 rows) Forcing the other plan to be chosen, I get regression=# explain analyze select * from small_table join big_table using (id_short); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=131719.10..150327.44 rows=41176 width=24) (actual time=1922.942..2810.095 rows=13731 loops=1) Hash Cond: (small_table.id_short = big_table.id_short) -> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.012..10.058 rows=41176 loops=1) -> Hash (cost=61626.71..61626.71 rows=4272271 width=4) (actual time=1921.962..1921.962 rows=4272271 loops=1) Buckets: 65536 Batches: 16 Memory Usage: 9412kB -> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.043..702.898 rows=4272271 loops=1) Total runtime: 2820.633 ms (7 rows) So that's at least going in the right direction. I have not thought about how the calculation should be adjusted in the semi/anti join case, nor about how we ought to repurpose the bucket-size-variance calculations for checking whether work_mem will be exceeded. So this is a long way from being committable, but it seems promising. regards, tom lane
Attachment
Tom, all, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > So that's at least going in the right direction. I agree that this is going in the right direction; it certainly would make the plan that I *expect* to be chosen more likely, however.. I've been fiddling with this on the very much larger overall database where this test case came from and have found that hashing the large table can actually be *faster* and appears to cause a more consistent and constant amount of disk i/o (which is good). The test case exhibits a bit of why this is the case- the per-tuple hash lookup is way closer to the per-tuple cost of building the hash table than I'd expect. per-tuple cost to build the hash table (41M tuples): 0.33us per-tuple cost to scan/do hash lookups (41M tuples): 0.29us (with a small hash table of only 41K entries) The total difference being: 1233.854 vs. 1428.424, or only 194.570ms in favor of scanning the big table instead of hashing it. These numbers are just from those posted with my original email: http://explain.depesz.com/s/FEq http://explain.depesz.com/s/FOU I've seen much worse though- I have one case where hash-the-big-table took 5s and hash-the-small-table took almost 10s (total times). I'm trying to see if I can pull that out and isolate how it's different (and see if it was just due to other load on the box). What I'm trying to get at in this overall email is: why in the world is it so expensive to do hash lookups? I would have expected the cost of the hash table to be *much* more than the cost to do a hash lookup, and that doing hash lookups against a small hash table would be fast enough to put serious pressure on the i/o subsystem. Instead, the building of the hash table actually puts more pressure and can end up being more efficient overall. We have a process that basically does this a whole bunch and the "hash-the-big-table" operation takes about 4.7 hrs, while the "hash-the-small-table" approach went well past 5 hours and was only about 70% complete. Thoughts? Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > I've been fiddling with this on the very much larger overall database > where this test case came from and have found that hashing the large > table can actually be *faster* and appears to cause a more consistent > and constant amount of disk i/o (which is good). Interesting. > What I'm trying to get at in this overall email is: why in the world is > it so expensive to do hash lookups? perf or oprofile reveal anything? Also, I assume that the cases you are looking at are large enough that even the "small" table doesn't fit in a single hash batch? It could well be that the answer has to do with some bogus or at least unintuitive behavior of the batching process, and it isn't really at all a matter of individual hash lookups being slow. (You never did mention what work_mem setting you're testing, anyway.) regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > > What I'm trying to get at in this overall email is: why in the world is > > it so expensive to do hash lookups? > > perf or oprofile reveal anything? Working on a test case actually- I've got one now: http://snowman.net/~sfrost/test_case2.sql In this example, hashing the large table is actually 2 seconds *faster* than hashing the small table (again, all on my laptop). > Also, I assume that the cases you are looking at are large enough that > even the "small" table doesn't fit in a single hash batch? No, quite the opposite, sorry for not mentioning that before. Either side fits completely into memory w/ a single batch. The explain analyze's that I posted before show that, either way, there's only one batch involved. > (You never did mention what work_mem setting you're testing, anyway.) With the test case above (where I got a 2s faster run time by hashing the big table) used a work_mem of 1GB. Thanks! Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > perf or oprofile reveal anything? Here's what we get from oprofile (perhaps not too surprising): Hash the small table / scan the big table: samples cum. samples % cum. % linenr info image name symbol name 167374 167374 47.9491 47.9491 nodeHash.c:915 postgres ExecScanHashBucket 85041 252415 24.3624 72.3115 mcount.c:60 libc-2.15.so __mcount_internal 28370 280785 8.1274 80.4389 _mcount.S:33 libc-2.15.so mcount 15856 296641 4.5424 84.9814 (no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff)[vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff) 6291 302932 1.8022 86.7836 xact.c:682 postgres TransactionIdIsCurrentTransactionId 4555 307487 1.3049 88.0885 instrument.c:70 postgres InstrStopNode 3849 311336 1.1027 89.1912 heapam.c:711 postgres heapgettup_pagemode 3567 314903 1.0219 90.2130 nodeHashjoin.c:63 postgres ExecHashJoin Hash the big table / scan the small table: samples cum. samples % cum. % linenr info image name symbol name 112060 112060 39.2123 39.2123 mcount.c:60 libc-2.15.so __mcount_internal 36547 148607 12.7886 52.0009 nodeHash.c:709 postgres ExecHashTableInsert 33570 182177 11.7469 63.7477 _mcount.S:33 libc-2.15.so mcount 16383 198560 5.7328 69.4805 (no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff)[vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff) 13200 211760 4.6190 74.0995 (no location information) no-vmlinux /no-vmlinux 6345 218105 2.2203 76.3197 xact.c:682 postgres TransactionIdIsCurrentTransactionId 5250 223355 1.8371 78.1568 nodeHash.c:915 postgres ExecScanHashBucket 4797 228152 1.6786 79.8354 heapam.c:711 postgres heapgettup_pagemode 4661 232813 1.6310 81.4664 aset.c:563 postgres AllocSetAlloc 4588 237401 1.6054 83.0718 instrument.c:70 postgres InstrStopNode 3550 240951 1.2422 84.3140 memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back 3013 243964 1.0543 85.3684 aset.c:1109 postgres AllocSetCheck Looking at the 'Hash the small table / scan the big table', opannotate claims that this is bar far the worst offender: 147588 42.2808 : hashTuple = hashTuple->next; While most of the time in the 'Hash the big table / scan the small table' is in: 34572 12.0975 : hashTuple->next = hashtable->buckets[bucketno]; Neither of those strike me as terribly informative though. To be honest, I've not really played w/ oprofile all that much. Now that I've got things set up to support this, I'd be happy to provide more info if anyone has suggestions on how to get something more useful. It does look like reducing bucket depth, as I outlined before through the use of a 2-level hashing system, might help speed up ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) entries to consider instead of more. Along those same lines, I really wonder if we're being too generous wrt the bucket-depth goal of '10' instead of, say, '1', especially when we've got plenty of work_mem available. Thanks, Stephen
* Stephen Frost (sfrost@snowman.net) wrote: > 85041 252415 24.3624 72.3115 mcount.c:60 libc-2.15.so __mcount_internal > 28370 280785 8.1274 80.4389 _mcount.S:33 libc-2.15.so mcount [...] And as a side-note, I'm rebuilding w/o profiling, asserts, etc and will run it again, though I don't really expect the top-hitters to change. Thanks, Stephen
* Stephen Frost (sfrost@snowman.net) wrote: > It does look like reducing bucket depth, as I outlined before through > the use of a 2-level hashing system, might help speed up > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) > entries to consider instead of more. Along those same lines, I really > wonder if we're being too generous wrt the bucket-depth goal of '10' > instead of, say, '1', especially when we've got plenty of work_mem > available. Rerunning using a minimally configured build (only --enable-openssl and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1' results in a couple of interesting things- First, the planner actually picks the plan to hash the small table and seqscan the big one. That also, finally, turns out to be *faster* for this test case. explain analyze results here: Hash small table / seqscan big table: http://explain.depesz.com/s/nP1 Hash big table / seqscan small table: http://explain.depesz.com/s/AUv Here's the oprofile reports: Hash small table / seqscan big table: samples cum. samples % cum. % linenr info image name symbol name 39023 39023 52.8574 52.8574 nodeHash.c:915 postgres ExecScanHashBucket 3743 42766 5.0700 57.9273 xact.c:682 postgres TransactionIdIsCurrentTransactionId 3110 45876 4.2126 62.1399 nodeHashjoin.c:63 postgres ExecHashJoin 2561 48437 3.4689 65.6088 heapam.c:711 postgres heapgettup_pagemode 2427 50864 3.2874 68.8962 heapam.c:300 postgres heapgetpage 2395 53259 3.2441 72.1403 heaptuple.c:1028 postgres slot_deform_tuple 2395 55654 3.2441 75.3843 heaptuple.c:1135 postgres slot_getattr 2383 58037 3.2278 78.6122 nodeHash.c:786 postgres ExecHashGetHashValue 1811 59848 2.4530 81.0652 tqual.c:1044 postgres HeapTupleSatisfiesMVCC 1796 61644 2.4327 83.4979 execScan.c:111 postgres ExecScan 1298 62942 1.7582 85.2561 hashfunc.c:517 postgres hash_uint32 1274 64216 1.7257 86.9817 execProcnode.c:356 postgres ExecProcNode 1011 65227 1.3694 88.3511 heapam.c:1453 postgres heap_getnext 905 66132 1.2258 89.5770 execTuples.c:333 postgres ExecStoreTuple 858 66990 1.1622 90.7392 fmgr.c:1291 postgres FunctionCall1Coll 835 67825 1.1310 91.8702 execQual.c:668 postgres ExecEvalScalarVarFast 834 68659 1.1297 92.9999 mcxt.c:126 postgres MemoryContextReset 818 69477 1.1080 94.1078 nodeSeqscan.c:48 postgres SeqNext Hash big table / seqscan small table: samples cum. samples % cum. % linenr info image name symbol name 38612 38612 41.2901 41.2901 nodeHash.c:709 postgres ExecHashTableInsert 7435 46047 7.9507 49.2408 (no location information) no-vmlinux /no-vmlinux 4900 50947 5.2399 54.4806 aset.c:563 postgres AllocSetAlloc 3803 54750 4.0668 58.5474 xact.c:682 postgres TransactionIdIsCurrentTransactionId 3335 58085 3.5663 62.1137 heapam.c:711 postgres heapgettup_pagemode 2532 60617 2.7076 64.8213 nodeHash.c:786 postgres ExecHashGetHashValue 2523 63140 2.6980 67.5193 memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back 2518 65658 2.6926 70.2119 heaptuple.c:1028 postgres slot_deform_tuple 2378 68036 2.5429 72.7549 heapam.c:300 postgres heapgetpage 2374 70410 2.5387 75.2935 heaptuple.c:1135 postgres slot_getattr 1852 72262 1.9805 77.2740 nodeHash.c:915 postgres ExecScanHashBucket 1831 74093 1.9580 79.2320 tqual.c:1044 postgres HeapTupleSatisfiesMVCC 1732 75825 1.8521 81.0841 heapam.c:1453 postgres heap_getnext 1320 77145 1.4116 82.4957 nodeHash.c:76 postgres MultiExecHash 1219 78364 1.3035 83.7992 heaptuple.c:1529 postgres minimal_tuple_from_heap_tuple 1212 79576 1.2961 85.0953 execProcnode.c:356 postgres ExecProcNode 1209 80785 1.2929 86.3881 hashfunc.c:517 postgres hash_uint32 1197 81982 1.2800 87.6682 execScan.c:111 postgres ExecScan 1139 83121 1.2180 88.8862 execTuples.c:333 postgres ExecStoreTuple 1010 84131 1.0801 89.9662 execTuples.c:662 postgres ExecFetchSlotMinimalTuple 961 85092 1.0277 90.9939 aset.c:821 postgres AllocSetFree Looking with opannotate, there's two main hotspots in ExecScanHashBucket: 12846 17.4001 : hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo]; and 22100 29.9348 : hashTuple = hashTuple->next; I'm certainly curious about those, but I'm also very interested in the possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable depending on the work_mem setting. It's only used in ExecChooseHashTableSize, so while making it variable or depending on work_mem could slow planning down a bit, it's not a per-tuple cost item. Thoughts? Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > Looking with opannotate, there's two main hotspots in > ExecScanHashBucket: > 12846 17.4001 : hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo]; > and > 22100 29.9348 : hashTuple = hashTuple->next; Those are, of course, pretty trivial statements; so the way I read this is that the fundamental slowdown comes from the hash table being large compared to the CPU's cache, so that you're incurring lots of cache misses at these particular fetches. (You might be able to confirm that if you can set oprofile to count cache misses rather than wall clock time.) > I'm certainly curious about those, but I'm also very interested in the > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > depending on the work_mem setting. Not sure about that. That would make the hash-bucket-header array larger without changing the size of the rest of the hash table, thus probably making the CPU cache situation worse not better (which would manifest as more time at the first of these two statements relative to the second). Can you add some instrumentation code to get us information about the average/max length of the bucket chains? And maybe try to figure out how many distinct hash values per bucket, which would give us a clue whether your two-level-list idea is worth anything. regards, tom lane
On Thu, Apr 04, 2013 at 04:16:12PM -0400, Stephen Frost wrote: > * Stephen Frost (sfrost@snowman.net) wrote: > > It does look like reducing bucket depth, as I outlined before through > > the use of a 2-level hashing system, might help speed up > > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) > > entries to consider instead of more. Along those same lines, I really > > wonder if we're being too generous wrt the bucket-depth goal of '10' > > instead of, say, '1', especially when we've got plenty of work_mem > > available. > > Rerunning using a minimally configured build (only --enable-openssl > and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1' > results in a couple of interesting things- > > First, the planner actually picks the plan to hash the small table and > seqscan the big one. That also, finally, turns out to be *faster* for > this test case. > > ... > > I'm certainly curious about those, but I'm also very interested in the > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > depending on the work_mem setting. It's only used in > ExecChooseHashTableSize, so while making it variable or depending on > work_mem could slow planning down a bit, it's not a per-tuple cost item. > +1 for adjusting this based on work_mem value. Ken
> In this example, hashing the large table is actually 2 seconds *faster* > than hashing the small table (again, all on my laptop). Are you running the laptop on battery? When I've benchmarked pgsql last time I used my laptop as well and it only occured to me after a lot of trying that laptops (even with all energy saving disabled in my case) don't always make for reliable benchmark machines. Things like your CPU clockspeed being dynamically adjusted can produce really strange results. Also when I was running on battery the performance numbers could not be compared in any way to when I was running with the laptop connected straight to a socket. Things like IO/CPU ratio were completely different. And numbers on the final testing servers were even different. Of course your test case might not be affected by this at all, but it's something to watch out for. -Matthias
* Matthias (nitrogenycs@gmail.com) wrote: > >In this example, hashing the large table is actually 2 seconds *faster* > >than hashing the small table (again, all on my laptop). > > Are you running the laptop on battery? When I've benchmarked pgsql > last time I used my laptop as well and it only occured to me after a > lot of trying that laptops (even with all energy saving disabled in > my case) don't always make for reliable benchmark machines. Things > like your CPU clockspeed being dynamically adjusted can produce > really strange results. Those runs were with the laptop plugged in, but I've also run it w/o the battery and while the performance is certainly different between those two cases, the relative speed of hashing vs. hash-lookup has been consistent. Also, that's why I provided the test case- feel free (and please do!) test it on any/all hardware you can find. I'd love to hear reports from others on their experiences. Also, the relative speeds on my laptop runs matched the performance (the laptop was slower, but slower in both paths in a comparable way) on the big server where this is all originating. > Of course your test case might not be affected by this at all, but > it's something to watch out for. Certainly. Thanks, Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > I'm certainly curious about those, but I'm also very interested in the > > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > > depending on the work_mem setting. > > Not sure about that. That would make the hash-bucket-header array > larger without changing the size of the rest of the hash table, thus > probably making the CPU cache situation worse not better (which would > manifest as more time at the first of these two statements relative to > the second). In the testing that I've done, I've yet to find a case where having a smaller NTUP_PER_BUCKET makes things worse and it can have a dramatic improvement. Regarding the memory situation, I'm not sure that using buckets really helps, though I'm no CPU architect. However, assuming that the CPU is smart enough to only pull in bits of memory that it needs, I'm not sure how having to pull in parts of a large array of pointers is worse than having to go fetch randomly placed entries in a bucket, especially since we have to step through an entire bucket each time and the bucket tuples are allocated independently, while the hash array is allocated once. Indeed, it seems you'd be more likely to get other things you need in a given pull into cache for the hash table array than with the bucket tuple entries which might well include unrelated garbage. > Can you add some instrumentation code to get us information about the > average/max length of the bucket chains? And maybe try to figure out > how many distinct hash values per bucket, which would give us a clue > whether your two-level-list idea is worth anything. I ended up implementing the two-level system and doing some testing with it. It ends up making hash table building take quite a bit longer and only improves the scan performance in very select cases. The improvement requires an individual bucket to have both lots of dups and a lot of distinct values, because it only helps when it can skip a significant number of tuples. If we move to a system where there's rarely more than one distinct value in a bucket then the chances of a serious improvment from the two-level system goes down that much more. As such, I've come up with this (trival) patch which simply modifies ExecChooseHashTableSize() to ignore NTUP_PER_BUCKET (essentially treating it as 1) when work_mem is large enough to fit the entire hash table (which also implies that there is only one batch). I'd love to hear feedback from others on what this does under different conditions. This also makes the "hash-the-small-table" case faster for the test case which I provided earlier (http://snowman.net/~sfrost/test_case2.sql), and it uses quite a bit less memory too. Now that I've convinced myself that the two-level system isn't practical and the "hash-the-small-table" case is actually faster than "hash-the-big-table", I'll start looking at improving on your proposal to change the costing to favor the smaller table being hashed more often. Thanks, Stephen