Thread: Horribly slow hash join
Note the time for the hash join step: ------------------------------------------------------------------------------------------------------------------------------------------ Hash Join (cost=357.62..26677.99 rows=93668 width=62) (actual time=741.159..443381.011 rows=49091 loops=1) Hash Cond: ("outer".work_today = "inner".work_units) -> Hash Join (cost=337.11..24784.11 rows=93668 width=54) (actual time=731.374..417188.519 rows=49091 loops=1) Hash Cond: ("outer".work_total = "inner".work_units) -> Seq Scan on email_rank (cost=0.00..22240.04 rows=254056 width=46) (actual time=582.145..1627.759 rows=49091loops=1) Filter: (project_id = 8) -> Hash (cost=292.49..292.49 rows=17849 width=16) (actual time=148.944..148.944 rows=0 loops=1) -> Seq Scan on rank_tie_overall o (cost=0.00..292.49 rows=17849 width=16) (actual time=0.059..75.984 rows=17849loops=1) -> Hash (cost=17.81..17.81 rows=1081 width=16) (actual time=8.996..8.996 rows=0 loops=1) -> Seq Scan on rank_tie_today d (cost=0.00..17.81 rows=1081 width=16) (actual time=0.080..4.635 rows=1081 loops=1) Total runtime: 619047.032 ms By comparison: stats=# set enable_hashjoin=false; SET stats=# explain analyze select * from email_rank, rank_tie_overall o, rank_tie_today d WHERE email_rank.work_today = d.work_unitsAND email_rank.work_total = o.work_units AND email_rank.project_id = :ProjectID; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=55391.69..56823.23 rows=93668 width=80) (actual time=2705.344..3349.318 rows=49091 loops=1) Merge Cond: ("outer".work_units = "inner".work_today) -> Index Scan using work_units_today on rank_tie_today d (cost=0.00..23.89 rows=1081 width=16) (actual time=0.150..4.874rows=1081 loops=1) -> Sort (cost=55391.69..55625.86 rows=93668 width=64) (actual time=2705.153..2888.039 rows=49091 loops=1) Sort Key: email_rank.work_today -> Merge Join (cost=45047.64..47656.93 rows=93668 width=64) (actual time=1685.414..2494.342 rows=49091 loops=1) Merge Cond: ("outer".work_units = "inner".work_total) -> Index Scan using work_units_overall on rank_tie_overall o (cost=0.00..361.34 rows=17849 width=16) (actualtime=0.122..79.383 rows=17849 loops=1) -> Sort (cost=45047.64..45682.78 rows=254056 width=48) (actual time=1685.228..1866.215 rows=49091 loops=1) Sort Key: email_rank.work_total -> Seq Scan on email_rank (cost=0.00..22240.04 rows=254056 width=48) (actual time=786.515..1289.101rows=49091 loops=1) Filter: (project_id = 8) Total runtime: 3548.087 ms Even though the second case is only a select, it seems clear that something's wrong... -- Jim C. Nasby, Database Consultant jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > Note the time for the hash join step: Have you ANALYZEd these tables lately? It looks to me like it's hashing on some column that has only a small number of distinct values, so that the hash doesn't actually help to avoid comparisons. The planner should know better than to choose such a plan, but if it's working with obsolete stats ... regards, tom lane
Yes, stats are up to date, and the values should be fairly unique. Combined with the hash aggregate problem I saw (see my other email to the list), do you think there could be some issue with the performance of the hash function on FreeBSD 5.2 on AMD64? I'll post the table you requested someplace you can grab it. On Fri, Apr 16, 2004 at 12:34:11PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > Note the time for the hash join step: > > Have you ANALYZEd these tables lately? > > It looks to me like it's hashing on some column that has only a small > number of distinct values, so that the hash doesn't actually help to > avoid comparisons. The planner should know better than to choose such > a plan, but if it's working with obsolete stats ... > > regards, tom lane > -- Jim C. Nasby, Database Consultant jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > Combined with the hash aggregate problem I saw (see my other email to > the list), do you think there could be some issue with the performance > of the hash function on FreeBSD 5.2 on AMD64? Yeah, I was wondering about that too. Hard to imagine what though. The hash function should be pretty platform-independent. regards, tom lane
[ resending because I fat-fingered the cc: to the list ] I see the problem: all the entries in your work_units column have the low 32 bits equal to zero. regression=# select distinct work_units % (2^32)::bigint from Trank_work_overall; ?column? ---------- 0 (1 row) The hash function for int8 only takes the low word into account, so all of the entries end up on the same hash chain, resulting in worst-case behavior. This applies to both your hash join and hash aggregate cases. We could change the hash function, perhaps, but then we'd just have different cases where there's a problem ... hashing will always fail on *some* set of inputs. (Also, I have been harboring some notions of supporting cross-type hash joins for integer types, which will not work unless small int8 values hash the same as int4 etc.) I guess the real issue is why are you encoding work_units like that? regards, tom lane
I didn't follow the conversation from the begining, bu I imagine that you could improve performance using the value (work_units % (2^32) ) instead of work_units. You could even make an index on this value. Like that, the HASH function will work well. This is not a good solution, but ... For example. create index ind1 on table1 ( work_units % (2^32) ); create index ind1 on table2 ( work_units % (2^32) ); Select * from table1 join table2 on (table1.work_units % (2^32) ) = (table2.work_units % (2^32) ) ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Jim C. Nasby" <jim@nasby.net> Cc: <pgsql-performance@postgreSQL.org> Sent: Saturday, April 17, 2004 6:08 PM Subject: Re: [PERFORM] Horribly slow hash join > [ resending because I fat-fingered the cc: to the list ] > > I see the problem: all the entries in your work_units column have the > low 32 bits equal to zero. > > regression=# select distinct work_units % (2^32)::bigint from Trank_work_overall; > ?column? > ---------- > 0 > (1 row) > > The hash function for int8 only takes the low word into account, so all > of the entries end up on the same hash chain, resulting in worst-case > behavior. This applies to both your hash join and hash aggregate cases. > > We could change the hash function, perhaps, but then we'd just have > different cases where there's a problem ... hashing will always fail on > *some* set of inputs. (Also, I have been harboring some notions of > supporting cross-type hash joins for integer types, which will not work > unless small int8 values hash the same as int4 etc.) > > I guess the real issue is why are you encoding work_units like that? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
Tom Lane <tgl@sss.pgh.pa.us> writes: > We could change the hash function, perhaps, but then we'd just have > different cases where there's a problem ... hashing will always fail on > *some* set of inputs. Sure, but completely ignoring part of the input seems like an unfortunate choice of hash function. > (Also, I have been harboring some notions of supporting cross-type hash > joins for integer types, which will not work unless small int8 values hash > the same as int4 etc.) The obvious way to modify the hash function is to xor the high 32 bits with the low 32 bits. That maintains the property you need and at least ensures that all the bits are taken into account. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> (Also, I have been harboring some notions of supporting cross-type hash >> joins for integer types, which will not work unless small int8 values hash >> the same as int4 etc.) > The obvious way to modify the hash function is to xor the high 32 bits with > the low 32 bits. That maintains the property you need No it doesn't ... regards, tom lane
On Sat, 17 Apr 2004, Tom Lane wrote: > *some* set of inputs. (Also, I have been harboring some notions of > supporting cross-type hash joins for integer types, which will not work > unless small int8 values hash the same as int4 etc.) The simple solution would be to always extend integers to 64 bits (or whatever the biggest integer is) before calculating the hash. It makes the hash function a little slower for smaller types, but it's mostly an operation in the cpu and no memory involved, so it's probably not noticable. -- /Dennis Björklund
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> (Also, I have been harboring some notions of supporting cross-type hash > >> joins for integer types, which will not work unless small int8 values hash > >> the same as int4 etc.) > > > The obvious way to modify the hash function is to xor the high 32 bits with > > the low 32 bits. That maintains the property you need > > No it doesn't ... Eh? Oh, negative numbers? So low^high^sign. I wonder if it makes sense to have check the hash distribution after generating the table and if it's bad then throw it away and try again with a different hash function. The "different hash function" would probably just be a seed value changing. Probably way overkill though. -- greg
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Sat, 17 Apr 2004, Tom Lane wrote: >> *some* set of inputs. (Also, I have been harboring some notions of >> supporting cross-type hash joins for integer types, which will not work >> unless small int8 values hash the same as int4 etc.) > The simple solution would be to always extend integers to 64 bits (or > whatever the biggest integer is) before calculating the hash. That creates portability issues though. We do not depend on there being a 64-bit-int type for anything except int8 itself, and I don't want to start doing so. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > Eh? Oh, negative numbers? So low^high^sign. [ thinks about it... ] Yeah, that would work. We can't backpatch it without breaking existing hash indexes on int8, but it'd be reasonable to change for 7.5 (since at the rate things are going, we won't have pg_upgrade for 7.5 anyway...) > I wonder if it makes sense to have check the hash distribution after > generating the table and if it's bad then throw it away and try again with a > different hash function. The "different hash function" would probably just be > a seed value changing. Probably way overkill though. Yeah, it'd be a pain trying to get all the type-specific hash functions doing that. I'm also unconvinced that a simple change of seed value would necessarily make the distribution better. In the worst case, if the real problem is that all the input values are identical, you can reseed all day long and it won't fix it. regards, tom lane
On Sun, 18 Apr 2004, Tom Lane wrote: > That creates portability issues though. We do not depend on there being > a 64-bit-int type for anything except int8 itself, and I don't want to > start doing so. What do you mean? int8 is supported on all platformas and if the hasfunction would convert all numbers to int8 before making the hash it would work. I don't see any portability problems. -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > What do you mean? int8 is supported on all platformas No it isn't. regards, tom lane
On Sun, 18 Apr 2004, Tom Lane wrote: > > What do you mean? int8 is supported on all platformas > > No it isn't. So on platforms where it isn't you would use int4 as the biggest int then. I don't really see that as a problem. As long as you calculate the hash on the biggest int on that platform it should work. -- /Dennis Björklund
On Sun, Apr 18, 2004 at 18:27:09 +0200, Dennis Bjorklund <db@zigo.dhs.org> wrote: > On Sun, 18 Apr 2004, Tom Lane wrote: > > > > What do you mean? int8 is supported on all platformas > > > > No it isn't. > > So on platforms where it isn't you would use int4 as the biggest int then. > I don't really see that as a problem. As long as you calculate the hash on > the biggest int on that platform it should work. Another option would be to put the numbers into two int4s. For int4 or smaller types one of these would be zero. int8s would be split between the two. The hash function would then be defined on the two int4s.
On Sun, 18 Apr 2004, Bruno Wolff III wrote: > Another option would be to put the numbers into two int4s. For int4 or > smaller types one of these would be zero. int8s would be split between > the two. The hash function would then be defined on the two int4s. Sure, this is an internal calculation in the hash function. The only important thing is that the number 7 (for example) gives the same hash value no matter if it is an int2 or an int8 and that the hash function works well also for int8 numbers (which is does not today). At least that was the properties I understood that we wanted. We got side tracked into talking about what datatype exists in all platforms, that's not an issue at all. -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Sun, 18 Apr 2004, Bruno Wolff III wrote: > > > Another option would be to put the numbers into two int4s. For int4 or > > smaller types one of these would be zero. int8s would be split between > > the two. The hash function would then be defined on the two int4s. > > Sure, this is an internal calculation in the hash function. The only > important thing is that the number 7 (for example) gives the same hash > value no matter if it is an int2 or an int8 and that the hash function > works well also for int8 numbers (which is does not today). What's missing here is that the actual API for hash functions is that the data type provides a function that hashes to 32 bit integers. Then the hash code uses the 32 bit integer to crunch down to the actual number of buckets (using mod). The choice of 32 bit integers is purely arbitrary. As long as it's larger than than the number of buckets in any sane hash table it's fine. 32 bits is plenty. I question the use of mod to crunch the hash value down though. In the case of int4 the mapping to 32 bits is simply the identity. So the entire hash function ends up being simply "input mod #buckets". It seems way too easy to find real world data sets where many numbers will all be multiples of some number. If that common divisor shares any factors with the number of buckets, then the distribution will be very far from even with many empty buckets. If the hash tables were made a power of two then it would be possible to mix the bits of the 32 bit value and just mask off the unneeded bits. I've found one page via google that mentions mixing bits in a hash function, but I would look for a more serious treatment somewhere. http://burtleburtle.net/bob/hash/doobs.html Incidentally, this text claims mod is extremely slow compared to bit manipulations. I don't know that that kind of cycle counting is really is a factor for postgres though. Also, incidentally, this text is interesting: http://www.isthe.com/chongo/tech/comp/fnv/ -- greg
Greg Stark <gsstark@mit.edu> writes: > If the hash tables were made a power of two then it would be possible to mix > the bits of the 32 bit value and just mask off the unneeded bits. I've found > one page via google that mentions mixing bits in a hash function, but I would > look for a more serious treatment somewhere. > http://burtleburtle.net/bob/hash/doobs.html > Incidentally, this text claims mod is extremely slow compared to bit > manipulations. Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well, and is likely faster than any multiple-instruction way to do the same. The quoted article seems to be by someone who has spent a lot of time counting assembly cycles and none at all reading the last thirty years worth of CS literature. Knuth's treatment of hashing has some actual math to it... regards, tom lane
Here's an interesting link that suggests that hyperthreading would be much worse. http://groups.google.com/groups?q=hyperthreading+dual+xeon+idle&start=10&hl=en&lr=&ie=UTF-8&c2coff=1&selm=aukkonen-FE5275.21093624062003%40shawnews.gv.shawcable.net&rnum=16 FWIW, I have anecdotal evidence that suggests that this is the case, on of my clients was seeing very large context switches with HTT turned on, and without it was much better. Dave On Mon, 2004-04-19 at 02:09, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > If the hash tables were made a power of two then it would be possible to mix > > the bits of the 32 bit value and just mask off the unneeded bits. I've found > > one page via google that mentions mixing bits in a hash function, but I would > > look for a more serious treatment somewhere. > > http://burtleburtle.net/bob/hash/doobs.html > > Incidentally, this text claims mod is extremely slow compared to bit > > manipulations. > > Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well, > and is likely faster than any multiple-instruction way to do the same. > > The quoted article seems to be by someone who has spent a lot of time > counting assembly cycles and none at all reading the last thirty years > worth of CS literature. Knuth's treatment of hashing has some actual > math to it... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > > > !DSPAM:40837183123741526418863! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > If the hash tables were made a power of two then it would be possible to mix > > the bits of the 32 bit value and just mask off the unneeded bits. I've found > > one page via google that mentions mixing bits in a hash function, but I would > > look for a more serious treatment somewhere. > Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well, > and is likely faster than any multiple-instruction way to do the same. Well a) any number that has any factors of two fails to mix in some bits. That's a lot more common than non powers of two. b) The postgres code makes no attempt to make the number of buckets a prime and c) Even if the number of buckets were prime then it seems it would still be too easy to find real-world data where all the data have that prime as a factor. As it is they only need to have common factors to lose. > The quoted article seems to be by someone who has spent a lot of time > counting assembly cycles and none at all reading the last thirty years > worth of CS literature. Yes, well I did note that. -- greg
Dave Cramer <pg@fastcrypt.com> writes: > Here's an interesting link that suggests that hyperthreading would be > much worse. Uh, this is the wrong thread. -- greg
Dammit, I somehow deleted a bunch of replies to this. Did a TODO ever come out of this? -- Jim C. Nasby, Database Consultant jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tom Lane <tgl@sss.pgh.pa.us> writes: > Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well, > and is likely faster than any multiple-instruction way to do the same. > > The quoted article seems to be by someone who has spent a lot of time > counting assembly cycles and none at all reading the last thirty years > worth of CS literature. Knuth's treatment of hashing has some actual > math to it... [incidentally, I just found that the quoted article was independently found by Bruce Momjian who found it convincing enough to convert the hash_any table over to it two years ago] I just reviewed Knuth as well as C.L.R. and several papers from CACM and SIGMOD. It seems we have three options: mod(): Pro: empirical research shows it the best algorithm for avoiding collisions Con: requires the hash table be a prime size and far from a power of two. This is inconvenient to arrange for dynamic tables as used in postgres. multiplication method (floor(tablesize * remainder(x * golden ratio))) Pro: works with any table size Con: I'm not clear if it can be done without floating point arithmetic. It seems like it ought to be possible though. Universal hashing: Pro: won't create gotcha situations where the distribution of data suddenly and unexpectedly creates unexpected performance problems. Con: more complex. It would be trivial to switch the implementation from mod() to the multiplicative method which is more suited to postgres's needs. However universal hashing has some appeal. I hate the idea that a hash join could perform perfectly well one day and suddenly become pessimal when new data is loaded. In a sense universal hashing is less predictable. For a DSS system that could be bad. A query that runs fine every day might fail one day in an unpredictable way even though the data is unchanged. But in another sense it would be more predictable in that if you run the query a thousand times the average performance would be very close to the expected regardless of what the data is. Whereas more traditional algorithms have some patterns of data that will consistently perform badly. It's probably not worth it but postgres could maybe even be tricky and pretest the parameters against the common values in the statistics table, generating new ones if they fail to generate a nice distribution. That doesn't really guarantee anything though, except that those common values would at least be well distributed to start with. -- greg