Thread: Chain Hashing
Been following YouTube to study about Database Hash Join. https://www.youtube.com/watch?v=J0nbgXIarhc
HASH TABLE
Design Decision
#1: Hash Function → How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate. Design Decision
#2: Hashing Scheme → How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys.
Now I have some idea about the Hash function, but the math part I have no idea.
Hash Scheme is for handling collisions. Collision means that different input keys will have some output hash bits.
The following part is about the Chain Hashing.
Maintain a linked list of buckets for each slot in the hash table.
Resolve collisions by placing all elements with the same hash key into the same bucket.
→ To determine whether an element is present, hash to its bucket and scan for it.
→ Insertions and deletions are generalizations of lookups.
I still don't get it. Stackoverflow seems don't have good answers yet. So I come here, asking....
On Thu, May 6, 2021 at 9:48 PM Jian He <hejian.mark@gmail.com> wrote: > The following part is about the Chain Hashing. >> >> Maintain a linked list of buckets for each slot in the hash table. >> Resolve collisions by placing all elements with the same hash key into the same bucket. >> → To determine whether an element is present, hash to its bucket and scan for it. >> → Insertions and deletions are generalizations of lookups. > > > I still don't get it. Stackoverflow seems don't have good answers yet. So I come here, asking.... Not sure which part you're asking about, but here's an example. Suppose you want to put three objects (in our case tuples) into a hash table, where the first attribute is the key: ('AAA', 'Aardvark'), ('BBB', 'Bumblebee'), ('CCC', 'Cat') Suppose your hash table has 4 buckets. We can use the lower 2 bits to map hash values to bucket numbers. hash('AAA') = 0xfe7f3cba, hash('AAA') & 0x3 = 2 hash('BBB') = 0x87e3287b, hash('BBB') & 0x3 = 3 hash('CCC') = 0x194bcedf, hash('CCC') & 0x3 = 3 'BBB' and 'CCC' collided: they both want to be in bucket 3. To insert, a chaining hash table just add each object to the computed bucket's chain: +---+ | 0 | +---+ | 1 | +---+ | 2 |->('AAA', 'Aardvark') +---+ | 3 |->('BBB', 'Bumblebee')->('CCC', 'Cat') +---+ When looking up key 'CCC' during the "probe" phase of a hash join, we'll again compute hash('CCC') & 0x3 = 3, look in bucket 3, and then compare the key of every tuple in that list with 'CCC' to see if we can find any matches. That's called "lookup" in the text you quoted. It also mentions deletion, which is pretty much just lookup following by removing the matching entry from the list, but that's a general comment about hash tables. It doesn't apply to their use in hash joins: there is never a need to remove individual keys, we just build, probe and destroy the whole table. Another difference between general purpose hash tables such as you might find in a typical programming language standard library and the hash tables used to implement hash joins is the latter need to be able to tolerate duplicate keys, so the 'scan' of a bucket doesn't give up as soon as it finds a match (unless it's a semi-join): it normally has to emit all of the matches. PostgreSQL uses chaining for hash joins, but it also uses Robin Hood hash tables in some other places, including hash-based GROUP BY.