Thread: Comment fix and question about dshash.c
1. The return type of resize() function is void, so I propose part of the header comment to be removed: diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c index b46f7c4cfd..b2b8fe60e1 100644 --- a/src/backend/lib/dshash.c +++ b/src/backend/lib/dshash.c @@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item *item) /* * Grow the hash table if necessary to the requested number of buckets. The - * requested size must be double some previously observed size. Returns true - * if the table was successfully expanded or found to be big enough already - * (because another backend expanded it). + * requested size must be double some previously observed size. * * Must be called without any partition lock held. */ 2. Can anyone please explain this macro? /* Max entries before we need to grow. Half + quarter = 75% load factor. */ #define MAX_COUNT_PER_PARTITION(hash_table) \ (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \ BUCKETS_PER_PARTITION(hash_table->size_log2) / 4) I'm failing to understand why the maximum number of hash table entries in a partition should be smaller than the number of buckets in that partition. The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is obvious from this condition in dshash_find_or_insert() /* Check if we are getting too full. */ if (partition->count > MAX_COUNT_PER_PARTITION(hash_table)) -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote: > 1. The return type of resize() function is void, so I propose part of the > header comment to be removed: > > diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c > index b46f7c4cfd..b2b8fe60e1 100644 > --- a/src/backend/lib/dshash.c > +++ b/src/backend/lib/dshash.c > @@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item *item) > > /* > * Grow the hash table if necessary to the requested number of buckets. The > - * requested size must be double some previously observed size. Returns true > - * if the table was successfully expanded or found to be big enough already > - * (because another backend expanded it). > + * requested size must be double some previously observed size. > * > * Must be called without any partition lock held. > */ Thanks, will fix. > 2. Can anyone please explain this macro? > > /* Max entries before we need to grow. Half + quarter = 75% load factor. */ > #define MAX_COUNT_PER_PARTITION(hash_table) \ > (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \ > BUCKETS_PER_PARTITION(hash_table->size_log2) / 4) > > I'm failing to understand why the maximum number of hash table entries in a > partition should be smaller than the number of buckets in that partition. > > The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is > obvious from this condition in dshash_find_or_insert() > > /* Check if we are getting too full. */ > if (partition->count > MAX_COUNT_PER_PARTITION(hash_table)) Are you saying there is a bug in this logic (which is nbuckets * 0.75 expressed as integer maths), or saying that 0.75 is not a good maximum load factor? I looked around at a couple of general purpose hash tables and saw that some used 0.75 and some used 1.0, as a wasted space-vs-collision trade-off. If I have my maths right[1], with 0.75 you expect to have 75 entries in ~53 buckets, but with 1.0 you expect to have 100 entries in ~64 buckets. It'd be a fair criticism that that's arbitrarily different to the choice made for hash joins, and dynahash's default is 1 (though it's a run-time parameter). [1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins -- Thomas Munro http://www.enterprisedb.com
On Sat, Oct 27, 2018 at 10:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > space-vs-collision trade-off. If I have my maths right[1], with 0.75 > you expect to have 75 entries in ~53 buckets, but with 1.0 you expect > to have 100 entries in ~64 buckets. It'd be a fair criticism that Forgot to add: assuming 100 buckets, for illustration (the real number is a power of 2). -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote: > > Thanks, will fix. > > > 2. Can anyone please explain this macro? > > > > /* Max entries before we need to grow. Half + quarter = 75% load factor. */ > > #define MAX_COUNT_PER_PARTITION(hash_table) \ > > (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \ > > BUCKETS_PER_PARTITION(hash_table->size_log2) / 4) > > > > I'm failing to understand why the maximum number of hash table entries in a > > partition should be smaller than the number of buckets in that partition. > > > > The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is > > obvious from this condition in dshash_find_or_insert() > > > > /* Check if we are getting too full. */ > > if (partition->count > MAX_COUNT_PER_PARTITION(hash_table)) > > Are you saying there is a bug in this logic (which is nbuckets * 0.75 > expressed as integer maths), or saying that 0.75 is not a good maximum > load factor? I looked around at a couple of general purpose hash > tables and saw that some used 0.75 and some used 1.0, as a wasted > space-vs-collision trade-off. If I have my maths right[1], with 0.75 > you expect to have 75 entries in ~53 buckets, but with 1.0 you expect > to have 100 entries in ~64 buckets. I don't know how exactly you apply the [1] formula (what is "n" and what is "N" in your case?), but my consideration was much simpler: For example, if BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the hashtable gets resized if we're going to add the 7th entry to the partition, i.e. we the number of entries in the partition is lower than the number of buckets. Is that o.k.? > [1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > > On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote: > > > > Are you saying there is a bug in this logic (which is nbuckets * 0.75 > > expressed as integer maths), or saying that 0.75 is not a good maximum > > load factor? I looked around at a couple of general purpose hash > > tables and saw that some used 0.75 and some used 1.0, as a wasted > > space-vs-collision trade-off. If I have my maths right[1], with 0.75 > > you expect to have 75 entries in ~53 buckets, but with 1.0 you expect > > to have 100 entries in ~64 buckets. > > I don't know how exactly you apply the [1] formula (what is "n" and what is > "N" in your case?), but my consideration was much simpler: For example, if > BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more > convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the > hashtable gets resized if we're going to add the 7th entry to the partition, > i.e. we the number of entries in the partition is lower than the number of > buckets. Is that o.k.? Well, it may be o.k. I've just checked what the fill factor means in hash index and it's also the number of entries divided by the number of buckets. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On 10/27/2018 01:51 PM, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > >> Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> >>> On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote: >>> >>> Are you saying there is a bug in this logic (which is nbuckets * 0.75 >>> expressed as integer maths), or saying that 0.75 is not a good maximum >>> load factor? I looked around at a couple of general purpose hash >>> tables and saw that some used 0.75 and some used 1.0, as a wasted >>> space-vs-collision trade-off. If I have my maths right[1], with 0.75 >>> you expect to have 75 entries in ~53 buckets, but with 1.0 you expect >>> to have 100 entries in ~64 buckets. >> >> I don't know how exactly you apply the [1] formula (what is "n" and what is >> "N" in your case?), but my consideration was much simpler: For example, if >> BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more >> convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the >> hashtable gets resized if we're going to add the 7th entry to the partition, >> i.e. we the number of entries in the partition is lower than the number of >> buckets. Is that o.k.? > > Well, it may be o.k. I've just checked what the fill factor means in hash > index and it's also the number of entries divided by the number of buckets. > Using load factor ~0.75 is definitely the right thing to do. One way to interpret it is "average chain length" (or whatever is the approach in that particular hash table implementations) and one of the main points of hash tables is to eliminate linear scans. We could pick a value closer to 1.0, but experience says it's not worth it - it's easy to get a annoyingly long chains due to hash collisions, in exchange for fairly minimal space savings. That being said, I wonder if we should tweak NTUP_PER_BUCKET=1 in hash joins to a lower value. IIRC we ended up with 1.0 because originally it was set to 10.0, and we reduced it to 1.0 in 9.5 which gave us nice speedup. But I don't recall if we tried using even lower value (probably not). Maybe we don't need to do that because we only build the hash table at the very end, when we exactly know how many entries will it contain, so we don't need to do lookups and inserts at the same time, and we don't need to grow the hash table (at least in the non-parallel case). And we end up with 0.75 load factor on average, due to the doubling (the sizes are essentially uniformly distributed between 0.5+epsilon and 1.0-epsilon). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Oct 28, 2018 at 1:22 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 10/27/2018 01:51 PM, Antonin Houska wrote: > > Antonin Houska <ah@cybertec.at> wrote: > >> Thomas Munro <thomas.munro@enterprisedb.com> wrote: > >>> On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote: > >>> Are you saying there is a bug in this logic (which is nbuckets * 0.75 > >>> expressed as integer maths), or saying that 0.75 is not a good maximum > >>> load factor? I looked around at a couple of general purpose hash > >>> tables and saw that some used 0.75 and some used 1.0, as a wasted > >>> space-vs-collision trade-off. If I have my maths right[1], with 0.75 > >>> you expect to have 75 entries in ~53 buckets, but with 1.0 you expect > >>> to have 100 entries in ~64 buckets. > >> > >> I don't know how exactly you apply the [1] formula (what is "n" and what is > >> "N" in your case?), but my consideration was much simpler: For example, if > >> BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more > >> convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the > >> hashtable gets resized if we're going to add the 7th entry to the partition, > >> i.e. we the number of entries in the partition is lower than the number of > >> buckets. Is that o.k.? n balls = the keys being hashed and insert N bins = the hash table buckets Unless we know of some special properties of the hash function and input data, I believe we have to treat the hashes like uniformly distributed random numbers when making predictions, hence the balls-into-bins probability stuff that can tell you about the expected distribution. The expected number of occupied bins for 1 million balls into 1 million bins: select 1000000.0 * (1.0 - pow(1.0 - (1.0 / 1000000), 1000000.0)); -> 632120.7427683549057142050000000 The same thing by counting distinct positive random numbers modulo 1 million: select count(distinct (random() * 200000000)::int % 1000000) from generate_series(1, 1000000) ss; -> 632373, 632246, 631954, ... Distinct hashes of the first 1 million integers (arbitrary keys) modulo 1 million (buckets): select count(distinct abs(hashoid(n)) % 1000000) from generate_series(1, 1000000) ss(n); -> 632115 (For a moment I was confused about getting a higher number until I realised I need abs() to avoid doubling the effective number of buckets...) > > Well, it may be o.k. I've just checked what the fill factor means in hash > > index and it's also the number of entries divided by the number of buckets. > > > > Using load factor ~0.75 is definitely the right thing to do. One way to > interpret it is "average chain length" (or whatever is the approach in > that particular hash table implementations) and one of the main points > of hash tables is to eliminate linear scans. We could pick a value > closer to 1.0, but experience says it's not worth it - it's easy to get > a annoyingly long chains due to hash collisions, in exchange for fairly > minimal space savings. Using the hashes of the first 1 million integers, let's try putting them into different numbers of buckets, and see how many buckets we get with each chain length: WITH entries AS (SELECT abs(hashoid(generate_series(1, 1000000))) % NBUCKETS AS bucket), lengths AS (SELECT bucket, COUNT(*) AS chain_length FROM entries GROUP BY bucket) SELECT chain_length, COUNT(*) AS count FROM lengths GROUP BY chain_length ORDER BY 2 DESC; NBUCKETS = 1000000 (load factor 1.0) chain_length | count --------------+-------- 1 | 367537 2 | 184580 3 | 61109 4 | 15197 5 | 3079 6 | 509 7 | 95 8 | 7 9 | 2 NBUCKETS = 1333333 (load factor 0.75) chain_length | count --------------+-------- 1 | 472012 2 | 177174 3 | 44348 4 | 8361 5 | 1243 6 | 143 7 | 9 8 | 2 NBUCKETS = 2000000 (load factor 0.5) chain_length | count --------------+-------- 1 | 606451 2 | 151741 3 | 25226 4 | 3148 5 | 323 6 | 28 7 | 2 > That being said, I wonder if we should tweak NTUP_PER_BUCKET=1 in hash > joins to a lower value. IIRC we ended up with 1.0 because originally it > was set to 10.0, and we reduced it to 1.0 in 9.5 which gave us nice > speedup. But I don't recall if we tried using even lower value (probably > not). Maybe we don't need to do that because we only build the hash > table at the very end, when we exactly know how many entries will it > contain, so we don't need to do lookups and inserts at the same time, > and we don't need to grow the hash table (at least in the non-parallel > case). And we end up with 0.75 load factor on average, due to the > doubling (the sizes are essentially uniformly distributed between > 0.5+epsilon and 1.0-epsilon). Yeah, it would indeed be interesting to experiment with load factors lower than 1.0. Every link in the chain is a cache miss. In future we should work on mitigating that by prefetching during both building and probing (see nearby thread), but I suspect you can only do that effectively for the bucket header and perhaps the first tuple in the chain; if I'm right then longer chains will become even more expensive relative to short ones. By the way, aside from wasting memory, when you add extra buckets you make full/right outer hash joins slower because they scan all the buckets. -- Thomas Munro http://www.enterprisedb.com