Thread: Allow simplehash to use already-calculated hash values

Allow simplehash to use already-calculated hash values

From
Jeff Davis
Date:
The attached small patch adds new entry points to simplehash.h that
allow the caller to pass in the already-calculated hash value, so that
simplehash doesn't need to recalculate it.

This is helpful for Memory-Bounded Hash Aggregation[1], which uses the
hash value for multiple purposes. For instance, if the hash table is
full and the group is not already present in the hash table, it needs
to spill the tuple to disk. In that case, it would use the hash value
for the initial lookup, then to select the right spill partition.
Later, when it processes the batch, it will again need the same hash
value to perform a lookup. By separating the hash value calculation
from where it's used, we can avoid needlessly recalculating it for each
of these steps.

There is already an option for simplehash to cache the calculated hash
value and return it with the entry, but that doesn't quite fit the
need. The hash value is needed in cases where the lookup fails, because
that is when the tuple must be spilled; but if the lookup fails, it
returns NULL, discarding the calculated hash value.

I am including this patch separately from Hash Aggregation because it
is a small and independently-reviewable change.

In theory, this could add overhead for "SH_SCOPE extern" for callers
not specifying their own hash value, because it adds an extra external
function call. I looked at the generated LLVM and it's a simple tail
call, and I looked at the generated assembly and it's just an extra
jmp. I tested by doing a hash aggregation of 30M zeroes, which should
exercise that path a lot, and I didn't see any difference. Also, once
we actually use this for hash aggregation, there will be no "SH_SCOPE
extern" callers that don't specify the hash value anyway.

Regards,
    Jeff Davis

[1] 
https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel%40j-davis.com

Attachment

Re: Allow simplehash to use already-calculated hash values

From
Andres Freund
Date:
Hi,

On 2019-07-16 15:20:33 -0700, Jeff Davis wrote:
> The attached small patch adds new entry points to simplehash.h that
> allow the caller to pass in the already-calculated hash value, so that
> simplehash doesn't need to recalculate it.
> 
> This is helpful for Memory-Bounded Hash Aggregation[1], which uses the
> hash value for multiple purposes. For instance, if the hash table is
> full and the group is not already present in the hash table, it needs
> to spill the tuple to disk. In that case, it would use the hash value
> for the initial lookup, then to select the right spill partition.
> Later, when it processes the batch, it will again need the same hash
> value to perform a lookup. By separating the hash value calculation
> from where it's used, we can avoid needlessly recalculating it for each
> of these steps.

Makes sense to me.



> In theory, this could add overhead for "SH_SCOPE extern" for callers
> not specifying their own hash value, because it adds an extra external
> function call. I looked at the generated LLVM and it's a simple tail
> call, and I looked at the generated assembly and it's just an extra
> jmp.

How does it look for gcc? And was that with LTO enabled or not?

Is that still true when the hashtable is defined in a shared library, or
when you compile postgres as a PIE executable? I'm not sure that
compilers can optimize the external function call at least in the former
case, because the typical function resolution rules IIRC mean that
references to extern functions could be resolved to definitions in other
translation units, *even if* there's a definition in the same TU.

ISTM that it'd be best to just have a static inline helper function
employed both the hash-passing and the "traditional" insertion routines?
Then that problem ought to not exist anymore.


Greetings,

Andres Freund



Re: Allow simplehash to use already-calculated hash values

From
Jeff Davis
Date:
On Tue, 2019-07-16 at 15:46 -0700, Andres Freund wrote:
> ISTM that it'd be best to just have a static inline helper function
> employed both the hash-passing and the "traditional" insertion
> routines?
> Then that problem ought to not exist anymore.

Agreed, attached.

Regards,
    Jeff Davis


Attachment

Re: Allow simplehash to use already-calculated hash values

From
Andres Freund
Date:
Hi,

On 2019-07-17 11:17:46 -0700, Jeff Davis wrote:
> From a6aba8e53f7a36a42922add68098682c2c96683e Mon Sep 17 00:00:00 2001
> From: Jeff Davis <jdavis@postgresql.org>
> Date: Wed, 17 Jul 2019 10:52:15 -0700
> Subject: [PATCH] Allow simplehash to use already-calculated hash values.
> 
> Add _lookup_hash and _insert_hash functions for callers that have
> already calculated the hash value of the key.

I've not tested it, but this looks reasonable to me. Do you actually
need the lookup variant, or is that more for completeness?


> This is intended for use with hash algorithms that write to disk in
> partitions. The hash value can be calculated once, used to perform a
> lookup, used to select the partition, then written to the partition
> along with the tuple. When the tuple is read back, the hash value does
> not need to be recalculated.

nitpick^3: I'd s/This is intended for use/The immediate use-case is/


> +static inline    SH_ELEMENT_TYPE *
> +SH_INSERT_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash, bool *found)

I'd perhaps add a comment here along the lines of:

/*
 * This is a separate static inline function, so it can be reliably be inlined
 * into its wrapper functions even if SH_SCOPE is extern.
 */

Greetings,

Andres Freund



Re: Allow simplehash to use already-calculated hash values

From
Jeff Davis
Date:
On Wed, 2019-07-17 at 11:59 -0700, Andres Freund wrote:
> I've not tested it, but this looks reasonable to me. Do you actually
> need the lookup variant, or is that more for completeness?

Yes. If the hash table is full, I do a lookup. If not, I do an insert.

> nitpick^3: I'd s/This is intended for use/The immediate use-case is/

OK.

> > +static inline    SH_ELEMENT_TYPE *
> > +SH_INSERT_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32
> > hash, bool *found)
> 
> I'd perhaps add a comment here along the lines of:
> 
> /*
>  * This is a separate static inline function, so it can be reliably
> be inlined
>  * into its wrapper functions even if SH_SCOPE is extern.
>  */

Will do.

Regards,
    Jeff