Thread: Use compiler intrinsics for bit ops in hash

Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

14 January 2020, 17:35:53

Folks,

The recent patch for distinct windowing aggregates contained a partial
fix of the FIXME that didn't seem entirely right, so I extracted that
part, changed it to use compiler intrinsics, and submit it here.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

v1-0001-Use-compiler-intrinsics-for-bit-ops-in-hash.patch

Re: Use compiler intrinsics for bit ops in hash

From

Jesse Zhang

Date:

14 January 2020, 20:21:41

Hi David,

On Tue, Jan 14, 2020 at 9:36 AM David Fetter <david@fetter.org> wrote:
>
> Folks,
>
> The recent patch for distinct windowing aggregates contained a partial
> fix of the FIXME that didn't seem entirely right, so I extracted that
> part, changed it to use compiler intrinsics, and submit it here.

The changes in hash AM and SIMPLEHASH do look like a net positive
improvement. My biggest cringe might be in pg_bitutils:

> diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
> index 498e532308..cc9338da25 100644
> --- a/src/include/port/pg_bitutils.h
> +++ b/src/include/port/pg_bitutils.h
> @@ -145,4 +145,32 @@ pg_rotate_right32(uint32 word, int n)
>  return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
>  }
>
> +/* ceil(lg2(num)) */
> +static inline uint32
> +ceil_log2_32(uint32 num)
> +{
> + return pg_leftmost_one_pos32(num-1) + 1;
> +}
> +
> +static inline uint64
> +ceil_log2_64(uint64 num)
> +{
> + return pg_leftmost_one_pos64(num-1) + 1;
> +}
> +
> +/* calculate first power of 2 >= num
> + * per https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
> + * using BSR where available */
> +static inline uint32
> +next_power_of_2_32(uint32 num)
> +{
> + return ((uint32) 1) << (pg_leftmost_one_pos32(num-1) + 1);
> +}
> +
> +static inline uint64
> +next_power_of_2_64(uint64 num)
> +{
> + return ((uint64) 1) << (pg_leftmost_one_pos64(num-1) + 1);
> +}
> +
>  #endif /* PG_BITUTILS_H */
>

1. Is ceil_log2_64 dead code?

2. The new utilities added here (ceil_log2_32 and company,
next_power_of_2_32 and company) all require num > 1, but don't clearly
Assert (or at the very least document) so.

3. A couple of the callers can actively pass in an argument of 1, e.g.
from _hash_spareindex in hashutil.c, while some other callers are iffy
at best (simplehash.h maybe?)

4. It seems like you *really* would like an operation like LZCNT in x86
(first appearing in Haswell) that is well defined on zero input. ISTM
the alternatives are:

   a) Special case 1. That seems straightforward, but the branching cost
   on a seemingly unlikely condition seems to be a lot of performance
   loss

   b) Use architecture specific intrinsic (and possibly with CPUID
   shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
   intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
   instructions that are well defined on zero in most ISA's other than
   x86, so maybe we can get away with special-casing x86?

Cheers,
Jesse

Re: Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

14 January 2020, 22:09:18

On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote:
> Hi David,
> 
> On Tue, Jan 14, 2020 at 9:36 AM David Fetter <david@fetter.org> wrote:
> >
> > Folks,
> >
> > The recent patch for distinct windowing aggregates contained a partial
> > fix of the FIXME that didn't seem entirely right, so I extracted that
> > part, changed it to use compiler intrinsics, and submit it here.
> 
> The changes in hash AM and SIMPLEHASH do look like a net positive
> improvement. My biggest cringe might be in pg_bitutils:

Thanks for looking at this!

> > diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
> > index 498e532308..cc9338da25 100644
> > --- a/src/include/port/pg_bitutils.h
> > +++ b/src/include/port/pg_bitutils.h
> > @@ -145,4 +145,32 @@ pg_rotate_right32(uint32 word, int n)
> >  return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
> >  }
> >
> > +/* ceil(lg2(num)) */
> > +static inline uint32
> > +ceil_log2_32(uint32 num)
> > +{
> > + return pg_leftmost_one_pos32(num-1) + 1;
> > +}
> > +
> > +static inline uint64
> > +ceil_log2_64(uint64 num)
> > +{
> > + return pg_leftmost_one_pos64(num-1) + 1;
> > +}
> > +
> > +/* calculate first power of 2 >= num
> > + * per https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
> > + * using BSR where available */
> > +static inline uint32
> > +next_power_of_2_32(uint32 num)
> > +{
> > + return ((uint32) 1) << (pg_leftmost_one_pos32(num-1) + 1);
> > +}
> > +
> > +static inline uint64
> > +next_power_of_2_64(uint64 num)
> > +{
> > + return ((uint64) 1) << (pg_leftmost_one_pos64(num-1) + 1);
> > +}
> > +
> >  #endif /* PG_BITUTILS_H */
> >
> 
> 1. Is ceil_log2_64 dead code?

Let's call it nascent code. I suspect there are places it could go, if
I look for them.  Also, it seemed silly to have one without the other.

> 2. The new utilities added here (ceil_log2_32 and company,
> next_power_of_2_32 and company) all require num > 1, but don't clearly
> Assert (or at the very least document) so.

Assert()ed.

> 3. A couple of the callers can actively pass in an argument of 1, e.g.
> from _hash_spareindex in hashutil.c, while some other callers are iffy
> at best (simplehash.h maybe?)

What would you recommend be done about this?

> 4. It seems like you *really* would like an operation like LZCNT in x86
> (first appearing in Haswell) that is well defined on zero input. ISTM
> the alternatives are:
> 
>    a) Special case 1. That seems straightforward, but the branching cost
>    on a seemingly unlikely condition seems to be a lot of performance
>    loss
> 
>    b) Use architecture specific intrinsic (and possibly with CPUID
>    shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
>    intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
>    instructions that are well defined on zero in most ISA's other than
>    x86, so maybe we can get away with special-casing x86?

b) seems much more attractive. Is there some way to tilt the tools so
that this happens? What should I be reading up on?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

v2-0001-Use-compiler-intrinsics-for-bit-ops-in-hash.patch

Re: Use compiler intrinsics for bit ops in hash

From

Jesse Zhang

Date:

15 January 2020, 23:45:12

On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote:
> > The changes in hash AM and SIMPLEHASH do look like a net positive
> > improvement. My biggest cringe might be in pg_bitutils:
> >
> > 1. Is ceil_log2_64 dead code?
>
> Let's call it nascent code. I suspect there are places it could go, if
> I look for them.  Also, it seemed silly to have one without the other.
>

While not absolutely required, I'd like us to find at least one
place and start using it. (Clang also nags at me when we have
unused functions).

> On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote:
> > 4. It seems like you *really* would like an operation like LZCNT in x86
> > (first appearing in Haswell) that is well defined on zero input. ISTM
> > the alternatives are:
> >
> >    a) Special case 1. That seems straightforward, but the branching cost
> >    on a seemingly unlikely condition seems to be a lot of performance
> >    loss
> >
> >    b) Use architecture specific intrinsic (and possibly with CPUID
> >    shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
> >    intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
> >    instructions that are well defined on zero in most ISA's other than
> >    x86, so maybe we can get away with special-casing x86?

i. We can detect LZCNT instruction by checking one of the
"extended feature" (EAX=80000001) bits using CPUID. Unlike the
"basic features" (EAX=1), extended feature flags have been more
vendor-specific, but fortunately it seems that the feature bit
for LZCNT is the same [1][2].

ii. We'll most likely still need to provide a fallback
implementation for processors that don't have LZCNT (either
because they are from a different vendor, or an older Intel/AMD
processor). I wonder if simply checking for 1 is "good enough".
Maybe a micro benchmark is in order?

> Is there some way to tilt the tools so that this happens?
We have a couple options here:

1. Use a separate object (a la our SSE 4.2 implemenation of
CRC). On Clang and GCC (I don't have MSVC at hand), -mabm or
-mlzcnt should cause __builtin_clz to generate the LZCNT
instruction, which is well defined on zero input. The default
configuration would translate __builtin_clz to code that
subtracts BSR from the width of the input, but BSR leaves the
destination undefined on zero input.

2. (My least favorite) use inline asm (a la our popcount
implementation).

> b) seems much more attractive. Is there some way to tilt the tools so
> that this happens? What should I be reading up on?

The enclosed references hopefully are good places to start. Let
me know if you have more ideas.

Cheers,
Jesse


References:

[1] "How to detect New Instruction support in the 4th generation Intel®
Core™ processor family"

https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family
[2] "Bit Manipulation Instruction Sets"
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

18 January 2020, 03:46:24

On Wed, Jan 15, 2020 at 6:09 AM David Fetter <david@fetter.org> wrote:
> [v2 patch]

Hi David,

I have a stylistic comment on this snippet:

- for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
- {
- if ((1 << i) <= metap->hashm_bsize)
- break;
- }
+ i =  pg_leftmost_one_pos32(metap->hashm_bsize);
  Assert(i > 0);
  metap->hashm_bmsize = 1 << i;
  metap->hashm_bmshift = i + BYTE_TO_BIT;

Naming the variable "i" made sense when it was a loop counter, but it
seems out of place now. Same with the Assert.

Also, this

+ * using BSR where available */

is not directly tied to anything in this function, or even in the
function it calls, and could get out of date easily.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

19 January 2020, 00:00:52

On Sat, Jan 18, 2020 at 11:46:24AM +0800, John Naylor wrote:
> On Wed, Jan 15, 2020 at 6:09 AM David Fetter <david@fetter.org> wrote:
> > [v2 patch]
> 
> Hi David,
> 
> I have a stylistic comment on this snippet:
> 
> - for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
> - {
> - if ((1 << i) <= metap->hashm_bsize)
> - break;
> - }
> + i =  pg_leftmost_one_pos32(metap->hashm_bsize);
>   Assert(i > 0);
>   metap->hashm_bmsize = 1 << i;
>   metap->hashm_bmshift = i + BYTE_TO_BIT;
> 
> Naming the variable "i" made sense when it was a loop counter, but it
> seems out of place now. Same with the Assert.

Fixed by removing the variable entirely.

> Also, this
> 
> + * using BSR where available */
> 
> is not directly tied to anything in this function, or even in the
> function it calls, and could get out of date easily.

Removed.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

v3-0001-Use-compiler-intrinsics-for-bit-ops-in-hash.patch

Re: Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

31 January 2020, 15:59:18

On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote:
> On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote:
> > > The changes in hash AM and SIMPLEHASH do look like a net positive
> > > improvement. My biggest cringe might be in pg_bitutils:
> > >
> > > 1. Is ceil_log2_64 dead code?
> >
> > Let's call it nascent code. I suspect there are places it could go, if
> > I look for them.  Also, it seemed silly to have one without the other.
> >
> 
> While not absolutely required, I'd like us to find at least one
> place and start using it. (Clang also nags at me when we have
> unused functions).

Done in the expanded patches attached.

> > On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote:
> > > 4. It seems like you *really* would like an operation like LZCNT in x86
> > > (first appearing in Haswell) that is well defined on zero input. ISTM
> > > the alternatives are:
> > >
> > >    a) Special case 1. That seems straightforward, but the branching cost
> > >    on a seemingly unlikely condition seems to be a lot of performance
> > >    loss
> > >
> > >    b) Use architecture specific intrinsic (and possibly with CPUID
> > >    shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
> > >    intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
> > >    instructions that are well defined on zero in most ISA's other than
> > >    x86, so maybe we can get away with special-casing x86?
> 
> i. We can detect LZCNT instruction by checking one of the
> "extended feature" (EAX=80000001) bits using CPUID. Unlike the
> "basic features" (EAX=1), extended feature flags have been more
> vendor-specific, but fortunately it seems that the feature bit
> for LZCNT is the same [1][2].
> 
> ii. We'll most likely still need to provide a fallback
> implementation for processors that don't have LZCNT (either
> because they are from a different vendor, or an older Intel/AMD
> processor). I wonder if simply checking for 1 is "good enough".
> Maybe a micro benchmark is in order?

I'm not sure how I'd run one on the architectures we support.  What
I've done here is generalize our implementation to be basically like
LZCNT and TZCNT at the cost of a brief branch that might go away at
runtime.

> 2. (My least favorite) use inline asm (a la our popcount
> implementation).

Yeah, I'd like to fix that, but I kept the scope of this one
relatively narrow.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

26 February 2020, 08:12:24

On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote:
> On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote:
> > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote:
> > > > The changes in hash AM and SIMPLEHASH do look like a net positive
> > > > improvement. My biggest cringe might be in pg_bitutils:
> > > >
> > > > 1. Is ceil_log2_64 dead code?
> > >
> > > Let's call it nascent code. I suspect there are places it could go, if
> > > I look for them.  Also, it seemed silly to have one without the other.
> > >
> > 
> > While not absolutely required, I'd like us to find at least one
> > place and start using it. (Clang also nags at me when we have
> > unused functions).
> 
> Done in the expanded patches attached.

These bit-rotted a little, so I've updated them.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

27 February 2020, 05:56:40

On Wed, Feb 26, 2020 at 09:12:24AM +0100, David Fetter wrote:
> On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote:
> > On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote:
> > > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote:
> > > > > The changes in hash AM and SIMPLEHASH do look like a net positive
> > > > > improvement. My biggest cringe might be in pg_bitutils:
> > > > >
> > > > > 1. Is ceil_log2_64 dead code?
> > > >
> > > > Let's call it nascent code. I suspect there are places it could go, if
> > > > I look for them.  Also, it seemed silly to have one without the other.
> > > >
> > > 
> > > While not absolutely required, I'd like us to find at least one
> > > place and start using it. (Clang also nags at me when we have
> > > unused functions).
> > 
> > Done in the expanded patches attached.
> 
> These bit-rotted a little, so I've updated them.

05d8449e73694585b59f8b03aaa087f04cc4679a broke this patch set, so fix.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

27 February 2020, 06:41:49

On Thu, Feb 27, 2020 at 1:56 PM David Fetter <david@fetter.org> wrote:
> [v6 set]

Hi David,

In 0002, the pg_bitutils functions have a test (input > 0), and the
new callers ceil_log2_* and next_power_of_2_* have asserts. That seems
backward to me. I imagine some callers of bitutils will already know
the value > 0, and it's probably good to keep that branch out of the
lowest level functions. What do you think?

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

28 February 2020, 15:13:00

On Thu, Feb 27, 2020 at 02:41:49PM +0800, John Naylor wrote:
> On Thu, Feb 27, 2020 at 1:56 PM David Fetter <david@fetter.org> wrote:
> > [v6 set]
> 
> Hi David,
> 
> In 0002, the pg_bitutils functions have a test (input > 0), and the
> new callers ceil_log2_* and next_power_of_2_* have asserts. That seems
> backward to me.

To me, too, now that you mention it.  My thinking was a little fuzzed
by trying to accommodate platforms with intrinsics where clz is
defined for 0 inputs.

> I imagine some callers of bitutils will already know the value > 0,
> and it's probably good to keep that branch out of the lowest level
> functions. What do you think?

I don't know quite how smart compilers and CPUs are these days, so
it's unclear to me how often that branch would actually happen.

Anyhow, I'll get a revised patch set out later today.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Use compiler intrinsics for bit ops in hash

From

Jesse Zhang

Date:

02 March 2020, 20:45:21

Hi David,

On Wed, Feb 26, 2020 at 9:56 PM David Fetter <david@fetter.org> wrote:
>
> On Wed, Feb 26, 2020 at 09:12:24AM +0100, David Fetter wrote:
> > On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote:
> > > On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote:
> > > > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote:
> > > > > > The changes in hash AM and SIMPLEHASH do look like a net positive
> > > > > > improvement. My biggest cringe might be in pg_bitutils:
> > > > > >
> > > > > > 1. Is ceil_log2_64 dead code?
> > > > >
> > > > > Let's call it nascent code. I suspect there are places it could go, if
> > > > > I look for them.  Also, it seemed silly to have one without the other.
> > > > >
> > > >
> > > > While not absolutely required, I'd like us to find at least one
> > > > place and start using it. (Clang also nags at me when we have
> > > > unused functions).
> > >
> > > Done in the expanded patches attached.
I see that you've found use of it in dynahash, thanks!

The math in the new (from v4 to v6) patch is wrong: it yields
ceil_log2(1) = 1 or next_power_of_2(1) = 2. I can see that you lifted
the restriction of "num greater than one" for ceil_log2() in this patch
set, but it's now _more_ problematic to base those functions on
pg_leftmost_one_pos().

I'm not comfortable with your changes to pg_leftmost_one_pos() to remove
the restriction on word being non-zero. Specifically
pg_leftmost_one_pos() is made to return 0 on 0 input. While none of its
current callers (in HEAD) is harmed, this introduces muddy semantics:

1. pg_leftmost_one_pos is semantically undefined on 0 input: scanning
for a set bit in a zero word won't find it anywhere.

2. we can _try_ generalizing it to accommodate ceil_log2 by
extrapolating based on the invariant that BSR + LZCNT = 31 (or 63). In
that case, the extrapolation yields -1 for pg_leftmost_one_pos(0).

I'm not convinced that others on the list will be comfortable with the
generalization suggested in 2 above.

I've quickly put together a PoC patch on top of yours, which
re-implements ceil_log2 using LZCNT coupled with a CPUID check.
Thoughts?

Cheers,
Jesse

Attachment

0001-Use-LZCOUNT-when-possible.patch

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

04 March 2020, 08:59:32

On Thu, Feb 27, 2020 at 1:56 PM David Fetter <david@fetter.org> wrote:
>
> [v6 patch set]

Here I'm only looking at 0001. It needs rebasing, but it's trivial to
see what it does. I noticed in some places, you've replaced "long"
with uint64, but many are int64. I started making a list, but it got
too long, and I had to stop and ask: Is there a reason to change from
signed to unsigned for any of the ones that aren't directly related to
hashing code? Is there some larger pattern I'm missing?

-static long gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
-static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum);
+static uint64 gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
+static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, uint64 blocknum);

I believe these should actually use BlockNumber, if these refer to
relation blocks as opposed to temp file blocks (I haven't read the
code).

-exec_execute_message(const char *portal_name, long max_rows)
+exec_execute_message(const char *portal_name, uint64 max_rows)

The only call site of this function uses an int32, which gets its
value from pq_getmsgint, which returns uint32.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

04 March 2020, 09:46:48

On Tue, Mar 3, 2020 at 4:46 AM Jesse Zhang <sbjesse@gmail.com> wrote:
> The math in the new (from v4 to v6) patch is wrong: it yields
> ceil_log2(1) = 1 or next_power_of_2(1) = 2.

I think you're right.

> I can see that you lifted
> the restriction of "num greater than one" for ceil_log2() in this patch
> set, but it's now _more_ problematic to base those functions on
> pg_leftmost_one_pos().

> I'm not comfortable with your changes to pg_leftmost_one_pos() to remove
> the restriction on word being non-zero. Specifically
> pg_leftmost_one_pos() is made to return 0 on 0 input. While none of its
> current callers (in HEAD) is harmed, this introduces muddy semantics:
>
> 1. pg_leftmost_one_pos is semantically undefined on 0 input: scanning
> for a set bit in a zero word won't find it anywhere.

Right.

> I've quickly put together a PoC patch on top of yours, which
> re-implements ceil_log2 using LZCNT coupled with a CPUID check.
> Thoughts?

This patch seems to be making an assumption that an indirect function
call is faster than taking a branch (in inlined code) that the CPU
will almost always predict correctly. It would be nice to have some
numbers to compare. (against pg_count_leading_zeros_* using the "slow"
versions but statically inlined).

Stylistically, "8 * sizeof(num)" is a bit overly formal, since the
hard-coded number we want is in the name of the function.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

David Fetter

Date:

08 March 2020, 18:34:06

On Mon, Mar 02, 2020 at 12:45:21PM -0800, Jesse Zhang wrote:
> Hi David,
> 
> On Wed, Feb 26, 2020 at 9:56 PM David Fetter <david@fetter.org> wrote:
> >
> > On Wed, Feb 26, 2020 at 09:12:24AM +0100, David Fetter wrote:
> > > On Fri, Jan 31, 2020 at 04:59:18PM +0100, David Fetter wrote:
> > > > On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote:
> > > > > On Tue, Jan 14, 2020 at 2:09 PM David Fetter <david@fetter.org> wrote:
> > > > > > > The changes in hash AM and SIMPLEHASH do look like a net positive
> > > > > > > improvement. My biggest cringe might be in pg_bitutils:
> > > > > > >
> > > > > > > 1. Is ceil_log2_64 dead code?
> > > > > >
> > > > > > Let's call it nascent code. I suspect there are places it could go, if
> > > > > > I look for them.  Also, it seemed silly to have one without the other.
> > > > > >
> > > > >
> > > > > While not absolutely required, I'd like us to find at least one
> > > > > place and start using it. (Clang also nags at me when we have
> > > > > unused functions).
> > > >
> > > > Done in the expanded patches attached.
> I see that you've found use of it in dynahash, thanks!
> 
> The math in the new (from v4 to v6) patch is wrong: it yields
> ceil_log2(1) = 1 or next_power_of_2(1) = 2. I can see that you lifted
> the restriction of "num greater than one" for ceil_log2() in this patch
> set, but it's now _more_ problematic to base those functions on
> pg_leftmost_one_pos().
> 
> I'm not comfortable with your changes to pg_leftmost_one_pos() to remove
> the restriction on word being non-zero. Specifically
> pg_leftmost_one_pos() is made to return 0 on 0 input. While none of its
> current callers (in HEAD) is harmed, this introduces muddy semantics:
> 
> 1. pg_leftmost_one_pos is semantically undefined on 0 input: scanning
> for a set bit in a zero word won't find it anywhere.
> 
> 2. we can _try_ generalizing it to accommodate ceil_log2 by
> extrapolating based on the invariant that BSR + LZCNT = 31 (or 63). In
> that case, the extrapolation yields -1 for pg_leftmost_one_pos(0).
> 
> I'm not convinced that others on the list will be comfortable with the
> generalization suggested in 2 above.
> 
> I've quickly put together a PoC patch on top of yours, which
> re-implements ceil_log2 using LZCNT coupled with a CPUID check.
> Thoughts?

Per discussion on IRC with Andrew (RhodiumToad) Gierth:

The runtime detection means there's always an indirect call overhead
and no way to inline.  This is counter to what using compiler
intrinsics is supposed to do.

It's better to rely on the compiler, because:
(a) The compiler often knows whether the value can or can't be 0 and
    can therefore skip a conditional jump.
(b) If you're targeting a recent microarchitecture, the compiler can
    just use the right instruction.
(c) Even if the conditional branch is left in, it's not a big overhead.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Use compiler intrinsics for bit ops in hash

From

Jesse Zhang

Date:

08 March 2020, 23:44:21

Hi John,
Oops this email has been sitting in my outbox for 3 days...

On Wed, Mar 4, 2020 at 1:46 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> On Tue, Mar 3, 2020 at 4:46 AM Jesse Zhang <sbjesse@gmail.com> wrote:
> > I've quickly put together a PoC patch on top of yours, which
> > re-implements ceil_log2 using LZCNT coupled with a CPUID check.
> > Thoughts?
>
> This patch seems to be making an assumption that an indirect function
> call is faster than taking a branch (in inlined code) that the CPU
> will almost always predict correctly. It would be nice to have some
> numbers to compare. (against pg_count_leading_zeros_* using the "slow"
> versions but statically inlined).
>

Ah, how could I forget that... I ran a quick benchmark on my laptop, and
indeed, even though the GCC-generated code takes a hit on zero input
(Clang generates slightly different code that gives indistinguishable
runtime for zero and non-zero inputs), the inlined code (the function
input in my benchmark is never a constant literal so the branch does get
exercised at runtime) is still more than twice as fast as the function
call.

------------------------------------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
BM_pfunc/0        1.57 ns         1.56 ns    447127265
BM_pfunc/1        1.56 ns         1.56 ns    449618696
BM_pfunc/8        1.57 ns         1.57 ns    443013856
BM_pfunc/64       1.57 ns         1.57 ns    448784369
BM_slow/0        0.602 ns        0.600 ns   1000000000
BM_slow/1        0.391 ns        0.390 ns   1000000000
BM_slow/8        0.392 ns        0.391 ns   1000000000
BM_slow/64       0.391 ns        0.390 ns   1000000000
BM_fast/0         1.47 ns         1.46 ns    477513921
BM_fast/1         1.47 ns         1.46 ns    473992040
BM_fast/8         1.46 ns         1.46 ns    474895755
BM_fast/64        1.47 ns         1.46 ns    477215268

For your amusement, I've attached the meat of the benchmark. To build
the code you can grab the repository at
https://github.com/d/glowing-chainsaw/tree/pfunc

> Stylistically, "8 * sizeof(num)" is a bit overly formal, since the
> hard-coded number we want is in the name of the function.

Oh yeah, overly generic code is indicative of the remnants of my C++
brain, will fix.

Cheers,
Jesse

Attachment

Re: Use compiler intrinsics for bit ops in hash

From

Jesse Zhang

Date:

09 March 2020, 00:29:25

Hi David,
On Sun, Mar 8, 2020 at 11:34 AM David Fetter <david@fetter.org> wrote:
>
> On Mon, Mar 02, 2020 at 12:45:21PM -0800, Jesse Zhang wrote:
> > Hi David,
>
> Per discussion on IRC with Andrew (RhodiumToad) Gierth:
>
> The runtime detection means there's always an indirect call overhead
> and no way to inline.  This is counter to what using compiler
> intrinsics is supposed to do.
>
> It's better to rely on the compiler, because:
> (a) The compiler often knows whether the value can or can't be 0 and
>     can therefore skip a conditional jump.

Yes, the compiler would know to eliminate the branch if the inlined
function is called with a literal argument, or it infers an invariant
from the context (like nesting inside a conditional block, or a previous
conditional "noreturn" path).

> (b) If you're targeting a recent microarchitecture, the compiler can
>     just use the right instruction.

I might be more conservative than you are on (b). The thought of
building a binary that cannot run "somewhere" where the compiler
supports by default still mortifies me.

> (c) Even if the conditional branch is left in, it's not a big overhead.
>

I 100% agree with (c), see benchmarking results upthread.

Cheers,
Jesse

Re: Use compiler intrinsics for bit ops in hash

From

David Rowley

Date:

11 March 2020, 23:42:25

On Sat, 29 Feb 2020 at 04:13, David Fetter <david@fetter.org> wrote:
>
> On Thu, Feb 27, 2020 at 02:41:49PM +0800, John Naylor wrote:
> > In 0002, the pg_bitutils functions have a test (input > 0), and the
> > new callers ceil_log2_* and next_power_of_2_* have asserts. That seems
> > backward to me.
>
> To me, too, now that you mention it.  My thinking was a little fuzzed
> by trying to accommodate platforms with intrinsics where clz is
> defined for 0 inputs.

Wouldn't it be better just to leave the existing definitions of the
pg_leftmost_one_pos* function alone?  It seems to me you're hacking
away at those just so you can support passing 1 to the new functions,
and that's giving you trouble now because you're doing num-1 to handle
the case where the number is already a power of 2. Which is
troublesome because 1-1 is 0, which you're trying to code around.

Isn't it better just to put in a run-time check for numbers that are
already a power of 2 and then get rid of the num - 1? Something like:

/*
 * pg_nextpow2_32
 *      Returns the next highest power of 2 of 'num', or 'num', if
it's already a
 *      power of 2.  'num' mustn't be 0 or be above UINT_MAX / 2.
 */
static inline uint32
pg_nextpow2_32(uint32 num)
{
    Assert(num > 0 && num <= UINT_MAX / 2);
    /* use some bitmasking tricks to see if only 1 bit is on */
    return (num & (num - 1)) == 0 ? num : ((uint32) 1) <<
(pg_leftmost_one_pos32(num) + 1);
}

I think you'll also want to mention the issue about numbers greater
than UINT_MAX / 2, as I've done above and also align your naming
conversion to what else is in that file.

I don't think Jesse's proposed solution is that great due to the
additional function call overhead for pg_count_leading_zeros_32(). The
(num & (num - 1)) == 0 I imagine will perform better, but I didn't
test it.

Also, wondering if you've looked at any of the other places where we
do "*= 2;" or "<<= 1;" inside a loop? There's quite a number that look
like candidates for using the new function.

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

12 March 2020, 09:59:25

On Thu, Mar 12, 2020 at 7:42 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> I don't think Jesse's proposed solution is that great due to the
> additional function call overhead for pg_count_leading_zeros_32(). The
> (num & (num - 1)) == 0 I imagine will perform better, but I didn't
> test it.

Right, I believe we've all landed on the same page about that. I see
two ways of doing next_power_of_2_32 without an indirect function
call, and leaving pg_leftmost_one_pos32 the same as it is now. I
haven't measured either yet (or tested for that matter):

static inline uint32
next_power_of_2_32(uint32 num)
{
    Assert(num > 0 && num <= UINT_MAX / 2);
    /* use some bitmasking tricks to see if only 1 bit is on */
    if (num & (num - 1)) == 0)
        return num;
    return ((uint32) 1) << (pg_leftmost_one_pos32(num) + 1)
}
OR
{
    Assert(num > 0 && num <= UINT_MAX / 2);
    return ((uint32) 1) << ceil_log2_32(num);
}

static inline uint32
ceil_log2_32(uint32 num)
{
    Assert(num > 0);
    if (num == 1)
        return 0;
    return pg_leftmost_one_pos32(num-1) + 1;
}

One naming thing I noticed: the name "next power of two" implies to me
num *= 2 for a power of two, not the same as the input. The latter
behavior is better called "ceil power of 2".

> Also, wondering if you've looked at any of the other places where we
> do "*= 2;" or "<<= 1;" inside a loop? There's quite a number that look
> like candidates for using the new function.

A brief look shows a few functions where this is done in a tight loop:

nodes/list.c:new_list
LWLockRegisterTranche
ensure_record_cache_typmod_slot_exists
pqCheckOutBufferSpace
ExecChooseHashTableSize
ExecHashBuildSkewHash
choose_nelem_alloc
init_htab
hash_estimate_size
hash_select_dirsize
AllocSetAlloc

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

David Rowley

Date:

07 April 2020, 11:40:52

On Thu, 12 Mar 2020 at 22:59, John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Mar 12, 2020 at 7:42 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > I don't think Jesse's proposed solution is that great due to the
> > additional function call overhead for pg_count_leading_zeros_32(). The
> > (num & (num - 1)) == 0 I imagine will perform better, but I didn't
> > test it.
>
> Right, I believe we've all landed on the same page about that. I see
> two ways of doing next_power_of_2_32 without an indirect function
> call, and leaving pg_leftmost_one_pos32 the same as it is now. I
> haven't measured either yet (or tested for that matter):

I've attached an updated patch.  It includes the modifications
mentioned above to pre-check for a power of 2 number with the bit
masking hack mentioned above. I also renamed the functions to be more
aligned to the other functions in pg_bitutils.h   I'm not convinced
pg_ceil_log2_* needs the word "ceil" in there.

I dropped the part of the patch that was changing longs to ints of a
known size. I went on and did some additional conversion in the 0003
patch. There are more laying around the code base, but I ended up
finding a bit to fix up than i had thought I would. e.g. various
places that repalloc() inside a loop that is multiplying the
allocation size by 2 each time.  The repalloc should be done at the
end, not during the loop.  I thought I might come back to those some
time in the future.

Is anyone able to have a look at this?

David

Attachment

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

07 April 2020, 12:16:47

On Tue, Apr 7, 2020 at 7:41 PM David Rowley <dgrowleyml@gmail.com> wrote:

> I've attached an updated patch.  It includes the modifications
> mentioned above to pre-check for a power of 2 number with the bit
> masking hack mentioned above. I also renamed the functions to be more
> aligned to the other functions in pg_bitutils.h   I'm not convinced
> pg_ceil_log2_* needs the word "ceil" in there.
>
> I dropped the part of the patch that was changing longs to ints of a
> known size. I went on and did some additional conversion in the 0003
> patch. There are more laying around the code base, but I ended up
> finding a bit to fix up than i had thought I would. e.g. various
> places that repalloc() inside a loop that is multiplying the
> allocation size by 2 each time.  The repalloc should be done at the
> end, not during the loop.  I thought I might come back to those some
> time in the future.
>
> Is anyone able to have a look at this?
>
> David

Hi David,

Overall looks good to me. Just a couple things I see:

It seems _hash_log2 is still in the tree, but has no callers?

- max_size = 8; /* semi-arbitrary small power of 2 */
- while (max_size < min_size + LIST_HEADER_OVERHEAD)
- max_size *= 2;
+ max_size = pg_nextpower2_32(Max(8, min_size + LIST_HEADER_OVERHEAD));

Minor nit: We might want to keep the comment that the number is
"semi-arbitrary" here as well.

- 'pg_waldump', 'scripts');
+ 'pg_validatebackup', 'pg_waldump', 'scripts');

This seems like a separate concern?

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

David Rowley

Date:

07 April 2020, 12:26:32

Hi John,

Thanks for having a look at this.

On Wed, 8 Apr 2020 at 00:16, John Naylor <john.naylor@2ndquadrant.com> wrote:
> Overall looks good to me. Just a couple things I see:
>
> It seems _hash_log2 is still in the tree, but has no callers?

Yeah, I left it in there since it was an external function.  Perhaps
we could rip it out and write something in the commit message that it
should be replaced with the newer functions.  Thinking of extension
authors here.

> - max_size = 8; /* semi-arbitrary small power of 2 */
> - while (max_size < min_size + LIST_HEADER_OVERHEAD)
> - max_size *= 2;
> + max_size = pg_nextpower2_32(Max(8, min_size + LIST_HEADER_OVERHEAD));
>
> Minor nit: We might want to keep the comment that the number is
> "semi-arbitrary" here as well.

I had dropped that as the 8 part was mentioned in the comment above:
"The minimum allocation is 8 ListCell units". I can put it back, I had
just thought it was overkill.

> - 'pg_waldump', 'scripts');
> + 'pg_validatebackup', 'pg_waldump', 'scripts');
>
> This seems like a separate concern?

That's required due to the #include "lib/simplehash.h" in
pg_validatebackup.c. I have to say, I didn't really take the time to
understand all the Perl code there, but without that change, I was
getting a link error when testing on Windows, and after I added
pg_validatebackup to that array, it worked.

David

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

07 April 2020, 13:16:17

On Tue, Apr 7, 2020 at 8:26 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Hi John,
>
> Thanks for having a look at this.
>
> On Wed, 8 Apr 2020 at 00:16, John Naylor <john.naylor@2ndquadrant.com> wrote:
> > Overall looks good to me. Just a couple things I see:
> >
> > It seems _hash_log2 is still in the tree, but has no callers?
>
> Yeah, I left it in there since it was an external function.  Perhaps
> we could rip it out and write something in the commit message that it
> should be replaced with the newer functions.  Thinking of extension
> authors here.

I'm not the best judge of where to draw the line for extensions, but
this function does have a name beginning with an underscore, which to
me is a red flag that it's internal in nature.

> > Minor nit: We might want to keep the comment that the number is
> > "semi-arbitrary" here as well.
>
> I had dropped that as the 8 part was mentioned in the comment above:
> "The minimum allocation is 8 ListCell units". I can put it back, I had
> just thought it was overkill.

Oh I see now, nevermind.

> > - 'pg_waldump', 'scripts');
> > + 'pg_validatebackup', 'pg_waldump', 'scripts');
> >
> > This seems like a separate concern?
>
> That's required due to the #include "lib/simplehash.h" in
> pg_validatebackup.c. I have to say, I didn't really take the time to
> understand all the Perl code there, but without that change, I was
> getting a link error when testing on Windows, and after I added
> pg_validatebackup to that array, it worked.

Hmm. Does pg_bitutils.h need something like this?

#ifndef FRONTEND
extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
#else
extern const uint8 pg_leftmost_one_pos[256];
extern const uint8 pg_rightmost_one_pos[256];
extern const uint8 pg_number_of_ones[256];
#endif

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

David Rowley

Date:

08 April 2020, 01:04:08

On Wed, 8 Apr 2020 at 01:16, John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Tue, Apr 7, 2020 at 8:26 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > Hi John,
> >
> > Thanks for having a look at this.
> >
> > On Wed, 8 Apr 2020 at 00:16, John Naylor <john.naylor@2ndquadrant.com> wrote:
> > > Overall looks good to me. Just a couple things I see:
> > >
> > > It seems _hash_log2 is still in the tree, but has no callers?
> >
> > Yeah, I left it in there since it was an external function.  Perhaps
> > we could rip it out and write something in the commit message that it
> > should be replaced with the newer functions.  Thinking of extension
> > authors here.
>
> I'm not the best judge of where to draw the line for extensions, but
> this function does have a name beginning with an underscore, which to
> me is a red flag that it's internal in nature.

OK. I've removed that function now and stuck a note in the commit
message to mention an alternative.

> Hmm. Does pg_bitutils.h need something like this?
>
> #ifndef FRONTEND
> extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
> extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
> extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
> #else
> extern const uint8 pg_leftmost_one_pos[256];
> extern const uint8 pg_rightmost_one_pos[256];
> extern const uint8 pg_number_of_ones[256];
> #endif

Yeah, looking at keywords.h, we hit this before in c2d1eea9e75.  Your
proposed fix works and is the same as in keywords.h, so I've gone with
that.

I've attached v8 of the patchset.

David

Attachment

Re: Use compiler intrinsics for bit ops in hash

From

John Naylor

Date:

08 April 2020, 03:06:21

On Wed, Apr 8, 2020 at 9:04 AM David Rowley <dgrowleyml@gmail.com> wrote:
> [v8]

Looks good to me, marked RFC.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Use compiler intrinsics for bit ops in hash

From

David Rowley

Date:

08 April 2020, 06:35:24

On Wed, 8 Apr 2020 at 15:06, John Naylor <john.naylor@2ndquadrant.com> wrote:
> Looks good to me, marked RFC.

Thanks a lot for reviewing those changes. I've now pushed all 3 of the patches.

David