Thread: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

From
Ranier Vilela
Date:
Hi,

IMO I believe that bitmapset can obtain an optimization in the calculation
of the WORDNUM and BITNUM macros.

As you know, in bitmapset, negative members are not allowed.

if (x < 0)
elog(ERROR, "negative bitmapset member not allowed");

Then, allow the compiler to optimize and do the calculations in unsigned.

I did a demonstration in compiler explorer.
https://godbolt.org/z/c68o43Eq1

It is demonstrated here as well.

Generated instructions: GCC 13.2

#define WORDNUM_f1(x) ((x) / BITS_PER_BITMAPWORD)
         mov eax, DWORD PTR [rbp-20]
         lea edx, [rax+63]
         test eax, eax
         cmovs eax, edx
         sar eax, 6
         mov DWORD PTR [rbp-4], eax

#define WORDNUM_f2(x) ((bitmapword) (x) / BITS_PER_BITMAPWORD)
         mov eax, DWORD PTR [rbp-20]
         cdqe
         shr rax, 6
         mov DWORD PTR [rbp-4], eax


#define BITNUM_f1(x) ((x) % BITS_PER_BITMAPWORD)
         mov edx, DWORD PTR [rbp-20]
         mov eax, edx
         sar eax, 31
         shr eax, 26
         add edx, eax
         and edx, 63
         sub edx, eax
         mov DWORD PTR [rbp-4], edx

#define BITNUM_f2(x) ((bitmapword) (x) % BITS_PER_BITMAPWORD)
        mov eax, DWORD PTR [rbp-20]
         and eax, 63
         mov DWORD PTR [rbp-4], eax

As you can see, when calculations are done in unsigned, 
the generated instructions are optimized.

Patch attached.

Best regards,
Ranier Vilela
Attachment

Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

From
Nathan Bossart
Date:
On Mon, Jan 29, 2024 at 01:30:47PM -0300, Ranier Vilela wrote:
> IMO I believe that bitmapset can obtain an optimization in the calculation
> of the WORDNUM and BITNUM macros.
> 
> As you know, in bitmapset, negative members are not allowed.
> 
> if (x < 0)
> elog(ERROR, "negative bitmapset member not allowed");
> 
> Then, allow the compiler to optimize and do the calculations in unsigned.

I'm currently +0.1 for this change.  I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable impact.
However, a cycle saved is a cycle earned...

-#define WORDNUM(x)    ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)    ((x) % BITS_PER_BITMAPWORD)
+#define WORDNUM(x)    ((bitmapword)(x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)    ((bitmapword)(x) % BITS_PER_BITMAPWORD)

I'm curious why we'd cast to bitmapword and not straight to uint32.  I
don't think the intent is that callers will provide a bitmapword to these
macros.  I also wonder if it's worth asserting that x is >= 0 before
casting here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

From
Ranier Vilela
Date:
Em seg., 29 de jan. de 2024 às 16:32, Nathan Bossart <nathandbossart@gmail.com> escreveu:
On Mon, Jan 29, 2024 at 01:30:47PM -0300, Ranier Vilela wrote:
> IMO I believe that bitmapset can obtain an optimization in the calculation
> of the WORDNUM and BITNUM macros.
>
> As you know, in bitmapset, negative members are not allowed.
>
> if (x < 0)
> elog(ERROR, "negative bitmapset member not allowed");
>
> Then, allow the compiler to optimize and do the calculations in unsigned.

I'm currently +0.1 for this change.  I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable impact.
However, a cycle saved is a cycle earned...
Bitmapset is extensively used.


-#define WORDNUM(x)     ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)      ((x) % BITS_PER_BITMAPWORD)
+#define WORDNUM(x)     ((bitmapword)(x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)      ((bitmapword)(x) % BITS_PER_BITMAPWORD)

I'm curious why we'd cast to bitmapword and not straight to uint32.  I
don't think the intent is that callers will provide a bitmapword to these
macros.
bitmapword It is the most correct and prudent option, if in the future, 
we decide to change the number of nwords to uint64.
 
  I also wonder if it's worth asserting that x is >= 0 before
casting here.
I don't think this would change anything.

best regards,
Ranier Vilela

Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

From
Nathan Bossart
Date:
On Mon, Jan 29, 2024 at 04:43:32PM -0300, Ranier Vilela wrote:
> Em seg., 29 de jan. de 2024 às 16:32, Nathan Bossart <
> nathandbossart@gmail.com> escreveu:
>> -#define WORDNUM(x)     ((x) / BITS_PER_BITMAPWORD)
>> -#define BITNUM(x)      ((x) % BITS_PER_BITMAPWORD)
>> +#define WORDNUM(x)     ((bitmapword)(x) / BITS_PER_BITMAPWORD)
>> +#define BITNUM(x)      ((bitmapword)(x) % BITS_PER_BITMAPWORD)
>>
>> I'm curious why we'd cast to bitmapword and not straight to uint32.  I
>> don't think the intent is that callers will provide a bitmapword to these
>> macros.
> 
> bitmapword It is the most correct and prudent option, if in the future,
> we decide to change the number of nwords to uint64.

If we change nwords to a uint64, I think there will be many other changes
required.  Using uint32 probably trims the instructions further on machines
with 64-bit pointers, too (e.g., cdqe).

>>   I also wonder if it's worth asserting that x is >= 0 before
>> casting here.
>>
> I don't think this would change anything.

Right, but it would offer another layer of protection against negative
integers in Bitmapsets.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

From
David Rowley
Date:
On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote:
> I'm currently +0.1 for this change.  I don't see any huge problem with
> trimming a few instructions, but I'm dubious there's any measurable impact.
> However, a cycle saved is a cycle earned...

FWIW, In [1] and subsequent replies, there are several examples of
benchmarks where various bitmapset functions are sitting high in the
profiles. So I wouldn't be too surprised if such a small change to the
WORDNUM and BITNUM macros made a noticeable difference.

A benchmark speaks a thousand words, however.

David

[1] https://postgr.es/m/CAApHDvq9eq0W_aFUGrb6ba28ieuQN4zM5Uwqxy7+LMZjJc+VGg@mail.gmail.com



Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

From
Nathan Bossart
Date:
On Tue, Jan 30, 2024 at 11:23:57AM +1300, David Rowley wrote:
> On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I'm currently +0.1 for this change.  I don't see any huge problem with
>> trimming a few instructions, but I'm dubious there's any measurable impact.
>> However, a cycle saved is a cycle earned...
> 
> FWIW, In [1] and subsequent replies, there are several examples of
> benchmarks where various bitmapset functions are sitting high in the
> profiles. So I wouldn't be too surprised if such a small change to the
> WORDNUM and BITNUM macros made a noticeable difference.

Good to know, thanks.  If there is indeed demonstrable improvement, I'd
readily adjust my +0.1 to +1.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

From
Ranier Vilela
Date:

Em seg., 29 de jan. de 2024 às 19:40, Nathan Bossart <nathandbossart@gmail.com> escreveu:
On Tue, Jan 30, 2024 at 11:23:57AM +1300, David Rowley wrote:
> On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I'm currently +0.1 for this change.  I don't see any huge problem with
>> trimming a few instructions, but I'm dubious there's any measurable impact.
>> However, a cycle saved is a cycle earned...
>
> FWIW, In [1] and subsequent replies, there are several examples of
> benchmarks where various bitmapset functions are sitting high in the
> profiles. So I wouldn't be too surprised if such a small change to the
> WORDNUM and BITNUM macros made a noticeable difference.

Good to know, thanks.  If there is indeed demonstrable improvement, I'd
readily adjust my +0.1 to +1.
Following the suggestions, I did a quick test with one of the scripts.

Ubuntu 64 bits
gcc 12.3 64 bits

create table t1 (a int) partition by list(a);
select 'create table t1_'||x||' partition of t1 for values
in('||x||');' from generate_series(0,9)x;

test1.sql
select * from t1 where a > 1 and a < 3;

pgbench -U postgres -n -f test1.sql -T 15 postgres

head:

tps = 27983.182940
tps = 28916.903038
tps = 29051.878855

patched:

tps = 27517.301246
tps = 27848.684133
tps = 28669.367300


create table t2 (a int) partition by list(a);
select 'create table t2_'||x||' partition of t2 for values
in('||x||');' from generate_series(0,9999)x;

test2.sql
select * from t2 where a > 1 and a < 3;

pgbench -U postgres -n -f test2.sql -T 15 postgres

head:

tps = 27144.044463
tps = 28932.948620
tps = 29299.016248

patched:

tps = 27363.364039
tps = 28588.141586
tps = 28669.367300

To my complete surprise, the change is slower.
I can't say how, with fewer instructions, gcc makes the binary worse.

best regards,
Ranier Vilela