Thread: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)
Hi,
IMO I believe that bitmapset can obtain an optimization in the calculation
of the WORDNUM and BITNUM macros.
As you know, in bitmapset, negative members are not allowed.
if (x < 0)
elog(ERROR, "negative bitmapset member not allowed");
Then, allow the compiler to optimize and do the calculations in unsigned.
I did a demonstration in compiler explorer.
https://godbolt.org/z/c68o43Eq1
It is demonstrated here as well.
#define WORDNUM_f1(x) ((x) / BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
lea edx, [rax+63]
test eax, eax
cmovs eax, edx
sar eax, 6
mov DWORD PTR [rbp-4], eax
#define WORDNUM_f2(x) ((bitmapword) (x) / BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
cdqe
shr rax, 6
mov DWORD PTR [rbp-4], eax
#define BITNUM_f1(x) ((x) % BITS_PER_BITMAPWORD)
mov edx, DWORD PTR [rbp-20]
mov eax, edx
sar eax, 31
shr eax, 26
add edx, eax
and edx, 63
sub edx, eax
mov DWORD PTR [rbp-4], edx
#define BITNUM_f2(x) ((bitmapword) (x) % BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
and eax, 63
mov DWORD PTR [rbp-4], eax
IMO I believe that bitmapset can obtain an optimization in the calculation
of the WORDNUM and BITNUM macros.
As you know, in bitmapset, negative members are not allowed.
if (x < 0)
elog(ERROR, "negative bitmapset member not allowed");
Then, allow the compiler to optimize and do the calculations in unsigned.
I did a demonstration in compiler explorer.
https://godbolt.org/z/c68o43Eq1
It is demonstrated here as well.
Generated instructions: GCC 13.2
mov eax, DWORD PTR [rbp-20]
lea edx, [rax+63]
test eax, eax
cmovs eax, edx
sar eax, 6
mov DWORD PTR [rbp-4], eax
#define WORDNUM_f2(x) ((bitmapword) (x) / BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
cdqe
shr rax, 6
mov DWORD PTR [rbp-4], eax
#define BITNUM_f1(x) ((x) % BITS_PER_BITMAPWORD)
mov edx, DWORD PTR [rbp-20]
mov eax, edx
sar eax, 31
shr eax, 26
add edx, eax
and edx, 63
sub edx, eax
mov DWORD PTR [rbp-4], edx
#define BITNUM_f2(x) ((bitmapword) (x) % BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
and eax, 63
mov DWORD PTR [rbp-4], eax
As you can see, when calculations are done in unsigned,
the generated instructions are optimized.
Patch attached.
Best regards,
Ranier Vilela
Attachment
On Mon, Jan 29, 2024 at 01:30:47PM -0300, Ranier Vilela wrote: > IMO I believe that bitmapset can obtain an optimization in the calculation > of the WORDNUM and BITNUM macros. > > As you know, in bitmapset, negative members are not allowed. > > if (x < 0) > elog(ERROR, "negative bitmapset member not allowed"); > > Then, allow the compiler to optimize and do the calculations in unsigned. I'm currently +0.1 for this change. I don't see any huge problem with trimming a few instructions, but I'm dubious there's any measurable impact. However, a cycle saved is a cycle earned... -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) +#define WORDNUM(x) ((bitmapword)(x) / BITS_PER_BITMAPWORD) +#define BITNUM(x) ((bitmapword)(x) % BITS_PER_BITMAPWORD) I'm curious why we'd cast to bitmapword and not straight to uint32. I don't think the intent is that callers will provide a bitmapword to these macros. I also wonder if it's worth asserting that x is >= 0 before casting here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Em seg., 29 de jan. de 2024 às 16:32, Nathan Bossart <nathandbossart@gmail.com> escreveu:
On Mon, Jan 29, 2024 at 01:30:47PM -0300, Ranier Vilela wrote:
> IMO I believe that bitmapset can obtain an optimization in the calculation
> of the WORDNUM and BITNUM macros.
>
> As you know, in bitmapset, negative members are not allowed.
>
> if (x < 0)
> elog(ERROR, "negative bitmapset member not allowed");
>
> Then, allow the compiler to optimize and do the calculations in unsigned.
I'm currently +0.1 for this change. I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable impact.
However, a cycle saved is a cycle earned...
Bitmapset is extensively used.
-#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD)
+#define WORDNUM(x) ((bitmapword)(x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x) ((bitmapword)(x) % BITS_PER_BITMAPWORD)
I'm curious why we'd cast to bitmapword and not straight to uint32. I
don't think the intent is that callers will provide a bitmapword to these
macros.
bitmapword It is the most correct and prudent option, if in the future,
we decide to change the number of nwords to uint64.
I also wonder if it's worth asserting that x is >= 0 before
casting here.
I don't think this would change anything.
best regards,
Ranier Vilela
On Mon, Jan 29, 2024 at 04:43:32PM -0300, Ranier Vilela wrote: > Em seg., 29 de jan. de 2024 às 16:32, Nathan Bossart < > nathandbossart@gmail.com> escreveu: >> -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) >> -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) >> +#define WORDNUM(x) ((bitmapword)(x) / BITS_PER_BITMAPWORD) >> +#define BITNUM(x) ((bitmapword)(x) % BITS_PER_BITMAPWORD) >> >> I'm curious why we'd cast to bitmapword and not straight to uint32. I >> don't think the intent is that callers will provide a bitmapword to these >> macros. > > bitmapword It is the most correct and prudent option, if in the future, > we decide to change the number of nwords to uint64. If we change nwords to a uint64, I think there will be many other changes required. Using uint32 probably trims the instructions further on machines with 64-bit pointers, too (e.g., cdqe). >> I also wonder if it's worth asserting that x is >= 0 before >> casting here. >> > I don't think this would change anything. Right, but it would offer another layer of protection against negative integers in Bitmapsets. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote: > I'm currently +0.1 for this change. I don't see any huge problem with > trimming a few instructions, but I'm dubious there's any measurable impact. > However, a cycle saved is a cycle earned... FWIW, In [1] and subsequent replies, there are several examples of benchmarks where various bitmapset functions are sitting high in the profiles. So I wouldn't be too surprised if such a small change to the WORDNUM and BITNUM macros made a noticeable difference. A benchmark speaks a thousand words, however. David [1] https://postgr.es/m/CAApHDvq9eq0W_aFUGrb6ba28ieuQN4zM5Uwqxy7+LMZjJc+VGg@mail.gmail.com
On Tue, Jan 30, 2024 at 11:23:57AM +1300, David Rowley wrote: > On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I'm currently +0.1 for this change. I don't see any huge problem with >> trimming a few instructions, but I'm dubious there's any measurable impact. >> However, a cycle saved is a cycle earned... > > FWIW, In [1] and subsequent replies, there are several examples of > benchmarks where various bitmapset functions are sitting high in the > profiles. So I wouldn't be too surprised if such a small change to the > WORDNUM and BITNUM macros made a noticeable difference. Good to know, thanks. If there is indeed demonstrable improvement, I'd readily adjust my +0.1 to +1. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Em seg., 29 de jan. de 2024 às 19:40, Nathan Bossart <nathandbossart@gmail.com> escreveu:
On Tue, Jan 30, 2024 at 11:23:57AM +1300, David Rowley wrote:
> On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I'm currently +0.1 for this change. I don't see any huge problem with
>> trimming a few instructions, but I'm dubious there's any measurable impact.
>> However, a cycle saved is a cycle earned...
>
> FWIW, In [1] and subsequent replies, there are several examples of
> benchmarks where various bitmapset functions are sitting high in the
> profiles. So I wouldn't be too surprised if such a small change to the
> WORDNUM and BITNUM macros made a noticeable difference.
Good to know, thanks. If there is indeed demonstrable improvement, I'd
readily adjust my +0.1 to +1.
Following the suggestions, I did a quick test with one of the scripts.
Ubuntu 64 bits
gcc 12.3 64 bits
create table t1 (a int) partition by list(a);
select 'create table t1_'||x||' partition of t1 for values
in('||x||');' from generate_series(0,9)x;
gcc 12.3 64 bits
create table t1 (a int) partition by list(a);
select 'create table t1_'||x||' partition of t1 for values
in('||x||');' from generate_series(0,9)x;
test1.sql
select * from t1 where a > 1 and a < 3;
pgbench -U postgres -n -f test1.sql -T 15 postgres
head:
tps = 27983.182940
tps = 28916.903038
tps = 29051.878855
patched:
tps = 27517.301246
tps = 27848.684133
tps = 28669.367300
create table t2 (a int) partition by list(a);
select 'create table t2_'||x||' partition of t2 for values
in('||x||');' from generate_series(0,9999)x;
test2.sql
select * from t2 where a > 1 and a < 3;
pgbench -U postgres -n -f test2.sql -T 15 postgres
head:
tps = 27144.044463
tps = 28932.948620
tps = 29299.016248
patched:
tps = 27363.364039
tps = 28588.141586
tps = 28669.367300
head:
tps = 27144.044463
tps = 28932.948620
tps = 29299.016248
patched:
tps = 27363.364039
tps = 28588.141586
tps = 28669.367300
To my complete surprise, the change is slower.
I can't say how, with fewer instructions, gcc makes the binary worse.
I can't say how, with fewer instructions, gcc makes the binary worse.
best regards,
Ranier Vilela