Home > mailing lists

Re: Small and unlikely overflow hazard in bms_next_member() - Mailing list pgsql-hackers

From	Chao Li
Subject	Re: Small and unlikely overflow hazard in bms_next_member()
Date	April 3 16:28:03
Msg-id	D8422C13-E995-4D0D-B296-5D643DE89399@gmail.com Whole thread Raw
In response to	Re: Small and unlikely overflow hazard in bms_next_member() (David Rowley <dgrowleyml@gmail.com>)
Responses	Re: Small and unlikely overflow hazard in bms_next_member()
List	pgsql-hackers

Tree view


> On Apr 3, 2026, at 19:42, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 3 Apr 2026 at 16:24, Chao Li <li.evan.chao@gmail.com> wrote:
>> I also did a load test with a standalone c program with 4 versions:
>
> I think you'd be better off picking a more realistic Bitmapset. The
> code is doing:
>
>    int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
>    Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
> sizeof(bitmapword));
>    bms->nwords = words_to_alloc;
>    memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
>
>    /* Set a bit far into the set to force a long scan */
>    int target_bit = (words_to_alloc - 1) * 64 + 10;
>    bms->words[words_to_alloc - 1] |= (1ULL << 10);
>
> A set with 20k words! Then you're doing:
>
>    for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
>
> So you're not looping correctly over the set, as looping over a set
> with 1 member requires 2 calls to the function.
>
> I'd say this unrealistically favours the unrolled loop version, as
> most of the effort is in the loop looping over empty words and you can
> do that without the extra bit-wise AND that the other versions have.
> That version also seems to apply both the 64-bit fix and the INT_MAX
> fix. Why both?  Because you're only calling the function once per
> iteration, it also means your || INT_MAX is only executed once, rather
> than n_members + 1 as it would normally be executed. That'll make your
> version seem better than it is.
>
> I fixed all that and changed the Bitmapset to something more realistic
> with how it's more likely to be seen in PostgreSQL and ran the
> benchmark with a proper bms_next_member loop. File attached and
> results below:
>
> Zen2 Linux
>
> $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
>
> Original:  0.62113 seconds
> David's:          0.70257 seconds
> Chao's version:   1.70691 seconds
> Unrolled loop: 1.56239 seconds
> 2800000000
>
> $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
>
> Original:  1.11263 seconds
> David's:          0.87399 seconds
> Chao's version:   0.52962 seconds
> Unrolled loop: 1.07799 seconds
> 2800000000
>
> Apple M2:
>
> Benchmarking 100000000 iterations...
>
> Original:  0.64023 seconds
> David's:          0.41087 seconds
> Chao's version:   1.21113 seconds
> Unrolled loop: 0.55924 seconds
> 2800000000
>
>> I’m curious that, when something performs differently across platforms, which platform should take priority?
>
> I don't think it matters too much for this case. If we were actually
> working on a function that was a bottleneck, then it would be worth
> looking deeper and seeing if the compilers are producing suboptimal
> code and then try to coax them into making better code. As a last
> resort, have two versions and some preprocessor to decide which one.
> However, this just isn't performance-critical enough to warrant such
> an effort. I think we're already going far too far with this. I just
> want to fix the bug. If performance increases in some way as a result
> of that, then good.
>
> If I sum up the times from each version of the above results, I get:
>
> Original: 2.37399
> David's: 1.98743
> Chao's: 3.44766
> Unrolled: 3.19962
>
> I'm happy to go with the fastest one. I expect the one with the fewest
> instructions is likely more important when running it in the planner,
> as that's less to decode. I didn't look, but I expect your loop
> unrolled version and your || INT_MAX version to have more
> instructions.
>
> David
> <test_bms_next2.c>

In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change
words_to_alloc= 2, then on my MacBook M4, the unrolled version seems to win: 
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  0.61559 seconds
David's:      0.61651 seconds
Chao's version:      1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```

(I also checked on Godbolt, from the instruction-count perspective, the unrolled version actually looks the worst.)

I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the
bestchoice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we
reallywant to study the performance more deeply, I think that would be a separate topic. 

It was fun working on this patch, and thank you for your patience.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

pgsql-hackers by date:

From: Ashutosh Bapat
Date: 03 April, 16:12:58
Subject: Re: Better shared data structure management and resizable shared data structures

From: Andrew Dunstan
Date: 03 April, 16:36:31
Subject: Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part

Re: Small and unlikely overflow hazard in bms_next_member() - Mailing list pgsql-hackers

Previous

Next