Re: Small and unlikely overflow hazard in bms_next_member() - Mailing list pgsql-hackers
| From | Chao Li |
|---|---|
| Subject | Re: Small and unlikely overflow hazard in bms_next_member() |
| Date | |
| Msg-id | D8422C13-E995-4D0D-B296-5D643DE89399@gmail.com Whole thread Raw |
| In response to | Re: Small and unlikely overflow hazard in bms_next_member() (David Rowley <dgrowleyml@gmail.com>) |
| Responses |
Re: Small and unlikely overflow hazard in bms_next_member()
|
| List | pgsql-hackers |
> On Apr 3, 2026, at 19:42, David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 3 Apr 2026 at 16:24, Chao Li <li.evan.chao@gmail.com> wrote: >> I also did a load test with a standalone c program with 4 versions: > > I think you'd be better off picking a more realistic Bitmapset. The > code is doing: > > int words_to_alloc = 20000; // Large set to bypass CPU cache slightly > Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * > sizeof(bitmapword)); > bms->nwords = words_to_alloc; > memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); > > /* Set a bit far into the set to force a long scan */ > int target_bit = (words_to_alloc - 1) * 64 + 10; > bms->words[words_to_alloc - 1] |= (1ULL << 10); > > A set with 20k words! Then you're doing: > > for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0); > > So you're not looping correctly over the set, as looping over a set > with 1 member requires 2 calls to the function. > > I'd say this unrealistically favours the unrolled loop version, as > most of the effort is in the loop looping over empty words and you can > do that without the extra bit-wise AND that the other versions have. > That version also seems to apply both the 64-bit fix and the INT_MAX > fix. Why both? Because you're only calling the function once per > iteration, it also means your || INT_MAX is only executed once, rather > than n_members + 1 as it would normally be executed. That'll make your > version seem better than it is. > > I fixed all that and changed the Bitmapset to something more realistic > with how it's more likely to be seen in PostgreSQL and ran the > benchmark with a proper bms_next_member loop. File attached and > results below: > > Zen2 Linux > > $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next > Benchmarking 100000000 iterations... > > Original: 0.62113 seconds > David's: 0.70257 seconds > Chao's version: 1.70691 seconds > Unrolled loop: 1.56239 seconds > 2800000000 > > $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next > Benchmarking 100000000 iterations... > > Original: 1.11263 seconds > David's: 0.87399 seconds > Chao's version: 0.52962 seconds > Unrolled loop: 1.07799 seconds > 2800000000 > > Apple M2: > > Benchmarking 100000000 iterations... > > Original: 0.64023 seconds > David's: 0.41087 seconds > Chao's version: 1.21113 seconds > Unrolled loop: 0.55924 seconds > 2800000000 > >> I’m curious that, when something performs differently across platforms, which platform should take priority? > > I don't think it matters too much for this case. If we were actually > working on a function that was a bottleneck, then it would be worth > looking deeper and seeing if the compilers are producing suboptimal > code and then try to coax them into making better code. As a last > resort, have two versions and some preprocessor to decide which one. > However, this just isn't performance-critical enough to warrant such > an effort. I think we're already going far too far with this. I just > want to fix the bug. If performance increases in some way as a result > of that, then good. > > If I sum up the times from each version of the above results, I get: > > Original: 2.37399 > David's: 1.98743 > Chao's: 3.44766 > Unrolled: 3.19962 > > I'm happy to go with the fastest one. I expect the one with the fewest > instructions is likely more important when running it in the planner, > as that's less to decode. I didn't look, but I expect your loop > unrolled version and your || INT_MAX version to have more > instructions. > > David > <test_bms_next2.c> In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc= 2, then on my MacBook M4, the unrolled version seems to win: ``` chaol@ChaodeMacBook-Air test % ./test_bms_next2 Benchmarking 100000000 iterations... Original: 0.61559 seconds David's: 0.61651 seconds Chao's version: 1.06033 seconds Unrolled loop: 0.60561 seconds 2800000000 ``` (I also checked on Godbolt, from the instruction-count perspective, the unrolled version actually looks the worst.) I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the bestchoice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we reallywant to study the performance more deeply, I think that would be a separate topic. It was fun working on this patch, and thank you for your patience. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
pgsql-hackers by date: