On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Here's a couple draft patches fixing the bug:
>
> - 0001 adds the missing size_t cast, to fix the overflow
> - 0002 fixes the balancing, by adjusting the hash table size limit
> - 0003 adds the missing overflow protection for nbuckets and the hash
> table limit
I've attached an alternative patch idea. It's shorter because I
avoided multiplication using algebraic equivalence. There are two
overflow checks for nbuckets and that max_pointers will be able to be
allocated, but otherwise, it avoids most of the overflow risk by
switching to division.
On the topic of max_pointers, I think there is no need to handle >
MaxAllocSize. We were willing to cope with the original values of
nbatch and nbuckets, we are just trying to optimize that. If doing so
would make us run out of memory for the arrays of poitners to buckets,
just use the last hashtable size that didn't have this problem. That
means we don't have to do any shenanigans to have powers of two for
nbuckets because we don't do clamping.
Also, I think the outer loop needs the condition nbatch > 1 not nbatch
> 0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.
I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.
> - 0004 rewords the comment explaining how the balancing works. Reading
> it after a couple months, I found it overly verbose / not very clear.
> I'm sure it could be improved even further.
My attached patch does build on your revised wording.
- Melanie