Re: Making empty Bitmapsets always be NULL - Mailing list pgsql-hackers

From David Rowley
Subject Re: Making empty Bitmapsets always be NULL
Date
Msg-id CAApHDvr5O41MuUjw0DQKqmAnv7QqvmLqXReEd5o4nXTzWp8-+w@mail.gmail.com
Whole thread Raw
In response to Making empty Bitmapsets always be NULL  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Making empty Bitmapsets always be NULL
List pgsql-hackers
On Wed, 1 Mar 2023 at 10:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> When I designed the Bitmapset module, I set things up so that an empty
> Bitmapset could be represented either by a NULL pointer, or by an
> allocated object all of whose bits are zero.  I've recently come to
> the conclusion that that was a bad idea and we should instead have
> a convention like the longstanding invariant for Lists: an empty
> list is represented by NIL and nothing else.

I know I'm late to the party here, but I just wanted to add that I
agree with this and also that I've never been a fan of having to
decide if it was safe to check for NULL when I needed performance or
if I needed to use bms_is_empty() because the set might have all its
words set to 0.

I suggest tightening the rule even further so instead of just empty
sets having to be represented as NULL, the rule should be that sets
should never contain any trailing zero words, which is effectively a
superset of the "empty is NULL" rule that you've just changed.

If we did this, then various functions can shake loose some crufty
code and various other functions become more optimal due to not having
to loop over trailing zero words needlessly. For example.

* bms_equal() and bms_compare() can precheck nwords match before
troubling themselves with looping over each member, and;
* bms_is_subset() can return false early if 'a' has more words than 'b', and;
* bms_subset_compare() becomes more simple as it does not have to look
for trailing 0 words, and;
* bms_nonempty_difference() can return true early if 'a' has more
words than 'b', plus no need to check for trailing zero words at the
end.

We can also chop the set down to size in; bms_intersect(),
bms_difference(), bms_int_members(), bms_del_members() and
bms_intersect() which saves looping needlessly over empty words when
various other BMS operations are performed later on the set, for
example, bms_next_member(), bms_prev_member, bms_copy(), etc.

The only reason I can think of that this would be a bad idea is that
if we want to add members again then we need to do repalloc().  If
we're only increasing the nwords back to what it had been on some
previous occasion then repalloc() is effectively a no-op, so I doubt
this will really be a net negative.  I think the effort we'll waste by
carrying around needless trailing zero words in most cases is likely
to outweigh the overhead of any no-op repallocs.  Take
bms_int_members(), for example, we'll purposefully 0 out all the
trailing words possibly having to read in new cachelines from RAM to
do so.  It would be better to leave having to read those in again
until we actually need to do something more useful with them, like
adding some new members to the set again. We'll have to dirty those
cachelines then anyway and we may have flushed those cachelines out of
the CPU cache by the time we get around to adding the new members
again.

I've coded this up in the attached and followed the lead in list.c and
added a function named check_bitmapset_invariants() which ensures the
above rule is followed. I think the code as it stands today should
really have something like that anyway.

The patch also optimizes sub-optimal newly added code which calls
bms_is_empty_internal() when we have other more optimal means to
determine if the set is empty or not.

David

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Rework LogicalOutputPluginWriterUpdateProgress
Next
From: Thomas Munro
Date:
Subject: Re: Understanding, testing and improving our Windows filesystem code