Thread: popcount

popcount

From
David Fetter
Date:
Hi,

Per request, I'd like to see about surfacing $Subject to SQL.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: popcount

From
"Daniel Verite"
Date:
    David Fetter wrote:

+Datum
+byteapopcount(PG_FUNCTION_ARGS)
+{
+    bytea    *t1 = PG_GETARG_BYTEA_PP(0);
+    int        len, result;
+
+    len = VARSIZE_ANY_EXHDR(t1);
+    result = pg_popcount(VARDATA_ANY(t1), len);
+
+    PG_RETURN_INT32(result);
+}

The input may have more than 2 billion bits set to 1. The biggest possible
result should be 8 billion for bytea (1 GB with all bits set to 1).
So shouldn't this function return an int8?

There is a pre-existing function in core: bit_length(bytea) returning int4,
but it errors out for large values (in fact it's a SQL function that returns
octet_length($1)*8).

For the varbit case, it doesn't seem like it can hold so many bits, although
I don't see the limit documented in
https://www.postgresql.org/docs/current/datatype-bit.html


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: popcount

From
David Fetter
Date:
On Wed, Dec 30, 2020 at 05:27:04PM +0100, Daniel Verite wrote:
>     David Fetter wrote:
> 
> +Datum
> +byteapopcount(PG_FUNCTION_ARGS)
> +{
> +    bytea    *t1 = PG_GETARG_BYTEA_PP(0);
> +    int        len, result;
> +
> +    len = VARSIZE_ANY_EXHDR(t1);
> +    result = pg_popcount(VARDATA_ANY(t1), len);
> +
> +    PG_RETURN_INT32(result);
> +}
> 
> The input may have more than 2 billion bits set to 1. The biggest possible
> result should be 8 billion for bytea (1 GB with all bits set to 1).
> So shouldn't this function return an int8?

It does now, and thanks for looking at this.

> There is a pre-existing function in core: bit_length(bytea) returning int4,
> but it errors out for large values (in fact it's a SQL function that returns
> octet_length($1)*8).
> 
> For the varbit case, it doesn't seem like it can hold so many bits, although
> I don't see the limit documented in
> https://www.postgresql.org/docs/current/datatype-bit.html

Out of an abundance of caution, I changed that one, too.

I'd noticed earlier that ceil() doesn't need to be quite as wasteful
as the way I had it earlier, and fixed that with help from Andrew
Gierth.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: popcount

From
Peter Eisentraut
Date:
On 2020-12-30 17:41, David Fetter wrote:
>> The input may have more than 2 billion bits set to 1. The biggest possible
>> result should be 8 billion for bytea (1 GB with all bits set to 1).
>> So shouldn't this function return an int8?
> It does now, and thanks for looking at this.

The documentation still reflects the previous int4 return type (in two 
different spellings, too).



Re: popcount

From
David Fetter
Date:
On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote:
> On 2020-12-30 17:41, David Fetter wrote:
> > > The input may have more than 2 billion bits set to 1. The biggest possible
> > > result should be 8 billion for bytea (1 GB with all bits set to 1).
> > > So shouldn't this function return an int8?
> > It does now, and thanks for looking at this.
> 
> The documentation still reflects the previous int4 return type (in two
> different spellings, too).

Thanks for looking this over!

Please find attached the next version with corrected documentation.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: popcount

From
Peter Eisentraut
Date:
On 2021-01-11 17:13, David Fetter wrote:
> On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote:
>> On 2020-12-30 17:41, David Fetter wrote:
>>>> The input may have more than 2 billion bits set to 1. The biggest possible
>>>> result should be 8 billion for bytea (1 GB with all bits set to 1).
>>>> So shouldn't this function return an int8?
>>> It does now, and thanks for looking at this.
>>
>> The documentation still reflects the previous int4 return type (in two
>> different spellings, too).
> 
> Thanks for looking this over!
> 
> Please find attached the next version with corrected documentation.

The documentation entries should be placed in alphabetical order in 
their respective tables.

For the bytea function, maybe choose a simpler example that a reader can 
compute in their head.  Also for the test cases.

In varbit.c:

The comment says

+ * Returns the number of bits set in a bit array.

but it should be "bit string".

+   int8        popcount;

should be int64.

Also, pg_popcount() returns uint64, not int64.  Perhaps overflow is not 
possible here, but perhaps a small comment about why or an assertion 
could be appropriate.

+   p = VARBITS(arg1);

Why not assign that in the declaration block as well?

This comment:

+   /*
+    * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
+    * done to minimize branches and instructions.
+    */

I don't know what that is supposed to mean or why that kind of tuning 
would be necessary for a user-callable function.

+   popcount = pg_popcount((const char *)p, len);

The "const" is probably not necessary.

byteapopcount() looks better, but also needs int8 -> int64.



Re: popcount

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> [ assorted nits ]

At the level of bikeshedding ... I quite dislike using the name "popcount"
for these functions.  I'm aware that some C compilers provide primitives
of that name, but I wouldn't expect a SQL programmer to know that;
without that context the name seems pretty random and unintuitive.
Moreover, it invites confusion with SQL's use of "pop" to abbreviate
"population" in the statistical aggregates, such as var_pop().

Perhaps something along the lines of count_ones() or count_set_bits()
would be more apropos.

            regards, tom lane



Re: popcount

From
"Daniel Verite"
Date:
    Peter Eisentraut wrote:

> +   /*
> +    * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
> +    * done to minimize branches and instructions.
> +    */
>
> I don't know what that is supposed to mean or why that kind of tuning
> would be necessary for a user-callable function.

Also, the formula just below looks wrong in the current patch  (v3):

+ /*
+  * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
+  * done to minimize branches and instructions.
+  */
+ len = (VARBITLEN(arg1) + BITS_PER_BYTE + 1) / BITS_PER_BYTE;
+ p = VARBITS(arg1);
+
+ popcount = pg_popcount((const char *)p, len);

It should be
 len = (VARBITLEN(arg1) + BITS_PER_BYTE - 1) / BITS_PER_BYTE

If we add 1 to BITS_PER_BYTE instead of subtracting 1, when
VARBITLEN(arg1) is a multiple of BITS_PER_BYTE, "len" is one byte too
large.

The correct formula is already used in include/utils/varbit.h near the
definition of VARBITLEN:

#define VARBITTOTALLEN(BITLEN)    (((BITLEN) + BITS_PER_BYTE-1)/BITS_PER_BYTE +
\
                                 VARHDRSZ +
VARBITHDRSZ)


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: popcount

From
David Fetter
Date:
On Mon, Jan 18, 2021 at 10:34:10AM -0500, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> > [ assorted nits ]

fixed, I think.

> At the level of bikeshedding ... I quite dislike using the name "popcount"
> for these functions.  I'm aware that some C compilers provide primitives
> of that name, but I wouldn't expect a SQL programmer to know that;
> without that context the name seems pretty random and unintuitive.
> Moreover, it invites confusion with SQL's use of "pop" to abbreviate
> "population" in the statistical aggregates, such as var_pop().
> 
> Perhaps something along the lines of count_ones() or count_set_bits()
> would be more apropos.

Done that way.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: popcount

From
Peter Eisentraut
Date:
On 2021-01-18 16:34, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
>> [ assorted nits ]
> 
> At the level of bikeshedding ... I quite dislike using the name "popcount"
> for these functions.  I'm aware that some C compilers provide primitives
> of that name, but I wouldn't expect a SQL programmer to know that;
> without that context the name seems pretty random and unintuitive.
> Moreover, it invites confusion with SQL's use of "pop" to abbreviate
> "population" in the statistical aggregates, such as var_pop().

I was thinking about that too, but according to 
<https://en.wikipedia.org/wiki/Hamming_weight>, popcount is an accepted 
high-level term, with "pop" also standing for "population".



Re: popcount

From
Robert Haas
Date:
On Tue, Jan 19, 2021 at 3:06 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> On 2021-01-18 16:34, Tom Lane wrote:
> > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> >> [ assorted nits ]
> >
> > At the level of bikeshedding ... I quite dislike using the name "popcount"
> > for these functions.  I'm aware that some C compilers provide primitives
> > of that name, but I wouldn't expect a SQL programmer to know that;
> > without that context the name seems pretty random and unintuitive.
> > Moreover, it invites confusion with SQL's use of "pop" to abbreviate
> > "population" in the statistical aggregates, such as var_pop().
>
> I was thinking about that too, but according to
> <https://en.wikipedia.org/wiki/Hamming_weight>, popcount is an accepted
> high-level term, with "pop" also standing for "population".

Yeah, I am not sure that it's going to be good to invent our own name
for this, although maybe. But at least I think we should make sure
there are some good comments in an easily discoverable place. Some
people seem to think every programmer in the universe should know what
things like popcount() and fls() and ffs() and stuff like that are,
but it's far from obvious and I often have to refresh my memory. Let's
make it easy for someone to figure out, if they don't know already.
Like just a comment that says "this returns the number of 1 bits in
the integer supplied as an argument" or something can save somebody a
lot of trouble.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: popcount

From
David Fetter
Date:
On Tue, Jan 19, 2021 at 07:58:12AM -0500, Robert Haas wrote:
> On Tue, Jan 19, 2021 at 3:06 AM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
> > On 2021-01-18 16:34, Tom Lane wrote:
> > > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> > >> [ assorted nits ]
> > >
> > > At the level of bikeshedding ... I quite dislike using the name "popcount"
> > > for these functions.  I'm aware that some C compilers provide primitives
> > > of that name, but I wouldn't expect a SQL programmer to know that;
> > > without that context the name seems pretty random and unintuitive.
> > > Moreover, it invites confusion with SQL's use of "pop" to abbreviate
> > > "population" in the statistical aggregates, such as var_pop().
> >
> > I was thinking about that too, but according to
> > <https://en.wikipedia.org/wiki/Hamming_weight>, popcount is an accepted
> > high-level term, with "pop" also standing for "population".
> 
> Yeah, I am not sure that it's going to be good to invent our own
> name for this, although maybe. But at least I think we should make
> sure there are some good comments in an easily discoverable place.
> Some people seem to think every programmer in the universe should
> know what things like popcount() and fls() and ffs() and stuff like
> that are, but it's far from obvious and I often have to refresh my
> memory.  Let's make it easy for someone to figure out, if they don't
> know already.

I went with count_set_bits() for the next version, and I hope that
helps clarify what it does.

> Like just a comment that says "this returns the number of 1 bits in
> the integer supplied as an argument" or something can save somebody
> a lot of trouble.

You bring up an excellent point, which is that our builtin functions
could use a lot more documentation directly to hand than they now
have.  For example, there's a lot of needless ambiguity created by
function comments which leave it up in the air as to which positional
argument does what in functions like string_to_array, which take
multiple arguments. I'll try to get a patch in for the next CF with a
fix for that, and a separate one that doesn't put it on people to use
\df+ to find the comments we do provide. There have been proposals for
including an optional space for COMMENT ON in DDL, but I suspect that
those won't fly unless and until they make their way into the
standard.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: popcount

From
Isaac Morland
Date:
On Tue, 19 Jan 2021 at 11:38, David Fetter <david@fetter.org> wrote:
You bring up an excellent point, which is that our builtin functions
could use a lot more documentation directly to hand than they now
have.  For example, there's a lot of needless ambiguity created by
function comments which leave it up in the air as to which positional
argument does what in functions like string_to_array, which take
multiple arguments. I'll try to get a patch in for the next CF with a
fix for that, and a separate one that doesn't put it on people to use
\df+ to find the comments we do provide. There have been proposals for
including an optional space for COMMENT ON in DDL, but I suspect that
those won't fly unless and until they make their way into the
standard.

Since you mention \df+, I wonder if this is the time to consider removing the source code from \df+ (since we have \sf) and adding in the proacl instead?

Re: popcount

From
John Naylor
Date:
Hi David,

Just a nitpick:

+SET bytea_output TO hex;

Since we don't see the string in the output, I don't immediately see the reason to change the output format here?

Aside from that, this patch works as expected, and is ready for committer.
--
John Naylor
EDB: http://www.enterprisedb.com

Re: popcount

From
Peter Eisentraut
Date:
On 18.03.21 13:51, John Naylor wrote:
> Hi David,
> 
> Just a nitpick:
> 
> +SET bytea_output TO hex;
> 
> Since we don't see the string in the output, I don't immediately see the 
> reason to change the output format here?
> 
> Aside from that, this patch works as expected, and is ready for committer.

I have now read the entire internet on what a suitable name for this 
function could be.  I think the emerging winner is BIT_COUNT(), which 
already exists in MySQL, and also in Python (int.bit_count()) and Java 
(Integer.bitCount()).



Re: popcount

From
David Fetter
Date:
On Sat, Mar 20, 2021 at 09:01:25PM +0100, Peter Eisentraut wrote:
> On 18.03.21 13:51, John Naylor wrote:
> > Hi David,
> > 
> > Just a nitpick:
> > 
> > +SET bytea_output TO hex;
> > 
> > Since we don't see the string in the output, I don't immediately see the
> > reason to change the output format here?

That's how I got it to work. If there's a way to make it go without
that, I'd be delighted to learn what it is :)

> > Aside from that, this patch works as expected, and is ready for committer.
> 
> I have now read the entire internet on what a suitable name for this
> function could be.  I think the emerging winner is BIT_COUNT(), which
> already exists in MySQL, and also in Python (int.bit_count()) and Java
> (Integer.bitCount()).

Thanks for doing this tedious work. Please find attached the next
version of the patch.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

Re: popcount

From
Peter Eisentraut
Date:
On 21.03.21 02:31, David Fetter wrote:
>> I have now read the entire internet on what a suitable name for this
>> function could be.  I think the emerging winner is BIT_COUNT(), which
>> already exists in MySQL, and also in Python (int.bit_count()) and Java
>> (Integer.bitCount()).
> 
> Thanks for doing this tedious work. Please find attached the next
> version of the patch.

committing, with some polishing



Re: popcount

From
David Fetter
Date:
On Tue, Mar 23, 2021 at 10:51:08AM +0100, Peter Eisentraut wrote:
> On 21.03.21 02:31, David Fetter wrote:
> > > I have now read the entire internet on what a suitable name for this
> > > function could be.  I think the emerging winner is BIT_COUNT(), which
> > > already exists in MySQL, and also in Python (int.bit_count()) and Java
> > > (Integer.bitCount()).
> > 
> > Thanks for doing this tedious work. Please find attached the next
> > version of the patch.
> 
> committing, with some polishing

Thanks!

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate