Thread: popcount
Hi, Per request, I'd like to see about surfacing $Subject to SQL. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
David Fetter wrote: +Datum +byteapopcount(PG_FUNCTION_ARGS) +{ + bytea *t1 = PG_GETARG_BYTEA_PP(0); + int len, result; + + len = VARSIZE_ANY_EXHDR(t1); + result = pg_popcount(VARDATA_ANY(t1), len); + + PG_RETURN_INT32(result); +} The input may have more than 2 billion bits set to 1. The biggest possible result should be 8 billion for bytea (1 GB with all bits set to 1). So shouldn't this function return an int8? There is a pre-existing function in core: bit_length(bytea) returning int4, but it errors out for large values (in fact it's a SQL function that returns octet_length($1)*8). For the varbit case, it doesn't seem like it can hold so many bits, although I don't see the limit documented in https://www.postgresql.org/docs/current/datatype-bit.html Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Wed, Dec 30, 2020 at 05:27:04PM +0100, Daniel Verite wrote: > David Fetter wrote: > > +Datum > +byteapopcount(PG_FUNCTION_ARGS) > +{ > + bytea *t1 = PG_GETARG_BYTEA_PP(0); > + int len, result; > + > + len = VARSIZE_ANY_EXHDR(t1); > + result = pg_popcount(VARDATA_ANY(t1), len); > + > + PG_RETURN_INT32(result); > +} > > The input may have more than 2 billion bits set to 1. The biggest possible > result should be 8 billion for bytea (1 GB with all bits set to 1). > So shouldn't this function return an int8? It does now, and thanks for looking at this. > There is a pre-existing function in core: bit_length(bytea) returning int4, > but it errors out for large values (in fact it's a SQL function that returns > octet_length($1)*8). > > For the varbit case, it doesn't seem like it can hold so many bits, although > I don't see the limit documented in > https://www.postgresql.org/docs/current/datatype-bit.html Out of an abundance of caution, I changed that one, too. I'd noticed earlier that ceil() doesn't need to be quite as wasteful as the way I had it earlier, and fixed that with help from Andrew Gierth. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
On 2020-12-30 17:41, David Fetter wrote: >> The input may have more than 2 billion bits set to 1. The biggest possible >> result should be 8 billion for bytea (1 GB with all bits set to 1). >> So shouldn't this function return an int8? > It does now, and thanks for looking at this. The documentation still reflects the previous int4 return type (in two different spellings, too).
On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote: > On 2020-12-30 17:41, David Fetter wrote: > > > The input may have more than 2 billion bits set to 1. The biggest possible > > > result should be 8 billion for bytea (1 GB with all bits set to 1). > > > So shouldn't this function return an int8? > > It does now, and thanks for looking at this. > > The documentation still reflects the previous int4 return type (in two > different spellings, too). Thanks for looking this over! Please find attached the next version with corrected documentation. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
On 2021-01-11 17:13, David Fetter wrote: > On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote: >> On 2020-12-30 17:41, David Fetter wrote: >>>> The input may have more than 2 billion bits set to 1. The biggest possible >>>> result should be 8 billion for bytea (1 GB with all bits set to 1). >>>> So shouldn't this function return an int8? >>> It does now, and thanks for looking at this. >> >> The documentation still reflects the previous int4 return type (in two >> different spellings, too). > > Thanks for looking this over! > > Please find attached the next version with corrected documentation. The documentation entries should be placed in alphabetical order in their respective tables. For the bytea function, maybe choose a simpler example that a reader can compute in their head. Also for the test cases. In varbit.c: The comment says + * Returns the number of bits set in a bit array. but it should be "bit string". + int8 popcount; should be int64. Also, pg_popcount() returns uint64, not int64. Perhaps overflow is not possible here, but perhaps a small comment about why or an assertion could be appropriate. + p = VARBITS(arg1); Why not assign that in the declaration block as well? This comment: + /* + * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE) + * done to minimize branches and instructions. + */ I don't know what that is supposed to mean or why that kind of tuning would be necessary for a user-callable function. + popcount = pg_popcount((const char *)p, len); The "const" is probably not necessary. byteapopcount() looks better, but also needs int8 -> int64.
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > [ assorted nits ] At the level of bikeshedding ... I quite dislike using the name "popcount" for these functions. I'm aware that some C compilers provide primitives of that name, but I wouldn't expect a SQL programmer to know that; without that context the name seems pretty random and unintuitive. Moreover, it invites confusion with SQL's use of "pop" to abbreviate "population" in the statistical aggregates, such as var_pop(). Perhaps something along the lines of count_ones() or count_set_bits() would be more apropos. regards, tom lane
Peter Eisentraut wrote: > + /* > + * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE) > + * done to minimize branches and instructions. > + */ > > I don't know what that is supposed to mean or why that kind of tuning > would be necessary for a user-callable function. Also, the formula just below looks wrong in the current patch (v3): + /* + * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE) + * done to minimize branches and instructions. + */ + len = (VARBITLEN(arg1) + BITS_PER_BYTE + 1) / BITS_PER_BYTE; + p = VARBITS(arg1); + + popcount = pg_popcount((const char *)p, len); It should be len = (VARBITLEN(arg1) + BITS_PER_BYTE - 1) / BITS_PER_BYTE If we add 1 to BITS_PER_BYTE instead of subtracting 1, when VARBITLEN(arg1) is a multiple of BITS_PER_BYTE, "len" is one byte too large. The correct formula is already used in include/utils/varbit.h near the definition of VARBITLEN: #define VARBITTOTALLEN(BITLEN) (((BITLEN) + BITS_PER_BYTE-1)/BITS_PER_BYTE + \ VARHDRSZ + VARBITHDRSZ) Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Mon, Jan 18, 2021 at 10:34:10AM -0500, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > > [ assorted nits ] fixed, I think. > At the level of bikeshedding ... I quite dislike using the name "popcount" > for these functions. I'm aware that some C compilers provide primitives > of that name, but I wouldn't expect a SQL programmer to know that; > without that context the name seems pretty random and unintuitive. > Moreover, it invites confusion with SQL's use of "pop" to abbreviate > "population" in the statistical aggregates, such as var_pop(). > > Perhaps something along the lines of count_ones() or count_set_bits() > would be more apropos. Done that way. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
On 2021-01-18 16:34, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: >> [ assorted nits ] > > At the level of bikeshedding ... I quite dislike using the name "popcount" > for these functions. I'm aware that some C compilers provide primitives > of that name, but I wouldn't expect a SQL programmer to know that; > without that context the name seems pretty random and unintuitive. > Moreover, it invites confusion with SQL's use of "pop" to abbreviate > "population" in the statistical aggregates, such as var_pop(). I was thinking about that too, but according to <https://en.wikipedia.org/wiki/Hamming_weight>, popcount is an accepted high-level term, with "pop" also standing for "population".
On Tue, Jan 19, 2021 at 3:06 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > On 2021-01-18 16:34, Tom Lane wrote: > > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > >> [ assorted nits ] > > > > At the level of bikeshedding ... I quite dislike using the name "popcount" > > for these functions. I'm aware that some C compilers provide primitives > > of that name, but I wouldn't expect a SQL programmer to know that; > > without that context the name seems pretty random and unintuitive. > > Moreover, it invites confusion with SQL's use of "pop" to abbreviate > > "population" in the statistical aggregates, such as var_pop(). > > I was thinking about that too, but according to > <https://en.wikipedia.org/wiki/Hamming_weight>, popcount is an accepted > high-level term, with "pop" also standing for "population". Yeah, I am not sure that it's going to be good to invent our own name for this, although maybe. But at least I think we should make sure there are some good comments in an easily discoverable place. Some people seem to think every programmer in the universe should know what things like popcount() and fls() and ffs() and stuff like that are, but it's far from obvious and I often have to refresh my memory. Let's make it easy for someone to figure out, if they don't know already. Like just a comment that says "this returns the number of 1 bits in the integer supplied as an argument" or something can save somebody a lot of trouble. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 19, 2021 at 07:58:12AM -0500, Robert Haas wrote: > On Tue, Jan 19, 2021 at 3:06 AM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: > > On 2021-01-18 16:34, Tom Lane wrote: > > > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > > >> [ assorted nits ] > > > > > > At the level of bikeshedding ... I quite dislike using the name "popcount" > > > for these functions. I'm aware that some C compilers provide primitives > > > of that name, but I wouldn't expect a SQL programmer to know that; > > > without that context the name seems pretty random and unintuitive. > > > Moreover, it invites confusion with SQL's use of "pop" to abbreviate > > > "population" in the statistical aggregates, such as var_pop(). > > > > I was thinking about that too, but according to > > <https://en.wikipedia.org/wiki/Hamming_weight>, popcount is an accepted > > high-level term, with "pop" also standing for "population". > > Yeah, I am not sure that it's going to be good to invent our own > name for this, although maybe. But at least I think we should make > sure there are some good comments in an easily discoverable place. > Some people seem to think every programmer in the universe should > know what things like popcount() and fls() and ffs() and stuff like > that are, but it's far from obvious and I often have to refresh my > memory. Let's make it easy for someone to figure out, if they don't > know already. I went with count_set_bits() for the next version, and I hope that helps clarify what it does. > Like just a comment that says "this returns the number of 1 bits in > the integer supplied as an argument" or something can save somebody > a lot of trouble. You bring up an excellent point, which is that our builtin functions could use a lot more documentation directly to hand than they now have. For example, there's a lot of needless ambiguity created by function comments which leave it up in the air as to which positional argument does what in functions like string_to_array, which take multiple arguments. I'll try to get a patch in for the next CF with a fix for that, and a separate one that doesn't put it on people to use \df+ to find the comments we do provide. There have been proposals for including an optional space for COMMENT ON in DDL, but I suspect that those won't fly unless and until they make their way into the standard. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, 19 Jan 2021 at 11:38, David Fetter <david@fetter.org> wrote:
You bring up an excellent point, which is that our builtin functions
could use a lot more documentation directly to hand than they now
have. For example, there's a lot of needless ambiguity created by
function comments which leave it up in the air as to which positional
argument does what in functions like string_to_array, which take
multiple arguments. I'll try to get a patch in for the next CF with a
fix for that, and a separate one that doesn't put it on people to use
\df+ to find the comments we do provide. There have been proposals for
including an optional space for COMMENT ON in DDL, but I suspect that
those won't fly unless and until they make their way into the
standard.
Since you mention \df+, I wonder if this is the time to consider removing the source code from \df+ (since we have \sf) and adding in the proacl instead?
Hi David,
Just a nitpick:
+SET bytea_output TO hex;
Since we don't see the string in the output, I don't immediately see the reason to change the output format here?
Aside from that, this patch works as expected, and is ready for committer.
+SET bytea_output TO hex;
Since we don't see the string in the output, I don't immediately see the reason to change the output format here?
Aside from that, this patch works as expected, and is ready for committer.
On 18.03.21 13:51, John Naylor wrote: > Hi David, > > Just a nitpick: > > +SET bytea_output TO hex; > > Since we don't see the string in the output, I don't immediately see the > reason to change the output format here? > > Aside from that, this patch works as expected, and is ready for committer. I have now read the entire internet on what a suitable name for this function could be. I think the emerging winner is BIT_COUNT(), which already exists in MySQL, and also in Python (int.bit_count()) and Java (Integer.bitCount()).
On Sat, Mar 20, 2021 at 09:01:25PM +0100, Peter Eisentraut wrote: > On 18.03.21 13:51, John Naylor wrote: > > Hi David, > > > > Just a nitpick: > > > > +SET bytea_output TO hex; > > > > Since we don't see the string in the output, I don't immediately see the > > reason to change the output format here? That's how I got it to work. If there's a way to make it go without that, I'd be delighted to learn what it is :) > > Aside from that, this patch works as expected, and is ready for committer. > > I have now read the entire internet on what a suitable name for this > function could be. I think the emerging winner is BIT_COUNT(), which > already exists in MySQL, and also in Python (int.bit_count()) and Java > (Integer.bitCount()). Thanks for doing this tedious work. Please find attached the next version of the patch. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
On 21.03.21 02:31, David Fetter wrote: >> I have now read the entire internet on what a suitable name for this >> function could be. I think the emerging winner is BIT_COUNT(), which >> already exists in MySQL, and also in Python (int.bit_count()) and Java >> (Integer.bitCount()). > > Thanks for doing this tedious work. Please find attached the next > version of the patch. committing, with some polishing
On Tue, Mar 23, 2021 at 10:51:08AM +0100, Peter Eisentraut wrote: > On 21.03.21 02:31, David Fetter wrote: > > > I have now read the entire internet on what a suitable name for this > > > function could be. I think the emerging winner is BIT_COUNT(), which > > > already exists in MySQL, and also in Python (int.bit_count()) and Java > > > (Integer.bitCount()). > > > > Thanks for doing this tedious work. Please find attached the next > > version of the patch. > > committing, with some polishing Thanks! Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate