Hi all,
In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup
table. At the time, using CLZ was rejected because compiler/platform
support was not widespread enough to justify it. For other reasons, we
recently added bitutils.h which uses __builtin_clz() where available,
so it makes sense to revisit this. I modified the test in [1] (C files
attached), using two separate functions to test CLZ versus the
open-coded algorithm of pg_leftmost_one_pos32().
These are typical results on a recent Intel platform:
HEAD 5.55s
clz 4.51s
open-coded 9.67s
CLZ gives a nearly 20% speed boost on this microbenchmark. I suspect
that this micro-benchmark is actually biased towards the lookup table
more than real-world workloads, because it can monopolize the L1
cache. Sparing cache is possibly the more interesting reason to use
CLZ. The open-coded version is nearly twice as slow, so it makes sense
to keep the current implementation as the default one, and not use
pg_leftmost_one_pos32() directly. However, with a small tweak, we can
reuse the lookup table in bitutils.c instead of the bespoke one used
solely by AllocSetFreeIndex(), saving a couple cache lines here also.
This is done in the attached patch.
[1] https://www.postgresql.org/message-id/407d949e0907201811i13c73e18x58295566d27aadcc%40mail.gmail.com
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services