Thread: always use runtime checks for CRC-32C instructions

always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
This is an offshoot of the "CRC32C Parallel Computation Optimization on
ARM" thread [0].  I intend for this to be a prerequisite patch set.

Presently, for the SSE 4.2 and ARMv8 CRC instructions used in the CRC32C
code for WAL records, etc., we first check if the intrinsics are available
with the default compiler flags.  If so, we only bother compiling the
implementation that uses those intrinsics.  If not, we also check whether
the intrinsics are available with some extra CFLAGS, and if they are, we
compile both the implementation that uses the intrinsics as well as a
fallback implementation that doesn't require any special instructions.
Then, at runtime, we check what's available in the hardware and choose the
appropriate CRC32C implementation.

The aforementioned other thread [0] aims to further optimize this code by
using another instruction that requires additional configure and/or runtime
checks.  $SUBJECT has been in the back of my mind for a while, but given
proposals to add further complexity to this code, I figured it might be a
good time to propose this simplification.  Specifically, I think we
shouldn't worry about trying to compile only the special instrinics
versions, and instead always try to build both and choose the appropriate
one at runtime.

AFAICT the trade-offs aren't too bad.  With some simple testing, I see that
the runtime check occurs once at startup, so I don't anticipate any
noticeable performance impact.  I suppose each process might need to do the
check in EXEC_BACKEND builds, but even so, I suspect the difference is
negligible.

I also see that the SSE 4.2 runtime check requires the CPUID instruction,
so we wouldn't use the instrinsics for hardware that supports SSE 4.2 but
not CPUID.  However, I'm not sure such hardware even exists.  Wikipedia
says that CPUID was introduced in 1993 [1], and meson.build appears to omit
the CPUID check when determining which CRC32C implementation to use.
Furthermore, meson.build alludes to problems with some of the CPUID-related
checks:

    # XXX: The configure.ac check for __cpuid() is broken, we don't copy that
    # here. To prevent problems due to two detection methods working, stop
    # checking after one.

Are there any other reasons that we should try to avoid the runtime check
when possible?

I've attached two patches.  0001 adds a debug message to the SSE 4.2
runtime check that matches the one already present for the ARMv8 check.
This message just notes whether the runtime check found that the special
CRC instructions are available.  0002 is a first attempt at $SUBJECT.  I've
tested it on both x86 and ARM, and it seems to work as intended.  You'll
notice that I'm still checking for the intrinsics with the default compiler
flags first.  I didn't see any strong reason to change this, and doing so
allows us to avoid sending extra CFLAGS when possible.

Thoughts?

[0] https://postgr.es/m/DB9PR08MB6991329A73923BF8ED4B3422F5DBA%40DB9PR08MB6991.eurprd08.prod.outlook.com
[1] https://en.wikipedia.org/wiki/CPUID

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: always use runtime checks for CRC-32C instructions

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> The aforementioned other thread [0] aims to further optimize this code by
> using another instruction that requires additional configure and/or runtime
> checks.  $SUBJECT has been in the back of my mind for a while, but given
> proposals to add further complexity to this code, I figured it might be a
> good time to propose this simplification.  Specifically, I think we
> shouldn't worry about trying to compile only the special instrinics
> versions, and instead always try to build both and choose the appropriate
> one at runtime.

On the one hand, I agree that we need to keep the complexity from
getting out of hand.  On the other hand, I wonder if this approach
isn't optimizing for the wrong case.  How many machines that PG 17
will ever be run on in production will lack SSE 4.2 (for Intel)
or ARMv8 instructions (on the ARM side)?  It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions.  Maybe that overhead
is negligible, but it doesn't sound like you tried to measure it.

I'm not quite sure what to propose instead, though.  I thought for
a little bit about a configure switch to select "test first" or
"pedal to the metal".  But in practice, package builders would
probably have to select the conservative "test first" option; and
we know that the vast majority of modern installations use prebuilt
packages, so it's not clear that this answer would help many people.

Anyway, I agree that the cost of a one-time-per-process probe should
be negligible; it's the per-use cost that I worry about.  If you can
do some measurements proving that that worry is ill-founded, then
I'm good with test-first.

            regards, tom lane



Re: always use runtime checks for CRC-32C instructions

From
Jeff Davis
Date:
On Mon, 2023-10-30 at 12:39 -0400, Tom Lane wrote:
> It seems like a shame
> to be burdening these instructions with a subroutine call for the
> benefit of long-obsolete hardware versions.

It's already doing a call to pg_comp_crc32c_sse42() regardless, right?

I assume you are concerned about the call going through a function
pointer? If so, is it possible that setting a flag and then branching
would be better?

Also, if it's a concern, should we also consider making an inlineable
version of pg_comp_crc32c_sse42()?

Regards,
    Jeff Davis




Re: always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
On Mon, Oct 30, 2023 at 12:39:23PM -0400, Tom Lane wrote:
> On the one hand, I agree that we need to keep the complexity from
> getting out of hand.  On the other hand, I wonder if this approach
> isn't optimizing for the wrong case.  How many machines that PG 17
> will ever be run on in production will lack SSE 4.2 (for Intel)
> or ARMv8 instructions (on the ARM side)?

For the CRC instructions in use today, I wouldn't be surprised if that
number is pretty small, but for newer or optional instructions (like ARM's
PMULL), I don't think we'll be so lucky.  Even if we do feel comfortable
assuming the presence of SSE 4.2, etc., we'll likely still need to add
runtime checks for future optimizations.

> It seems like a shame
> to be burdening these instructions with a subroutine call for the
> benefit of long-obsolete hardware versions.  Maybe that overhead
> is negligible, but it doesn't sound like you tried to measure it.

When I went to measure this, I noticed that my relatively new x86 machine
with a relatively new version of gcc uses the runtime check.  I then
skimmed through a few dozen buildfarm machines and found that, of all x86
and ARM machines that supported the specialized CRC instructions, only one
ARM machine did not use the runtime check.  Of course, this is far from a
scientific data point, but it seems to indicate that the runtime check is
the norm.

(I still need to measure it.)

> Anyway, I agree that the cost of a one-time-per-process probe should
> be negligible; it's the per-use cost that I worry about.  If you can
> do some measurements proving that that worry is ill-founded, then
> I'm good with test-first.

Will do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
On Mon, Oct 30, 2023 at 01:48:29PM -0700, Jeff Davis wrote:
> I assume you are concerned about the call going through a function
> pointer? If so, is it possible that setting a flag and then branching
> would be better?
> 
> Also, if it's a concern, should we also consider making an inlineable
> version of pg_comp_crc32c_sse42()?

I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:

 * slicing-by-8                                : ~3.08s
 * proposed patches applied (runtime check)    : ~2.44s
 * only CRC intrinsics implementation compiled : ~2.42s
 * forced inlining                             : ~2.38s

Avoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement.  In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote:
> I tested pg_waldump -z with 50M 65-byte records for the following
> implementations on an ARM system:
> 
>  * slicing-by-8                                : ~3.08s
>  * proposed patches applied (runtime check)    : ~2.44s
>  * only CRC intrinsics implementation compiled : ~2.42s
>  * forced inlining                             : ~2.38s
> 
> Avoiding the runtime check produced a 0.8% improvement, and forced inlining
> produced another 1.7% improvement.  In comparison, even the runtime check
> implementation produced a 20.8% improvement over the slicing-by-8 one.

After reflecting on these numbers for a bit, I think I'm still inclined to
do $SUBJECT.  I considered the following:

* While it would be nice to gain a couple of percentage points for existing
  hardware, I think we'll still end up doing runtime checks most of the
  time once we add support for newer instructions.

* The performance improvements that the new instructions provide seem
  likely to outweigh these small regressions, especially for workloads with
  larger WAL records [0].

* From my quick scan of a few dozen machines on the buildfarm, it looks
  like the runtime checks are already the norm, so the number of systems
  that would be subject to a regression from v16 to v17 should be pretty
  small, in theory.  And this regression seems to be on the order of 1%
  based on the numbers above.

Do folks think this is reasonable?  Or should we instead try to squeeze
every last drop out of the current implementations by avoiding function
pointers, forcing inlining, etc.?

[0] https://postgr.es/m/20231025014539.GA977906%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: always use runtime checks for CRC-32C instructions

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote:
>> I tested pg_waldump -z with 50M 65-byte records for the following
>> implementations on an ARM system:
>> 
>> * slicing-by-8                                : ~3.08s
>> * proposed patches applied (runtime check)    : ~2.44s
>> * only CRC intrinsics implementation compiled : ~2.42s
>> * forced inlining                             : ~2.38s
>> 
>> Avoiding the runtime check produced a 0.8% improvement, and forced inlining
>> produced another 1.7% improvement.  In comparison, even the runtime check
>> implementation produced a 20.8% improvement over the slicing-by-8 one.

I find these numbers fairly concerning.  If you can see a
couple-of-percent slowdown on a macroscopic benchmark like pg_waldump,
that implies that the percentage slowdown considering the CRC
operation alone is much worse.  So there may be other use-cases where
we would take a bigger relative hit.

> * From my quick scan of a few dozen machines on the buildfarm, it looks
>   like the runtime checks are already the norm, so the number of systems
>   that would be subject to a regression from v16 to v17 should be pretty
>   small, in theory.  And this regression seems to be on the order of 1%
>   based on the numbers above.

I did a more thorough scrape of the buildfarm results.  Of 161 animals
currently reporting configure output on HEAD, we have

      2 ARMv8 CRC instructions
     36 ARMv8 CRC instructions with runtime check
      2 LoongArch CRCC instructions
      2 SSE 4.2
     52 SSE 4.2 with runtime check
     67 slicing-by-8

While that'd seem to support your conclusion, the two using ARM CRC
*without* a runtime check are my Apple M1 Mac animals (sifaka/indri);
and I see the same selection on my laptop.  So one platform where
we'd clearly be taking a regression is M-series Macs; that's a pretty
popular platform.  The two using SSE without a check are prion and
tayra.  I notice those are using gcc 11; so perhaps the default cflags
have changed to include -msse4.2 recently?  I couldn't see much other
pattern though.  (Scraping results attached in case anybody wants to
look.)

Really this just reinforces my concern that doing a runtime check
all the time is on the wrong side of history.  I grant that we've
got to do that for anything where the availability of the instruction
is really in serious question, but I'm not very convinced that that's
a majority situation on popular platforms.

            regards, tom lane

alimoche,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 8.3.0-6) 8.3.0
batta,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110
blackneck,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 13.0.1-6~deb10u4
boiga,ARMv8 CRC instructions with runtime check,aarch64,clang version 14.0.5 (Fedora 14.0.5-2.fc36)
broadbill,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
bulbul,ARMv8 CRC instructions with runtime check,aarch64,clang version 15.0.7 (Red Hat
15.0.7-1.module_el8.8.0+3466+dfcbc058)
corzo,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
desman,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1)
eelpout,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110
gokiburi,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 11.0.1-2
hachi,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 11.0.1-2
jackdaw,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110
massasauga,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
motmot,ARMv8 CRC instructions with runtime check,aarch64,clang version 16.0.6 (Fedora 16.0.6-2.fc38)
oystercatcher,ARMv8 CRC instructions with runtime check,aarch64,clang version 15.0.7 (Red Hat 15.0.7-2.el9)
potoo,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)
rudd,ARMv8 CRC instructions with runtime check,aarch64,Ubuntu clang version 12.0.0-3ubuntu1~20.04.5
shiner,ARMv8 CRC instructions with runtime check,aarch64,Ubuntu clang version 15.0.7
snakefly,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
splitfin,ARMv8 CRC instructions with runtime check,aarch64,gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
tanager,ARMv8 CRC instructions with runtime check,aarch64,clang version 7.0.1-8+deb10u2 (tags/RELEASE_701/final)
turbot,ARMv8 CRC instructions with runtime check,aarch64,gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
vimba,ARMv8 CRC instructions with runtime check,aarch64,clang version 10.0.0-4ubuntu1~18.04.2
whinchat,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 13.0.1-6~deb11u1
whiting,ARMv8 CRC instructions with runtime check,aarch64,gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
widowbird,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110
ziege,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-21)
arowana,slicing-by-8,aarch64,gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
parula,ARMv8 CRC instructions with runtime check,aarch64/graviton3/c7g.2xl,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
elver,SSE 4.2 with runtime check,amd64,FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git
llvmorg-14.0.5-0-gc12386ae247c)
gombessa,SSE 4.2 with runtime check,amd64,OpenBSD clang version 10.0.1
hake,SSE 4.2 with runtime check,amd64,gcc (OpenIndiana 9.3.0-oi-4) 9.3.0
mule,SSE 4.2 with runtime check,amd64,gcc (Debian 12.2.0-14) 12.2.0
plover,SSE 4.2 with runtime check,amd64,OpenBSD clang version 10.0.1
pollock,SSE 4.2 with runtime check,amd64,gcc (OmniOS 151046/12.2.0-il-0) 12.2.0
indri,ARMv8 CRC instructions,arm64,Apple clang version 15.0.0 (clang-1500.0.40.1)
sifaka,ARMv8 CRC instructions,arm64,Apple clang version 15.0.0 (clang-1500.0.40.1)
dikkop,ARMv8 CRC instructions with runtime check,arm64,FreeBSD clang version 16.0.6
(https://github.com/llvm/llvm-project.gitllvmorg-16.0.6-0-g7cbf1a259152) 
opaleye,ARMv8 CRC instructions with runtime check,arm64,FreeBSD clang version 13.0.0
(git@github.com:llvm/llvm-project.gitllvmorg-13.0.0-0-gd7b669b3a303) 
chipmunk,ARMv8 CRC instructions with runtime check,armv6l,gcc (Raspbian 4.9.2-10+deb8u2) 4.9.2
grison,ARMv8 CRC instructions with runtime check,armv7,gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516
gull,ARMv8 CRC instructions with runtime check,armv7,clang version 13.0.0
(http://git.linaro.org/toolchain/jenkins-scripts.gitd04b0fafc2d906fd9b2e8e55efb35c9cf7114e68) 
mereswine,ARMv8 CRC instructions with runtime check,armv7,gcc (Debian 10.2.1-6) 10.2.1 20210110
dangomushi,ARMv8 CRC instructions with runtime check,armv7l,clang version 7.0.1-8+deb10u2 (tags/RELEASE_701/final)
turaco,ARMv8 CRC instructions with runtime check,armv7l,gcc (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110
chickadee,slicing-by-8,hppa,gcc (nb1 20220722) 10.4.0
lapwing,SSE 4.2 with runtime check,i686,gcc (Debian 4.7.2-5) 4.7.2
cisticola,LoongArch CRCC instructions,loongarch64,gcc (GCC) 8.3.0 20190222 (Loongson 8.3.0-35 vec)
nuthatch,LoongArch CRCC instructions,loongarch64,gcc (GCC) 13.2.1 20230906
mamba,slicing-by-8,macppc,cc (nb1 20220722) 10.4.0
frogfish,slicing-by-8,mips64eb; -mabi=64,gcc (Debian 4.6.3-14) 4.6.3
topminnow,slicing-by-8,mips64el; -mabi=32,gcc (Debian 4.9.2-10+deb8u1) 4.9.2
hornet,slicing-by-8,ppc64 (power7),xlc_r -D_LARGE_FILES=1
mandrill,slicing-by-8,ppc64 (power7),xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY
sungazer,slicing-by-8,ppc64 (power7),gcc (GCC) 8.3.0
tern,slicing-by-8,ppc64 (power7),gcc (GCC) 8.3.0
boa,slicing-by-8,ppc64 (power8),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
chub,slicing-by-8,ppc64 (power8),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
hoverfly,slicing-by-8,ppc64 (power8),/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY
batfish,slicing-by-8,ppc64le,clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final)
clam,slicing-by-8,ppc64le,"gcc (GCC) 5.2.1 20151016 (Advance-Toolchain-) [ibm/gcc-5-branch, revision: 229493 merged
fromgcc-5-branch, revision 228917]" 
cuon,slicing-by-8,ppc64le,gcc (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
dhole,slicing-by-8,ppc64le,gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
urocryon,slicing-by-8,ppc64le,gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
vulpes,slicing-by-8,ppc64le,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
wobbegong,slicing-by-8,ppc64le,clang version 5.0.2 (tags/RELEASE_502/final)
ayu,slicing-by-8,ppc64le (power8),clang version 4.0.1-10~deb9u2 (tags/RELEASE_401/final)
babbler,slicing-by-8,ppc64le (power9),gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
bonito,slicing-by-8,ppc64le (power9),gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)
buri,slicing-by-8,ppc64le (power9),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
cascabel,slicing-by-8,ppc64le (power9),gcc (Debian 10.2.1-6) 10.2.1 20210110
cavefish,slicing-by-8,ppc64le (power9),gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
chevrotain,slicing-by-8,ppc64le (power9),Debian clang version 13.0.1-6~deb11u1
chimaera,slicing-by-8,ppc64le (power9),gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
demoiselle,slicing-by-8,ppc64le (power9),clang version 5.0.1 (tags/RELEASE_501/final 312548)
elasmobranch,slicing-by-8,ppc64le (power9),gcc (SUSE Linux) 7.3.1 20180323 [gcc-7-branch revision 258812]
habu,slicing-by-8,ppc64le (power9),gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1)
kingsnake,slicing-by-8,ppc64le (power9),clang version 16.0.6 (Fedora 16.0.6-3.fc38)
krait,slicing-by-8,ppc64le (power9),gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-21)
lancehead,slicing-by-8,ppc64le (power9),clang version 16.0.6 (Red Hat 16.0.6-2.module_el8+588+6f71ce7b)
nicator,slicing-by-8,ppc64le (power9),gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)
pytilia,slicing-by-8,ppc64le (power9),clang version 15.0.7 (Red Hat 15.0.7-1.module_el8.8.0+3466+dfcbc058)
boomslang,slicing-by-8,riscv64,gcc (Debian 10.2.1-6) 10.2.1 20210110
copperhead,slicing-by-8,riscv64,gcc (Debian 10.2.1-6) 10.2.1 20210110
aracari,slicing-by-8,s390x (z15),clang version 15.0.7 (Red Hat 15.0.7-1.module+el8.8.0+17939+b58878af)
branta,slicing-by-8,s390x (z15),gcc-10 (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
cotinga,slicing-by-8,s390x (z15),clang version 10.0.0-4ubuntu1~18.04.2
lora,slicing-by-8,s390x (z15),gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)
mamushi,slicing-by-8,s390x (z15),clang version 15.0.7 (Red Hat 15.0.7-2.el9)
perch,slicing-by-8,s390x (z15),gcc-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
pike,slicing-by-8,s390x (z15),gcc (Debian 10.2.1-6) 10.2.1 20210110
pipit,slicing-by-8,s390x (z15),gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
rinkhals,slicing-by-8,s390x (z15),Debian clang version 13.0.1-6~deb11u1
ruddy,slicing-by-8,s390x (z15),gcc-11 (SUSE Linux) 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9]
sarus,slicing-by-8,s390x (z15),Ubuntu clang version 14.0.0-1ubuntu1.1
shelduck,slicing-by-8,s390x (z15),gcc (SUSE Linux) 4.8.5
siskin,slicing-by-8,s390x (z15),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
urutau,slicing-by-8,s390x (z15),Ubuntu clang version 12.0.0-3ubuntu1~20.04.5
margay,slicing-by-8,sparc,gcc (GCC) 11.2.0
skate,slicing-by-8,sparc,gcc-4.7 (Debian 4.7.2-5) 4.7.2
snapper,slicing-by-8,sparc,gcc-4.7 (Debian 4.7.2-5) 4.7.2
grackle,slicing-by-8,sparc64,gcc (GCC) 12.2.0
ibisbill,slicing-by-8,sparc64,gcc (Debian 8.3.0-6) 8.3.0
kittiwake,slicing-by-8,sparc64,gcc (Debian 8.3.0-6) 8.3.0
mussurana,slicing-by-8,sparc64,gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
tadarida,slicing-by-8,sparc64,gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
thorntail,slicing-by-8,sparc64,gcc-11 (Debian 11.4.0-1) 11.4.0
waxbill,slicing-by-8,sparc64,gcc (Debian 12.3.0-4) 12.3.0
wrasse,slicing-by-8,sparc64,cc: Studio 12.6 Sun C 5.15 SunOS_sparc 152881-01 2018/03/16
morepork,SSE 4.2 with runtime check,x64,OpenBSD clang version 10.0.1
bushmaster,SSE 4.2 with runtime check,x86-64,clang version 17.0.3 (https://github.com/llvm/llvm-project.git
37b79e779f447f1c714af7f907e7a2ec846d1da0)
calliphoridae,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0
canebrake,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0
culicidae,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0
grassquit,SSE 4.2 with runtime check,x86-64,gcc-12 (Debian 12.3.0-11) 12.3.0
kestrel,SSE 4.2 with runtime check,x86-64,Debian clang version 13.0.1-11+b2
olingo,SSE 4.2 with runtime check,x86-64,Debian clang version 13.0.1-11+b2
skink,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0
taipan,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0
tamandua,SSE 4.2 with runtime check,x86-64,gcc-12 (Debian 12.3.0-11) 12.3.0
urutu,SSE 4.2 with runtime check,x86-64,clang version 16.0.6 (https://github.com/llvm/llvm-project.git
7cbf1a2591520c2491aa35339f227775f4d3adf6)
prion,SSE 4.2,x86_64,gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)
tayra,SSE 4.2,x86_64,gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4.3.0.4)
alabio,SSE 4.2 with runtime check,x86_64,gcc (Debian 10.2.1-6) 10.2.1 20210110
avocet,SSE 4.2 with runtime check,x86_64,gcc (SUSE Linux) 7.5.0
butterflyfish,SSE 4.2 with runtime check,x86_64,gcc (GCC) 6.3.0
caiman,SSE 4.2 with runtime check,x86_64,gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4)
conchuela,SSE 4.2 with runtime check,x86_64,clang version 14.0.6
culpeo,SSE 4.2 with runtime check,x86_64,gcc (Debian 12.2.0-14) 12.2.0
desmoxytes,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0
dragonet,SSE 4.2 with runtime check,x86_64,clang version 3.9.1
flaviventris,SSE 4.2 with runtime check,x86_64,gcc (Debian 20231017-1) 14.0.0 20231017 (experimental) [master
r14-4678-gce55521bcd1]
guaibasaurus,SSE 4.2 with runtime check,x86_64,gcc (Debian 10.2.1-6) 10.2.1 20210110
hippopotamus,SSE 4.2 with runtime check,x86_64,gcc (SUSE Linux) 7.5.0
idiacanthus,SSE 4.2 with runtime check,x86_64,clang version 5.0.2
jay,SSE 4.2 with runtime check,x86_64,clang version 13.0.1
komodoensis,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0
loach,SSE 4.2 with runtime check,x86_64,clang version 14.0.6
longfin,SSE 4.2 with runtime check,x86_64,Apple clang version 14.0.3 (clang-1403.0.22.14.1)
lorikeet,SSE 4.2 with runtime check,x86_64,gcc (GCC) 10.2.0
malleefowl,SSE 4.2 with runtime check,x86_64,gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219
moonjelly,SSE 4.2 with runtime check,x86_64,gcc (GCC) 14.0.0 20230429 (experimental)
myna,SSE 4.2 with runtime check,x86_64,gcc (GCC) 7.3.0
peripatus,SSE 4.2 with runtime check,x86_64,FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git
llvmorg-14.0.5-0-gc12386ae247c)
petalura,SSE 4.2 with runtime check,x86_64,clang version 4.0.1
phycodurus,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0
pogona,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0
rhinoceros,SSE 4.2 with runtime check,x86_64,gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
schnauzer,SSE 4.2 with runtime check,x86_64,OpenBSD clang version 13.0.0
seawasp,SSE 4.2 with runtime check,x86_64,clang version 17.0.0 (https://github.com/llvm/llvm-project.git
40222ddcf8f54fe523b2d14ab7005ebf412330f1)
serinus,SSE 4.2 with runtime check,x86_64,gcc (Debian 20231017-1) 14.0.0 20231017 (experimental) [master
r14-4678-gce55521bcd1]
sidewinder,SSE 4.2 with runtime check,x86_64,cc (nb4 20200810) 7.5.0
spurfowl,SSE 4.2 with runtime check,x86_64,gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
trilobite,SSE 4.2 with runtime check,x86_64,clang version 13.0.1
xenodermus,SSE 4.2 with runtime check,x86_64,clang version 6.0.1
curculio,slicing-by-8,x86_64,gcc (GCC) 4.2.1 20070719
mantid,SSE 4.2 with runtime check,x86_64 (virtualized),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
blossomcrown,slicing-by-8,,gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
cardinalfish,slicing-by-8,,Debian clang version 14.0.6
cottonmouth,slicing-by-8,,gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0
devario,slicing-by-8,,gcc (Debian 12.2.0-9) 12.2.0

Re: always use runtime checks for CRC-32C instructions

From
Tom Lane
Date:
I wrote:
> I did a more thorough scrape of the buildfarm results.  Of 161 animals
> currently reporting configure output on HEAD, we have

Oh ... take "current" with a grain of salt there, because I just noticed
that I typo'd my search condition so that it collected results from all
systems that reported since 2022-Oct, rather than in the last month as
I'd intended.  There are just 137 animals currently reporting.

Of those, I broke down the architectures reporting using slicing-by-8:

# select arch,count(*) from results where crc = 'slicing-by-8' group by 1 order by 1;
        arch        | count
--------------------+-------
 aarch64            |     1
 macppc             |     1
 mips64eb; -mabi=64 |     1
 mips64el; -mabi=32 |     1
 ppc64 (power7)     |     4
 ppc64 (power8)     |     2
 ppc64le            |     7
 ppc64le (power8)   |     1
 ppc64le (power9)   |    15
 riscv64            |     2
 s390x (z15)        |    14
 sparc              |     1
(12 rows)

The one machine using slicing-by-8 where there might be a better
alternative is arowana, which is CentOS 7 with a pretty ancient gcc
version.  So I reject the idea that slicing-by-8 is an appropriate
baseline for comparisons.  There isn't anybody who will see an
improvement over current behavior: in the population of interest,
just about all platforms are using CRC instructions with or without
a runtime check.

            regards, tom lane



Re: always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
On Tue, Oct 31, 2023 at 03:16:16PM -0400, Tom Lane wrote:
> Really this just reinforces my concern that doing a runtime check
> all the time is on the wrong side of history.  I grant that we've
> got to do that for anything where the availability of the instruction
> is really in serious question, but I'm not very convinced that that's
> a majority situation on popular platforms.

Okay.  With that in mind, I think the path forward for new instructions is
as follows:

* If the special CRC instructions can be used with the default compiler
  flags, we can only use newer instructions if they can also be used with
  the default compiler flags.  (My M2 machine appears to add +crypto by
  default, so I bet your buildfarm animals would fall into this bucket.)
* Otherwise, if the CRC instructions can be used with added flags (i.e.,
  the runtime check path), we can do a runtime check for the new
  instructions as well.  (Most other buildfarm animals would fall into this
  bucket.)

Any platform that can use the CRC instructions with default compiler flags
but not the new instructions wouldn't be able to take advantage of the
proposed optimization, but it also wouldn't be subject to the small
performance regression.

If we wanted to further eliminate runtime checks for SSE 4.2 and ARMv8,
then I think things become a little trickier, as having a compiler that
understands things like +crypto would mean that you're automatically
subject to the runtime check regression (assuming we proceed with the
proposed optimization).  An alternate approach could be to only use newer
instructions if they are available with the default compiler flags, but
given the current state of the buildfarm, such optimizations might not get
much uptake for a while.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
On Tue, Oct 31, 2023 at 03:42:33PM -0400, Tom Lane wrote:
> The one machine using slicing-by-8 where there might be a better
> alternative is arowana, which is CentOS 7 with a pretty ancient gcc
> version.  So I reject the idea that slicing-by-8 is an appropriate
> baseline for comparisons.  There isn't anybody who will see an
> improvement over current behavior: in the population of interest,
> just about all platforms are using CRC instructions with or without
> a runtime check.

I only included the slicing-by-8 benchmark to demonstrate that 1) the CRC
computations are a big portion of that pg_waldump -z command and that 2)
the CRC instructions provide significant performance gains.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: always use runtime checks for CRC-32C instructions

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> Okay.  With that in mind, I think the path forward for new instructions is
> as follows:

> * If the special CRC instructions can be used with the default compiler
>   flags, we can only use newer instructions if they can also be used with
>   the default compiler flags.  (My M2 machine appears to add +crypto by
>   default, so I bet your buildfarm animals would fall into this bucket.)
> * Otherwise, if the CRC instructions can be used with added flags (i.e.,
>   the runtime check path), we can do a runtime check for the new
>   instructions as well.  (Most other buildfarm animals would fall into this
>   bucket.)

This seems like a reasonable proposal.

> Any platform that can use the CRC instructions with default compiler flags
> but not the new instructions wouldn't be able to take advantage of the
> proposed optimization, but it also wouldn't be subject to the small
> performance regression.

Check.  For now I think that's fine.  If we get to a place where this
policy is really leaving a lot of performance on the table, we can
revisit it ... but that situation is hypothetical and may remain so.

(It's worth noting also that a package builder can move the goalposts
at will, since our idea of "default flags" is really whatever the user
says to use.)

            regards, tom lane



Re: always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> Okay.  With that in mind, I think the path forward for new instructions is
>> as follows:
> 
>> * If the special CRC instructions can be used with the default compiler
>>   flags, we can only use newer instructions if they can also be used with
>>   the default compiler flags.  (My M2 machine appears to add +crypto by
>>   default, so I bet your buildfarm animals would fall into this bucket.)
>> * Otherwise, if the CRC instructions can be used with added flags (i.e.,
>>   the runtime check path), we can do a runtime check for the new
>>   instructions as well.  (Most other buildfarm animals would fall into this
>>   bucket.)
> 
> This seems like a reasonable proposal.

Great.  I think that leaves us with nothing left to do for this thread, so
I'll withdraw it from the commitfest and move the discussion back to the
original thread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: always use runtime checks for CRC-32C instructions

From
Nathan Bossart
Date:
On Tue, Oct 31, 2023 at 03:38:17PM -0500, Nathan Bossart wrote:
> On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote:
>> This seems like a reasonable proposal.
> 
> Great.  I think that leaves us with nothing left to do for this thread, so
> I'll withdraw it from the commitfest and move the discussion back to the
> original thread.

(Also, thanks for the discussion.)

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com