Thread: always use runtime checks for CRC-32C instructions
This is an offshoot of the "CRC32C Parallel Computation Optimization on ARM" thread [0]. I intend for this to be a prerequisite patch set. Presently, for the SSE 4.2 and ARMv8 CRC instructions used in the CRC32C code for WAL records, etc., we first check if the intrinsics are available with the default compiler flags. If so, we only bother compiling the implementation that uses those intrinsics. If not, we also check whether the intrinsics are available with some extra CFLAGS, and if they are, we compile both the implementation that uses the intrinsics as well as a fallback implementation that doesn't require any special instructions. Then, at runtime, we check what's available in the hardware and choose the appropriate CRC32C implementation. The aforementioned other thread [0] aims to further optimize this code by using another instruction that requires additional configure and/or runtime checks. $SUBJECT has been in the back of my mind for a while, but given proposals to add further complexity to this code, I figured it might be a good time to propose this simplification. Specifically, I think we shouldn't worry about trying to compile only the special instrinics versions, and instead always try to build both and choose the appropriate one at runtime. AFAICT the trade-offs aren't too bad. With some simple testing, I see that the runtime check occurs once at startup, so I don't anticipate any noticeable performance impact. I suppose each process might need to do the check in EXEC_BACKEND builds, but even so, I suspect the difference is negligible. I also see that the SSE 4.2 runtime check requires the CPUID instruction, so we wouldn't use the instrinsics for hardware that supports SSE 4.2 but not CPUID. However, I'm not sure such hardware even exists. Wikipedia says that CPUID was introduced in 1993 [1], and meson.build appears to omit the CPUID check when determining which CRC32C implementation to use. Furthermore, meson.build alludes to problems with some of the CPUID-related checks: # XXX: The configure.ac check for __cpuid() is broken, we don't copy that # here. To prevent problems due to two detection methods working, stop # checking after one. Are there any other reasons that we should try to avoid the runtime check when possible? I've attached two patches. 0001 adds a debug message to the SSE 4.2 runtime check that matches the one already present for the ARMv8 check. This message just notes whether the runtime check found that the special CRC instructions are available. 0002 is a first attempt at $SUBJECT. I've tested it on both x86 and ARM, and it seems to work as intended. You'll notice that I'm still checking for the intrinsics with the default compiler flags first. I didn't see any strong reason to change this, and doing so allows us to avoid sending extra CFLAGS when possible. Thoughts? [0] https://postgr.es/m/DB9PR08MB6991329A73923BF8ED4B3422F5DBA%40DB9PR08MB6991.eurprd08.prod.outlook.com [1] https://en.wikipedia.org/wiki/CPUID -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Nathan Bossart <nathandbossart@gmail.com> writes: > The aforementioned other thread [0] aims to further optimize this code by > using another instruction that requires additional configure and/or runtime > checks. $SUBJECT has been in the back of my mind for a while, but given > proposals to add further complexity to this code, I figured it might be a > good time to propose this simplification. Specifically, I think we > shouldn't worry about trying to compile only the special instrinics > versions, and instead always try to build both and choose the appropriate > one at runtime. On the one hand, I agree that we need to keep the complexity from getting out of hand. On the other hand, I wonder if this approach isn't optimizing for the wrong case. How many machines that PG 17 will ever be run on in production will lack SSE 4.2 (for Intel) or ARMv8 instructions (on the ARM side)? It seems like a shame to be burdening these instructions with a subroutine call for the benefit of long-obsolete hardware versions. Maybe that overhead is negligible, but it doesn't sound like you tried to measure it. I'm not quite sure what to propose instead, though. I thought for a little bit about a configure switch to select "test first" or "pedal to the metal". But in practice, package builders would probably have to select the conservative "test first" option; and we know that the vast majority of modern installations use prebuilt packages, so it's not clear that this answer would help many people. Anyway, I agree that the cost of a one-time-per-process probe should be negligible; it's the per-use cost that I worry about. If you can do some measurements proving that that worry is ill-founded, then I'm good with test-first. regards, tom lane
On Mon, 2023-10-30 at 12:39 -0400, Tom Lane wrote: > It seems like a shame > to be burdening these instructions with a subroutine call for the > benefit of long-obsolete hardware versions. It's already doing a call to pg_comp_crc32c_sse42() regardless, right? I assume you are concerned about the call going through a function pointer? If so, is it possible that setting a flag and then branching would be better? Also, if it's a concern, should we also consider making an inlineable version of pg_comp_crc32c_sse42()? Regards, Jeff Davis
On Mon, Oct 30, 2023 at 12:39:23PM -0400, Tom Lane wrote: > On the one hand, I agree that we need to keep the complexity from > getting out of hand. On the other hand, I wonder if this approach > isn't optimizing for the wrong case. How many machines that PG 17 > will ever be run on in production will lack SSE 4.2 (for Intel) > or ARMv8 instructions (on the ARM side)? For the CRC instructions in use today, I wouldn't be surprised if that number is pretty small, but for newer or optional instructions (like ARM's PMULL), I don't think we'll be so lucky. Even if we do feel comfortable assuming the presence of SSE 4.2, etc., we'll likely still need to add runtime checks for future optimizations. > It seems like a shame > to be burdening these instructions with a subroutine call for the > benefit of long-obsolete hardware versions. Maybe that overhead > is negligible, but it doesn't sound like you tried to measure it. When I went to measure this, I noticed that my relatively new x86 machine with a relatively new version of gcc uses the runtime check. I then skimmed through a few dozen buildfarm machines and found that, of all x86 and ARM machines that supported the specialized CRC instructions, only one ARM machine did not use the runtime check. Of course, this is far from a scientific data point, but it seems to indicate that the runtime check is the norm. (I still need to measure it.) > Anyway, I agree that the cost of a one-time-per-process probe should > be negligible; it's the per-use cost that I worry about. If you can > do some measurements proving that that worry is ill-founded, then > I'm good with test-first. Will do. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Oct 30, 2023 at 01:48:29PM -0700, Jeff Davis wrote: > I assume you are concerned about the call going through a function > pointer? If so, is it possible that setting a flag and then branching > would be better? > > Also, if it's a concern, should we also consider making an inlineable > version of pg_comp_crc32c_sse42()? I tested pg_waldump -z with 50M 65-byte records for the following implementations on an ARM system: * slicing-by-8 : ~3.08s * proposed patches applied (runtime check) : ~2.44s * only CRC intrinsics implementation compiled : ~2.42s * forced inlining : ~2.38s Avoiding the runtime check produced a 0.8% improvement, and forced inlining produced another 1.7% improvement. In comparison, even the runtime check implementation produced a 20.8% improvement over the slicing-by-8 one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote: > I tested pg_waldump -z with 50M 65-byte records for the following > implementations on an ARM system: > > * slicing-by-8 : ~3.08s > * proposed patches applied (runtime check) : ~2.44s > * only CRC intrinsics implementation compiled : ~2.42s > * forced inlining : ~2.38s > > Avoiding the runtime check produced a 0.8% improvement, and forced inlining > produced another 1.7% improvement. In comparison, even the runtime check > implementation produced a 20.8% improvement over the slicing-by-8 one. After reflecting on these numbers for a bit, I think I'm still inclined to do $SUBJECT. I considered the following: * While it would be nice to gain a couple of percentage points for existing hardware, I think we'll still end up doing runtime checks most of the time once we add support for newer instructions. * The performance improvements that the new instructions provide seem likely to outweigh these small regressions, especially for workloads with larger WAL records [0]. * From my quick scan of a few dozen machines on the buildfarm, it looks like the runtime checks are already the norm, so the number of systems that would be subject to a regression from v16 to v17 should be pretty small, in theory. And this regression seems to be on the order of 1% based on the numbers above. Do folks think this is reasonable? Or should we instead try to squeeze every last drop out of the current implementations by avoiding function pointers, forcing inlining, etc.? [0] https://postgr.es/m/20231025014539.GA977906%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote: >> I tested pg_waldump -z with 50M 65-byte records for the following >> implementations on an ARM system: >> >> * slicing-by-8 : ~3.08s >> * proposed patches applied (runtime check) : ~2.44s >> * only CRC intrinsics implementation compiled : ~2.42s >> * forced inlining : ~2.38s >> >> Avoiding the runtime check produced a 0.8% improvement, and forced inlining >> produced another 1.7% improvement. In comparison, even the runtime check >> implementation produced a 20.8% improvement over the slicing-by-8 one. I find these numbers fairly concerning. If you can see a couple-of-percent slowdown on a macroscopic benchmark like pg_waldump, that implies that the percentage slowdown considering the CRC operation alone is much worse. So there may be other use-cases where we would take a bigger relative hit. > * From my quick scan of a few dozen machines on the buildfarm, it looks > like the runtime checks are already the norm, so the number of systems > that would be subject to a regression from v16 to v17 should be pretty > small, in theory. And this regression seems to be on the order of 1% > based on the numbers above. I did a more thorough scrape of the buildfarm results. Of 161 animals currently reporting configure output on HEAD, we have 2 ARMv8 CRC instructions 36 ARMv8 CRC instructions with runtime check 2 LoongArch CRCC instructions 2 SSE 4.2 52 SSE 4.2 with runtime check 67 slicing-by-8 While that'd seem to support your conclusion, the two using ARM CRC *without* a runtime check are my Apple M1 Mac animals (sifaka/indri); and I see the same selection on my laptop. So one platform where we'd clearly be taking a regression is M-series Macs; that's a pretty popular platform. The two using SSE without a check are prion and tayra. I notice those are using gcc 11; so perhaps the default cflags have changed to include -msse4.2 recently? I couldn't see much other pattern though. (Scraping results attached in case anybody wants to look.) Really this just reinforces my concern that doing a runtime check all the time is on the wrong side of history. I grant that we've got to do that for anything where the availability of the instruction is really in serious question, but I'm not very convinced that that's a majority situation on popular platforms. regards, tom lane alimoche,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 8.3.0-6) 8.3.0 batta,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110 blackneck,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 13.0.1-6~deb10u4 boiga,ARMv8 CRC instructions with runtime check,aarch64,clang version 14.0.5 (Fedora 14.0.5-2.fc36) broadbill,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18) bulbul,ARMv8 CRC instructions with runtime check,aarch64,clang version 15.0.7 (Red Hat 15.0.7-1.module_el8.8.0+3466+dfcbc058) corzo,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4) desman,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1) eelpout,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110 gokiburi,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 11.0.1-2 hachi,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 11.0.1-2 jackdaw,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110 massasauga,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) motmot,ARMv8 CRC instructions with runtime check,aarch64,clang version 16.0.6 (Fedora 16.0.6-2.fc38) oystercatcher,ARMv8 CRC instructions with runtime check,aarch64,clang version 15.0.7 (Red Hat 15.0.7-2.el9) potoo,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) rudd,ARMv8 CRC instructions with runtime check,aarch64,Ubuntu clang version 12.0.0-3ubuntu1~20.04.5 shiner,ARMv8 CRC instructions with runtime check,aarch64,Ubuntu clang version 15.0.7 snakefly,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) splitfin,ARMv8 CRC instructions with runtime check,aarch64,gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 tanager,ARMv8 CRC instructions with runtime check,aarch64,clang version 7.0.1-8+deb10u2 (tags/RELEASE_701/final) turbot,ARMv8 CRC instructions with runtime check,aarch64,gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 vimba,ARMv8 CRC instructions with runtime check,aarch64,clang version 10.0.0-4ubuntu1~18.04.2 whinchat,ARMv8 CRC instructions with runtime check,aarch64,Debian clang version 13.0.1-6~deb11u1 whiting,ARMv8 CRC instructions with runtime check,aarch64,gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0 widowbird,ARMv8 CRC instructions with runtime check,aarch64,gcc (Debian 10.2.1-6) 10.2.1 20210110 ziege,ARMv8 CRC instructions with runtime check,aarch64,gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-21) arowana,slicing-by-8,aarch64,gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) parula,ARMv8 CRC instructions with runtime check,aarch64/graviton3/c7g.2xl,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) elver,SSE 4.2 with runtime check,amd64,FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386ae247c) gombessa,SSE 4.2 with runtime check,amd64,OpenBSD clang version 10.0.1 hake,SSE 4.2 with runtime check,amd64,gcc (OpenIndiana 9.3.0-oi-4) 9.3.0 mule,SSE 4.2 with runtime check,amd64,gcc (Debian 12.2.0-14) 12.2.0 plover,SSE 4.2 with runtime check,amd64,OpenBSD clang version 10.0.1 pollock,SSE 4.2 with runtime check,amd64,gcc (OmniOS 151046/12.2.0-il-0) 12.2.0 indri,ARMv8 CRC instructions,arm64,Apple clang version 15.0.0 (clang-1500.0.40.1) sifaka,ARMv8 CRC instructions,arm64,Apple clang version 15.0.0 (clang-1500.0.40.1) dikkop,ARMv8 CRC instructions with runtime check,arm64,FreeBSD clang version 16.0.6 (https://github.com/llvm/llvm-project.gitllvmorg-16.0.6-0-g7cbf1a259152) opaleye,ARMv8 CRC instructions with runtime check,arm64,FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.gitllvmorg-13.0.0-0-gd7b669b3a303) chipmunk,ARMv8 CRC instructions with runtime check,armv6l,gcc (Raspbian 4.9.2-10+deb8u2) 4.9.2 grison,ARMv8 CRC instructions with runtime check,armv7,gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516 gull,ARMv8 CRC instructions with runtime check,armv7,clang version 13.0.0 (http://git.linaro.org/toolchain/jenkins-scripts.gitd04b0fafc2d906fd9b2e8e55efb35c9cf7114e68) mereswine,ARMv8 CRC instructions with runtime check,armv7,gcc (Debian 10.2.1-6) 10.2.1 20210110 dangomushi,ARMv8 CRC instructions with runtime check,armv7l,clang version 7.0.1-8+deb10u2 (tags/RELEASE_701/final) turaco,ARMv8 CRC instructions with runtime check,armv7l,gcc (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110 chickadee,slicing-by-8,hppa,gcc (nb1 20220722) 10.4.0 lapwing,SSE 4.2 with runtime check,i686,gcc (Debian 4.7.2-5) 4.7.2 cisticola,LoongArch CRCC instructions,loongarch64,gcc (GCC) 8.3.0 20190222 (Loongson 8.3.0-35 vec) nuthatch,LoongArch CRCC instructions,loongarch64,gcc (GCC) 13.2.1 20230906 mamba,slicing-by-8,macppc,cc (nb1 20220722) 10.4.0 frogfish,slicing-by-8,mips64eb; -mabi=64,gcc (Debian 4.6.3-14) 4.6.3 topminnow,slicing-by-8,mips64el; -mabi=32,gcc (Debian 4.9.2-10+deb8u1) 4.9.2 hornet,slicing-by-8,ppc64 (power7),xlc_r -D_LARGE_FILES=1 mandrill,slicing-by-8,ppc64 (power7),xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY sungazer,slicing-by-8,ppc64 (power7),gcc (GCC) 8.3.0 tern,slicing-by-8,ppc64 (power7),gcc (GCC) 8.3.0 boa,slicing-by-8,ppc64 (power8),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) chub,slicing-by-8,ppc64 (power8),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) hoverfly,slicing-by-8,ppc64 (power8),/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY batfish,slicing-by-8,ppc64le,clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final) clam,slicing-by-8,ppc64le,"gcc (GCC) 5.2.1 20151016 (Advance-Toolchain-) [ibm/gcc-5-branch, revision: 229493 merged fromgcc-5-branch, revision 228917]" cuon,slicing-by-8,ppc64le,gcc (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 dhole,slicing-by-8,ppc64le,gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) urocryon,slicing-by-8,ppc64le,gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 vulpes,slicing-by-8,ppc64le,gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6) wobbegong,slicing-by-8,ppc64le,clang version 5.0.2 (tags/RELEASE_502/final) ayu,slicing-by-8,ppc64le (power8),clang version 4.0.1-10~deb9u2 (tags/RELEASE_401/final) babbler,slicing-by-8,ppc64le (power9),gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18) bonito,slicing-by-8,ppc64le (power9),gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2) buri,slicing-by-8,ppc64le (power9),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) cascabel,slicing-by-8,ppc64le (power9),gcc (Debian 10.2.1-6) 10.2.1 20210110 cavefish,slicing-by-8,ppc64le (power9),gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 chevrotain,slicing-by-8,ppc64le (power9),Debian clang version 13.0.1-6~deb11u1 chimaera,slicing-by-8,ppc64le (power9),gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 demoiselle,slicing-by-8,ppc64le (power9),clang version 5.0.1 (tags/RELEASE_501/final 312548) elasmobranch,slicing-by-8,ppc64le (power9),gcc (SUSE Linux) 7.3.1 20180323 [gcc-7-branch revision 258812] habu,slicing-by-8,ppc64le (power9),gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1) kingsnake,slicing-by-8,ppc64le (power9),clang version 16.0.6 (Fedora 16.0.6-3.fc38) krait,slicing-by-8,ppc64le (power9),gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-21) lancehead,slicing-by-8,ppc64le (power9),clang version 16.0.6 (Red Hat 16.0.6-2.module_el8+588+6f71ce7b) nicator,slicing-by-8,ppc64le (power9),gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) pytilia,slicing-by-8,ppc64le (power9),clang version 15.0.7 (Red Hat 15.0.7-1.module_el8.8.0+3466+dfcbc058) boomslang,slicing-by-8,riscv64,gcc (Debian 10.2.1-6) 10.2.1 20210110 copperhead,slicing-by-8,riscv64,gcc (Debian 10.2.1-6) 10.2.1 20210110 aracari,slicing-by-8,s390x (z15),clang version 15.0.7 (Red Hat 15.0.7-1.module+el8.8.0+17939+b58878af) branta,slicing-by-8,s390x (z15),gcc-10 (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0 cotinga,slicing-by-8,s390x (z15),clang version 10.0.0-4ubuntu1~18.04.2 lora,slicing-by-8,s390x (z15),gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) mamushi,slicing-by-8,s390x (z15),clang version 15.0.7 (Red Hat 15.0.7-2.el9) perch,slicing-by-8,s390x (z15),gcc-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0 pike,slicing-by-8,s390x (z15),gcc (Debian 10.2.1-6) 10.2.1 20210110 pipit,slicing-by-8,s390x (z15),gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10) rinkhals,slicing-by-8,s390x (z15),Debian clang version 13.0.1-6~deb11u1 ruddy,slicing-by-8,s390x (z15),gcc-11 (SUSE Linux) 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9] sarus,slicing-by-8,s390x (z15),Ubuntu clang version 14.0.0-1ubuntu1.1 shelduck,slicing-by-8,s390x (z15),gcc (SUSE Linux) 4.8.5 siskin,slicing-by-8,s390x (z15),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) urutau,slicing-by-8,s390x (z15),Ubuntu clang version 12.0.0-3ubuntu1~20.04.5 margay,slicing-by-8,sparc,gcc (GCC) 11.2.0 skate,slicing-by-8,sparc,gcc-4.7 (Debian 4.7.2-5) 4.7.2 snapper,slicing-by-8,sparc,gcc-4.7 (Debian 4.7.2-5) 4.7.2 grackle,slicing-by-8,sparc64,gcc (GCC) 12.2.0 ibisbill,slicing-by-8,sparc64,gcc (Debian 8.3.0-6) 8.3.0 kittiwake,slicing-by-8,sparc64,gcc (Debian 8.3.0-6) 8.3.0 mussurana,slicing-by-8,sparc64,gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 tadarida,slicing-by-8,sparc64,gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 thorntail,slicing-by-8,sparc64,gcc-11 (Debian 11.4.0-1) 11.4.0 waxbill,slicing-by-8,sparc64,gcc (Debian 12.3.0-4) 12.3.0 wrasse,slicing-by-8,sparc64,cc: Studio 12.6 Sun C 5.15 SunOS_sparc 152881-01 2018/03/16 morepork,SSE 4.2 with runtime check,x64,OpenBSD clang version 10.0.1 bushmaster,SSE 4.2 with runtime check,x86-64,clang version 17.0.3 (https://github.com/llvm/llvm-project.git 37b79e779f447f1c714af7f907e7a2ec846d1da0) calliphoridae,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0 canebrake,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0 culicidae,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0 grassquit,SSE 4.2 with runtime check,x86-64,gcc-12 (Debian 12.3.0-11) 12.3.0 kestrel,SSE 4.2 with runtime check,x86-64,Debian clang version 13.0.1-11+b2 olingo,SSE 4.2 with runtime check,x86-64,Debian clang version 13.0.1-11+b2 skink,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0 taipan,SSE 4.2 with runtime check,x86-64,gcc (Debian 13.2.0-6) 13.2.0 tamandua,SSE 4.2 with runtime check,x86-64,gcc-12 (Debian 12.3.0-11) 12.3.0 urutu,SSE 4.2 with runtime check,x86-64,clang version 16.0.6 (https://github.com/llvm/llvm-project.git 7cbf1a2591520c2491aa35339f227775f4d3adf6) prion,SSE 4.2,x86_64,gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) tayra,SSE 4.2,x86_64,gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4.3.0.4) alabio,SSE 4.2 with runtime check,x86_64,gcc (Debian 10.2.1-6) 10.2.1 20210110 avocet,SSE 4.2 with runtime check,x86_64,gcc (SUSE Linux) 7.5.0 butterflyfish,SSE 4.2 with runtime check,x86_64,gcc (GCC) 6.3.0 caiman,SSE 4.2 with runtime check,x86_64,gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4) conchuela,SSE 4.2 with runtime check,x86_64,clang version 14.0.6 culpeo,SSE 4.2 with runtime check,x86_64,gcc (Debian 12.2.0-14) 12.2.0 desmoxytes,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0 dragonet,SSE 4.2 with runtime check,x86_64,clang version 3.9.1 flaviventris,SSE 4.2 with runtime check,x86_64,gcc (Debian 20231017-1) 14.0.0 20231017 (experimental) [master r14-4678-gce55521bcd1] guaibasaurus,SSE 4.2 with runtime check,x86_64,gcc (Debian 10.2.1-6) 10.2.1 20210110 hippopotamus,SSE 4.2 with runtime check,x86_64,gcc (SUSE Linux) 7.5.0 idiacanthus,SSE 4.2 with runtime check,x86_64,clang version 5.0.2 jay,SSE 4.2 with runtime check,x86_64,clang version 13.0.1 komodoensis,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0 loach,SSE 4.2 with runtime check,x86_64,clang version 14.0.6 longfin,SSE 4.2 with runtime check,x86_64,Apple clang version 14.0.3 (clang-1403.0.22.14.1) lorikeet,SSE 4.2 with runtime check,x86_64,gcc (GCC) 10.2.0 malleefowl,SSE 4.2 with runtime check,x86_64,gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219 moonjelly,SSE 4.2 with runtime check,x86_64,gcc (GCC) 14.0.0 20230429 (experimental) myna,SSE 4.2 with runtime check,x86_64,gcc (GCC) 7.3.0 peripatus,SSE 4.2 with runtime check,x86_64,FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386ae247c) petalura,SSE 4.2 with runtime check,x86_64,clang version 4.0.1 phycodurus,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0 pogona,SSE 4.2 with runtime check,x86_64,gcc (Debian 13.2.0-6) 13.2.0 rhinoceros,SSE 4.2 with runtime check,x86_64,gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) schnauzer,SSE 4.2 with runtime check,x86_64,OpenBSD clang version 13.0.0 seawasp,SSE 4.2 with runtime check,x86_64,clang version 17.0.0 (https://github.com/llvm/llvm-project.git 40222ddcf8f54fe523b2d14ab7005ebf412330f1) serinus,SSE 4.2 with runtime check,x86_64,gcc (Debian 20231017-1) 14.0.0 20231017 (experimental) [master r14-4678-gce55521bcd1] sidewinder,SSE 4.2 with runtime check,x86_64,cc (nb4 20200810) 7.5.0 spurfowl,SSE 4.2 with runtime check,x86_64,gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 trilobite,SSE 4.2 with runtime check,x86_64,clang version 13.0.1 xenodermus,SSE 4.2 with runtime check,x86_64,clang version 6.0.1 curculio,slicing-by-8,x86_64,gcc (GCC) 4.2.1 20070719 mantid,SSE 4.2 with runtime check,x86_64 (virtualized),gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) blossomcrown,slicing-by-8,,gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 cardinalfish,slicing-by-8,,Debian clang version 14.0.6 cottonmouth,slicing-by-8,,gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0 devario,slicing-by-8,,gcc (Debian 12.2.0-9) 12.2.0
I wrote: > I did a more thorough scrape of the buildfarm results. Of 161 animals > currently reporting configure output on HEAD, we have Oh ... take "current" with a grain of salt there, because I just noticed that I typo'd my search condition so that it collected results from all systems that reported since 2022-Oct, rather than in the last month as I'd intended. There are just 137 animals currently reporting. Of those, I broke down the architectures reporting using slicing-by-8: # select arch,count(*) from results where crc = 'slicing-by-8' group by 1 order by 1; arch | count --------------------+------- aarch64 | 1 macppc | 1 mips64eb; -mabi=64 | 1 mips64el; -mabi=32 | 1 ppc64 (power7) | 4 ppc64 (power8) | 2 ppc64le | 7 ppc64le (power8) | 1 ppc64le (power9) | 15 riscv64 | 2 s390x (z15) | 14 sparc | 1 (12 rows) The one machine using slicing-by-8 where there might be a better alternative is arowana, which is CentOS 7 with a pretty ancient gcc version. So I reject the idea that slicing-by-8 is an appropriate baseline for comparisons. There isn't anybody who will see an improvement over current behavior: in the population of interest, just about all platforms are using CRC instructions with or without a runtime check. regards, tom lane
On Tue, Oct 31, 2023 at 03:16:16PM -0400, Tom Lane wrote: > Really this just reinforces my concern that doing a runtime check > all the time is on the wrong side of history. I grant that we've > got to do that for anything where the availability of the instruction > is really in serious question, but I'm not very convinced that that's > a majority situation on popular platforms. Okay. With that in mind, I think the path forward for new instructions is as follows: * If the special CRC instructions can be used with the default compiler flags, we can only use newer instructions if they can also be used with the default compiler flags. (My M2 machine appears to add +crypto by default, so I bet your buildfarm animals would fall into this bucket.) * Otherwise, if the CRC instructions can be used with added flags (i.e., the runtime check path), we can do a runtime check for the new instructions as well. (Most other buildfarm animals would fall into this bucket.) Any platform that can use the CRC instructions with default compiler flags but not the new instructions wouldn't be able to take advantage of the proposed optimization, but it also wouldn't be subject to the small performance regression. If we wanted to further eliminate runtime checks for SSE 4.2 and ARMv8, then I think things become a little trickier, as having a compiler that understands things like +crypto would mean that you're automatically subject to the runtime check regression (assuming we proceed with the proposed optimization). An alternate approach could be to only use newer instructions if they are available with the default compiler flags, but given the current state of the buildfarm, such optimizations might not get much uptake for a while. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Oct 31, 2023 at 03:42:33PM -0400, Tom Lane wrote: > The one machine using slicing-by-8 where there might be a better > alternative is arowana, which is CentOS 7 with a pretty ancient gcc > version. So I reject the idea that slicing-by-8 is an appropriate > baseline for comparisons. There isn't anybody who will see an > improvement over current behavior: in the population of interest, > just about all platforms are using CRC instructions with or without > a runtime check. I only included the slicing-by-8 benchmark to demonstrate that 1) the CRC computations are a big portion of that pg_waldump -z command and that 2) the CRC instructions provide significant performance gains. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > Okay. With that in mind, I think the path forward for new instructions is > as follows: > * If the special CRC instructions can be used with the default compiler > flags, we can only use newer instructions if they can also be used with > the default compiler flags. (My M2 machine appears to add +crypto by > default, so I bet your buildfarm animals would fall into this bucket.) > * Otherwise, if the CRC instructions can be used with added flags (i.e., > the runtime check path), we can do a runtime check for the new > instructions as well. (Most other buildfarm animals would fall into this > bucket.) This seems like a reasonable proposal. > Any platform that can use the CRC instructions with default compiler flags > but not the new instructions wouldn't be able to take advantage of the > proposed optimization, but it also wouldn't be subject to the small > performance regression. Check. For now I think that's fine. If we get to a place where this policy is really leaving a lot of performance on the table, we can revisit it ... but that situation is hypothetical and may remain so. (It's worth noting also that a package builder can move the goalposts at will, since our idea of "default flags" is really whatever the user says to use.) regards, tom lane
On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> Okay. With that in mind, I think the path forward for new instructions is >> as follows: > >> * If the special CRC instructions can be used with the default compiler >> flags, we can only use newer instructions if they can also be used with >> the default compiler flags. (My M2 machine appears to add +crypto by >> default, so I bet your buildfarm animals would fall into this bucket.) >> * Otherwise, if the CRC instructions can be used with added flags (i.e., >> the runtime check path), we can do a runtime check for the new >> instructions as well. (Most other buildfarm animals would fall into this >> bucket.) > > This seems like a reasonable proposal. Great. I think that leaves us with nothing left to do for this thread, so I'll withdraw it from the commitfest and move the discussion back to the original thread. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Oct 31, 2023 at 03:38:17PM -0500, Nathan Bossart wrote: > On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote: >> This seems like a reasonable proposal. > > Great. I think that leaves us with nothing left to do for this thread, so > I'll withdraw it from the commitfest and move the discussion back to the > original thread. (Also, thanks for the discussion.) -- Nathan Bossart Amazon Web Services: https://aws.amazon.com