On Sat, Nov 11, 2023 at 07:38:59PM +0700, John Naylor wrote:
> On Tue, Nov 7, 2023 at 9:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Separately, I'm wondering whether we should consider using CFLAGS_VECTORIZE
>> on the whole tree. Commit fdea253 seems to be responsible for introducing
>> this targeted autovectorization strategy, and AFAICT this was just done to
>> minimize the impact elsewhere while optimizing page checksums. Are there
>> fundamental problems with adding CFLAGS_VECTORIZE everywhere? Or is it
>> just waiting on someone to do the analysis/benchmarking?
>
> It's already the default for gcc 12 with -O2 (looking further in the
> docs, it uses the "very-cheap" vectorization cost model), so it may be
> worth investigating what the effect of that was. I can't quickly find
> the equivalent info for clang.
My x86 machine is using gcc 9.4.0, which isn't even aware of "very-cheap".
I don't see any difference with any of the cost models, though. It isn't
until I add -O3 that I see things like inlining pg_checksum_block into
pg_checksum_page. -O3 is generating far more SSE2 instructions, too.
I'll have to check whether gcc 12 is finding anything else within Postgres
to autovectorize with it's "very-cheap" cost model...
> That being the case, if the difference you found was real, it must
> have been due to unrolling loops. What changed in the binary?
For gcc 9.4.0 on x86, the autovectorization flag alone indeed makes no
difference, while the loop unrolling one does. For Apple clang 14.0.0 on
an M2, both flags seem to generate very different machine code.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com