Re: add AVX2 support to simd.h - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: add AVX2 support to simd.h
Date
Msg-id 20240108173715.GC2611898@nathanxps13
Whole thread Raw
In response to Re: add AVX2 support to simd.h  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: add AVX2 support to simd.h
List pgsql-hackers
On Mon, Jan 08, 2024 at 02:01:39PM +0700, John Naylor wrote:
> On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>>   writers    sse2    avx2     %
>>       256    1195    1188    -1
>>       512     928    1054   +14
>>      1024     633     716   +13
>>      2048     332     420   +27
>>      4096     162     203   +25
>>      8192     162     182   +12
> 
> There doesn't seem to be any benefit at 256 at all. Is that expected
> and/or fine?

My unverified assumption is that the linear searches make up much less of
the benchmark at these lower client counts, so any improvements we make
here are unlikely to show up here.  IIRC even the hash table approach that
we originally explored for XidInMVCCSnapshot() didn't do much, if anything,
for the benchmark at lower client counts.

> Here, the peak throughput seems to be around 64 writers with or
> without the patch from a couple years ago, but the slope is shallower
> after that. It would be good to make sure that it can't regress near
> the peak, even with a "long tail" case (see next paragraph). The first
> benchmark above starts at 256, so we can't tell where the peak is. It
> might be worth it to also have a microbenchmark because the systemic
> one has enough noise to obscure what's going on unless there are a
> very large number of writers. We know what a systemic benchmark can
> tell us on extreme workloads past the peak, and the microbenchmark
> would tell us "we need to see X improvement here in order to see Y
> improvement in the system benchmark".

Yes, will do.

> I suspect that there could be a regression lurking for some inputs
> that the benchmark doesn't look at: pg_lfind32() currently needs to be
> able to read 4 vector registers worth of elements before taking the
> fast path. There is then a tail of up to 15 elements that are now
> checked one-by-one, but AVX2 would increase that to 31. That's getting
> big enough to be noticeable, I suspect. It would be good to understand
> that case (n*32 + 31), because it may also be relevant now. It's also
> easy to improve for SSE2/NEON for v17.

Good idea.  If it is indeed noticeable, we might be able to "fix" it by
processing some of the tail with shorter vectors.  But that probably means
finding a way to support multiple vector sizes on the same build, which
would require some work.

> Also, by reading 4 registers per loop iteration, that's 128 bytes on
> AVX2. I'm not sure that matters, but we shouldn't assume it doesn't.
> Code I've seen elsewhere reads a fixed 64-byte block, and then uses 1,
> 2, or 4 registers to handle it, depending on architecture. Whether or
> not that's worth it in this case, this patch does mean future patches
> will have to wonder if they have to do anything differently depending
> on vector length, whereas now they don't. That's not a deal-breaker,
> but it is a trade-off to keep in mind.

Yeah.  Presently, this AVX2 patch just kicks the optimization down the road
a bit for the existing use-cases, so you don't start using the vector
registers until there's more data to work with, which might not even be
noticeable.  But it's conceivable that vector length could matter at some
point, even if it doesn't matter much now.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning
Next
From: Dean Rasheed
Date:
Subject: Re: psql JSON output format