Re: [PATCH] Hex-coding optimizations using SVE on ARM. - Mailing list pgsql-hackers
From | Nathan Bossart |
---|---|
Subject | Re: [PATCH] Hex-coding optimizations using SVE on ARM. |
Date | |
Msg-id | aNWO7L43UevRErw_@nathan Whole thread Raw |
In response to | Re: [PATCH] Hex-coding optimizations using SVE on ARM. (John Naylor <johncnaylorls@gmail.com>) |
List | pgsql-hackers |
On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote: > + if (unlikely(!success)) > + i = 0; > > This is after the main loop exits, and the cold path is literally one > instruction, so the motivation is not apparent to me. Removed. I was thinking about smaller inputs when I added this, but it probably makes little difference. >> Yeah, the compiler refuses unless the value is an integer literal. I >> thought of using a switch statement to cover all the values used in-tree, >> but I didn't like that, either. > > Neither option is great, but I mildly lean towards keeping it internal > with "switch" or whatever: By putting the burden of specifying shift > amounts on separately named functions we run a risk of combinatorial > explosion in function names. Done. > (Separately, now I'm wondering if we can do the same for > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that > would allow getting rid of ) I think so. I doubt there's any performance advantage, but it could be nice for code cleanup. (I'm assuming you meant to say vector8_ssub (renamed to vector8_ussub() in the patch) after "getting rid of.") I'll do this in the related patch in the "couple of small patches for simd.h" thread. > + * back together to form the final hex-encoded string. It might be > + * possible to squeeze out a little more gain by manually unrolling the > + * loop, but for now we don't bother. > > My position (and I think the community agrees) is that manual > unrolling is a rare desperation move that has to be justified, so we > don't need to mention its lack. Removed. > + * Some compilers are picky about casts to the same underlying type, and others > + * are picky about implicit conversions with vector types. This function does > + * the same thing as vector32_broadcast(), but it returns a Vector8 and is > + * carefully crafted to avoid compiler indigestion. > + */ > +#ifndef USE_NO_SIMD > +static inline Vector8 > +vector8_broadcast_u32(const uint32 c) > +{ > +#ifdef USE_SSE2 > + return vector32_broadcast(c); > +#elif defined(USE_NEON) > + return (Vector8) vector32_broadcast(c); > +#endif > +} > > I'm ambivalent about this: The use case doesn't seem well motivated, > since I don't know why we'd actually need to both broadcast arbitrary > integers and also view the result as bytes. Setting arbitrary bytes is > what we're really doing, and would be more likely be useful in the > future (attached, only tested on x86, and I think part of the > strangeness is the endianness you mentioned above). On the other hand, > the Arm workaround results in awful generated code compared to what > you have here. Since the "set" should be hoisted out of the outer > loop, and we already rely on this pattern for vector8_highbit_mask > anyway, it might be tolerable, and we can reduce the pain with bitwise > NOT. I think I disagree on this one. We're not broadcasting arbitrary bytes for every vector element, we're broadcasting a patten of bytes that happens to be wider than the element size. I would expect this to be a relatively common use-case. Furthermore, the "set" API is closely tethered to the vector size, which is fine for SSE2/Neon but may not work down the road (not to mention the USE_NO_SIMD path). Also, the bitwise NOT approach won't work because we need to use 0x0f000f00 and 0x000f000f to avoid angering the assertion in vector8_pack_16(), as mentioned below. > +/* > + * Pack 16-bit elements in the given vectors into a single vector of 8-bit > + * elements. NB: The upper 8-bits of each 16-bit element must be zeros, else > + * this will produce different results on different architectures. > + */ > > v10 asserted this requirement -- that still seems like a good thing? I had removed that because I worried the accumulator approach would cause it to fail (it does), but looking again, that's easy enough to work around. -- nathan
Attachment
pgsql-hackers by date: