Re: [PATCH] Hex-coding optimizations using SVE on ARM. - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: [PATCH] Hex-coding optimizations using SVE on ARM.
Date
Msg-id aNWO7L43UevRErw_@nathan
Whole thread Raw
In response to Re: [PATCH] Hex-coding optimizations using SVE on ARM.  (John Naylor <johncnaylorls@gmail.com>)
List pgsql-hackers
On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
> + if (unlikely(!success))
> + i = 0;
> 
> This is after the main loop exits, and the cold path is literally one
> instruction, so the motivation is not apparent to me.

Removed.  I was thinking about smaller inputs when I added this, but it
probably makes little difference.

>> Yeah, the compiler refuses unless the value is an integer literal.  I
>> thought of using a switch statement to cover all the values used in-tree,
>> but I didn't like that, either.
> 
> Neither option is great, but I mildly lean towards keeping it internal
> with "switch" or whatever: By putting the burden of specifying shift
> amounts on separately named functions we run a risk of combinatorial
> explosion in function names.

Done.

> (Separately, now I'm wondering if we can do the same for
> vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
> would allow getting rid of )

I think so.  I doubt there's any performance advantage, but it could be
nice for code cleanup.  (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.")  I'll
do this in the related patch in the "couple of small patches for simd.h"
thread.

> + * back together to form the final hex-encoded string.  It might be
> + * possible to squeeze out a little more gain by manually unrolling the
> + * loop, but for now we don't bother.
> 
> My position (and I think the community agrees) is that manual
> unrolling is a rare desperation move that has to be justified, so we
> don't need to mention its lack.

Removed.

> + * Some compilers are picky about casts to the same underlying type, and others
> + * are picky about implicit conversions with vector types.  This function does
> + * the same thing as vector32_broadcast(), but it returns a Vector8 and is
> + * carefully crafted to avoid compiler indigestion.
> + */
> +#ifndef USE_NO_SIMD
> +static inline Vector8
> +vector8_broadcast_u32(const uint32 c)
> +{
> +#ifdef USE_SSE2
> + return vector32_broadcast(c);
> +#elif defined(USE_NEON)
> + return (Vector8) vector32_broadcast(c);
> +#endif
> +}
> 
> I'm ambivalent about this: The use case doesn't seem well motivated,
> since I don't know why we'd actually need to both broadcast arbitrary
> integers and also view the result as bytes. Setting arbitrary bytes is
> what we're really doing, and would be more likely be useful in the
> future (attached, only tested on x86, and I think part of the
> strangeness is the endianness you mentioned above). On the other hand,
> the Arm workaround results in awful generated code compared to what
> you have here. Since the "set" should be hoisted out of the outer
> loop, and we already rely on this pattern for vector8_highbit_mask
> anyway, it might be tolerable, and we can reduce the pain with bitwise
> NOT.

I think I disagree on this one.  We're not broadcasting arbitrary bytes for
every vector element, we're broadcasting a patten of bytes that happens to
be wider than the element size.  I would expect this to be a relatively
common use-case.  Furthermore, the "set" API is closely tethered to the
vector size, which is fine for SSE2/Neon but may not work down the road
(not to mention the USE_NO_SIMD path).  Also, the bitwise NOT approach
won't work because we need to use 0x0f000f00 and 0x000f000f to avoid
angering the assertion in vector8_pack_16(), as mentioned below.

> +/*
> + * Pack 16-bit elements in the given vectors into a single vector of 8-bit
> + * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
> + * this will produce different results on different architectures.
> + */
> 
> v10 asserted this requirement -- that still seems like a good thing?

I had removed that because I worried the accumulator approach would cause
it to fail (it does), but looking again, that's easy enough to work around.

-- 
nathan

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Avoiding roundoff error in pg_sleep()
Next
From: Пополитов Владлен
Date:
Subject: Re: Avoiding roundoff error in pg_sleep()