On Thu, Jul 22, 2021 at 6:16 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Neat! It's good to make it more architecture-agnostic, and I'm sure we can use quite a bit of this.
One question is whether this "one size fits all" approach will be
extensible to wider SIMD.
> to_bool(const pg_u8x16_t v)
> {
> +#if defined(USE_NEON)
> + return vmaxvq_u32((uint32x4_t) v) != 0;
>
> --> return vmaxvq_u8(*this) != 0;
I chose that lane width because I saw an unsubstantiated claim
somewhere that it might be faster, but I have no idea if it matters.
The u8 code looks more natural anyway. Changed.
> vzero()
> {
> +#if defined(USE_NEON)
> + return vmovq_n_u8(0);
>
> --> return vdupq_n_u8(0); // or equivalently, splat(0)
I guess it doesn't make a difference which builtin you use here, but I
was influenced by the ARM manual which says the vdupq form is
generated for immediate values.
> is_highbit_set(const pg_u8x16_t v)
> {
> +#if defined(USE_NEON)
> + return to_bool(bitwise_and(v, vmovq_n_u8(0x80)));
>
> --> return vmaxq_u8(v) > 0x7F
Ah, of course. Much nicer!
> +#if defined(USE_NEON)
> +static pg_attribute_always_inline pg_u8x16_t
> +vset(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
> + uint8 v4, uint8 v5, uint8 v6, uint8 v7,
> + uint8 v8, uint8 v9, uint8 v10, uint8 v11,
> + uint8 v12, uint8 v13, uint8 v14, uint8 v15)
> +{
> + uint8 pg_attribute_aligned(16) values[16] = {
> + v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15
> + };
> + return vld1q_u8(values);
> +}
>
> --> They have this strange beast instead:
>
> // Doing a load like so end ups generating worse code.
> // uint8_t array[16] = {x1, x2, x3, x4, x5, x6, x7, x8,
> // x9, x10,x11,x12,x13,x14,x15,x16};
> // return vld1q_u8(array);
> uint8x16_t x{};
> // incredibly, Visual Studio does not allow x[0] = x1
> x = vsetq_lane_u8(x1, x, 0);
> x = vsetq_lane_u8(x2, x, 1);
> x = vsetq_lane_u8(x3, x, 2);
> ...
> x = vsetq_lane_u8(x15, x, 14);
> x = vsetq_lane_u8(x16, x, 15);
> return x;
>
> Since you aligned the array, that might not have the problem alluded to above, and it looks nicer.
Strange indeed. We should probably poke around in the assember and
see... it might be that MSVC doesn't like it, and I was just
cargo-culting the alignment. I don't expect the generated code to
really "load" anything of course, it should ideally be some kind of
immediate mov...
FWIW here are some performance results from my humble RPI4:
master:
chinese | mixed | ascii
---------+-------+-------
4172 | 2763 | 1823
(1 row)
Your v15 patch:
chinese | mixed | ascii
---------+-------+-------
2267 | 1248 | 399
(1 row)
Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:
chinese | mixed | ascii
---------+-------+-------
909 | 620 | 318
(1 row)
It's so good I wonder if it's producing incorrect results :-)
I also tried to do a quick and dirty AltiVec patch to see if it could
fit into the same code "shape", with less immediate success: it works
out slower than the fallback code on the POWER7 machine I scrounged an
account on. I'm not sure what's wrong there, but maybe it's a uesful
start (I'm probably confused about endianness, or the encoding of
boolean vectors which may be different (is true 0x01or 0xff, does it
matter?), or something else, and it's falling back on errors all the
time?).