Re: [PATCH] SVE popcount support - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: [PATCH] SVE popcount support
Date
Msg-id Z6ONmQVSD5Qnpbsl@nathan
Whole thread Raw
In response to Re: [PATCH] SVE popcount support  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
On Tue, Feb 04, 2025 at 09:01:33AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
>> +    /*
>> +     * For smaller inputs, aligning the buffer degrades the performance.
>> +     * Therefore, the buffers only when the input size is sufficiently large.
>> +     */
> 
>> Is the inverse true, i.e., does aligning the buffer improve performance for
>> larger inputs?  I'm also curious what level of performance degradation you
>> were seeing.
> 
> Here is a comparison of all three cases. Alignment is marginally better for inputs
> above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
> Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
> Always Aligned => condition "bytes > 4 * vec_len" is removed.
> Unaligned => the whole if block was removed
> 
>  buf    | Always Aligned | Aligned After 128B | Unaligned
> --------+---------------+--------------------+------------
>    16   |       37.851  |           38.203   |     34.971
>    32   |       37.859  |           38.187   |     34.972
>    64   |       37.611  |           37.405   |     34.121
>   128   |       45.357  |           45.897   |     41.890
>   256   |       62.440  |           63.454   |     58.666
>   512   |      100.120  |          102.767   |     99.861
>  1024   |      159.574  |          158.594   |    164.975
>  2048   |      282.354  |          281.198   |    283.937
>  4096   |      532.038  |          531.068   |    533.699
>  8192   |     1038.973  |         1038.083   |   1039.206
> 16384   |     2028.604  |         2025.843   |   2033.940

Hm.  These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment.  Is there any reason not
to do that?

+    /* Process 2 complete vectors */
+    for (; i < loop_bytes; i += vec_len * 2)
+    {
+        vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+        accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+        vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+        accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+    }

Does this hand-rolled loop unrolling offer any particular advantage?  What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

-- 
nathan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Remove unnecessary static specifier
Next
From: Tom Lane
Date:
Subject: Re: Better title output for psql \dt \di etc. commands