Thread: optimize several list functions with SIMD intrinsics

optimize several list functions with SIMD intrinsics

From
Nathan Bossart
Date:
I noticed that several of the List functions do simple linear searches that
can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in
37a6e5d).  The following table shows the time spent iterating over a list
of n elements (via list_member_int) one billion times on my x86 laptop.

      n   | head (ms) | patched (ms) 
    ------+-----------+--------------
        2 |      3884 |         3001
        4 |      5506 |         4092
        8 |      6209 |         3026
       16 |      8797 |         4458
       32 |     25051 |         7032
       64 |     37611 |        12763
      128 |     61886 |        22770
      256 |    111170 |        59885
      512 |    209612 |       103378
     1024 |    407462 |       189484

I've attached a work-in-progress patch that implements these optimizations
for both x86 and arm, and I will register this in the July commitfest.  I'm
posting this a little early in order to gauge interest.  Presumably you
shouldn't be using a List if you have many elements that must be routinely
searched, but it might be nice to speed up these functions, anyway.  This
was mostly a fun weekend project, and I don't presently have any concrete
examples of workloads where this might help.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: optimize several list functions with SIMD intrinsics

From
David Rowley
Date:
On Wed, 8 Mar 2023 at 13:25, Nathan Bossart <nathandbossart@gmail.com> wrote:
> I've attached a work-in-progress patch that implements these optimizations
> for both x86 and arm, and I will register this in the July commitfest.  I'm
> posting this a little early in order to gauge interest.

Interesting and quite impressive performance numbers.

From having a quick glance at the patch, it looks like you'll need to
take some extra time to make it work on 32-bit builds.

David



Re: optimize several list functions with SIMD intrinsics

From
Nathan Bossart
Date:
On Wed, Mar 08, 2023 at 01:54:15PM +1300, David Rowley wrote:
> Interesting and quite impressive performance numbers.

Thanks for taking a look.

> From having a quick glance at the patch, it looks like you'll need to
> take some extra time to make it work on 32-bit builds.

At the moment, the support for SIMD intrinsics in Postgres is limited to
64-bit (simd.h has the details).  But yes, if we want to make this work for
32-bit builds, additional work is required.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: optimize several list functions with SIMD intrinsics

From
Nathan Bossart
Date:
cfbot's Windows build wasn't happy with a couple of casts.  I applied a
fix similar to c6a43c2 in v2.  The patch is still very much a work in
progress.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: optimize several list functions with SIMD intrinsics

From
Ankit Kumar Pandey
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

Hello, 

Adding some review comments:

1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from 
#ifdef USE_NO_SIMD
    const ListCell *cell;
#endif
to #else like as mentioned below? This will make visual separation between #if cases more cleaner
#else
      const ListCell *cell;

    foreach(cell, list)
    {
        if (lfirst(cell) == datum)
            return true;
    }

    return false;

#endif

2. Lots of duplicated
if (list == NIL) checks before calling  list_member_inline_internal(list, datum)
Can we not add this check in list_member_inline_internal itself?
3. if (cell)
        return list_delete_cell(list, cell) in list_delete_int/oid
can we change if (cell) to if (cell != NULL) as used elsewhere?
4. list_member_inline_interal_idx , there is typo in name.
5. list_member_inline_interal_idx and list_member_inline_internal are practically same with almost 90+ % duplication.
can we not refactor both into one and allow cell or true/false as return? Although I see usage of const ListCell so
thismight be problematic.
 
6. Loop for (i = 0; i < tail_idx; i += nelem_per_iteration) for Vector32 in list.c looks duplicated from pg_lfind32 (in
pg_lfind.h),can we not reuse that?
 
7. Is it possible to add a benchmark which shows improvement against HEAD ?

Regards,
Ankit

The new status of this patch is: Waiting on Author

Re: optimize several list functions with SIMD intrinsics

From
Ankit Kumar Pandey
Date:
 > 7. Is it possible to add a benchmark which shows improvement against 
HEAD ?


Please ignore this from my earlier mail, I just saw stats now.

Thanks,

Ankit





Re: optimize several list functions with SIMD intrinsics

From
Nathan Bossart
Date:
Thanks for taking a look.

On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote:
> 1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from 
> #ifdef USE_NO_SIMD
>     const ListCell *cell;
> #endif
> to #else like as mentioned below? This will make visual separation between #if cases more cleaner

I would expect to see -Wdeclaration-after-statement warnings if we did
this.

> 2. Lots of duplicated
> if (list == NIL) checks before calling  list_member_inline_internal(list, datum)
> Can we not add this check in list_member_inline_internal itself?

We probably could.  I only extracted the NIL checks to simplify the code in
list_member_inline_internal() a bit.

> 3. if (cell)
>         return list_delete_cell(list, cell) in list_delete_int/oid
> can we change if (cell) to if (cell != NULL) as used elsewhere?

Sure.

> 4. list_member_inline_interal_idx , there is typo in name.

Will fix.

> 5. list_member_inline_interal_idx and list_member_inline_internal are practically same with almost 90+ %
duplication.
> can we not refactor both into one and allow cell or true/false as return? Although I see usage of const ListCell so
thismight be problematic.
 

The idea was to skip finding the exact match if all we care about is
whether the element exists.  This micro-optimization might be negligible,
in which case we could use list_member_inline_internal_idx() for both
cases.

> 6. Loop for (i = 0; i < tail_idx; i += nelem_per_iteration) for Vector32 in list.c looks duplicated from pg_lfind32
(inpg_lfind.h), can we not reuse that?
 

The list.c version is slightly different because we need to disregard any
matches that we find in between the data.  For example, in an integer List,
the integer will take up 4 bytes of the 8-byte ListCell (for 64-bit
platforms).

    typedef union ListCell
    {
        void       *ptr_value;
        int            int_value;
        Oid            oid_value;
        TransactionId xid_value;
    } ListCell;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: optimize several list functions with SIMD intrinsics

From
Ankit Kumar Pandey
Date:
Agree with your points Nathan. Just a headup.

> On 14/03/23 03:10, Nathan Bossart wrote:
> On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote:
>> 1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from 
>> #ifdef USE_NO_SIMD
>>     const ListCell *cell;
>> #endif
>> to #else like as mentioned below? This will make visual separation between #if cases more cleaner
> I would expect to see -Wdeclaration-after-statement warnings if we did
> this.

This worked fine for me, no warnings on gcc 12.2.0. Not a concern though.

Thanks,

Ankit




Re: optimize several list functions with SIMD intrinsics

From
Nathan Bossart
Date:
On Wed, Mar 15, 2023 at 07:31:46PM +0530, Ankit Kumar Pandey wrote:
>> On 14/03/23 03:10, Nathan Bossart wrote:
>> On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote:
>> > 1. In list_member_ptr, will it be okay to bring `const ListCell
>> > *cell` from #ifdef USE_NO_SIMD
>> >     const ListCell *cell;
>> > #endif
>> > to #else like as mentioned below? This will make visual separation between #if cases more cleaner
>> I would expect to see -Wdeclaration-after-statement warnings if we did
>> this.
> 
> This worked fine for me, no warnings on gcc 12.2.0. Not a concern though.

Did you try building without SIMD support?  This is what I see:

    list.c: In function ‘list_member_ptr’:
    list.c:697:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
      697 |  const ListCell *cell;
          |  ^~~~~

If your build doesn't have USE_NO_SIMD defined, this warning won't appear
because the code in question will be compiled out.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: optimize several list functions with SIMD intrinsics

From
Ankit Kumar Pandey
Date:
On 15/03/23 21:53, Nathan Bossart wrote:

> Did you try building without SIMD support?  This is what I see:

>    list.c: In function ‘list_member_ptr’:
>    list.c:697:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
>      697 |  const ListCell *cell;
>          |  ^~~~~

> If your build doesn't have USE_NO_SIMD defined, this warning won't appear
> because the code in question will be compiled out.

My mistake, I tried with USE_NO_SIMD defined and it showed the warning. Sorry for the noise.

Regards,
Ankit




Re: optimize several list functions with SIMD intrinsics

From
Nathan Bossart
Date:
Here is a new patch set.  I've split it into two patches: one for the
64-bit functions, and one for the 32-bit functions.  I've also added tests
for pg_lfind64/pg_lfind64_idx and deduplicated the code a bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: optimize several list functions with SIMD intrinsics

From
John Naylor
Date:

On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> was mostly a fun weekend project, and I don't presently have any concrete
> examples of workloads where this might help.

It seems like that should be demonstrated before seriously considering this, like a profile where the relevant list functions show up.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: optimize several list functions with SIMD intrinsics

From
Nathan Bossart
Date:
On Fri, Apr 21, 2023 at 01:50:34PM +0700, John Naylor wrote:
> On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com>
> wrote:
>> was mostly a fun weekend project, and I don't presently have any concrete
>> examples of workloads where this might help.
> 
> It seems like that should be demonstrated before seriously considering
> this, like a profile where the relevant list functions show up.

Agreed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: optimize several list functions with SIMD intrinsics

From
Heikki Linnakangas
Date:
On 21/04/2023 23:33, Nathan Bossart wrote:
> On Fri, Apr 21, 2023 at 01:50:34PM +0700, John Naylor wrote:
>> On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com>
>> wrote:
>>> was mostly a fun weekend project, and I don't presently have any concrete
>>> examples of workloads where this might help.
>>
>> It seems like that should be demonstrated before seriously considering
>> this, like a profile where the relevant list functions show up.
> 
> Agreed.

Grepping for "tlist_member" and "list_delete_ptr", I don't see any 
callers in hot codepaths where this could make a noticeable difference. 
So I've marked this as Returned with Feedback in the commitfest.

> I noticed that several of the List functions do simple linear searches that
> can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in
> 37a6e5d).  The following table shows the time spent iterating over a list
> of n elements (via list_member_int) one billion times on my x86 laptop.
> 
> 
>       n   | head (ms) | patched (ms) 
>     ------+-----------+--------------
>         2 |      3884 |         3001
>         4 |      5506 |         4092
>         8 |      6209 |         3026
>        16 |      8797 |         4458
>        32 |     25051 |         7032
>        64 |     37611 |        12763
>       128 |     61886 |        22770
>       256 |    111170 |        59885
>       512 |    209612 |       103378
>      1024 |    407462 |       189484

I'm surprised to see an improvement with n=2 and n=2. AFAICS, the 
vectorization only kicks in when n >= 8.

-- 
Heikki Linnakangas
Neon (https://neon.tech)