Thread: optimize several list functions with SIMD intrinsics
I noticed that several of the List functions do simple linear searches that can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in 37a6e5d). The following table shows the time spent iterating over a list of n elements (via list_member_int) one billion times on my x86 laptop. n | head (ms) | patched (ms) ------+-----------+-------------- 2 | 3884 | 3001 4 | 5506 | 4092 8 | 6209 | 3026 16 | 8797 | 4458 32 | 25051 | 7032 64 | 37611 | 12763 128 | 61886 | 22770 256 | 111170 | 59885 512 | 209612 | 103378 1024 | 407462 | 189484 I've attached a work-in-progress patch that implements these optimizations for both x86 and arm, and I will register this in the July commitfest. I'm posting this a little early in order to gauge interest. Presumably you shouldn't be using a List if you have many elements that must be routinely searched, but it might be nice to speed up these functions, anyway. This was mostly a fun weekend project, and I don't presently have any concrete examples of workloads where this might help. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, 8 Mar 2023 at 13:25, Nathan Bossart <nathandbossart@gmail.com> wrote: > I've attached a work-in-progress patch that implements these optimizations > for both x86 and arm, and I will register this in the July commitfest. I'm > posting this a little early in order to gauge interest. Interesting and quite impressive performance numbers. From having a quick glance at the patch, it looks like you'll need to take some extra time to make it work on 32-bit builds. David
On Wed, Mar 08, 2023 at 01:54:15PM +1300, David Rowley wrote: > Interesting and quite impressive performance numbers. Thanks for taking a look. > From having a quick glance at the patch, it looks like you'll need to > take some extra time to make it work on 32-bit builds. At the moment, the support for SIMD intrinsics in Postgres is limited to 64-bit (simd.h has the details). But yes, if we want to make this work for 32-bit builds, additional work is required. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
cfbot's Windows build wasn't happy with a couple of casts. I applied a fix similar to c6a43c2 in v2. The patch is still very much a work in progress. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: not tested Spec compliant: not tested Documentation: not tested Hello, Adding some review comments: 1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from #ifdef USE_NO_SIMD const ListCell *cell; #endif to #else like as mentioned below? This will make visual separation between #if cases more cleaner #else const ListCell *cell; foreach(cell, list) { if (lfirst(cell) == datum) return true; } return false; #endif 2. Lots of duplicated if (list == NIL) checks before calling list_member_inline_internal(list, datum) Can we not add this check in list_member_inline_internal itself? 3. if (cell) return list_delete_cell(list, cell) in list_delete_int/oid can we change if (cell) to if (cell != NULL) as used elsewhere? 4. list_member_inline_interal_idx , there is typo in name. 5. list_member_inline_interal_idx and list_member_inline_internal are practically same with almost 90+ % duplication. can we not refactor both into one and allow cell or true/false as return? Although I see usage of const ListCell so thismight be problematic. 6. Loop for (i = 0; i < tail_idx; i += nelem_per_iteration) for Vector32 in list.c looks duplicated from pg_lfind32 (in pg_lfind.h),can we not reuse that? 7. Is it possible to add a benchmark which shows improvement against HEAD ? Regards, Ankit The new status of this patch is: Waiting on Author
> 7. Is it possible to add a benchmark which shows improvement against HEAD ? Please ignore this from my earlier mail, I just saw stats now. Thanks, Ankit
Thanks for taking a look. On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote: > 1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from > #ifdef USE_NO_SIMD > const ListCell *cell; > #endif > to #else like as mentioned below? This will make visual separation between #if cases more cleaner I would expect to see -Wdeclaration-after-statement warnings if we did this. > 2. Lots of duplicated > if (list == NIL) checks before calling list_member_inline_internal(list, datum) > Can we not add this check in list_member_inline_internal itself? We probably could. I only extracted the NIL checks to simplify the code in list_member_inline_internal() a bit. > 3. if (cell) > return list_delete_cell(list, cell) in list_delete_int/oid > can we change if (cell) to if (cell != NULL) as used elsewhere? Sure. > 4. list_member_inline_interal_idx , there is typo in name. Will fix. > 5. list_member_inline_interal_idx and list_member_inline_internal are practically same with almost 90+ % duplication. > can we not refactor both into one and allow cell or true/false as return? Although I see usage of const ListCell so thismight be problematic. The idea was to skip finding the exact match if all we care about is whether the element exists. This micro-optimization might be negligible, in which case we could use list_member_inline_internal_idx() for both cases. > 6. Loop for (i = 0; i < tail_idx; i += nelem_per_iteration) for Vector32 in list.c looks duplicated from pg_lfind32 (inpg_lfind.h), can we not reuse that? The list.c version is slightly different because we need to disregard any matches that we find in between the data. For example, in an integer List, the integer will take up 4 bytes of the 8-byte ListCell (for 64-bit platforms). typedef union ListCell { void *ptr_value; int int_value; Oid oid_value; TransactionId xid_value; } ListCell; -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Agree with your points Nathan. Just a headup. > On 14/03/23 03:10, Nathan Bossart wrote: > On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote: >> 1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from >> #ifdef USE_NO_SIMD >> const ListCell *cell; >> #endif >> to #else like as mentioned below? This will make visual separation between #if cases more cleaner > I would expect to see -Wdeclaration-after-statement warnings if we did > this. This worked fine for me, no warnings on gcc 12.2.0. Not a concern though. Thanks, Ankit
On Wed, Mar 15, 2023 at 07:31:46PM +0530, Ankit Kumar Pandey wrote: >> On 14/03/23 03:10, Nathan Bossart wrote: >> On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote: >> > 1. In list_member_ptr, will it be okay to bring `const ListCell >> > *cell` from #ifdef USE_NO_SIMD >> > const ListCell *cell; >> > #endif >> > to #else like as mentioned below? This will make visual separation between #if cases more cleaner >> I would expect to see -Wdeclaration-after-statement warnings if we did >> this. > > This worked fine for me, no warnings on gcc 12.2.0. Not a concern though. Did you try building without SIMD support? This is what I see: list.c: In function ‘list_member_ptr’: list.c:697:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 697 | const ListCell *cell; | ^~~~~ If your build doesn't have USE_NO_SIMD defined, this warning won't appear because the code in question will be compiled out. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 15/03/23 21:53, Nathan Bossart wrote: > Did you try building without SIMD support? This is what I see: > list.c: In function ‘list_member_ptr’: > list.c:697:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] > 697 | const ListCell *cell; > | ^~~~~ > If your build doesn't have USE_NO_SIMD defined, this warning won't appear > because the code in question will be compiled out. My mistake, I tried with USE_NO_SIMD defined and it showed the warning. Sorry for the noise. Regards, Ankit
Here is a new patch set. I've split it into two patches: one for the 64-bit functions, and one for the 32-bit functions. I've also added tests for pg_lfind64/pg_lfind64_idx and deduplicated the code a bit. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> was mostly a fun weekend project, and I don't presently have any concrete
> examples of workloads where this might help.
It seems like that should be demonstrated before seriously considering this, like a profile where the relevant list functions show up.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Apr 21, 2023 at 01:50:34PM +0700, John Naylor wrote: > On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com> > wrote: >> was mostly a fun weekend project, and I don't presently have any concrete >> examples of workloads where this might help. > > It seems like that should be demonstrated before seriously considering > this, like a profile where the relevant list functions show up. Agreed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 21/04/2023 23:33, Nathan Bossart wrote: > On Fri, Apr 21, 2023 at 01:50:34PM +0700, John Naylor wrote: >> On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com> >> wrote: >>> was mostly a fun weekend project, and I don't presently have any concrete >>> examples of workloads where this might help. >> >> It seems like that should be demonstrated before seriously considering >> this, like a profile where the relevant list functions show up. > > Agreed. Grepping for "tlist_member" and "list_delete_ptr", I don't see any callers in hot codepaths where this could make a noticeable difference. So I've marked this as Returned with Feedback in the commitfest. > I noticed that several of the List functions do simple linear searches that > can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in > 37a6e5d). The following table shows the time spent iterating over a list > of n elements (via list_member_int) one billion times on my x86 laptop. > > > n | head (ms) | patched (ms) > ------+-----------+-------------- > 2 | 3884 | 3001 > 4 | 5506 | 4092 > 8 | 6209 | 3026 > 16 | 8797 | 4458 > 32 | 25051 | 7032 > 64 | 37611 | 12763 > 128 | 61886 | 22770 > 256 | 111170 | 59885 > 512 | 209612 | 103378 > 1024 | 407462 | 189484 I'm surprised to see an improvement with n=2 and n=2. AFAICS, the vectorization only kicks in when n >= 8. -- Heikki Linnakangas Neon (https://neon.tech)