Thread: SIMD optimization for list_sort

SIMD optimization for list_sort

From
"Giacchino, Luca"
Date:

Hi All,

 

We propose enabling SIMD-based sort for list_sort using the x86-simd-sort library (https://github.com/intel/x86-simd-sort).

 

The existing list_sort takes a comparator function to compare pairs of ListCell. On the other hand, x86-simd-sort requires an array of numeric values to sort, and it returns an array of sorted indices. To enable x86-simd-sort, we add new list_sort_simd functions that take an extractor function. The function extracts a value (float or uint32) from a ListCell. Those values are then collected into an array for x86-simd-sort to work on. A comparator function can still be passed to be used as tie-breaker.

 

typedef float (*list_sort_extractor_float)(const ListCell *a);

typedef uint32 (*list_sort_extractor_uint32)(const ListCell *a);

 

void list_sort_simd_float(List *list, list_sort_extractor_float extract, list_sort_comparator cmp);

void list_sort_simd_uint32(List *list, list_sort_extractor_uint32 extract, list_sort_comparator cmp);

 

These functions will exist alongside the current list_sort. Existing list_sort use cases in Postgres or extensions will not be affected by default, and they can be converted to list_sort_simd functions where it makes sense in terms of performance.

 

This feature introduces the x86-simd-sort library as a dependency. We plan to make the dependency optional, using a configuration option, such as --with-x86-simd-sort. If building without this option, list_sort_simd functions could fall back to the existing list_sort (e.g., by creating a comparator function that combines the extracted value and the tie-breaker comparator function). Alternatively (or in addition), we could add a function to report whether x86-simd-sort is available.

 

We identified a first use case for list_sort_simd_float in pgvector. As part of HNSW index construction, pgvector uses list_sort to sort candidate vectors by distance. Using list_sort_simd_float, we observed reduction in index build time in some scenarios. For example, we observed 7% reduction in index build time with the gist-960 dataset and 10% with the dbpedia-openai-1000k dataset (ANN-Benchmarks, HNSW index with halfvec, m=80). We are also looking into microbenchmarks to measure list_sort performance independently.

 

We’d appreciate feedback on this approach. In the meantime, we will complete the patch to share. We also plan to extend SIMD-based sort to tuple sort in the future.

 

Thanks!

 

Luca Giacchino

 

Re: SIMD optimization for list_sort

From
Heikki Linnakangas
Date:
On 22/11/2024 01:27, Giacchino, Luca wrote:
> The existing list_sort takes a comparator function to compare pairs of 
> ListCell. On the other hand, x86-simd-sort requires an array of numeric 
> values to sort, and it returns an array of sorted indices. To enable 
> x86-simd-sort, we add new list_sort_simd functions that take an 
> extractor function. The function extracts a value (float or uint32) from 
> a ListCell. Those values are then collected into an array for x86-simd- 
> sort to work on. A comparator function can still be passed to be used as 
> tie-breaker.
> 
> typedef float (*list_sort_extractor_float)(const ListCell *a);
> 
> typedef uint32 (*list_sort_extractor_uint32)(const ListCell *a);
> 
> void list_sort_simd_float(List *list, list_sort_extractor_float extract, 
> list_sort_comparator cmp);
> 
> void list_sort_simd_uint32(List *list, list_sort_extractor_uint32 
> extract, list_sort_comparator cmp);
> 
> These functions will exist alongside the current list_sort. Existing 
> list_sort use cases in Postgres or extensions will not be affected by 
> default, and they can be converted to list_sort_simd functions where it 
> makes sense in terms of performance.

I'd suggest targeting pg_qsort() directly, instead of list_sort(). 
list_sort() is not used in very performance critical parts.

> We identified a first use case for list_sort_simd_float in pgvector. As 
> part of HNSW index construction, pgvector uses list_sort to sort 
> candidate vectors by distance. Using list_sort_simd_float, we observed 
> reduction in index build time in some scenarios. For example, we 
> observed 7% reduction in index build time with the gist-960 dataset and 
> 10% with the dbpedia-openai-1000k dataset (ANN-Benchmarks, HNSW index 
> with halfvec, m=80). We are also looking into microbenchmarks to measure 
> list_sort performance independently.

That's interesting. I'd suggest proposing this to the pgvector project 
directly, since pgvector would immediately benefit.

> We’d appreciate feedback on this approach. In the meantime, we will 
> complete the patch to share. We also plan to extend SIMD-based sort to 
> tuple sort in the future.

If you could use this to speed up tuple sorting, that would be much more 
interesting for PostgreSQL itself.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: SIMD optimization for list_sort

From
Nathan Bossart
Date:
On Fri, Nov 22, 2024 at 06:01:22PM +0200, Heikki Linnakangas wrote:
> On 22/11/2024 01:27, Giacchino, Luca wrote:
>> We´d appreciate feedback on this approach. In the meantime, we will
>> complete the patch to share. We also plan to extend SIMD-based sort to
>> tuple sort in the future.
> 
> If you could use this to speed up tuple sorting, that would be much more
> interesting for PostgreSQL itself.

+1

-- 
nathan



Re: SIMD optimization for list_sort

From
John Naylor
Date:
On Fri, Nov 22, 2024 at 6:27 AM Giacchino, Luca
<luca.giacchino@intel.com> wrote:
> We’d appreciate feedback on this approach. In the meantime, we will complete the patch to share. We also plan to
extendSIMD-based sort to tuple sort in the future. 

Coincidentally, I'll be prototyping a tuple sort that will take better
advantage of keys that are integers (whether authoritative or
abbreviated), using scalar, portable branch-free techniques. This will
require some re-architecting and I believe that will also make it
easier for you to experiment with sorting networks there.

--
John Naylor
Amazon Web Services



Re: SIMD optimization for list_sort

From
Andrei Lepikhov
Date:
On 22/11/2024 06:27, Giacchino, Luca wrote:
> We’d appreciate feedback on this approach. In the meantime, we will 
> complete the patch to share. We also plan to extend SIMD-based sort to 
> tuple sort in the future.
Nice! I continually see performance reports when sorting and group order 
impact performance. Because of that, efforts have been made to optimise 
sorts with recombination of clause order [1,2]. It would be nice to 
review your Sort operator optimisation. I hope, it might improve 
performance of the cases provided in these threads.

[1] POC: GROUP BY optimization
https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru
[2] Consider the number of columns in the sort cost model
https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com

-- 
regards, Andrei Lepikhov




Re: SIMD optimization for list_sort

From
John Naylor
Date:
On Tue, Feb 18, 2025 at 1:49 PM R, Rakshit <rakshit.r@intel.com> wrote:
>
> Hi All,
>
> Thank you very much for your feedback! We investigated the recommendations as we developed the current patch. Please
referto the comments below. 
>
> > I'd suggest targeting pg_qsort() directly, instead of list_sort(). list_sort() is not used in very performance
criticalparts. 
> Using x86-simd-sort would require a new pg_qsort variant taking an extractor function. The new API would still have
tobubble up to functions like list_sort, which internally call pg_qsort. We targeted list_sort first, as we were aware
ofat least one use case that benefits from SIMD sorting (pgvector). As we target more use cases (such as tuple sort),
wemay need to move the SIMD-enabled implementation to a common location, and a new variant of pg_qsort may be a good
candidate.
>
> > I'd suggest proposing this to the pgvector project directly, since pgvector would immediately benefit.
> Yes, we’ll share the performance gains on HNSW index build time with the pgvector community. However, other users of
list_sort(e.g., other extensions) may benefit from the simd-sort version of list_sort as well. As it is not a
pgvector-specificoptimization, we propose it for PostgreSQL. 

I don't think "another extension might use it someday" makes a very
strong case, particularly for something that requires a new
dependency.

> > If you could use this to speed up tuple sorting, that would be much more interesting for PostgreSQL itself.
> Enabling x86-simd-sort for tuple sorting is in development. We observed promising results, but there is still work to
do(e.g., to ensure there are no regressions in different conditions). We are planning to share another patch for tuple
sortbuilding on this one. 

Note that our current implemention is highly optimized for
low-cardinality inputs. This is needed for aggregate queries. I found
this write-up of a couple scalar and vectorized sorts, and they show
this library doing very poorly on very-low cardinality inputs. I would
look into that before trying to get something in shape to share.

https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md

There is also the question of hardware support. It seems AVX-512 is
not supported well on client side, where most developers work. And
availability of any flavor is not guaranteed on server either.
Something to keep in mind.

--
John Naylor
Amazon Web Services



RE: SIMD optimization for list_sort

From
"Devulapalli, Raghuveer"
Date:
> Note that our current implemention is highly optimized for low-cardinality inputs.
> This is needed for aggregate queries. I found this write-up of a couple scalar and
> vectorized sorts, and they show this library doing very poorly on very-low
> cardinality inputs. I would look into that before trying to get something in shape to
> share.
> 
> https://github.com/Voultapher/sort-research-
> rs/blob/main/writeup/intel_avx512/text.md

That write up is fairly old and those perf problems has subsequently been fixed. See
https://github.com/intel/x86-simd-sort/pull/127and https://github.com/intel/x86-simd-sort/pull/168. I still suggest
measuringperf here for thoroughness. 
 

> 
> There is also the question of hardware support. It seems AVX-512 is not
> supported well on client side, where most developers work. And availability of
> any flavor is not guaranteed on server either.
> Something to keep in mind.

simd-sort also works on avx2 which is widely available. I would suggest benchmarking on one of the client laptops to
measurethe perf. 
 

Raghuveer

RE: SIMD optimization for list_sort

From
"R, Rakshit"
Date:
Hi All,

Thank you so much for the feedback. 

> I don't think "another extension might use it someday" makes a very strong case, 
> particularly for something that requires a new dependency.

The x86-simdsort library is an optional dependency in Postgres. Also the new list sort implementation which uses the
x86-simdsortlibrary does not impact any of the existing workflows in Postgres. We have at least one use case where
Pgvectorgets a gain of 7 - 10 % in HNSW index build time using the SIMD sort implementation. Hence we feel that this
patchcould be up streamed further. Open to further discussion on the same.
 


> Note that our current implemention is highly optimized for low-cardinality inputs.
> This is needed for aggregate queries. I found this write-up of a 
> couple scalar and vectorized sorts, and they show this library doing 
> very poorly on very-low cardinality inputs. I would look into that 
> before trying to get something in shape to share.
> 
> https://github.com/Voultapher/sort-research-
> rs/blob/main/writeup/intel_avx512/text.md

We ran our extension to stress list sort with low cardinality inputs. For eg, for an array of size 100k having repeated
valuesin the range 1-10 we still see a gain of around 20% in throughput.
 
We will collect more data for low cardinality inputs and with AVX2 too.

Thanks and regards
Rakshit Rajeev

Re: SIMD optimization for list_sort

From
John Naylor
Date:
On Fri, Feb 28, 2025 at 12:43 PM R, Rakshit <rakshit.r@intel.com> wrote:

> > I don't think "another extension might use it someday" makes a very strong case,
> > particularly for something that requires a new dependency.
>
> The x86-simdsort library is an optional dependency in Postgres. Also the new list sort implementation which uses the
x86-simdsortlibrary does not impact any of the existing workflows in Postgres. 

"Optional" and "Does not impact" are not great selling points to get
us to take a 1500 line patch. As we told you in November, list_sort
isn't critical for us. You need to start with the user and work
backwards to the technology. Don't pick a technology and try to sell
people on using it.

> We ran our extension to stress list sort with low cardinality inputs. For eg, for an array of size 100k having
repeatedvalues in the range 1-10 we still see a gain of around 20% in throughput. 
> We will collect more data for low cardinality inputs and with AVX2 too.

Thanks for the news, those are encouraging results.

--
John Naylor
Amazon Web Services



RE: SIMD optimization for list_sort

From
"Shankaran, Akash"
Date:
>> > I don't think "another extension might use it someday" makes a very strong case,
>> > particularly for something that requires a new dependency.
>>
>> The x86-simdsort library is an optional dependency in Postgres. Also the new list sort implementation which uses the
x86-simdsortlibrary does not impact any of the existing workflows in Postgres.
 

>"Optional" and "Does not impact" are not great selling points to get
>us to take a 1500 line patch. As we told you in November, list_sort
>isn't critical for us. You need to start with the user and work
>backwards to the technology. Don't pick a technology and try to sell
>people on using it.

Agree on starting from the user problem and work towards technology. As stated upthread, the problem being addressed
hereis optimizing pg_vector list_sort (which relies on postgres list_sort) to speed up HNSW index construction. The
resultsshow 7-10% improvements on vector workloads using the library. 
 

Given the same x86-simdsort library is intended to optimize 1) list_sort 2) tuple sort, it didn't make sense to
duplicatethe work to integrate it in pg_vector for list_sort, and then again in postgres for tuple-sort.
 
We're trying to tackle list_sort and tuple_sort in separate patches using the same x86-simdsort library, with the goal
tooptimize both. Let us know if this approach is preferred and if the list_sort patch could be reviewed and any other
testswe should do, or would you rather see tuple_sort benchmarks. 
 

- Akash Shankaran

Re: SIMD optimization for list_sort

From
John Naylor
Date:
On Sat, Mar 1, 2025 at 1:23 PM Shankaran, Akash
<akash.shankaran@intel.com> wrote:

> Given the same x86-simdsort library is intended to optimize 1) list_sort 2) tuple sort, it didn't make sense to
duplicatethe work to integrate it in pg_vector for list_sort, and then again in postgres for tuple-sort. 

Thank you for the context -- the motivation is clear now.

I imagine this kind of integration work is messy and difficult, which
is why I would be reluctant to write something that way if I have a
choice.

> We're trying to tackle list_sort and tuple_sort in separate patches using the same x86-simdsort library,
with the goal to optimize both. Let us know if this approach is
preferred and if the list_sort patch could be reviewed and any other
tests we should do, or would you rather see tuple_sort benchmarks.

The approach I actually prefer to is to hear about the high-level
architecture first, before being asked to look at code. Tuple sort has
special challenges, so when you're ready to start a new thread for
that, I'll be curious about your findings.

--
John Naylor
Amazon Web Services