Thread: SIMD optimization for list_sort
Hi All,
We propose enabling SIMD-based sort for list_sort using the x86-simd-sort library (https://github.com/intel/x86-simd-sort).
The existing list_sort takes a comparator function to compare pairs of ListCell. On the other hand, x86-simd-sort requires an array of numeric values to sort, and it returns an array of sorted indices. To enable x86-simd-sort, we add new list_sort_simd functions that take an extractor function. The function extracts a value (float or uint32) from a ListCell. Those values are then collected into an array for x86-simd-sort to work on. A comparator function can still be passed to be used as tie-breaker.
typedef float (*list_sort_extractor_float)(const ListCell *a);
typedef uint32 (*list_sort_extractor_uint32)(const ListCell *a);
void list_sort_simd_float(List *list, list_sort_extractor_float extract, list_sort_comparator cmp);
void list_sort_simd_uint32(List *list, list_sort_extractor_uint32 extract, list_sort_comparator cmp);
These functions will exist alongside the current list_sort. Existing list_sort use cases in Postgres or extensions will not be affected by default, and they can be converted to list_sort_simd functions where it makes sense in terms of performance.
This feature introduces the x86-simd-sort library as a dependency. We plan to make the dependency optional, using a configuration option, such as --with-x86-simd-sort. If building without this option, list_sort_simd functions could fall back to the existing list_sort (e.g., by creating a comparator function that combines the extracted value and the tie-breaker comparator function). Alternatively (or in addition), we could add a function to report whether x86-simd-sort is available.
We identified a first use case for list_sort_simd_float in pgvector. As part of HNSW index construction, pgvector uses list_sort to sort candidate vectors by distance. Using list_sort_simd_float, we observed reduction in index build time in some scenarios. For example, we observed 7% reduction in index build time with the gist-960 dataset and 10% with the dbpedia-openai-1000k dataset (ANN-Benchmarks, HNSW index with halfvec, m=80). We are also looking into microbenchmarks to measure list_sort performance independently.
We’d appreciate feedback on this approach. In the meantime, we will complete the patch to share. We also plan to extend SIMD-based sort to tuple sort in the future.
Thanks!
Luca Giacchino
On 22/11/2024 01:27, Giacchino, Luca wrote: > The existing list_sort takes a comparator function to compare pairs of > ListCell. On the other hand, x86-simd-sort requires an array of numeric > values to sort, and it returns an array of sorted indices. To enable > x86-simd-sort, we add new list_sort_simd functions that take an > extractor function. The function extracts a value (float or uint32) from > a ListCell. Those values are then collected into an array for x86-simd- > sort to work on. A comparator function can still be passed to be used as > tie-breaker. > > typedef float (*list_sort_extractor_float)(const ListCell *a); > > typedef uint32 (*list_sort_extractor_uint32)(const ListCell *a); > > void list_sort_simd_float(List *list, list_sort_extractor_float extract, > list_sort_comparator cmp); > > void list_sort_simd_uint32(List *list, list_sort_extractor_uint32 > extract, list_sort_comparator cmp); > > These functions will exist alongside the current list_sort. Existing > list_sort use cases in Postgres or extensions will not be affected by > default, and they can be converted to list_sort_simd functions where it > makes sense in terms of performance. I'd suggest targeting pg_qsort() directly, instead of list_sort(). list_sort() is not used in very performance critical parts. > We identified a first use case for list_sort_simd_float in pgvector. As > part of HNSW index construction, pgvector uses list_sort to sort > candidate vectors by distance. Using list_sort_simd_float, we observed > reduction in index build time in some scenarios. For example, we > observed 7% reduction in index build time with the gist-960 dataset and > 10% with the dbpedia-openai-1000k dataset (ANN-Benchmarks, HNSW index > with halfvec, m=80). We are also looking into microbenchmarks to measure > list_sort performance independently. That's interesting. I'd suggest proposing this to the pgvector project directly, since pgvector would immediately benefit. > We’d appreciate feedback on this approach. In the meantime, we will > complete the patch to share. We also plan to extend SIMD-based sort to > tuple sort in the future. If you could use this to speed up tuple sorting, that would be much more interesting for PostgreSQL itself. -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, Nov 22, 2024 at 06:01:22PM +0200, Heikki Linnakangas wrote: > On 22/11/2024 01:27, Giacchino, Luca wrote: >> We´d appreciate feedback on this approach. In the meantime, we will >> complete the patch to share. We also plan to extend SIMD-based sort to >> tuple sort in the future. > > If you could use this to speed up tuple sorting, that would be much more > interesting for PostgreSQL itself. +1 -- nathan
On Fri, Nov 22, 2024 at 6:27 AM Giacchino, Luca <luca.giacchino@intel.com> wrote: > We’d appreciate feedback on this approach. In the meantime, we will complete the patch to share. We also plan to extendSIMD-based sort to tuple sort in the future. Coincidentally, I'll be prototyping a tuple sort that will take better advantage of keys that are integers (whether authoritative or abbreviated), using scalar, portable branch-free techniques. This will require some re-architecting and I believe that will also make it easier for you to experiment with sorting networks there. -- John Naylor Amazon Web Services
On 22/11/2024 06:27, Giacchino, Luca wrote: > We’d appreciate feedback on this approach. In the meantime, we will > complete the patch to share. We also plan to extend SIMD-based sort to > tuple sort in the future. Nice! I continually see performance reports when sorting and group order impact performance. Because of that, efforts have been made to optimise sorts with recombination of clause order [1,2]. It would be nice to review your Sort operator optimisation. I hope, it might improve performance of the cases provided in these threads. [1] POC: GROUP BY optimization https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru [2] Consider the number of columns in the sort cost model https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com -- regards, Andrei Lepikhov
On Tue, Feb 18, 2025 at 1:49 PM R, Rakshit <rakshit.r@intel.com> wrote: > > Hi All, > > Thank you very much for your feedback! We investigated the recommendations as we developed the current patch. Please referto the comments below. > > > I'd suggest targeting pg_qsort() directly, instead of list_sort(). list_sort() is not used in very performance criticalparts. > Using x86-simd-sort would require a new pg_qsort variant taking an extractor function. The new API would still have tobubble up to functions like list_sort, which internally call pg_qsort. We targeted list_sort first, as we were aware ofat least one use case that benefits from SIMD sorting (pgvector). As we target more use cases (such as tuple sort), wemay need to move the SIMD-enabled implementation to a common location, and a new variant of pg_qsort may be a good candidate. > > > I'd suggest proposing this to the pgvector project directly, since pgvector would immediately benefit. > Yes, we’ll share the performance gains on HNSW index build time with the pgvector community. However, other users of list_sort(e.g., other extensions) may benefit from the simd-sort version of list_sort as well. As it is not a pgvector-specificoptimization, we propose it for PostgreSQL. I don't think "another extension might use it someday" makes a very strong case, particularly for something that requires a new dependency. > > If you could use this to speed up tuple sorting, that would be much more interesting for PostgreSQL itself. > Enabling x86-simd-sort for tuple sorting is in development. We observed promising results, but there is still work to do(e.g., to ensure there are no regressions in different conditions). We are planning to share another patch for tuple sortbuilding on this one. Note that our current implemention is highly optimized for low-cardinality inputs. This is needed for aggregate queries. I found this write-up of a couple scalar and vectorized sorts, and they show this library doing very poorly on very-low cardinality inputs. I would look into that before trying to get something in shape to share. https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md There is also the question of hardware support. It seems AVX-512 is not supported well on client side, where most developers work. And availability of any flavor is not guaranteed on server either. Something to keep in mind. -- John Naylor Amazon Web Services
> Note that our current implemention is highly optimized for low-cardinality inputs. > This is needed for aggregate queries. I found this write-up of a couple scalar and > vectorized sorts, and they show this library doing very poorly on very-low > cardinality inputs. I would look into that before trying to get something in shape to > share. > > https://github.com/Voultapher/sort-research- > rs/blob/main/writeup/intel_avx512/text.md That write up is fairly old and those perf problems has subsequently been fixed. See https://github.com/intel/x86-simd-sort/pull/127and https://github.com/intel/x86-simd-sort/pull/168. I still suggest measuringperf here for thoroughness. > > There is also the question of hardware support. It seems AVX-512 is not > supported well on client side, where most developers work. And availability of > any flavor is not guaranteed on server either. > Something to keep in mind. simd-sort also works on avx2 which is widely available. I would suggest benchmarking on one of the client laptops to measurethe perf. Raghuveer
Hi All, Thank you so much for the feedback. > I don't think "another extension might use it someday" makes a very strong case, > particularly for something that requires a new dependency. The x86-simdsort library is an optional dependency in Postgres. Also the new list sort implementation which uses the x86-simdsortlibrary does not impact any of the existing workflows in Postgres. We have at least one use case where Pgvectorgets a gain of 7 - 10 % in HNSW index build time using the SIMD sort implementation. Hence we feel that this patchcould be up streamed further. Open to further discussion on the same. > Note that our current implemention is highly optimized for low-cardinality inputs. > This is needed for aggregate queries. I found this write-up of a > couple scalar and vectorized sorts, and they show this library doing > very poorly on very-low cardinality inputs. I would look into that > before trying to get something in shape to share. > > https://github.com/Voultapher/sort-research- > rs/blob/main/writeup/intel_avx512/text.md We ran our extension to stress list sort with low cardinality inputs. For eg, for an array of size 100k having repeated valuesin the range 1-10 we still see a gain of around 20% in throughput. We will collect more data for low cardinality inputs and with AVX2 too. Thanks and regards Rakshit Rajeev
On Fri, Feb 28, 2025 at 12:43 PM R, Rakshit <rakshit.r@intel.com> wrote: > > I don't think "another extension might use it someday" makes a very strong case, > > particularly for something that requires a new dependency. > > The x86-simdsort library is an optional dependency in Postgres. Also the new list sort implementation which uses the x86-simdsortlibrary does not impact any of the existing workflows in Postgres. "Optional" and "Does not impact" are not great selling points to get us to take a 1500 line patch. As we told you in November, list_sort isn't critical for us. You need to start with the user and work backwards to the technology. Don't pick a technology and try to sell people on using it. > We ran our extension to stress list sort with low cardinality inputs. For eg, for an array of size 100k having repeatedvalues in the range 1-10 we still see a gain of around 20% in throughput. > We will collect more data for low cardinality inputs and with AVX2 too. Thanks for the news, those are encouraging results. -- John Naylor Amazon Web Services
>> > I don't think "another extension might use it someday" makes a very strong case, >> > particularly for something that requires a new dependency. >> >> The x86-simdsort library is an optional dependency in Postgres. Also the new list sort implementation which uses the x86-simdsortlibrary does not impact any of the existing workflows in Postgres. >"Optional" and "Does not impact" are not great selling points to get >us to take a 1500 line patch. As we told you in November, list_sort >isn't critical for us. You need to start with the user and work >backwards to the technology. Don't pick a technology and try to sell >people on using it. Agree on starting from the user problem and work towards technology. As stated upthread, the problem being addressed hereis optimizing pg_vector list_sort (which relies on postgres list_sort) to speed up HNSW index construction. The resultsshow 7-10% improvements on vector workloads using the library. Given the same x86-simdsort library is intended to optimize 1) list_sort 2) tuple sort, it didn't make sense to duplicatethe work to integrate it in pg_vector for list_sort, and then again in postgres for tuple-sort. We're trying to tackle list_sort and tuple_sort in separate patches using the same x86-simdsort library, with the goal tooptimize both. Let us know if this approach is preferred and if the list_sort patch could be reviewed and any other testswe should do, or would you rather see tuple_sort benchmarks. - Akash Shankaran
On Sat, Mar 1, 2025 at 1:23 PM Shankaran, Akash <akash.shankaran@intel.com> wrote: > Given the same x86-simdsort library is intended to optimize 1) list_sort 2) tuple sort, it didn't make sense to duplicatethe work to integrate it in pg_vector for list_sort, and then again in postgres for tuple-sort. Thank you for the context -- the motivation is clear now. I imagine this kind of integration work is messy and difficult, which is why I would be reluctant to write something that way if I have a choice. > We're trying to tackle list_sort and tuple_sort in separate patches using the same x86-simdsort library, with the goal to optimize both. Let us know if this approach is preferred and if the list_sort patch could be reviewed and any other tests we should do, or would you rather see tuple_sort benchmarks. The approach I actually prefer to is to hear about the high-level architecture first, before being asked to look at code. Tuple sort has special challenges, so when you're ready to start a new thread for that, I'll be curious about your findings. -- John Naylor Amazon Web Services