Thread: Insertion Sort Improvements

Insertion Sort Improvements

From
Benjamin Coutu
Date:
Hello,

Inspired by the recent discussions[1][2] around sort improvements, I took a look around the code and noticed the use of
asomewhat naive version of insertion sort within the broader quicksort code. 

The current implementation (see sort_template.h) is practically the textbook version of insertion sort:

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP)
  for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0; pl -= ST_POINTER_STEP)
    DO_SWAP(pl, pl - ST_POINTER_STEP);

I propose to rather use the slightly more efficient variant of insertion sort where only a single assignment instead of
afully-fledged swap is performed in the inner loop: 

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) {
  DO_COPY(pm_temp, pm); /* pm_temp <- copy of pm */

  pl = pm - ST_POINTER_STEP;

  for (; pl >= a && DO_COMPARE(pl, pm_temp) > 0; pl -= ST_POINTER_STEP)
    DO_ASSIGN(pl + ST_POINTER_STEP, pl); /* pl + 1 <- pl */

  DO_COPY(pl + ST_POINTER_STEP, pm_temp); /* pl + 1 <- copy of pm_temp */
}

DO_ASSIGN and DO_COPY macros would have to be declared analogue to DO_SWAP via the template.

There is obviously a trade-off involved here as O(1) extra memory is required to hold the temporary variable and
DO_COPYmight be expensive if the sort element is large. In case of single datum sort with trivial data types this would
notbe a big issue. For large tuples on the other hand it could mean a significant overhead that might not be made up
forby the improved inner loop. One might want to limit this algorithm to a certain maximum tuple size and use the
originalinsertion sort version for larger elements (this could be decided at compile-time via sort_template.h). 

Anyways, there might be some low hanging fruit here. If it turns out to be significantly faster one might even consider
increasingthe insertion sort threshold from < 7 to < 10 sort elements. But that is a whole other discussion for another
day.

Has anyone tested such an approach before? Please let me know your thoughts.

Cheers,

Benjamin

[1] https://www.postgresql.org/message-id/flat/CAFBsxsHanJTsX9DNJppXJxwg3bU%2BYQ6pnmSfPM0uvYUaFdwZdQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CAApHDvoTTtoQYfp3d0kTPF6y1pjexgLwquzKmjzvjC9NCw4RGw%40mail.gmail.com

--

Benjamin Coutu
http://www.zeyos.com



Re: Insertion Sort Improvements

From
John Naylor
Date:
On Thu, Aug 25, 2022 at 5:55 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:
>
> Hello,
>
> Inspired by the recent discussions[1][2] around sort improvements, I took a look around the code and noticed the use
ofa somewhat naive version of insertion sort within the broader quicksort code.
 
>
> The current implementation (see sort_template.h) is practically the textbook version of insertion sort:

Right. I think it's worth looking into. When I tested dual-pivot
quicksort, a threshold of 18 seemed best for my inputs, so making
insertion sort more efficient could tilt the balance more in favor of
dual-pivot. (It already seems slightly ahead, but as I mentioned in
the thread you linked, removing branches for null handling would make
it more compelling).

> DO_ASSIGN and DO_COPY macros would have to be declared analogue to DO_SWAP via the template.

I don't think you need these macros, since this optimization is only
convenient if you know the type at compile time. See the attached,
which I had laying around when I was looking at PDQuicksort. I believe
it's similar to what you have in mind. (Ignore the part about
"unguarded", it's irrelevant out of context). Notice that it avoids
unnecessary copies, but has two calls to DO_COMPARE, which is *not*
small for tuples.

> Anyways, there might be some low hanging fruit here. If it turns out to be significantly faster one might even
considerincreasing the insertion sort threshold from < 7 to < 10 sort elements. But that is a whole other discussion
foranother day.
 

I think it belongs around 10 now anyway. I also don't think you'll see
much benefit with your proposal at a threshold of 7 -- I suspect it'd
be more enlightening to test a range of thresholds with and without
the patch, to see how the inflection point shifts. That worked pretty
well when testing dual-pivot.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: Insertion Sort Improvements

From
Benjamin Coutu
Date:
> convenient if you know the type at compile time. See the attached,
> which I had laying around when I was looking at PDQuicksort. I believe
> it's similar to what you have in mind.

That looks very promising.
I also love your recent proposal of partitioning into null and non-null. I suspect that to be a clear winner.

> I think it belongs around 10 now anyway.

Yeah, I think that change is overdue given modern hardware characteristics (even with the current implementation).

Another idea could be to run a binary insertion sort and use a much higher threshold. This could significantly cut down
oncomparisons (especially with presorted runs, which are quite common in real workloads). 

If full binary search turned out to be an issue regarding cache locality, we could do it in smaller chunks, e.g. do a
microbinary search between the current position (I) and position minus chunk size (say 6-12 or so, whatever fits 1 or 2
cachelines)whenever A[I] < A[I-1] and if we don't find the position within that chunk we continue with the previous
chunk,thereby preserving cache locality. 

With less comparisons we should start keeping track of swaps and use that as an efficient way to determine
presortedness.We could change the insertion sort threshold to a swap threshold and do insertion sort until we hit the
swapthreshold. I suspect that would make the current presorted check obsolete (as binary insertion sort without or even
witha few swaps should be faster than the current presorted-check). 

Cheers, Ben



Re: Insertion Sort Improvements

From
John Naylor
Date:
On Fri, Aug 26, 2022 at 9:06 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:
>
> Another idea could be to run a binary insertion sort and use a much higher threshold. This could significantly cut
downon comparisons (especially with presorted runs, which are quite common in real workloads). 

Comparisons that must go to the full tuple are expensive enough that
this idea might have merit in some cases, but that would be a research
project.

> If full binary search turned out to be an issue regarding cache locality, we could do it in smaller chunks,

The main issue with binary search is poor branch prediction. Also, if
large chunks are bad for cache locality, isn't that a strike against
using a "much higher threshold"?

> With less comparisons we should start keeping track of swaps and use that as an efficient way to determine
presortedness.We could change the insertion sort threshold to a swap threshold and do insertion sort until we hit the
swapthreshold. I suspect that would make the current presorted check obsolete (as binary insertion sort without or even
witha few swaps should be faster than the current presorted-check). 

The thread you linked to discusses partial insertion sort as a
replacement for the pre-sorted check, along with benchmark results and
graphs IIRC. I think it's possibly worth doing, but needs more
investigation to make sure the (few) regressions I saw either: 1. were
just noise or 2. can be ameliorated. As I said in the dual pivot
thread, this would be great for dual pivot since we could reuse
partial insertion sort for choosing the pivots, reducing binary space.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: Insertion Sort Improvements

From
Benjamin Coutu
Date:
Getting back to improvements for small sort runs, it might make sense to consider using in-register based sorting via
sortingnetworks for very small runs. 

This talk is specific to database sorting and illustrates how such an approach can be vectorized:
https://youtu.be/HeFwVNHsDzc?list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&t=1090

It looks like some of the commercial DBMSs do this very successfully. They use 4 512bit registers (AVX-512) in this
example,which could technically store 4 * 4 sort-elements (given that the sorting key is 64 bit and the tuple pointer
is64 bit). I wonder whether this could also be done with just SSE (instead of AVX), which the project now readily
supportsthanks to your recent efforts? 



Re: Insertion Sort Improvements

From
John Naylor
Date:

On Tue, Sep 27, 2022 at 4:39 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:
>
> Getting back to improvements for small sort runs, it might make sense to consider using in-register based sorting via sorting networks for very small runs.

> It looks like some of the commercial DBMSs do this very successfully. 

"Looks like"?

> They use 4 512bit registers (AVX-512) in this example, which could technically store 4 * 4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 64 bit). I wonder whether this could also be done with just SSE (instead of AVX), which the project now readily supports thanks to your recent efforts?

SortTuples are currently 24 bytes, and supported vector registers are 16 bytes, so not sure how you think that would work.

More broadly, the more invasive a change is, the more risk is involved, and the more effort to test and evaluate. If you're serious about trying to improve insertion sort performance, the simple idea we discussed at the start of the thread is a much more modest step that has a good chance of justifying the time put into it. That is not to say it's easy, however, because testing is a non-trivial amount of work.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Insertion Sort Improvements

From
Benjamin Coutu
Date:
> "Looks like"?

I cannot find the reference, but I've read a while back that a well-known company from Redwood uses it for their
in-memorycolumnar storage. That might have just been a rumor or might have been research only - not sure. It does not
reallymatter anyways. 

> SortTuples are currently 24 bytes, and supported vector registers are 16 bytes, so not sure how you think that would
work.

The thought was to logically group multiple sort tuples together and then create a vectorized version of that group
withjust the primitive type sort key as well as a small-sized index/offset into that sort group to later swap the
correspondingsort tuple referenced by that index/offset. The sorting network would allow us to do branch-less register
basedsorting for a particular sort run. I guess this idea is moot, given ... 

> More broadly, the more invasive a change is, the more risk is involved, and the more effort to test and evaluate. If
you'reserious about trying to improve insertion sort performance, the simple idea we discussed at the start of the
threadis a much more modest step that has a good chance of justifying the time put into it. That is not to say it's
easy,however, because testing is a non-trivial amount of work. 

I absolutely agree. Let's concentrate on improving things incrementally.
Please excuse me wondering given that you have contributed some of the recent vectorization stuff and seeing that you
haveobviously experimented a lot with the sort code, that you might already have tried something along those lines or
researchedthe subject - it is definitely a very interesting topic. 

Cheers, Ben



Re: Insertion Sort Improvements

From
Benjamin Coutu
Date:
Greetings,

I would like to revisit the discussion and concur with John's perspective that incremental progress through small,
successivemodifications is the appropriate approach to move forward. Therefore, I would like to propose two distinct
ideasaimed at enhancing the functionality of insertion sort: 

1. Implementation of a more efficient variant (as described in the introductory part of this thread):

------------ OLD ------------

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
     pm += ST_POINTER_STEP)
    for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
         pl -= ST_POINTER_STEP)
        DO_SWAP(pl, pl - ST_POINTER_STEP);

------------ NEW ------------

for (
    pm = a + ST_POINTER_STEP;
    pm < a + n * ST_POINTER_STEP;
    pm += ST_POINTER_STEP
) {
    ST_POINTER_TYPE tmp = *pm;

    for (
        pl = pm - ST_POINTER_STEP;
        pl >= a && DO_COMPARE(pl, &tmp) > 0;
        pl -= ST_POINTER_STEP
    )
        *(pl + ST_POINTER_STEP) = *pl;

    *(pl + ST_POINTER_STEP) = tmp;
}

------------

2. It appears that there is a consensus against disregarding the presorted check, despite its questionable value. In
lightof this, an alternative suggestion is to integrate the presorted check into the insertion sort path by utilizing
anunbounded insertion sort. Only after a maximum number of swaps have occurred should we abandon the sorting process.
Ifinsertion sort is executed on the entire array without any swaps, we can simply return. If not, and we continue with
quicksortafter the swap limit has been reached, we at least have left the array in a more sorted state, which may
likelybe advantageous for subsequent iterations. 

------------ OLD ------------

if (n < 7)
{
    for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
         pm += ST_POINTER_STEP)
        for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
             pl -= ST_POINTER_STEP)
            DO_SWAP(pl, pl - ST_POINTER_STEP);
    return;
}
presorted = 1;
for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
     pm += ST_POINTER_STEP)
{
    DO_CHECK_FOR_INTERRUPTS();
    if (DO_COMPARE(pm - ST_POINTER_STEP, pm) > 0)
    {
        presorted = 0;
        break;
    }
}
if (presorted)
    return;

------------ NEW ------------

#define LIMIT_SWAPS 30 /* to be determined empirically */

int swaps = 0;

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
     pm += ST_POINTER_STEP)
    for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
         pl -= ST_POINTER_STEP) {
        DO_SWAP(pl, pl - ST_POINTER_STEP);

        if (++swaps == LIMIT_SWAPS)
            goto quicksort;
    }

if (swaps == 0)
    return;

quicksort:

------------

Naturally, both modifications (with point 2 being highly speculative) are currently independent of each other, and it
isalso crucial to benchmark the combined variant as well as different values for LIMIT_SWAPS. 
I would greatly appreciate assistance in benchmarking these proposed changes. Your collaboration in this matter would
beinvaluable. 

Cheers, Ben



Re: Insertion Sort Improvements

From
John Naylor
Date:

On Tue, May 23, 2023 at 4:10 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:
>
> Greetings,
>
> I would like to revisit the discussion and concur with John's perspective that incremental progress through small, successive modifications is the appropriate approach to move forward. Therefore, I would like to propose two distinct ideas aimed at enhancing the functionality of insertion sort:
>
> 1. Implementation of a more efficient variant (as described in the introductory part of this thread):

That's worth trying out. It might also then be worth trying to push both unordered values -- the big one up / the small one down. I've seen other implementations do that, but don't remember where, or what it's called.

> 2. It appears that there is a consensus against disregarding the presorted check, despite its questionable value. In light of this, an alternative suggestion is to integrate the presorted check into the insertion sort path by utilizing an unbounded insertion sort.

"Unbounded" means no bounds check on the array. That's not possible in its current form, so I think you misunderstood something.

> Only after a maximum number of swaps have occurred should we abandon the sorting process. 

I only remember implementations tracking loop iterations, not swaps. You'd need evidence that this is better.

> If insertion sort is executed on the entire array without any swaps, we can simply return. If not, and we continue with quicksort after the swap limit has been reached, we at least have left the array in a more sorted state, which may likely be advantageous for subsequent iterations.

An important part not mentioned yet: This might only be worth doing if the previous recursion level performed no swaps during partitioning and the current pivot candidates are ordered. That's a bit of work and might not be convenient now -- it'd be trivial with dual-pivot, but I've not looked at that in a while. (Fittingly, dual-pivot requires a higher insertion sort threshold so it goes both ways.)

> I would greatly appreciate assistance in benchmarking these proposed changes. Your collaboration in this matter would be invaluable.

I advise looking in the archives for scripts from previous benchmarks. No need to reinvent the wheel -- it's enough work as it is. A key thing here for #1 is that improving insertion sort requires increasing the threshold to show the true improvement. It's currently 7, but should really be something like 10. A script that repeats tests for, say, 7 through 18 should show a concave-up shape in the times. The bottom of the bowl should shift to higher values, and that minimum is what should be compared.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Insertion Sort Improvements

From
Benjamin Coutu
Date:
> That's worth trying out. It might also then be worth trying to push both unordered values -- the big one up / the
smallone down. I've seen other implementations do that, but don't remember where, or what it's called. 

It is important that we do not do 2 compares two avoid one copy (assignment to temporary) as you did in your patch
earlierin this thread, cause compares are usually pretty costly (also two compares are then inlined, bloating the
binary).
Assigning a sort tuple to a temporary translates to pretty simple assembly code, so my suggested modification should
outperform.It cuts down the cost of the inner loop by ca. 40% comparing the assembly. And it avoids having to re-read
memoryon each comparison, as the temporary can be held in registers. 

> "Unbounded" means no bounds check on the array. That's not possible in its current form, so I think you misunderstood
something.

Sorry for the confusion. I didn't mean unbounded in the "array bound checking" sense, but in the "unrestricted number
ofloops" sense. 

> I only remember implementations tracking loop iterations, not swaps. You'd need evidence that this is better.

Well, the idea was to include the presorted check somehow. Stopping after a certain number of iterations is surely more
safethan stopping after a number of swaps, but we would then implicitly also stop our presort check. We could change
thatthough: Count loop iterations and on bailout continue with a pure presort check, but from the last position of the
insertionsort -- not all over again -- by comparing against the maximum value recorded during the insertion sort.
Thoughts?

> An important part not mentioned yet: This might only be worth doing if the previous recursion level performed no
swapsduring partitioning and the current pivot candidates are ordered. 

Agreed.

> It's currently 7, but should really be something like 10. A script that repeats tests for, say, 7 through 18 should
showa concave-up shape in the times. The bottom of the bowl should shift to higher values, and that minimum is what
shouldbe compared. 

Yeah, as alluded to before, it should be closer to 10 nowadays.
In any case it should be changed at least from 7 to 8, cause then we could at least discard the additional check for n
>7 in the quicksort code path (see /src/include/lib/sort_template.h#L322). Currently we check n < 7 and a few lines
downwe check for n > 7, if we check n < 8 for insertion sort then the second check becomes obsolete. 

Benjamin Coutu
http://www.zeyos.com