> > Sorting in smaller batches that better fit into CPU cache.
>
> More reading yielded that we are looking for cache-oblivious
> sorting algorithm.
Since David referred to L3 size as the starting point of a possible configuration parameter, that's actually cache-conscious.
> One the papers[1] mentions that in quick sort, whenever we reach size which can fit in cache,
> instead of partitioning it further, we can do insertion sort there itself.
> > Memory-tuned quicksort uses insertion sort to sort small subsets while they are in
> > the cache, instead of partitioning further. This increases the instruction count,
> > but reduces the cache misses
>
> If this looks step in right direction, I can give it a try and do more reading and experimentation.
I'm not close enough to this thread to guess at the right direction (although I hope related work will help), but I have a couple general remarks:
1. In my experience, academic papers like to test sorting with register-sized numbers or strings. Our sorttuples are bigger, have complex comparators, and can fall back to fields in the full tuple.
2. That paper is over 20 years old. If it demonstrated something genuinely useful, some of those concepts would likely be implemented in the real-world somewhere. Looking for evidence of that might be a good exercise.
3. 20 year-old results may not carry over to modern hardware.
4. Open source software is more widespread in the academic world now than 20 years ago, so papers with code (maybe even the author's github) are much more useful to us in my view.
5. It's actually not terribly hard to make sorting faster for some specific cases -- the hard part is keeping other inputs of interest from regressing.
6. The bigger the change, the bigger the risk of regressing somewhere.