Home > mailing lists

Re: [GSoC 2018] Proposal Draft - Mailing list pgsql-hackers

From	Kefan Yang
Subject	Re: [GSoC 2018] Proposal Draft
Date	March 18, 2018 05:00:21
Msg-id	CAF448YJhr0q5N45wVfars29G+h7VY9qBVnXFE1eWdp18X+xQww@mail.gmail.com Whole thread
In response to	Re: [GSoC 2018] Proposal Draft (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: [GSoC 2018] Proposal Draft
List	pgsql-hackers

Tree view

Thanks for your quick feedback!

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""

What I am trying to say here is that similar optimizations can be applied to novel algorithms or other implementations of quicksort.

The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java implementation already. I can definitely make use of that.

Also, I am wondering what's the normal approach to testing if a certain sorting implementation brings performance gain in this project? More specifically, You mentioned that little progress has been made with novel algorithmic enhancement. How can we get that conclusion?

I am still working on my proposal and I will post a new version soon:)

Regards,

Kefan

2018-03-17 18:05 GMT-07:00 Peter Geoghegan <pg@bowt.ie>:

On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang <starordust@gmail.com> wrote:
> I am Kefan Yang, a third-year Computing Science student from Simon Fraser
> University, Canada. I am very interested in the sorting algorithm
> benchmarking and implementation issue you mentioned on the idealist of
> Google Summer of Code 2018.

> I've attached a proposal draft to this email. It's a brief one but I guess
> it's enough to show my current ideas:)

The proposal reads:

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""

But our qsort implementation, which is the Bentley & McIlroy
implementation with a couple of customizations, already does
everything you mention (I refer to the precheck algorithmic
customization, and the way that we check to see which side of a pivot
to use real recursion with to avoid stack overflows). I suggest that
you read the paper [1] -- the code that we use is almost directly
lifted from the paper. The opaque sounding variable names are the
same, and the C code from the paper is structured in exactly the same
way.

I think that this won't be a particularly easy project to get
committed. I suggest that if you go forward with it that you
specifically look into integrating "Dual-Pivot Quicksort" [2] as a
whole cloth replacement for the B&M implementation. It seems like this
has some chance of working out, because it's more or less acknowledged
by Bentley himself to be a kind of successor to his own industrial
quicksort implementation [3] -- it's derived from it. Note that Java 7
uses dual pivot quicksort when sorting integers.

In general, we've had a lot of success with optimizing sorting in the
past few years by focusing on things like avoiding cache misses in
comparators. There has been much less success with algorithmic
improvements, and no success at all with novel algorithmic
enhancements. PostgreSQL development just isn't a great way to do that
sort of thing.

BTW, I noticed that you go on to say this:

"""
However,
since disk operation is much expensive than in-memory sorting, I am
not sure if we can
observe a significant difference in this way.

"""

I think that you'd be surprised at how often this isn't true these
days, at least when sorting enough data for spilling to disk to be
relevant. The main reason for that is that the cost of writing out
runs increases linearly, and therefore eventually becomes very small
compared to the costs of sorting itself, which increases at a
linearithmic rate.

[1] http://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
[3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-September/002630.html
--
Peter Geoghegan

pgsql-hackers by date:

From: Tom Lane
Date: 18 March 2018, 04:37:22
Subject: Re: Precision loss casting float to numeric

From: Tom Lane
Date: 18 March 2018, 05:04:44
Subject: Re: Precision loss casting float to numeric

Re: [GSoC 2018] Proposal Draft - Mailing list pgsql-hackers

Previous

Next