Thread: [GSoC 2018] Proposal Draft

[GSoC 2018] Proposal Draft

From
Kefan Yang
Date:
Hi everyone,

I am Kefan Yang, a third-year Computing Science student from Simon Fraser University, Canada. I am very interested in the sorting algorithm benchmarking and implementation issue you mentioned on the idealist of Google Summer of Code 2018. 

I am currently working on my proposal, but I am a little bit worried about whether I am on the right track. I've encountered some problems in the implementation of the benchmark of those sorting algorithms. I briefly described my current ideas in the proposal draft, but they may change a bit as I read more research papers. I sincerely hope to get some feedback from you to improve my proposal, especially for the implementation part.

I've attached a proposal draft to this email. It's a brief one but I guess it's enough to show my current ideas:)

Sincerely,
Kefan Yang
Attachment

Re: [GSoC 2018] Proposal Draft

From
Peter Geoghegan
Date:
On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang <starordust@gmail.com> wrote:
> I am Kefan Yang, a third-year Computing Science student from Simon Fraser
> University, Canada. I am very interested in the sorting algorithm
> benchmarking and implementation issue you mentioned on the idealist of
> Google Summer of Code 2018.

> I've attached a proposal draft to this email. It's a brief one but I guess
> it's enough to show my current ideas:)

The proposal reads:

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""

But our qsort implementation, which is the Bentley & McIlroy
implementation with a couple of customizations, already does
everything you mention (I refer to the precheck algorithmic
customization, and the way that we check to see which side of a pivot
to use real recursion with to avoid stack overflows). I suggest that
you read the paper [1] -- the code that we use is almost directly
lifted from the paper. The opaque sounding variable names are the
same, and the C code from the paper is structured in exactly the same
way.

I think that this won't be a particularly easy project to get
committed. I suggest that if you go forward with it that you
specifically look into integrating "Dual-Pivot Quicksort" [2] as a
whole cloth replacement for the B&M implementation. It seems like this
has some chance of working out, because it's more or less acknowledged
by Bentley himself to be a kind of successor to his own industrial
quicksort implementation [3] -- it's derived from it. Note that Java 7
uses dual pivot quicksort when sorting integers.

In general, we've had a lot of success with optimizing sorting in the
past few years by focusing on things like avoiding cache misses in
comparators. There has been much less success with algorithmic
improvements, and no success at all with novel algorithmic
enhancements. PostgreSQL development just isn't a great way to do that
sort of thing.

BTW, I noticed that you go on to say this:

"""
However,
since disk operation is much expensive than in-memory sorting, I am
not sure if we can
observe a significant difference in this way.

"""

I think that you'd be surprised at how often this isn't true these
days, at least when sorting enough data for spilling to disk to be
relevant. The main reason for that is that the cost of writing out
runs increases linearly, and therefore eventually becomes very small
compared to the costs of sorting itself, which increases at a
linearithmic rate.

[1] http://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
[3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-September/002630.html
-- 
Peter Geoghegan


Re: [GSoC 2018] Proposal Draft

From
Kefan Yang
Date:
Thanks for your quick feedback!

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""
What I am trying to say here is that similar optimizations can be applied to novel algorithms or other implementations of quicksort.

The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java implementation already. I can definitely make use of that.

Also, I am wondering what's the normal approach to testing if a certain sorting implementation brings performance gain in this project? More specifically,  You mentioned that little progress has been made with novel algorithmic enhancement. How can we get that conclusion?

I am still working on my proposal and I will post a new version soon:)

Regards,
Kefan


2018-03-17 18:05 GMT-07:00 Peter Geoghegan <pg@bowt.ie>:
On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang <starordust@gmail.com> wrote:
> I am Kefan Yang, a third-year Computing Science student from Simon Fraser
> University, Canada. I am very interested in the sorting algorithm
> benchmarking and implementation issue you mentioned on the idealist of
> Google Summer of Code 2018.

> I've attached a proposal draft to this email. It's a brief one but I guess
> it's enough to show my current ideas:)

The proposal reads:

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""

But our qsort implementation, which is the Bentley & McIlroy
implementation with a couple of customizations, already does
everything you mention (I refer to the precheck algorithmic
customization, and the way that we check to see which side of a pivot
to use real recursion with to avoid stack overflows). I suggest that
you read the paper [1] -- the code that we use is almost directly
lifted from the paper. The opaque sounding variable names are the
same, and the C code from the paper is structured in exactly the same
way.

I think that this won't be a particularly easy project to get
committed. I suggest that if you go forward with it that you
specifically look into integrating "Dual-Pivot Quicksort" [2] as a
whole cloth replacement for the B&M implementation. It seems like this
has some chance of working out, because it's more or less acknowledged
by Bentley himself to be a kind of successor to his own industrial
quicksort implementation [3] -- it's derived from it. Note that Java 7
uses dual pivot quicksort when sorting integers.

In general, we've had a lot of success with optimizing sorting in the
past few years by focusing on things like avoiding cache misses in
comparators. There has been much less success with algorithmic
improvements, and no success at all with novel algorithmic
enhancements. PostgreSQL development just isn't a great way to do that
sort of thing.

BTW, I noticed that you go on to say this:

"""
However,
since disk operation is much expensive than in-memory sorting, I am
not sure if we can
observe a significant difference in this way.

"""

I think that you'd be surprised at how often this isn't true these
days, at least when sorting enough data for spilling to disk to be
relevant. The main reason for that is that the cost of writing out
runs increases linearly, and therefore eventually becomes very small
compared to the costs of sorting itself, which increases at a
linearithmic rate.

[1] http://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
[3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-September/002630.html
--
Peter Geoghegan

Re: [GSoC 2018] Proposal Draft

From
Peter Geoghegan
Date:
On Sat, Mar 17, 2018 at 7:00 PM, Kefan Yang <starordust@gmail.com> wrote:
> What I am trying to say here is that similar optimizations can be applied to
> novel algorithms or other implementations of quicksort.

A novel algorithm is something to avoid here, because novel techniques
tend to only work out for specific datasets and data distributions. In
general, we care about a wide variety of cases, and are very keen to
avoid regressions.

> The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java
> implementation already. I can definitely make use of that.

Cool. Please be careful to pick source material that has a license
that is compatible with the PostgreSQL license.

> Also, I am wondering what's the normal approach to testing if a certain
> sorting implementation brings performance gain in this project?

It varies quite a bit. You can search the lists archives for some of
this. For example, Tomas Vondra often tests my sorting patches by
subjecting them to a variety of tests with varied distributions and
data types. His results are then collated in a spreadsheet for easy
analysis. Maybe you can replicate that.

The only obvious trend is that everything is tested using real SQL
statements (including CREATE INDEX statements). In the past,
enhancements that were successfully incorporated tended to either
obviously have little or no downside, or have a very significant
upside that made it worth talking about accepting a small regression
in a minority of less representative cases.

> More
> specifically,  You mentioned that little progress has been made with novel
> algorithmic enhancement. How can we get that conclusion?

That's simply been the trend. It isn't particularly likely that
Postgres will be a project that pioneers some kind of algorithmic
enhancement that has broad applicability, that could just as easily
benefit a general purpose library sort routine. If you're interested
in algorithmic enhancements in the sense that I understand the term,
then that's the high bar that would have to be met -- that would
amount to a new insight in an area that has already been researched in
enormous detail by many people, most of whom don't know much about
database systems in particular.

On the other hand, techniques like abbreviated keys work well because
they effectively exploit things like the first normal form, and the
fact that a high start up cost for the sort is already very likely. It
is a technique specifically tailored for a database system, and modern
hardware characteristics.

-- 
Peter Geoghegan


Re: [GSoC 2018] Proposal Draft

From
Andrey Borodin
Date:
Hi Kefar!

18 марта 2018 г., в 5:34, Kefan Yang <starordust@gmail.com> написал(а):

I am Kefan Yang, a third-year Computing Science student from Simon Fraser University, Canada. I am very interested in the sorting algorithm benchmarking and implementation issue you mentioned on the idealist of Google Summer of Code 2018. 

I am currently working on my proposal, but I am a little bit worried about whether I am on the right track. I've encountered some problems in the implementation of the benchmark of those sorting algorithms. I briefly described my current ideas in the proposal draft, but they may change a bit as I read more research papers. I sincerely hope to get some feedback from you to improve my proposal, especially for the implementation part.

I've attached a proposal draft to this email. It's a brief one but I guess it's enough to show my current ideas:)

Peter have already added some context, I'll comment a bit on contents of your proposal.

I expect implementation of both timsort (dual pivot qsort) and introsort (depth-limited qsort). MySQL recently switched to introsort (they just used STL version instead of his own qsort).
Sorting routines are important part of many DBMS parts. But, indeed, sorting of actual user data is related to qsort very distantly - often DBMS deals with external sorting.

There is built-in Postgres way to benchmark - pgbench performance. But I do not think that it's output will be statistically conclusive. I think that benchmarking will be at least as time consuming as implementation itself.

BTW I think it is good to add Postgres hash table implementation to this project too. Currently, we use Larson's linear hashing (HTAB struct, dynahash.c), which is somewhat seldom used nowadays. It would be good to tests cuckoo hashing and some open addressing against HTAB interface.

Please do not forget to add to your proposal review of other's patches. If you will produce some useful output in a patch form and you will want it committed you will have to review someone else patch. This is required by Postgres dev process.

Best regards, Andrey Borodin.

Re: [GSoC 2018] Proposal Draft

From
Kefan Yang
Date:
Hello,

Thanks again for all the feedback! They have really helped me a lot, and I've made some changes in my proposal accordingly.

The major updates in this version are in the timeline and implementation sections. I feel there's still a lot to improve, so I would appreciate any comment:)

I decide to add the new hashing table into this proposal, and I am still working on it. This part is still quite incomplete in the proposal.

Regards,

Kefan

2018-03-18 9:39 GMT-07:00 Andrey Borodin <x4mmm@yandex-team.ru>:
Hi Kefar!

18 марта 2018 г., в 5:34, Kefan Yang <starordust@gmail.com> написал(а):

I am Kefan Yang, a third-year Computing Science student from Simon Fraser University, Canada. I am very interested in the sorting algorithm benchmarking and implementation issue you mentioned on the idealist of Google Summer of Code 2018. 

I am currently working on my proposal, but I am a little bit worried about whether I am on the right track. I've encountered some problems in the implementation of the benchmark of those sorting algorithms. I briefly described my current ideas in the proposal draft, but they may change a bit as I read more research papers. I sincerely hope to get some feedback from you to improve my proposal, especially for the implementation part.

I've attached a proposal draft to this email. It's a brief one but I guess it's enough to show my current ideas:)

Peter have already added some context, I'll comment a bit on contents of your proposal.

I expect implementation of both timsort (dual pivot qsort) and introsort (depth-limited qsort). MySQL recently switched to introsort (they just used STL version instead of his own qsort).
Sorting routines are important part of many DBMS parts. But, indeed, sorting of actual user data is related to qsort very distantly - often DBMS deals with external sorting.

There is built-in Postgres way to benchmark - pgbench performance. But I do not think that it's output will be statistically conclusive. I think that benchmarking will be at least as time consuming as implementation itself.

BTW I think it is good to add Postgres hash table implementation to this project too. Currently, we use Larson's linear hashing (HTAB struct, dynahash.c), which is somewhat seldom used nowadays. It would be good to tests cuckoo hashing and some open addressing against HTAB interface.

Please do not forget to add to your proposal review of other's patches. If you will produce some useful output in a patch form and you will want it committed you will have to review someone else patch. This is required by Postgres dev process.

Best regards, Andrey Borodin.

Attachment