Thread: database sorting algorithms.
Jian He: On Sat, May 1, 2021 at 9:07 AM Jian He <hejian.mark@gmail.com> wrote: > Been self study Database, from database I deep dived into sorting algorithms. Peek a good book, because that is a hairy topic with lot of math and other previous knowledge required. > Databases can do in-memory QuickSort. It also has an on-disk MergeSort. If you ignore top-n and other stuff, just full sorts, databases uses two types of sorting, internal sorts, which are the ones used when you can fit your data on ram, and external sorts, when they have to spill to tape/disk/whatever. For internal sorts you can use quicksort, heapsort, mergesort, even insertion sort. Quicksort with fallback to insertion is a popular one ( quicksort and merge sort are divide and conquer algorithms, but once they have divided enough it is easier to fallback to a simpler algorithm ). When the data no longers fit in ram you need to make an external sort. The classic way to do this is to sort chunks as big as they fit in ram, write presorted chunks to disk and then merge those chunks. These is called merge sort too, and is similar to a memory merge sort, but not the same ( when I've done it I use two way merge for memory merge sorts, but N-way for disk sort ( current memories are so big I never need to do more than 1 merge pass since the tape era ) > For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108 (around 1 minutes only) > But I am still not fully understanding about nlogn. I understand how many passes it will take, that is logn. > Yes each pass will sort N elements. > But I still don't get the N stand for in n*logn. Basically you do logn passes over N elements. > Why does each pass take n time to sort it? I have not time for looking at the video, but basic binary merge sort is N log N because you have logN passes and each one must merge N elements. Imagine you do an 8 elements pure merge sort. 1st you split it into 4+4 elements, recurse to sort them and merge them. Merging is an N-complexity op because you need to read the 8 elements. 4 elements are split into 2+2, same thing, merge neads four reads, but you have two merges, so eight reads again. 2 elements are split into 1+1, two reads for merging, but you have four chungk, so eight again. Total, 3 passes ( log2(8) ) of eight reads. Real sorts are more complex, but the N is always present, as you always need to at least read the full input set to sort. Note, in external sort you may have N elements, of which only M fit in ram. You do C=N/M chunks , these need C*MlogM = N logMtime to be sorted and C*M=N time to be written to and read from disk. But to merge C chunks you basically do logC comparison for element, so add N*logC N(logN-logM). If you add appropiate constants and add all you'll find the final result is O(NlogN). Francisco Olarte. Francisco Olarte.
On 1 May 2021, at 17:06, Jian He wrote:
Been self study Database, from database I deep dived into sorting algorithms.
Databases can do in-memory QuickSort. It also has an on-disk MergeSort.
For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108
(around 1 minutes only)
Also check https://en.wikipedia.org/wiki/Merge_sort
But I am still not fully understanding about nlogn. I understand how many
passes it will take, that is* logn. *
Yes each pass will sort N elements.
But I still don't get the N stand for in nlogn.*
So, answering the question…
The ’n’ refers to the need to do something to each element at least once, so the sort time grows in simple proportion to the size of the list that needs to be sorted. Unfortunately that is not enough work to get the list sorted so extra steps are needed. The log(n) indicates how many extra steps are needed. So the overall performance is proportional to the number of elements (N) multiplied by the log of the number of elements, viz., N * log(N)
Regards
Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.
— H. L. Mencken, 1920
On 1 May 2021, at 17:06, Jian He wrote:
Been self study Database, from database I deep dived into sorting algorithms.
Databases can do in-memory QuickSort. It also has an on-disk MergeSort.
For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108
(around 1 minutes only)Also check https://en.wikipedia.org/wiki/Merge_sort
But I am still not fully understanding about nlogn. I understand how many
passes it will take, that is* logn. *
Yes each pass will sort N elements.
But I still don't get the N stand for in nlogn.*So, answering the question…
The ’n’ refers to the need to do something to each element at least once, so the sort time grows in simple proportion to the size of the list that needs to be sorted. Unfortunately that is not enough work to get the list sorted so extra steps are needed. The log(n) indicates how many extra steps are needed. So the overall performance is proportional to the number of elements (N) multiplied by the log of the number of elements, viz., N * log(N)Regards
Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.
— H. L. Mencken, 1920