Thread: Randomisation for ensuring nlogn complexity in quicksort
Hi all, I have been reading the recent discussion and was researching a bit, and I think that we should really go with the idea ofrandomising the input data(if it is not completely presorted), to ensure that we do not get quadratic complexity. One easy way to do that could be to take a sample of the data set, and take a pivot out of it. Still a better way could beto take multiple samples which are spread of the data set, select a value from each of them, and then take a cumulativepivot(median,maybe). Anyways, I really think that if we do not go with the above ideas, then, we should some how factor in the degree of randomnessof the input data when making the decision between quicksort and external merge sort for a set of rows. This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affecting existingnormal data sets that are present in every day transactions. I even believe that those data sets will also benefitfrom the above optimisation. Thoughts/Comments? Regards, Atri Sent from my iPad
My PostgresSQL (9.2) is crashing after certain load tests. Currently, postgressql is crashing when simulatenously 800 to 1000 threads are run on a 10 million records schema. Not sure, if we have to tweak some more parameters of postgres. ----- jasmine -- View this message in context: http://postgresql.1045698.n5.nabble.com/Randomisation-for-ensuring-nlogn-complexity-in-quicksort-tp5761907p5762011.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Mon, Jul 01, 2013 at 04:42:04AM -0700, jasmine wrote: > My PostgresSQL (9.2) is crashing after certain load tests. Currently, > postgressql is crashing when simulatenously 800 to 1000 threads are run on a > 10 million records schema. Not sure, if we have to tweak some more > parameters of postgres. Jasmine, Please start your own thread rather than replying to some unrelated thread. In the particular case above, please also to provide reproducible test cases including the exact version(s) of PostgreSQL to which they apply (9.2.4, e.g.) along with output from same so other people can see what you're talking about and actually help. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, Jul 1, 2013 at 10:02 PM, David Fetter <david@fetter.org> wrote: > On Mon, Jul 01, 2013 at 04:42:04AM -0700, jasmine wrote: >> My PostgresSQL (9.2) is crashing after certain load tests. Currently, >> postgressql is crashing when simulatenously 800 to 1000 threads are run on a >> 10 million records schema. Not sure, if we have to tweak some more >> parameters of postgres. > > Jasmine, > > Please start your own thread rather than replying to some unrelated > thread. > > In the particular case above, please also to provide reproducible test > cases including the exact version(s) of PostgreSQL to which they apply > (9.2.4, e.g.) along with output from same so other people can see what > you're talking about and actually help. Not sure if it belongs to -general instead of -hackers. Regards, Atri -- Regards, Atri l'apprenant
On Sun, Jun 30, 2013 at 8:30 AM, Atri Sharma <atri.jiit@gmail.com> wrote: > I have been reading the recent discussion and was researching a bit, and I think that we should really go with the ideaof randomising the input data(if it is not completely presorted), to ensure that we do not get quadratic complexity. That doesn't ensure any such thing. It just makes it less likely. But we have existing guards that also make that unlikely, so I'm not sure what we'd be gaining. > One easy way to do that could be to take a sample of the data set, and take a pivot out of it. Still a better way couldbe to take multiple samples which are spread of the data set, select a value from each of them, and then take a cumulativepivot(median,maybe). We pretty much do that already. > This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affecting existingnormal data sets that are present in every day transactions. I even believe that those data sets will also benefitfrom the above optimisation. The only method of selecting a pivot for quicksort that obtain O(n lg n) run time with 100% certainty is have a magical oracle inside the computer that tells you in fixed time and with perfect accuracy which pivot you should select. If you want to get a useful response to your emails, consider including a statement of what you think the problem is and why you think your proposed changes will help. Consider offering a test case that performs badly and an analysis of the reason why. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will alsobenefit from the above optimisation. > > The only method of selecting a pivot for quicksort that obtain O(n lg > n) run time with 100% certainty is have a magical oracle inside the > computer that tells you in fixed time and with perfect accuracy which > pivot you should select. Doesn't a linear median algorithm (say median of medians) get you O(n lg n)? Granted, with a huge constant (I think 4)... but it should still be O(n lg n).
On Mon, Jul 1, 2013 at 3:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will alsobenefit from the above optimisation. >> >> The only method of selecting a pivot for quicksort that obtain O(n lg >> n) run time with 100% certainty is have a magical oracle inside the >> computer that tells you in fixed time and with perfect accuracy which >> pivot you should select. > > Doesn't a linear median algorithm (say median of medians) get you O(n lg n)? > > Granted, with a huge constant (I think 4)... but it should still be O(n lg n). No. Thinking about this a little more, I believe the way it works out is that any algorithm for picking the median that guarantees that a certain *percentage* of the tuples will be in the smaller partition will have O(n lg n) complexity, but any algorithm that only guarantees that a fixed *number* of tuples in the smaller partition is still quadratic in complexity. In the case of a median algorithm, you're only guaranteed to have 2 elements in the smaller partition, which is a constant. If you take a median of medians, you're guaranteed to have 8 elements in the smaller partition, which is bigger, but still a constant. The reason why this doesn't matter much in practice is because the data distribution that causes quadratic behavior for median-of-medians is not one which is likely to occur in the real world and will probably only come up if chosen by an adversary who is attempting to make your life miserable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 1, 2013 at 5:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jul 1, 2013 at 3:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will alsobenefit from the above optimisation. >>> >>> The only method of selecting a pivot for quicksort that obtain O(n lg >>> n) run time with 100% certainty is have a magical oracle inside the >>> computer that tells you in fixed time and with perfect accuracy which >>> pivot you should select. >> >> Doesn't a linear median algorithm (say median of medians) get you O(n lg n)? >> >> Granted, with a huge constant (I think 4)... but it should still be O(n lg n). > > No. Thinking about this a little more, I believe the way it works > out is that any algorithm for picking the median that guarantees that > a certain *percentage* of the tuples will be in the smaller partition > will have O(n lg n) complexity, but any algorithm that only guarantees > that a fixed *number* of tuples in the smaller partition is still > quadratic in complexity. In the case of a median algorithm, you're > only guaranteed to have 2 elements in the smaller partition, which is > a constant. If you take a median of medians, you're guaranteed to > have 8 elements in the smaller partition, which is bigger, but still a > constant. What? A median of medians algorithm will guarantee floor(N/2) elements on the smaller. That's the definition of median. Note that I'm referring to picking the actual median of all tuples, not just a sample. That's slow, but it guarantees O(n log n).
On Mon, Jul 1, 2013 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I have been reading the recent discussion and was researching a bit, and I think that we should really go with the ideaof randomising the input data(if it is not completely presorted), to ensure that we do not get quadratic complexity. > > That doesn't ensure any such thing. It just makes it less likely. > But we have existing guards that also make that unlikely, so I'm not > sure what we'd be gaining. +1 -- Peter Geoghegan
On Mon, Jul 1, 2013 at 4:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > What? > > A median of medians algorithm will guarantee floor(N/2) elements on > the smaller. That's the definition of median. > > Note that I'm referring to picking the actual median of all tuples, > not just a sample. That's slow, but it guarantees O(n log n). Ah, OK. Well, yes, that would guarantee O(n lg n). But, as you say, it would be slow. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 2, 2013 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jun 30, 2013 at 8:30 AM, Atri Sharma <atri.jiit@gmail.com> wrote: >> I have been reading the recent discussion and was researching a bit, and I think that we should really go with the ideaof randomising the input data(if it is not completely presorted), to ensure that we do not get quadratic complexity. > > That doesn't ensure any such thing. It just makes it less likely. > But we have existing guards that also make that unlikely, so I'm not > sure what we'd be gaining. > >> One easy way to do that could be to take a sample of the data set, and take a pivot out of it. Still a better way couldbe to take multiple samples which are spread of the data set, select a value from each of them, and then take a cumulativepivot(median,maybe). > > We pretty much do that already. > >> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will alsobenefit from the above optimisation. > > The only method of selecting a pivot for quicksort that obtain O(n lg > n) run time with 100% certainty is have a magical oracle inside the > computer that tells you in fixed time and with perfect accuracy which > pivot you should select. > > If you want to get a useful response to your emails, consider > including a statement of what you think the problem is and why you > think your proposed changes will help. Consider offering a test case > that performs badly and an analysis of the reason why. Right, thanks for that. I will keep that in mind. I was thinking about *mostly sorted* datasets, consider the following: 10 11 12 4 5 6 1 2 (Just off my head, sorry if I missed something). Now, the above data set is made up of number of rotation of a sorted dataset, so is mostly sorted, albeit with some disordering. My point is that these kind of datasets(not necessarily the above one) can lead to a bad choice of pivot, and hence give us a complexity which is below NlogN. I know we have a check for pre sorted inputs, but wasn't sure how we deal with mostly sorted inputs, as quick sort likes disorder in input. I agree with Claudio's idea. One thing to keep in mind is that we don't do quick sort for large data sets anyway, and move to external merge sort for it. So, we could think of using median of medians algorithm for the purpose. Another thing I would like to investigate is our implementation of quick sort's performance(and maybe external merge sort as well) on multiword keys. Regards, Atri -- Regards, Atri l'apprenant
On Tue, Jul 2, 2013 at 12:33 AM, Atri Sharma <atri.jiit@gmail.com> wrote: >> If you want to get a useful response to your emails, consider >> including a statement of what you think the problem is and why you >> think your proposed changes will help. Consider offering a test case >> that performs badly and an analysis of the reason why. > > Right, thanks for that. I will keep that in mind. > > I was thinking about *mostly sorted* datasets, consider the following: > > 10 11 12 4 5 6 1 2 I think if you'll try it you'll find that we perform quite well on data sets of this kind - and if you read the code you'll see why. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 2, 2013 at 5:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jul 2, 2013 at 12:33 AM, Atri Sharma <atri.jiit@gmail.com> wrote: >>> If you want to get a useful response to your emails, consider >>> including a statement of what you think the problem is and why you >>> think your proposed changes will help. Consider offering a test case >>> that performs badly and an analysis of the reason why. >> >> Right, thanks for that. I will keep that in mind. >> >> I was thinking about *mostly sorted* datasets, consider the following: >> >> 10 11 12 4 5 6 1 2 > > I think if you'll try it you'll find that we perform quite well on > data sets of this kind - and if you read the code you'll see why. Right, let me read the code again from that viewpoint. Thanks a ton for your help! Regards, Atri -- Regards, Atri l'apprenant
On Tue, Jul 2, 2013 at 5:04 AM, Atri Sharma <atri.jiit@gmail.com> wrote: >> I think if you'll try it you'll find that we perform quite well on >> data sets of this kind - and if you read the code you'll see why. > > Right, let me read the code again from that viewpoint. In my opinion, it would be worthwhile reading the original Bentley and McIlroy paper [1] and using what you learned to write a patch that adds comments throughout the canonical qsort_arg, and perhaps the other variants. [1] http://www.enseignement.polytechnique.fr/informatique/profs/Luc.Maranget/421/09/bentley93engineering.pdf -- Peter Geoghegan
On Tue, Jul 2, 2013 at 12:36 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Jul 2, 2013 at 5:04 AM, Atri Sharma <atri.jiit@gmail.com> wrote: >>> I think if you'll try it you'll find that we perform quite well on >>> data sets of this kind - and if you read the code you'll see why. >> >> Right, let me read the code again from that viewpoint. > > In my opinion, it would be worthwhile reading the original Bentley and > McIlroy paper [1] and using what you learned to write a patch that > adds comments throughout the canonical qsort_arg, and perhaps the > other variants. > > [1] http://www.enseignement.polytechnique.fr/informatique/profs/Luc.Maranget/421/09/bentley93engineering.pdf That's weird, it doesn't seem as sophisticated as even libc's introsort. Perhaps an introsort[1] approach wouldn't hurt: do the quick and dirty median selection pg is already doing (or a better one if a better one is found), but check recursion depth/input size ratios. When across K recursive calls the input set hasn't been halved in size, switch to median of medians to guard off against quadratic complexity. [1] http://en.wikipedia.org/wiki/Selection_algorithm#Introselect