Thread: Randomisation for ensuring nlogn complexity in quicksort

Randomisation for ensuring nlogn complexity in quicksort

From
Atri Sharma
Date:
Hi all,

I have been reading the recent discussion and was researching a bit, and I think that we should really go with the idea
ofrandomising the input data(if it is not completely presorted), to ensure that we do not get quadratic complexity. 

One easy way to do that could be to take a sample of the data set, and take a pivot out of it. Still a better way could
beto take multiple samples which are spread of the data set, select a value from each of them, and then take a
cumulativepivot(median,maybe). 

Anyways, I really think that if we do not go with the above ideas, then, we should some how factor in the degree of
randomnessof the input data when making the decision between quicksort and external merge sort for a set of rows. 

This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affecting
existingnormal data sets that are present in every day transactions. I even believe that those data sets will also
benefitfrom the above optimisation. 

Thoughts/Comments?

Regards,
Atri

Sent from my iPad


Re: Randomisation for ensuring nlogn complexity in quicksort

From
jasmine
Date:
My PostgresSQL (9.2) is crashing after certain load tests. Currently,
postgressql is crashing when simulatenously 800 to 1000 threads are run on a
10 million records schema. Not sure, if we have to tweak some more
parameters of postgres. 



-----
jasmine
--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Randomisation-for-ensuring-nlogn-complexity-in-quicksort-tp5761907p5762011.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Randomisation for ensuring nlogn complexity in quicksort

From
David Fetter
Date:
On Mon, Jul 01, 2013 at 04:42:04AM -0700, jasmine wrote:
> My PostgresSQL (9.2) is crashing after certain load tests. Currently,
> postgressql is crashing when simulatenously 800 to 1000 threads are run on a
> 10 million records schema. Not sure, if we have to tweak some more
> parameters of postgres. 

Jasmine,

Please start your own thread rather than replying to some unrelated
thread.

In the particular case above, please also to provide reproducible test
cases including the exact version(s) of PostgreSQL to which they apply
(9.2.4, e.g.) along with output from same so other people can see what
you're talking about and actually help.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Atri Sharma
Date:
On Mon, Jul 1, 2013 at 10:02 PM, David Fetter <david@fetter.org> wrote:
> On Mon, Jul 01, 2013 at 04:42:04AM -0700, jasmine wrote:
>> My PostgresSQL (9.2) is crashing after certain load tests. Currently,
>> postgressql is crashing when simulatenously 800 to 1000 threads are run on a
>> 10 million records schema. Not sure, if we have to tweak some more
>> parameters of postgres.
>
> Jasmine,
>
> Please start your own thread rather than replying to some unrelated
> thread.
>
> In the particular case above, please also to provide reproducible test
> cases including the exact version(s) of PostgreSQL to which they apply
> (9.2.4, e.g.) along with output from same so other people can see what
> you're talking about and actually help.

Not sure if it belongs to -general instead of -hackers.

Regards,

Atri

--
Regards,

Atri
l'apprenant



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Robert Haas
Date:
On Sun, Jun 30, 2013 at 8:30 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
> I have been reading the recent discussion and was researching a bit, and I think that we should really go with the
ideaof randomising the input data(if it is not completely presorted), to ensure that we do not get quadratic
complexity.

That doesn't ensure any such thing.  It just makes it less likely.
But we have existing guards that also make that unlikely, so I'm not
sure what we'd be gaining.

> One easy way to do that could be to take a sample of the data set, and take a pivot out of it. Still a better way
couldbe to take multiple samples which are spread of the data set, select a value from each of them, and then take a
cumulativepivot(median,maybe). 

We pretty much do that already.

> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without affecting
existingnormal data sets that are present in every day transactions. I even believe that those data sets will also
benefitfrom the above optimisation. 

The only method of selecting a pivot for quicksort that obtain O(n lg
n) run time with 100% certainty is have a magical oracle inside the
computer that tells you in fixed time and with perfect accuracy which
pivot you should select.

If you want to get a useful response to your emails, consider
including a statement of what you think the problem is and why you
think your proposed changes will help.  Consider offering a test case
that performs badly and an analysis of the reason why.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Claudio Freire
Date:
On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without
affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will
alsobenefit from the above optimisation. 
>
> The only method of selecting a pivot for quicksort that obtain O(n lg
> n) run time with 100% certainty is have a magical oracle inside the
> computer that tells you in fixed time and with perfect accuracy which
> pivot you should select.


Doesn't a linear median algorithm (say median of medians) get you O(n lg n)?

Granted, with a huge constant (I think 4)... but it should still be O(n lg n).



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Robert Haas
Date:
On Mon, Jul 1, 2013 at 3:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without
affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will
alsobenefit from the above optimisation. 
>>
>> The only method of selecting a pivot for quicksort that obtain O(n lg
>> n) run time with 100% certainty is have a magical oracle inside the
>> computer that tells you in fixed time and with perfect accuracy which
>> pivot you should select.
>
> Doesn't a linear median algorithm (say median of medians) get you O(n lg n)?
>
> Granted, with a huge constant (I think 4)... but it should still be O(n lg n).

No.   Thinking about this a little more, I believe the way it works
out is that any algorithm for picking the median that guarantees that
a certain *percentage* of the tuples will be in the smaller partition
will have O(n lg n) complexity, but any algorithm that only guarantees
that a fixed *number* of tuples in the smaller partition is still
quadratic in complexity.  In the case of a median algorithm, you're
only guaranteed to have 2 elements in the smaller partition, which is
a constant.  If you take a median of medians, you're guaranteed to
have 8 elements in the smaller partition, which is bigger, but still a
constant.

The reason why this doesn't matter much in practice is because the
data distribution that causes quadratic behavior for median-of-medians
is not one which is likely to occur in the real world and will
probably only come up if chosen by an adversary who is attempting to
make your life miserable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Claudio Freire
Date:
On Mon, Jul 1, 2013 at 5:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 1, 2013 at 3:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without
affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will
alsobenefit from the above optimisation. 
>>>
>>> The only method of selecting a pivot for quicksort that obtain O(n lg
>>> n) run time with 100% certainty is have a magical oracle inside the
>>> computer that tells you in fixed time and with perfect accuracy which
>>> pivot you should select.
>>
>> Doesn't a linear median algorithm (say median of medians) get you O(n lg n)?
>>
>> Granted, with a huge constant (I think 4)... but it should still be O(n lg n).
>
> No.   Thinking about this a little more, I believe the way it works
> out is that any algorithm for picking the median that guarantees that
> a certain *percentage* of the tuples will be in the smaller partition
> will have O(n lg n) complexity, but any algorithm that only guarantees
> that a fixed *number* of tuples in the smaller partition is still
> quadratic in complexity.  In the case of a median algorithm, you're
> only guaranteed to have 2 elements in the smaller partition, which is
> a constant.  If you take a median of medians, you're guaranteed to
> have 8 elements in the smaller partition, which is bigger, but still a
> constant.


What?

A median of medians algorithm will guarantee floor(N/2) elements on
the smaller. That's the definition of median.

Note that I'm referring to picking the actual median of all tuples,
not just a sample. That's slow, but it guarantees O(n log n).



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Peter Geoghegan
Date:
On Mon, Jul 1, 2013 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I have been reading the recent discussion and was researching a bit, and I think that we should really go with the
ideaof randomising the input data(if it is not completely presorted), to ensure that we do not get quadratic
complexity.
>
> That doesn't ensure any such thing.  It just makes it less likely.
> But we have existing guards that also make that unlikely, so I'm not
> sure what we'd be gaining.

+1

-- 
Peter Geoghegan



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Robert Haas
Date:
On Mon, Jul 1, 2013 at 4:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> What?
>
> A median of medians algorithm will guarantee floor(N/2) elements on
> the smaller. That's the definition of median.
>
> Note that I'm referring to picking the actual median of all tuples,
> not just a sample. That's slow, but it guarantees O(n log n).

Ah, OK.  Well, yes, that would guarantee O(n lg n).  But, as you say,
it would be slow.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Atri Sharma
Date:
On Tue, Jul 2, 2013 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jun 30, 2013 at 8:30 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>> I have been reading the recent discussion and was researching a bit, and I think that we should really go with the
ideaof randomising the input data(if it is not completely presorted), to ensure that we do not get quadratic
complexity.
>
> That doesn't ensure any such thing.  It just makes it less likely.
> But we have existing guards that also make that unlikely, so I'm not
> sure what we'd be gaining.
>
>> One easy way to do that could be to take a sample of the data set, and take a pivot out of it. Still a better way
couldbe to take multiple samples which are spread of the data set, select a value from each of them, and then take a
cumulativepivot(median,maybe). 
>
> We pretty much do that already.
>
>> This shouldn't be too complex, and should give us a fixed nlogn complexity even for wild data sets, without
affectingexisting normal data sets that are present in every day transactions. I even believe that those data sets will
alsobenefit from the above optimisation. 
>
> The only method of selecting a pivot for quicksort that obtain O(n lg
> n) run time with 100% certainty is have a magical oracle inside the
> computer that tells you in fixed time and with perfect accuracy which
> pivot you should select.
>
> If you want to get a useful response to your emails, consider
> including a statement of what you think the problem is and why you
> think your proposed changes will help.  Consider offering a test case
> that performs badly and an analysis of the reason why.

Right, thanks for that. I will keep that in mind.

I was thinking about *mostly sorted* datasets, consider the following:

10 11 12 4 5 6 1 2

(Just off my head, sorry if I missed something).

Now, the above data set is made up of number of rotation of a sorted
dataset, so is mostly sorted, albeit with some disordering.

My point is that these kind of datasets(not necessarily the above one)
can lead to a bad choice of pivot, and hence give us a complexity
which is below NlogN.

I know we have a check for pre sorted inputs, but wasn't sure how we
deal with mostly sorted inputs, as quick sort likes disorder in input.

I agree with Claudio's idea. One thing to keep in mind is that we
don't do quick sort for large data sets anyway, and move to external
merge sort for it. So, we could think of using median of medians
algorithm for the purpose.

Another thing I would like to investigate is our implementation of
quick sort's performance(and maybe external merge sort as well) on
multiword keys.

Regards,

Atri


--
Regards,

Atri
l'apprenant



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Robert Haas
Date:
On Tue, Jul 2, 2013 at 12:33 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>> If you want to get a useful response to your emails, consider
>> including a statement of what you think the problem is and why you
>> think your proposed changes will help.  Consider offering a test case
>> that performs badly and an analysis of the reason why.
>
> Right, thanks for that. I will keep that in mind.
>
> I was thinking about *mostly sorted* datasets, consider the following:
>
> 10 11 12 4 5 6 1 2

I think if you'll try it you'll find that we perform quite well on
data sets of this kind - and if you read the code you'll see why.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Atri Sharma
Date:
On Tue, Jul 2, 2013 at 5:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 2, 2013 at 12:33 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>>> If you want to get a useful response to your emails, consider
>>> including a statement of what you think the problem is and why you
>>> think your proposed changes will help.  Consider offering a test case
>>> that performs badly and an analysis of the reason why.
>>
>> Right, thanks for that. I will keep that in mind.
>>
>> I was thinking about *mostly sorted* datasets, consider the following:
>>
>> 10 11 12 4 5 6 1 2
>
> I think if you'll try it you'll find that we perform quite well on
> data sets of this kind - and if you read the code you'll see why.

Right, let me read the code again from that viewpoint.

Thanks a ton for your help!

Regards,

Atri


--
Regards,

Atri
l'apprenant



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Peter Geoghegan
Date:
On Tue, Jul 2, 2013 at 5:04 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>> I think if you'll try it you'll find that we perform quite well on
>> data sets of this kind - and if you read the code you'll see why.
>
> Right, let me read the code again from that viewpoint.

In my opinion, it would be worthwhile reading the original Bentley and
McIlroy paper [1] and using what you learned to write a patch that
adds comments throughout the canonical qsort_arg, and perhaps the
other variants.

[1] http://www.enseignement.polytechnique.fr/informatique/profs/Luc.Maranget/421/09/bentley93engineering.pdf
-- 
Peter Geoghegan



Re: Randomisation for ensuring nlogn complexity in quicksort

From
Claudio Freire
Date:
On Tue, Jul 2, 2013 at 12:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jul 2, 2013 at 5:04 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>>> I think if you'll try it you'll find that we perform quite well on
>>> data sets of this kind - and if you read the code you'll see why.
>>
>> Right, let me read the code again from that viewpoint.
>
> In my opinion, it would be worthwhile reading the original Bentley and
> McIlroy paper [1] and using what you learned to write a patch that
> adds comments throughout the canonical qsort_arg, and perhaps the
> other variants.
>
> [1] http://www.enseignement.polytechnique.fr/informatique/profs/Luc.Maranget/421/09/bentley93engineering.pdf

That's weird, it doesn't seem as sophisticated as even libc's introsort.

Perhaps an introsort[1] approach wouldn't hurt: do the quick and dirty
median selection pg is already doing (or a better one if a better one
is found), but check recursion depth/input size ratios.

When across K recursive calls the input set hasn't been halved in
size, switch to median of medians to guard off against quadratic
complexity.

[1] http://en.wikipedia.org/wiki/Selection_algorithm#Introselect