Re: A worst case for qsort - Mailing list pgsql-hackers

From John Cochran
Subject Re: A worst case for qsort
Date
Msg-id CAGQU3n5YjS-iayG6Ah5_sBV5nhtw+zxptDNrdq8YZgP=7f5fiw@mail.gmail.com
Whole thread Raw
In response to Re: A worst case for qsort  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Aug 7, 2014 at 11:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 5, 2014 at 8:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
> "The adversarial method works for almost any polymorphic program
> recognizable as quicksort. The subject quicksort may copy values at
> will, or work with lists rather than arrays. It may even pick the
> pivot at random. The quicksort will be vulnerable provided only that
> it satisfies some mild assumptions that are met by every
> implementation I have seen".
>
> IMHO, while worst case performance is a very useful tool for analyzing
> algorithms (particularly their worst case time complexity), a worst
> case should be put in its practical context. For example, if we had
> reason to be concerned about *adversarial* inputs, I think that there
> is a good chance that our qsort() actually would be problematic to the
> point of driving us to prefer some generally slower alternative.

If one is concerned about adversarial inputs, then as mentioned earlier, two main steps need to be taken.
1. Don't permit potential adversaries define the comparison function used by qsort().
2. Perform random selection of pivot to prevent precomputed lists of data that invoke quadratic behavior in qsort.

And before the argument that a random pivot will be less efficient than the median of median that's currently being used, you need to look at what is currently being used.
1. If a "small" partition is being sorted, the element at n/2 is selected as the pivot with n being the size of the partition.
2. If a "medium" partition is being sorted, then pivot selected is the median of the elements at 0, n/2, and n.
3. And finally, if a "large" partition is being sorted, potential pivots are selected from 0, n/8, n/4, 3n/8, n/2, 5n/8, 3n/4, 7n/8, and n.  Of those 9 candidates, the median of medians is selected as the pivot.

There is nothing in the world that would prevent the following.
1. Short partition - select a pivot at random.
2. Medium partition - select the median of 3 randomly selected candidates.
3. Large partition - select the median of medians of 9 randomly selected candidates.
 
Using the above random pivot selection methods, the performance of a hardened qsort should be close to that of an unhardened qsort. The only real difference is the cost of generating random numbers to select the candidates for the median tests. However, if an malicious compare function is permitted to be defined, it would still be vulnerable to that attack, but it would be pretty much immune to a precomputed data attack.

Or perhaps it just might be simpler to detect an attack and fall back to a sort algorithm that doesn't have quadratic behavior.

What I mean by that is the qsort in the "Engineering a Sort Function" has a big O of approximately 1.1 n ln n comparisons. If upon starting qsort, it computes an upper bound of comparisons it's willing to perform. Say 3 n ln n or so (1.386 n log n is the estimate for a randomly selected pivot). Then if the upper bound is reached, the qsort function backs out leaving the array in a partially sorted order, and then calls a slower sort that doesn't exhibit quadratic behavior to actually finish the sort.

The result of doing that would be most, if not all, sorts being handled by qsort in a timely fashion. But if bad luck or an adversary strikes, the qsort bails out after things look bad and passes the task to a different sort. I'll be the first to admit that the hybrid approach would be slower in those bad cases than it would be if the alternate sort was selected at the beginning, but it would be faster than letting qsort continue with quadratic behavior. 

--

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. — C.A.R. Hoare

pgsql-hackers by date:

Previous
From: Abhijit Menon-Sen
Date:
Subject: Re: replication commands and log_statements
Next
From: Peter Geoghegan
Date:
Subject: Re: A worst case for qsort