Thread: Re: [HACKERS] qsort again (was Re: Strange Create Index behaviour)
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Tom Lane > Sent: Wednesday, February 15, 2006 5:22 PM > To: Ron > Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index > behaviour) > > Ron <rjpeace@earthlink.net> writes: > > How are we choosing our pivots? > > See qsort.c: it looks like median of nine equally spaced inputs (ie, > the 1/8th points of the initial input array, plus the end points), > implemented as two rounds of median-of-three choices. With half of the > data inputs zero, it's not too improbable for two out of the three > samples to be zeroes in which case I think the med3 result will be zero > --- so choosing a pivot of zero is much more probable than one would > like, and doing so in many levels of recursion causes the problem. Adding some randomness to the selection of the pivot is a known technique to fix the oddball partitions problem. However, Bentley and Sedgewick proved that every quick sort algorithm has some input set that makes it go quadratic (hence the recent popularity of introspective sort, which switches to heapsort if quadratic behavior is detected. The C++ template I submitted was an example of introspective sort, but PostgreSQL does not use C++ so it was not helpful). > I think. I'm not too sure if the code isn't just being sloppy about the > case where many data values are equal to the pivot --- there's a special > case there to switch to insertion sort, and maybe that's getting invoked > too soon. Here are some cases known to make qsort go quadratic: 1. Data already sorted 2. Data reverse sorted 3. Data organ-pipe sorted or ramp 4. Almost all data of the same value There are probably other cases. Randomizing the pivot helps some, as does check for in-order or reverse order partitions. Imagine if 1/3 of the partitions fall into a category that causes quadratic behavior (have one of the above formats and have more than CUTOFF elements in them). It is doubtful that the switch to insertion sort is causing any sort of problems. It is only going to be invoked on tiny sets, for which it has a fixed cost that is probably less that qsort() function calls on sets of the same size. >It'd be useful to get a line-level profile of the behavior of > this code in the slow cases... I guess that my in-order or presorted tests [which often arise when there are very few distinct values] may solve the bad partition problems. Don't forget that the algorithm is called recursively. > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
At 08:37 PM 2/15/2006, Dann Corbit wrote: >Adding some randomness to the selection of the pivot is a known >technique to fix the oddball partitions problem. True, but it makes QuickSort slower than say MergeSort because of the expense of the PRNG being called ~O(lgN) times during a sort. >However, Bentley and Sedgewick proved that every quick sort >algorithm has some input set that makes it go quadratic Yep. OTOH, that input set can be so specific and so unusual as to require astronomically unlikely bad luck or hostile hacking in order for it to actually occur. > (hence the recent popularity of introspective sort, which switches > to heapsort if quadratic behavior is detected. The C++ template I > submitted was an example of introspective sort, but PostgreSQL does > not use C++ so it was not helpful). ...and there are other QuickSort+Other hybrids that address the issue as well. MergeSort, RadixExchangeSort, and BucketSort all come to mind. See Gonnet and Baeza-Yates, etc. >Here are some cases known to make qsort go quadratic: >1. Data already sorted Only if one element is used to choose the pivot; _and_ only if the pivot is the first or last element of each pass. Even just always using the middle element as the pivot avoids this problem. See Sedgewick or Knuth. >2. Data reverse sorted Ditto above. >3. Data organ-pipe sorted or ramp Not sure what this means? Regardless, median of n partitioning that includes samples from each of the 1st 1/3, 2nd 1/3, and final 3rd of the data is usually enough to guarantee O(NlgN) behavior unless the _specific_ distribution known to be pessimal to that sampling algorithm is encountered. The only times I've ever seen it ITRW was as a result of hostile activity: purposely arranging the data in such a manner is essentially a DoS attack. >4. Almost all data of the same value Well known fixes to inner loop available to avoid this problem. >There are probably other cases. Randomizing the pivot helps some, >as does check for in-order or reverse order partitions. Randomizing the choice of pivot essentially guarantees O(NlgN) behavior no matter what the distribution of the data at the price of increasing the cost of each pass by a constant factor (the generation of a random number or numbers). In sum, QuickSort gets all sorts of bad press that is far more FUD than fact ITRW. Ron.
Added to TODO: * Improve port/qsort() to handle sorts with 50% unique and 50% duplicate value [qsort] This involves choosing better pivot points for the quicksort. --------------------------------------------------------------------------- Dann Corbit wrote: > > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > > owner@postgresql.org] On Behalf Of Tom Lane > > Sent: Wednesday, February 15, 2006 5:22 PM > > To: Ron > > Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create > Index > > behaviour) > > > > Ron <rjpeace@earthlink.net> writes: > > > How are we choosing our pivots? > > > > See qsort.c: it looks like median of nine equally spaced inputs (ie, > > the 1/8th points of the initial input array, plus the end points), > > implemented as two rounds of median-of-three choices. With half of > the > > data inputs zero, it's not too improbable for two out of the three > > samples to be zeroes in which case I think the med3 result will be > zero > > --- so choosing a pivot of zero is much more probable than one would > > like, and doing so in many levels of recursion causes the problem. > > Adding some randomness to the selection of the pivot is a known > technique to fix the oddball partitions problem. However, Bentley and > Sedgewick proved that every quick sort algorithm has some input set that > makes it go quadratic (hence the recent popularity of introspective > sort, which switches to heapsort if quadratic behavior is detected. The > C++ template I submitted was an example of introspective sort, but > PostgreSQL does not use C++ so it was not helpful). > > > I think. I'm not too sure if the code isn't just being sloppy about > the > > case where many data values are equal to the pivot --- there's a > special > > case there to switch to insertion sort, and maybe that's getting > invoked > > too soon. > > Here are some cases known to make qsort go quadratic: > 1. Data already sorted > 2. Data reverse sorted > 3. Data organ-pipe sorted or ramp > 4. Almost all data of the same value > > There are probably other cases. Randomizing the pivot helps some, as > does check for in-order or reverse order partitions. > > Imagine if 1/3 of the partitions fall into a category that causes > quadratic behavior (have one of the above formats and have more than > CUTOFF elements in them). > > It is doubtful that the switch to insertion sort is causing any sort of > problems. It is only going to be invoked on tiny sets, for which it has > a fixed cost that is probably less that qsort() function calls on sets > of the same size. > > >It'd be useful to get a line-level profile of the behavior of > > this code in the slow cases... > > I guess that my in-order or presorted tests [which often arise when > there are very few distinct values] may solve the bad partition > problems. Don't forget that the algorithm is called recursively. > > > regards, tom lane > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 3: Have you checked our extensive FAQ? > > > > http://www.postgresql.org/docs/faq > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +