Thread: qsort again (was Re: [PERFORM] Strange Create Index behaviour)
Gary Doades <gpd@gpdnet.co.uk> writes: > If I run the script again, it is not always the first case that is slow, > it varies from run to run, which is why I repeated it quite a few times > for the test. For some reason I hadn't immediately twigged to the fact that your test script is just N repetitions of the exact same structure with random data. So it's not so surprising that you get random variations in behavior with different test data sets. I did some experimentation comparing the qsort from Fedora Core 4 (glibc-2.3.5-10.3) with our src/port/qsort.c. For those who weren't following the pgsql-performance thread, the test case is just this repeated a lot of times: create table atest(i int4, r int4); insert into atest (i,r) select generate_series(1,100000), 0; insert into atest (i,r) select generate_series(1,100000), random()*100000; \timing create index idx on atest(r); \timing drop table atest; I did this 100 times and sorted the reported runtimes. (Investigation with trace_sort = on confirms that the runtime is almost entirely spent in qsort() called from our performsort --- the Postgres overhead is about 100msec on this machine.) Results are below. It seems clear that our qsort.c is doing a pretty awful job of picking qsort pivots, while glibc is mostly managing not to make that mistake. I haven't looked at the glibc code yet to see what they are doing differently. I'd say this puts a considerable damper on my enthusiasm for using our qsort all the time, as was recently debated in this thread: http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php We need to fix our qsort.c before pushing ahead with that idea. regards, tom lane 100 runtimes for glibc qsort, sorted ascending: Time: 459.860 ms Time: 460.209 ms Time: 460.704 ms Time: 461.317 ms Time: 461.538 ms Time: 461.652 ms Time: 461.988 ms Time: 462.573 ms Time: 462.638 ms Time: 462.716 ms Time: 462.917 ms Time: 463.219 ms Time: 463.455 ms Time: 463.650 ms Time: 463.723 ms Time: 463.737 ms Time: 463.750 ms Time: 463.852 ms Time: 463.964 ms Time: 463.988 ms Time: 464.003 ms Time: 464.135 ms Time: 464.372 ms Time: 464.458 ms Time: 464.496 ms Time: 464.551 ms Time: 464.599 ms Time: 464.655 ms Time: 464.656 ms Time: 464.722 ms Time: 464.814 ms Time: 464.827 ms Time: 464.878 ms Time: 464.899 ms Time: 464.905 ms Time: 464.987 ms Time: 465.055 ms Time: 465.138 ms Time: 465.159 ms Time: 465.194 ms Time: 465.310 ms Time: 465.316 ms Time: 465.375 ms Time: 465.450 ms Time: 465.535 ms Time: 465.595 ms Time: 465.680 ms Time: 465.769 ms Time: 465.865 ms Time: 465.892 ms Time: 465.903 ms Time: 466.003 ms Time: 466.154 ms Time: 466.164 ms Time: 466.203 ms Time: 466.305 ms Time: 466.344 ms Time: 466.364 ms Time: 466.388 ms Time: 466.502 ms Time: 466.593 ms Time: 466.725 ms Time: 466.794 ms Time: 466.798 ms Time: 466.904 ms Time: 466.971 ms Time: 466.997 ms Time: 467.122 ms Time: 467.146 ms Time: 467.221 ms Time: 467.224 ms Time: 467.244 ms Time: 467.277 ms Time: 467.587 ms Time: 468.142 ms Time: 468.207 ms Time: 468.237 ms Time: 468.471 ms Time: 468.663 ms Time: 468.700 ms Time: 469.235 ms Time: 469.840 ms Time: 470.472 ms Time: 471.140 ms Time: 472.811 ms Time: 472.959 ms Time: 474.858 ms Time: 477.210 ms Time: 479.571 ms Time: 479.671 ms Time: 482.797 ms Time: 488.852 ms Time: 514.639 ms Time: 529.287 ms Time: 612.185 ms Time: 660.748 ms Time: 742.227 ms Time: 866.814 ms Time: 1234.848 ms Time: 1267.398 ms 100 runtimes for port/qsort.c, sorted ascending: Time: 418.905 ms Time: 420.611 ms Time: 420.764 ms Time: 420.904 ms Time: 421.706 ms Time: 422.466 ms Time: 422.627 ms Time: 423.189 ms Time: 423.302 ms Time: 425.096 ms Time: 425.731 ms Time: 425.851 ms Time: 427.253 ms Time: 430.113 ms Time: 432.756 ms Time: 432.963 ms Time: 440.502 ms Time: 440.640 ms Time: 450.452 ms Time: 458.143 ms Time: 459.212 ms Time: 467.706 ms Time: 468.006 ms Time: 468.574 ms Time: 470.003 ms Time: 472.313 ms Time: 483.622 ms Time: 492.395 ms Time: 509.564 ms Time: 531.037 ms Time: 533.366 ms Time: 535.610 ms Time: 575.523 ms Time: 582.688 ms Time: 593.545 ms Time: 647.364 ms Time: 660.612 ms Time: 677.312 ms Time: 680.288 ms Time: 697.626 ms Time: 833.066 ms Time: 834.511 ms Time: 851.819 ms Time: 920.443 ms Time: 926.731 ms Time: 954.289 ms Time: 1045.214 ms Time: 1059.200 ms Time: 1062.328 ms Time: 1136.018 ms Time: 1260.091 ms Time: 1276.883 ms Time: 1319.351 ms Time: 1438.854 ms Time: 1475.457 ms Time: 1538.211 ms Time: 1549.004 ms Time: 1744.642 ms Time: 1771.258 ms Time: 1959.530 ms Time: 2300.140 ms Time: 2589.641 ms Time: 2612.780 ms Time: 3100.024 ms Time: 3284.125 ms Time: 3379.792 ms Time: 3750.278 ms Time: 4302.278 ms Time: 4780.624 ms Time: 5000.056 ms Time: 5092.604 ms Time: 5168.722 ms Time: 5292.941 ms Time: 5895.964 ms Time: 7003.164 ms Time: 7099.449 ms Time: 7115.083 ms Time: 7384.940 ms Time: 8214.010 ms Time: 8700.771 ms Time: 9331.225 ms Time: 10503.360 ms Time: 12496.026 ms Time: 12982.474 ms Time: 15192.390 ms Time: 15392.161 ms Time: 15958.295 ms Time: 18375.693 ms Time: 18617.706 ms Time: 18927.515 ms Time: 19898.018 ms Time: 20865.979 ms Time: 21000.907 ms Time: 21297.585 ms Time: 21714.518 ms Time: 25423.235 ms Time: 27543.052 ms Time: 28314.182 ms Time: 29400.278 ms Time: 34142.534 ms
Tom Lane wrote: > For some reason I hadn't immediately twigged to the fact that your test > script is just N repetitions of the exact same structure with random data. > So it's not so surprising that you get random variations in behavior > with different test data sets. > > It seems clear that our qsort.c is doing a pretty awful job of picking > qsort pivots, while glibc is mostly managing not to make that mistake. > I haven't looked at the glibc code yet to see what they are doing > differently. > > I'd say this puts a considerable damper on my enthusiasm for using our > qsort all the time, as was recently debated in this thread: > http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php > We need to fix our qsort.c before pushing ahead with that idea. [snip] > Time: 28314.182 ms > Time: 29400.278 ms > Time: 34142.534 ms Ouch! That confirms my problem. I generated the random test case because it was easier than including the dump of my tables, but you can appreciate that tables 20 times the size are basically crippled when it comes to creating an index on them. Examining the dump and the associated times during restore it looks like I have 7 tables with this approximate distribution, thus the ridiculously long restore time. Better not re-index soon! Is this likely to hit me in a random fashion during normal operation, joins, sorts, order by for example? So the options are: 1) Fix the included qsort.c code and use that 2) Get FreeBSD to fix their qsort code 3) Both I guess that 1 is the real solution in case anyone else's qsort is broken in the same way. Then at least you *could* use it all the time :) Regards, Gary.
Gary Doades <gpd@gpdnet.co.uk> writes: > Is this likely to hit me in a random fashion during normal operation, > joins, sorts, order by for example? Yup, anytime you're passing data with that kind of distribution through a sort. > So the options are: > 1) Fix the included qsort.c code and use that > 2) Get FreeBSD to fix their qsort code > 3) Both > I guess that 1 is the real solution in case anyone else's qsort is > broken in the same way. Then at least you *could* use it all the time :) It's reasonable to assume that most of the *BSDen have basically the same qsort code. Ours claims to have come from NetBSD sources, but I don't doubt that they all trace back to a common ancestor. regards, tom lane
Gary Doades <gpd@gpdnet.co.uk> writes: > Ouch! That confirms my problem. I generated the random test case because > it was easier than including the dump of my tables, but you can > appreciate that tables 20 times the size are basically crippled when it > comes to creating an index on them. Actually... we only use qsort when we have a sorting problem that fits within the allowed sort memory. The external-sort logic doesn't go through that code at all. So all the analysis we just did on your test case doesn't necessarily apply to sort problems that are too large for the sort_mem setting. The test case would be sorting 200000 index entries, which'd probably occupy at least 24 bytes apiece of sort memory, so probably about 5 meg. A problem 20 times that size would definitely not fit in the default 16MB maintenance_work_mem. Were you using a large value of maintenance_work_mem for your restore? regards, tom lane
This behavior is consistent with the pivot choosing algorithm assuming certain distribution(s) for the data. For instance, median-of-three partitioning is known to be pessimal when the data is geometrically or hyper-geometrically distributed. Also, care must be taken that sometimes is not when there are many equal values in the data. Even pseudo random number generator based pivot choosing algorithms are not immune if the PRNG is flawed in some way. How are we choosing our pivots? At 06:28 PM 2/15/2006, Tom Lane wrote: >I did some experimentation comparing the qsort from Fedora Core 4 >(glibc-2.3.5-10.3) with our src/port/qsort.c. For those who weren't >following the pgsql-performance thread, the test case is just this >repeated a lot of times: > >create table atest(i int4, r int4); >insert into atest (i,r) select generate_series(1,100000), 0; >insert into atest (i,r) select generate_series(1,100000), random()*100000; >\timing >create index idx on atest(r); >\timing >drop table atest; > >I did this 100 times and sorted the reported runtimes. (Investigation >with trace_sort = on confirms that the runtime is almost entirely spent >in qsort() called from our performsort --- the Postgres overhead is >about 100msec on this machine.) Results are below. > >It seems clear that our qsort.c is doing a pretty awful job of picking >qsort pivots, while glibc is mostly managing not to make that mistake. >I haven't looked at the glibc code yet to see what they are doing >differently. > >I'd say this puts a considerable damper on my enthusiasm for using our >qsort all the time, as was recently debated in this thread: >http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php >We need to fix our qsort.c before pushing ahead with that idea. > > regards, tom lane > > >100 runtimes for glibc qsort, sorted ascending: > >Time: 459.860 ms ><snip> >Time: 488.852 ms >Time: 514.639 ms >Time: 529.287 ms >Time: 612.185 ms >Time: 660.748 ms >Time: 742.227 ms >Time: 866.814 ms >Time: 1234.848 ms >Time: 1267.398 ms > > >100 runtimes for port/qsort.c, sorted ascending: > >Time: 418.905 ms ><snip> >Time: 20865.979 ms >Time: 21000.907 ms >Time: 21297.585 ms >Time: 21714.518 ms >Time: 25423.235 ms >Time: 27543.052 ms >Time: 28314.182 ms >Time: 29400.278 ms >Time: 34142.534 ms
I wrote: > Gary Doades <gpd@gpdnet.co.uk> writes: >> Ouch! That confirms my problem. I generated the random test case because >> it was easier than including the dump of my tables, but you can >> appreciate that tables 20 times the size are basically crippled when it >> comes to creating an index on them. > Actually... we only use qsort when we have a sorting problem that fits > within the allowed sort memory. The external-sort logic doesn't go > through that code at all. So all the analysis we just did on your test > case doesn't necessarily apply to sort problems that are too large for > the sort_mem setting. I increased the size of the test case by 10x (basically s/100000/1000000/) which is enough to push it into the external-sort regime. I get amazingly stable runtimes now --- I didn't have the patience to run 100 trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec. So this code path is definitely not very sensitive to this data distribution. While these numbers aren't glittering in comparison to the best-case qsort times (~450 msec to sort 10% as much data), they are sure a lot better than the worst-case times. So maybe a workaround for you is to decrease maintenance_work_mem, counterintuitive though that be. (Now, if you *weren't* using maintenance_work_mem of 100MB or more for your problem restore, then I'm not sure I know what's going on...) We still ought to try to fix qsort of course. regards, tom lane
Ron <rjpeace@earthlink.net> writes: > How are we choosing our pivots? See qsort.c: it looks like median of nine equally spaced inputs (ie, the 1/8th points of the initial input array, plus the end points), implemented as two rounds of median-of-three choices. With half of the data inputs zero, it's not too improbable for two out of the three samples to be zeroes in which case I think the med3 result will be zero --- so choosing a pivot of zero is much more probable than one would like, and doing so in many levels of recursion causes the problem. I think. I'm not too sure if the code isn't just being sloppy about the case where many data values are equal to the pivot --- there's a special case there to switch to insertion sort, and maybe that's getting invoked too soon. It'd be useful to get a line-level profile of the behavior of this code in the slow cases... regards, tom lane
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Tom Lane > Sent: Wednesday, February 15, 2006 5:22 PM > To: Ron > Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index > behaviour) > > Ron <rjpeace@earthlink.net> writes: > > How are we choosing our pivots? > > See qsort.c: it looks like median of nine equally spaced inputs (ie, > the 1/8th points of the initial input array, plus the end points), > implemented as two rounds of median-of-three choices. With half of the > data inputs zero, it's not too improbable for two out of the three > samples to be zeroes in which case I think the med3 result will be zero > --- so choosing a pivot of zero is much more probable than one would > like, and doing so in many levels of recursion causes the problem. Adding some randomness to the selection of the pivot is a known technique to fix the oddball partitions problem. However, Bentley and Sedgewick proved that every quick sort algorithm has some input set that makes it go quadratic (hence the recent popularity of introspective sort, which switches to heapsort if quadratic behavior is detected. The C++ template I submitted was an example of introspective sort, but PostgreSQL does not use C++ so it was not helpful). > I think. I'm not too sure if the code isn't just being sloppy about the > case where many data values are equal to the pivot --- there's a special > case there to switch to insertion sort, and maybe that's getting invoked > too soon. Here are some cases known to make qsort go quadratic: 1. Data already sorted 2. Data reverse sorted 3. Data organ-pipe sorted or ramp 4. Almost all data of the same value There are probably other cases. Randomizing the pivot helps some, as does check for in-order or reverse order partitions. Imagine if 1/3 of the partitions fall into a category that causes quadratic behavior (have one of the above formats and have more than CUTOFF elements in them). It is doubtful that the switch to insertion sort is causing any sort of problems. It is only going to be invoked on tiny sets, for which it has a fixed cost that is probably less that qsort() function calls on sets of the same size. >It'd be useful to get a line-level profile of the behavior of > this code in the slow cases... I guess that my in-order or presorted tests [which often arise when there are very few distinct values] may solve the bad partition problems. Don't forget that the algorithm is called recursively. > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)
From
Christopher Kings-Lynne
Date:
> Ouch! That confirms my problem. I generated the random test case because > it was easier than including the dump of my tables, but you can > appreciate that tables 20 times the size are basically crippled when it > comes to creating an index on them. I have to say that I restored a few gigabyte dump on freebsd the other day, and most of the restore time was in index creation - I didn't think too much of it though at the time. FreeBSD 4.x. Chris
On Wed, 2006-02-15 at 19:59 -0500, Tom Lane wrote: > I get > amazingly stable runtimes now --- I didn't have the patience to run 100 > trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec. > So this code path is definitely not very sensitive to this data > distribution. "The worst-case behavior of replacement-selection is very close to its average behavior, while the worst-case behavior of QuickSort is terrible (N2) – a strong argument in favor of replacement-selection. Despite this risk, QuickSort is widely used because, in practice, it has superior performance." p.8, "AlphaSort: A Cache-Sensitive Parallel External Sort", Nyberg et al, VLDB Journal 4(4): 603-627 (1995) I think your other comment about flipping to insertion sort too early (and not returning...) is a plausible cause for the poor pg qsort behaviour, but the overall spread of values seems as expected. Some test results I've seen seem consistent with the view that increasing memory also increases run-time for larger settings of work_mem/maintenance_work_mem. Certainly, as I observed a while back, having a large memory settings doesn't help you at all when you are doing final run merging on the external sort. Whatever we do, we should look at the value high memory settings bring to each phase of a sort separately from the other phases. There is work underway on improving external sorts, so I hear (not me). Plus my WIP on randomAccess requirements. Best Regards, Simon Riggs
On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote: > It seems clear that our qsort.c is doing a pretty awful job of picking > qsort pivots, while glibc is mostly managing not to make that mistake. > I haven't looked at the glibc code yet to see what they are doing > differently. glibc qsort is actually merge sort, so I'm not surprised it avoids this problem. -Neil
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > I did this 100 times and sorted the reported runtimes. > > I'd say this puts a considerable damper on my enthusiasm for using our > qsort all the time, as was recently debated in this thread: > http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php > > 100 runtimes for glibc qsort, sorted ascending: > > Time: 866.814 ms > Time: 1234.848 ms > Time: 1267.398 ms > > 100 runtimes for port/qsort.c, sorted ascending: > > Time: 28314.182 ms > Time: 29400.278 ms > Time: 34142.534 ms > By "did this 100 times" do you mean generate a sequence of at most 200000*100 numbers, and for every 200000 numbers, the first half are all zeros and the other half are uniform random numbers? I tried to confirm it by patching the program mentioned in the link, but seems BSDqsort is still a little bit leading. Regards, Qingqing --- Result sort#./sort [3] [glibc qsort]: nelem(20000000), range(4294901760) distr(halfhalf) ccost(2) : 18887.285000 ms [3] [BSD qsort]: nelem(20000000), range(4294901760) distr(halfhalf) ccost(2) : 18801.018000 ms [3] [qsortG]: nelem(20000000), range(4294901760) distr(halfhalf) ccost(2) : 22997.004000 ms --- Patch to sort.c sort#diff -c sort.c sort1.c *** sort.c Thu Dec 15 12:18:59 2005 --- sort1.c Wed Feb 15 22:21:15 2006 *************** *** 35,43 **** {"BSD qsort", qsortB}, {"qsortG", qsortG} }; ! static const size_t d_nelem[] = {1000, 10000, 100000, 1000000, 5000000}; ! static const size_t d_range[] = {2, 32, 1024, 0xFFFF0000L}; ! static const char *d_distr[] = {"uniform", "gaussian", "95sorted", "95reversed"}; static const size_t d_ccost[] = {2}; /* factor index */ --- 35,43 ---- {"BSD qsort", qsortB}, {"qsortG", qsortG} }; ! static const size_t d_nelem[] = {5000000, 10000000, 20000000}; ! static const size_t d_range[] = {0xFFFF0000L}; ! static const char *d_distr[] = {"halfhalf"}; static const size_t d_ccost[] = {2}; /* factor index */ *************** *** 180,185 **** --- 180,192 ---- swap(karray[i], karray[nelem-i-1]); } } + else if (!strcmp(distr, "halfhalf")) + { + int j; + for (i = 0; i < nelem/200000; i++) + for (j = 0; j < 100000; j++) + karray[i*200000 + j] = 0; + } return array; }
At 08:21 PM 2/15/2006, Tom Lane wrote: >Ron <rjpeace@earthlink.net> writes: > > How are we choosing our pivots? > >See qsort.c: it looks like median of nine equally spaced inputs (ie, >the 1/8th points of the initial input array, plus the end points), >implemented as two rounds of median-of-three choices. OK, this is a bad way to do median-of-n partitioning for a few reasons. See Sedgewick's PhD thesis for details. Basically, if one is using median-of-n partitioning to choose a pivot, one should do it in =one= pass, and n for that pass should be <= the numbers of registers in the CPU. Since the x86 ISA has 8 GPR's, n should be <= 8. 7 for instance. Special purposing the median-of-n code so that the minimal number of comparisons and moves is used to sort the sample and then "partitioning in place" is the best way to do it. In addition, care must be taken to deal with the possibility that many of the keys may be equal. The (pseudo) code looks something like this: qs(a[],L,R){ if((R-L) > SAMPLE_SIZE){ // Not worth using qs for too few elements SortSample(SAMPLE_SIZE,a[],L,R); // Sorts SAMPLE_SIZE= n elements and does median-of-n partitioning for small n // using the minimal number of comparisons and moves. // In the process it ends up partitioning the first n/2 and last n/2 elements // SAMPLE_SIZE is a constant chosen to work best for a given CPU. // #GPRs - 1 is a good initial guess. // For the x86 ISA, #GPRs - 1 = 7. For native x86-64, it's 15. // For most RISC CPUs it's 31 or 63. For Itanium, it's 127 (!) pivot= a[(L+R)>>1]; i= L+(SAMPLE_SIZE>>1); j= R-(SAMPLE_SIZE>>1); for(;;){ while(a[++i] < pivot); while(a[--j] > pivot); if(i >= j) break; if(a[i] > a[j]) swap(a[i],a[j]); } if((i-R) >= (j-L)){qs(a,L,i-1);} else{qs(a,i,R);} else{OofN^2_Sort(a,L,R);} // SelectSort may be better than InsertSort if KeySize in bits << RecordSize in bits } // End of qs Given that the most common CPU ISA in existence has 8 GPRs, SAMPLE_SIZE= 7 is probably optimal: t= (L+R); the set would be {L; t/8; t/4; t/2; 3*t/4; 7*t/8; R;} ==> {L; t>>3; t>>2; t>>1; (3*t)>>2; (7*t)>>3; R} as the locations. Even better (and more easily scaled as the number of GPR's in the CPU changes) is to use the set {L; L+1; L+2; t>>1; R-2; R-1; R} This means that instead of 7 random memory accesses, we have 3; two of which result in a burst access for three elements each. That's much faster; _and_ using a sample of 9, 15, 31, 63, etc (to max of ~GPRs -1) elements is more easily done. It also means that the work we do sorting the sample can be taken advantage of when starting inner loop of quicksort: items L..L+2, t, and R-2..R are already partitioned by SortSample(). Insuring that the minimum number of comparisons and moves is done in SortSample can be down by using a code generator to create a comparison tree that identifies which permutation(s) of n we are dealing with and then moving them into place with the minimal number of moves. SIDE NOTE: IIRC glibc's qsort is actually merge sort. Merge sort performance is insensitive to all inputs, and there are way to optimize it as well. I'll leave the actual coding to someone who knows the pg source better than I do. Ron
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > By "did this 100 times" do you mean generate a sequence of at most > 200000*100 numbers, and for every 200000 numbers, the first half are all > zeros and the other half are uniform random numbers? No, I mean I ran the bit of SQL script I gave 100 separate times. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > > By "did this 100 times" do you mean generate a sequence of at most > > 200000*100 numbers, and for every 200000 numbers, the first half are all > > zeros and the other half are uniform random numbers? > > No, I mean I ran the bit of SQL script I gave 100 separate times. > I must misunderstand something here -- I can't figure out that why the cost of the same procedure keep climbing? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote > > I must misunderstand something here -- I can't figure out that why the cost > of the same procedure keep climbing? > Ooops, I mis-intepret the sentence -- you sorted the results ... Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> wrote >> No, I mean I ran the bit of SQL script I gave 100 separate times. > I must misunderstand something here -- I can't figure out that why the cost > of the same procedure keep climbing? No, the run cost varies randomly depending on the random data supplied by the test script. The reason the numbers are increasing is that I sorted them for ease of inspection. regards, tom lane
* Neil Conway: > On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote: >> It seems clear that our qsort.c is doing a pretty awful job of picking >> qsort pivots, while glibc is mostly managing not to make that mistake. >> I haven't looked at the glibc code yet to see what they are doing >> differently. > > glibc qsort is actually merge sort, so I'm not surprised it avoids this > problem. qsort also performs twice as many key comparisons as the theoretical minimum. If key comparison is not very cheap, other schemes (like heapsort, for example) are more attractive.
On Thu, Feb 16, 2006 at 01:10:48PM +0100, Florian Weimer wrote: > * Neil Conway: > > > On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote: > >> It seems clear that our qsort.c is doing a pretty awful job of picking > >> qsort pivots, while glibc is mostly managing not to make that mistake. > >> I haven't looked at the glibc code yet to see what they are doing > >> differently. > > > > glibc qsort is actually merge sort, so I'm not surprised it avoids this > > problem. > > qsort also performs twice as many key comparisons as the theoretical > minimum. If key comparison is not very cheap, other schemes (like > heapsort, for example) are more attractive. Last time around there were a number of different algorithms tested. Did anyone run those tests while getting it to count the number of actual comparisons (which could easily swamp the time taken to do the actual sort in some cases)? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout schrieb: > > Last time around there were a number of different algorithms tested. > Did anyone run those tests while getting it to count the number of > actual comparisons (which could easily swamp the time taken to do the > actual sort in some cases)? > The last time I did such tests is almost 10 years ago. I had used MetroWerks CodeWarrior C/C++, which had Quicksort as algorithm in the Lib C. Anyhow, I tested a few algorithms including merge sort and heapsort. I end up with heapsort because it was the fastest algorithm for our issue. We joined two arrays where each array was sorted and run qsort to sort the new array. Sven.
At 06:35 AM 2/16/2006, Steinar H. Gunderson wrote: >On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote: > > Even better (and more easily scaled as the number of GPR's in the CPU > > changes) is to use > > the set {L; L+1; L+2; t>>1; R-2; R-1; R} > > This means that instead of 7 random memory accesses, we have 3; two > > of which result in a burst access for three elements each. > >Isn't that improvement going to disappear competely if you choose a bad >pivot? Only if you _consistently_ (read: "the vast majority of the time": quicksort is actually darn robust) choose a _pessimal_, not just "bad", pivot quicksort will degenerate to the O(N^2) behavior everyone worries about. See Corman & Rivest for a proof on this. Even then, doing things as above has benefits: 1= The worst case is less bad since the guaranteed O(lgs!) pivot choosing algorithm puts s elements into final position. Worst case becomes better than O(N^2/(s-1)). 2= The overhead of pivot choosing can overshadow the benefits using more traditional methods for even moderate values of s. See discussions on the quicksort variant known as "samplesort" and Sedgewick's PhD thesis for details. Using a pivot choosing algorithm that actually does some of the partitioning (and does it more efficiently than the "usual" partitioning algorithm does) plus using partition-in-place (rather then Lomuto's method) reduces overhead very effectively (at the "cost" of more complicated / delicate to get right partitioning code). The above reduces the number of moves used in a quicksort pass considerably regardless of the number of compares used. 3= Especially in modern systems where the gap between internal CPU bandwidth and memory bandwidth is so great, the overhead of memory accesses for comparisons and moves is the majority of the overhead for both the pivot choosing and the partitioning algorithms within quicksort. Particularly random memory accesses. The reason (#GPRs - 1) is a magic constant is that it's the most you can compare and move using only register-to-register operations. In addition, replacing as many of the memory accesses you must do with sequential rather than random memory accesses is a big deal: sequential memory access is measured in 10's of CPU cycles while random memory access is measured in hundreds of CPU cycles. It's no accident that the advances in Grey's sorting contest have involved algorithms that are both register and cache friendly, minimizing overall memory access and using sequential memory access as much as possible when said access can not be avoided. As caches grow larger and memory accesses more expensive, it's often worth it to use a BucketSort+QuickSort hybrid rather than just QuickSort. ...and of course if you know enough about the data to be sorted so as to constrain it appropriately, one should use a non comparison based O(N) sorting algorithm rather than any of the general comparison based O(NlgN) methods. > > SIDE NOTE: IIRC glibc's qsort is actually merge sort. Merge sort > > performance is insensitive to all inputs, and there are way to > > optimize it as well. > >glibc-2.3.5/stdlib/qsort.c: > > /* Order size using quicksort. This implementation incorporates > four optimizations discussed in Sedgewick: > >I can't see any references to merge sort in there at all. Well, then I'm not the only person on the lists whose memory is faulty ;-) The up side of MergeSort is that its performance is always O(NlgN). The down sides are that it is far more memory hungry than QuickSort and slower. Ron
At 07:10 AM 2/16/2006, Florian Weimer wrote: >* Neil Conway: > > > On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote: > >> It seems clear that our qsort.c is doing a pretty awful job of picking > >> qsort pivots, while glibc is mostly managing not to make that mistake. > >> I haven't looked at the glibc code yet to see what they are doing > >> differently. > > > > glibc qsort is actually merge sort, so I'm not surprised it avoids this > > problem. > >qsort also performs twice as many key comparisons as the theoretical >minimum. The theoretical minimum number of comparisons for a general purpose comparison based sort is O(lgN!). QuickSort uses 2NlnN ~= 1.38NlgN or ~1.38x the optimum without tuning (see Knuth, Sedgewick, Corman, ... etc) OTOH, QuickSort uses ~2x as many =moves= as the theoretical minimum unless tuned, and moves are more expensive than compares in modern systems. See my other posts for QuickSort tuning methods that attempt to directly address both issues. Ron
Hi, Ron, Ron wrote: > ...and of course if you know enough about the data to be sorted so as to > constrain it appropriately, one should use a non comparison based O(N) > sorting algorithm rather than any of the general comparison based > O(NlgN) methods. Sounds interesting, could you give us some pointers (names, URLs, papers) to such algorithms? Thanks a lot, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Last night I implemented a non-recursive introsort in C... let me test it a bit more and then I'll post it here for everyone else to try out.
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
On 2/16/06, Markus Schaber <schabi@logix-tt.com> wrote:
Hi, Ron,
Ron wrote:
> ...and of course if you know enough about the data to be sorted so as to
> constrain it appropriately, one should use a non comparison based O(N)
> sorting algorithm rather than any of the general comparison based
> O(NlgN) methods.
Sounds interesting, could you give us some pointers (names, URLs,
papers) to such algorithms?
Thanks a lot,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS
Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote: > 3= Especially in modern systems where the gap between internal CPU > bandwidth and memory bandwidth is so great, the overhead of memory > accesses for comparisons and moves is the majority of the overhead > for both the pivot choosing and the partitioning algorithms within > quicksort. Particularly random memory accesses. The reason (#GPRs - > 1) is a magic constant is that it's the most you can compare and move > using only register-to-register operations. But how much of this applies to us? We're not sorting arrays of integers, we're sorting pointers to tuples. So while moves cost very little, a comparison costs hundreds, maybe thousands of cycles. A tuple can easily be two or three cachelines and you're probably going to access all of it, not to mention the Fmgr structures and the Datums themselves. None of this is cache friendly. The actual tuples themselves could be spread all over memory (I don't think any particular effort is expended trying to minimize fragmentation). Do these algorithms discuss the case where a comparison is more than 1000 times the cost of a move? Where this does become interesting is where we can convert a datum to an integer such that if f(A) > f(B) then A > B. Then we can sort on f(X) first with just integer comparisons and then do a full tuple comparison only if f(A) = f(B). This would be much more cache-coherent and make these algorithms much more applicable in my mind. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote: >On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote: > > 3= Especially in modern systems where the gap between internal CPU > > bandwidth and memory bandwidth is so great, the overhead of memory > > accesses for comparisons and moves is the majority of the overhead > > for both the pivot choosing and the partitioning algorithms within > > quicksort. Particularly random memory accesses. The reason (#GPRs - > > 1) is a magic constant is that it's the most you can compare and move > > using only register-to-register operations. > >But how much of this applies to us? We're not sorting arrays of >integers, we're sorting pointers to tuples. So while moves cost very >little, a comparison costs hundreds, maybe thousands of cycles. A tuple >can easily be two or three cachelines and you're probably going to >access all of it, not to mention the Fmgr structures and the Datums >themselves. Pointers are simply fixed size 32b or 64b quantities. They are essentially integers. Comparing and moving pointers or fixed size keys to those pointers is exactly the same problem as comparing and moving integers. Comparing =or= moving the actual data structures is a much more expensive and variable cost proposition. I'm sure that pg's sort functionality is written intelligently enough that the only real data moves are done in a final pass after the exact desired order has been found using pointer compares and (re)assignments during the sorting process. That's a standard technique for sorting data whose "key" or pointer is much smaller than a datum. Your cost comment basically agrees with mine regarding the cost of random memory accesses. The good news is that the number of datums to be examined during the pivot choosing process is small enough that the datums can fit into CPU cache while the pointers to them can be assigned to registers: making pivot choosing +very+ fast when done correctly. As you've noted, actual partitioning is going to be more expensive since it involves accessing enough actual datums that they can't all fit into CPU cache. The good news is that QuickSort has a very sequential access pattern within its inner loop. So while we must go to memory for compares, we are at least keeping the cost for it down it a minimum. In addition, said access is nice enough to be very prefetch and CPU cache hierarchy friendly. >None of this is cache friendly. The actual tuples themselves could be >spread all over memory (I don't think any particular effort is expended >trying to minimize fragmentation). It probably would be worth it to spend some effort on memory layout just as we do for HD layout. >Do these algorithms discuss the case where a comparison is more than >1000 times the cost of a move? A move is always more expensive than a compare when the datum is larger than its pointer or key. A move is always more expensive than a compare when it involves memory to memory movement rather than CPU location to CPU location movement. A move is especially more expensive than a compare when it involves both factors. Most moves do involve both. What I suspect you meant is that a key comparison that involves accessing the data in memory is more expensive than reassigning the pointers associated with those keys. That is certainly true. Yes. The problem has been extensively studied. ;-) >Where this does become interesting is where we can convert a datum to >an integer such that if f(A) > f(B) then A > B. Then we can sort on >f(X) first with just integer comparisons and then do a full tuple >comparison only if f(A) = f(B). This would be much more cache-coherent >and make these algorithms much more applicable in my mind. In fact we can do better. Using hash codes or what-not to map datums to keys and then sorting just the keys and the pointers to those datums followed by an optional final pass where we do the actual data movement is also a standard technique for handling large data structures. Regardless of what tweaks beyond the basic algorithms we use, the algorithms themselves have been well studied and their performance well established. QuickSort is the best performing of the O(nlgn) comparison based sorts and it uses less resources than HeapSort or MergeSort. Ron
Ron <rjpeace@earthlink.net> writes: > Your cost comment basically agrees with mine regarding the cost of > random memory accesses. The good news is that the number of datums > to be examined during the pivot choosing process is small enough that > the datums can fit into CPU cache while the pointers to them can be > assigned to registers: making pivot choosing +very+ fast when done correctly. This is more or less irrelevant given that comparing the pointers is not the operation we need to do. regards, tom lane
Markus Schaber wrote: > Ron wrote: >>...and of course if you know enough about the data to be sorted so as to >>constrain it appropriately, one should use a non comparison based O(N) >>sorting algorithm rather than any of the general comparison based >>O(NlgN) methods. > > Sounds interesting, could you give us some pointers (names, URLs, > papers) to such algorithms? Most of these techniques boil down to good ol' "bucket sort". A simple example: suppose you have a column of integer percentages,range zero to 100. You know there are only 101 distinct values. So create 101 "buckets" (e.g. linked lists),make a single pass through your data and drop each row's ID into the right bucket, then make a second pass throughthe buckets, and write the row ID's out in bucket order. This is an O(N) sort technique. Any time you have a restricted data range, you can do this. Say you have 100 million rows of scientific results known tobe good to only three digits -- it can have at most 1,000 distinct values (regardless of the magnitude of the values),so you can do this with 1,000 buckets and just two passes through the data. You can also use this trick when the optimizer is asked for "fastest first result." Say you have a cursor on a column ofnumbers with good distribution. If you do a bucket sort on the first two or three digits only, you know the first "page"of results will be in the first bucket. So you only need to apply qsort to that first bucket (which is very fast,since it's small), and you can deliver the first page of data to the application. This can be particularly effectivein interactive situations, where the user typically looks at a few pages of data and then abandons the search. I doubt this is very relevant to Postgres. A relational database has to be general purpose, and it's hard to give it "hints"that would tell it when to use this particular optimization. Craig
At 10:52 AM 2/16/2006, Ron wrote: >At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote: > >>Where this does become interesting is where we can convert a datum to >>an integer such that if f(A) > f(B) then A > B. Then we can sort on >>f(X) first with just integer comparisons and then do a full tuple >>comparison only if f(A) = f(B). This would be much more cache-coherent >>and make these algorithms much more applicable in my mind. >In fact we can do better. >Using hash codes or what-not to map datums to keys and then sorting >just the keys and the pointers to those datums followed by an >optional final pass where we do the actual data movement is also a >standard technique for handling large data structures. I thought some follow up might be in order here. Let's pretend that we have the typical DB table where rows are ~2-4KB apiece. 1TB of storage will let us have 256M-512M rows in such a table. A 32b hash code can be assigned to each row value such that only exactly equal rows will have the same hash code. A 32b pointer can locate any of the 256M-512M rows. Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an optional pass to rearrange the actual rows if we so wish. We get the same result while only examining and manipulating 1/50 to 1/25 as much data during the sort. If we want to spend more CPU time in order to save more space, we can compress the key+pointer representation. That usually reduces the amount of data to be manipulated to ~1/4 the original key+pointer representation, reducing things to ~512M-1GB worth of compressed pointers+keys. Or ~1/200 - ~1/100 the original amount of data we were discussing. Either representation is small enough to fit within RAM rather than requiring HD IO, so we solve the HD IO bottleneck in the best possible way: we avoid ever doing it. Ron
On Thu, Feb 16, 2006 at 11:32:55AM -0500, Ron wrote: > At 10:52 AM 2/16/2006, Ron wrote: > >In fact we can do better. > >Using hash codes or what-not to map datums to keys and then sorting > >just the keys and the pointers to those datums followed by an > >optional final pass where we do the actual data movement is also a > >standard technique for handling large data structures. Or in fact required if the Datums are not all the same size, which is the case in PostgreSQL. > I thought some follow up might be in order here. > > Let's pretend that we have the typical DB table where rows are ~2-4KB > apiece. 1TB of storage will let us have 256M-512M rows in such a table. > > A 32b hash code can be assigned to each row value such that only > exactly equal rows will have the same hash code. > A 32b pointer can locate any of the 256M-512M rows. That hash code is impossible the way you state it, since the set of strings is not mappable to a 32bit integer. You probably meant that a hash code can be assigned such that equal rows have equal hashes (drop the only). > Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= > 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an optional > pass to rearrange the actual rows if we so wish. > > We get the same result while only examining and manipulating 1/50 to > 1/25 as much data during the sort. But this is what we do now. The tuples are loaded, we sort an array of pointers, then we write the output. Except we don't have the hash, so we require access to the 1TB of data to do the actual comparisons. Even if we did have the hash, we'd *still* need access to the data to handle tie-breaks. That's why your comment about moves always being more expensive than compares makes no sense. A move can be acheived simply by swapping two pointers in the array. A compare actually needs to call all sorts of functions. If and only if we have functions for every data type to produce an ordered hash, we can optimise sorts based on single integers. For reference, look at comparetup_heap(). It's just 20 lines, but each function call there expands to maybe a dozen lines of code. And it has a loop. I don't think we're anywhere near the stage where locality of reference makes much difference. We very rarely needs the tuples actualised in memory in the required order, just the pointers are enough. Have a ncie day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
"Craig A. James" <cjames@modgraph-usa.com> writes: > You can also use this trick when the optimizer is asked for "fastest first result." Say you have a cursor on a columnof numbers with good distribution. If you do a bucket sort on the first two or three digits only, you know the first"page" of results will be in the first bucket. So you only need to apply qsort to that first bucket (which is veryfast, since it's small), and you can deliver the first page of data to the application. This can be particularly effectivein interactive situations, where the user typically looks at a few pages of data and then abandons the search. > I doubt this is very relevant to Postgres. A relational database has to be general purpose, and it's hard to give it "hints"that would tell it when to use this particular optimization. Actually, LIMIT does nicely for that hint; the PG planner has definitely got a concept of preferring fast-start plans for limited queries. The real problem in applying bucket-sort ideas is the lack of any datatype-independent way of setting up the buckets. Once or twice we've kicked around the idea of having some datatype-specific sorting code paths alongside the general-purpose one, but I can't honestly see this as being workable from a code maintenance standpoint. regards, tom lane
On Feb 16, 2006, at 8:32 AM, Ron wrote: > Let's pretend that we have the typical DB table where rows are > ~2-4KB apiece. 1TB of storage will let us have 256M-512M rows in > such a table. > > A 32b hash code can be assigned to each row value such that only > exactly equal rows will have the same hash code. > A 32b pointer can locate any of the 256M-512M rows. > > Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b > +32b= 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an > optional pass to rearrange the actual rows if we so wish. I don't understand this. This is a true statement: (H(x) != H(y)) => (x != y) This is not: (H(x) < H(y)) => (x < y) Hash keys can tell you there's an inequality, but they can't tell you how the values compare. If you want 32-bit keys that compare in the same order as the original values, here's how you have to get them: (1) sort the values into an array (2) use each value's array index as its key It reduces to the problem you're trying to use it to solve. -- Scott Lamb <http://www.slamb.org/>
At 12:19 PM 2/16/2006, Scott Lamb wrote: >On Feb 16, 2006, at 8:32 AM, Ron wrote: >>Let's pretend that we have the typical DB table where rows are >>~2-4KB apiece. 1TB of storage will let us have 256M-512M rows in >>such a table. >> >>A 32b hash code can be assigned to each row value such that only >>exactly equal rows will have the same hash code. >>A 32b pointer can locate any of the 256M-512M rows. >> >>Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b >>+32b= 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an >>optional pass to rearrange the actual rows if we so wish. > >I don't understand this. > >This is a true statement: (H(x) != H(y)) => (x != y) >This is not: (H(x) < H(y)) => (x < y) > >Hash keys can tell you there's an inequality, but they can't tell you >how the values compare. If you want 32-bit keys that compare in the >same order as the original values, here's how you have to get them: For most hash codes, you are correct. There is a class of hash or hash-like codes that maintains the mapping to support that second statement. More later when I can get more time. Ron
On Thu, 2006-02-16 at 12:35 +0100, Steinar H. Gunderson wrote: > glibc-2.3.5/stdlib/qsort.c: > > /* Order size using quicksort. This implementation incorporates > four optimizations discussed in Sedgewick: > > I can't see any references to merge sort in there at all. stdlib/qsort.c defines _quicksort(), not qsort(), which is defined by msort.c. On looking closer, it seems glibc actually tries to determine the physical memory in the machine -- if it is sorting a single array that exceeds 1/4 of the machine's physical memory, it uses quick sort, otherwise it uses merge sort. -Neil
On Thu, 2006-02-16 at 12:15 -0500, Tom Lane wrote: > Once or twice we've kicked around the idea of having some > datatype-specific sorting code paths alongside the general-purpose one, > but I can't honestly see this as being workable from a code maintenance > standpoint. > > regards, tom lane It seems that instead of maintaining a different sorting code path for each data type, you could get away with one generic path and one (hopefully faster) path if you allowed data types to optionally support a 'sortKey' interface by providing a function f which maps inputs to 32- bit int outputs, such that the following two properties hold: f(a)>=f(b) iff a>=b if a==b then f(a)==f(b) So if a data type supports the sortKey interface you could perform the sort on f(value) and only refer back to the actual element comparison functions when two sortKeys have the same value. Data types which could probably provide a useful function for f would be int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). Depending on the overhead, you might not even need to maintain 2 independent search code paths, since you could always use f(x)=0 as the default sortKey function which would degenerate to the exact same sort behavior in use today. -- Mark Lewis
Hi, Mark, Mark Lewis schrieb: > It seems that instead of maintaining a different sorting code path for > each data type, you could get away with one generic path and one > (hopefully faster) path if you allowed data types to optionally support > a 'sortKey' interface by providing a function f which maps inputs to 32- > bit int outputs, such that the following two properties hold: > > f(a)>=f(b) iff a>=b > if a==b then f(a)==f(b) Hmm, to remove redundancy, I'd change the <= to a < and define: if a==b then f(a)==f(b) if a<b then f(a)<=f(b) > Data types which could probably provide a useful function for f would be > int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). With int2 or some restricted ranges of oid and int4, we could even implement a bucket sort. Markus
On Thu, Feb 16, 2006 at 02:17:36PM -0800, Mark Lewis wrote: > It seems that instead of maintaining a different sorting code path for > each data type, you could get away with one generic path and one > (hopefully faster) path if you allowed data types to optionally support > a 'sortKey' interface by providing a function f which maps inputs to 32- > bit int outputs, such that the following two properties hold: > > f(a)>=f(b) iff a>=b > if a==b then f(a)==f(b) Note this is a property of the collation, not the type. For example strings can be sorted in many ways and the sortKey must reflect that. So in postgres terms it's a property of the btree operator class. It's something I'd like to do if I get A Round Tuit. :) Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Markus Schaber <schabi@logix-tt.com> writes: > Hmm, to remove redundancy, I'd change the <= to a < and define: > > if a==b then f(a)==f(b) > if a<b then f(a)<=f(b) > > > Data types which could probably provide a useful function for f would be > > int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). How exactly do you imagine doing this for text? I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though. -- greg
On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote: > > > Data types which could probably provide a useful function for f would be > > > int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). > > How exactly do you imagine doing this for text? > > I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though. In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit sortKey as elsewhere suggested). The sorting key doesn't need to be a one-to-one mapping. -- Mark Lewis
On Thu, 16 Feb 2006, Mark Lewis wrote: > On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote: >>>> Data types which could probably provide a useful function for f would be >>>> int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). >> >> How exactly do you imagine doing this for text? >> >> I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though. > > > In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit > sortKey as elsewhere suggested). The sorting key doesn't need to be a > one-to-one mapping. that would violate your second contraint ( f(a)==f(b) iff (a==b) ) if you could drop that constraint (the cost of which would be extra 'real' compares within a bucket) then a helper function per datatype could work as you are talking. David Lang
At 01:47 PM 2/16/2006, Ron wrote: >At 12:19 PM 2/16/2006, Scott Lamb wrote: >>On Feb 16, 2006, at 8:32 AM, Ron wrote: >>>Let's pretend that we have the typical DB table where rows are >>>~2-4KB apiece. 1TB of storage will let us have 256M-512M rows in >>>such a table. >>> >>>A 32b hash code can be assigned to each row value such that only >>>exactly equal rows will have the same hash code. >>>A 32b pointer can locate any of the 256M-512M rows. >>> >>>Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b >>>+32b= 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an >>>optional pass to rearrange the actual rows if we so wish. >> >>I don't understand this. >> >>This is a true statement: (H(x) != H(y)) => (x != y) >>This is not: (H(x) < H(y)) => (x < y) >> >>Hash keys can tell you there's an inequality, but they can't tell you >>how the values compare. If you want 32-bit keys that compare in the >>same order as the original values, here's how you have to get them: >For most hash codes, you are correct. There is a class of hash or >hash-like codes that maintains the mapping to support that second statement. > >More later when I can get more time. >Ron OK, so here's _a_ way (there are others) to obtain a mapping such that if a < b then f(a) < f (b) and if a == b then f(a) == f(b) Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb integer; a 4KB row becomes a 32Kb integer; etc) Since even a 1TB table made of such rows can only have 256M - 512M possible values even if each row is unique, a 28b or 29b key is large enough to represent each row's value and relative rank compared to all of the others even if all row values are unique. By scanning the table once, we can map say 0000001h (Hex used to ease typing) to the row with the minimum value and 1111111h to the row with the maximum value as well as mapping everything in between to their appropriate keys. That same scan can be used to assign a pointer to each record's location. We can now sort the key+pointer pairs instead of the actual data and use an optional final pass to rearrange the actual rows if we wish. That initial scan to set up the keys is expensive, but if we wish that cost can be amortized over the life of the table so we don't have to pay it all at once. In addition, once we have created those keys, then can be saved for later searches and sorts. Further space savings can be obtained whenever there are duplicate keys and/or when compression methods are used on the Key+pointer pairs. Ron
Hi, David, David Lang schrieb: >> In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit >> sortKey as elsewhere suggested). The sorting key doesn't need to be a >> one-to-one mapping. > that would violate your second contraint ( f(a)==f(b) iff (a==b) ) no, it doesn't. When both strings are equal, then the first characters are equal, too. If they are not equal, the constraint condition does not match. The first characters of the strings may be equal as f(a) may be equal to f(b) as to the other constraint. Markus
Hi, Ron, Ron schrieb: > OK, so here's _a_ way (there are others) to obtain a mapping such that > if a < b then f(a) < f (b) and > if a == b then f(a) == f(b) > > Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb > integer; a 4KB row becomes a 32Kb integer; etc) > Since even a 1TB table made of such rows can only have 256M - 512M > possible values even if each row is unique, a 28b or 29b key is large > enough to represent each row's value and relative rank compared to all > of the others even if all row values are unique. > > By scanning the table once, we can map say 0000001h (Hex used to ease > typing) to the row with the minimum value and 1111111h to the row with > the maximum value as well as mapping everything in between to their > appropriate keys. That same scan can be used to assign a pointer to > each record's location. But with a single linear scan, this cannot be accomplished, as the table contents are neither sorted nor distributed linearly between the minimum and the maximum. For this mapping, you need a full table sort. > That initial scan to set up the keys is expensive, but if we wish that > cost can be amortized over the life of the table so we don't have to pay > it all at once. In addition, once we have created those keys, then can > be saved for later searches and sorts. But for every update or insert, you have to resort the keys, which is _very_ expensive as it basically needs to update a huge part of the table. Markus
At 04:24 AM 2/17/2006, Ragnar wrote: >On fös, 2006-02-17 at 01:20 -0500, Ron wrote: > > > > OK, so here's _a_ way (there are others) to obtain a mapping such that > > if a < b then f(a) < f (b) and > > if a == b then f(a) == f(b) > > > By scanning the table once, we can map say 0000001h (Hex used to ease > > typing) to the row with the minimum value and 1111111h to the row > > with the maximum value as well as mapping everything in between to > > their appropriate keys. That same scan can be used to assign a > > pointer to each record's location. > >This step is just as expensive as the original >sort you want to replace/improve. Why do you think that? External sorts involve the equivalent of multiple scans of the table to be sorted, sometimes more than lgN (where N is the number of items in the table to be sorted). Since this is physical IO we are talking about, each scan is very expensive, and therefore 1 scan is going to take considerably less time than >= lgN scans will be. >If you want to keep this mapping saved as a sort >of an index, or as part ot each row data, this >will make the cost of inserts and updates enormous. Not sure you've got this right either. Looks to me like we are adding a <= 32b quantity to each row. Once we know the mapping, incrementally updating it upon insert or update would seem to be simple matter of a fast search for the correct ranking [Interpolation search, which we have all the needed data for, is O(lglgN). Hash based search is O(1)]; plus an increment/decrement of the key values greater/less than the key value of the row being inserted / updated. Given than we are updating all the keys in a specific range within a tree structure, that update can be done in O(lgm) (where m is the number of records affected). > > We can now sort the key+pointer pairs instead of the actual data and > > use an optional final pass to rearrange the actual rows if we wish. > >How are you suggesting this mapping be accessed? >If the mapping is kept separate from the tuple >data, as in an index, then how will you look up the key? ??? We've effectively created a data set where each record is a pointer to a DB row plus its key. We can now sort the data set by key and then do an optional final pass to rearrange the actual DB rows if we so wish. Since that final pass is very expensive, it is good that not all use scenarios will need that final pass. The amount of storage required to sort this representation of the table rather than the actual table is so much less that it turns an external sorting problem into a internal sorting problem with an optional final pass that is =1= scan (albeit one scan with a lot of seeks and data movement). This is a big win. It is a variation of a well known technique. See Sedgewick, Knuth, etc. > > That initial scan to set up the keys is expensive, but if we wish > > that cost can be amortized over the life of the table so we don't > > have to pay it all at once. In addition, once we have created those > > keys, then can be saved for later searches and sorts. > >What is the use case where this would work better than a >regular btree index ? Again, ??? btree indexes address different issues. They do not in any way help create a compact data representation of the original data that saves enough space so as to turn an external ranking or sorting problem into an internal one. Ron
At 05:19 AM 2/17/2006, Markus Schaber wrote: >Hi, Ron, > >Ron schrieb: > > > OK, so here's _a_ way (there are others) to obtain a mapping such that > > if a < b then f(a) < f (b) and > > if a == b then f(a) == f(b) > > > > Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb > > integer; a 4KB row becomes a 32Kb integer; etc) > > Since even a 1TB table made of such rows can only have 256M - 512M > > possible values even if each row is unique, a 28b or 29b key is large > > enough to represent each row's value and relative rank compared to all > > of the others even if all row values are unique. > > > > By scanning the table once, we can map say 0000001h (Hex used to ease > > typing) to the row with the minimum value and 1111111h to the row with > > the maximum value as well as mapping everything in between to their > > appropriate keys. That same scan can be used to assign a pointer to > > each record's location. > >But with a single linear scan, this cannot be accomplished, as the table >contents are neither sorted nor distributed linearly between the minimum >and the maximum. So what? We are talking about key assignment here, not anything that requires physically manipulating the actual DB rows. One physical IO pass should be all that's needed. >For this mapping, you need a full table sort. One physical IO pass should be all that's needed. However, let's pretend you are correct and that we do need to sort the table to get the key mapping. Even so, we would only need to do it =once= and then we would be able to use and incrementally update the results forever afterward. Even under this assumption, one external sort to save all subsequent such sorts seems well worth it. IOW, even if I'm wrong about the initial cost to do this; it is still worth doing ;-) > > That initial scan to set up the keys is expensive, but if we wish that > > cost can be amortized over the life of the table so we don't have to pay > > it all at once. In addition, once we have created those keys, then can > > be saved for later searches and sorts. > >But for every update or insert, you have to resort the keys, which is >_very_ expensive as it basically needs to update a huge part of the table. ??? You do not need to resort already ordered data to insert a new element into it such that the data stays ordered! Once we have done the key ordering operation once, we should not ever need to do it again on the original data. Else most sorting algorithms wouldn't work ;-) Ron
On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote: > >For this mapping, you need a full table sort. > One physical IO pass should be all that's needed. However, let's > pretend you are correct and that we do need to sort the table to get > the key mapping. Even so, we would only need to do it =once= and > then we would be able to use and incrementally update the results > forever afterward. Even under this assumption, one external sort to > save all subsequent such sorts seems well worth it. > > IOW, even if I'm wrong about the initial cost to do this; it is still > worth doing ;-) I think you're talking about something different here. You're thinking of having the whole table sorted and when you add a new value you add it in such a way to keep it sorted. The problem is, what do you sort it by? If you've sorted the table by col1, then when the user does ORDER BY col2 it's useless. Indeed, this is what btrees do, you store the order of the table seperate from the data. And you can store multiple orders. But even then, when someone does ORDER BY lower(col1), it's still useless. And you're right, we still need to do the single mass sort in the beginning, which is precisely what we're trying to optimise here. > ??? You do not need to resort already ordered data to insert a new > element into it such that the data stays ordered! Once we have done > the key ordering operation once, we should not ever need to do it > again on the original data. Else most sorting algorithms wouldn't work ;-) We already do this with btree indexes. I'm not sure what you are proposing that improves on that. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote:
Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
...and with some work, floats (I think just the exponent would work, if nothing else). bytea. Probably just about anything.
Interesting. If you abandon the idea that collisions should be impossible (they're not indexes) or extremely rare (they're not hashes), it's pretty easy to come up with a decent hint to avoid a lot of dereferences.
--
Scott Lamb <http://www.slamb.org/>
On Fri, Feb 17, 2006 at 08:18:41AM -0800, Scott Lamb wrote: > On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote: > >Data types which could probably provide a useful function for f > >would be > >int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). > > ...and with some work, floats (I think just the exponent would work, > if nothing else). bytea. Probably just about anything. > > Interesting. If you abandon the idea that collisions should be > impossible (they're not indexes) or extremely rare (they're not > hashes), it's pretty easy to come up with a decent hint to avoid a > lot of dereferences. Yep, pretty much for any datatype you create a mapping function to map it to a signed int32. All you have to guarentee is that f(a) > f(b) implies that a > b. Only if f(a) == f(b) do you need to compare a and b. You then change the sorting code to have an array of (Datum,int32) (ouch, double the storage) where the int32 is the f(Datum). And in the comparison routines you first check the int32. If they give an order you're done. On match you do the full comparison. For integer types (int2,int4,int8,oid) the conversion is straight forward. For float you'd use the exponent and the first few bits of the mantissa. For strings you'd have to bail, or use a strxfrm equivalent. NULL would be INT_MAX pr INT_MIN depending on where you want it. Thing is, even if you don't have such a function and always return zero, the results will still be right. Not a new idea, but it would be very nice to implement. If would produce nice speedups for type where comparisons are expensive. And more importantly, the bulk of the comparisons can be moved inline and make the whole cache-friendlyness discussed here much more meaningful. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
At 10:53 AM 2/17/2006, Martijn van Oosterhout wrote: >On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote: > > >For this mapping, you need a full table sort. > > One physical IO pass should be all that's needed. However, let's > > pretend you are correct and that we do need to sort the table to get > > the key mapping. Even so, we would only need to do it =once= and > > then we would be able to use and incrementally update the results > > forever afterward. Even under this assumption, one external sort to > > save all subsequent such sorts seems well worth it. > > > > IOW, even if I'm wrong about the initial cost to do this; it is still > > worth doing ;-) > >I think you're talking about something different here. You're thinking >of having the whole table sorted and when you add a new value you add >it in such a way to keep it sorted. The problem is, what do you sort it >by? If you've sorted the table by col1, then when the user does ORDER >BY col2 it's useless. No, I'm thinking about how to represent DB row data in such a way that a= we use a compact enough representation that we can sort internally rather than externally. b= we do the sort once and avoid most of the overhead upon subsequent similar requests. I used the example of sorting on the entire row to show that the approach works even when the original record being sorted by is very large. All my previous comments on this topic hold for the case where we are sorting on only part of a row as well. If all you are doing is sorting on a column or a few columns, what I'm discussing is even easier since treating the columns actually being used a sort criteria as integers rather than the whole row as an atomic unit eats less resources during the key creation and mapping process. If the row is 2-4KB in size, but we only care about some combination of columns that only takes on <= 2^8 or <= 2^16 different values, then what I've described will be even better than the original example I gave. Basically, the range of a key is going to be restricted by how a= big the field is that represents the key (columns and such are usually kept narrow for performance reasons) or b= big each row is (the more space each row takes, the fewer rows fit into any given amount of storage) c= many rows there are in the table Between the conditions, the range of a key tends to be severely restricted and therefore use much less space than sorting the actual DB records would. ...and that gives us something we can take advantage of. >Indeed, this is what btrees do, you store the order of the table >seperate from the data. And you can store multiple orders. But even >then, when someone does ORDER BY lower(col1), it's still useless. > >And you're right, we still need to do the single mass sort in the >beginning, which is precisely what we're trying to optimise here. Sigh. My points were: 1= we have information available to us that allows us to map the rows in such a way as to turn most external sorts into internal sorts, thereby avoiding the entire external sorting problem in those cases. This is a huge performance improvement. 2= that an external sort is =NOT= required for initial key assignment, but even if it was it would be worth it. 3= that this is a variation of a well known technique so I'm not suggesting heresy here. Ron
On Thu, 2006-02-16 at 21:33 -0800, David Lang wrote: > > In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit > > sortKey as elsewhere suggested). The sorting key doesn't need to be a > > one-to-one mapping. > > that would violate your second contraint ( f(a)==f(b) iff (a==b) ) > > if you could drop that constraint (the cost of which would be extra 'real' > compares within a bucket) then a helper function per datatype could work > as you are talking. I think we're actually on the same page here; you're right that the constraint above ( f(a)==f(b) iff a==b ) can't be extended to data types with more than 32 bits of value space. But the constraint I listed was actually: if a==b then f(a)==f(b) Which doesn't imply 'if and only if'. It's a similar constraint to hashcodes; the same value will always have the same hash, but you're not guaranteed that the hashcodes for two distinct values will be unique. -- Mark
On fös, 2006-02-17 at 08:01 -0500, Ron wrote: > At 04:24 AM 2/17/2006, Ragnar wrote: > >On fös, 2006-02-17 at 01:20 -0500, Ron wrote: > > > > > > OK, so here's _a_ way (there are others) to obtain a mapping such that > > > if a < b then f(a) < f (b) and > > > if a == b then f(a) == f(b) > > > > > By scanning the table once, we can map say 0000001h (Hex used to ease > > > typing) to the row with the minimum value and 1111111h to the row > > > with the maximum value as well as mapping everything in between to > > > their appropriate keys. That same scan can be used to assign a > > > pointer to each record's location. > > > >This step is just as expensive as the original > >sort you want to replace/improve. > > Why do you think that? External sorts involve > the equivalent of multiple scans of the table to > be sorted, sometimes more than lgN (where N is > the number of items in the table to be > sorted). Since this is physical IO we are > talking about, each scan is very expensive, and > therefore 1 scan is going to take considerably > less time than >= lgN scans will be. Call me dim, but please explain exactly how you are going to build this mapping in one scan. Are you assuming the map will fit in memory? > > > >If you want to keep this mapping saved as a sort > >of an index, or as part ot each row data, this > >will make the cost of inserts and updates enormous. > > Not sure you've got this right either. Looks to > me like we are adding a <= 32b quantity to each > row. Once we know the mapping, incrementally > updating it upon insert or update would seem to > be simple matter of a fast search for the correct > ranking [Interpolation search, which we have all > the needed data for, is O(lglgN). Hash based > search is O(1)]; plus an increment/decrement of > the key values greater/less than the key value of > the row being inserted / updated. Given than we > are updating all the keys in a specific range > within a tree structure, that update can be done > in O(lgm) (where m is the number of records affected). Say again ? Let us say you have 1 billion rows, where the column in question contains strings like baaaaaaaaaaaaaaa....aaa baaaaaaaaaaaaaaa....aab baaaaaaaaaaaaaaa....aac ... not necessarily in this order on disc of course The minimum value would be keyed as 00000001h, the next one as 00000002h and so on. Now insert new value 'aaaaa' Not only will you have to update 1 billion records, but also all the values in your map. please explain gnari
Mark Lewis <mark.lewis@mir3.com> writes: > I think we're actually on the same page here; you're right that the > constraint above ( f(a)==f(b) iff a==b ) can't be extended to data types > with more than 32 bits of value space. But the constraint I listed was > actually: > if a==b then f(a)==f(b) I believe Martijn had it right: the important constraint is f(a) > f(b) implies a > b which implies by commutativity f(a) < f(b) implies a < b and these two together imply a == b implies f(a) == f(b) Now you can't do any sorting if you only have the equality rule, you need the inequality rule. regards, tom lane
On 2/17/06, Ragnar <gnari@hive.is> wrote: > Say again ? > Let us say you have 1 billion rows, where the > column in question contains strings like > baaaaaaaaaaaaaaa....aaa > baaaaaaaaaaaaaaa....aab > baaaaaaaaaaaaaaa....aac > ... > not necessarily in this order on disc of course > > The minimum value would be keyed as 00000001h, > the next one as 00000002h and so on. > > Now insert new value 'aaaaa' > > Not only will you have to update 1 billion records, > but also all the values in your map. > > please explain No comment on the usefulness of the idea overall.. but the solution would be to insert with the colliding value of the existing one lesser than it.. It will falsly claim equal, which you then must fix with a second local sort which should be fast because you only need to sort the duplicates/false dupes. If you insert too much then this obviously becomes completely useless.
At 08:37 PM 2/15/2006, Dann Corbit wrote: >Adding some randomness to the selection of the pivot is a known >technique to fix the oddball partitions problem. True, but it makes QuickSort slower than say MergeSort because of the expense of the PRNG being called ~O(lgN) times during a sort. >However, Bentley and Sedgewick proved that every quick sort >algorithm has some input set that makes it go quadratic Yep. OTOH, that input set can be so specific and so unusual as to require astronomically unlikely bad luck or hostile hacking in order for it to actually occur. > (hence the recent popularity of introspective sort, which switches > to heapsort if quadratic behavior is detected. The C++ template I > submitted was an example of introspective sort, but PostgreSQL does > not use C++ so it was not helpful). ...and there are other QuickSort+Other hybrids that address the issue as well. MergeSort, RadixExchangeSort, and BucketSort all come to mind. See Gonnet and Baeza-Yates, etc. >Here are some cases known to make qsort go quadratic: >1. Data already sorted Only if one element is used to choose the pivot; _and_ only if the pivot is the first or last element of each pass. Even just always using the middle element as the pivot avoids this problem. See Sedgewick or Knuth. >2. Data reverse sorted Ditto above. >3. Data organ-pipe sorted or ramp Not sure what this means? Regardless, median of n partitioning that includes samples from each of the 1st 1/3, 2nd 1/3, and final 3rd of the data is usually enough to guarantee O(NlgN) behavior unless the _specific_ distribution known to be pessimal to that sampling algorithm is encountered. The only times I've ever seen it ITRW was as a result of hostile activity: purposely arranging the data in such a manner is essentially a DoS attack. >4. Almost all data of the same value Well known fixes to inner loop available to avoid this problem. >There are probably other cases. Randomizing the pivot helps some, >as does check for in-order or reverse order partitions. Randomizing the choice of pivot essentially guarantees O(NlgN) behavior no matter what the distribution of the data at the price of increasing the cost of each pass by a constant factor (the generation of a random number or numbers). In sum, QuickSort gets all sorts of bad press that is far more FUD than fact ITRW. Ron.
Added to TODO: * Improve port/qsort() to handle sorts with 50% unique and 50% duplicate value [qsort] This involves choosing better pivot points for the quicksort. --------------------------------------------------------------------------- Dann Corbit wrote: > > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > > owner@postgresql.org] On Behalf Of Tom Lane > > Sent: Wednesday, February 15, 2006 5:22 PM > > To: Ron > > Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create > Index > > behaviour) > > > > Ron <rjpeace@earthlink.net> writes: > > > How are we choosing our pivots? > > > > See qsort.c: it looks like median of nine equally spaced inputs (ie, > > the 1/8th points of the initial input array, plus the end points), > > implemented as two rounds of median-of-three choices. With half of > the > > data inputs zero, it's not too improbable for two out of the three > > samples to be zeroes in which case I think the med3 result will be > zero > > --- so choosing a pivot of zero is much more probable than one would > > like, and doing so in many levels of recursion causes the problem. > > Adding some randomness to the selection of the pivot is a known > technique to fix the oddball partitions problem. However, Bentley and > Sedgewick proved that every quick sort algorithm has some input set that > makes it go quadratic (hence the recent popularity of introspective > sort, which switches to heapsort if quadratic behavior is detected. The > C++ template I submitted was an example of introspective sort, but > PostgreSQL does not use C++ so it was not helpful). > > > I think. I'm not too sure if the code isn't just being sloppy about > the > > case where many data values are equal to the pivot --- there's a > special > > case there to switch to insertion sort, and maybe that's getting > invoked > > too soon. > > Here are some cases known to make qsort go quadratic: > 1. Data already sorted > 2. Data reverse sorted > 3. Data organ-pipe sorted or ramp > 4. Almost all data of the same value > > There are probably other cases. Randomizing the pivot helps some, as > does check for in-order or reverse order partitions. > > Imagine if 1/3 of the partitions fall into a category that causes > quadratic behavior (have one of the above formats and have more than > CUTOFF elements in them). > > It is doubtful that the switch to insertion sort is causing any sort of > problems. It is only going to be invoked on tiny sets, for which it has > a fixed cost that is probably less that qsort() function calls on sets > of the same size. > > >It'd be useful to get a line-level profile of the behavior of > > this code in the slow cases... > > I guess that my in-order or presorted tests [which often arise when > there are very few distinct values] may solve the bad partition > problems. Don't forget that the algorithm is called recursively. > > > regards, tom lane > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 3: Have you checked our extensive FAQ? > > > > http://www.postgresql.org/docs/faq > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
My introsort is almost complete and its the fastest variant of quicksort I can find, I'll submit it to -patches in the next couple days as-well.
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
On 3/2/06, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
Added to TODO:
* Improve port/qsort() to handle sorts with 50% unique and 50% duplicate
value [qsort]
This involves choosing better pivot points for the quicksort.
---------------------------------------------------------------------------
Dann Corbit wrote:
>
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> > owner@postgresql.org] On Behalf Of Tom Lane
> > Sent: Wednesday, February 15, 2006 5:22 PM
> > To: Ron
> > Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
> Index
> > behaviour)
> >
> > Ron <rjpeace@earthlink.net> writes:
> > > How are we choosing our pivots?
> >
> > See qsort.c: it looks like median of nine equally spaced inputs (ie,
> > the 1/8th points of the initial input array, plus the end points),
> > implemented as two rounds of median-of-three choices. With half of
> the
> > data inputs zero, it's not too improbable for two out of the three
> > samples to be zeroes in which case I think the med3 result will be
> zero
> > --- so choosing a pivot of zero is much more probable than one would
> > like, and doing so in many levels of recursion causes the problem.
>
> Adding some randomness to the selection of the pivot is a known
> technique to fix the oddball partitions problem. However, Bentley and
> Sedgewick proved that every quick sort algorithm has some input set that
> makes it go quadratic (hence the recent popularity of introspective
> sort, which switches to heapsort if quadratic behavior is detected. The
> C++ template I submitted was an example of introspective sort, but
> PostgreSQL does not use C++ so it was not helpful).
>
> > I think. I'm not too sure if the code isn't just being sloppy about
> the
> > case where many data values are equal to the pivot --- there's a
> special
> > case there to switch to insertion sort, and maybe that's getting
> invoked
> > too soon.
>
> Here are some cases known to make qsort go quadratic:
> 1. Data already sorted
> 2. Data reverse sorted
> 3. Data organ-pipe sorted or ramp
> 4. Almost all data of the same value
>
> There are probably other cases. Randomizing the pivot helps some, as
> does check for in-order or reverse order partitions.
>
> Imagine if 1/3 of the partitions fall into a category that causes
> quadratic behavior (have one of the above formats and have more than
> CUTOFF elements in them).
>
> It is doubtful that the switch to insertion sort is causing any sort of
> problems. It is only going to be invoked on tiny sets, for which it has
> a fixed cost that is probably less that qsort() function calls on sets
> of the same size.
>
> >It'd be useful to get a line-level profile of the behavior of
> > this code in the slow cases...
>
> I guess that my in-order or presorted tests [which often arise when
> there are very few distinct values] may solve the bad partition
> problems. Don't forget that the algorithm is called recursively.
>
> > regards, tom lane
> >
> > ---------------------------(end of
> broadcast)---------------------------
> > TIP 3: Have you checked our extensive FAQ?
> >
> > http://www.postgresql.org/docs/faq
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com
+ If your life is a hard drive, Christ can be your backup. +
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
Last month I wrote: > It seems clear that our qsort.c is doing a pretty awful job of picking > qsort pivots, while glibc is mostly managing not to make that mistake. I re-ran Gary's test script using the just-committed improvements to qsort.c, and got pretty nice numbers (attached --- compare to http://archives.postgresql.org/pgsql-performance/2006-02/msg00227.php). So it was wrong to blame his problems on the pivot selection --- the culprit was that ill-considered switch to insertion sort. regards, tom lane 100 runtimes for latest port/qsort.c, sorted ascending: Time: 335.481 ms Time: 335.606 ms Time: 335.932 ms Time: 336.039 ms Time: 336.182 ms Time: 336.231 ms Time: 336.711 ms Time: 336.721 ms Time: 336.971 ms Time: 336.982 ms Time: 337.036 ms Time: 337.190 ms Time: 337.223 ms Time: 337.312 ms Time: 337.350 ms Time: 337.423 ms Time: 337.523 ms Time: 337.528 ms Time: 337.565 ms Time: 337.566 ms Time: 337.732 ms Time: 337.741 ms Time: 337.744 ms Time: 337.786 ms Time: 337.790 ms Time: 337.898 ms Time: 337.905 ms Time: 337.952 ms Time: 337.976 ms Time: 338.017 ms Time: 338.123 ms Time: 338.206 ms Time: 338.306 ms Time: 338.514 ms Time: 338.594 ms Time: 338.597 ms Time: 338.683 ms Time: 338.705 ms Time: 338.729 ms Time: 338.748 ms Time: 338.816 ms Time: 338.958 ms Time: 338.963 ms Time: 338.997 ms Time: 339.074 ms Time: 339.106 ms Time: 339.134 ms Time: 339.159 ms Time: 339.226 ms Time: 339.260 ms Time: 339.289 ms Time: 339.341 ms Time: 339.500 ms Time: 339.585 ms Time: 339.595 ms Time: 339.774 ms Time: 339.897 ms Time: 339.927 ms Time: 340.064 ms Time: 340.133 ms Time: 340.172 ms Time: 340.219 ms Time: 340.261 ms Time: 340.323 ms Time: 340.708 ms Time: 340.761 ms Time: 340.785 ms Time: 340.900 ms Time: 340.986 ms Time: 341.339 ms Time: 341.564 ms Time: 341.707 ms Time: 342.155 ms Time: 342.213 ms Time: 342.452 ms Time: 342.515 ms Time: 342.540 ms Time: 342.928 ms Time: 343.548 ms Time: 343.663 ms Time: 344.192 ms Time: 344.952 ms Time: 345.152 ms Time: 345.174 ms Time: 345.444 ms Time: 346.848 ms Time: 348.144 ms Time: 348.842 ms Time: 354.550 ms Time: 356.877 ms Time: 357.475 ms Time: 358.487 ms Time: 364.178 ms Time: 370.730 ms Time: 493.098 ms Time: 648.009 ms Time: 849.345 ms Time: 860.616 ms Time: 936.800 ms Time: 1727.085 ms