Thread: Bad n_distinct estimation; hacks suggested?
Folks, Params: PostgreSQL 8.0.1 on Solaris 10 Statistics = 500 (tablenames have been changed to protect NDA) e1=# select tablename, null_frac, correlation, n_distinct from pg_stats where tablename = 'clickstream1' andattname = 'session_id'; tablename | null_frac | correlation | n_distinct ----------------------+-----------+-------------+------------ clickstream1 | 0 | 0.412034 | 378174 (2 rows) e1=# select count(distinct session_id) from clickstream1; count --------- 3174813 As you can see, n_distinct estimation is off by a factor of 10x and it's causing query planning problems. Any suggested hacks to improve the histogram on this? (BTW, increasing the stats to 1000 only doubles n_distinct, and doesn't solve the problem) -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> -----Original Message----- > From: Josh Berkus [mailto:josh@agliodbs.com] > Sent: Tuesday, April 19, 2005 2:09 PM > To: pgsql-perform > Subject: [PERFORM] Bad n_distinct estimation; hacks suggested? > > [...] > (BTW, increasing the stats to 1000 only doubles n_distinct, > and doesn't solve the problem) Speaking of which, is there a reason why statistics are limited to 1000? Performance? __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
Josh Berkus <josh@agliodbs.com> writes: > As you can see, n_distinct estimation is off by a factor of 10x and it's > causing query planning problems. Any suggested hacks to improve the > histogram on this? What's the histogram itself look like? (I'd like to see the whole pg_stats row not just part of it ...) There's probably no point in showing the target=1000 version, but maybe target=100 would be informative. regards, tom lane
Tom, > What's the histogram itself look like? (I'd like to see the whole > pg_stats row not just part of it ...) There's probably no point in > showing the target=1000 version, but maybe target=100 would be > informative. Here is the stats = 100 version. Notice that n_distinct has gone down. schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+----------------------+------------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- public | web_site_activity_fa | session_id | 0 | 8 | 96107 | {4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815,4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506,71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,70986,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,6239825,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,2546720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,4388025} | {0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0.000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.000366667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002} | {230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,386486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419,1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229,2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587,3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625,4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200,6395250,6424719,6888329} | 0.41744 -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Tom, Any thoughts? This is really messing up query execution all across the database ... --Josh > Here is the stats = 100 version. Notice that n_distinct has gone down. > > schemaname | tablename | attname | null_frac | avg_width | > n_distinct | most_common_vals > > | most_common_freqs > | histogram_bounds | > > correlation >-------------------+------------- public | web_site_activity_fa | > session_id | 0 | 8 | 96107 | > {4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705 >488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006 >604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815, >4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387 >835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23 >450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506, >71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709 >86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982 >5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25 >46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802 >5} > > {0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000 >733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667, >0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.00 >05,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0. >000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036 >6667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333 >,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0. >0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.0002666 >67,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0 >.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000 >233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002333 >33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0 >002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002} > > {230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38 >6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419, >1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038 >573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229, >2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832 >224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587, >3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804 >593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625, >4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078 >912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200, >6395250,6424719,6888329} > > | 0.41744 -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Hi. Sometimes, if the random number generator, that PostgreSQL uses, isn't good enough, the randomly selected pages for the statistics might not be random enough. Solaris is unknown to me. Maybe the used random number generator there isn't good enough? Good statistics depend on good random numbers. So, for example, if you have one million pages, but the upper bound for the random numbers is one hundred thousand pages, the statistics might get tuned. Or some random number generator has for example only 32000 different values. Regards, Marko Ristola Josh Berkus wrote: >Tom, > >Any thoughts? This is really messing up query execution all across the >database ... > >--Josh > > > >>Here is the stats = 100 version. Notice that n_distinct has gone down. >> >> schemaname | tablename | attname | null_frac | avg_width | >>n_distinct | most_common_vals >> >>| most_common_freqs >>| histogram_bounds | >> >>correlation >> >> > > > >>-------------------+------------- public | web_site_activity_fa | >>session_id | 0 | 8 | 96107 | >>{4393922,6049228,6026260,4394034,60341,4393810,2562999,2573850,3006299,4705 >>488,2561499,4705258,3007378,4705490,60327,60352,2560950,2567640,2569852,3006 >>604,4394329,2570739,2406633,2407292,3006356,4393603,4394121,6449083,2565815, >>4387881,2406770,2407081,2564340,3007328,2406578,2407295,2562813,2567603,4387 >>835,71014,2566253,2566900,6103079,2289424,2407597,2567627,2568333,3457448,23 >>450,23670,60743,70739,2406818,2406852,2407511,2562816,3007446,6306095,60506, >>71902,591543,1169136,1447077,2285047,2406830,2573964,6222758,61393,70955,709 >>86,71207,71530,262368,2289213,2406899,2567361,2775952,3006824,4387864,623982 >>5,6244853,6422152,1739,58600,179293,278473,488407,1896390,2286976,2407020,25 >>46720,2677019,2984333,3006133,3007497,3310286,3631413,3801909,4366116,438802 >>5} >> >>{0.00166667,0.00146667,0.0013,0.0011,0.000933333,0.0009,0.0008,0.0008,0.000 >>733333,0.000733333,0.0007,0.000633333,0.0006,0.0006,0.000566667,0.000566667, >>0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000533333,0.00 >>05,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667,0.000466667,0.000433333,0. >>000433333,0.000433333,0.000433333,0.0004,0.0004,0.0004,0.0004,0.0004,0.00036 >>6667,0.000366667,0.000366667,0.000366667,0.000333333,0.000333333,0.000333333 >>,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0. >>0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.0002666 >>67,0.000266667,0.000266667,0.000266667,0.000266667,0.000233333,0.000233333,0 >>.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000 >>233333,0.000233333,0.000233333,0.000233333,0.000233333,0.000233333,0.0002333 >>33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0 >>002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002} >> >>{230,58907,88648,156764,216759,240405,264601,289047,312630,339947,364452,38 >>6486,409427,434075,455140,475759,500086,521530,544703,680376,981066,1313419, >>1712592,1860151,1882452,1905328,1927504,1948159,1970054,1990408,2014501,2038 >>573,2062786,2087163,2110129,2132196,2155657,2181058,2204976,2228575,2256229, >>2283897,2352453,2407153,2457716,2542081,2572119,2624133,2699592,2771254,2832 >>224,2908151,2951500,3005088,3032889,3137244,3158685,3179395,3203681,3261587, >>3304359,3325577,3566688,3621357,3645094,3718667,3740821,3762386,3783169,3804 >>593,3826503,3904589,3931012,3957675,4141934,4265118,4288568,4316898,4365625, >>4473965,4535752,4559700,4691802,4749478,5977208,6000272,6021416,6045939,6078 >>912,6111900,6145155,6176422,6206627,6238291,6271270,6303067,6334117,6365200, >>6395250,6424719,6888329} >> >>| 0.41744 >> >> > > >
Marko, > Sometimes, if the random number generator, that PostgreSQL uses, > isn't good enough, the randomly selected pages for the statistics > might not be random enough. > > Solaris is unknown to me. Maybe the used random number generator there > isn't good enough? Hmmm. Good point. Will have to test on Linux. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> > Solaris is unknown to me. Maybe the used random number generator there > > isn't good enough? > > Hmmm. Good point. Will have to test on Linux. Nope: Linux 2.4.20: test=# select tablename, attname, n_distinct from pg_stats where tablename = 'web_site_activity_fa'; tablename | attname | n_distinct ----------------------+---------------------+------------ web_site_activity_fa | session_id | 626127 test=# select count(distinct session_id) from web_site_activity_fa; count --------- 3174813 (1 row) ... I think the problem is in our heuristic sampling code. I'm not the first person to have this kind of a problem. Will be following up with tests ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > ... I think the problem is in our heuristic sampling code. I'm not the first > person to have this kind of a problem. Will be following up with tests ... I looked into this a while back when we were talking about changing the sampling method. The conclusions were discouraging. Fundamentally, using constant sized samples of data for n_distinct is bogus. Constant sized samples only work for things like the histograms that can be analyzed through standard statistics population sampling which depends on the law of large numbers. n_distinct requires you to estimate how frequently very rare things occur. You can't apply the law of large numbers because even a single instance of a value out of a large pool alters the results disproportionately. To get a valid estimate for n_distinct you would need to sample a fixed percentage of the table. Not a fixed size sample. That just isn't practical. Moreover, I think the percentage would have to be quite large. Even if you sampled half the table I think the confidence interval would be quite wide. The situation is worsened because it's unclear how to interpolate values for subsets of the table. If the histogram says you have a million records and you're adding a clause that has a selectivity of 50% then half a million is a good guess. But if what you care about is n_distinct and you start with a million records with 1,000 distinct values and then apply a clause that filters 50% of them, how do you estimate the resulting n_distinct? 500 is clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0 to 1,000 and you have no good way to figure out where the truth lies. So I fear this is fundamentally a hopeless situation. It's going to be difficult to consistently get good plans for any queries that depend on good estimates for n_distinct. -- greg
Greg, > I looked into this a while back when we were talking about changing the > sampling method. The conclusions were discouraging. Fundamentally, using > constant sized samples of data for n_distinct is bogus. Constant sized > samples only work for things like the histograms that can be analyzed > through standard statistics population sampling which depends on the law of > large numbers. Well, unusual distributions are certainly tough. But I think the problem exists even for relatively well-distributed populations. Part of it is, I believe, the formula we are using: n*d / (n - f1 + f1*n/N) This is an estimation formula from Haas and Stokes in IBM Research Report RJ 10025, and is called the DUJ1 formula. (http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf) It appears to suck. For example, take my table: rows: 26million (N) distinct values: 3.4million I took a random sample of 1000 rows (n) from that table. It contained: 968 values that occurred only once (f1) 981 distinct values (d) Any human being looking at that sample would assume a large number of distinct values; after all, 96.8% of the values occurred only once. But the formula gives us: 1000*981 / ( 1000 - 968 + ( 968 * 1000/26000000 ) ) = 30620 This is obviously dramatically wrong, by a factor of 100. The math gets worse as the sample size goes down: Sample 250, 248 distinct values, 246 unique values: 250*248 / ( 250 - 246 + ( 246 * 250 / 26000000 ) ) = 15490 Even in a case with an ovewhelming majority of unique values, the formula gets it wrong: 999 unque values in sample 998 appearing only once: 1000*999 / ( 1000 - 998 + ( 998 * 1000 / 26000000 ) ) = 490093 This means that, with a sample size of 1000 a table of 26million rows cannot ever have with this formula more than half a million distinct values, unless the column is a unique column. Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to the notes in analyze.c, which seem to think that we're *overestimating* n_distinct. This formula appears broken everywhere: Table: 969000 rows Distinct values: 374000 Sample Size: 1000 Unique values in sample: 938 Values appearing only once: 918 1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308 Again, too small by a factor of 20x. This is so broken, in fact, that I'm wondering if we've read the paper right? I've perused the paper on almaden, and the DUJ1 formula appears considerably more complex than the formula we're using. Can someone whose math is more recent than calculus in 1989 take a look at that paper, and look at the formula toward the bottom of page 10, and see if we are correctly interpreting it? I'm particularly confused as to what "q" and "d-sub-n" represent. Thanks! -- --Josh Josh Berkus Aglio Database Solutions San Francisco
People, > Can someone whose math is more recent than calculus in 1989 take a look at > that paper, and look at the formula toward the bottom of page 10, and see > if we are correctly interpreting it? I'm particularly confused as to > what "q" and "d-sub-n" represent. Thanks! Actually, I managed to solve for these and it appears we are using the formula correctly. It's just a bad formula. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus said: > > > Well, unusual distributions are certainly tough. But I think the > problem exists even for relatively well-distributed populations. > Part of it is, I believe, the formula we are using: > > n*d / (n - f1 + f1*n/N) > [snip] > > This is so broken, in fact, that I'm wondering if we've read the paper > right? I've perused the paper on almaden, and the DUJ1 formula > appears considerably more complex than the formula we're using. > > Can someone whose math is more recent than calculus in 1989 take a look > at that paper, and look at the formula toward the bottom of page 10, > and see if we are correctly interpreting it? I'm particularly > confused as to what "q" and "d-sub-n" represent. Thanks! > Math not too recent ... I quickly read the paper and independently came up with the same formula you say above we are applying. The formula is on the page that is numbered 6, although it's the tenth page in the PDF. q = n/N = ratio of sample size to population size d_sub_n = d = number of distinct classes in sample cheers andrew
Josh Berkus <josh@agliodbs.com> writes: > Overall, our formula is inherently conservative of n_distinct. That is, I > believe that it is actually computing the *smallest* number of distinct > values which would reasonably produce the given sample, rather than the > *median* one. This is contrary to the notes in analyze.c, which seem to > think that we're *overestimating* n_distinct. Well, the notes are there because the early tests I ran on that formula did show it overestimating n_distinct more often than not. Greg is correct that this is inherently a hard problem :-( I have nothing against adopting a different formula, if you can find something with a comparable amount of math behind it ... but I fear it'd only shift the failure cases around. regards, tom lane
Here is my opinion. I hope this helps. Maybe there is no one good formula: On boolean type, there are at most 3 distinct values. There is an upper bound for fornames in one country. There is an upper bound for last names in one country. There is a fixed number of states and postal codes in one country. On the other hand, with timestamp, every value could be distinct. A primary key with only one column has only distinct values. If the integer column refers with a foreign key into another table's only primary key, we could take advantage of that knolege. A column with a unique index has only distinct values. First ones are for classifying and the second ones measure continuous or discrete time or something like the time. The upper bound for classifying might be 3 (boolean), or it might be one million. The properties of the distribution might be hard to guess. Here is one way: 1. Find out the number of distinct values for 500 rows. 2. Try to guess, how many distinct values are for 1000 rows. Find out the real number of distinct values for 1000 rows. 3. If the guess and the reality are 50% wrong, do the iteration for 2x1000 rows. Iterate using a power of two to increase the samples, until you trust the estimate enough. So, in the phase two, you could try to guess with two distinct formulas: One for the classifying target (boolean columns hit there). Another one for the timestamp and numerical values. If there are one million classifications on one column, how you can find it out, by other means than checking at least two million rows? This means, that the user should have a possibility to tell the lower bound for the number of rows for sampling. Regards, Marko Ristola Tom Lane wrote: >Josh Berkus <josh@agliodbs.com> writes: > > >>Overall, our formula is inherently conservative of n_distinct. That is, I >>believe that it is actually computing the *smallest* number of distinct >>values which would reasonably produce the given sample, rather than the >>*median* one. This is contrary to the notes in analyze.c, which seem to >>think that we're *overestimating* n_distinct. >> >> > >Well, the notes are there because the early tests I ran on that formula >did show it overestimating n_distinct more often than not. Greg is >correct that this is inherently a hard problem :-( > >I have nothing against adopting a different formula, if you can find >something with a comparable amount of math behind it ... but I fear >it'd only shift the failure cases around. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > >
Andrew, > The math in the paper does not seem to look at very low levels of q (= > sample to pop ratio). Yes, I think that's the failing. Mind you, I did more testing and found out that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy (which I would consider acceptable) with a sample size of 25% or more (which is infeasable in any large table). The formula does work for populations where D/N is much lower, say 0.01. So overall it seems to only work for 1/4 of cases; those where n/N is large and D/N is low. And, annoyingly, that's probably the population where accurate estimation is least crucial, as it consists mostly of small tables. I've just developed (not original, probably, but original to *me*) a formula that works on populations where n/N is very small and D/N is moderate (i.e. 0.1 to 0.4): N * (d/n)^(sqrt(N/n)) However, I've tested it only on (n/N < 0.005 and D/N > 0.1 and D/N < 0.4) populations, and only 3 of them to boot. I'd appreciate other people trying it on their own data populations, particularly very different ones, like D/N > 0.7 or D/N < 0.01. Further, as Andrew points out we presumably do page sampling rather than purely random sampling so I should probably read the paper he referenced. Working on it now .... -- Josh Berkus Aglio Database Solutions San Francisco
Folks, > I wonder if this paper has anything that might help: > http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I > were more of a statistician I might be able to answer :-) Actually, that paper looks *really* promising. Does anyone here have enough math to solve for D(sub)Md on page 6? I'd like to test it on samples of < 0.01%. Tom, how does our heuristic sampling work? Is it pure random sampling, or page sampling? -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Tom, how does our heuristic sampling work? Is it pure random sampling, or > page sampling? Manfred probably remembers better than I do, but I think the idea is to approximate pure random sampling as best we can without actually examining every page of the table. regards, tom lane
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote: Greg Stark wrote > > I looked into this a while back when we were talking about changing the > > sampling method. The conclusions were discouraging. Fundamentally, using > > constant sized samples of data for n_distinct is bogus. Constant sized > > samples only work for things like the histograms that can be analyzed > > through standard statistics population sampling which depends on the law of > > large numbers. ISTM Greg's comments are correct. There is no way to calculate this with consistent accuracy when using a constant sized sample. (If it were, then people wouldnt bother to hold large databases...) > Overall, our formula is inherently conservative of n_distinct. That is, I > believe that it is actually computing the *smallest* number of distinct > values which would reasonably produce the given sample, rather than the > *median* one. This is contrary to the notes in analyze.c, which seem to > think that we're *overestimating* n_distinct. The only information you can determine from a sample is the smallest number of distinct values that would reasonably produce the given sample. There is no meaningful concept of a median one... (You do have an upper bound: the number of rows in the table, but I cannot see any meaning from taking (Nrows+estimatedN_distinct)/2 ). Even if you use Zipf's Law to predict the frequency of occurrence, you'd still need to estimate the parameters for the distribution. Most other RDBMS make optimizer statistics collection an unsampled scan. Some offer this as one of their options, as well as the ability to define the sample size in terms of fixed number of rows or fixed proportion of the table. My suggested hack for PostgreSQL is to have an option to *not* sample, just to scan the whole table and find n_distinct accurately. Then anybody who doesn't like the estimated statistics has a clear path to take. The problem of poorly derived base statistics is a difficult one. When considering join estimates we already go to the trouble of including MFV comparisons to ensure an upper bound of join selectivity is known. If the statistics derived are themselves inaccurate the error propagation touches every other calculation in the optimizer. GIGO. What price a single scan of a table, however large, when incorrect statistics could force scans and sorts to occur when they aren't actually needed ? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > My suggested hack for PostgreSQL is to have an option to *not* sample, > just to scan the whole table and find n_distinct accurately. > ... > What price a single scan of a table, however large, when incorrect > statistics could force scans and sorts to occur when they aren't > actually needed ? It's not just the scan --- you also have to sort, or something like that, if you want to count distinct values. I doubt anyone is really going to consider this a feasible answer for large tables. regards, tom lane
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > My suggested hack for PostgreSQL is to have an option to *not* sample, > > just to scan the whole table and find n_distinct accurately. > > ... > > What price a single scan of a table, however large, when incorrect > > statistics could force scans and sorts to occur when they aren't > > actually needed ? > > It's not just the scan --- you also have to sort, or something like > that, if you want to count distinct values. I doubt anyone is really > going to consider this a feasible answer for large tables. Assuming you don't use the HashAgg plan, which seems very appropriate for the task? (...but I understand the plan otherwise). If that was the issue, then why not keep scanning until you've used up maintenance_work_mem with hash buckets, then stop and report the result. The problem is if you don't do the sort once for statistics collection you might accidentally choose plans that force sorts on that table. I'd rather do it once... The other alternative is to allow an ALTER TABLE command to set statistics manually, but I think I can guess what you'll say to that! Best Regards, Simon Riggs
Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of data pages, rather than a pure random sort, makes this more feasable; for example, a 70GB table would only need to sample about 9000 data pages (or 70MB). Of course, larger samples would lead to better accuracy, and this could be set through a revised GUC (i.e., maximum_sample_size, minimum_sample_size). I just need a little help doing the math ... please? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Guys, > While it's not possible to get accurate estimates from a fixed size sample, > I think it would be possible from a small but scalable sample: say, 0.1% of > all data pages on large tables, up to the limit of maintenance_work_mem. BTW, when I say "accurate estimates" here, I'm talking about "accurate enough for planner purposes" which in my experience is a range between 0.2x to 5x. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: >> It's not just the scan --- you also have to sort, or something like >> that, if you want to count distinct values. I doubt anyone is really >> going to consider this a feasible answer for large tables. > Assuming you don't use the HashAgg plan, which seems very appropriate > for the task? (...but I understand the plan otherwise). The context here is a case with a very large number of distinct values... keep in mind also that we have to do this for *all* the columns of the table. A full-table scan for each column seems right out to me. regards, tom lane
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > Overall, our formula is inherently conservative of n_distinct. That is, I > > believe that it is actually computing the *smallest* number of distinct > > values which would reasonably produce the given sample, rather than the > > *median* one. This is contrary to the notes in analyze.c, which seem to > > think that we're *overestimating* n_distinct. > > Well, the notes are there because the early tests I ran on that formula > did show it overestimating n_distinct more often than not. Greg is > correct that this is inherently a hard problem :-( > > I have nothing against adopting a different formula, if you can find > something with a comparable amount of math behind it ... but I fear > it'd only shift the failure cases around. > Perhaps the formula is not actually being applied? The code looks like this... if (nmultiple == 0) { /* If we found no repeated values, assume it's a unique column */ stats->stadistinct = -1.0; } else if (toowide_cnt == 0 && nmultiple == ndistinct) { /* * Every value in the sample appeared more than once. Assume * the column has just these values. */ stats->stadistinct = ndistinct; } else { /*---------- * Estimate the number of distinct values using the estimator * proposed by Haas and Stokes in IBM Research Report RJ 10025: The middle chunk of code looks to me like if we find a distribution where values all occur at least twice, then we won't bother to apply the Haas and Stokes equation. That type of frequency distribution would be very common in a set of values with very high ndistinct, especially when sampled. The comment * Every value in the sample appeared more than once. Assume * the column has just these values. doesn't seem to apply when using larger samples, as Josh is using. Looking at Josh's application it does seem likely that when taking a sample, all site visitors clicked more than once during their session, especially if they include home page, adverts, images etc for each page. Could it be that we have overlooked this simple explanation and that the Haas and Stokes equation is actually quite good, but just not being applied? Best Regards, Simon Riggs
Simon, > Could it be that we have overlooked this simple explanation and that the > Haas and Stokes equation is actually quite good, but just not being > applied? That's probably part of it, but I've tried Haas and Stokes on a pure random sample and it's still bad, or more specifically overly conservative. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: > >> It's not just the scan --- you also have to sort, or something like > >> that, if you want to count distinct values. I doubt anyone is really > >> going to consider this a feasible answer for large tables. > > > Assuming you don't use the HashAgg plan, which seems very appropriate > > for the task? (...but I understand the plan otherwise). > > The context here is a case with a very large number of distinct > values... Yes, but is there another way of doing this other than sampling a larger proportion of the table? I don't like that answer either, for the reasons you give. The manual doesn't actually say this, but you can already alter the sample size by setting one of the statistics targets higher, but all of those samples are fixed sample sizes, not a proportion of the table itself. It seems reasonable to allow an option to scan a higher proportion of the table. (It would be even better if you could say "keep going until you run out of memory, then stop", to avoid needing to have an external sort mode added to ANALYZE). Oracle and DB2 allow a proportion of the table to be specified as a sample size during statistics collection. IBM seem to be ignoring their own research note on estimating ndistinct... > keep in mind also that we have to do this for *all* the > columns of the table. You can collect stats for individual columns. You need only use an option to increase sample size when required. Also, if you have a large table and the performance of ANALYZE worries you, set some fields to 0. Perhaps that should be the default setting for very long text columns, since analyzing those doesn't help much (usually) and takes ages. (I'm aware we already don't analyze var length column values > 1024 bytes). > A full-table scan for each column seems > right out to me. Some systems analyze multiple columns simultaneously. Best Regards, Simon Riggs
Hi everybody! Perhaps the following papers are relevant to the discussion here (their contact authors have been cc'd): 1. The following proposes effective algorithms for using block-level sampling for n_distinct estimation: "Effective use of block-level sampling in statistics estimation" by Chaudhuri, Das and Srivastava, SIGMOD 2004. http://www-db.stanford.edu/~usriv/papers/block-sampling.pdf 2. In a single scan, it is possible to estimate n_distinct by using a very simple algorithm: "Distinct sampling for highly-accurate answers to distinct value queries and event reports" by Gibbons, VLDB 2001. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf 3. In fact, Gibbon's basic idea has been extended to "sliding windows" (this extension is useful in streaming systems like Aurora / Stream): "Distributed streams algorithms for sliding windows" by Gibbons and Tirthapura, SPAA 2002. http://home.eng.iastate.edu/~snt/research/tocs.pdf Thanks, Gurmeet ---------------------------------------------------- Gurmeet Singh Manku Google Inc. http://www.cs.stanford.edu/~manku (650) 967 1890 ----------------------------------------------------
Tom Lane wrote: >Josh Berkus <josh@agliodbs.com> writes: > > >>Overall, our formula is inherently conservative of n_distinct. That is, I >>believe that it is actually computing the *smallest* number of distinct >>values which would reasonably produce the given sample, rather than the >>*median* one. This is contrary to the notes in analyze.c, which seem to >>think that we're *overestimating* n_distinct. >> >> > >Well, the notes are there because the early tests I ran on that formula >did show it overestimating n_distinct more often than not. Greg is >correct that this is inherently a hard problem :-( > >I have nothing against adopting a different formula, if you can find >something with a comparable amount of math behind it ... but I fear >it'd only shift the failure cases around. > > > > The math in the paper does not seem to look at very low levels of q (= sample to pop ratio). The formula has a range of [d,N]. It appears intuitively (i.e. I have not done any analysis) that at very low levels of q, as f1 moves down from n, the formula moves down from N towards d very rapidly. I did a test based on the l_comments field in a TPC lineitems table. The test set has N = 6001215, D = 2921877. In my random sample of 1000 I got d = 976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2 orders of magnitude. I wonder if this paper has anything that might help: http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I were more of a statistician I might be able to answer :-) cheers andrew
Josh Berkus wrote: >Simon, Tom: > >While it's not possible to get accurate estimates from a fixed size sample, I >think it would be possible from a small but scalable sample: say, 0.1% of all >data pages on large tables, up to the limit of maintenance_work_mem. > >Setting up these samples as a % of data pages, rather than a pure random sort, >makes this more feasable; for example, a 70GB table would only need to sample >about 9000 data pages (or 70MB). Of course, larger samples would lead to >better accuracy, and this could be set through a revised GUC (i.e., >maximum_sample_size, minimum_sample_size). > >I just need a little help doing the math ... please? > > After some more experimentation, I'm wondering about some sort of adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. The idea would be that we take a sample (either of fixed size, or some small proportion of the table) , see how well it fits a larger sample (say a few times the size of the first sample), and then adjust the formula accordingly to project from the larger sample the estimate for the full population. Math not worked out yet - I think we want to ensure that the result remains bounded by [d,N]. cheers andrew
Simon Riggs wrote: >The comment > * Every value in the sample appeared more than once. Assume > * the column has just these values. >doesn't seem to apply when using larger samples, as Josh is using. > >Looking at Josh's application it does seem likely that when taking a >sample, all site visitors clicked more than once during their session, >especially if they include home page, adverts, images etc for each page. > >Could it be that we have overlooked this simple explanation and that the >Haas and Stokes equation is actually quite good, but just not being >applied? > > > > No, it is being aplied. If every value in the sample appears more than once, then f1 in the formula is 0, and the result is then just d, the number of distinct values in the sample. cheers andrew
Quoting Andrew Dunstan <andrew@dunslane.net>: > After some more experimentation, I'm wondering about some sort of > adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. > > The idea would be that we take a sample (either of fixed size, or > some small proportion of the table) , see how well it fits a larger sample > > (say a few times the size of the first sample), and then adjust the > formula accordingly to project from the larger sample the estimate for the full population. Math not worked out yet - I think we want to ensure that the result remains bounded by [d,N]. Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like "anecdotal" evidence for wonderdrugs. -- "Dreams come true, not free." -- S.Sondheim, ITW
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote: > 2. In a single scan, it is possible to estimate n_distinct by using > a very simple algorithm: > > "Distinct sampling for highly-accurate answers to distinct value > queries and event reports" by Gibbons, VLDB 2001. > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf That looks like the one... ...though it looks like some more complex changes to the current algorithm to use it, and we want the other stats as well... > 3. In fact, Gibbon's basic idea has been extended to "sliding windows" > (this extension is useful in streaming systems like Aurora / Stream): > > "Distributed streams algorithms for sliding windows" > by Gibbons and Tirthapura, SPAA 2002. > > http://home.eng.iastate.edu/~snt/research/tocs.pdf > ...and this offers the possibility of calculating statistics at load time, as part of the COPY command Best Regards, Simon Riggs
Mischa, > >Perhaps I can save you some time (yes, I have a degree in Math). If I > >understand correctly, you're trying extrapolate from the correlation > >between a tiny sample and a larger sample. Introducing the tiny sample > >into any decision can only produce a less accurate result than just > >taking the larger sample on its own; GIGO. Whether they are consistent > >with one another has no relationship to whether the larger sample > >correlates with the whole population. You can think of the tiny sample > >like "anecdotal" evidence for wonderdrugs. Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. Wheras if the two estimates are < 1.0 stdev apart, we can have good confidence that the table is easily estimated. Note that this doesn't require progressively larger samples; any two samples would work. > I'm with Tom though in being very wary of solutions that require even > one-off whole table scans. Maybe we need an additional per-table > statistics setting which could specify the sample size, either as an > absolute number or as a percentage of the table. It certainly seems that > where D/N ~ 0.3, the estimates on very large tables at least are way way > out. Oh, I think there are several other cases where estimates are way out. Basically the estimation method we have doesn't work for samples smaller than 0.10. > Or maybe we need to support more than one estimation method. Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, >= 0.1) 1 for tables where we sample a small % of pages but are "easily estimated" 1 for tables which are not easily estimated by we can't afford to sample a large % of pages. If we're doing sampling-based estimation, I really don't want people to lose sight of the fact that page-based random sampling is much less expensive than row-based random sampling. We should really be focusing on methods which are page-based. -- Josh Berkus Aglio Database Solutions San Francisco
Quoting Josh Berkus <josh@agliodbs.com>: > > >Perhaps I can save you some time (yes, I have a degree in Math). If I > > >understand correctly, you're trying extrapolate from the correlation > > >between a tiny sample and a larger sample. Introducing the tiny sample > > >into any decision can only produce a less accurate result than just > > >taking the larger sample on its own; GIGO. Whether they are consistent > > >with one another has no relationship to whether the larger sample > > >correlates with the whole population. You can think of the tiny sample > > >like "anecdotal" evidence for wonderdrugs. > > Actually, it's more to characterize how large of a sample we need. For > example, if we sample 0.005 of disk pages, and get an estimate, and then > sample another 0.005 of disk pages and get an estimate which is not even > close to the first estimate, then we have an idea that this is a table which > defies analysis based on small samples. Wheras if the two estimates are < > 1.0 stdev apart, we can have good confidence that the table is easily > estimated. Note that this doesn't require progressively larger samples; any > two samples would work. We're sort of wandering away from the area where words are a good way to describe the problem. Lacking a common scratchpad to work with, could I suggest you talk to someone you consider has a background in stats, and have them draw for you why this doesn't work? About all you can get out of it is, if the two samples are disjunct by a stddev, yes, you've demonstrated that the union of the two populations has a larger stddev than either of them; but your two stddevs are less info than the stddev of the whole. Breaking your sample into two (or three, or four, ...) arbitrary pieces and looking at their stddevs just doesn't tell you any more than what you start with. -- "Dreams come true, not free." -- S.Sondheim, ITW
Mischa Sandberg wrote: > >Perhaps I can save you some time (yes, I have a degree in Math). If I >understand correctly, you're trying extrapolate from the correlation >between a tiny sample and a larger sample. Introducing the tiny sample >into any decision can only produce a less accurate result than just >taking the larger sample on its own; GIGO. Whether they are consistent >with one another has no relationship to whether the larger sample >correlates with the whole population. You can think of the tiny sample >like "anecdotal" evidence for wonderdrugs. > > > Ok, good point. I'm with Tom though in being very wary of solutions that require even one-off whole table scans. Maybe we need an additional per-table statistics setting which could specify the sample size, either as an absolute number or as a percentage of the table. It certainly seems that where D/N ~ 0.3, the estimates on very large tables at least are way way out. Or maybe we need to support more than one estimation method. Or both ;-) cheers andrew
Well, this guy has it nailed. He cites Flajolet and Martin, which was (I thought) as good as you could get with only a reasonable amount of memory per statistic. Unfortunately, their hash table is a one-shot deal; there's no way to maintain it once the table changes. His incremental update doesn't degrade as the table changes. If there isn't the same wrangle of patent as with the ARC algorithm, and if the existing stats collector process can stand the extra traffic, then this one is a winner. Many thanks to the person who posted this reference in the first place; so sorry I canned your posting and can't recall your name. Now, if we can come up with something better than the ARC algorithm ...
> Now, if we can come up with something better than the ARC algorithm ... Tom already did. His clock-sweep patch is already in the 8.1 source. -- Josh Berkus Aglio Database Solutions San Francisco
Actually, the earliest paper that solves the distinct_n estimation problem in 1 pass is the following: "Estimating simple functions on the union of data streams" by Gibbons and Tirthapura, SPAA 2001. http://home.eng.iastate.edu/~snt/research/streaming.pdf The above paper addresses a more difficult problem (1 pass _and_ a distributed setting). Gibbon's followup paper in VLDB 2001 limits the problem to a single machine and contains primarily experimental results (for a database audience). The algorithmic breakthrough had already been accomplished in the SPAA paper. Gurmeet -- ---------------------------------------------------- Gurmeet Singh Manku Google Inc. http://www.cs.stanford.edu/~manku (650) 967 1890 ----------------------------------------------------
Hi, Josh, Josh Berkus wrote: > Yes, actually. We need 3 different estimation methods: > 1 for tables where we can sample a large % of pages (say, >= 0.1) > 1 for tables where we sample a small % of pages but are "easily estimated" > 1 for tables which are not easily estimated by we can't afford to sample a > large % of pages. > > If we're doing sampling-based estimation, I really don't want people to lose > sight of the fact that page-based random sampling is much less expensive than > row-based random sampling. We should really be focusing on methods which > are page-based. Would it make sense to have a sample method that scans indices? I think that, at least for tree based indices (btree, gist), rather good estimates could be derived. And the presence of a unique index should lead to 100% distinct values estimation without any scan at all. Markus
Hello Everybody, We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the foreign key constraints compile noticeably faster. In 7.3 the constraints would typically take more than an hour to run on our production data. Now they take a minute or two. Can anybody explain such a major performance improvement ? Thanks -- Ashish Arte Open Sky Software
Ashish Arte <ashish@openskysoftware.com> writes: > We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the > foreign key constraints compile noticeably faster. In 7.3 the > constraints would typically take more than an hour to run on our > production data. Now they take a minute or two. > Can anybody explain such a major performance improvement ? Hey, we do do some work on this thing from time to time ;-) Probably you are talking about this: 2003-10-06 12:38 tgl * src/: backend/commands/tablecmds.c, backend/utils/adt/ri_triggers.c, include/commands/trigger.h: During ALTER TABLE ADD FOREIGN KEY, try to check the existing rows using a single LEFT JOIN query instead of firing the check trigger for each row individually. Stephan Szabo, with some kibitzing from Tom Lane and Jan Wieck. regards, tom lane
Quoting Markus Schaber <schabi@logix-tt.com>: > Hi, Josh, > > Josh Berkus wrote: > > > Yes, actually. We need 3 different estimation methods: > > 1 for tables where we can sample a large % of pages (say, >= 0.1) > > 1 for tables where we sample a small % of pages but are "easily > estimated" > > 1 for tables which are not easily estimated by we can't afford to > sample a > > large % of pages. > > > > If we're doing sampling-based estimation, I really don't want > people to lose > > sight of the fact that page-based random sampling is much less > expensive than > > row-based random sampling. We should really be focusing on > methods which > > are page-based. Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) How about applying the distinct-sampling filter on a small extra data stream to the stats collector? -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
Mischa, > Okay, although given the track record of page-based sampling for > n-distinct, it's a bit like looking for your keys under the streetlight, > rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, for example, 10% of rows purely randomly would actually mean loading 50% of pages. With 20% of rows, you might as well scan the whole table. Unless, of course, we use indexes for sampling, which seems like a *really good* idea to me .... -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > Mischa, > > >>Okay, although given the track record of page-based sampling for >>n-distinct, it's a bit like looking for your keys under the streetlight, >>rather than in the alley where you dropped them :-) > > > Bad analogy, but funny. > > The issue with page-based vs. pure random sampling is that to do, for example, > 10% of rows purely randomly would actually mean loading 50% of pages. With > 20% of rows, you might as well scan the whole table. > > Unless, of course, we use indexes for sampling, which seems like a *really > good* idea to me .... > But doesn't an index only sample one column at a time, whereas with page-based sampling, you can sample all of the columns at once. And not all columns would have indexes, though it could be assumed that if a column doesn't have an index, then it doesn't matter as much for calculations such as n_distinct. But if you had 5 indexed rows in your table, then doing it index wise means you would have to make 5 passes instead of just one. Though I agree that page-based sampling is important for performance reasons. John =:->
John, > But doesn't an index only sample one column at a time, whereas with > page-based sampling, you can sample all of the columns at once. Hmmm. Yeah, we're not currently doing that though. Another good idea ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Quoting Josh Berkus <josh@agliodbs.com>: > Mischa, > > > Okay, although given the track record of page-based sampling for > > n-distinct, it's a bit like looking for your keys under the > streetlight, > > rather than in the alley where you dropped them :-) > > Bad analogy, but funny. Bad analogy? Page-sampling effort versus row-sampling effort, c'est moot. It's not good enough for stats to produce good behaviour on the average. Straight random sampling, page or row, is going to cause enough untrustworthy engine behaviour,for any %ages small enough to allow sampling from scratch at any time. I'm curious what the problem is with relying on a start-up plus incremental method, when the method in the distinct-sampling paper doesn't degenerate: you can start when the table is still empty. Constructing an index requires an initial full scan plus incremental update; what's the diff? > Unless, of course, we use indexes for sampling, which seems like a > *really > good* idea to me .... "distinct-sampling" applies for indexes, too. I started tracking the discussion of this a bit late. Smart method for this is in VLDB'92: Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I don't think this is online anywhere, except if you have a DBLP membership. Does nybod else know better? Antoshenkov was the brains behind some of the really cool stuff in DEC Rdb (what eventually became Oracle). Compressed bitmap indices, parallel competing query plans, and smart handling of keys with hyperbolic distributions. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.