Thread: benchmarking the query planner (was Re: Simple postgresql.conf wizard)
Sorry for top posting but we are getting a bit far afield from the original topic. I followed up the tests I did last night: http://archives.postgresql.org/pgsql-hackers/2008-12/msg00369.php I benchmarked 100 iterations of EXPLAIN on the query Greg Stark put together as a synthetic benchmark for default_statistics_target with various values for "SET STATISTICS n". Testing was done on CVS HEAD on my laptop with no configure options other than --prefix. Then I did this, to disable compression on pg_statistic. alter table pg_statistic alter column stanumbers1 set storage external; alter table pg_statistic alter column stanumbers2 set storage external; alter table pg_statistic alter column stanumbers3 set storage external; alter table pg_statistic alter column stanumbers4 set storage external; alter table pg_statistic alter column stavalues1 set storage external; alter table pg_statistic alter column stavalues2 set storage external; alter table pg_statistic alter column stavalues3 set storage external; alter table pg_statistic alter column stavalues4 set storage external; (Note that you'll need to put allow_system_table_mods=true in your postgresql.conf file if you want this to work.) Then I reran the tests. The results were pretty dramatic. In the table below, the first column is value of "SET STATISTICS n" that was performed the table column prior to analyzing it. The second column is the time required to plan the query 100x AFTER disabling compression on pg_statistic, and the third column is the time required to plan the query 100x BEFORE disabling compression on pg_statistic. 10 0.829202 0.8249 20 1.059976 1.06957 30 1.168727 1.143803 40 1.287189 1.263252 50 1.370167 1.363951 60 1.486589 1.460464 70 1.603899 1.571107 80 1.69402 1.689651 90 1.79068 1.804454 100 1.930877 2.803941 150 2.446471 4.833002 200 2.95323 6.217708 250 3.436741 7.507919 300 3.983568 8.895015 350 4.497475 10.201713 400 5.072471 11.576961 450 5.615272 12.933128 500 6.286358 14.408157 550 6.895951 15.745378 600 7.400134 17.192916 650 8.038159 18.568616 700 8.606704 20.025952 750 9.154889 21.45775 800 9.80953 22.74635 850 10.363471 24.057379 900 11.022348 25.559911 950 11.69732 27.021034 1000 12.266699 28.711027 As you can see, for default_statistics_target > 90, this is a HUGE win. After doing this test, I rebuilt with --enable-profiling and profiled EXPLAIN 10x with SET STATISTICS 10, 70, 100, 200 with a vanilla configuration, and then 200 again with compression turned off as described above. The, ahem, ridiculously small limit on attachment size prevents me from attaching the full results, so please see the attached results which are truncated after the first section. 10x doesn't seem to be quite enough to get the exact picture of where the bottlenecks are, but the overall picture is clear enough: decompression introduces a huge overhead. Looking specifically at the 200-decompress output, the next biggest hit is AllocSetAlloc(), which, from the detailed results that I unfortunately can't include, is being called mostly by datumCopy() which is being called mostly by get_attstatsslot(). There are 4000 calls to get_attstatsslot() which result 701,500 calls to datumCopy(). I'm not too sure what any of this means in terms of optimizatiion, other than that changing the storage type of pg_statistic columns to external looks like a huge win. Perhaps someone more knowledgeable than I has some thoughts. ...Robert
Attachment
That might only be the case when the pg_statistic record is in shared buffers. Also I wonder if eqjoinsel and company might need to be made more toast-aware by detoasring all the things it needs once rather than every time it accesses them. greg On 6 Dec 2008, at 06:19 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > Sorry for top posting but we are getting a bit far afield from the > original topic. I followed up the tests I did last night: > > http://archives.postgresql.org/pgsql-hackers/2008-12/msg00369.php > > I benchmarked 100 iterations of EXPLAIN on the query Greg Stark put > together as a synthetic benchmark for default_statistics_target with > various values for "SET STATISTICS n". Testing was done on CVS HEAD > on my laptop with no configure options other than --prefix. Then I > did this, to disable compression on pg_statistic. > > alter table pg_statistic alter column stanumbers1 set storage > external; > alter table pg_statistic alter column stanumbers2 set storage > external; > alter table pg_statistic alter column stanumbers3 set storage > external; > alter table pg_statistic alter column stanumbers4 set storage > external; > alter table pg_statistic alter column stavalues1 set storage external; > alter table pg_statistic alter column stavalues2 set storage external; > alter table pg_statistic alter column stavalues3 set storage external; > alter table pg_statistic alter column stavalues4 set storage external; > > (Note that you'll need to put allow_system_table_mods=true in your > postgresql.conf file if you want this to work.) Then I reran the > tests. The results were pretty dramatic. In the table below, the > first column is value of "SET STATISTICS n" that was performed the > table column prior to analyzing it. The second column is the time > required to plan the query 100x AFTER disabling compression on > pg_statistic, and the third column is the time required to plan the > query 100x BEFORE disabling compression on pg_statistic. > > 10 0.829202 0.8249 > 20 1.059976 1.06957 > 30 1.168727 1.143803 > 40 1.287189 1.263252 > 50 1.370167 1.363951 > 60 1.486589 1.460464 > 70 1.603899 1.571107 > 80 1.69402 1.689651 > 90 1.79068 1.804454 > 100 1.930877 2.803941 > 150 2.446471 4.833002 > 200 2.95323 6.217708 > 250 3.436741 7.507919 > 300 3.983568 8.895015 > 350 4.497475 10.201713 > 400 5.072471 11.576961 > 450 5.615272 12.933128 > 500 6.286358 14.408157 > 550 6.895951 15.745378 > 600 7.400134 17.192916 > 650 8.038159 18.568616 > 700 8.606704 20.025952 > 750 9.154889 21.45775 > 800 9.80953 22.74635 > 850 10.363471 24.057379 > 900 11.022348 25.559911 > 950 11.69732 27.021034 > 1000 12.266699 28.711027 > > As you can see, for default_statistics_target > 90, this is a HUGE > win. > > After doing this test, I rebuilt with --enable-profiling and profiled > EXPLAIN 10x with SET STATISTICS 10, 70, 100, 200 with a vanilla > configuration, and then 200 again with compression turned off as > described above. The, ahem, ridiculously small limit on attachment > size prevents me from attaching the full results, so please see the > attached results which are truncated after the first section. 10x > doesn't seem to be quite enough to get the exact picture of where the > bottlenecks are, but the overall picture is clear enough: > decompression introduces a huge overhead. > > Looking specifically at the 200-decompress output, the next biggest > hit is AllocSetAlloc(), which, from the detailed results that I > unfortunately can't include, is being called mostly by datumCopy() > which is being called mostly by get_attstatsslot(). There are 4000 > calls to get_attstatsslot() which result 701,500 calls to datumCopy(). > > I'm not too sure what any of this means in terms of optimizatiion, > other than that changing the storage type of pg_statistic columns to > external looks like a huge win. Perhaps someone more knowledgeable > than I has some thoughts. > > ...Robert > <gmon-summary.tbz>
Greg Stark <greg.stark@enterprisedb.com> writes: > That might only be the case when the pg_statistic record is in shared > buffers. Yeah, it seems unlikely that disabling compression is a good idea in the bigger scheme of things. > Also I wonder if eqjoinsel and company might need to be made more > toast-aware by detoasring all the things it needs once rather than > every time it accesses them. At least in this example, the same stats arrays are fetched multiple times per query. It might be worth teaching the planner to cache those results internally during a plan cycle --- this would avoid most of the complexity associated with caching (ie, figuring out when to clear the cache) and still get maybe a 2x or 3x or so win. regards, tom lane
On Mon, Dec 8, 2008 at 11:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <greg.stark@enterprisedb.com> writes: >> That might only be the case when the pg_statistic record is in shared >> buffers. > > Yeah, it seems unlikely that disabling compression is a good idea in the > bigger scheme of things. Is there any way to figure out what sort of compression ratios we're typically getting? The only way compression can really be a win is if it's saving us having to read in additional pages. Even then, I'm not sure we should worry about needing to read in additional pages in this case. I would sure expect statistics to stay in shared buffers, or at least in the operating system cache. Sure, if the table is accessed VERY infrequently, that might not be the case, but is that really what we want to optimize for? ...Robert
On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <stark@enterprisedb.com> wrote: > I tried a different query, trying to get quadratic growth and again failed. It The profiling results I sent the other day show an exactly-linear increase in the number of times eqjoinsel invokes FunctionCall2. Reading through the the eqjoinsel_inner loop in selfuncs.c beginning around line 2042, I think what is happening is this: since the two tables are really the same table, nvalues1 and nvalues2 are the same array, and therefore contain the same elements in the same order. As a result, for each i, we skip over the first i - 1 entries in nvalues2, which have already been matched, and then compare element i of nvalues1 to element i of nvalues2 and mark them both as matched. Although this is technically an O(n^2) algorithm, the constant is very low, because this code is Really Fast: if (hasmatch[j]) continue; To get observable quadratic behavior, I think you might need to construct two MCV arrays where all the values are the same, but the arrays are not in the same order. I believe they are sorted by frequency, so it might be sufficient to arrange things so that the N'th most common value in one table is the (statistics_target + 1 - N)'th most common value in the other. It might also help to use a function with a real slow comparison function (perhaps one intentionally constructed to be slow). ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <stark@enterprisedb.com> wrote: >> I tried a different query, trying to get quadratic growth and again failed. It > The profiling results I sent the other day show an exactly-linear > increase in the number of times eqjoinsel invokes FunctionCall2. > Reading through the the eqjoinsel_inner loop in selfuncs.c beginning > around line 2042, I think what is happening is this: since the two > tables are really the same table, nvalues1 and nvalues2 are the same > array, and therefore contain the same elements in the same order. Yeah, that would be fast. To see a quadratic case you need MCV arrays that have little or no overlap of common values --- then each element of the first will be compared (in vain) to all or most of the elements in the second. regards, tom lane
> Yeah, that would be fast. To see a quadratic case you need MCV arrays > that have little or no overlap of common values --- then each element of > the first will be compared (in vain) to all or most of the elements in > the second. Ah, that makes sense. Here's a test case based on Greg's. This is definitely more than linear once you get above about n = 80, but it's not quadratic either. n = 1000 is only 43x n = 80, and while that's surely more than 1000/80 = 12.5, it's also a lot less than (1000/80)^2 = 156.25. create table tk as select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,1000); insert into tk (select * from tk); insert into tk (select * from tk); insert into tk (select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,2000)); create table tk2 as select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,1000); insert into tk2 (select * from tk2); insert into tk2 (select * from tk2); insert into tk2 (select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,2000)); create table tk3 as select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,1000); insert into tk3 (select * from tk3); insert into tk3 (select * from tk3); insert into tk3 (select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,2000)); create table tk4 as select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,1000); insert into tk4 (select * from tk4); insert into tk4 (select * from tk4); insert into tk4 (select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,2000)); create table tk5 as select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,1000); insert into tk5 (select * from tk5); insert into tk5 (select * from tk5); insert into tk5 (select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,2000)); create table tk6 as select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,1000); insert into tk6 (select * from tk6); insert into tk6 (select * from tk6); insert into tk6 (select random()::text||random()::text||random()::text||random()::text||random()::text||random()::text as r from generate_series(1,2000)); and then (after disabling autovacuum): set default_statistics_target = XXX; analyze; repeat 100x: explain select count(*) from (select * from tk as k, tk2 as l,tk3 as m,tk4 as n,tk5 as o,tk6 as p where k.r=l.r and k.r=m.r and k.r=n.r and k.r=o.r and k.r=p.r) as x; Timings (for 100 iterations): 10 0.900309 20 1.189229 30 1.280892 40 1.447358 50 1.611779 60 1.795701 70 2.001245 80 2.286144 90 2.955732 100 3.925557 150 6.472436 200 9.010824 250 11.89753 300 15.109172 350 18.813514 400 22.901383 450 27.842019 500 32.02136 550 37.609196 600 42.894322 650 48.460327 700 55.169819 750 61.568125 800 68.222201 850 75.027591 900 82.918344 950 91.235267 1000 99.737802 ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > Ah, that makes sense. Here's a test case based on Greg's. This is > definitely more than linear once you get above about n = 80, but it's > not quadratic either. n = 1000 is only 43x n = 80, and while that's > surely more than 1000/80 = 12.5, it's also a lot less than (1000/80)^2 > = 156.25. Yeah, that's not too surprising. There's a linear cost associated with fetching/deconstructing the stats array, and a quadratic cost in the comparison loop. What these results say is that the upper limit of 1000 keeps you from getting into the range where the linear component is completely negligible. If you plot the results and try to fit a curve like c1*x^2 + c2*x to them, you get a very good fit for all the points above about 80. Below that, the curve has a flatter slope, indicating a smaller linear cost component. The values in this test are about 100 bytes each, so the breakpoint at 80 seems to correspond to what happens when the stats array no longer fits in one page. I replicated your test and got timings like these, using CVS HEAD with cassert off: 10 1.587 20 1.997 30 2.208 40 2.499 50 2.734 60 3.048 70 3.368 80 3.706 90 4.686 100 6.418 150 10.016 200 13.598 250 17.905 300 22.777 400 33.471 500 46.394 600 61.185 700 77.664 800 96.304 900 116.615 1000 140.117 So this machine is a bit slower than yours, but otherwise it's pretty consistent with your numbers. I then modified the test to use array[random()::text,random()::text,random()::text,random()::text,random()::text,random()::text] ie, the same data except stuffed into an array. I did this because I know that array_eq is pretty durn expensive, and indeed: 10 1.662 20 2.478 30 3.119 40 3.885 50 4.636 60 5.437 70 6.403 80 7.427 90 8.473 100 9.597 150 16.542 200 24.919 250 35.225 300 47.423 400 76.509 500 114.076 600 157.535 700 211.189 800 269.731 900 335.427 1000 409.638 When looking at these numbers one might think the threshold of pain is about 50, rather than 100 which is where I'd put it for the text example. However, this is probably an extreme worst case. On the whole I think we have some evidence here to say that upping the default value of default_stats_target to 100 wouldn't be out of line, but 1000 definitely is. Comments? BTW, does anyone have an opinion about changing the upper limit for default_stats_target to, say, 10000? These tests suggest that you wouldn't want such a value for a column used as a join key, but I can see a possible argument for high values in text search and similar applications. regards, tom lane
BTW, does anyone have an opinion about changing the upper limit for
default_stats_target to, say, 10000? These tests suggest that you
wouldn't want such a value for a column used as a join key, but
I can see a possible argument for high values in text search and
similar applications.
Do you consider using hash tables?
I am not sure hash is a perfect match here, however I guess some kind of data structure might improve N^2 behaviour. Looks like that would improve both array_eq (that will narrow the list of possible arrays to the single hash bucket) and large _target (I guess that would improve N^2 to N)
Regards,
Vladimir Sitnikov
"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes: > Do you consider using hash tables? Doubt it's really worth it, unless there's some way to amortize the setup cost across multiple selectivity estimations; which would surely complicate life. One thing that just now occurred to me is that as long as we maintain the convention that MCV lists are in decreasing frequency order, one can take any prefix of the list and it's a perfectly good MCV list of less resolution. So one way to reduce the time taken in eqjoinsel is to set an upper limit on the number of entries considered *by that routine*, whereas other estimator functions could use larger lists. regards, tom lane
> When looking at these numbers one might think the threshold of pain > is about 50, rather than 100 which is where I'd put it for the text > example. However, this is probably an extreme worst case. > > On the whole I think we have some evidence here to say that upping the > default value of default_stats_target to 100 wouldn't be out of line, > but 1000 definitely is. Comments? Do you think there's any value in making it scale based on the size of the table? How hard would it be? If you made it MIN(10 + 0.001 * estimated_rows, 100), you would probably get most of the benefit while avoiding unnecessary overhead for small tables. Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump for one release, especially since it may cause the statistics to get toasted in some cases, which comes with a significant performance hit.I would raise it to 30 or 50 and plan to consider raisingit further down the road. (I realize I just made about a million enemies with that suggestion.) > BTW, does anyone have an opinion about changing the upper limit for > default_stats_target to, say, 10000? These tests suggest that you > wouldn't want such a value for a column used as a join key, but > I can see a possible argument for high values in text search and > similar applications. I think that's a good idea. Given that most people probably don't both fiddling with this parameter at all, it doesn't strike me as much of a foot-gun. I think you'd need a heck of a big table to justify a value in that range, but some people may have them. ...Robert
Doubt it's really worth it, unless there's some way to amortize the
> Do you consider using hash tables?
setup cost across multiple selectivity estimations; which would surely
complicate life.
MCV lists are updated only during "analyze" phase, don't they? If the "setup cost" is the cost of maintaining those "hash tables", it is not going to hurt much.
One thing that just now occurred to me is that as long as we maintain
the convention that MCV lists are in decreasing frequency order, one can
take any prefix of the list and it's a perfectly good MCV list of less
resolution. So one way to reduce the time taken in eqjoinsel is to set
an upper limit on the number of entries considered *by that routine*,
whereas other estimator functions could use larger lists.
That makes sense, however, linear search for single item in the list of 10'000 elements could take a while. Hash lookup might be better choice.
Regards,
Vladimir Sitnikov
On Thu, Dec 11, 2008 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes: >> Do you consider using hash tables? > > Doubt it's really worth it, unless there's some way to amortize the > setup cost across multiple selectivity estimations; which would surely > complicate life. > > One thing that just now occurred to me is that as long as we maintain > the convention that MCV lists are in decreasing frequency order, one can > take any prefix of the list and it's a perfectly good MCV list of less > resolution. So one way to reduce the time taken in eqjoinsel is to set > an upper limit on the number of entries considered *by that routine*, > whereas other estimator functions could use larger lists. To what extent will that negate the benefit of having those statistics in the first place? Here's another idea. If you have a < operator, you could use a quicksort-type strategy to partition the search space. Pick an arbitrary element of either list and apply it to all elements of both lists to divide the initial problem into two problems that are each half as large. When the subproblems fall below some size threshold, then solve them according to the existing algorithm. This is O(n^2) in the worst case, just like quicksort, but the worst case is difficult to construct. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: >> On the whole I think we have some evidence here to say that upping the >> default value of default_stats_target to 100 wouldn't be out of line, >> but 1000 definitely is. Comments? > Do you think there's any value in making it scale based on the size of > the table? As far as the MCVs go, I think we already have a decent heuristic for determining the actual length of the array, based on discarding values that have too small an estimated frequency --- look into compute_scalar_stats. I don't see that explicitly considering table size would improve that. It might be worth limiting the size of the histogram, as opposed to the MCV list, for smaller tables. But that's not relevant to the speed of eqjoinsel, and AFAIK there aren't any estimators that are O(N^2) in the histogram length. (ineq_histogram_selectivity is actually O(log N), so it hardly cares at all.) > Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump > for one release, especially since it may cause the statistics to get > toasted in some cases, which comes with a significant performance hit. > I would raise it to 30 or 50 and plan to consider raising it further > down the road. (I realize I just made about a million enemies with > that suggestion.) There's something in what you say, but consider that we have pretty much unanimous agreement that 10 is too small. I think we should try to fix the problem, not just gradually ratchet up the value until people start complaining in the other direction. (Also, we should have plenty of opportunity during beta to find out if we went too far.) regards, tom lane
There's something in what you say, but consider that we have pretty
much unanimous agreement that 10 is too small. I think we should
try to fix the problem, not just gradually ratchet up the value until
people start complaining in the other direction. (Also, we should have
plenty of opportunity during beta to find out if we went too far.)
I am not sure if entity-attribute-value model could be used for postgres database, however that is one of the cases that require large MCV list (generally, for attribute column).
You know, Oracle is not able to store more than 254 distinct values for histogram statistics. That really limits the use of histograms for software product the company I work for creates.
One more direction could be implementing "MCV" for range of values (group values and interpolate in between). Consider statistics on timestamp column that says that for "2008-December" there are as many X rows, for "2008-November" as many as Y, etc. That could be used for rather accurate cardinality estimation of "between" cases, while keeping number of entries in "MCV" list small.
Regards,
Vladimir Sitnikov
"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes: > One more direction could be implementing "MCV" for range of values (group > values and interpolate in between). Consider statistics on timestamp column > that says that for "2008-December" there are as many X rows, for > "2008-November" as many as Y, etc. That could be used for rather accurate > cardinality estimation of "between" cases, while keeping number of entries > in "MCV" list small. I think that would likely be redundant with the histogram. regards, tom lane
> One more direction could be implementing "MCV" for range of values (group > values and interpolate in between). Consider statistics on timestamp column > that says that for "2008-December" there are as many X rows, for > "2008-November" as many as Y, etc. That could be used for rather accurate > cardinality estimation of "between" cases, while keeping number of entries > in "MCV" list small. > I suggested this earlier ( http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ). If there's interest, I'd be happy to clean up my patch and submit it for 8.5 -Nathan
I think that would likely be redundant with the histogram.
I have no idea how to decide between automatic switch between histograms and MCV. It might sound crazy one could compute both histograms and MCV and use them both (and pick an average of two estimations :) )
Regards,
Vladimir Sitnikov
On Thu, 2008-12-11 at 13:09 -0500, Tom Lane wrote: > On the whole I think we have some evidence here to say that upping the > default value of default_stats_target to 100 wouldn't be out of line, > but 1000 definitely is. Comments? Sounds good to me. I would like it even more if there was a data type specific default. Currently we have a special case for boolean, but that's it. TPC allows for different defaults for different types (but that's not my only reason). That would allow us to set it lower for long text strings and floats and potentially higher for things like smallint, which is less likely as a join target. Would be great if we could set the default_stats_target for all columns in a table with a single ALTER TABLE statement, rather than setting it for every column individually. And I would like it even more if the sample size increased according to table size, since that makes ndistinct values fairly random for large tables. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes: >> I think that would likely be redundant with the histogram. >> > I am afraid I do not get you. AFAICS what you're proposing *is* a histogram, just awkwardly described. What is the specific difference between what you are talking about and what scalarineqsel already implements? regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2008-12-11 at 13:09 -0500, Tom Lane wrote: > >> On the whole I think we have some evidence here to say that upping the >> default value of default_stats_target to 100 wouldn't be out of line, >> but 1000 definitely is. Comments? > > Sounds good to me. > > I would like it even more if there was a data type specific default. > Currently we have a special case for boolean, but that's it. TPC allows > for different defaults for different types (but that's not my only > reason). That would allow us to set it lower for long text strings and > floats and potentially higher for things like smallint, which is less > likely as a join target. In conjunction with a toast rethink it would be interesting to engineer things so that we keep one page's worth of data after toasting for each of the arrays. That would at least remove a lot of the dangers of larger arrays. > Would be great if we could set the default_stats_target for all columns > in a table with a single ALTER TABLE statement, rather than setting it > for every column individually. Hm, just because one column has a skewed data distribution doesn't tell you much about what the other columns' data distributions are. On the other hand I could see an argument that a given table might be consistently used in only large batch queries or quick OLTP queries. > And I would like it even more if the sample size increased according to > table size, since that makes ndistinct values fairly random for large > tables. Unfortunately _any_ ndistinct estimate based on a sample of the table is going to be pretty random. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark <stark@enterprisedb.com> writes: > Simon Riggs <simon@2ndQuadrant.com> writes: >> And I would like it even more if the sample size increased according to >> table size, since that makes ndistinct values fairly random for large >> tables. > Unfortunately _any_ ndistinct estimate based on a sample of the table is going > to be pretty random. Yeah, it's a hard problem. It's worth noting though that increasing the default stats target 10X is already going to result in a 10X increase in the default sample size, so we should see at least some improvement in the typical quality of ndistinct estimates. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > I would like it even more if there was a data type specific default. > Currently we have a special case for boolean, but that's it. No, we don't (or if we do I'd be interested to know where). I don't see much value in a data-type-dependent default anyway --- would you make different defaults for int4, int8, and float8, and on what grounds? The actual data contents of three such columns could easily be exactly equivalent. regards, tom lane
What is the specific difference between what you are talking about and
what scalarineqsel already implements?
Hmm... Northing new. Feel sorry for bothering you. I did not realize histograms are implemented.
Regards,
Vladimir Sitnikov
Tom Lane <tgl@sss.pgh.pa.us> writes: > 500 114.076 > 600 157.535 > 700 211.189 > 800 269.731 > 900 335.427 > 1000 409.638 >... > BTW, does anyone have an opinion about changing the upper limit for > default_stats_target to, say, 10000? These tests suggest that you > wouldn't want such a value for a column used as a join key, but > I can see a possible argument for high values in text search and > similar applications. I don't like the existing arbitrary limit which it sounds like people are really bumping into. But that curve looks like it might be getting awfully steep. I wonder just how long 10,000 would take? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> BTW, does anyone have an opinion about changing the upper limit for >> default_stats_target to, say, 10000? These tests suggest that you >> wouldn't want such a value for a column used as a join key, but >> I can see a possible argument for high values in text search and >> similar applications. > I don't like the existing arbitrary limit which it sounds like people are > really bumping into. But that curve looks like it might be getting awfully > steep. I wonder just how long 10,000 would take? Presumably, right about 100X longer than 1000 ... if we don't do anything about limiting the number of values eqjoinsel looks at. I think though that the case for doing so is pretty good. "MCVs" that are beyond the K'th entry can't possibly have frequencies greater than 1/K, and in most cases it'll be a lot less. So the incremental contribution to the accuracy of the join selectivity estimate drops off pretty quickly, I should think. And it's not like we're ignoring the existence of those values entirely --- we'd just be treating them as if they are part of the undifferentiated collection of non-MCV values. It might be best to stop when the frequency drops below some threshold, rather than taking a fixed number of entries. regards, tom lane
On Thu, Dec 11, 2008 at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> I would like it even more if there was a data type specific default. >> Currently we have a special case for boolean, but that's it. > > No, we don't (or if we do I'd be interested to know where). utils/adt/selfuncs.c, get_variable_numdistinct ...Robert
On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > I would like it even more if there was a data type specific default. > > Currently we have a special case for boolean, but that's it. > > No, we don't (or if we do I'd be interested to know where). Your commit, selfuncs.c, 7 Jul. > I don't see > much value in a data-type-dependent default anyway --- would you make > different defaults for int4, int8, and float8, and on what grounds? > The actual data contents of three such columns could easily be exactly > equivalent. I would prefer distinct type or domain specific defaults, but until we have that I would settle for datatype specific defaults. Defaults, not only permissible settings. Most people don't keep same data in float8 as they do in int4, but neither of those were ones I was thinking about. I see 3 main classes: * data with small number of distinct values (e.g. boolean, smallint) * data with many distinct values * data with where every value is typically unique (e.g. text) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
"Robert Haas" <robertmhaas@gmail.com> writes: > On Thu, Dec 11, 2008 at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> I would like it even more if there was a data type specific default. >>> Currently we have a special case for boolean, but that's it. >> >> No, we don't (or if we do I'd be interested to know where). > utils/adt/selfuncs.c, get_variable_numdistinct That's got nothing to do with stats collection; it's a fallback case used if no stats are available. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> I would like it even more if there was a data type specific default. >>> Currently we have a special case for boolean, but that's it. >> >> No, we don't (or if we do I'd be interested to know where). > Your commit, selfuncs.c, 7 Jul. As with Robert's pointer, that's about coping with missing stats, not about determining what stats to collect. > ... neither of those were ones I was thinking about. I see 3 main classes: > * data with small number of distinct values (e.g. boolean, smallint) > * data with many distinct values > * data with where every value is typically unique (e.g. text) These three categories are already dealt with in an entirely type-independent fashion by the heuristics in compute_scalar_stats. I think it's quite appropriate to drive them off the number of observed values, not guesses about what a particular datatype is used for. regards, tom lane
On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote: > > And I would like it even more if the sample size increased according > to table size, since that makes ndistinct values fairly random for > large > > tables. > > Unfortunately _any_ ndistinct estimate based on a sample of the table > is going to be pretty random. We know that constructed data distributions can destroy the effectiveness of the ndistinct estimate and make sample size irrelevant. But typical real world data distributions do improve their estimations with increased sample size and so it is worthwhile. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: > On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote: >> I don't see >> much value in a data-type-dependent default anyway I am dubious about the value there, too, at least for our environment. > I would prefer distinct type or domain specific defaults Now that would be really useful. -Kevin
Tom Lane wrote: > "Robert Haas" <robertmhaas@gmail.com> writes: > >> On the whole I think we have some evidence here to say that upping the > >> default value of default_stats_target to 100 wouldn't be out of line, > >> but 1000 definitely is. Comments? > > > Do you think there's any value in making it scale based on the size of > > the table? > > As far as the MCVs go, I think we already have a decent heuristic for > determining the actual length of the array, based on discarding values > that have too small an estimated frequency --- look into > compute_scalar_stats. I don't see that explicitly considering table > size would improve that. It might be worth limiting the size of the > histogram, as opposed to the MCV list, for smaller tables. But that's > not relevant to the speed of eqjoinsel, and AFAIK there aren't any > estimators that are O(N^2) in the histogram length. > (ineq_histogram_selectivity is actually O(log N), so it hardly cares at > all.) Why is selfuncs.c::var_eq_const() doing a linear scan over the MCV array instead of having the list sorted and doing a binary search on the array? We already do this for histogram lookups, as you mentioned. Does it not matter? It didn't for ten values but might for larger distinct lists. > > Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump > > for one release, especially since it may cause the statistics to get > > toasted in some cases, which comes with a significant performance hit. > > I would raise it to 30 or 50 and plan to consider raising it further > > down the road. (I realize I just made about a million enemies with > > that suggestion.) > > There's something in what you say, but consider that we have pretty > much unanimous agreement that 10 is too small. I think we should > try to fix the problem, not just gradually ratchet up the value until > people start complaining in the other direction. (Also, we should have > plenty of opportunity during beta to find out if we went too far.) I am excited we are addresssing the low default statistics target value. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote: >>> And I would like it even more if the sample size increased according >>> to table size, since that makes ndistinct values fairly random for >>> large tables. >> >> Unfortunately _any_ ndistinct estimate based on a sample of the table >> is going to be pretty random. > We know that constructed data distributions can destroy the > effectiveness of the ndistinct estimate and make sample size irrelevant. > But typical real world data distributions do improve their estimations > with increased sample size and so it is worthwhile. This is handwaving unsupported by evidence. If you've got a specific proposal what to change the sample size to and some numbers about what it might gain us or cost us, I'm all ears. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > Why is selfuncs.c::var_eq_const() doing a linear scan over the MCV array > instead of having the list sorted and doing a binary search on the > array? 1. Keeping the MCV array sorted by frequency allows cheap extraction of less-accurate MCV lists; you just take the first N entries. 2. Binary search presumes that you have a less-than operator, a concept not implicit in the notion of MCVs (but it *is* implicit in a histogram). regards, tom lane
>> What is the specific difference between what you are talking about and >> what scalarineqsel already implements? > > Hmm... Northing new. Feel sorry for bothering you. I did not realize > histograms are implemented. > Well, ISTM there is a profound difference. For scalarineqsel we care about the total number of values in a bucket. For eqsel we care about the total number of *distinct* values in each bucket ( which we don't track ). IMHO, the whole idea of increasing mcv's seems a mistake. Why not use the limited storage in pg_statistic to try and estimate the selectivity for ranges of values rather than a single value? That gives way better coverage of the distribution. If the number of values is too high to fit in a single bucket we put it in an mcv slot anyways. *That* should be the mechanism by which the number of mcv's increases. I guess this is a bit off topic for the middle of a commit fest though. -Nathan
"Nathan Boley" <npboley@gmail.com> writes: > Well, ISTM there is a profound difference. For scalarineqsel we care > about the total number of values in a bucket. For eqsel we care about > the total number of *distinct* values in each bucket Really? > IMHO, the whole idea of increasing mcv's seems a mistake. Why not use > the limited storage in pg_statistic to try and estimate the > selectivity for ranges of values rather than a single value? MCVs are useful for questions that are not related to ranges of values --- an example of something highly useful we do with them is to try to estimate the fraction of a column that satisfies a LIKE or regex pattern. In fact, as I was just pointing out to Bruce, we can compute them and do useful things with them for datatypes that don't have a defined sort order and so the whole concept of "range" is meaningless. Now, if your point is that it'd be more flexible to not tie the MCV list length to the histogram length, you're right. I'm not sure we can expect the average DBA to set two separate knobs very effectively, though. We do already have some smarts in there to try to set the MCV list length intelligently instead of slavishly following the stats target --- perhaps it wouldn't be a bad idea to develop similar heuristics for choosing an actual histogram length. regards, tom lane
Thanks for the response. >> Well, ISTM there is a profound difference. For scalarineqsel we care >> about the total number of values in a bucket. For eqsel we care about >> the total number of *distinct* values in each bucket > > Really? > Well, we would obviously also care about the total number of values in the buckets if we were trying to use the histogram in eqsel. Isn't a selectivity estimate of x = v as ( the number of values in v's histogram bucket ) / ( number of distinct values in v's histogram bucket ) pretty rational? Thats currently what we do for non-mcv values, except that we look at ndistinct over the whole table instead of individual histogram buckets. >> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use >> the limited storage in pg_statistic to try and estimate the >> selectivity for ranges of values rather than a single value? > > MCVs are useful for questions that are not related to ranges of values > --- an example of something highly useful we do with them is to try to > estimate the fraction of a column that satisfies a LIKE or regex > pattern. > Good point. I guess I was responding to the eqsel benchmarks. I should remember that I don't necessarily know all the places that mcv's are used. > In fact, as I was just pointing out to Bruce, we can compute them and > do useful things with them for datatypes that don't have a defined sort > order and so the whole concept of "range" is meaningless. > Another good point. But don't we have bigger stat problems for datatypes without an ordering relation? IIRC, analyze doesn't even fully compute the mcv list, as that would be N^2 in the sample size. > Now, if your point is that it'd be more flexible to not tie the MCV list > length to the histogram length, you're right. No, my point is just the opposite. I think the two should be *more* tightly linked. It seems that ( at least for eqsel and scalarineqsel ) mcv's should be the values that the histogram can't deal with effectively. ie, there's no ordering relation, there are too many values to fit into a histogram bucket, the histogram eqsel estimate does an especially bad job for a relatively common value, etc. Even now, if there are 100 histogram buckets then any values that occupy more than 1% of the table will be mcv's regardless - why force a value to be an mcv if it only occupies 0.1% of the table? That just bloats pg_statistic and slows down joins unnecessarily. I'll submit a patch for 8.5 and then, hopefully, some simple benchmarks can make my point.. -Nathan
"Nathan Boley" <npboley@gmail.com> writes: > Isn't a selectivity estimate of x = v as ( the number of values in v's > histogram bucket ) / ( number of distinct values in v's histogram > bucket ) pretty rational? Thats currently what we do for non-mcv > values, except that we look at ndistinct over the whole table instead > of individual histogram buckets. But the histogram buckets are (meant to be) equal-population, so it should come out the same either way. The removal of MCVs from the population will knock any possible variance in ndistinct down to the point where I seriously doubt that this could offer a win. An even bigger problem is that this requires estimation of ndistinct among fractions of the population, which will be proportionally less accurate than the overall estimate. Accurate ndistinct estimation is *hard*. > now, if there are 100 histogram buckets then any values that occupy > more than 1% of the table will be mcv's regardless - why force a value > to be an mcv if it only occupies 0.1% of the table? Didn't you just contradict yourself? The cutoff would be 1% not 0.1%. In any case there's already a heuristic to cut off the MCV list at some shorter length (ie, more than 1% in this example) if it seems not worthwhile to keep the last entries. See lines 2132ff (in CVS HEAD) in analyze.c. regards, tom lane
> I think though that the case for doing so is pretty good. "MCVs" that > are beyond the K'th entry can't possibly have frequencies greater than > 1/K, and in most cases it'll be a lot less. So the incremental > contribution to the accuracy of the join selectivity estimate drops off > pretty quickly, I should think. And it's not like we're ignoring the > existence of those values entirely --- we'd just be treating them as if > they are part of the undifferentiated collection of non-MCV values. > > It might be best to stop when the frequency drops below some threshold, > rather than taking a fixed number of entries. OK, I'll bite. How do we decide where to put the cutoff? If we make it too high, it will have a negative effect on join selectivity estimates; if it's too low, it won't really address the problem we're trying to fix. I randomly propose p = 0.001, which should limit eqjoinsel() to about a million equality tests in the worst case. In the synthetic example we were just benchmarking, that causes the entire MCV array to be tossed out the window (which feels about right). ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: >> It might be best to stop when the frequency drops below some threshold, >> rather than taking a fixed number of entries. > OK, I'll bite. How do we decide where to put the cutoff? If we make > it too high, it will have a negative effect on join selectivity > estimates; if it's too low, it won't really address the problem we're > trying to fix. I randomly propose p = 0.001, which should limit > eqjoinsel() to about a million equality tests in the worst case. In > the synthetic example we were just benchmarking, that causes the > entire MCV array to be tossed out the window (which feels about > right). Yeah. One idle thought I had was that maybe the cutoff needs to consider both probabilities: if the high-frequency MCVs on one side chance to match to not-so-high-frequency MCVs on the other, you would like to know about that. As long as we keep the lists in frequency order, it seems easy to implement this: for each MCV examined by the outer loop, you run the inner loop until the product of the outer and current inner frequency drops below whatever your threshold is. This doesn't immediately suggest what the threshold ought to be, but it seems like it ought to be possible to determine that given a desired maximum error in the overall estimate. I'm also not very clear on what the "total frequency" computations (matchfreq2 and unmatchfreq2 in the current code) ought to look like if we are using a variable subset of the inner list. regards, tom lane
On Thu, Dec 11, 2008 at 11:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote: > >> > And I would like it even more if the sample size increased according >> to table size, since that makes ndistinct values fairly random for >> large >> > tables. >> >> Unfortunately _any_ ndistinct estimate based on a sample of the table >> is going to be pretty random. > > We know that constructed data distributions can destroy the > effectiveness of the ndistinct estimate and make sample size irrelevant. > But typical real world data distributions do improve their estimations > with increased sample size and so it is worthwhile. Well that just means "more is always better" which puts us back in the same boat of always needing more. The existing sampling mechanism is tied to solid statistics. It provides the correct sample size to get a consistent confidence range for range queries. This is the same mathematics which governs election polling and other surveys. The sample size you need to get +/- 5% 19 times out of 20 increases as the population increases, but not by very much. However ndistinct is a different kind of question. Instead of needing the relatively small and slowly growing sample size you need a percentage of the whole table. It would mean letting this one measure control the whole sample size decision. And the end result would be pretty unimpressive. The papers I read showed pretty poor results for sample sizes as large as 50% of the table. Simply raising the statistics target for larger tables would not be very helpful. All it would do is raise the table size at which you would find the sample size too small. You do have to change to using a percentage of the whole table -- which would make ANALYZE a *lot* slower for larger tables. Really the only way we'll get good ndistinct calculations is if we decide to scan the whole table. If we have to do that for some other reason then there are algorithms for gathering enough information to handle ndistinct properly. -- greg
>> Isn't a selectivity estimate of x = v as ( the number of values in v's >> histogram bucket ) / ( number of distinct values in v's histogram >> bucket ) pretty rational? Thats currently what we do for non-mcv >> values, except that we look at ndistinct over the whole table instead >> of individual histogram buckets. > > But the histogram buckets are (meant to be) equal-population, so it > should come out the same either way. Why is it the same either way? The histogram buckets have equal population in the number of total values, not the number of distinct value. Consider [0,0,0,0,1,2,3,4,5,6]. The histogram buckets would be [0,2) and [2,+oo), but they would have 2 and 5 distinct values respectively. > The removal of MCVs from the > population will knock any possible variance in ndistinct down to the > point where I seriously doubt that this could offer a win. Well, if the number of MCV's is large enough then it certainly won't matter. But isn't that pointlessly inefficient for most distributions? I provided some empirical evidence of this in an earlier post ( http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ) for normal distributions. > An even > bigger problem is that this requires estimation of ndistinct among > fractions of the population, which will be proportionally less accurate > than the overall estimate. Accurate ndistinct estimation is *hard*. > Agreed, but I'm not convinced that the overall error rate will go up for our current estimator. Ndistinct estimation is hardest for populations with a large ndistinct count variation. If the assumptions underlying this are founded ( namely that ndistinct distributions are related to the ordering relation in a predictable way ) then ndistinct estimation should be easier because the ndistinct distribution will be more uniform. But that's certainly something that should be tested on real data. >> now, if there are 100 histogram buckets then any values that occupy >> more than 1% of the table will be mcv's regardless - why force a value >> to be an mcv if it only occupies 0.1% of the table? > > Didn't you just contradict yourself? The cutoff would be 1% not 0.1%. No, I'm saying that if the number of histogram buckets is 100, then even if the mcv list is set to length 2, every value that occupies 1% or more of the table will be an mcv. However, if the size is fixed to 100 I think that it will be common to see values with relative frequencies much lower than 1% near the bottom of the list. > In any case there's already a heuristic to cut off the MCV list at some > shorter length (ie, more than 1% in this example) if it seems not > worthwhile to keep the last entries. See lines 2132ff (in CVS HEAD) > in analyze.c. Given an increase in the default stats target, limits like the 25% are exactly what I think there should be more of. Can anyone suggest a good data set to test this sort of question on? -Nathan
>> OK, I'll bite. How do we decide where to put the cutoff? If we make >> it too high, it will have a negative effect on join selectivity >> estimates; if it's too low, it won't really address the problem we're >> trying to fix. I randomly propose p = 0.001, which should limit >> eqjoinsel() to about a million equality tests in the worst case. In >> the synthetic example we were just benchmarking, that causes the >> entire MCV array to be tossed out the window (which feels about >> right). > > Yeah. One idle thought I had was that maybe the cutoff needs to > consider both probabilities: if the high-frequency MCVs on one side > chance to match to not-so-high-frequency MCVs on the other, you > would like to know about that. As long as we keep the lists in > frequency order, it seems easy to implement this: for each MCV > examined by the outer loop, you run the inner loop until the product of > the outer and current inner frequency drops below whatever your > threshold is. This doesn't immediately suggest what the threshold I had this idle thought too, but I didn't write it down because... > ought to be, but it seems like it ought to be possible to determine > that given a desired maximum error in the overall estimate. I'm also > not very clear on what the "total frequency" computations (matchfreq2 > and unmatchfreq2 in the current code) ought to look like if we are using > a variable subset of the inner list. ...of this exact concern, which I think is an insurmountable problem. If you don't consider some of the MCVs AT ALL, I think you can add their frequencies back in to otherfreq{1,2} and go home, but if you consider them for some elements of the other list but not all, I'm not sure there's an appropriate way to proceed. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > I had this idle thought too, but I didn't write it down because... >> ought to be, but it seems like it ought to be possible to determine >> that given a desired maximum error in the overall estimate. I'm also >> not very clear on what the "total frequency" computations (matchfreq2 >> and unmatchfreq2 in the current code) ought to look like if we are using >> a variable subset of the inner list. > ...of this exact concern, which I think is an insurmountable problem. Maybe so. If we stick to the other design (end both lists at a preset frequency threshold) then the math clearly goes through the same as before, just with num_mcvs that are determined differently. But can we prove anything about the maximum error added from that? regards, tom lane
On Fri, 2008-12-12 at 02:23 +0000, Greg Stark wrote: > The existing sampling mechanism is tied to solid statistics. It > provides the correct sample size to get a consistent confidence range > for range queries. This is the same mathematics which governs election > polling and other surveys. The sample size you need to get +/- 5% 19 > times out of 20 increases as the population increases, but not by very > much. Sounds great, but its not true. The sample size is not linked to data volume, so how can it possibly give a consistent confidence range? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-12-11 at 18:52 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote: > >>> And I would like it even more if the sample size increased according > >>> to table size, since that makes ndistinct values fairly random for > >>> large tables. > >> > >> Unfortunately _any_ ndistinct estimate based on a sample of the table > >> is going to be pretty random. > > > We know that constructed data distributions can destroy the > > effectiveness of the ndistinct estimate and make sample size irrelevant. > > But typical real world data distributions do improve their estimations > > with increased sample size and so it is worthwhile. > > This is handwaving unsupported by evidence. Not at all. In the paper cited within the ANALYZE code, shown here: ftp://ftp.research.microsoft.com/users/autoadmin/histogram_conf.pdf we implement the sample size for reliable histogram size, but ignore most of the rest of the paper. Sections (4) Block Level sampling is ignored, yet the conclusions are (7.2) that it provides a larger and more effective sample size yet without significantly increasing number of accessed disk blocks. Haas Stokes [1998] also indicate that accuracy of n-distinct estimation is linked to sample size. In a previous post to hackers I looked at the case where values were physically clustered together in the table, either naturally or via the CLUSTER command. Section: ESTIMATES OF D FOR DEPENDENT TABLES http://archives.postgresql.org/pgsql-hackers/2006-01/msg00153.php In that case the current ANALYZE algorithm fails badly because of fixed sample size. This is because "ANALYZE randomly samples rows, so that the average gap between randomly selected rows increases as the table size increases, because of the fixed sample size. Since the clustered rows are typically close together, then the apparent number of multiple instances of the same data value decreases as the sample fraction decreases. Since the sample size is currently fixed, this means that the D estimate decreases as the table size increases. (This is proven in a test case below)." > If you've got a specific > proposal what to change the sample size to and some numbers about what > it might gain us or cost us, I'm all ears. So my specific proposal is: implement block level sampling. It allows us to * increase sample size without increasing number of I/Os * allows us to account correctly for clustered data I'm not trying to force this to happen now, I'm just bringing it into the discussion because its relevant and has not been mentioned. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> The existing sampling mechanism is tied to solid statistics. It >> provides the correct sample size to get a consistent confidence range >> for range queries. This is the same mathematics which governs election >> polling and other surveys. The sample size you need to get +/- 5% 19 >> times out of 20 increases as the population increases, but not by very >> much. > > Sounds great, but its not true. The sample size is not linked to data > volume, so how can it possibly give a consistent confidence range? I'm not 100% sure how relevant it is to this case, but I think what Greg is referring to is: http://en.wikipedia.org/wiki/Margin_of_error#Effect_of_population_size It is a pretty well-known mathematical fact that for something like an opinion poll your margin of error does not depend on the size of the population but only on the size of your sample. ...Robert
On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Robert Haas" <robertmhaas@gmail.com> writes: >> I had this idle thought too, but I didn't write it down because... > >>> ought to be, but it seems like it ought to be possible to determine >>> that given a desired maximum error in the overall estimate. I'm also >>> not very clear on what the "total frequency" computations (matchfreq2 >>> and unmatchfreq2 in the current code) ought to look like if we are using >>> a variable subset of the inner list. > >> ...of this exact concern, which I think is an insurmountable problem. > > Maybe so. If we stick to the other design (end both lists at a preset > frequency threshold) then the math clearly goes through the same as > before, just with num_mcvs that are determined differently. But can > we prove anything about the maximum error added from that? I don't think so, because in that design, it's entirely possible that you'll throw away the entire MCV list if all of the entries are below the threshold (as in the example we were just benchmarking, supposing a threshold of 0.001). An alternative is to pick a threshold T for the maximum number of equality probes that you're willing to suffer through. Then given two MCV lists of lengths M1 and M2 such that M1 * M2 > T, pick the largest N such that MIN(M1, N) * MIN(M2, N) <= T. This guarantees that you always use at least T^(1/2) MCVs. If you compare this approach with T = 10^6 vs. simply chopping off the MCV list at p = 0.001, this approach will be more accurate at the cost of more comparisons. For example in our test case where all the comparisons fail, chopping off the MCV list at p = 0.001 results in ignoring it completely, whereas with this approach we use all 1000 entries just as before. So it might be appropriate to choose a lower threshold like, say, T = 10^5, since otherwise we're not preventing any computation. (I suppose you could even make this a GUC since any choice of value is going to be pretty arbitrary...) I'm not sure to what extent we can bound the amount of error with this approach, but it's definitely better. As we've seen, a frequency cutoff can throw away the entire MCV list; this approach always retains at least T^(1/2) entries, and more if the other list happens to be shorter than T^(1/2). So we can say that the result will never be worse than it would have been had you set the statistics target to T^(1/2). ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> The existing sampling mechanism is tied to solid statistics. >> >> Sounds great, but its not true. The sample size is not linked to data >> volume, so how can it possibly give a consistent confidence range? > It is a pretty well-known mathematical fact that for something like an > opinion poll your margin of error does not depend on the size of the > population but only on the size of your sample. Right. The solid math that Greg referred to concerns how big a sample we need in order to have good confidence in the histogram results. It doesn't speak to whether we get good results for ndistinct (or for most-common-values, though in practice that seems to work fairly well). AFAICS, marginal enlargements in the sample size aren't going to help much for ndistinct --- you really need to look at most or all of the table to be guaranteed anything about that. But having said that, I have wondered whether we should consider allowing the sample to grow to fill maintenance_work_mem, rather than making it a predetermined number of rows. One difficulty is that the random-sampling code assumes it has a predetermined rowcount target; I haven't looked at whether that'd be easy to change or whether we'd need a whole new sampling algorithm. regards, tom lane
On Fri, Dec 12, 2008 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > AFAICS, marginal enlargements in the sample size aren't going to help > much for ndistinct --- you really need to look at most or all of the > table to be guaranteed anything about that. Well you only need to maintain a fixed percentage of the table if by "guaranteed anything" you mean guaranteed a consistent level of confidence. But even a small percentage like 1% means a very different behaviour than currently. For large tables it could mean sampling a *lot* more. However if by "guaranteed anything" you mean guaranteeing an actual useful result then it's true. Even samples as large as 50% give a pretty low confidence estimate. > But having said that, I have wondered whether we should consider > allowing the sample to grow to fill maintenance_work_mem Hm, so I wonder what this does to the time analyze takes. I think it would be the only thing where raising maintenance_work_mem would actually increase the amount of time an operation takes. Generally people raise it to speed up index builds and vacuums etc. -- greg
"Robert Haas" <robertmhaas@gmail.com> writes: > On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe so. If we stick to the other design (end both lists at a preset >> frequency threshold) then the math clearly goes through the same as >> before, just with num_mcvs that are determined differently. But can >> we prove anything about the maximum error added from that? > I don't think so, because in that design, it's entirely possible that > you'll throw away the entire MCV list if all of the entries are below > the threshold (as in the example we were just benchmarking, supposing > a threshold of 0.001). Right, but the question is how much that really hurts. It's not like we are going to pick a completely clueless number for the ignored MCVs; rather, we are going to assume that they have the same stats as the remainder of the population. If the threshold frequency isn't very large then the error involved should be bounded. As an example, in the perfectly flat distribution set up by the speed tests we were just doing, there actually wouldn't be any error at all (assuming we got ndistinct right, which of course is a pretty big assumption). I haven't consumed enough caffeine yet to try to do the math, but I think that if you set the threshold as something a bit more than the assumed frequency of a non-MCV value then it could work. > An alternative is to pick a threshold T for the maximum number of > equality probes that you're willing to suffer through. I'd like to get there from the other direction, ie figure out what T has to be to get known maximum error. regards, tom lane
On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote: > On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> The existing sampling mechanism is tied to solid statistics. It > >> provides the correct sample size to get a consistent confidence range > >> for range queries. This is the same mathematics which governs election > >> polling and other surveys. The sample size you need to get +/- 5% 19 > >> times out of 20 increases as the population increases, but not by very > >> much. > > > > Sounds great, but its not true. The sample size is not linked to data > > volume, so how can it possibly give a consistent confidence range? > > I'm not 100% sure how relevant it is to this case, but I think what > Greg is referring to is: > > http://en.wikipedia.org/wiki/Margin_of_error#Effect_of_population_size > > It is a pretty well-known mathematical fact that for something like an > opinion poll your margin of error does not depend on the size of the > population but only on the size of your sample. Yes, I agree with that *but* that refers to population statistics and has nothing at all to do with the calculation of ndistinct, which is what we were discussing. You can't just switch topics and have the statement remain true. ndistinct estimation is improved by larger sample sizes, that's what the maths says and what experimentation shows also. Note that the estimator we use was shown to be stable in the range of sample size between 5-20%. http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf We currently use a sample size of 300*stats_target. With default=10 that means our sample size is 0.3% on a 1 million row table, and 0.003% on a 100 million row table (they're common, I find). That size of sample is OK for some kinds of statistics, but not much good for ndistinct estimation. These issues only show up in the field, they never show up on optimizer test platforms because they typically are several orders of magnitude too small. (Conversely, the stats system works very well indeed for smaller tables... so I'm not trying to debunk everything). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-12 at 09:35 -0500, Tom Lane wrote: > AFAICS, marginal enlargements in the sample size aren't going to help > much for ndistinct --- you really need to look at most or all of the > table to be guaranteed anything about that. > > But having said that, I have wondered whether we should consider > allowing the sample to grow to fill maintenance_work_mem, rather than > making it a predetermined number of rows. One difficulty is that the > random-sampling code assumes it has a predetermined rowcount target; > I haven't looked at whether that'd be easy to change or whether we'd > need a whole new sampling algorithm. I think we need to do block sampling before we increase sample size. On large tables we currently do one I/O per sampled row, so the I/O cost of ANALYZE would just increase linearly. We need the increased sample size for ndistinct, not for other stats. So I would suggest we harvest a larger sample, use that for ndistinct estimation, but then sample-the-sample to minimise processing time for other stats that aren't as sensitive as ndistinct. Currently we fail badly on columns that have been CLUSTERed and we can improve that significantly by looking at adjacent groups of rows, i.e. block sampling. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
"Greg Stark" <stark@enterprisedb.com> writes: > On Fri, Dec 12, 2008 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But having said that, I have wondered whether we should consider >> allowing the sample to grow to fill maintenance_work_mem > Hm, so I wonder what this does to the time analyze takes. I think it > would be the only thing where raising maintenance_work_mem would > actually increase the amount of time an operation takes. Generally > people raise it to speed up index builds and vacuums etc. Yeah --- we might need to make it a separate GUC knob instead of tying it directly to maintenance_work_mem. But still, is *any* fixed-size sample really going to help much for large tables? regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote: >> It is a pretty well-known mathematical fact that for something like an >> opinion poll your margin of error does not depend on the size of the >> population but only on the size of your sample. > > Yes, I agree with that *but* that refers to population statistics and > has nothing at all to do with the calculation of ndistinct, which is > what we were discussing. You can't just switch topics and have the > statement remain true. If you go back to my email that was kind of my point. The existing sample size is on a solid foundation for the histograms and most use cases for the statistics. But entirely bogus for ndistinct. The ndistinct estimate is just piggy-backing on that data. However to fix it would require switching over to scanning a percentage of the whole table which would be a massive increase in work for that one calculation. You can't fix it by just adjusting the sample size slightly. > Note that the estimator we use was shown to be stable in the range of > sample size between 5-20%. > http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf Uhm, this is a survey of lots of different methods and does lots of analysis. I don't see any simple conclusions about stability. Perhaps I'm just missing it in the technical details. Could you point out exactly what part of the paper you're basing this on and what "stable" means? > We currently use a sample size of 300*stats_target. With default=10 that > means our sample size is 0.3% on a 1 million row table, and 0.003% on a > 100 million row table (they're common, I find). > > That size of sample is OK for some kinds of statistics, but not much > good for ndistinct estimation. Right, but increasing our sample size by a factor of 150 for a 100M row table doesn't seem like a reasonable solution to one metric being bogus. For that matter, if we do consider sampling 5% of the table we may as well just go ahead and scan the whole table. It wouldn't take much longer and it would actually produce good estimates. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: > For that matter, if we do consider sampling 5% of the table we may as well > just go ahead and scan the whole table. It wouldn't take much longer and it > would actually produce good estimates. Yeah. Anything over a small fraction of a percent is going to imply fetching every page anyway, for typical row widths. If you want ANALYZE to be cheap then you simply don't get to have a trustworthy value of ndistinct. Perhaps a better plan is to try to de-emphasize use of ndistinct, though I concede I have no idea how to do that. regards, tom lane
>>> "Nathan Boley" <npboley@gmail.com> wrote: > Can anyone suggest a good data set to test this sort of question on? Where we have the biggest problem with bad estimates is on complex searches involving many joins where the main criterion limiting the result set is a name. The estimate based on the histogram is often very low (e.g. 2) when the actual result set is several hundred. While several hundred is far short of 1% of the table, the best plan for a result set of that size is very different than the best plan for two rows. Some numbers follow to give an idea of the shape of data where current techniques sometimes do poorly. We use a "searchName" column which puts the name components from various columns into a canonical format; this is what is indexed and searched. The search is usually a LIKE with the high order portion being six to ten characters followed by the wild card. Total rows in table: 32,384,830 There are 9,958,969 distinct values. There is one value present in over 1% of the rows, with 433,578 rows. There are ten values present in over 0.1% of the rows:433578140398135489112088 64069 63158 44656 36499 35896 35819 The 100th most common value is present in 4847 rows. There are 186 rows with over 0.01% of the rows. Based on my experience, we would need better estimates for ranges with 200 to 300 rows to improve our plans for the problem cases. I'd be happy to have it scan the whole table during our nightly VACUUM ANALYZE if that would get me statistics which would improve the estimates to that degree without a huge increase in plan time. Which raises the issue, if we could get better statistics by passing the whole table, why not do that when VACUUM ANALYZE is run? -Kevin
> Which raises the issue, if we could get better statistics by passing > the whole table, why not do that when VACUUM ANALYZE is run? I think the reason is "because the next autovacuum would undo it". Perhaps a table-level option to scan the whole table instead of estimating would be appropriate? . ...Robert
On Fri, 2008-12-12 at 16:10 +0000, Gregory Stark wrote: > Right, but increasing our sample size by a factor of 150 for a 100M > row table doesn't seem like a reasonable solution to one metric being > bogus. > > For that matter, if we do consider sampling 5% of the table we may as > well just go ahead and scan the whole table. It wouldn't take much > longer and it would actually produce good estimates. As I said, we would only increase sample for ndistinct, not for others. At the moment we completely and significantly fail to assess ndistinct correctly on clustered data for large tables. Using block level sampling would prevent that. Right now we may as well use a random number generator. The amount of I/O could stay the same, just sample all rows on block. Lifting the sample size will help large tables. Will it be perfect? No. But I'll take "better" over "not working at all". If we are going to quote literature we should believe all the literature. We can't just listen to some people that did a few tests with sample size, but then ignore the guy that designed the MS optimizer and many others. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote: > If you want ANALYZE to be cheap then you simply don't get to have > a trustworthy value of ndistinct. Understood, but a cheap ANALYZE isn't always a higher priority than all other considerations. Solutions can also include * allowing user to note that we would actually like to scan the whole table (stats_target = -2?) * manual mechanism for setting ndistinct that doesn't keep getting overwritten by subsequent ANALYZEs * have the executor do non-transactional update of the value of ndistinct if it ever builds a hash table that is larger than expected (simple learning optimizer action) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote: > Perhaps a better plan is to try to de-emphasize use of ndistinct, > though I concede I have no idea how to do that. We don't actually care about the accuracy of the ndistinct much, just the accuracy of our answer to the question "given work_mem = X, is it better to use a hash plan". So we just need to scan the table until we can answer that question accurately enough. i.e. a variable sized sample. Perhaps we could store a probability distribution for various values of work_mem, rather than a single ndistinct value. Anyway, definitely handwaving now to stimulate ideas. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Robert Haas escribió: > > Which raises the issue, if we could get better statistics by passing > > the whole table, why not do that when VACUUM ANALYZE is run? > > I think the reason is "because the next autovacuum would undo it". Is there any way to "merge" the statistics? i.e. if a full table scan is done to compute precise statistics, and later a regular analyze scan is done, then perhaps instead of clobbering the previous stats, you merge them with the new ones, thus not completely losing those previous ones. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Tom Lane escribió: > If you want ANALYZE to be cheap then you simply don't get to have a > trustworthy value of ndistinct. But then, maybe it's not all that critical that ANALYZE is cheap. For example, if we were to rework VACUUM ANALYZE so that on the same pass that VACUUM cleans each heap page, a callback is called on the page to grab the needed stats. Partial vacuum is a roadblock for this though :-( -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Simon Riggs <simon@2ndQuadrant.com> writes: > The amount of I/O could stay the same, just sample all rows on block. > Lifting the sample size will help large tables. Will it be perfect? No. > But I'll take "better" over "not working at all". That will just raise the table size at which the problems start. It'll still be a constant-sized sample. It will also introduce strange biases. For instance in a clustered table it'll think there are a lot more duplicates than there really are because it'll see lots of similar values. Incidentally we *do* do block sampling. We pick random blocks and then pick random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime around then. It dramatically reduced the i/o requirements but there were long discussions of how to do it without introducing biases. > If we are going to quote literature we should believe all the > literature. We can't just listen to some people that did a few tests > with sample size, but then ignore the guy that designed the MS optimizer > and many others. I'm not sure what you're talking about regarding "some people that did a few tests". I looked around for the paper I keep referencing and can't find it on my laptop. I'll look for it online. But it included a section which was a survey of past results from other papers and the best results required stupidly large sample sizes to get anything worthwhile. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Fri, 2008-12-12 at 16:10 +0000, Gregory Stark wrote: > Uhm, this is a survey of lots of different methods and does lots of > analysis. > I don't see any simple conclusions about stability. Perhaps I'm just > missing > it in the technical details. Could you point out exactly what part of > the > paper you're basing this on and what "stable" means? I was echoing the comments in the ANALYZE code, which explain that we use the Duj1 estimator because it is more stable across sample size, as shown in table 5 on p.21 of the Haas Stokes report. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
>>> "Robert Haas" <robertmhaas@gmail.com> wrote: >> Which raises the issue, if we could get better statistics by passing >> the whole table, why not do that when VACUUM ANALYZE is run? > > I think the reason is "because the next autovacuum would undo it". The table has 32.4 million rows. autovacuum_analyze_scale_factor is 0.1. autovacuum_vacuum_scale_factor is 0.2. We run a nightly VACUUM ANALYZE. Deletes are rare. Normal operations don't update more than a few thousand rows per day. I know that normal operations never cause an autovacuum of this table. Perhaps if there was a way to share this information with PostgreSQL.... -Kevin
On Fri, 2008-12-12 at 17:05 +0000, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > The amount of I/O could stay the same, just sample all rows on block. > > Lifting the sample size will help large tables. Will it be perfect? No. > > But I'll take "better" over "not working at all". > > That will just raise the table size at which the problems start. It'll still > be a constant-sized sample. Work with me here. I want to make the situation better. It still won't be perfect, but is that an argument against any action at all? > It will also introduce strange biases. For instance in a clustered table it'll > think there are a lot more duplicates than there really are because it'll see > lots of similar values. > > Incidentally we *do* do block sampling. We pick random blocks and then pick > random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime > around then. It dramatically reduced the i/o requirements but there were long > discussions of how to do it without introducing biases. No, we pick random rows. On bigger tables, they get further apart typically and so we miss any clustering. I mean that we should pick a random block and read all rows on it. > > If we are going to quote literature we should believe all the > > literature. We can't just listen to some people that did a few tests > > with sample size, but then ignore the guy that designed the MS optimizer > > and many others. > > I'm not sure what you're talking about regarding "some people that did a few > tests". I looked around for the paper I keep referencing and can't find it on > my laptop. I'll look for it online. But it included a section which was a > survey of past results from other papers and the best results required > stupidly large sample sizes to get anything worthwhile. Even if you find it, we still need to know why we would listen to the research in the absent paper yet ignore the conclusion in the paper by the man in charge of the MS optimizer who said that block level sampling is a good idea. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> If you want ANALYZE to be cheap then you simply don't get to have a >> trustworthy value of ndistinct. > But then, maybe it's not all that critical that ANALYZE is cheap. For > example, if we were to rework VACUUM ANALYZE so that on the same pass > that VACUUM cleans each heap page, a callback is called on the page to > grab the needed stats. > Partial vacuum is a roadblock for this though :-( Yeah --- now that partial vacuum is in, any argument that we can make ANALYZE piggyback on VACUUM cheaply is dead anyway. It would be interesting to consider "partial analyze" processing, but I don't see how you would combine per-page partial results without a huge increase in stats-related state data. regards, tom lane
On Fri, Dec 12, 2008 at 5:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Incidentally we *do* do block sampling. We pick random blocks and then pick >> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime >> around then. It dramatically reduced the i/o requirements but there were long >> discussions of how to do it without introducing biases. > > No, we pick random rows. On bigger tables, they get further apart > typically and so we miss any clustering. I mean that we should pick a > random block and read all rows on it. I think what's happening here is that our sampling method is based on the assumption that a records location is not related to its value. Consider a table which is clustered and has two copies of each value. When we look at a block and see n/2 values and we know there are 1000 blocks then a human would conjecture that there are 1000*n/2 distinct values throughout the table and every value is represented twice throughout the whole table. But if we're assuming the records are randomly distributed then looking at the whole block will actually throw us off completely. We'll deduce from the fact that we saw every value twice that there must be hundreds of copies spread throughout the database and there must be a lot less than 1000*n/2 distinct values. I think you need to find two different formulas, one which represents a clustered table and one which represents randomly distributed data. Then you need a way to measure just how clustered the data is so you know how much weight to give each formula. Perhaps comparing the number of duplicates in whole-block samples versus overall random selections would give that measure. -- greg
On Fri, 2008-12-12 at 14:03 -0300, Alvaro Herrera wrote: > Partial vacuum is a roadblock for this though :-( Actually, perhaps its an enabler for looking at changed stats? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote: >> Perhaps a better plan is to try to de-emphasize use of ndistinct, >> though I concede I have no idea how to do that. > We don't actually care about the accuracy of the ndistinct much, just > the accuracy of our answer to the question "given work_mem = X, is it > better to use a hash plan". That's hardly the only thing we use ndistinct for. Also, it's a bit too simplistic to suppose that we only have to make the right binary choice between hash and something else at a particular plan level. If we don't have at least ballpark-correct figures for cost and number of output rows, we'll start making mistakes at higher plan levels. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Is there any way to "merge" the statistics? i.e. if a full table scan > is done to compute precise statistics, and later a regular analyze scan > is done, then perhaps instead of clobbering the previous stats, you > merge them with the new ones, thus not completely losing those previous > ones. Seems like a pretty hard problem unless you store a whole lot more statistics state than we do now (which of course would create its own costs). How would you know which portion of the old stats to not believe anymore? regards, tom lane
Alvaro Herrera wrote: > Robert Haas escribi?: > > > Which raises the issue, if we could get better statistics by passing > > > the whole table, why not do that when VACUUM ANALYZE is run? > > > > I think the reason is "because the next autovacuum would undo it". > > Is there any way to "merge" the statistics? i.e. if a full table scan > is done to compute precise statistics, and later a regular analyze scan > is done, then perhaps instead of clobbering the previous stats, you > merge them with the new ones, thus not completely losing those previous > ones. Crazy idea, but if a partial analyze finds that 5% of the table has changed since the last full analyze, but 10% of the statistics are different, we know something is wrong. ;-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Alvaro Herrera wrote: > Tom Lane escribi?: > > > If you want ANALYZE to be cheap then you simply don't get to have a > > trustworthy value of ndistinct. > > But then, maybe it's not all that critical that ANALYZE is cheap. For > example, if we were to rework VACUUM ANALYZE so that on the same pass > that VACUUM cleans each heap page, a callback is called on the page to > grab the needed stats. > > Partial vacuum is a roadblock for this though :-( Perhaps it isn't because partial vacuum is going to highlight the _changed_ blocks, which fits into your idea of merging stats, somehow. ;-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2008-12-12 at 18:01 +0000, Greg Stark wrote: > I think you need to find two different formulas, one which represents > a clustered table and one which represents randomly distributed data. > Then you need a way to measure just how clustered the data is so you > know how much weight to give each formula. Perhaps comparing the > number of duplicates in whole-block samples versus overall random > selections would give that measure. Please read the Chaudhuri paper. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > As I said, we would only increase sample for ndistinct, not for others. How will you do that? Keep in mind that one of the things we have to do to compute ndistinct is to sort the sample. ISTM that the majority of the cost of a larger sample is going to get expended anyway --- certainly we could form the histogram using the more accurate data at precisely zero extra cost, and I think we have also pretty much done all the work for MCV collection by the time we finish counting the number of distinct values. I seem to recall Greg suggesting that there were ways to estimate ndistinct without sorting, but short of a fundamental algorithm change there's not going to be a win here. > Right now we may as well use a random number generator. Could we skip the hyperbole please? regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > Solutions can also include > * manual mechanism for setting ndistinct that doesn't keep getting > overwritten by subsequent ANALYZEs Hmm, that might actually be the most practical answer for large, reasonably-static tables. Especially if we expose the "negative stadistinct" feature to let people specify it as a fraction of table size. regards, tom lane
On Fri, 2008-12-12 at 13:10 -0500, Tom Lane wrote: > If we don't > have at least ballpark-correct figures for cost and number of output > rows, we'll start making mistakes at higher plan levels. That's definitely where we are now. Had a call with a customer today where I said "it's OK, its only out by 3 orders of magnitude, it gets worse than that". 9 hour query, with substantially incorrect EXPLAIN. I don't claim to have any/all of the answers myself, but this needs attention very badly. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote: > I seem to recall Greg suggesting that there were ways to estimate > ndistinct without sorting, but short of a fundamental algorithm change > there's not going to be a win here. Hash table? Haas Stokes suggests a Bloom filter. Why not keep the random algorithm we have now, but scan the block into a separate hash table for ndistinct estimation. That way we keep the correct random rows for other purposes. > > Right now we may as well use a random number generator. > > Could we skip the hyperbole please? Some of the ndistinct values are very badly off, and in the common cases I cited previously, consistently so. Once I'm certain the rescue helicopter has seen me, I'll stop waving my arms. (But yes, OK). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote: >> Could we skip the hyperbole please? > Some of the ndistinct values are very badly off, and in the common cases > I cited previously, consistently so. > Once I'm certain the rescue helicopter has seen me, I'll stop waving my > arms. (But yes, OK). Well, AFAICT we have agreed in this thread to kick up the default and maximum stats targets by a factor of 10 for 8.4. If there's anything to your thesis that a bigger sample size will help, that should already make a noticeable difference. I'm inclined to wait for some field experience with 8.4 before we start making fundamental changes to the ANALYZE code. regards, tom lane
Robert Haas escreveu: >> Which raises the issue, if we could get better statistics by passing >> the whole table, why not do that when VACUUM ANALYZE is run? > > I think the reason is "because the next autovacuum would undo it". > Ok, but even if autovacuum will undo it, almost-static-dataset would benefit from this feature because (i) autovacuum won't often trigger this table or (ii) you disabled the autovacuum on that table. > Perhaps a table-level option to scan the whole table instead of > estimating would be appropriate? > ANALYZE FULL foo ? -- Euler Taveira de Oliveira http://www.timbira.com/
>> Perhaps a table-level option to scan the whole table instead of >> estimating would be appropriate? >> > ANALYZE FULL foo ? I was thinking more like a flag that you could set on the table itself, that would apply to all future analyzes. ...Robert
On Fri, Dec 12, 2008 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I seem to recall Greg suggesting that there were ways to estimate > ndistinct without sorting, but short of a fundamental algorithm change > there's not going to be a win here. Not sure if you meant me, but I can say the approach I saw documented before involved keeping a hash of all values seen in a full table scan. When the hash reached a maximum size you discard half the hash key space and from then on ignore any values which hash into that space. When you reach the end you have a hash table with a random sample of 1/2^n distinct values (where n is the number of times the hash table overflowed) and the counts for those values. If you're just interested in counting distinct values in the sample I suppose you could do it using a Bloom filter just as we've talked about for other hash tables. You scan through the list and increment an ndistinct counter for every value you find where the bits aren't all already set (then set those bits). That would undercount slightly and I'm not sure how much faster than qsort it would actually be. The log(n) factor in qsort's average case is pretty small when we're talking about small samples. -- greg
On Fri, Dec 12, 2008 at 6:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Why not keep the random algorithm we have now, but scan the block into a > separate hash table for ndistinct estimation. That way we keep the > correct random rows for other purposes. It seems to me that what you have to do is look at a set of blocks and judge a) how many duplicates are in the typical block and b) how much overlap there are between blocks. Then extrapolate to other blocks based on those two values. So for example if you look at 1% of the blocks and find there are 27 distinct values on each of the blocks then you extrapolate that there are somewhere between 100*27 distinct values table-wide (if those blocks have no intersections) and 27 distinct values total (if those blocks intersect 100% -- ie, they all have the same 27 distinct values). I haven't had a chance to read that paper, it looks extremely dense. Is this the same idea? -- greg
On Fri, 2008-12-12 at 13:43 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote: > >> Could we skip the hyperbole please? > > > Some of the ndistinct values are very badly off, and in the common cases > > I cited previously, consistently so. > > > Once I'm certain the rescue helicopter has seen me, I'll stop waving my > > arms. (But yes, OK). > > Well, AFAICT we have agreed in this thread to kick up the default and > maximum stats targets by a factor of 10 for 8.4. If there's anything > to your thesis that a bigger sample size will help, that should already > make a noticeable difference. That only makes x10 sample size. Since we're using such a low sample size already, it won't make much difference to ndistinct. It will be great for histograms and MCVs though. Please review my detailed test results mentioned here http://archives.postgresql.org/pgsql-hackers/2006-01/msg00153.php If you reproduce those results you'll see that the ndistinct machinery is fundamentally broken for clustered data on large tables. In many cases those are join keys and so joins are badly handled on the very tables where good optimisation is most important. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> The amount of I/O could stay the same, just sample all rows on block. [....] > > It will also introduce strange biases. For instance in a clustered table it'll > think there are a lot more duplicates than there really are because it'll see > lots of similar values. But for ndistinct - it seems it could only help things. If the ndistinct guesser just picks max(the-current-one-row-per-block-guess, a-guess-based-on-all-the-rows-on-the-blocks) it seems we'd be no worse off for clustered tables; and much better off for randomly organized tables. In some ways I fear *not* sampling all rows on the block also introduces strange biases by largely overlooking the fact that the table's clustered. In my tables clustered on zip-code we don't notice info like "state='AZ' is present in well under 1% of blocks in the table", while if we did scan all rows on the blocks it might guess this. But I guess a histogram of blocks would be additional stat rather than an improved one.
On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Solutions can also include > > * manual mechanism for setting ndistinct that doesn't keep getting > > overwritten by subsequent ANALYZEs > > Hmm, that might actually be the most practical answer for large, > reasonably-static tables. Especially if we expose the "negative > stadistinct" feature to let people specify it as a fraction of table > size. Works for me. Especially if you want to think more about ANALYZE before changing that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> Solutions can also include >>> * manual mechanism for setting ndistinct that doesn't keep getting >>> overwritten by subsequent ANALYZEs >> >> Hmm, that might actually be the most practical answer for large, >> reasonably-static tables. Especially if we expose the "negative >> stadistinct" feature to let people specify it as a fraction of table >> size. > Works for me. Especially if you want to think more about ANALYZE before > changing that. Well, it's something that would be sane to contemplate adding in 8.4. It's way too late for any of this other stuff to happen in this release. regards, tom lane
On Thu, 11 Dec 2008, Tom Lane wrote: > On the whole I think we have some evidence here to say that upping the > default value of default_stats_target to 100 wouldn't be out of line, > but 1000 definitely is. Comments? Circling back to where this started from now that the discussion has slowed down here, the original tuning suggestions Josh threw out were: > Mixed: > default_statistics_target = 100 > DW: > default_statistics_target = 400 I think this data suggesting serious quadratic behavior doesn't kick in until much higher than original theorized supports increasing the mixed number to 100. And if the database default goes to there I think that's even better. Compared to the current default of 10, that makes for a 4X increase in planning time for Robert's original test case, and a 5.8X increase in the more difficult array test Tom came up with. The only question really open in my mind is whether 400 is still considered too high for a DW application. Relative to 100, Tom's numbers suggest that further 4X increase ends up turning into as much as a 8X increase in planning time. Given the context of a DW application, where bad plans are really expensive, that suggestion of 400 seems confirmed to be sane now to me. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, Dec 12, 2008 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote: >>> Simon Riggs <simon@2ndQuadrant.com> writes: >>>> Solutions can also include >>>> * manual mechanism for setting ndistinct that doesn't keep getting >>>> overwritten by subsequent ANALYZEs >>> >>> Hmm, that might actually be the most practical answer for large, >>> reasonably-static tables. Especially if we expose the "negative >>> stadistinct" feature to let people specify it as a fraction of table >>> size. > >> Works for me. Especially if you want to think more about ANALYZE before >> changing that. > > Well, it's something that would be sane to contemplate adding in 8.4. > It's way too late for any of this other stuff to happen in this release. I'm thinking about trying to implement this, unless someone else is already planning to do it. I'm not sure it's practical to think about getting this into 8.4 at this point, but it's worth doing whether it does or not. The main problem I see here is that I don't know what the mechanism should be. My first thought was to use a reloption, but that won't work because reloptions are per-table and this setting is per-column. So my second thought is to add a column to pg_attribute and add a new DDL commands to modify it, maybe something like: ALTER TABLE name ALTER [COLUMN] column SET NDISTINCT 1672; ALTER TABLE name ALTER [COLUMN] column DROP NDISTINCT; Another option would be to invent a reloption-like syntax for columns: ALTER TABLE name ALTER [COLUMN] column SET (ndistinct = 1672); ALTER TABLE name ALTER [COLUMN] column RESET (ndistinct); This sort of seems cleaner and I can imagine it being useful for other purposes, but I'm not sure if I want to go to the trouble of building a completely general column-level reloptions mechanism ATM. It's also not entirely clear to me whether this option, wherever we put it, should be directly examined by the selectivity functions before consulting the statistics tuple, or whether it should merely override the value that ANALYZE puts into the statistics tuple. The latter seems simpler and more performant to me, but does lead to the possibly surprising result that changes don't take effect until the next ANALYZE. Thoughts? ...Robert
Robert Haas <robertmhaas@gmail.com> wrote: > >> Works for me. Especially if you want to think more about ANALYZE before > >> changing that. > > > > Well, it's something that would be sane to contemplate adding in 8.4. > > It's way too late for any of this other stuff to happen in this release. > > I'm thinking about trying to implement this, unless someone else is > already planning to do it. I'm not sure it's practical to think about > getting this into 8.4 at this point, but it's worth doing whether it > does or not. Can we use get_relation_stats_hook on 8.4? The pg_statistic catalog will be still modified by ANALYZEs, but we can rewrite the statistics just before it is used. your_relation_stats_hook(root, rte, attnum, vardata) { Call default implementation; if (rte->relid = YourRelation && attnum = YourColumn) ((Form_pg_statistic) (vardata->statsTuple))->stadistinct= YourNDistinct; } Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Thu, Mar 19, 2009 at 4:04 AM, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> >> Works for me. Especially if you want to think more about ANALYZE before >> >> changing that. >> > >> > Well, it's something that would be sane to contemplate adding in 8.4. >> > It's way too late for any of this other stuff to happen in this release. >> >> I'm thinking about trying to implement this, unless someone else is >> already planning to do it. I'm not sure it's practical to think about >> getting this into 8.4 at this point, but it's worth doing whether it >> does or not. > > Can we use get_relation_stats_hook on 8.4? The pg_statistic catalog > will be still modified by ANALYZEs, but we can rewrite the statistics > just before it is used. > > your_relation_stats_hook(root, rte, attnum, vardata) > { > Call default implementation; > if (rte->relid = YourRelation && attnum = YourColumn) > ((Form_pg_statistic) (vardata->statsTuple))->stadistinct = YourNDistinct; > } I don't know, can you run a query from inside the stats hook? It sounds like this could be made to work for a hard-coded relation and column, but ideally you'd like to get this data out of a table somewhere. I started implementing this by adding attdistinct to pg_attribute and making it a float8, with 0 meaning "don't override the results of the normal stats computation" and any other value meaning "override the results of the normal stats computation with this value". I'm not sure, however, whether I can count on the result of an equality test against a floating-point zero to be reliable on every platform. It also seems like something of a waste of space, since the only positive values that are useful are integers (and presumably less than 2^31-1) and the only negative values that are useful are > -1. So I'm thinking about making it an integer, to be interpreted as follows: 0 => compute ndistinct normally positive value => use this value for ndistinct negative value => use this value * 10^-6 for ndistinct Any thoughts? ...Robert