Thread: Query optimizer 8.0.1 (and 8.0)
Here's one: I have the USA census TIGER database loaded, the WHOLE THING, the whole country. It isn't the biggest database, but it is about 40G before indexes. Every table is over a few million rows. I can quite honestly say, a sequential scan is almost never the right thing to do. Info below. Here is the query: select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or zipl=2186); Well, when you look at the explain, obviously it looks like the sequential scan is better, but trust me, it takes a good few minutes for the query to respond with sequential scan enabled. I've run analyze and everything. I suspect that analyze only samples a very small amount of the database and gets the wrong idea about it. Is there a way to force analyze to sample more rows? tiger=# analyze verbose rt2 ; INFO: analyzing "public.rt2" INFO: "rt2": scanned 3000 of 1139825 pages, containing 84160 live rows and 0 dead rows; 3000 rows in sample, 31975891 estimated total rows ANALYZE tiger=# analyze verbose rt1 ; INFO: analyzing "public.rt1" INFO: "rt1": scanned 3000 of 1527360 pages, containing 90951 live rows and 0 dead rows; 3000 rows in sample, 46304973 estimated total rows ANALYZE tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or zipl=2186); QUERY PLAN -----------------------------------------------------------------------------------------------------Hash Join (cost=121978.81..3996190.89rows=21118 width=520) Hash Cond: ("outer".tlid = "inner".tlid) -> Seq Scan on rt2 (cost=0.00..1459291.36rows=31946636 width=218) -> Hash (cost=120662.36..120662.36 rows=30579 width=302) -> IndexScan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..120662.36 rows=30579 width=302) Index Cond: ((zipr = 2186) OR (zipl = 2186)) (6 rows) tiger=# set enable_seqscan=no; SET tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or zipl=2186); QUERY PLAN -----------------------------------------------------------------------------------------------Nested Loop (cost=0.00..46868256.00rows=21118 width=520) -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..120662.36 rows=30579 width=302) Index Cond: ((zipr = 2186) OR (zipl = 2186)) -> Index Scan using rt2_tlid on rt2 (cost=0.00..1523.80rows=396 width=218) Index Cond: ("outer".tlid = rt2.tlid) Table "public.rt1" Column | Type | Modifiers -----------+-------------------+-----------tlid | integer |side1 | character varying |source | charactervarying |fedirp | character varying |fename | character varying |fetype | character varying |fedirs | character varying |cfcc | character varying |fraddl | character varying |toaddl | character varying |fraddr | character varying |toaddr | character varying |friaddl | character varying |toiaddl | character varying|friaddr | character varying |toiaddr | character varying |zipl | integer |zipr | integer |aianhhfpl | character varying |aianhhfpr | character varying |aihhtlil | character varying |aihhtlir | character varying |statel | character varying |stater | character varying |countyl | character varying |countyr | character varying |cousubl | character varying |cousubr | character varying |submcdl | character varying|submcdr | character varying |placel | character varying |placer | character varying |tractl | charactervarying |tractr | character varying |blockl | character varying |blockr | character varying |frlong | numeric |frlat | numeric |tolong | numeric |tolat | numeric | Indexes: "rt1_fename" btree (fename) "rt1_frlat" btree (frlat) "rt1_frlong" btree (frlong) "rt1_tlid" btree (tlid) "rt1_tolat" btree (tolat) "rt1_tolong" btree (tolong) "rt1_zipl" btree (zipl) "rt1_zipr" btree (zipr) Table "public.rt2"Column | Type | Modifiers --------+---------+-----------tlid | integer |rtsq | integer |long1 | numeric |lat1 | numeric |long2 | numeric |lat2 | numeric |long3 | numeric |lat3 | numeric |long4 | numeric |lat4 | numeric |long5 | numeric |lat5 | numeric|long6 | numeric |lat6 | numeric |long7 | numeric |lat7 | numeric |long8 | numeric |lat8 | numeric |long9 | numeric |lat9 | numeric |long10 | numeric |lat10 | numeric | Indexes: "rt2_tlid" btree (tlid)
pgsql@mohawksoft.com writes: > I suspect that analyze only samples a very small amount of the database > and gets the wrong idea about it. Is there a way to force analyze to > sample more rows? default_statistics_target. But let's see the pg_stats rows for these columns before assuming that analyze is getting it wrong. regards, tom lane
* pgsql@mohawksoft.com (pgsql@mohawksoft.com) wrote: > I have the USA census TIGER database loaded, the WHOLE THING, the whole > country. It isn't the biggest database, but it is about 40G before > indexes. Every table is over a few million rows. I can quite honestly say, > a sequential scan is almost never the right thing to do. Cool, I've got it loaded here too w/ PostGIS. It's good stuff. :) I'm curious what all you're doing with it and especially if you're using mapserver on it or have found a way to identify roads at a more street/city/state/country level, I've had problems doing that. > I suspect that analyze only samples a very small amount of the database > and gets the wrong idea about it. Is there a way to force analyze to > sample more rows? Yup: alter table <blah> alter column <blah> set statistics <blah> As I recall, the default is 100 for statistics, I'd say increase that number to like 200 or 300 or more and see if it helps. Stephen
> pgsql@mohawksoft.com writes: >> I suspect that analyze only samples a very small amount of the database >> and gets the wrong idea about it. Is there a way to force analyze to >> sample more rows? > > default_statistics_target. But let's see the pg_stats rows for these > columns before assuming that analyze is getting it wrong. Some more info: I did a select count(distinct(tlid)) from rt2, and updated the statistics with the result: tiger=# update pg_statistic set stadistinct = 23656799 where starelid = 17236 and staattnum = 1; UPDATE 1 tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr tiger(# = 2186 or zipl=2186); QUERY PLAN -----------------------------------------------------------------------------------------------Nested Loop (cost=0.00..425517.95rows=21315 width=520) -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 rows=30835 width=302) Index Cond: ((zipr = 2186) OR (zipl = 2186)) -> Index Scan using rt2_tlid on rt2 (cost=0.00..9.82rows=2 width=218) Index Cond: ("outer".tlid = rt2.tlid) (5 rows) tiger=# SELECT attname, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 'rt1'; attname | n_distinct | most_common_vals -----------+------------+-----------------------------------------------------------------------------------------------------------------------tlid | -1 |side1 | 1 | {1}source | 9 | {B,J,A,K,L,N,O,M,C}fedirp | 8 | {N,E,S,W,SW,NE,NW,SE}fename | 2590 | {Main,1st,Oak,2nd,9th,"Burlington Northern Santa Fe R",Park,4th,11th,8th}fetype | 26 | {Rd,St,Ave,Dr}fedirs | 9 | {NE,SE,N,E,NW,SW,W,S,O}cfcc | 51 | {A41,H12,H11,F10,A74,A31,H01}fraddl | 767 | {1,101,2,201,301,401,100,701,298,500}toaddl | 805 | {199,399,299,499,98,198,99,1,100,2}fraddr | 765 | {2,1,100,200,400,300,700,101,501,299}toaddr | 772 | {198,398,298,498,99,98,199,101,1,1098}friaddl | 3 | {0,1,2}toiaddl | 3 | {0,1,2}friaddr | 3 | {0}toiaddr | 3 | {0}zipl | 925 |zipr | 899 |aianhhfpl | 42 |aianhhfpr | 43 |aihhtlil | 2 |aihhtlir | 2 |statel | 55 | {48,06,12,37,29,42,17,36,13,39}stater | 55 | {48,06,12,37,29,42,17,36,13,39}countyl | 189 | {005,059,013,029,003,001,017,009,031,043}countyr | 191 | {005,059,013,001,003,029,017,025,031,043}cousubl | 2568 | {92601,91800,90270,90240,90468,90572,91508,91750,60000,90324}cousubr | 2598 | {92601,91800,90270,90240,90468,90572,91248,91750,60000,90324}submcdl | -1 |submcdr | -1 |placel | 778 | {51000,65000,44000,12000,38000,60000,63460,07000,04000,22000}placer | 787 | {51000,65000,12000,44000,60000,07000,38000,55000,63460,04000}tractl | 1370 | {950200,950100,950300,000100,970100,960100,980100,950700,970300,990100}tractr | 1354 | {950200,950100,950300,000100,970100,960100,990100,950700,980100,970300}blockl | 1050 | {1000,2000,1001,1003,1005,2001,1009,2006,1002,1004}blockr | 1055 | {1000,2000,1001,1002,1003,1005,1004,1009,2004,2002}frlong | 134476 | {-120.214657,-113.074100,-106.494480,-103.306945,-100.184470,-100.083614,-99.476994,-98.420248,-97.325498,-93.349865}frlat | 143222 | {27.759896,29.251454,29.898585,30.093247,31.814071,31.950913,32.055726,32.377503,32.523607,32.607387}tolong | 317744| {-123.330861,-111.673035,-107.596898,-103.164000,-100.945693,-100.080307,-99.576886,-99.492719,-97.743722,-93.870222}tolat | 278079 | {27.493816,27.904316,29.691644,32.731410,33.350429,34.490563,35.551053,35.868297,39.139185,40.068098} (40 rows)
pgsql@mohawksoft.com writes: > -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 > rows=30835 width=302) > Index Cond: ((zipr = 2186) OR (zipl = 2186)) > zipl | 925 | > zipr | 899 | Those n_distinct values for zipl and zipr seem aberrant --- too low compared to the estimated rowcount you're showing. What are the true figures? Also, how about some EXPLAIN ANALYZEs and not just EXPLAINs? regards, tom lane
On Fri, Feb 04, 2005 at 05:23:27PM -0500, Stephen Frost wrote: > As I recall, the default is 100 for statistics, I'd say increase that > number to like 200 or 300 or more and see if it helps. No, it's 10. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> pgsql@mohawksoft.com writes: >> -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 >> rows=30835 width=302) >> Index Cond: ((zipr = 2186) OR (zipl = 2186)) > >> zipl | 925 | >> zipr | 899 | > > Those n_distinct values for zipl and zipr seem aberrant --- too low > compared to the estimated rowcount you're showing. What are the > true figures? Also, how about some EXPLAIN ANALYZEs and not just > EXPLAINs? > > ^ tiger=# select count(distinct(zipr)), count(distinct(zipl)) from rt1 ;count | count -------+-------35133 | 35393 (1 row)
Short summary: I had the same problem - since the sort order of zip-codes, counties, city names, and states don't match, the optimizergrossly overestimated the number of pages that would be read. I bet doing a CLUSTER by ZIP would solve that particular query, but would break similar queries by county or by city orby state. I think select attname,correlation from pg_stats where tablename = 'rt1'; will show you the same problem I had. My pg_statsnumbers are shown below. I had a couple ugly hacks that worked around the problem for myself. Longer. One interesting property of the TIGER data (and most geospatial databases) is that the data for most of the columns are "locally correlated" but not globally. What I mean by that is that even though data for those zip-codes probably only resided in a few disk pages; the data was probably loaded in order of Tiger File number (state,county), so the "correlation" in pg_stats for zip-code was very low. With the low correlation the optimizer couldn't see the fact that any given zip code's data was all together on disk. Because of this, it probably vastly overestimated the number of pages that would be read. Let me try a concrete example: Zip | City | State | County | Street ------+---------------+----------+---------------+-------------99501 } Anchorage } AK | Anchorage | 1st st.[...] 94105 | San Francisco | CA | San Francisco | 1st St 94105 | San Francisco | CA | San Francisco | 2ndSt [... a few more disk pages for 94105 ...] [... tens more disk pages for San Francisco ...] [... thousands more diskpages for CA ...] 94501 | Alameda | CA | Alameda | 1st St 94501 | Alameda | CA | Alameda | 2nd St [...] 02101 | Boston | MA | Suffolk | 1st St. Note, that all the data for any geographical region (zip, or city, or county) is located close together on disk. If I do a query by City, or by State, or by Zip, I should probably do an index scan. But since the correlation statistic only looks the total ordering; if we order the table so the correlation for one column, it's statistic will look very good; but since the columns have a different sort order the correlation statistic for the others will be very poor. In my copy of the TIGER data (loaded in order of TIGER-file-number (which is ordered by state, and then by county)), you can see the various relevant correlation values in my database. fl=# select attname,correlation from pg_stats where tablename = 'tgr_rt1'; attname | correlation -----------+-------------- tigerfile | 1 rt | 1 version | 0.968349 tlid | 0.151139[...] zipl | 0.381979 zipr | 0.376332 [...] statel | 0.998373 stater | 0.99855 countyl | 0.111207 countyr | 0.115345 cousubl | -0.0450375 cousubr | -0.0486589 [...] placel | 0.304117 placer | 0.306714 tractl | 0.141538 tractr | 0.134357 blockl | 0.00599286 blockr |-8.73298e-05 frlong | 0.0857337 frlat | 0.174396 tolong | 0.0857399 tolat | 0.174434 (45 rows) Note, that even though the TIGER data is sorted by State/County, "countyl" and "countyr" have some of the worst correlations (in the stats table); because of the way the FIPS codes work. Every state re-cycles the county codes starting with 1 and going up. STATE FIPS CODE | County FIPS code ------------------+----------------- 06 (california) | 001 (alameda) 06 (california) | 003 (alpine) ... 25(massachusets) | 001 (Barnstable) I have a hack for 7.4 that sets the numbers hidden in pg_statistic used for correlation to 0.75 for these columns; and the planner started making more reasonable guesses. In the end, though, I just resorted to uglier hacks that make the planner favor index scans like setting random_page_cost artificially low. What I think I'd like to see is for there to be another statistic similar to "correlation" but rather than looking at the total-ordering of the table, to look how correlated values within any single page are. If someone pointed me in the right direction, I might try doing this. Ron PS: I think lots of other data has the same issues. A very large name database ordered by "lastname,firstname" will haveall people of a given "firstname" on a relatively small set of pages, but the current correlation value wouldn't seethat. Firstname | Lastname | Middlename Adam | Brown | Albert Adam | Brown | Alex Adam | Brown | Bob Bill } Brown | .... ... Adam | Smith | Albert Adam | Smith | Alex Adam | Smith | Bob ... Tom Lane wrote: > pgsql@mohawksoft.com writes: > >> -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 >>rows=30835 width=302) >> Index Cond: ((zipr = 2186) OR (zipl = 2186)) > > >> zipl | 925 | >> zipr | 899 | > > > Those n_distinct values for zipl and zipr seem aberrant --- too low > compared to the estimated rowcount you're showing. What are the > true figures? Also, how about some EXPLAIN ANALYZEs and not just > EXPLAINs? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > What I think I'd like to see is for there to be > another statistic similar to "correlation" but rather > than looking at the total-ordering of the table, to > look how correlated values within any single page are. Our current approach to correlation is surely far too simplistic; it was just a quick hack to get *something* in there for this consideration. I don't personally know enough about statistics to do much better :-( regards, tom lane
pgsql@mohawksoft.com writes: > One of the things that is disturbing to me about the analyze settings is > that it wants to sample the same number of records from a table regardless > of the size of that table. The papers that I looked at say that this rule has a good solid statistical foundation, at least for the case of estimating histograms. See the references cited in analyze.c. regards, tom lane
> pgsql@mohawksoft.com writes: >> One of the things that is disturbing to me about the analyze settings is >> that it wants to sample the same number of records from a table >> regardless >> of the size of that table. > > The papers that I looked at say that this rule has a good solid > statistical foundation, at least for the case of estimating histograms. > See the references cited in analyze.c. Any and all random sampling assumes a degree of uniform distribution. This is the basis of the model. It assumes that chunks of the whole will be representative of the whole (to some degree). This works when normal variations are more or less distributed uniformly. As variations and trends becomes less uniformly distributed, more samples are required to characterize it. Douglas Adams had a great device called the "Total Perspective Vortex" which infered the whole of the universe from a piece of fairy cake. It was a subtle play on the absurd notion that a very small sample could lead to an understanding of an infinitly larger whole. On a very basic level, why bother sampling the whole table at all? Why not check one block and infer all information from that? Because we know that isn't enough data. In a table of 4.6 million rows, can you say with any mathmatical certainty that a sample of 100 points can be, in any way, representative? Another problem with random sampling is trend analysis. Often times there are minor trends in data. Ron pointed out the lastname firstname trend. Although there seems to be no correlation between firstnames in the table, there are clearly groups or clusters of ordered data that is an ordering that is missed by too small a sample. I understand why you chose the Vitter algorithm, because it provides a basically sound methodology for sampling without knowledge of the size of the whole, but I think we can do better. I would suggest using the current algorithm the first time through, then adjust the number of samples [n] based on the previous estimate of the size of the table [N]. Each successive ANALYZE will become more accurate. The Vitter algorithm is still useful as [N] will always be an estimate.
pgsql@mohawksoft.com writes: > On a very basic level, why bother sampling the whole table at all? Why not > check one block and infer all information from that? Because we know that > isn't enough data. In a table of 4.6 million rows, can you say with any > mathmatical certainty that a sample of 100 points can be, in any way, > representative? This is a statistical argument, not a rhetorical one, and I'm not going to bother answering handwaving. Show me some mathematical arguments for a specific sampling rule and I'll listen. regards, tom lane
> pgsql@mohawksoft.com writes: >> On a very basic level, why bother sampling the whole table at all? Why >> not >> check one block and infer all information from that? Because we know >> that >> isn't enough data. In a table of 4.6 million rows, can you say with any >> mathmatical certainty that a sample of 100 points can be, in any way, >> representative? > > This is a statistical argument, not a rhetorical one, and I'm not going > to bother answering handwaving. Show me some mathematical arguments for > a specific sampling rule and I'll listen. > Tom, I am floored by this response, I am shaking my head in disbelief. It is inarguable that increasing the sample size increases the accuracy of a study, especially when diversity of the subject is unknown. It is known that reducing a sample size increases probability of error in any poll or study. The required sample size depends on the variance of the whole. It is mathmatically unsound to ASSUME any sample size is valid without understanding the standard deviation of the set. http://geographyfieldwork.com/MinimumSampleSize.htm Again, I understand why you used the Vitter algorithm, but it has been proven insufficient (as used) with the US Census TIGER database. We understand this because we have seen that the random sampling as implemented has insufficient information to properly characterize the variance in the data.
On Mon, Feb 07, 2005 at 11:27:59 -0500, pgsql@mohawksoft.com wrote: > > It is inarguable that increasing the sample size increases the accuracy of > a study, especially when diversity of the subject is unknown. It is known > that reducing a sample size increases probability of error in any poll or > study. The required sample size depends on the variance of the whole. It > is mathmatically unsound to ASSUME any sample size is valid without > understanding the standard deviation of the set. For large populations the accuracy of estimates of statistics based on random samples from that population are not very sensitve to population size and depends primarily on the sample size. So that you would not expect to need to use larger sample sizes on larger data sets for data sets over some minimum size.
> On Mon, Feb 07, 2005 at 11:27:59 -0500, > pgsql@mohawksoft.com wrote: >> >> It is inarguable that increasing the sample size increases the accuracy >> of >> a study, especially when diversity of the subject is unknown. It is >> known >> that reducing a sample size increases probability of error in any poll >> or >> study. The required sample size depends on the variance of the whole. It >> is mathmatically unsound to ASSUME any sample size is valid without >> understanding the standard deviation of the set. > > For large populations the accuracy of estimates of statistics based on > random > samples from that population are not very sensitve to population size and > depends primarily on the sample size. So that you would not expect to need > to use larger sample sizes on larger data sets for data sets over some > minimum size. That assumes a fairly low standard deviation. If the standard deviation is low, then a minimal sample size works fine. If there was zero deviation in the data, then a sample of one works fine. If the standard deviation is high, then you need more samples. If you have a high standard deviation and a large data set, you need more samples than you would need for a smaller data set. In the current implementation of analyze.c, the default is 100 samples. On a table of 10,000 rows, that is probably a good number characterize the data enough for the query optimizer (1% sample). For a table with 4.6 million rows, that's less than 0.002% Think about an iregularly occuring event, unevenly distributed throughout the data set. A randomized sample strategy normalized across the whole data set with too few samples will mischaracterize the event or even miss it altogether.
On Mon, Feb 07, 2005 at 13:28:04 -0500, > > > > For large populations the accuracy of estimates of statistics based on > > random > > samples from that population are not very sensitve to population size and > > depends primarily on the sample size. So that you would not expect to need > > to use larger sample sizes on larger data sets for data sets over some > > minimum size. > > That assumes a fairly low standard deviation. If the standard deviation is > low, then a minimal sample size works fine. If there was zero deviation in > the data, then a sample of one works fine. This doesn't assume a low standard deviation. That wouldn't make sense anyway since the standard deviation depends on the units used for the measurements. > In the current implementation of analyze.c, the default is 100 samples. On > a table of 10,000 rows, that is probably a good number characterize the > data enough for the query optimizer (1% sample). For a table with 4.6 > million rows, that's less than 0.002% The fraction of rows isn't relevant unless it is a large fraction of the total, in which case you can use a smaller sample than you might otherwise. > Think about an iregularly occuring event, unevenly distributed throughout > the data set. A randomized sample strategy normalized across the whole > data set with too few samples will mischaracterize the event or even miss > it altogether. What you are saying here is that if you want more accurate statistics, you need to sample more rows. That is true. However, the size of the sample is essentially only dependent on the accuracy you need and not the size of the population, for large populations.
Maybe I am missing something - ISTM that you can increase your statistics target for those larger tables to obtain a larger (i.e. better) sample. regards Mark pgsql@mohawksoft.com wrote: >>pgsql@mohawksoft.com writes: > Any and all random sampling assumes a degree of uniform distribution. This > is the basis of the model. It assumes that chunks of the whole will be > representative of the whole (to some degree). This works when normal > variations are more or less distributed uniformly. As variations and > trends becomes less uniformly distributed, more samples are required to > characterize it. > > Douglas Adams had a great device called the "Total Perspective Vortex" > which infered the whole of the universe from a piece of fairy cake. It was > a subtle play on the absurd notion that a very small sample could lead to > an understanding of an infinitly larger whole. > > On a very basic level, why bother sampling the whole table at all? Why not > check one block and infer all information from that? Because we know that > isn't enough data. In a table of 4.6 million rows, can you say with any > mathmatical certainty that a sample of 100 points can be, in any way, > representative? > > Another problem with random sampling is trend analysis. Often times there > are minor trends in data. Ron pointed out the lastname firstname trend. > Although there seems to be no correlation between firstnames in the table, > there are clearly groups or clusters of ordered data that is an ordering > that is missed by too small a sample. > > I understand why you chose the Vitter algorithm, because it provides a > basically sound methodology for sampling without knowledge of the size of > the whole, but I think we can do better. I would suggest using the current > algorithm the first time through, then adjust the number of samples [n] > based on the previous estimate of the size of the table [N]. Each > successive ANALYZE will become more accurate. The Vitter algorithm is > still useful as [N] will always be an estimate. >
> On Mon, Feb 07, 2005 at 13:28:04 -0500, > > What you are saying here is that if you want more accurate statistics, you > need to sample more rows. That is true. However, the size of the sample > is essentially only dependent on the accuracy you need and not the size > of the population, for large populations. > That's nonsense. If your total data size is 100 elements in a set, then a sample size of 100 elements will cover 100% of your data. If your total data size is 10,000 elements in a set, the a sample size of 100 elements will cover 1% of your data. In the case of the TIGER database, the base of 100 samples is about .002% 0f the data is sampled. Think about that, that is an average of 1 sample about every 50,000 records. You could have substantial but irregular trends in the data that may never get detected, and this is EXACTLY what we see. If we increase the sample size (targrows), the statistics suddenly work better. For instance, look at the data below. The first analyze / select from pg_stats is with an analyze of 3000 samples. The zipl and zipr columns get calculated poorly and can cause the planner to use a table scan instead of an index scan. The second analyze / select from the pg_stats is with an analyse of 10000 samples. The zipl and zipr n_distinct values are still off by a factor of 10, but close enough for the planner to deal. If the premise is that samples size doesn't make a difference, I think we've proved that this is not true. tiger=# analyze verbose rt1; INFO: analyzing "public.rt1" INFO: "rt1": scanned 3000 of 1527360 pages, containing 90978 live rows and 0 dead rows; 3000 rows in sample, 46318719 estimated total rows ANALYZE tiger=# select * from pg_stats where tablename = 'rt1' and attname like 'zip%';schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+-------------public | rt1 | zipl | 0.672 | 4 | 960 | {76240,52601,55746,71730,74604,92705,93117,95818} | {0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333} | {1085,16652,28206,33412,43147,49428,58801,68110,77515,91340,99006} | -0.119519public | rt1 | zipr | 0.677 | 4 | 960 | {76240,52601,55746,71730,74604,78577,92705,93117,95818} | {0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333} | {962,15613,28572,33606,43545,49428,60423,68064,77040,91340,99006} | -0.104158 (2 rows) Now this: tiger=# analyze verbose rt1; INFO: analyzing "public.rt1" INFO: "rt1": scanned 10000 of 1527360 pages, containing 303419 live rows and 0 dead rows; 10000 rows in sample, 46343004 estimated total rows ANALYZE tiger=# select * from pg_stats where tablename = 'rt1' and attname like 'zip%';schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------------+-------------------------------------------------------------------------+-------------------------------------------------------------------+-------------public | rt1 | zipl | 0.6807 | 4 | 2942 | {61832,13090,17404,30907,31204,45342,47714,63050,80918,93726} | {0.0008,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} | {654,15018,28208,33870,43006,49008,59741,68803,78640,92105,99687} | -0.137744public | rt1 | zipr | 0.684 | 4 | 2921 | {13090,61832,30907,31204,45342,47714,63050,70122,80918,93726} | {0.0006,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} | {731,14824,27871,33324,42276,48895,58401,68338,78575,92105,99654} | -0.140663 (2 rows)
On Mon, Feb 07, 2005 at 05:16:56PM -0500, pgsql@mohawksoft.com wrote: > > On Mon, Feb 07, 2005 at 13:28:04 -0500, > > > > What you are saying here is that if you want more accurate statistics, you > > need to sample more rows. That is true. However, the size of the sample > > is essentially only dependent on the accuracy you need and not the size > > of the population, for large populations. > > > That's nonsense. Huh, have you studied any statistics? -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "At least to kernel hackers, who really are human, despite occasional rumors to the contrary" (LWN.net)
> Maybe I am missing something - ISTM that you can increase your > statistics target for those larger tables to obtain a larger (i.e. > better) sample. No one is arguing that you can't manually do things, but I am not the first to notice this. I saw the query planner doing something completely stupid and set off to discover why. Think about the person using PostgreSQL for the first time. He/she does not know about this stuff. Even if they've read the FAQs and the manual cover to cover, it will take them some time to figure out it all works together. PostgreSQL is a big system, and this is exactly why MySQL gets better marks from newbes. In this case, the behavior observed could be changed by altering the sample size for a table. I submit that an arbitrary fixed sample size is not a good base for the analyzer, but that the sample size should be based on the size of the table or some calculation of its deviation. There is no reason why old stats can't be used to create more accurate stats. Using succesive analyze operations, we could create better statistics for the planner. We can increase the sample size based on the table size. We could, I suppose, also calculate some sort of deviation statistic so that "n_distinct" can be calculated better with a smaller sample set. The basic problem, though, is that PostgreSQL performed incorrectly on a simple query after indexes were created and analyze performed. Yes, it can be corrected, that's what led me to my conclusions, but shouldn't we try to devise a better system in the future to improve PostgreSQL so it does not need this sort of tuning? > > regards > > Mark > > pgsql@mohawksoft.com wrote: >>>pgsql@mohawksoft.com writes: >> Any and all random sampling assumes a degree of uniform distribution. >> This >> is the basis of the model. It assumes that chunks of the whole will be >> representative of the whole (to some degree). This works when normal >> variations are more or less distributed uniformly. As variations and >> trends becomes less uniformly distributed, more samples are required to >> characterize it. >> >> Douglas Adams had a great device called the "Total Perspective Vortex" >> which infered the whole of the universe from a piece of fairy cake. It >> was >> a subtle play on the absurd notion that a very small sample could lead >> to >> an understanding of an infinitly larger whole. >> >> On a very basic level, why bother sampling the whole table at all? Why >> not >> check one block and infer all information from that? Because we know >> that >> isn't enough data. In a table of 4.6 million rows, can you say with any >> mathmatical certainty that a sample of 100 points can be, in any way, >> representative? >> >> Another problem with random sampling is trend analysis. Often times >> there >> are minor trends in data. Ron pointed out the lastname firstname trend. >> Although there seems to be no correlation between firstnames in the >> table, >> there are clearly groups or clusters of ordered data that is an ordering >> that is missed by too small a sample. >> >> I understand why you chose the Vitter algorithm, because it provides a >> basically sound methodology for sampling without knowledge of the size >> of >> the whole, but I think we can do better. I would suggest using the >> current >> algorithm the first time through, then adjust the number of samples [n] >> based on the previous estimate of the size of the table [N]. Each >> successive ANALYZE will become more accurate. The Vitter algorithm is >> still useful as [N] will always be an estimate. >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
Tom Lane <tgl@sss.pgh.pa.us> writes: > The papers that I looked at say that this rule has a good solid > statistical foundation, at least for the case of estimating histograms. Earlier I was discussing the issue of how to measure cross-column "correlation" with someone who works on precisely these types of problems. He immediately honed in on precisely this issue of what kinds of deductions can be extrapolated from a constant sample like this and what kind of deductions would require a linear sample, or worse: can't be reasonably extrapolated at all. After the discussion with him I'm left with a distinctly uneasy feeling about some of Postgres's stats. I was planning to investigate further before I wrote about this though, since I'm not sure exactly where they all stand. The histograms are indeed on solid footing, but I'm not so certain that's true for some of the other bits of statistics. The statistical foundation for histograms is the traditional sampling from a population that supports every opinion poll you see in the news. Those opinion polls only need a fixed size sample to obtain a given confidence interval for almost any population size. Postgres's histograms are just describing the distribution in a similar way. However for discrete values like the "top ten most common values" and the "total number of distinct values" it's not so clear at all that you can extrapolate from a sample at all. And it's certainly not clear that a fixed size sample gives you at all the same confidence for a large population as it does for a small population. If you sampled across the country and found your sample of 600 people had 4 different top choices for president, how do you extrapolate that to calculate the total number of top choices for president the 300 million+ people will have across the country? You could multiply by 500k but that's not going to be right. Hell you're probably better off just going with "4" but that's clearly not right either. The reality is that your sample hasn't really given you *any* information on how frequently you'll find outliers in the distribution. There are probably a couple hundred or a couple thousand values that occur only once or twice, but the sample size you would have to use to get a good estimate of how many there were will be proportional to the entire population, *unlike the histograms*. The essential point here is that to count "discrete values" you only care whether they occur or not at all, not how many times they occur. So you lose the benefit of being able to model the population as "large", the values as continuous, and the distribution as a normal curve. So you can't really talk about a 95% confidence interval any more and every outlier becomes as important as the more common values. -- greg
> On Mon, Feb 07, 2005 at 05:16:56PM -0500, pgsql@mohawksoft.com wrote: >> > On Mon, Feb 07, 2005 at 13:28:04 -0500, >> > >> > What you are saying here is that if you want more accurate statistics, >> you >> > need to sample more rows. That is true. However, the size of the >> sample >> > is essentially only dependent on the accuracy you need and not the >> size >> > of the population, for large populations. >> > >> That's nonsense. > > Huh, have you studied any statistics? To what aspects of "statistics" are your referring. I was not a math major, no, but I did have my obligatory classes as well as algorithms and so on. I've only worked in the industry for over 20 years. I've worked with statistical analysis of data on multiple projects, ranging from medical instruments, compression, encryption, and web based recommendations systems. I assume "Huh, have you studied any statistics?" was a call for qualifications. And yes, some real math major would be helpful in this discussion because clearly there is a disconnect. The basic problem with a fixed sample is that is assumes a normal distribution. If data variation is evenly distributed across a set, then a sample of sufficient size would be valid for almost any data set. That isn't what I'm saying. If the data variation is NOT uniformly distributed across the data set, the sample size has to be larger because there is "more" data. I think I can explain with a visual. I started my career as an electrical engineer and took an experimental class called "computer science." Sure, it was a long time ago, but bare with me. When you look at a sine wave on an oscilloscope, you can see it clear as day. When you look at music on the scope, you know there are many waves there, but it is difficult to make heads or tails of it. (use xmms or winamp to see for yourself) The waves change in frequency, amplitude, and duration over a very large scale. That's why you use a spectrum analyzer to go from time domain to frequency domain. In frequency domain, you can see the trends better. This is the problem we are having. Currently, the analyze.c code is assuming a very regular data set. In which case, almost any size sample would work fine. What we see when we use it to analyze complex data sets, is that it can't characterize the data well. The solution is to completely change the statistics model to handle complex and unpredictably changing trends, or, increase the sample size.
pgsql@mohawksoft.com wrote: > > In this case, the behavior observed could be changed by altering the > sample size for a table. I submit that an arbitrary fixed sample size is > not a good base for the analyzer, but that the sample size should be based > on the size of the table or some calculation of its deviation. > I can see your point, however I wonder if the issue is that the default stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and maybe we should consider making a higher value (say '100') the default. > There is no reason why old stats can't be used to create more accurate > stats. Using succesive analyze operations, we could create better > statistics for the planner. We can increase the sample size based on the > table size. We could, I suppose, also calculate some sort of deviation > statistic so that "n_distinct" can be calculated better with a smaller > sample set. The idea of either automatically increasing sample size for large tables, or doing a few more samplings with different sizes and examining the stability of the estimates is rather nice, provided we can keep the runtime for ANALYZE to reasonable limits, I guess :-) > > The basic problem, though, is that PostgreSQL performed incorrectly on a > simple query after indexes were created and analyze performed. Yes, it can > be corrected, that's what led me to my conclusions, but shouldn't we try > to devise a better system in the future to improve PostgreSQL so it does > not need this sort of tuning? > Thanks for clarifying. bets wishes Mark
pgsql@mohawksoft.com writes: > The basic problem with a fixed sample is that is assumes a normal > distribution. That's sort of true, but not in the way you think it is. What's under discussion here, w.r.t. the histograms and the validity of sampling is really basic statistics. You don't have to be a math major, in fact math majors tend to avoid the classes that teach this because it's full of social sciences students :) It doesn't matter what distribution the variable you're sampling has. In fact if you think of most applications of sampling, whether a pre-election opinion polls, or a survey about how often people brush their teeth, you very rarely find a real normal distribution. Usually the whole point of the survey is to determine what the distribution is. That's exactly what Postgres is recording with its histograms. Where the normal curve comes into play is in measuring the likelihood of the actual population distribution varying from the sample distribution. The most likely population distribution is exactly equal to the sample distribution. distributions slightly different from the sample distribution are slightly less equal, and the further removed from the sample distribution the less and less likely. This drops off in precisely the shape of a normal curve. For large populations this is *always* true, provably, it doesn't matter what the distribution the raw data or your sample has. The likelihood of the sample diverging from the population will look like a normal curve. The stddev of this normal curve represents how precisely you can describe the distribution of the population based on your sample. The smaller the sample the larger the stddev of this normal curve, and the larger the probability that the population diverges from the sample you drew. To calculate the stddev there's a complex formula that most scientific calculators have preprogrammed. This function does *not* depend on the distribution of the raw data (which you don't know in any case) There are two versions of this function though: one for "small" populations where the sample is a significant percentage of the population. As you point out if you're sampling 50% of the entire population then yes you can have more confidence in your results than .002% of the entire population. However, once you're talking about populations like an entire state or country then it's simpler to just calculate the confidence as if the population is "large". In that case it no longer matters whether the sample is .002% or .0002% of the population. If it's 600 people then you have the "+/- 3% 19 times out of 20" confidence interval that all those opinion polls you see put in their fine print. To achieve that they only have to poll 600 people that whether they're polling a city, a state, or the entire country. > When you look at a sine wave on an oscilloscope, you can see it clear as > day. When you look at music on the scope, you know there are many waves > there, but it is difficult to make heads or tails of it. (use xmms or > winamp to see for yourself) The waves change in frequency, amplitude, and > duration over a very large scale. That's why you use a spectrum analyzer > to go from time domain to frequency domain. In frequency domain, you can > see the trends better. That's not a bad analogy to many problems where you're measuring data that has non-randomness in it but that are not visible in the domain that the statistics that are being analyzed. This seems to happen a lot with geographic data, for instance. If you find that increasing the stats targets improves things then this isn't true. If you find that it doesn't then what's really needed is a cleverer set of statistics to look for. -- greg
> pgsql@mohawksoft.com wrote: >> >> In this case, the behavior observed could be changed by altering the >> sample size for a table. I submit that an arbitrary fixed sample size is >> not a good base for the analyzer, but that the sample size should be >> based >> on the size of the table or some calculation of its deviation. >> > > Mark, > > Do you have any evidence that the Sample Size had anything to do > with the performance problem you're seeing? I have evidence, if you look through some of the messages in this thread, you'll see how a sample size of 10000 provides enough data points to create stats the planner can use. > > I also do a lot with the complete Census/TIGER database. Cool, have any code for Mapserver? > > Every problem I have with the optimizer comes down to the > fact that the data is loaded (and ordered on disk) by > State/County FIPS codes, and then queried by zip-code > or by city name. Like this: > > Alabama 36101 [hundreds of pages with zip's in 36***] > Alaska 99686 [hundreds of pages with zip's in 9****] > Arizona 85701 [hundreds of pages with zip's in 855**] > > Note that the zip codes are *NOT* sequential. > > The "correlation" statistic sees that the Zip codes are not > sequential; so it makes the *HORRIBLE* assumption that they > are scattered randomly across the disk. It is my theory that this is because there are too few data points with which to properly characterize the nature of the data. > > In reality, even though there's no total ordering of the > zip codes; any given zip code only exists on a couple > disk pages; so index scans would be the right choice. I totally agree. > > > But the single correlation parameter is not sufficient > to let the optimizer known this. > > No matter how large a sample size you choose, ANALYZE > will correctly see that Zip codes and State FIPS codes > are non-correlated, and the optimizer will overestimate > the # of pages an index scan will need. > I tried to create an analogy in another post, and TIGER is a perfect example of the analogy. Think of the difference between an oscilloscope and a spectrum analizer. The current sampling code works more like an oscilloscope. It assumes a fairly normalized distribution of data. Given this, it works perfectly fine. When a scope is presented with an audio signal, it looks more like gibberish showing almost no correlation. When you view it in frequency domain, as with a spectrum analyzer, you can see clear patterns in the signal. Now, fortunately, we don't need any sort of absolute visualization of the data in TIGER, we only need to see that the data has many subtle trends rather than one fairly evenly distributed one. That's why more samples works. If we could do anything, I would add more statistics to the database. A standard deviation and maybe a sliding window deviation. A standard deviation might be pretty high, were as a sliding window whould show less localized deviation. Less localized deviation whould favor index scans in. Anyway, like I said. I think the expectation that the data is fairly normalized or evenly distributed works very well for data acquired over time. It is data like TIGER that is in a multiple field order, i.e. state, zipr, zipl that has complex paterns for the secondary sorts that can't be detected with too small a sample. > > PS: I pointed out workarounds in my earlier posting > in this thread. Yes, I'm using the same TIGER data > you are. > Cool. >
* Mark Kirkwood (markir@coretech.co.nz) wrote: > I can see your point, however I wonder if the issue is that the default > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and > maybe we should consider making a higher value (say '100') the default. Personally, I think that'd be reasonable. > The idea of either automatically increasing sample size for large > tables, or doing a few more samplings with different sizes and examining > the stability of the estimates is rather nice, provided we can keep the > runtime for ANALYZE to reasonable limits, I guess :-) I also agree with this and personally don't mind *too* much if analyze takes a little while on a large table to get decent statistics for it. One thing I was wondering about though is if we use the index to get some of the statistics information? Would it be possible, or reasonable? Do we already? I dunno, just some thoughts there, I keep hearing about the number of rows that are sampled and I would have thought it'd make sense to scan the index for things like the number of distinct values... Stephen
> pgsql@mohawksoft.com writes: > >> The basic problem with a fixed sample is that is assumes a normal >> distribution. > > That's sort of true, but not in the way you think it is. > [snip] Greg, I think you have an excellent ability to articulate stats, but I think that the view that this is like election polling is incorrect. Election polling assumes a very simple outcome: Some standard ditribution of a limited number options. I don't think it applies to this. > >> When you look at a sine wave on an oscilloscope, you can see it clear as >> day. When you look at music on the scope, you know there are many waves >> there, but it is difficult to make heads or tails of it. (use xmms or >> winamp to see for yourself) The waves change in frequency, amplitude, >> and >> duration over a very large scale. That's why you use a spectrum analyzer >> to go from time domain to frequency domain. In frequency domain, you can >> see the trends better. > > That's not a bad analogy to many problems where you're measuring data that > has > non-randomness in it but that are not visible in the domain that the > statistics that are being analyzed. This seems to happen a lot with > geographic > data, for instance. EXACTLY!!! > > If you find that increasing the stats targets improves things then this > isn't true. If you find that it doesn't then what's really needed is a > cleverer set of statistics to look for. I will be the first one to say that increasing the samples is not perfect, but it is a methodology that will help without major changes in postgres. Simply increasing the samples to a percentage of the estimated number of rows (with some upper and lower limits of course) will increase the accuracy of the "n_distinct" and "correlation" settings (at least a little bit), and that will make a huge impact with very little work. If we want to discuss improved statatistics, then we should include a standard deviation and a sliding window deviation, or something like that. Hell, maybe even FFT. The basic problem, I think, is that the sampling mechanism is more like an oscilloscope looking for large trends instead of a spectrum analyzer looking for the smaller ones. We have to be able to tell the planner that adjacent values are less random even though, as a whole, they are seemingly random.
On Mon, Feb 07, 2005 at 17:45:23 -0500, Greg Stark <gsstark@mit.edu> wrote: > > However for discrete values like the "top ten most common values" and the > "total number of distinct values" it's not so clear at all that you can > extrapolate from a sample at all. And it's certainly not clear that a fixed > size sample gives you at all the same confidence for a large population as it > does for a small population. If you were to keep a complete histogram for the sample points (which may or may not be practical) you should be able to estimate the number of distinct values under some assumptions. Such as, that all values outside of the top N values have the same likelihood. I don't think this is unreasonable. > > If you sampled across the country and found your sample of 600 people had 4 > different top choices for president, how do you extrapolate that to calculate > the total number of top choices for president the 300 million+ people will > have across the country? You could multiply by 500k but that's not going to be > right. Hell you're probably better off just going with "4" but that's clearly > not right either. Well you can put some bounds on the number. Since no 5th candidate was picked by any of the 600 people, you would expect that the number of people prefering other candidates is on the order of 500000 or less, so that the number of distinct values is going to be 500000 or less. I think the histogram idea will work well for estimating the number of rows with particular values, but if you are interested in the number of unique values, you are going to have problems with some data sets. (Ones with a few very common values and lots of extremely rare items.) In this case there may be some way to use information from indexes on the data to get better results.
Mark, Stephen, etc: > > I can see your point, however I wonder if the issue is that the default > > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and > > maybe we should consider making a higher value (say '100') the default. > > Personally, I think that'd be reasonable. I don't really think that we want to increase sampling for *all* columns. Each column that doesn't get queried but has increased stats adds the size of the pg_stats table and the amount of vacuuming it needs, decreasing performance of stats lookups. Especially when very large text and BYTEA columns are involved. Neil and I talked about setting a seperate GUC, default_index_stats_target, for 8.0 that would dramatically increase the stats sampling for indexed columns only (say, to 250). Unfortunately, I was unable to come up with a good test case for setting this value. Perhaps the Tiger database would allow us to set up a good test case. Does anyone have a well-indexed PG Tiger I could play with? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> Mark, Stephen, etc: > >> > I can see your point, however I wonder if the issue is that the >> default >> > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, >> and >> > maybe we should consider making a higher value (say '100') the >> default. >> >> Personally, I think that'd be reasonable. > > I don't really think that we want to increase sampling for *all* columns. > Each column that doesn't get queried but has increased stats adds the size > of > the pg_stats table and the amount of vacuuming it needs, decreasing > performance of stats lookups. Especially when very large text and BYTEA > columns are involved. > > Neil and I talked about setting a seperate GUC, > default_index_stats_target, > for 8.0 that would dramatically increase the stats sampling for indexed > columns only (say, to 250). Unfortunately, I was unable to come up with > a > good test case for setting this value. > > Perhaps the Tiger database would allow us to set up a good test case. > Does > anyone have a well-indexed PG Tiger I could play with? Hey, I can give you a copy of RT1 which is fine, but it is 1.1G compressed. I'd have to mail you a DVD.
pgsql@mohawksoft.com wrote: > > In this case, the behavior observed could be changed by altering the > sample size for a table. I submit that an arbitrary fixed sample size is > not a good base for the analyzer, but that the sample size should be based > on the size of the table or some calculation of its deviation. > Mark, Do you have any evidence that the Sample Size had anything to do with the performance problem you're seeing? I also do a lot with the complete Census/TIGER database. Every problem I have with the optimizer comes down to the fact that the data is loaded (and ordered on disk) by State/County FIPS codes, and then queried by zip-code or by city name. Like this: Alabama 36101 [hundreds of pages with zip's in 36***] Alaska 99686 [hundreds of pages with zip's in 9****] Arizona 85701 [hundreds of pages with zip's in 855**] Note that the zip codes are *NOT* sequential. The "correlation" statistic sees that the Zip codes are not sequential; so it makes the *HORRIBLE* assumption that they are scattered randomly across the disk. In reality, even though there's no total ordering of the zip codes; any given zip code only exists on a couple disk pages; so index scans would be the right choice. But the single correlation parameter is not sufficient to let the optimizer known this. No matter how large a sample size you choose, ANALYZE will correctly see that Zip codes and State FIPS codes are non-correlated, and the optimizer will overestimate the # of pages an index scan will need. Ron PS: I pointed out workarounds in my earlier posting in this thread. Yes, I'm using the same TIGER data you are.
I wrote a message caled "One Big trend vs multiple smaller trends in table statistics" that, I think, explains what we've been seeing. > pgsql@mohawksoft.com wrote: >> >> In this case, the behavior observed could be changed by altering the >> sample size for a table. I submit that an arbitrary fixed sample size is >> not a good base for the analyzer, but that the sample size should be >> based >> on the size of the table or some calculation of its deviation. >> > > Mark, > > Do you have any evidence that the Sample Size had anything to do > with the performance problem you're seeing? Sample size is only a bandaid for the issue, however, more samples always provide more information. > > I also do a lot with the complete Census/TIGER database. > > Every problem I have with the optimizer comes down to the > fact that the data is loaded (and ordered on disk) by > State/County FIPS codes, and then queried by zip-code > or by city name. Like this: > > Alabama 36101 [hundreds of pages with zip's in 36***] > Alaska 99686 [hundreds of pages with zip's in 9****] > Arizona 85701 [hundreds of pages with zip's in 855**] > > Note that the zip codes are *NOT* sequential. Again, read "One Big Trend..." and let me know what you think. I think it describes exactly the problem that we see. For now, the solution that works for me is to seriously up the value of "targrows" in analyze.c. It makes it take longer, and while the stats are not "correct" because they are not designed to detect these sorts of patterns, a larger sample allows them to be "less wrong" enough to give a better hint to the planner.
Josh Berkus wrote: > Mark, Stephen, etc: > > > > I can see your point, however I wonder if the issue is that the default > > > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and > > > maybe we should consider making a higher value (say '100') the default. > > > > Personally, I think that'd be reasonable. > > I don't really think that we want to increase sampling for *all* columns. > Each column that doesn't get queried but has increased stats adds the size of > the pg_stats table and the amount of vacuuming it needs, decreasing > performance of stats lookups. Especially when very large text and BYTEA > columns are involved. > > Neil and I talked about setting a seperate GUC, default_index_stats_target, > for 8.0 that would dramatically increase the stats sampling for indexed > columns only (say, to 250). Unfortunately, I was unable to come up with a > good test case for setting this value. Is default_index_stats_target a TODO item? Seems we already have cases where our bucket size isn't large enough and this would help. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> Mark, > >> Hey, I can give you a copy of RT1 which is fine, but it is 1.1G >> compressed. I'd have to mail you a DVD. > > Sure, cool. > [address info sniped] I would be willing to send a couple DVDs (on a regular basis) to anyone who is able to post this on a good mirror that anyone could get at. I can send the US Census database, fully loaded and in compressed pg_dump format. I also have the latest CD database from www.freedb.org. I would love to keep these things current for PG development, but my company's server is on a plan that gets 1G free, and is billed after that. Also, I am on a broadband line at my office, and uploading the data would take days.
On Wed, Feb 09, 2005 at 07:30:16PM -0500, pgsql@mohawksoft.com wrote: > I would love to keep these things current for PG development, but my > company's server is on a plan that gets 1G free, and is billed after > that. Also, I am on a broadband line at my office, and uploading the data > would take days. Maybe post it as a Torrent? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> On Wed, Feb 09, 2005 at 07:30:16PM -0500, pgsql@mohawksoft.com wrote: >> I would love to keep these things current for PG development, but my >> company's server is on a plan that gets 1G free, and is billed after >> that. Also, I am on a broadband line at my office, and uploading the >> data >> would take days. > > Maybe post it as a Torrent? That still doesn't answer the question of the initial upload.
On Wed, Feb 09, 2005 at 07:30:16PM -0500, pgsql@mohawksoft.com wrote: > > Mark, > > > >> Hey, I can give you a copy of RT1 which is fine, but it is 1.1G > >> compressed. I'd have to mail you a DVD. > > > > Sure, cool. > > > [address info sniped] > > I would be willing to send a couple DVDs (on a regular basis) to > anyone who is able to post this on a good mirror that anyone could > get at. > > I can send the US Census database, fully loaded and in compressed > pg_dump format. I also have the latest CD database from > www.freedb.org. > > I would love to keep these things current for PG development, but my > company's server is on a plan that gets 1G free, and is billed > after that. Also, I am on a broadband line at my office, and > uploading the data would take days. How about putting up a Bittorrent? I'd be delighted to stick one up on http://bt.postgresql.org/ if that's OK with you. Cheers, D -- David Fetter david@fetter.org http://fetter.org/ phone: +1 510 893 6100 mobile: +1 415 235 3778 Remember to vote!
Probably off-topic, but I think it's worth to see what astronomers are doing with their very big spatial databases. For example, we are working with more than 500,000,000 rows catalog and we use some special transformation of coordinates to integer numbers with preserving objects closeness. I hope we could show postgresql is good enough to be used in astronomy for very big catalogs. Currently, MS SQL is in use. See http://www.sdss.jhu.edu/htm/ for details. We use another technique. Oleg On Wed, 9 Feb 2005 pgsql@mohawksoft.com wrote: > I wrote a message caled "One Big trend vs multiple smaller trends in table > statistics" that, I think, explains what we've been seeing. > > >> pgsql@mohawksoft.com wrote: >>> >>> In this case, the behavior observed could be changed by altering the >>> sample size for a table. I submit that an arbitrary fixed sample size is >>> not a good base for the analyzer, but that the sample size should be >>> based >>> on the size of the table or some calculation of its deviation. >>> >> >> Mark, >> >> Do you have any evidence that the Sample Size had anything to do >> with the performance problem you're seeing? > > Sample size is only a bandaid for the issue, however, more samples always > provide more information. > > >> >> I also do a lot with the complete Census/TIGER database. >> >> Every problem I have with the optimizer comes down to the >> fact that the data is loaded (and ordered on disk) by >> State/County FIPS codes, and then queried by zip-code >> or by city name. Like this: >> >> Alabama 36101 [hundreds of pages with zip's in 36***] >> Alaska 99686 [hundreds of pages with zip's in 9****] >> Arizona 85701 [hundreds of pages with zip's in 855**] >> >> Note that the zip codes are *NOT* sequential. > > Again, read "One Big Trend..." and let me know what you think. I think it > describes exactly the problem that we see. > > For now, the solution that works for me is to seriously up the value of > "targrows" in analyze.c. It makes it take longer, and while the stats are > not "correct" because they are not designed to detect these sorts of > patterns, a larger sample allows them to be "less wrong" enough to give a > better hint to the planner. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
What's the purpose of doing this transformation? Is it just a means to sub-divide the dataset? It's very possible that PostGIS would do just as good a job, without using HTM. Granted, GIS is designed more for working in LAT/LONG, but I suspect it should work just as well in whatever coordinate system astronomers use. Something else to consider is that it would be relatively easy to imlpement an HTM type in postgresql, which would result in substantial space savings. You need 3 bits for level 0, each additional level requires 2 bits. This will be much smaller than storing the HTM in a varchar, and also smaller than using a bit to indicate N vs S and an int (or using sign to indicate N/S with an int). On Sun, Feb 13, 2005 at 08:14:58PM +0300, Oleg Bartunov wrote: > Probably off-topic, but I think it's worth to see what astronomers are > doing with their very big spatial databases. For example, we are working > with more than 500,000,000 rows catalog and we use some special > transformation > of coordinates to integer numbers with preserving objects closeness. > I hope we could show postgresql is good enough to be used in astronomy > for very big catalogs. Currently, MS SQL is in use. > See http://www.sdss.jhu.edu/htm/ for details. We use another technique. > > > Oleg > On Wed, 9 Feb 2005 pgsql@mohawksoft.com wrote: > > >I wrote a message caled "One Big trend vs multiple smaller trends in table > >statistics" that, I think, explains what we've been seeing. > > > > > >>pgsql@mohawksoft.com wrote: > >>> > >>>In this case, the behavior observed could be changed by altering the > >>>sample size for a table. I submit that an arbitrary fixed sample size is > >>>not a good base for the analyzer, but that the sample size should be > >>>based > >>>on the size of the table or some calculation of its deviation. > >>> > >> > >> Mark, > >> > >>Do you have any evidence that the Sample Size had anything to do > >>with the performance problem you're seeing? > > > >Sample size is only a bandaid for the issue, however, more samples always > >provide more information. > > > > > >> > >>I also do a lot with the complete Census/TIGER database. > >> > >>Every problem I have with the optimizer comes down to the > >>fact that the data is loaded (and ordered on disk) by > >>State/County FIPS codes, and then queried by zip-code > >>or by city name. Like this: > >> > >> Alabama 36101 [hundreds of pages with zip's in 36***] > >> Alaska 99686 [hundreds of pages with zip's in 9****] > >> Arizona 85701 [hundreds of pages with zip's in 855**] > >> > >>Note that the zip codes are *NOT* sequential. > > > >Again, read "One Big Trend..." and let me know what you think. I think it > >describes exactly the problem that we see. > > > >For now, the solution that works for me is to seriously up the value of > >"targrows" in analyze.c. It makes it take longer, and while the stats are > >not "correct" because they are not designed to detect these sorts of > >patterns, a larger sample allows them to be "less wrong" enough to give a > >better hint to the planner. > > > > > > > >---------------------------(end of broadcast)--------------------------- > >TIP 7: don't forget to increase your free space map settings > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Just my 2 cents. I am not a super statistics guy but besides increasing the sample size and assumming things on the distribution, I understand you want to get more info on what distribution the data represents. usualy the problem with these things is that the data needs to be sorted on the index key and also it could take a while, at least for the one time you want to find out what is the distribution. Example: for comulative distributions you need to first sort the data (I am talking scalars but probably other keys can work) and run it sequentially thru a KS(Kolmogorov smirnov) test. (there are other tests but this is good for general cases) The test can be against all kind of comulative distributions like normals,etc... You then get a feel of how close is the data to the selected distribution with a parameter that can be rejected at 0.01, 0.05, 0.1, etc... Anyway, it can be done, however I am not sure how much better is it over just plain histograms with random() and uniform dist. Or what happens if you just increase the sample size and be done with it. Again, I am talking about the general/common cases. Regards,tzahi. > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of > pgsql@mohawksoft.com > Sent: Wednesday, February 09, 2005 3:46 PM > To: Ron Mayer > Cc: Mark Kirkwood; Tom Lane; Ron Mayer; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Query optimizer 8.0.1 (and 8.0) > > > I wrote a message caled "One Big trend vs multiple smaller > trends in table statistics" that, I think, explains what > we've been seeing. > > > > pgsql@mohawksoft.com wrote: > >> > >> In this case, the behavior observed could be changed by > altering the > >> sample size for a table. I submit that an arbitrary fixed > sample size > >> is not a good base for the analyzer, but that the sample > size should > >> be based on the size of the table or some calculation of its > >> deviation. > >> > > > > Mark, > > > > Do you have any evidence that the Sample Size had anything > to do with > > the performance problem you're seeing? > > Sample size is only a bandaid for the issue, however, more > samples always provide more information. > > > > > > I also do a lot with the complete Census/TIGER database. > > > > Every problem I have with the optimizer comes down to the fact that > > the data is loaded (and ordered on disk) by State/County > FIPS codes, > > and then queried by zip-code or by city name. Like this: > > > > Alabama 36101 [hundreds of pages with zip's in 36***] > > Alaska 99686 [hundreds of pages with zip's in 9****] > > Arizona 85701 [hundreds of pages with zip's in 855**] > > > > Note that the zip codes are *NOT* sequential. > > Again, read "One Big Trend..." and let me know what you > think. I think it describes exactly the problem that we see. > > For now, the solution that works for me is to seriously up > the value of "targrows" in analyze.c. It makes it take > longer, and while the stats are not "correct" because they > are not designed to detect these sorts of patterns, a larger > sample allows them to be "less wrong" enough to give a better > hint to the planner. > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > >
> Probably off-topic, but I think it's worth to see what astronomers are > doing with their very big spatial databases. For example, we are working > with more than 500,000,000 rows catalog and we use some special > transformation > of coordinates to integer numbers with preserving objects closeness. > I hope we could show postgresql is good enough to be used in astronomy > for very big catalogs. Currently, MS SQL is in use. > See http://www.sdss.jhu.edu/htm/ for details. We use another technique. You know, I don't think a lot of people "get" the issues I was describing, or maybe they don't believe it, I don't know, but, I think that it would be a useful contrib project to create an 'analyze_special('table', 'column', 'method')' function that does a better job at calculating the stats for table that contain multiple trend waveforms. A separate function will probably work well as the trends within the data probably only apply to specific rows. It's interesting, because I don't think it needs to calculate a perfect representation of the data so much as better clue to its nature for the optimizer. When I get the time (or can get someone to pay me to do it) I'm going to try it. > > > Oleg > On Wed, 9 Feb 2005 pgsql@mohawksoft.com wrote: > >> I wrote a message caled "One Big trend vs multiple smaller trends in >> table >> statistics" that, I think, explains what we've been seeing. >> >> >>> pgsql@mohawksoft.com wrote: >>>> >>>> In this case, the behavior observed could be changed by altering the >>>> sample size for a table. I submit that an arbitrary fixed sample size >>>> is >>>> not a good base for the analyzer, but that the sample size should be >>>> based >>>> on the size of the table or some calculation of its deviation. >>>> >>> >>> Mark, >>> >>> Do you have any evidence that the Sample Size had anything to do >>> with the performance problem you're seeing? >> >> Sample size is only a bandaid for the issue, however, more samples >> always >> provide more information. >> >> >>> >>> I also do a lot with the complete Census/TIGER database. >>> >>> Every problem I have with the optimizer comes down to the >>> fact that the data is loaded (and ordered on disk) by >>> State/County FIPS codes, and then queried by zip-code >>> or by city name. Like this: >>> >>> Alabama 36101 [hundreds of pages with zip's in 36***] >>> Alaska 99686 [hundreds of pages with zip's in 9****] >>> Arizona 85701 [hundreds of pages with zip's in 855**] >>> >>> Note that the zip codes are *NOT* sequential. >> >> Again, read "One Big Trend..." and let me know what you think. I think >> it >> describes exactly the problem that we see. >> >> For now, the solution that works for me is to seriously up the value of >> "targrows" in analyze.c. It makes it take longer, and while the stats >> are >> not "correct" because they are not designed to detect these sorts of >> patterns, a larger sample allows them to be "less wrong" enough to give >> a >> better hint to the planner. >> >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 7: don't forget to increase your free space map settings >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
pgsql@mohawksoft.com wrote: > > You know, I don't think a lot of people "get" the issues I was describing, > or maybe they don't believe it, I don't know, but, I think that it would > be a useful contrib project to create an 'analyze_special('table', > 'column', 'method')' function that does a better job at calculating the > stats for table that contain multiple trend waveforms. A separate function > will probably work well as the trends within the data probably only apply > to specific rows. I've done something similar, but simpler for the Census/TIGER data. If you loaded each TIGER file sequentially, like I did, the data was all grouped by county when it was loaded - so basically all the geographical columns (zip, county, state, census-tract) are actually grouped tightly on disk -- though ANALYZE can't see this because they're not strictly ascending or descending. Since I merely observed the geospatial columns were all clustered pretty well, I merely set the correlation value to the same pretty large value for all the geometric rows with a bunch of statements like this: update pg_statistic set stanumbers3[1] = 0.8 where starelid = 31412043 and staattnum=3; Instead of a complicated analyze function, how about just letting the user "tell" the optimizer that a column is clustered well with a function like: force_correlation_stat('schema', 'table', 'column', 'value') would actually work well for your data. Since you know your distinct values lay on a relatively small number of pages if you merely did: force_correlation('tiger','rt1','zipl',0.8); force_correlation('tiger','rt1','statel',0.8);force_correlation('tiger','rt1','countyl',0.8); the optimizer would then see that not many disk pages would need to be hit for a single zip code. > It's interesting, because I don't think it needs to calculate a perfect > representation of the data so much as better clue to its nature for the > optimizer. Indeed. Using the very arbitrary number "0.8" for the correlation, for all the geographic-related columns in the tiger data, the optimizer guessed a good plan almost every time on my company's 200GB geographical database. > When I get the time (or can get someone to pay me to do it) I'm going to > try it. I still suspect that the correct way to do it would not be to use the single "correlation", but 2 stats - one for estimating how sequential/random accesses would be; and one for estimating the number of pages that would be hit. I think the existing correlation does well for the first estimate; but for many data sets, poorly for the second type. If you want to start a contrib project that looks into additional stats that may help, I might be interested. Ron
On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: > I still suspect that the correct way to do it would not be > to use the single "correlation", but 2 stats - one for estimating > how sequential/random accesses would be; and one for estimating > the number of pages that would be hit. I think the existing > correlation does well for the first estimate; but for many data > sets, poorly for the second type. Should this be made a TODO? Is there some way we can estimate how much this would help without actually building it? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Jim C. Nasby wrote: > On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: > > > I still suspect that the correct way to do it would not be > > to use the single "correlation", but 2 stats - one for estimating > > how sequential/random accesses would be; and one for estimating > > the number of pages that would be hit. I think the existing > > correlation does well for the first estimate; but for many data > > sets, poorly for the second type. > > Should this be made a TODO? Is there some way we can estimate how much > this would help without actually building it? I guess I am confused how we would actually do that or if it is possible. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > Jim C. Nasby wrote: > >>On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: >> >> >>>I still suspect that the correct way to do it would not be >>>to use the single "correlation", but 2 stats - one for estimating >>>how sequential/random accesses would be; and one for estimating >>>the number of pages that would be hit. I think the existing >>>correlation does well for the first estimate; but for many data >>>sets, poorly for the second type. >> >> >>Should this be made a TODO? Is there some way we can estimate how much >>this would help without actually building it? > > > I guess I am confused how we would actually do that or if it is > possible. > I spent a while on the web looking for some known way to calculate "local" correlation or "clumping" in some manner analogous to how we do correlation currently. As yet I have only seen really specialized examples that were tangentially relevant. We need a pet statistician to ask. regards Mark
> Jim C. Nasby wrote: >> On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: >> >> > I still suspect that the correct way to do it would not be >> > to use the single "correlation", but 2 stats - one for estimating >> > how sequential/random accesses would be; and one for estimating >> > the number of pages that would be hit. I think the existing >> > correlation does well for the first estimate; but for many data >> > sets, poorly for the second type. >> >> Should this be made a TODO? Is there some way we can estimate how much >> this would help without actually building it? > > I guess I am confused how we would actually do that or if it is > possible. Bruce, we didn't get much time to talk at Linux world (It was nice to meet you). I'm not sure how you would go about it, but in my post "One big trend .." (In don't even remember anymore), I talk about tables that are physically sorted on multiple keys, the addresses: streetname, typename, state, zip. maple, st, ma, 02186 maple, st, ma, 02186 maple, rd, ma, 02186 maple, ave, ma, 02186 maple, st, me, ?????? Assuming the table is physically sorted by state, town (zip), streetname, streettype, zip. If one were to type: select * from locations where streetname = 'maple'; The analysis of that query improperly minimizes the correlation of the street address of the table at whole.
Sounds a bit like multi-dimensional clustering ... http://www.research.ibm.com/mdc/ After the ARC experience though ... -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh