Thread: ANALYZE sampling is too good
At multiple conferences I've heard about people trying all sorts of gymnastics to avoid ANALYZE which they expect to take too long and consume too much I/O. This is especially a big complain after upgrades when their new database performs poorly until the new statistics are in and they did pg_upgrade to avoid an extended downtime and complain about ANALYZE taking hours. I always gave the party line that ANALYZE only takes a small constant-sized sample so even very large tables should be very quick. But after hearing the same story again in Heroku I looked into it a bit further. I was kind of shocked but the numbers. ANALYZE takes a sample of 300 * statistics_target rows. That sounds pretty reasonable but with default_statistics_target set to 100 that's 30,000 rows. If I'm reading the code right It takes this sample by sampling 30,000 blocks and then (if the table is large enough) taking an average of one row per block. Each block is 8192 bytes so that means it's reading 240MB of each table.That's a lot more than I realized. It means if your table is anywhere up to 240MB you're effectively doing a full table scan and then throwing out nearly all the data read. Worse, my experience with the posix_fadvise benchmarking is that on spinning media reading one out of every 16 blocks takes about the same time as reading them all. Presumably this is because the seek time between tracks dominates and reading one out of every 16 blocks is still reading every track. So in fact if your table is up to about 3-4G ANALYZE is still effectively going to do a full table scan, at least as far as I/O time goes. The current algorithm seems like it was designed with a 100G+ table in mind but the consequences on the more common 100M-100G tables weren't really considered. Consider what this means for partitioned tables. If they partition their terabyte table into 10 partitions ANALYZE will suddenly want to use 10x as much I/O which seems like a perverse consequence. I'm not sure I have a prescription but my general feeling is that we're spending an awful lot of resources going after a statistically valid sample when we can spend a lot less resources and get something 90% as good. Or if we're really going to read that much data that we might as well use more of the rows we find. -- greg
On 12/03/2013 03:30 PM, Greg Stark wrote: > It means if your table is anywhere up to 240MB you're effectively > doing a full table scan and then throwing out nearly all the data > read. There are lots of issues with our random sampling approach for ANALYZE.This is why, back in our Greenplum days, Simon proposedchanging to a block-based sampling approach, where we would sample random *pages* instead of random *rows*. That would allow us to do things like sample 5% of the table, but only read 5% of the table, although we might have to play some with OS-FS operations to make sure of that. In addition to solving the issue you cite above, it would let us get MUCH more accurate estimates for very large tables, where currently we sample only about 0.1% of the table. There are fairly well researched algorithms for block-based sampling which estimate for the skew introduced by looking at consecutive rows in a block. In general, a minimum sample size of 5% is required, and the error is no worse than our current system. However, the idea was shot down at the time, partly because I think other hackers didn't get the math. I believe that both Oracle and MSSQL use block-based sampling, but of course, I don't know which specific algo they use. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Dec 3, 2013 at 8:30 PM, Greg Stark <stark@mit.edu> wrote: > Worse, my experience with the posix_fadvise benchmarking is that on > spinning media reading one out of every 16 blocks takes about the same > time as reading them all. Presumably this is because the seek time > between tracks dominates and reading one out of every 16 blocks is > still reading every track. So in fact if your table is up to about > 3-4G ANALYZE is still effectively going to do a full table scan, at > least as far as I/O time goes. Actually, it's rotational latency the dominant cost there.
On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote: > There are fairly well researched algorithms for block-based sampling > which estimate for the skew introduced by looking at consecutive rows in > a block. In general, a minimum sample size of 5% is required, and the > error is no worse than our current system. However, the idea was shot > down at the time, partly because I think other hackers didn't get the math. I think that this certainly warrants revisiting. The benefits would be considerable. Has anyone ever thought about opportunistic ANALYZE piggy-backing on other full-table scans? That doesn't really help Greg, because his complaint is mostly that a fresh ANALYZE is too expensive, but it could be an interesting, albeit risky approach. Opportunistically/unpredictably acquiring a ShareUpdateExclusiveLock would be kind of weird, for one thing, but if a full table scan really is very expensive, would it be so unreasonable to attempt to amortize that cost? -- Peter Geoghegan
On Fri, Dec 6, 2013 at 7:22 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote: >> There are fairly well researched algorithms for block-based sampling >> which estimate for the skew introduced by looking at consecutive rows in >> a block. In general, a minimum sample size of 5% is required, and the >> error is no worse than our current system. However, the idea was shot >> down at the time, partly because I think other hackers didn't get the math. > > I think that this certainly warrants revisiting. The benefits would be > considerable. > > Has anyone ever thought about opportunistic ANALYZE piggy-backing on > other full-table scans? That doesn't really help Greg, because his > complaint is mostly that a fresh ANALYZE is too expensive, but it > could be an interesting, albeit risky approach. Is only fresh ANALYZE costly or consecutive one's are also equally costly? Doing it in some background operation might not be a bad idea, but doing it in backend query execution (seq scan) might add overhead for query response time especially if part or most of data for table is in RAM, so here overhead due to actual read might not be very high but the calculation for analyse (like sort) will make it costly. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: > Has anyone ever thought about opportunistic ANALYZE piggy-backing on > other full-table scans? That doesn't really help Greg, because his > complaint is mostly that a fresh ANALYZE is too expensive, but it > could be an interesting, albeit risky approach. What I've been thinking of is a) making it piggy back on scans vacuum is doing instead of doing separate ones all the time (if possible, analyze needs to be more frequent). Currently with quite some likelihood the cache will be gone again when revisiting. b) make analyze incremental. In lots of bigger tables most of the table is static - and we actually *do* know that, thanks to the vm. So keep a rawer form of what ends in the catalogs around somewhere, chunked by the region of the table the statistic is from. Everytime a part of the table changes, re-sample only that part. Then recompute the aggregate. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
It looks like this is a fairly well understood problem because in the real world it's also often cheaper to speak to people in a small geographic area or time interval too. These wikipedia pages sound interesting and have some external references: http://en.wikipedia.org/wiki/Cluster_sampling http://en.wikipedia.org/wiki/Multistage_sampling I suspect the hard part will be characterising the nature of the non-uniformity in the sample generated by taking a whole block. Some of it may come from how the rows were loaded (e.g. older rows were loaded by pg_restore but newer rows were inserted retail) or from the way Postgres works (e.g. hotter rows are on blocks with fewer rows in them and colder rows are more densely packed). I've felt for a long time that Postgres would make an excellent test bed for some aspiring statistics research group. -- greg
> http://en.wikipedia.org/wiki/Cluster_sampling > http://en.wikipedia.org/wiki/Multistage_sampling > > I suspect the hard part will be characterising the nature of the > non-uniformity in the sample generated by taking a whole block. Some > of it may come from how the rows were loaded (e.g. older rows were > loaded by pg_restore but newer rows were inserted retail) or from the > way Postgres works (e.g. hotter rows are on blocks with fewer rows in > them and colder rows are more densely packed). I would have thought that as VACUUM reclaims space it levels that issue in the long run and on average, so that it could be simply ignored? > I've felt for a long time that Postgres would make an excellent test > bed for some aspiring statistics research group. I would say "applied statistics" rather than "research". Nevertheless I can ask my research statistician colleagues next door about their opinion on this sampling question. -- Fabien.
On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark@mit.edu> wrote: > I always gave the party line that ANALYZE only takes a small > constant-sized sample so even very large tables should be very quick. > But after hearing the same story again in Heroku I looked into it a > bit further. I was kind of shocked but the numbers. > > ANALYZE takes a sample of 300 * statistics_target rows. That sounds > pretty reasonable but with default_statistics_target set to 100 that's > 30,000 rows. If I'm reading the code right It takes this sample by > sampling 30,000 blocks and then (if the table is large enough) taking > an average of one row per block. Each block is 8192 bytes so that > means it's reading 240MB of each table.That's a lot more than I > realized. That is a lot. On the other hand, I question the subject line: sometimes, our ANALYZE sampling is not good enough. Before we raised the default statistics target from 10 to 100, complaints about bad plan choices due to insufficiently-precise statistics were legion -- and we still have people periodically proposing to sample a fixed percentage of the table instead of a fixed amount of the table, even on large tables, which is going the opposite direction. I think this is because they're getting really bad n_distinct estimates, and no fixed-size sample can reliably give a good one. More generally, I think the basic problem that people are trying to solve by raising the statistics target is avoid index scans on gigantic tables. Obviously, there are a lot of other situations where inadequate statistics can cause problems, but that's a pretty easy-to-understand one that we do not always get right. We know that an equality search looking for some_column = 'some_constant', where some_constant is an MCV, must be more selective than a search for the least-frequent MCV. If you store more and more MCVs for a table, eventually you'll have enough that the least-frequent one is pretty infrequent, and then things work a lot better. This is more of a problem for big tables than for small tables. MCV #100 can't have a frequency of greater than 1/100 = 0.01, but that's a lot more rows on a big table than small one. On a table with 10 million rows we might estimate something close to 100,000 rows when the real number is just a handful; when the table has only 10,000 rows, we just can't be off by as many orders of magnitude. Things don't always work out that badly, but in the worst case they do. Maybe there's some highly-principled statistical approach which could be taken here, and if so that's fine, but I suspect not. So what I think we should do is auto-tune the statistics target based on the table size. If, say, we think that the generally useful range for the statistics target is something like 10 to 400, then let's come up with a formula based on table size that outputs 10 for small tables, 400 for really big tables, and intermediate values for tables in the middle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/07/2013 11:46 AM, Robert Haas wrote: > Maybe there's some highly-principled statistical approach which could > be taken here, and if so that's fine, but I suspect not. So what I > think we should do is auto-tune the statistics target based on the > table size. If, say, we think that the generally useful range for the > statistics target is something like 10 to 400, then let's come up with > a formula based on table size that outputs 10 for small tables, 400 > for really big tables, and intermediate values for tables in the > middle. The only approach which makes sense is to base it on a % of the table. In fact, pretty much every paper which has examined statistics estimation for database tables has determined that any estimate based on a less-than-5% sample is going to be wildly inaccurate. Not that 5% samples are 100% accurate, but at least they fit the 80/20 rule. This is the reason why implementing block-based sampling is critical; using our current "take one row out of every page" method, sampling 5% of the table means scanning the whole thing in most tables. We also need to decouple the number of MCVs we keep from the sample size. Certainly our existing sampling algo seems designed to maximize IO for the sample size. There's other qualitative improvements we could make, which Nathan Boley has spoken on. For example, our stats code has no way to recognize a normal or exponential distrbution -- it assumes that all columns are randomly distributed. If we could recoginze common distribution patterns, then not only could we have better query estimates, those would require keeping *fewer* stats, since all you need for a normal distribution are the end points and the variance. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote: > The only approach which makes sense is to base it on a % of the table. > In fact, pretty much every paper which has examined statistics > estimation for database tables has determined that any estimate based on > a less-than-5% sample is going to be wildly inaccurate. Not that 5% > samples are 100% accurate, but at least they fit the 80/20 rule. This is nonsense. The math behind the deductions the planner makes is well understood and we know how to estimate the precision based on the sample size. There are some questions that really need a proportional sample such as n_distinct (and as Robert points out the MCV list) but the main motivation of the sample size is the histograms and those are used to answer questions which very clearly do not need a proportional sample. The statistics is very clear there. Using a proportional sample for the histograms would just be silly. It would be substituting a gut feeling for real statistics. The problems with using a proportional sample for things like n_distinct or the MCV list is that you're talking about sorting or hashing an unboundedly large set of values and storing an unboundedly large array in the statistics table. I don't think that would be tenable without dramatically changing the way process and store that data to be more scalable. Using a lossy counting algorithm and something more scalable than toasted arrays would be prerequisites I think. And as Robert mentions even if we solved those problems it wouldn't help n_distinct. You really need to read all the values to deal with n_distinct. > This is the reason why implementing block-based sampling is critical; > using our current "take one row out of every page" method, sampling 5% > of the table means scanning the whole thing in most tables. We also > need to decouple the number of MCVs we keep from the sample size. > Certainly our existing sampling algo seems designed to maximize IO for > the sample size. This would be helpful if you could specify what you mean by "black-based sampling". The reason these previous pleas didn't go anywhere is not because we can't do math. It's because of the lack of specifics here. What we do *today* is called "block-based sampling" by the literature. What I'm saying is we need to change the "block-based sampling" that we do to extract more rows per block. We currently extract the "correct" number of rows to get a strong guarantee of uniformity. If we could get a weaker guarantee of being "close enough" to uniform samples for the deductions we want to make and get many more rows per block then we could read a lot fewer blocks. Or to put it another way people could increase the statistics target dramatically and still be reading the same number of blocks as today. In an ideal world perhaps we could have people reading 1% of the blocks they read today and get statistics targets 10x better than today. -- greg
On 08/12/13 09:25, Josh Berkus wrote: > On 12/07/2013 11:46 AM, Robert Haas wrote: >> Maybe there's some highly-principled statistical approach which could >> be taken here, and if so that's fine, but I suspect not. So what I >> think we should do is auto-tune the statistics target based on the >> table size. If, say, we think that the generally useful range for the >> statistics target is something like 10 to 400, then let's come up with >> a formula based on table size that outputs 10 for small tables, 400 >> for really big tables, and intermediate values for tables in the >> middle. > > The only approach which makes sense is to base it on a % of the table. > In fact, pretty much every paper which has examined statistics > estimation for database tables has determined that any estimate based on > a less-than-5% sample is going to be wildly inaccurate. Not that 5% > samples are 100% accurate, but at least they fit the 80/20 rule. > > This is the reason why implementing block-based sampling is critical; > using our current "take one row out of every page" method, sampling 5% > of the table means scanning the whole thing in most tables. We also > need to decouple the number of MCVs we keep from the sample size. > Certainly our existing sampling algo seems designed to maximize IO for > the sample size. > > There's other qualitative improvements we could make, which Nathan Boley > has spoken on. For example, our stats code has no way to recognize a > normal or exponential distrbution -- it assumes that all columns are > randomly distributed. If we could recoginze common distribution > patterns, then not only could we have better query estimates, those > would require keeping *fewer* stats, since all you need for a normal > distribution are the end points and the variance. > From src/backend/commands/analyze.c * As of May 2004 we use a new two-stage method: Stage one selects up * to targrows random blocks (or all blocks, if therearen't so many). * Stage two scans these blocks and uses the Vitter algorithm to create * a random sample of targrowsrows (or less, if there are less in the * sample of blocks). The two stages are executed simultaneously: each *block is processed as soon as stage one returns its number and while * the rows are read stage two controls which ones areto be inserted * into the sample. I don't think we always read every block (been a while since I looked at this code, so I'll recheck). From what I understand this stuff is based on reasonable research (Vitter algorithm). Not to say it couldn't/shouldn't be looked at again to improve it - but it is not just dreamed up with no valid research! Cheers Mark
On 08/12/13 12:27, Mark Kirkwood wrote: > On 08/12/13 09:25, Josh Berkus wrote: >> On 12/07/2013 11:46 AM, Robert Haas wrote: >>> Maybe there's some highly-principled statistical approach which could >>> be taken here, and if so that's fine, but I suspect not. So what I >>> think we should do is auto-tune the statistics target based on the >>> table size. If, say, we think that the generally useful range for the >>> statistics target is something like 10 to 400, then let's come up with >>> a formula based on table size that outputs 10 for small tables, 400 >>> for really big tables, and intermediate values for tables in the >>> middle. >> >> The only approach which makes sense is to base it on a % of the table. >> In fact, pretty much every paper which has examined statistics >> estimation for database tables has determined that any estimate based on >> a less-than-5% sample is going to be wildly inaccurate. Not that 5% >> samples are 100% accurate, but at least they fit the 80/20 rule. >> >> This is the reason why implementing block-based sampling is critical; >> using our current "take one row out of every page" method, sampling 5% >> of the table means scanning the whole thing in most tables. We also >> need to decouple the number of MCVs we keep from the sample size. >> Certainly our existing sampling algo seems designed to maximize IO for >> the sample size. >> >> There's other qualitative improvements we could make, which Nathan Boley >> has spoken on. For example, our stats code has no way to recognize a >> normal or exponential distrbution -- it assumes that all columns are >> randomly distributed. If we could recoginze common distribution >> patterns, then not only could we have better query estimates, those >> would require keeping *fewer* stats, since all you need for a normal >> distribution are the end points and the variance. >> > > From src/backend/commands/analyze.c > > * As of May 2004 we use a new two-stage method: Stage one selects up > * to targrows random blocks (or all blocks, if there aren't so many). > * Stage two scans these blocks and uses the Vitter algorithm to create > * a random sample of targrows rows (or less, if there are less in the > * sample of blocks). The two stages are executed simultaneously: each > * block is processed as soon as stage one returns its number and while > * the rows are read stage two controls which ones are to be inserted > * into the sample. > > I don't think we always read every block (been a while since I looked at > this code, so I'll recheck). From what I understand this stuff is based > on reasonable research (Vitter algorithm). Not to say it > couldn't/shouldn't be looked at again to improve it - but it is not just > dreamed up with no valid research! > Since I volunteered to recheck :-) from analyze.c again: /*-------------------- * The following choice of minrows is based on the paper * "Random sampling for histogram construction:how much is enough?" * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in * Proceedings of ACM SIGMODInternational Conference on Management * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 * says thatfor table size n, histogram size k, maximum relative * error in bin size f, and error probability gamma, the minimum* random sample size is * r = 4 * k * ln(2*n/gamma) / f^2 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain* r = 305.82 * k * Note that because of the log function, the dependence on n is * quite weak; even at n = 10^12,a 300*k sample gives <= 0.66 * bin size error with probability 0.99. So there's no real need to * scale for n, whichis a good thing because we don't necessarily * know it at this point. *-------------------- */ stats->minrows = 300 * attr->attstattarget; So for typical settings (default_statictics_target = 100), we try to get 30000 rows. This means we will sample about 30000 blocks. Indeed quickly checking with a scale 100 pgbench db and a simple patch to make the block sampler say how many blocks it reads (note pgbench_accounts has 163935 blocks): bench=# ANALYZE pgbench_branches; NOTICE: acquire sample will need 30000 blocks NOTICE: sampled 1 blocks ANALYZE Time: 16.935 ms bench=# ANALYZE pgbench_accounts; NOTICE: acquire sample will need 30000 blocks NOTICE: sampled 30000 blocks ANALYZE Time: 10059.446 ms bench=# \q Regards Mark
On 08/12/13 10:27, Greg Stark wrote: > On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote: >> The only approach which makes sense is to base it on a % of the table. >> In fact, pretty much every paper which has examined statistics >> estimation for database tables has determined that any estimate based on >> a less-than-5% sample is going to be wildly inaccurate. Not that 5% >> samples are 100% accurate, but at least they fit the 80/20 rule. > This is nonsense. The math behind the deductions the planner makes is > well understood and we know how to estimate the precision based on the > sample size. There are some questions that really need a proportional > sample such as n_distinct (and as Robert points out the MCV list) but > the main motivation of the sample size is the histograms and those are > used to answer questions which very clearly do not need a proportional > sample. The statistics is very clear there. Using a proportional > sample for the histograms would just be silly. It would be > substituting a gut feeling for real statistics. > > The problems with using a proportional sample for things like > n_distinct or the MCV list is that you're talking about sorting or > hashing an unboundedly large set of values and storing an unboundedly > large array in the statistics table. I don't think that would be > tenable without dramatically changing the way process and store that > data to be more scalable. Using a lossy counting algorithm and > something more scalable than toasted arrays would be prerequisites I > think. > > And as Robert mentions even if we solved those problems it wouldn't > help n_distinct. You really need to read all the values to deal with > n_distinct. > >> This is the reason why implementing block-based sampling is critical; >> using our current "take one row out of every page" method, sampling 5% >> of the table means scanning the whole thing in most tables. We also >> need to decouple the number of MCVs we keep from the sample size. >> Certainly our existing sampling algo seems designed to maximize IO for >> the sample size. > This would be helpful if you could specify what you mean by > "black-based sampling". The reason these previous pleas didn't go > anywhere is not because we can't do math. It's because of the lack of > specifics here. What we do *today* is called "block-based sampling" by > the literature. > > What I'm saying is we need to change the "block-based sampling" that > we do to extract more rows per block. We currently extract the > "correct" number of rows to get a strong guarantee of uniformity. If > we could get a weaker guarantee of being "close enough" to uniform > samples for the deductions we want to make and get many more rows per > block then we could read a lot fewer blocks. > > Or to put it another way people could increase the statistics target > dramatically and still be reading the same number of blocks as today. > In an ideal world perhaps we could have people reading 1% of the > blocks they read today and get statistics targets 10x better than > today. > I suspect that the number of rows to sample should be something of the form: { N where N <= a * 100 n = { (a * N) / 100 where a * 100 < N <= 10000 { (a * SQRT(N)) / 100 where N > 10000 n: the number of rows to sample N: total number of rows a: configurable coefficient (1 <= a <= 100) For very large numbers of rows, like 10^8, taking a constant fraction will over sample for most purposes, hence the square root. a N n sampled% 1 10000 100 1% 100 10000 10000 100% 1 10^8 10000 0.01% 100 10^8 10^6 1% 1 10^10 10^5 0.001% 100 10^10 10^7 0.01% For very large values of N, you can get can still get a representative sample with a very small fraction of rows sampled. Hmm... Yes, I can see some issues, once I tried out with test values! However. it might inspire a more useful and practical approach! Cheers, Gavin
On Sun, Dec 8, 2013 at 12:06 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > > bench=# ANALYZE pgbench_accounts; > NOTICE: acquire sample will need 30000 blocks > NOTICE: sampled 30000 blocks > ANALYZE > Time: 10059.446 ms > bench=# \q I did some experimenting here as well. I hacked up a version of analyze.c that has a guc for rows_per_block to sample. Setting it to 1 doesn't modify the behaviour at all whereas setting it to 4 divides the number of blocks to sample by 4 which causes it to do less I/O and use more rows from each block. I then initialized pgbench with scale factor 100 but modified the code to run the actual pgbench with scale factor 1. In other words I ran a lot of updates on 1% of the database but left the other 99% untouched from the initial load. Then I ran "ANALYZE VERBOSE accounts" with rows_per_block set to 1, 4, 16, and 64. The latter is slightly larger than the average number of tuples per block so the resulting sample is actually slightly short. The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): stark=# analyze verbose pgbench_accounts; INFO: analyzing "public.pgbench_accounts" INFO: "pgbench_accounts": scanned 30000 of 158756 pages, containing 1889701 live rows and 0 dead rows; 30000 rows in sample, 10000036 estimated total rows ANALYZE Time: 19468.987 ms With rows_per_block=4 it reads only a quarter as many blocks but it's not much faster: stark=# analyze verbose pgbench_accounts; INFO: analyzing "public.pgbench_accounts" INFO: "pgbench_accounts": scanned 7501 of 158756 pages, containing 472489 live rows and 0 dead rows; 30000 rows in sample, 10000037 estimated total rows ANALYZE Time: 17062.331 ms But with rows_per_block=16 it's much faster, 6.7s stark=# set statistics_rows_per_block = 16; SET Time: 1.583 ms stark=# analyze verbose pgbench_accounts; INFO: analyzing "public.pgbench_accounts" INFO: "pgbench_accounts": scanned 1876 of 158756 pages, containing 118163 live rows and 0 dead rows; 30000 rows in sample, 10000031 estimated total rows ANALYZE Time: 6694.331 ms And with rows_per_block=64 it's under 2s: stark=# set statistics_rows_per_block = 64; SET Time: 0.693 ms stark=# analyze verbose pgbench_accounts; INFO: analyzing "public.pgbench_accounts" INFO: "pgbench_accounts": scanned 469 of 158756 pages, containing 29544 live rows and 0 dead rows; 29544 rows in sample, 10000033 estimated total rows ANALYZE Time: 1937.055 ms The estimates for the total rows is just as accurate in every case. (It seems to be consistently sightly high though which is a bit disconcerting) However looking at the actual pg_stats entries the stats are noticeably less accurate for the "blockier" samples. The "bid" column actually has 100 distinct values and so with a statistics_target of 100 each value should appear in the MCV list with a frequency of about .01. With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123 With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125 With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164 With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213 I'm not really sure if this is due to the blocky sample combined with the skewed pgbench run or not. It doesn't seem to be consistently biasing towards or against bid 1 which I believe are the only rows that would have been touched by pgbench. Still it's suspicious that they seem to be consistently getting less accurate as the blockiness increases. I've attached the results of pg_stats following the analyze with the various levels of "blockiness". -- greg
Attachment
On 12/08/2013 10:14 AM, Greg Stark wrote: > With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123 > With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125 > With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164 > With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213 > > I'm not really sure if this is due to the blocky sample combined with > the skewed pgbench run or not. It doesn't seem to be consistently > biasing towards or against bid 1 which I believe are the only rows > that would have been touched by pgbench. Still it's suspicious that > they seem to be consistently getting less accurate as the blockiness > increases. They will certainly do so if you don't apply any statistical adjustments for selecting more rows from the same pages. So there's a set of math designed to calculate for the skew introduced by reading *all* of the rows in each block. That's what I meant by "block-based sampling"; you read, say, 400 pages, you compile statistics on *all* of the rows on those pages, you apply some algorithms to adjust for groupings of rows based on how grouped they are. And you have a pretty good estimate of how grouped they are, because you just looked a complete sets of rows on a bunch of nonadjacent pages. Obviously, you need to look at more rows than you would with a pure-random sample. Like I said, the 80%+ accurate point in the papers seemed to be at a 5% sample. However, since those rows come from the same pages, the cost of looking at more rows is quite small, compared to the cost of looking at 64 times as many disk pages. My ACM subscription has lapsed, though; someone with a current ACM subscription could search for this; there are several published papers, with math and pseudocode. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 12/08/2013 08:14 PM, Greg Stark wrote: > The whole accounts table is 1.2GB and contains 10 million rows. As > expected with rows_per_block set to 1 it reads 240MB of that > containing nearly 2 million rows (and takes nearly 20s -- doing a full > table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. - Heikki
On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus <josh@agliodbs.com> wrote: > They will certainly do so if you don't apply any statistical adjustments > for selecting more rows from the same pages. > > So there's a set of math designed to calculate for the skew introduced > by reading *all* of the rows in each block. I just think this is an oversimplification. There's no skew introduced just by reading all the rows in each block unless there's some kind of dependency between the block a row is placed on and the data in it. So I don't believe there can be some single set of math that automatically removes any skew automatically. The math will depend on what the dependency is. Just to be clear, you have to think pretty hard about the way Postgres internals work to see what kinds of skew might be appearing here. Due to the way Postgres updates work and HOT cleanups work "hot" tuples will be weighted less than "cold" tuples. That's not going to be something someone in ACM knew to design into their maths. I do have access to ACM or other academic articles if you remember any author names or any keywords but if it's a database journal I would worry about patent issues. Do you remember if it was over 17 years old? > Obviously, you need to look at more rows than you would with a > pure-random sample. Like I said, the 80%+ accurate point in the papers > seemed to be at a 5% sample. I really don't believe the 5% thing. It's not enough for n_distinct and it's *far* too high a value for linear properties like histograms or nullfrac etc. From a computer point of view it's too high to be worth bothering. If we have to read 5% of the table we might as well do a full scan anyways, it'll be marginally slower but much better quality results. -- greg
On 09/12/13 08:03, Josh Berkus wrote: > > So there's a set of math designed to calculate for the skew introduced > by reading *all* of the rows in each block. That's what I meant by > "block-based sampling"; you read, say, 400 pages, you compile statistics > on *all* of the rows on those pages, you apply some algorithms to adjust > for groupings of rows based on how grouped they are. And you have a > pretty good estimate of how grouped they are, because you just looked a > complete sets of rows on a bunch of nonadjacent pages. > > Obviously, you need to look at more rows than you would with a > pure-random sample. Like I said, the 80%+ accurate point in the papers > seemed to be at a 5% sample. However, since those rows come from the > same pages, the cost of looking at more rows is quite small, compared to > the cost of looking at 64 times as many disk pages. > > My ACM subscription has lapsed, though; someone with a current ACM > subscription could search for this; there are several published papers, > with math and pseudocode. > This makes more sense! Unfortunately you had confused everyone (well me at least) by decreeing that we needed block based sampling - when we started using that in 2004 - there is more than one way to sample blocks it seems :-) What you are actually suggesting is a way to do block based sampling that requires reading fewer blocks, and that is a great idea! Of course as Greg has just suggested - the details are gonna be the clincher. Can you supply authors or titles of papers? Also with reference to Greenplum - their use of postgres is fairly special case, and an approach that works well for them is not always suitable for more general purpose use. Regards Mark
Greg, > I really don't believe the 5% thing. It's not enough for n_distinct > and it's *far* too high a value for linear properties like histograms > or nullfrac etc. Actually, it is enough for n_distinct, or more properly, 5% is as good as you can get for n_distinct unless you're going to jump to scanning 50% or more. It's also applicable for the other stats; histogram buckets constructed from a 5% sample are more likely to be accurate than those constructed from a 0.1% sample. Same with nullfrac. The degree of improved accuracy, would, of course, require some math to determine. > From a computer point of view it's too high to be > worth bothering. If we have to read 5% of the table we might as well > do a full scan anyways, it'll be marginally slower but much better > quality results. Reading 5% of a 200GB table is going to be considerably faster than reading the whole thing, if that 5% is being scanned in a way that the FS understands. Also, we can optimize this significantly by using the VM, as Robert (I think) suggested. In the advanced approaches section, there's also the idea of collecting analyze data from table pages while they're in memory anyway for other reasons. You do seem kind of hostile to the idea of full-page-sampling, going pretty far beyond the "I'd need to see the math". Why? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus <josh@agliodbs.com> wrote: >> I really don't believe the 5% thing. It's not enough for n_distinct >> and it's *far* too high a value for linear properties like histograms >> or nullfrac etc. > > Actually, it is enough for n_distinct, or more properly, 5% is as good > as you can get for n_distinct unless you're going to jump to scanning > 50% or more. I'd like to see a proof of that result. Not because I'm hostile to changing the algorithm, but because you've made numerous mathematical claims on this thread that fly in the face of what Greg, myself, and others understand to be mathematically true - including this one. If our understanding is wrong, then by all means let's get that fixed. But you're not going to convince anyone here that we should rip out the existing algorithm and its peer-reviewed journal citations by making categorical assertions about the right way to do things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Josh Berkus <josh@agliodbs.com> writes: > Reading 5% of a 200GB table is going to be considerably faster than > reading the whole thing, if that 5% is being scanned in a way that the > FS understands. Really? See the upthread point that reading one sector from each track has just as much seek overhead as reading the whole thing. I will grant that if you think that reading a *contiguous* 5% of the table is good enough, you can make it faster --- but I don't believe offhand that you can make this better without seriously compromising the randomness of your sample. Too many tables are loaded roughly in time order, or in other ways that make contiguous subsets nonrandom. > You do seem kind of hostile to the idea of full-page-sampling, going > pretty far beyond the "I'd need to see the math". Why? I'm detecting a lot of hostility to assertions unsupported by any math. For good reason. regards, tom lane
On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus <josh@agliodbs.com> wrote: > > It's also applicable for the other stats; histogram buckets constructed > from a 5% sample are more likely to be accurate than those constructed > from a 0.1% sample. Same with nullfrac. The degree of improved > accuracy, would, of course, require some math to determine. This "some math" is straightforward basic statistics. The 95th percentile confidence interval for a sample consisting of 300 samples from a population of a 1 million would be 5.66%. A sample consisting of 1000 samples would have a 95th percentile confidence interval of +/- 3.1%. The histogram and nullfact answers the same kind of question as a political poll, "what fraction of the population falls within this subset". This is why pollsters don't need to sample 15 million Americans to have a decent poll result. That's just not how the math works for these kinds of questions. n_distinct is an entirely different kettle of fish. It's a different kind of problem and the error rate there *is* going to be dependent on the percentage of the total population that you sampled. Moreover from the papers I read I'm convinced any sample less than 50-80% is nearly useless so I'm convinced you can't get good results without reading the whole table. -- greg
On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark <stark@mit.edu> wrote: > > This "some math" is straightforward basic statistics. The 95th > percentile confidence interval for a sample consisting of 300 samples > from a population of a 1 million would be 5.66%. A sample consisting > of 1000 samples would have a 95th percentile confidence interval of > +/- 3.1%. Incidentally I got this using an online sample size calculator. Google turns up several but this one seems the easiest to use: http://www.raosoft.com/samplesize.html -- greg
On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark@mit.edu> wrote:> I always gave the party line that ANALYZE only takes a smallThat is a lot. On the other hand, I question the subject line:
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
>
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.
sometimes, our ANALYZE sampling is not good enough. Before we raised
the default statistics target from 10 to 100, complaints about bad
plan choices due to insufficiently-precise statistics were legion --
and we still have people periodically proposing to sample a fixed
percentage of the table instead of a fixed amount of the table, even
on large tables, which is going the opposite direction. I think this
is because they're getting really bad n_distinct estimates, and no
fixed-size sample can reliably give a good one.
I don't recall ever tracing a bad plan down to a bad n_distinct. I have seen several that were due to bad frequency estimates in MCV list, because hash join planning is extremely sensitive to that. Do we have some kind of catalog of generators of problematic data, so that changes can be tested on known problem sets? Perhaps a wiki page to accumulate them would be useful. For automated testing I guess the generator and query is the easy part, the hard part is the cost settings/caching/RAM needed to trigger the problem, and parsing and interpreting the results.
More generally, I think the basic problem that people are trying to
solve by raising the statistics target is avoid index scans on
gigantic tables. Obviously, there are a lot of other situations where
inadequate statistics can cause problems, but that's a pretty
easy-to-understand one that we do not always get right. We know that
an equality search looking for some_column = 'some_constant', where
some_constant is an MCV, must be more selective than a search for the
least-frequent MCV. If you store more and more MCVs for a table,
eventually you'll have enough that the least-frequent one is pretty
infrequent, and then things work a lot better.
My reading of the code is that if it is not in the MCV, then it is assumed to have the average selectivity (about 1/n_distinct, but deflating top and bottom for the MCV list). There is also a check that it is less than the least common of the MCV, but I don't know why that situation would ever prevail--that should always be higher or equal to the average selectivity.
This is more of a problem for big tables than for small tables. MCV
#100 can't have a frequency of greater than 1/100 = 0.01, but that's a
lot more rows on a big table than small one. On a table with 10
million rows we might estimate something close to 100,000 rows when
the real number is just a handful; when the table has only 10,000
rows, we just can't be off by as many orders of magnitude. Things
don't always work out that badly, but in the worst case they do.
Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not. So what I
think we should do is auto-tune the statistics target based on the
table size. If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.
I think that parts of the planner are N^2 in the size of histogram (or was that the size of the MCV list?). So we would probably need a way to use a larger sample size to get more accurate n_distinct and MCV frequencies, but not save the entire histogram that goes with that sample size.
Cheers,
Jeff
On 12/6/13 3:21 AM, Andres Freund wrote: > On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: >> Has anyone ever thought about opportunistic ANALYZE piggy-backing on >> other full-table scans? That doesn't really help Greg, because his >> complaint is mostly that a fresh ANALYZE is too expensive, but it >> could be an interesting, albeit risky approach. > > What I've been thinking of is > > a) making it piggy back on scans vacuum is doing instead of doing > separate ones all the time (if possible, analyze needs to be more > frequent). Currently with quite some likelihood the cache will be gone > again when revisiting. FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 2nd backend to do the ANALYZE portion (orperhaps use Robert's fancy new shared memory stuff). -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 12/8/13 1:49 PM, Heikki Linnakangas wrote: > On 12/08/2013 08:14 PM, Greg Stark wrote: >> The whole accounts table is 1.2GB and contains 10 million rows. As >> expected with rows_per_block set to 1 it reads 240MB of that >> containing nearly 2 million rows (and takes nearly 20s -- doing a full >> table scan for select count(*) only takes about 5s): > > One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for theblocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage2 work in the callback routine... -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I don't recall ever tracing a bad plan down to a bad n_distinct. It does happen. I've seen it several times. -- Peter Geoghegan
On 12/09/2013 11:35 PM, Jim Nasby wrote: > On 12/8/13 1:49 PM, Heikki Linnakangas wrote: >> On 12/08/2013 08:14 PM, Greg Stark wrote: >>> The whole accounts table is 1.2GB and contains 10 million rows. As >>> expected with rows_per_block set to 1 it reads 240MB of that >>> containing nearly 2 million rows (and takes nearly 20s -- doing a full >>> table scan for select count(*) only takes about 5s): >> >> One simple thing we could do, without or in addition to changing the >> algorithm, is to issue posix_fadvise() calls for the blocks we're >> going to read. It should at least be possible to match the speed of a >> plain sequential scan that way. > > Hrm... maybe it wouldn't be very hard to use async IO here either? I'm > thinking it wouldn't be very hard to do the stage 2 work in the callback > routine... Yeah, other than the fact we have no infrastructure to do asynchronous I/O anywhere in the backend. If we had that, then we could easily use it here. I doubt it would be much better than posix_fadvising the blocks, though. - Heikki
On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 12/09/2013 11:35 PM, Jim Nasby wrote: >> >> On 12/8/13 1:49 PM, Heikki Linnakangas wrote: >>> >>> On 12/08/2013 08:14 PM, Greg Stark wrote: >>>> >>>> The whole accounts table is 1.2GB and contains 10 million rows. As >>>> expected with rows_per_block set to 1 it reads 240MB of that >>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full >>>> table scan for select count(*) only takes about 5s): >>> >>> >>> One simple thing we could do, without or in addition to changing the >>> algorithm, is to issue posix_fadvise() calls for the blocks we're >>> going to read. It should at least be possible to match the speed of a >>> plain sequential scan that way. >> >> >> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm >> thinking it wouldn't be very hard to do the stage 2 work in the callback >> routine... > > > Yeah, other than the fact we have no infrastructure to do asynchronous I/O > anywhere in the backend. If we had that, then we could easily use it here. I > doubt it would be much better than posix_fadvising the blocks, though. Without patches to the kernel, it is much better. posix_fadvise interferes with read-ahead, so posix_fadvise on, say, bitmap heap scans (or similarly sorted analyze block samples) run at 1 IO / block, ie horrible, whereas aio can do read coalescence and read-ahead when the kernel thinks it'll be profitable, significantly increasing IOPS. I've seen everything from a 2x to 10x difference.
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > My reading of the code is that if it is not in the MCV, then it is assumed > to have the average selectivity (about 1/n_distinct, but deflating top and > bottom for the MCV list). There is also a check that it is less than the > least common of the MCV, but I don't know why that situation would ever > prevail--that should always be higher or equal to the average selectivity. I've never seen an n_distinct value of more than 5 digits, regardless of reality. Typically I've seen 20-50k, even if the real number is much higher. But the n_distinct value is only for non-MCVs, so if we estimate the selectivity of column = 'rarevalue' to be (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces the estimate, and making the MCV list longer naturally makes mcvfrac bigger. I'm not sure how important the less-frequent-than-the-least-common-MCV part is, but I'm very sure that raising the statistics target helps to solve the problem of overestimating the prevalence of uncommon values in a very big table. > I think that parts of the planner are N^2 in the size of histogram (or was > that the size of the MCV list?). So we would probably need a way to use a > larger sample size to get more accurate n_distinct and MCV frequencies, but > not save the entire histogram that goes with that sample size. I think the saving the histogram part is important. As you say, the MCVs are important for a variety of planning purposes, such as hash joins. More than that, in my experience, people with large tables are typically very willing to spend more planning time to get a better plan, because mistakes are expensive and the queries are likely to run for a while anyway. People with small tables care about planning time, because it makes no sense to spend an extra 1ms planning a query unless you improve the plan by enough to save at least 1ms when executing it, and when the tables are small and access is expected to be fast anyway that's often not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/09/2013 11:56 PM, Claudio Freire wrote: > On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> On 12/09/2013 11:35 PM, Jim Nasby wrote: >>> >>> On 12/8/13 1:49 PM, Heikki Linnakangas wrote: >>>> >>>> On 12/08/2013 08:14 PM, Greg Stark wrote: >>>>> >>>>> The whole accounts table is 1.2GB and contains 10 million rows. As >>>>> expected with rows_per_block set to 1 it reads 240MB of that >>>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full >>>>> table scan for select count(*) only takes about 5s): >>>> >>>> One simple thing we could do, without or in addition to changing the >>>> algorithm, is to issue posix_fadvise() calls for the blocks we're >>>> going to read. It should at least be possible to match the speed of a >>>> plain sequential scan that way. >>> >>> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm >>> thinking it wouldn't be very hard to do the stage 2 work in the callback >>> routine... >> >> Yeah, other than the fact we have no infrastructure to do asynchronous I/O >> anywhere in the backend. If we had that, then we could easily use it here. I >> doubt it would be much better than posix_fadvising the blocks, though. > > Without patches to the kernel, it is much better. > > posix_fadvise interferes with read-ahead, so posix_fadvise on, say, > bitmap heap scans (or similarly sorted analyze block samples) run at 1 > IO / block, ie horrible, whereas aio can do read coalescence and > read-ahead when the kernel thinks it'll be profitable, significantly > increasing IOPS. I've seen everything from a 2x to 10x difference. How did you test that, given that we don't actually have an asynchronous I/O implementation? I don't recall any recent patches floating around either to do that. When Greg Stark investigated this back in 2007-2008 and implemented posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a significant speedup on the test data he used. What kind of a data distribution gives a slowdown like that? I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 30000 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. - Heikki
Attachment
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 12/09/2013 11:56 PM, Claudio Freire wrote: >> Without patches to the kernel, it is much better. >> >> posix_fadvise interferes with read-ahead, so posix_fadvise on, say, >> bitmap heap scans (or similarly sorted analyze block samples) run at 1 >> IO / block, ie horrible, whereas aio can do read coalescence and >> read-ahead when the kernel thinks it'll be profitable, significantly >> increasing IOPS. I've seen everything from a 2x to 10x difference. > > > How did you test that, given that we don't actually have an asynchronous I/O > implementation? I don't recall any recent patches floating around either to > do that. When Greg Stark investigated this back in 2007-2008 and implemented > posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a > significant speedup on the test data he used. What kind of a data > distribution gives a slowdown like that? That's basically my summarized experience from working on this[0] patch, and the feedback given there about competing AIO work. Sequential I/O was the biggest issue. I had to actively avoid fadvising on sequential I/O, or sequential-ish I/O, which was a huge burden on fadvise logic. > > I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very > easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from > this on my laptop. Taking a 30000 page sample of a table with 717717 pages > (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the > patch, and less than a second with the patch, with > effective_io_concurrency=10. If anyone with a good test data set loaded > would like to test this and post some numbers, that would be great. Kernel version? I raised this issue on LKML, and, while I got no news on this front, they might have silently fixed it. I'd have to check the sources again. [0] http://www.postgresql.org/message-id/COL116-W162AEAA64173E77D4597EEA3670@phx.gbl
Claudio Freire <klaussfreire@gmail.com> wrote: >On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas ><hlinnakangas@vmware.com> wrote: >> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >be very >> easy, patch attached. Your mileage may vary, but I'm seeing a nice >gain from >> this on my laptop. Taking a 30000 page sample of a table with 717717 >pages >> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without >the >> patch, and less than a second with the patch, with >> effective_io_concurrency=10. If anyone with a good test data set >loaded >> would like to test this and post some numbers, that would be great. > >Kernel version? 3.12, from Debian experimental. With an ssd drive and btrfs filesystem. Admittedly not your average database server setup,so it would be nice to get more reports from others. >I raised this issue on LKML, and, while I got no news on this front, >they might have silently fixed it. I'd have to check the sources >again. Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway. I - Heikki
On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Claudio Freire <klaussfreire@gmail.com> wrote: >>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas >><hlinnakangas@vmware.com> wrote: >>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >>be very >>> easy, patch attached. Your mileage may vary, but I'm seeing a nice >>gain from >>> this on my laptop. Taking a 30000 page sample of a table with 717717 >>pages >>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without >>the >>> patch, and less than a second with the patch, with >>> effective_io_concurrency=10. If anyone with a good test data set >>loaded >>> would like to test this and post some numbers, that would be great. >> >>Kernel version? > > 3.12, from Debian experimental. With an ssd drive and btrfs filesystem. Admittedly not your average database server setup,so it would be nice to get more reports from others. Yeah, read-ahead isn't relevant for SSD.
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway. Yeah. If we're not reading consecutive blocks, readahead is unlikely to do anything anyhow. Claudio's comments do suggest that it might be a bad idea to issue a posix_fadvise when the next block to be examined *is* adjacent to the current one, though. regards, tom lane
On 10/12/13 12:14, Heikki Linnakangas wrote: > > > I took a stab at using posix_fadvise() in ANALYZE. It turned out to be > very easy, patch attached. Your mileage may vary, but I'm seeing a > nice gain from this on my laptop. Taking a 30000 page sample of a > table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes > about 6 seconds without the patch, and less than a second with the > patch, with effective_io_concurrency=10. If anyone with a good test > data set loaded would like to test this and post some numbers, that > would be great. > > I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Regards Mark
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: > I did a test run: > pgbench scale 2000 (pgbench_accounts approx 25GB). > postgres 9.4 > i7 3.5Ghz Cpu > 16GB Ram > 500 GB Velociraptor 10K > (cold os and pg cache both runs) > Without patch: ANALYZE pgbench_accounts 90s > With patch: ANALYZE pgbench_accounts 91s > So I'm essentially seeing no difference :-( What OS and filesystem? regards, tom lane
On 10/12/13 13:14, Mark Kirkwood wrote: > On 10/12/13 12:14, Heikki Linnakangas wrote: >> >> >> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >> be very easy, patch attached. Your mileage may vary, but I'm seeing a >> nice gain from this on my laptop. Taking a 30000 page sample of a >> table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes >> about 6 seconds without the patch, and less than a second with the >> patch, with effective_io_concurrency=10. If anyone with a good test >> data set loaded would like to test this and post some numbers, that >> would be great. >> >> > > > > I did a test run: > > pgbench scale 2000 (pgbench_accounts approx 25GB). > postgres 9.4 > > i7 3.5Ghz Cpu > 16GB Ram > 500 GB Velociraptor 10K > > (cold os and pg cache both runs) > Without patch: ANALYZE pgbench_accounts 90s > With patch: ANALYZE pgbench_accounts 91s > > So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4
On 10/12/13 13:20, Mark Kirkwood wrote: > On 10/12/13 13:14, Mark Kirkwood wrote: >> On 10/12/13 12:14, Heikki Linnakangas wrote: >>> >>> >>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >>> be very easy, patch attached. Your mileage may vary, but I'm seeing >>> a nice gain from this on my laptop. Taking a 30000 page sample of a >>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE >>> takes about 6 seconds without the patch, and less than a second with >>> the patch, with effective_io_concurrency=10. If anyone with a good >>> test data set loaded would like to test this and post some numbers, >>> that would be great. >>> >>> >> >> >> >> I did a test run: >> >> pgbench scale 2000 (pgbench_accounts approx 25GB). >> postgres 9.4 >> >> i7 3.5Ghz Cpu >> 16GB Ram >> 500 GB Velociraptor 10K >> >> (cold os and pg cache both runs) >> Without patch: ANALYZE pgbench_accounts 90s >> With patch: ANALYZE pgbench_accounts 91s >> >> So I'm essentially seeing no difference :-( > > > Arrg - sorry forgot the important bits: > > Ubuntu 13.10 (kernel 3.11.0-14) > filesystem is ext4 > > > Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s
On 12/06/2013 09:52 AM, Peter Geoghegan wrote: > Has anyone ever thought about opportunistic ANALYZE piggy-backing on > other full-table scans? That doesn't really help Greg, because his > complaint is mostly that a fresh ANALYZE is too expensive, but it > could be an interesting, albeit risky approach. It'd be particularly interesting, IMO, if autovacuum was able to do it using the synchronized scans machinery, so the analyze still ran in a separate proc and didn't impact the user's query. Have an autovac worker or two waiting for new scans to start on tables they need to analyze, and if one doesn't start within a reasonable time frame they start their own scan. We've seen enough issues with hint-bit setting causing unpredictable query times for user queries that I wouldn't want to add another "might write during a read-only query" behaviour. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/10/2013 05:20 AM, Jim Nasby wrote: >> > > FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to > fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's > fancy new shared memory stuff). Apologies for posting the same as a new idea before I saw your post. I need to finish the thread before posting. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/12/13 13:53, Mark Kirkwood wrote: > On 10/12/13 13:20, Mark Kirkwood wrote: >> On 10/12/13 13:14, Mark Kirkwood wrote: >>> On 10/12/13 12:14, Heikki Linnakangas wrote: >>>> >>>> >>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >>>> be very easy, patch attached. Your mileage may vary, but I'm seeing >>>> a nice gain from this on my laptop. Taking a 30000 page sample of a >>>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE >>>> takes about 6 seconds without the patch, and less than a second >>>> with the patch, with effective_io_concurrency=10. If anyone with a >>>> good test data set loaded would like to test this and post some >>>> numbers, that would be great. >>>> >>>> >>> >>> >>> >>> I did a test run: >>> >>> pgbench scale 2000 (pgbench_accounts approx 25GB). >>> postgres 9.4 >>> >>> i7 3.5Ghz Cpu >>> 16GB Ram >>> 500 GB Velociraptor 10K >>> >>> (cold os and pg cache both runs) >>> Without patch: ANALYZE pgbench_accounts 90s >>> With patch: ANALYZE pgbench_accounts 91s >>> >>> So I'm essentially seeing no difference :-( >> >> >> Arrg - sorry forgot the important bits: >> >> Ubuntu 13.10 (kernel 3.11.0-14) >> filesystem is ext4 >> >> >> > > Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem > mounted with discard): > > (cold os and pg cache both runs) > Without patch: ANALYZE pgbench_accounts 5s > With patch: ANALYZE pgbench_accounts 5s > > > > > Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s
On 10/12/13 15:04, Mark Kirkwood wrote: > On 10/12/13 13:53, Mark Kirkwood wrote: >> On 10/12/13 13:20, Mark Kirkwood wrote: >>> On 10/12/13 13:14, Mark Kirkwood wrote: >>>> On 10/12/13 12:14, Heikki Linnakangas wrote: >>>>> >>>>> >>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out >>>>> to be very easy, patch attached. Your mileage may vary, but I'm >>>>> seeing a nice gain from this on my laptop. Taking a 30000 page >>>>> sample of a table with 717717 pages (ie. slightly larger than >>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less >>>>> than a second with the patch, with effective_io_concurrency=10. If >>>>> anyone with a good test data set loaded would like to test this >>>>> and post some numbers, that would be great. >>>>> >>>>> >>>> >>>> >>>> >>>> I did a test run: >>>> >>>> pgbench scale 2000 (pgbench_accounts approx 25GB). >>>> postgres 9.4 >>>> >>>> i7 3.5Ghz Cpu >>>> 16GB Ram >>>> 500 GB Velociraptor 10K >>>> >>>> (cold os and pg cache both runs) >>>> Without patch: ANALYZE pgbench_accounts 90s >>>> With patch: ANALYZE pgbench_accounts 91s >>>> >>>> So I'm essentially seeing no difference :-( >>> >>> >>> Arrg - sorry forgot the important bits: >>> >>> Ubuntu 13.10 (kernel 3.11.0-14) >>> filesystem is ext4 >>> >>> >>> >> >> Doing the same test as above, but on a 80GB Intel 520 (ext4 >> filesystem mounted with discard): >> >> (cold os and pg cache both runs) >> Without patch: ANALYZE pgbench_accounts 5s >> With patch: ANALYZE pgbench_accounts 5s >> >> >> >> >> > > Redoing the filesystem on the 520 as btrfs didn't seem to make any > difference either: > > (cold os and pg cache both runs) > Without patch: ANALYZE pgbench_accounts 6.4s > With patch: ANALYZE pgbench_accounts 6.4s > > > Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Regards Mark
On 10/12/13 15:11, Mark Kirkwood wrote: > On 10/12/13 15:04, Mark Kirkwood wrote: >> On 10/12/13 13:53, Mark Kirkwood wrote: >>> On 10/12/13 13:20, Mark Kirkwood wrote: >>>> On 10/12/13 13:14, Mark Kirkwood wrote: >>>>> On 10/12/13 12:14, Heikki Linnakangas wrote: >>>>>> >>>>>> >>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out >>>>>> to be very easy, patch attached. Your mileage may vary, but I'm >>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page >>>>>> sample of a table with 717717 pages (ie. slightly larger than >>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less >>>>>> than a second with the patch, with effective_io_concurrency=10. >>>>>> If anyone with a good test data set loaded would like to test >>>>>> this and post some numbers, that would be great. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> I did a test run: >>>>> >>>>> pgbench scale 2000 (pgbench_accounts approx 25GB). >>>>> postgres 9.4 >>>>> >>>>> i7 3.5Ghz Cpu >>>>> 16GB Ram >>>>> 500 GB Velociraptor 10K >>>>> >>>>> (cold os and pg cache both runs) >>>>> Without patch: ANALYZE pgbench_accounts 90s >>>>> With patch: ANALYZE pgbench_accounts 91s >>>>> >>>>> So I'm essentially seeing no difference :-( >>>> >>>> >>>> Arrg - sorry forgot the important bits: >>>> >>>> Ubuntu 13.10 (kernel 3.11.0-14) >>>> filesystem is ext4 >>>> >>>> >>>> >>> >>> Doing the same test as above, but on a 80GB Intel 520 (ext4 >>> filesystem mounted with discard): >>> >>> (cold os and pg cache both runs) >>> Without patch: ANALYZE pgbench_accounts 5s >>> With patch: ANALYZE pgbench_accounts 5s >>> >>> >>> >>> >>> >> >> Redoing the filesystem on the 520 as btrfs didn't seem to make any >> difference either: >> >> (cold os and pg cache both runs) >> Without patch: ANALYZE pgbench_accounts 6.4s >> With patch: ANALYZE pgbench_accounts 6.4s >> >> >> > > Ah - I have just realized I was not setting effective_io_concurrency - > so I'll redo the test. - Apologies. > > Redoing the test on the velociraptor gives me exactly the same numbers as before (effective_io_concurrency = 10 instead of 1). Cheers Mark
On 12/09/2013 02:37 PM, Robert Haas wrote: > I've never seen an n_distinct value of more than 5 digits, regardless > of reality. Typically I've seen 20-50k, even if the real number is > much higher. But the n_distinct value is only for non-MCVs, so if we > estimate the selectivity of column = 'rarevalue' to be > (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces > the estimate, and making the MCV list longer naturally makes mcvfrac > bigger. I'm not sure how important the > less-frequent-than-the-least-common-MCV part is, but I'm very sure > that raising the statistics target helps to solve the problem of > overestimating the prevalence of uncommon values in a very big table. I did an analysis of our ndistinct algorithm several years ago ( ~~ 8.1), and to sum up: 1. we take far too small of a sample to estimate ndistinct well for tables larger than 100,000 rows. 2. the estimation algo we have chosen is one which tends to be wrong in the downwards direction, rather strongly so. That is, if we could potentially have an ndistinct of 1000 to 100,000 based on the sample, our algo estimates 1500 to 3000. 3. Other algos exist. The tend to be wrong in other directions. 4. Nobody has done an analysis of whether it's worse, on average, to estimate low vs. high for ndistinct. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 10/12/13 15:17, Mark Kirkwood wrote: > On 10/12/13 15:11, Mark Kirkwood wrote: >> On 10/12/13 15:04, Mark Kirkwood wrote: >>> On 10/12/13 13:53, Mark Kirkwood wrote: >>>> On 10/12/13 13:20, Mark Kirkwood wrote: >>>>> On 10/12/13 13:14, Mark Kirkwood wrote: >>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote: >>>>>>> >>>>>>> >>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out >>>>>>> to be very easy, patch attached. Your mileage may vary, but I'm >>>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page >>>>>>> sample of a table with 717717 pages (ie. slightly larger than >>>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less >>>>>>> than a second with the patch, with effective_io_concurrency=10. >>>>>>> If anyone with a good test data set loaded would like to test >>>>>>> this and post some numbers, that would be great. >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I did a test run: >>>>>> >>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB). >>>>>> postgres 9.4 >>>>>> >>>>>> i7 3.5Ghz Cpu >>>>>> 16GB Ram >>>>>> 500 GB Velociraptor 10K >>>>>> >>>>>> (cold os and pg cache both runs) >>>>>> Without patch: ANALYZE pgbench_accounts 90s >>>>>> With patch: ANALYZE pgbench_accounts 91s >>>>>> >>>>>> So I'm essentially seeing no difference :-( >>>>> >>>>> >>>>> Arrg - sorry forgot the important bits: >>>>> >>>>> Ubuntu 13.10 (kernel 3.11.0-14) >>>>> filesystem is ext4 >>>>> >>>>> >>>>> >>>> >>>> Doing the same test as above, but on a 80GB Intel 520 (ext4 >>>> filesystem mounted with discard): >>>> >>>> (cold os and pg cache both runs) >>>> Without patch: ANALYZE pgbench_accounts 5s >>>> With patch: ANALYZE pgbench_accounts 5s >>>> >>>> >>>> >>>> >>>> >>> >>> Redoing the filesystem on the 520 as btrfs didn't seem to make any >>> difference either: >>> >>> (cold os and pg cache both runs) >>> Without patch: ANALYZE pgbench_accounts 6.4s >>> With patch: ANALYZE pgbench_accounts 6.4s >>> >>> >>> >> >> Ah - I have just realized I was not setting effective_io_concurrency >> - so I'll redo the test. - Apologies. >> >> > > Redoing the test on the velociraptor gives me exactly the same numbers > as before (effective_io_concurrency = 10 instead of 1). > > Good grief - repeating the test gave: Without patch: ANALYZE pgbench_accounts 90s With patch: ANALYZE pgbench_accounts 42s pretty consistent *now*. No idea what was going on in the 1st run (maybe I happened to have it running at the same time as a checkpoint)? Anyway will stop now before creating more confusion. Regards Mark
On 10/12/13 15:32, Mark Kirkwood wrote: > On 10/12/13 15:17, Mark Kirkwood wrote: >> On 10/12/13 15:11, Mark Kirkwood wrote: >>> On 10/12/13 15:04, Mark Kirkwood wrote: >>>> On 10/12/13 13:53, Mark Kirkwood wrote: >>>>> On 10/12/13 13:20, Mark Kirkwood wrote: >>>>>> On 10/12/13 13:14, Mark Kirkwood wrote: >>>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote: >>>>>>>> >>>>>>>> >>>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned >>>>>>>> out to be very easy, patch attached. Your mileage may vary, but >>>>>>>> I'm seeing a nice gain from this on my laptop. Taking a 30000 >>>>>>>> page sample of a table with 717717 pages (ie. slightly larger >>>>>>>> than RAM), ANALYZE takes about 6 seconds without the patch, and >>>>>>>> less than a second with the patch, with >>>>>>>> effective_io_concurrency=10. If anyone with a good test data >>>>>>>> set loaded would like to test this and post some numbers, that >>>>>>>> would be great. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I did a test run: >>>>>>> >>>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB). >>>>>>> postgres 9.4 >>>>>>> >>>>>>> i7 3.5Ghz Cpu >>>>>>> 16GB Ram >>>>>>> 500 GB Velociraptor 10K >>>>>>> >>>>>>> (cold os and pg cache both runs) >>>>>>> Without patch: ANALYZE pgbench_accounts 90s >>>>>>> With patch: ANALYZE pgbench_accounts 91s >>>>>>> >>>>>>> So I'm essentially seeing no difference :-( >>>>>> >>>>>> >>>>>> Arrg - sorry forgot the important bits: >>>>>> >>>>>> Ubuntu 13.10 (kernel 3.11.0-14) >>>>>> filesystem is ext4 >>>>>> >>>>>> >>>>>> >>>>> >>>>> Doing the same test as above, but on a 80GB Intel 520 (ext4 >>>>> filesystem mounted with discard): >>>>> >>>>> (cold os and pg cache both runs) >>>>> Without patch: ANALYZE pgbench_accounts 5s >>>>> With patch: ANALYZE pgbench_accounts 5s >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> Redoing the filesystem on the 520 as btrfs didn't seem to make any >>>> difference either: >>>> >>>> (cold os and pg cache both runs) >>>> Without patch: ANALYZE pgbench_accounts 6.4s >>>> With patch: ANALYZE pgbench_accounts 6.4s >>>> >>>> >>>> >>> >>> Ah - I have just realized I was not setting effective_io_concurrency >>> - so I'll redo the test. - Apologies. >>> >>> >> >> Redoing the test on the velociraptor gives me exactly the same >> numbers as before (effective_io_concurrency = 10 instead of 1). >> >> > > Good grief - repeating the test gave: > > Without patch: ANALYZE pgbench_accounts 90s > With patch: ANALYZE pgbench_accounts 42s > > pretty consistent *now*. No idea what was going on in the 1st run > (maybe I happened to have it running at the same time as a > checkpoint)? Anyway will stop now before creating more confusion. > > Just one more... The Intel 520 with ext4: Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 1s And double checking - With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s These results look more like Heikki's. Which suggests more benefit on SSD than spinning disks. Some more data points (apart from mine) would be good to see tho. Cheers Mark
On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > Just one more... > > The Intel 520 with ext4: > > > Without patch: ANALYZE pgbench_accounts 5s > With patch: ANALYZE pgbench_accounts 1s > > And double checking - > With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s > > These results look more like Heikki's. Which suggests more benefit on SSD > than spinning disks. Some more data points (apart from mine) would be good > to see tho. Assuming ANALYZE is sampling less than 5% of the table, I'd say fadvising will always be a win. I'd also suggest higher e_i_c for rotating media. Rotating media has longer latencias, and e_i_c has to be computed in terms of latency, rather than "spindles" when doing prefetch. For backward index scans, I found the optimum number for a single spindle to be about 20.
On 10/12/13 17:18, Claudio Freire wrote: > On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz> wrote: >> Just one more... >> >> The Intel 520 with ext4: >> >> >> Without patch: ANALYZE pgbench_accounts 5s >> With patch: ANALYZE pgbench_accounts 1s >> >> And double checking - >> With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s >> >> These results look more like Heikki's. Which suggests more benefit on SSD >> than spinning disks. Some more data points (apart from mine) would be good >> to see tho. > > Assuming ANALYZE is sampling less than 5% of the table, I'd say > fadvising will always be a win. > > I'd also suggest higher e_i_c for rotating media. Rotating media has > longer latencias, and e_i_c has to be computed in terms of latency, > rather than "spindles" when doing prefetch. > > For backward index scans, I found the optimum number for a single > spindle to be about 20. Yeah - was just fooling about with this on the velociraptor, looks like somewhere in the 20's works well. | pgbench_accounts eff_io_concurrency | analyze time (s) -------------------+----------------- 8 | 43 16 | 40 24 | 36 32 | 35 64 | 35 Cheers Mark
Greg Stark wrote: >> It's also applicable for the other stats; histogram buckets constructed >> from a 5% sample are more likely to be accurate than those constructed >> from a 0.1% sample. Same with nullfrac. The degree of improved >> accuracy, would, of course, require some math to determine. > > This "some math" is straightforward basic statistics. The 95th > percentile confidence interval for a sample consisting of 300 samples > from a population of a 1 million would be 5.66%. A sample consisting > of 1000 samples would have a 95th percentile confidence interval of > +/- 3.1%. Doesn't all that assume a normally distributed random variable? I don't think it can be applied to database table contents without further analysis. Yours, Laurenz Albe
<p dir="ltr"><br /> On 10 Dec 2013 08:28, "Albe Laurenz" <<a href="mailto:laurenz.albe@wien.gv.at">laurenz.albe@wien.gv.at</a>>wrote:<br /> ><br /> ><br /> > Doesn't allthat assume a normally distributed random variable?<br /><p dir="ltr">I don't think so because of the law of large numbers.If you have a large population and sample it the sample behaves like a normal distribution when if the distributionof the population isn't.
Greg Stark wrote: >> Doesn't all that assume a normally distributed random variable? > I don't think so because of the law of large numbers. If you have a large population and sample it the > sample behaves like a normal distribution when if the distribution of the population isn't. Statistics is the part of mathematics I know least of, but aren't you saying that in a large enough sample of people there will always be some with age < 0 (which is what a normal distribution would imply)? Yours, Laurenz Albe
On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark@mit.edu> wrote: > > On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote: >> >> >> Doesn't all that assume a normally distributed random variable? > > I don't think so because of the law of large numbers. If you have a large > population and sample it the sample behaves like a normal distribution when > if the distribution of the population isn't. No, the large population says that if you have an AVERAGE of many samples of a random variable, the random variable that is the AVERAGE behaves like a normal. The variable itself doesn't. And for n_distinct, you need to know the variable itself.
On Tue, Dec 10, 2013 at 11:32 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark@mit.edu> wrote: >> >> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote: >>> >>> >>> Doesn't all that assume a normally distributed random variable? >> >> I don't think so because of the law of large numbers. If you have a large >> population and sample it the sample behaves like a normal distribution when >> if the distribution of the population isn't. > > > No, the large population says that if you have an AVERAGE of many > samples of a random variable, the random variable that is the AVERAGE > behaves like a normal. Sorry, that should be "No, the *law of large numbers* says..."
On 6 December 2013 09:21, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: >> Has anyone ever thought about opportunistic ANALYZE piggy-backing on >> other full-table scans? That doesn't really help Greg, because his >> complaint is mostly that a fresh ANALYZE is too expensive, but it >> could be an interesting, albeit risky approach. > > What I've been thinking of is > > a) making it piggy back on scans vacuum is doing instead of doing > separate ones all the time (if possible, analyze needs to be more > frequent). Currently with quite some likelihood the cache will be gone > again when revisiting. > b) make analyze incremental. In lots of bigger tables most of the table > is static - and we actually *do* know that, thanks to the vm. So keep a > rawer form of what ends in the catalogs around somewhere, chunked by the > region of the table the statistic is from. Everytime a part of the table > changes, re-sample only that part. Then recompute the aggregate. Piggy-backing sounds like a bad thing. If I run a query, I don't want to be given some extra task thanks! Especially if we might need to run data type code I'm not authroised to run, or to sample data I may not be authorised to see. The only way that could work is to kick off an autovacuum worker to run the ANALYZE as a separate process and then use synchronous scan behaviour to derive benefit indirectly. However, these things presume that we need to continue scanning most of the blocks of the table, which I don't think needs to be the case. There is a better way. Back in 2005/6, I advocated a block sampling method, as described by Chaudri et al (ref?) That has two advantages * We don't need to visit all of the blocks, reducing I/O * It would also give better analysis of clustered data (its all in that paper...) The recent patch on TABLESAMPLE contained a rewrite of the guts of ANALYZE into a generic sample scan, which would then be used as the basis for a TABLESAMPLE BERNOULLI. A TABLESAMPLE SYSTEM could use a block sample, as it does in some other DBMS (DB2, SQLServer). While I was unimpressed with the TABLESAMPLE patch, all it needs is some committer-love, so Greg... -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-10 19:23:37 +0000, Simon Riggs wrote: > On 6 December 2013 09:21, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: > >> Has anyone ever thought about opportunistic ANALYZE piggy-backing on > >> other full-table scans? That doesn't really help Greg, because his > >> complaint is mostly that a fresh ANALYZE is too expensive, but it > >> could be an interesting, albeit risky approach. > > > > What I've been thinking of is > > > > a) making it piggy back on scans vacuum is doing instead of doing > > separate ones all the time (if possible, analyze needs to be more > > frequent). Currently with quite some likelihood the cache will be gone > > again when revisiting. > > > b) make analyze incremental. In lots of bigger tables most of the table > > is static - and we actually *do* know that, thanks to the vm. So keep a > > rawer form of what ends in the catalogs around somewhere, chunked by the > > region of the table the statistic is from. Everytime a part of the table > > changes, re-sample only that part. Then recompute the aggregate. > > Piggy-backing sounds like a bad thing. If I run a query, I don't want > to be given some extra task thanks! Especially if we might need to run > data type code I'm not authroised to run, or to sample data I may not > be authorised to see. The only way that could work is to kick off an > autovacuum worker to run the ANALYZE as a separate process and then > use synchronous scan behaviour to derive benefit indirectly. I was suggesting to piggyback on VACUUM, not user queries. The latter suggestion was somebody else. In combination with incremental or chunk-wise building of statistics, doing more frequent partial vacuums which re-compute the changing part of the stats would be great. All those blocks have to be read anyway, and we should be more agressive about vacuuming anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > However, these things presume that we need to continue scanning most > of the blocks of the table, which I don't think needs to be the case. > There is a better way. Do they? I think it's one opportunistic way of ameliorating the cost. > Back in 2005/6, I advocated a block sampling method, as described by > Chaudri et al (ref?) I don't think that anyone believes that not doing block sampling is tenable, fwiw. Clearly some type of block sampling would be preferable for most or all purposes. -- Peter Geoghegan
On 10 December 2013 19:49, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> However, these things presume that we need to continue scanning most >> of the blocks of the table, which I don't think needs to be the case. >> There is a better way. > > Do they? I think it's one opportunistic way of ameliorating the cost. > >> Back in 2005/6, I advocated a block sampling method, as described by >> Chaudri et al (ref?) > > I don't think that anyone believes that not doing block sampling is > tenable, fwiw. Clearly some type of block sampling would be preferable > for most or all purposes. If we have one way of reducing cost of ANALYZE, I'd suggest we don't need 2 ways - especially if the second way involves the interaction of otherwise not fully related parts of the code. Or to put it clearly, lets go with block sampling and then see if that needs even more work. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/10/2013 11:49 AM, Peter Geoghegan wrote: > On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I don't think that anyone believes that not doing block sampling is > tenable, fwiw. Clearly some type of block sampling would be preferable > for most or all purposes. As discussed, we need math though. Does anyone have an ACM subscription and time to do a search? Someone must. We can buy one with community funds, but no reason to do so if we don't have to. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Dec 10, 2013 at 7:54 PM, Josh Berkus <josh@agliodbs.com> wrote: > As discussed, we need math though. Does anyone have an ACM subscription > and time to do a search? Someone must. We can buy one with community > funds, but no reason to do so if we don't have to. Anyone in a university likely has access through their library. But I don't really think this is the right way to go about this. Research papers are going to turn up pretty specialized solutions that are probably patented. We don't even have the basic understanding we need. I suspect a basic textbook chapter on multistage sampling will discuss at least the standard techniques. Once we have a handle on the standard multistage sampling techniques that would be safe from patents then we might want to go look at research papers to find how they've been applied to databases in the past but we would have to do that fairly carefully. -- greg
On 10 December 2013 19:54, Josh Berkus <josh@agliodbs.com> wrote: > On 12/10/2013 11:49 AM, Peter Geoghegan wrote: >> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I don't think that anyone believes that not doing block sampling is >> tenable, fwiw. Clearly some type of block sampling would be preferable >> for most or all purposes. > > As discussed, we need math though. Does anyone have an ACM subscription > and time to do a search? Someone must. We can buy one with community > funds, but no reason to do so if we don't have to. We already have that, just use Vitter's algorithm at the block level rather than the row level. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark <stark@mit.edu> wrote: > But I don't really think this is the right way to go about this. > Research papers are going to turn up pretty specialized solutions that > are probably patented. We don't even have the basic understanding we > need. I suspect a basic textbook chapter on multistage sampling will > discuss at least the standard techniques. I agree that looking for information on block level sampling specifically, and its impact on estimation quality is likely to not turn up very much, and whatever it does turn up will have patent issues. -- Peter Geoghegan
On 12/10/2013 10:00 PM, Simon Riggs wrote: > On 10 December 2013 19:54, Josh Berkus <josh@agliodbs.com> wrote: >> On 12/10/2013 11:49 AM, Peter Geoghegan wrote: >>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> I don't think that anyone believes that not doing block sampling is >>> tenable, fwiw. Clearly some type of block sampling would be preferable >>> for most or all purposes. >> >> As discussed, we need math though. Does anyone have an ACM subscription >> and time to do a search? Someone must. We can buy one with community >> funds, but no reason to do so if we don't have to. > > We already have that, just use Vitter's algorithm at the block level > rather than the row level. And what do you do with the blocks? How many blocks do you choose? Details, please. - Heikki
On 11/12/13 09:19, Heikki Linnakangas wrote: > On 12/10/2013 10:00 PM, Simon Riggs wrote: >> On 10 December 2013 19:54, Josh Berkus <josh@agliodbs.com> wrote: >>> On 12/10/2013 11:49 AM, Peter Geoghegan wrote: >>>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs >>>> <simon@2ndquadrant.com> wrote: >>>> I don't think that anyone believes that not doing block sampling is >>>> tenable, fwiw. Clearly some type of block sampling would be preferable >>>> for most or all purposes. >>> >>> As discussed, we need math though. Does anyone have an ACM >>> subscription >>> and time to do a search? Someone must. We can buy one with community >>> funds, but no reason to do so if we don't have to. >> >> We already have that, just use Vitter's algorithm at the block level >> rather than the row level. > > And what do you do with the blocks? How many blocks do you choose? > Details, please. > > Yeah - and we seem to be back to Josh's point about needing 'some math' to cope with the rows within a block not being a purely random selection. Regards Mark
On 12/10/2013 01:33 PM, Mark Kirkwood wrote: > Yeah - and we seem to be back to Josh's point about needing 'some math' > to cope with the rows within a block not being a purely random selection. Well, sometimes they are effectively random. But sometimes they are not. The Chaudri et al paper had a formula for estimating randomness based on the grouping of rows in each block, assuming that the sampled blocks were widely spaced (if they aren't there's not much you can do).This is where you get up to needing a 5% sample; youneed to take enough blocks that you're confident that the blocks you sampled are representative of the population. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Dec 10, 2013 at 7:49 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Back in 2005/6, I advocated a block sampling method, as described by >> Chaudri et al (ref?) > > I don't think that anyone believes that not doing block sampling is > tenable, fwiw. Clearly some type of block sampling would be preferable > for most or all purposes. We do block sampling now. But then we select rows from those blocks uniformly. Incidentally, if you mean Surajit Chaudhuri, he's a Microsoft Research lead so I would be nervous about anything he's published being patented by Microsoft. -- greg
On Mon, Dec 9, 2013 at 2:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:> My reading of the code is that if it is not in the MCV, then it is assumedI've never seen an n_distinct value of more than 5 digits, regardless
> to have the average selectivity (about 1/n_distinct, but deflating top and
> bottom for the MCV list). There is also a check that it is less than the
> least common of the MCV, but I don't know why that situation would ever
> prevail--that should always be higher or equal to the average selectivity.
of reality. Typically I've seen 20-50k, even if the real number is
much higher.
I don't recall seeing an n_distinct that is literally above 100,000 in the wild, but I've seen negative ones that multiply with reltuples to give values more than that. In test cases it is easy enough to get values in the millions by creating tables using floor(random()*$something).
create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
create table baz2 as select * from baz order by floor;
create table baz3 as select * from baz order by md5(floor::text);
baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.
After analyzing all of them:
select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ;
tablename | n_distinct | correlation
-----------+-------------+-------------
baz | 8.56006e+06 | 0.00497713
baz2 | 333774 | 1
baz3 | 361048 | -0.0118147
So baz is pretty close, while the other two are way off. But the n_distinct computation doesn't depend on the order of the rows, as far as I can tell, but only the composition of the sample. So this can only mean that our "random" sampling method is desperately non-random, right?
But the n_distinct value is only for non-MCVs, so if we
estimate the selectivity of column = 'rarevalue' to be
(1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
the estimate, and making the MCV list longer naturally makes mcvfrac
bigger.
Ah, I see. By including more things into MCV, we crowd out the rest of the space implicitly.
I'm not sure how important the
less-frequent-than-the-least-common-MCV part is, but I'm very sure
that raising the statistics target helps to solve the problem of
overestimating the prevalence of uncommon values in a very big table.I think the saving the histogram part is important. As you say, the
> I think that parts of the planner are N^2 in the size of histogram (or was
> that the size of the MCV list?). So we would probably need a way to use a
> larger sample size to get more accurate n_distinct and MCV frequencies, but
> not save the entire histogram that goes with that sample size.
MCVs are important for a variety of planning purposes, such as hash
joins. More than that, in my experience, people with large tables are
typically very willing to spend more planning time to get a better
plan, because mistakes are expensive and the queries are likely to run
for a while anyway. People with small tables care about planning
time, because it makes no sense to spend an extra 1ms planning a query
unless you improve the plan by enough to save at least 1ms when
executing it, and when the tables are small and access is expected to
be fast anyway that's often not the case.
I would think that the dichotomy is more about the size of the query-plan than of the tables. I think a lot of people with huge tables end up doing mostly indexed lookups in unique or highly selective indexes, once all the planning is done.
Does anyone have generators for examples of cases where increasing the sample size to get better histograms (as opposed more accurate n_distinct or more accurate MCV) was the key to fixing bad plans? I wonder if it is more bins, or more accurate boundaries, that makes the difference.
Cheers,
Jeff
On 12/10/13 2:17 PM, Peter Geoghegan wrote: > On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark <stark@mit.edu> wrote: >> But I don't really think this is the right way to go about this. >> Research papers are going to turn up pretty specialized solutions that >> are probably patented. We don't even have the basic understanding we >> need. I suspect a basic textbook chapter on multistage sampling will >> discuss at least the standard techniques. > > I agree that looking for information on block level sampling > specifically, and its impact on estimation quality is likely to not > turn up very much, and whatever it does turn up will have patent > issues. We have an entire analytics dept. at work that specializes in finding patterns in our data. I might be able to get some timefrom them to at least provide some guidance here, if the community is interested. They could really only serve in a consultingrole though. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby <jim@nasby.net> wrote: >> I agree that looking for information on block level sampling >> specifically, and its impact on estimation quality is likely to not >> turn up very much, and whatever it does turn up will have patent >> issues. > > > We have an entire analytics dept. at work that specializes in finding > patterns in our data. I might be able to get some time from them to at least > provide some guidance here, if the community is interested. They could > really only serve in a consulting role though. I think that Greg had this right several years ago: it would probably be very useful to have the input of someone with a strong background in statistics. It doesn't seem that important that they already know a lot about databases, provided they can understand what our constraints are, and what is important to us. It might just be a matter of having them point us in the right direction. -- Peter Geoghegan
On 10 December 2013 23:43, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby <jim@nasby.net> wrote: >>> I agree that looking for information on block level sampling >>> specifically, and its impact on estimation quality is likely to not >>> turn up very much, and whatever it does turn up will have patent >>> issues. >> >> >> We have an entire analytics dept. at work that specializes in finding >> patterns in our data. I might be able to get some time from them to at least >> provide some guidance here, if the community is interested. They could >> really only serve in a consulting role though. > > I think that Greg had this right several years ago: it would probably > be very useful to have the input of someone with a strong background > in statistics. It doesn't seem that important that they already know a > lot about databases, provided they can understand what our constraints > are, and what is important to us. It might just be a matter of having > them point us in the right direction. err, so what does stats target mean exactly in statistical theory? Waiting for a statistician, and confirming his credentials before you believe him above others here, seems like wasted time. What your statistician will tell you is it that YMMV, depending on the data. So we'll still need a parameter to fine tune things when the default is off. We can argue about the default later, in various level of rigour. Block sampling, with parameter to specify sample size. +1 -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Block sampling, with parameter to specify sample size. +1 Simon this is very frustrating. Can you define "block sampling"? -- greg
On 11 December 2013 00:28, Greg Stark <stark@mit.edu> wrote: > On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Block sampling, with parameter to specify sample size. +1 > > Simon this is very frustrating. Can you define "block sampling"? Blocks selected using Vitter's algorithm, using a parameterised fraction of the total. When we select a block we should read all rows on that block, to help identify the extent of clustering within the data. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > When we select a block we should read all rows on that block, to help > identify the extent of clustering within the data. So how do you interpret the results of the sample read that way that doesn't introduce bias? -- greg
On Tue, Dec 10, 2013 at 4:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > err, so what does stats target mean exactly in statistical theory? Why would I even mention that to a statistician? We want guidance. But yes, I bet I could give a statistician an explanation of statistics target that they'd understand without too much trouble. > Waiting for a statistician, and confirming his credentials before you > believe him above others here, seems like wasted time. > > What your statistician will tell you is it that YMMV, depending on the data. I'm reasonably confident that they'd give me more than that. > So we'll still need a parameter to fine tune things when the default > is off. We can argue about the default later, in various level of > rigour. > > Block sampling, with parameter to specify sample size. +1 Again, it isn't as if the likely efficacy of *some* block sampling approach is in question. I'm sure analyze.c is currently naive about many things. Everyone knows that there are big gains to be had. Balancing those gains against the possible downsides in terms of impact on the quality of statistics generated is pretty nuanced. I do know enough to know that a lot of thought goes into mitigating and/or detecting the downsides of block-based sampling. -- Peter Geoghegan
On 11 December 2013 00:44, Greg Stark <stark@mit.edu> wrote: > On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> When we select a block we should read all rows on that block, to help >> identify the extent of clustering within the data. > > So how do you interpret the results of the sample read that way that > doesn't introduce bias? Yes, it is not a perfect statistical sample. All sampling is subject to an error that is data dependent. I'm happy that we have an option to select this/or not and a default that maintains current behaviour, since otherwise we might expect some plan instability. I would like to be able to * allow ANALYZE to run faster in some cases * increase/decrease sample size when it matters * have the default sample size vary according to the size of the table, i.e. a proportional sample -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
For what it's worth. I'll quote Chaudhuri et al. first line from the abstract about the block sampling. "Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics." And after briefly glancing through the paper, my opinion is why it works is because after making one version of statistics they cross-validate, see how well it goes and then collect more if the cross-validation error is large (for example because the data is clustered). Without this bit, as far as I can a simply block based sampler will be bound to make catastrophic mistakes depending on the distribution Also, just another point about targets (e.g X%) for estimating stuff from the samples (as it was discussed in the thread). Basically, the is a point talking about a sampling a fixed target (5%) of the data ONLY if you fix the actual distribution of your data in the table, and decide what statistic you are trying to find, e.g. average, std. dev. a 90% percentile, ndistinct or a histogram and so forth. There won't be a general answer as the percentages will be distribution dependend and statistic dependent. Cheers, Sergey PS I'm not a statistician, but I use statistics a lot ******************************************************************* Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Peter Geoghegan <pg@heroku.com> writes: > Again, it isn't as if the likely efficacy of *some* block sampling > approach is in question. I'm sure analyze.c is currently naive about > many things. It's not *that* naive; this is already about a third-generation algorithm. The last major revision (commit 9d6570b8a4) was to address problems with misestimating the number of live tuples due to nonuniform tuple density in a table. IIRC, the previous code could be seriously misled if the first few pages in the table were significantly non-representative of the live-tuple density further on. I'm not sure how we can significantly reduce the number of blocks examined without re-introducing that hazard in some form. In particular, given that you want to see at least N tuples, how many blocks will you read if you don't have an a-priori estimate of tuple density? You have to decide that before you start sampling blocks, if you want all blocks to have the same probability of being selected and you want to read them in sequence. regards, tom lane
On Tuesday, December 10, 2013, Simon Riggs wrote:
On 11 December 2013 00:28, Greg Stark <stark@mit.edu> wrote:
> On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Block sampling, with parameter to specify sample size. +1
>
> Simon this is very frustrating. Can you define "block sampling"?
Blocks selected using Vitter's algorithm, using a parameterised
fraction of the total.
OK, thanks for defining that.
We only need Vitter's algorithm when we don't know in advance how many items we are sampling from (such as for tuples--unless we want to rely on the previous estimate for the current round of analysis). But for blocks, we do know how many there are, so there are simpler ways to pick them.
When we select a block we should read all rows on that block, to help
identify the extent of clustering within the data.
But we have no mechanism to store such information (or to use it if it were stored), nor even ways to prevent the resulting skew in the sample from seriously messing up the estimates which we do have ways of storing and using.
Cheers,
Jeff
On 11 December 2013 01:27, Sergey E. Koposov <math@sai.msu.ru> wrote: > For what it's worth. > > I'll quote Chaudhuri et al. first line from the abstract about the block > sampling. > "Block-level sampling is far more efficient than true uniform-random > sampling over a large database, but prone to significant errors if used to > create database statistics." This glosses over the point that both SQLServer and Oracle use this technique. > And after briefly glancing through the paper, my opinion is why it works is > because after making one version of statistics they cross-validate, see how > well it goes and then collect more if the cross-validation error is large > (for example because the data is clustered). Without this bit, as far as I > can a simply block based sampler will be bound to make catastrophic mistakes > depending on the distribution I don't think its true that a block based sampler will be *bound* to make "catastrophic mistakes". They can clearly happen, just as they can with random samples, hence the need for a parameter to control the sample with a parameter. Realistically, I never heard of an Oracle DBA doing advanced statistical mathematics before setting the sample size on ANALYZE. You use the default and bump it up if the sample is insufficient for the data. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 10, 2013 at 10:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 11 December 2013 01:27, Sergey E. Koposov <math@sai.msu.ru> wrote: >> For what it's worth. >> >> I'll quote Chaudhuri et al. first line from the abstract about the block >> sampling. >> "Block-level sampling is far more efficient than true uniform-random >> sampling over a large database, but prone to significant errors if used to >> create database statistics." > > This glosses over the point that both SQLServer and Oracle use this technique. That seems like an unusual omission for Microsoft Research to have made. I didn't read that paper, because undoubtedly it's all patented. But before I figured that out, after finding it on Google randomly, I did read the first couple of paragraphs, which more or less said "what follows - the entire paper - is an explanation as to why it's okay that we do block sampling". -- Peter Geoghegan
On 12/11/2013 01:44 AM, Greg Stark wrote: > On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> When we select a block we should read all rows on that block, to help >> identify the extent of clustering within the data. > So how do you interpret the results of the sample read that way that > doesn't introduce bias? > Initially/experimentally we could just compare it to our current approach :) That is, implement *some* block sampling and then check it against what we currently have. Then figure out the bad differences. Rinse. Repeat. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 11/12/13 19:34, Simon Riggs wrote: > > Realistically, I never heard of an Oracle DBA doing advanced > statistical mathematics before setting the sample size on ANALYZE. You > use the default and bump it up if the sample is insufficient for the > data. > I'm not sure that Oracle's stats and optimizer design is an example to be envied - pretty much all Oracle DBA's I've encountered will apply hints all queries to get the plan they want... Regards Mark
On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Yes, it is not a perfect statistical sample. All sampling is subject > to an error that is data dependent. Well there's random variation due to the limitations of dealing with a sample. And then there's systemic biases due to incorrect algorithms. You wouldn't be happy if the samples discarded every row with NULLs or every row older than some date etc. These things would not be corrected by larger samples. That's the kind of "error" we're talking about here. But the more I think about things the less convinced I am that there is a systemic bias introduced by reading the entire block. I had assumed larger rows would be selected against but that's not really true, they're just selected against relative to the number of bytes they occupy which is the correct frequency to sample. Even blocks that are mostly empty don't really bias things. Picture a table that consists of 100 blocks with 100 rows each (value A) and another 100 blocks with only 1 row each (value B). The rows with value B have a 50% chance of being in any given block which is grossly inflated however each block selected with value A will produce 100 rows. So if you sample 10 blocks you'll get 100x10xA and 1x10xB which will be the correct proportion. I'm not actually sure there is any systemic bias here. The larger number of rows per block generate less precise results but from my thought experiments they seem to still be accurate?
On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote: > I'm not actually sure there is any systemic bias here. The larger > number of rows per block generate less precise results but from my > thought experiments they seem to still be accurate? So I've done some empirical tests for a table generated by: create table sizeskew as (select i,j,repeat('i',i) from generate_series(1,1000) as i, generate_series(1,1000) as j); I find that using the whole block doesn't cause any problem with the avg_width field for the "repeat" column.That does reinforce my belief that we might not need any particularly black magic here. It does however cause a systemic error in the histogram bounds. It seems the median is systematically overestimated by more and more the larger the number of rows per block are used: 1: 524 4: 549 8: 571 12: 596 16: 602 20: 618 (total sample slightly smaller than normal) 30: 703 (substantially smaller sample) So there is something clearly wonky in the histogram stats that's affected by the distribution of the sample. The only thing I can think of is maybe the most common elements are being selected preferentially from the early part of the sample which is removing a substantial part of the lower end of the range. But even removing 100 from the beginning shouldn't be enough to push the median above 550. -- greg
On Wed, Dec 11, 2013 at 12:08 PM, Greg Stark <stark@mit.edu> wrote: > The only thing I can think > of is maybe the most common elements are being selected preferentially > from the early part of the sample which is removing a substantial part > of the lower end of the range. But even removing 100 from the > beginning shouldn't be enough to push the median above 550. Just to follow up here. I think what's going is that not only are the most_common_vals being preferentially taken from the beginning of the sample but also their frequency is being massively overestimated. All values have a frequency of about .001 but the head of the MCV has a frequency as high as .10 in some of my tests. -- greg
On 12/11/2013 02:08 PM, Greg Stark wrote: > On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark@mit.edu> wrote: >> I'm not actually sure there is any systemic bias here. The larger >> number of rows per block generate less precise results but from my >> thought experiments they seem to still be accurate? > > So I've done some empirical tests for a table generated by: > create table sizeskew as (select i,j,repeat('i',i) from > generate_series(1,1000) as i, generate_series(1,1000) as j); > > I find that using the whole block doesn't cause any problem with the > avg_width field for the "repeat" column.That does reinforce my belief > that we might not need any particularly black magic here. How large a sample did you use? Remember that the point of doing block-level sampling instead of the current approach would be to allow using a significantly smaller sample (in # of blocks), and still achieve the same sampling error. If the sample is "large enough", it will mask any systemic bias caused by block-sampling, but the point is to reduce the number of sampled blocks. The practical question here is this: What happens to the quality of the statistics if you only read 1/2 the number of blocks than you normally would, but included all the rows in the blocks we read in the sample? How about 1/10 ? Or to put it another way: could we achieve more accurate statistics by including all rows from the sampled rows, while reading the same number of blocks? In particular, I wonder if it would help with estimating ndistinct. It generally helps to have a larger sample for ndistinct estimation, so it might be beneficial. - Heikki
On Dec10, 2013, at 15:32 , Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark@mit.edu> wrote: >> >> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote: >>> >>> >>> Doesn't all that assume a normally distributed random variable? >> >> I don't think so because of the law of large numbers. If you have a large >> population and sample it the sample behaves like a normal distribution when >> if the distribution of the population isn't. > > No, the large population says that if you have an AVERAGE of many > samples of a random variable, the random variable that is the AVERAGE > behaves like a normal. Actually, that's the central limit theorem, and it doesn't hold for all random variables, only for those with finite expected value and variance. The law of large numbers, in contrast, only tells you that the AVERAGE of n samples of a random variable will converge to the random variables' expected value as n goes to infinity (there are different versions of the law which guarantee different kinds of convergence, weak or strong). best regards, Florian Pflug
On 11 December 2013 12:08, Greg Stark <stark@mit.edu> wrote: > So there is something clearly wonky in the histogram stats that's > affected by the distribution of the sample. ...in the case where the avg width changes in a consistent manner across the table. Well spotted. ISTM we can have a specific cross check for bias in the sample of that nature. We just calculate the avg width per block and then check for correlation of the avg width against block number. If we find bias we can calculate how many extra blocks to sample and from where. There may be other biases also, so we can check for them and respond accordingly. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Greg Stark <stark@mit.edu> writes: > So I've done some empirical tests for a table generated by: > create table sizeskew as (select i,j,repeat('i',i) from > generate_series(1,1000) as i, generate_series(1,1000) as j); > I find that using the whole block doesn't cause any problem with the > avg_width field for the "repeat" column.That does reinforce my belief > that we might not need any particularly black magic here. > It does however cause a systemic error in the histogram bounds. It > seems the median is systematically overestimated by more and more the > larger the number of rows per block are used: Hm. You can only take N rows from a block if there actually are at least N rows in the block. So the sampling rule I suppose you are using is "select up to N rows from each sampled block" --- and that is going to favor the contents of blocks containing narrower-than-average rows. Now in this case, it looks like that ought to favor rows with *smaller* i values, but you say the median goes up not down. So I'm not sure what's going on. I thought at first that TOAST compression might be part of the explanation, but TOAST shouldn't kick in on rows with raw representation narrower than 2KB. Did you do a run with no upper limit on the number of rows per block? Because I'm not sure that tests with a limit in place are a good guide to what happens without it. regards, tom lane
I wrote: > Hm. You can only take N rows from a block if there actually are at least > N rows in the block. So the sampling rule I suppose you are using is > "select up to N rows from each sampled block" --- and that is going to > favor the contents of blocks containing narrower-than-average rows. Oh, no, wait: that's backwards. (I plead insufficient caffeine.) Actually, this sampling rule discriminates *against* blocks with narrower rows. You previously argued, correctly I think, that sampling all rows on each page introduces no new bias because row width cancels out across all sampled pages. However, if you just include up to N rows from each page, then rows on pages with more than N rows have a lower probability of being selected, but there's no such bias against wider rows. This explains why you saw smaller values of "i" being undersampled. Had you run the test series all the way up to the max number of tuples per block, which is probably a couple hundred in this test, I think you'd have seen the bias go away again. But the takeaway point is that we have to sample all tuples per page, not just a limited number of them, if we want to change it like this. regards, tom lane
On 12/12/13 06:22, Tom Lane wrote: > I wrote: >> Hm. You can only take N rows from a block if there actually are at least >> N rows in the block. So the sampling rule I suppose you are using is >> "select up to N rows from each sampled block" --- and that is going to >> favor the contents of blocks containing narrower-than-average rows. > Oh, no, wait: that's backwards. (I plead insufficient caffeine.) > Actually, this sampling rule discriminates *against* blocks with > narrower rows. You previously argued, correctly I think, that > sampling all rows on each page introduces no new bias because row > width cancels out across all sampled pages. However, if you just > include up to N rows from each page, then rows on pages with more > than N rows have a lower probability of being selected, but there's > no such bias against wider rows. This explains why you saw smaller > values of "i" being undersampled. > > Had you run the test series all the way up to the max number of > tuples per block, which is probably a couple hundred in this test, > I think you'd have seen the bias go away again. But the takeaway > point is that we have to sample all tuples per page, not just a > limited number of them, if we want to change it like this. > > regards, tom lane > > Surely we want to sample a 'constant fraction' (obviously, in practice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom suggests, is to use all the rows in a page. However, if you wanted the same number of rows from a greater number of pages, you could (for example) select a quarter of the rows from each page. In which case, when this is a fractional number: take the integral number of rows, plus on extra row with a probability equal to the fraction (here 0.25). Either way, if it is determined that you need N rows, then keep selecting pages at random (but never use the same page more than once) until you have at least N rows. Cheers, Gavin
On 12/12/13 06:22, Tom Lane wrote: > I wrote: >> Hm. You can only take N rows from a block if there actually are at least >> N rows in the block. So the sampling rule I suppose you are using is >> "select up to N rows from each sampled block" --- and that is going to >> favor the contents of blocks containing narrower-than-average rows. > Oh, no, wait: that's backwards. (I plead insufficient caffeine.) > Actually, this sampling rule discriminates *against* blocks with > narrower rows. You previously argued, correctly I think, that > sampling all rows on each page introduces no new bias because row > width cancels out across all sampled pages. However, if you just > include up to N rows from each page, then rows on pages with more > than N rows have a lower probability of being selected, but there's > no such bias against wider rows. This explains why you saw smaller > values of "i" being undersampled. > > Had you run the test series all the way up to the max number of > tuples per block, which is probably a couple hundred in this test, > I think you'd have seen the bias go away again. But the takeaway > point is that we have to sample all tuples per page, not just a > limited number of them, if we want to change it like this. > > regards, tom lane > > Hmm... In my previous reply, which hasn't shown up yet! I realized I made a mistake! The fraction/probability could any of 0.25. 0.50, and 0.75. Cheers, Gavin
On 12/12/13 07:22, Gavin Flower wrote: > On 12/12/13 06:22, Tom Lane wrote: >> I wrote: >>> Hm. You can only take N rows from a block if there actually are at >>> least >>> N rows in the block. So the sampling rule I suppose you are using is >>> "select up to N rows from each sampled block" --- and that is going to >>> favor the contents of blocks containing narrower-than-average rows. >> Oh, no, wait: that's backwards. (I plead insufficient caffeine.) >> Actually, this sampling rule discriminates *against* blocks with >> narrower rows. You previously argued, correctly I think, that >> sampling all rows on each page introduces no new bias because row >> width cancels out across all sampled pages. However, if you just >> include up to N rows from each page, then rows on pages with more >> than N rows have a lower probability of being selected, but there's >> no such bias against wider rows. This explains why you saw smaller >> values of "i" being undersampled. >> >> Had you run the test series all the way up to the max number of >> tuples per block, which is probably a couple hundred in this test, >> I think you'd have seen the bias go away again. But the takeaway >> point is that we have to sample all tuples per page, not just a >> limited number of them, if we want to change it like this. >> >> regards, tom lane >> >> > Surely we want to sample a 'constant fraction' (obviously, in practice > you have to sample an integral number of rows in a page!) of rows per > page? The simplest way, as Tom suggests, is to use all the rows in a > page. > > However, if you wanted the same number of rows from a greater number > of pages, you could (for example) select a quarter of the rows from > each page. In which case, when this is a fractional number: take the > integral number of rows, plus on extra row with a probability equal to > the fraction (here 0.25). > > Either way, if it is determined that you need N rows, then keep > selecting pages at random (but never use the same page more than once) > until you have at least N rows. > > > Cheers, > Gavin > > > Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75. But there is a bias introduced by the arithmetic average size of the rows in a page. This results in block sampling favouring large rows, as they are in a larger proportion of pages. For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing density and no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected 11 pages at random, you get about 10 pages of large rows and about one for small rows! In practice, it would be much less extreme - for a start, not all blocks will be fully packed, most blocks would have both types of rows, and there is usually greater variation in row size - but still a bias towards sampling larger rows. So somehow, this bias needs to be counteracted. Cheers, Gavin
Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, > using 400 byte pages. In the pathologically worst case, assuming > maximum packing density and no page has both types: the large rows would > occupy 500 pages and the smaller rows 50 pages. So if one selected 11 > pages at random, you get about 10 pages of large rows and about one for > small rows! With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<div class="moz-cite-prefix">On 12/12/13 08:14, Gavin Flower wrote:<br /></div><blockquote cite="mid:52A8B998.5040602@archidevsys.co.nz"type="cite">On 12/12/13 07:22, Gavin Flower wrote: <br /><blockquote type="cite">On12/12/13 06:22, Tom Lane wrote: <br /><blockquote type="cite">I wrote: <br /><blockquote type="cite">Hm. Youcan only take N rows from a block if there actually are at least <br /> N rows in the block. So the sampling rule I supposeyou are using is <br /> "select up to N rows from each sampled block" --- and that is going to <br /> favor the contentsof blocks containing narrower-than-average rows. <br /></blockquote> Oh, no, wait: that's backwards. (I plead insufficientcaffeine.) <br /> Actually, this sampling rule discriminates *against* blocks with <br /> narrower rows. Youpreviously argued, correctly I think, that <br /> sampling all rows on each page introduces no new bias because row <br/> width cancels out across all sampled pages. However, if you just <br /> include up to N rows from each page, thenrows on pages with more <br /> than N rows have a lower probability of being selected, but there's <br /> no such biasagainst wider rows. This explains why you saw smaller <br /> values of "i" being undersampled. <br /><br /> Had yourun the test series all the way up to the max number of <br /> tuples per block, which is probably a couple hundred inthis test, <br /> I think you'd have seen the bias go away again. But the takeaway <br /> point is that we have to sampleall tuples per page, not just a <br /> limited number of them, if we want to change it like this. <br /><br /> regards, tom lane <br /><br /><br /></blockquote> Surely we want to sample a 'constant fraction' (obviously, inpractice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom suggests,is to use all the rows in a page. <br /><br /> However, if you wanted the same number of rows from a greater numberof pages, you could (for example) select a quarter of the rows from each page. In which case, when this is a fractionalnumber: take the integral number of rows, plus on extra row with a probability equal to the fraction (here 0.25).<br /><br /> Either way, if it is determined that you need N rows, then keep selecting pages at random (but never usethe same page more than once) until you have at least N rows. <br /><br /><br /> Cheers, <br /> Gavin <br /><br /><br/><br /></blockquote> Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75. <br /><br /> But thereis a bias introduced by the arithmetic average size of the rows in a page. This results in block sampling favouringlarge rows, as they are in a larger proportion of pages. <br /><br /> For example, assume 1000 rows of 200 bytesand 1000 rows of 20 bytes, using 400 byte pages. In the pathologically worst case, assuming maximum packing densityand no page has both types: the large rows would occupy 500 pages and the smaller rows 50 pages. So if one selected11 pages at random, you get about 10 pages of large rows and about one for small rows! In practice, it would bemuch less extreme - for a start, not all blocks will be fully packed, most blocks would have both types of rows, and thereis usually greater variation in row size - but still a bias towards sampling larger rows. So somehow, this bias needsto be counteracted. <br /><br /><br /> Cheers, <br /> Gavin <br /><br /></blockquote> Actually, I just thought of apossible way to overcome the bias towards large rows.<br /><br /><ol><li>Calculate (a rough estimate may be sufficient,if not too 'rough') the size of the smallest row.<br /><br /><li>Select a page at random (never selecting thesame page twice)<br /><br /><li>Then select rows at random within the page (never selecting the same row twice). Foreach row selected, accept it with the probability equal to (size of smallest row)/(size of selected row). I think youfind that will almost completely offset the bias towards larger rows!<br /><br /><li>If you do not have sufficient rows,and you still have pages not yet selected, goto 2<br /></ol> Note that it will be normal for for some pages not to haveany rows selected, especially for large tables!<br /><br /><br /> Cheers,<br /> Gavin<br /><br /> P.S.<br /> I reallyneed to stop thinking about this problem, and get on with my assigned project!!!<br /><br /><br />
On Tue, Dec 10, 2013 at 4:48 PM, Peter Geoghegan <pg@heroku.com> wrote: > Why would I even mention that to a statistician? We want guidance. But > yes, I bet I could give a statistician an explanation of statistics > target that they'd understand without too much trouble. Actually, I think that if we told a statistician about the statistics target, his or her response would be: why would you presume to know ahead of time what statistics target is going to be effective? I suspect that the basic problem is that it isn't adaptive. I think that if we could somehow characterize the quality of our sample as we took it, and then cease sampling when we reached a certain degree of confidence in its quality, that would be helpful. It might not even matter that the sample was clustered from various blocks. -- Peter Geoghegan
On 12/12/13 08:31, Kevin Grittner wrote: > Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > >> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, >> using 400 byte pages. In the pathologically worst case, assuming >> maximum packing density and no page has both types: the large rows would >> occupy 500 pages and the smaller rows 50 pages. So if one selected 11 >> pages at random, you get about 10 pages of large rows and about one for >> small rows! > With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company Sorry, I've simply come up with well argued nonsense! Kevin, you're dead right. Cheers, Gavin
On 12/12/13 08:39, Gavin Flower wrote: > On 12/12/13 08:31, Kevin Grittner wrote: >> Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: >> >>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, >>> using 400 byte pages. In the pathologically worst case, assuming >>> maximum packing density and no page has both types: the large rows >>> would >>> occupy 500 pages and the smaller rows 50 pages. So if one selected 11 >>> pages at random, you get about 10 pages of large rows and about one for >>> small rows! >> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. >> >> -- >> Kevin Grittner >> EDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company > Sorry, I've simply come up with well argued nonsense! > > Kevin, you're dead right. > > > Cheers, > Gavin > > I looked at: http://www.postgresql.org/docs/current/interactive/storage-page-layout.html this says that each row has an overhead, which suggests there should be a bias towards small rows. There must be a lot of things going on, that I'm simply not aware of, that affect sampling bias... Cheers, Gavin
On 12/12/13 09:12, Gavin Flower wrote: > On 12/12/13 08:39, Gavin Flower wrote: >> On 12/12/13 08:31, Kevin Grittner wrote: >>> Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: >>> >>>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, >>>> using 400 byte pages. In the pathologically worst case, assuming >>>> maximum packing density and no page has both types: the large rows >>>> would >>>> occupy 500 pages and the smaller rows 50 pages. So if one selected 11 >>>> pages at random, you get about 10 pages of large rows and about one >>>> for >>>> small rows! >>> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows. >>> >>> -- >>> Kevin Grittner >>> EDB: http://www.enterprisedb.com >>> The Enterprise PostgreSQL Company >> Sorry, I've simply come up with well argued nonsense! >> >> Kevin, you're dead right. >> >> >> Cheers, >> Gavin >> >> > I looked at: > http://www.postgresql.org/docs/current/interactive/storage-page-layout.html > > this says that each row has an overhead, which suggests there should > be a bias towards small rows. > > There must be a lot of things going on, that I'm simply not aware of, > that affect sampling bias... > > > Cheers, > Gavin > > Ignoring overheads per row and other things... There will be a biasing affect when the distribution of sizes is not symmetric. For example: when the majority of rows have sizes greater than the arithmetic mean, then most samples will be biased towards larger rows. Similarly there could be a bias towards smaller rows when most rows are smaller than the arithmetic mean. Yes, I did think about this in depth - but it is way too complicated to attempt to quantify the bias, because it depends on too many factors (even just limiting it to the distribution of row sizes). So apart from the nature of volatility of the table, and the pattern of insertions/updates/deletes - there there will be a bias depending on the distribution of values in the table. So I despair, that a simple elegant & practical algorithm will ever be found. Therefore, I expect the best answer is probably some kind of empirical adaptive approach - which I think has already been suggested. Cheers, Gavin
<p dir="ltr"><br /> I think we're all wet here. I don't see any bias towards larger or smaller rows. Larger tied will beon a larger number of pages but there will be fewer of them on any one page. The average effect should be the same.<p dir="ltr">Smallervalues might have a higher variance with block based sampling than larger values. But that actually *is*the kind of thing that Simon's approach of just compensating with later samples can deal with.<br /><br /><p dir="ltr">--<br /> greg
On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote: > Surely we want to sample a 'constant fraction' (obviously, in > practice you have to sample an integral number of rows in a page!) > of rows per page? The simplest way, as Tom suggests, is to use all > the rows in a page. > > However, if you wanted the same number of rows from a greater number > of pages, you could (for example) select a quarter of the rows from > each page. In which case, when this is a fractional number: take > the integral number of rows, plus on extra row with a probability > equal to the fraction (here 0.25). In this discussion we've mostly used block = 1 postgresql block of 8k. But when reading from a disk once you've read one block you can basically read the following ones practically for free. So I wonder if you could make your sampling read always 16 consecutive blocks, but then use 25-50% of the tuples. That way you get many more tuples for the same amount of disk I/O seeks.. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On 12/11/2013 02:39 PM, Martijn van Oosterhout wrote: > In this discussion we've mostly used block = 1 postgresql block of 8k. > But when reading from a disk once you've read one block you can > basically read the following ones practically for free. > > So I wonder if you could make your sampling read always 16 consecutive > blocks, but then use 25-50% of the tuples. That way you get many more > tuples for the same amount of disk I/O seeks.. Yeah, that's what I meant by "tune this for the FS". We'll probably have to test a lot of different "block sizes" on different FSes before we arrive at a reasonable size, and even then I'll bet we have to offer a GUC. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Here's an analysis of Jeff Janes' simple example of a table where our n_distinct estimate is way off. On Dec11, 2013, at 00:03 , Jeff Janes <jeff.janes@gmail.com> wrote: > create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000); > create table baz2 as select * from baz order by floor; > create table baz3 as select * from baz order by md5(floor::text); > > baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation. > > After analyzing all of them: > > select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ; > tablename | n_distinct | correlation > -----------+-------------+------------- > baz | 8.56006e+06 | 0.00497713 > baz2 | 333774 | 1 > baz3 | 361048 | -0.0118147 I think I understand what's going on here. I worked with a reduced test cases of 1e7 rows containing random values between 0 and 5e5 and instrumented analyze to print the raw ndistinct and nmultiple values of the sample population (i.e. the number of distinct values in the sample, and the number of distinct values which appeared more than once). I've also considered only baz and baz2, and thus removed the than unnecessary md5 column. To account for the reduced table sizes, I adjusted default_statistics_target to 10 instead of 100. The resulting estimates are then tablename | n_distinct (est) | n_distinct (act) -----------+------------------+------------------baz | 391685 | 500000 baz2 | 36001| 500000 ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of those. The sample of baz contains 2989 distinct values, 11 of which appear more than once. The sample of baz2 contains 2878 distinct values, 117 (!) of which appear more than once. The very different results stem from the Duj1 estimator we use. It estimates n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of samples, N the number of rows, d the number of distinct samples, and f1 the number of distinct samples occurring exactly once. If all samples are unique (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly 0.03% - and the estimate already less than a 1/10 of what it would be if f1 where 0. Which leaves us with the question why sampling baz2 produces more duplicate values than sampling baz does. Now, if we used block sampling, that behaviour would be unsurprising - baz2 is sorted, so each block very likely contains each value more than since, since the row count exceeds the number of possible values by more than a magnitude. In fact, with block sampling, we'd probably see f1 values close to 0 and thus our estimate of n_distinct would be roughly equal to the number of distinct values in the *sample* population, i.e. about 300 or so. So why does vitter's algorithm fail here? Given that we see inflated numbers of duplicate values in our sample, yet still far fewer than block-based sampling would yield, my guess is that it's the initial reservoir filling that bites us here. After that initial filling step, the reservoir contains a lot of consecutive rows, i.e. a block-based sample taken from just a few blocks. If the replacement phase that follows somehow uses a slightly smaller replacement probability than it should, more of these rows will survive replacement, resulting in exactly the kind of inflated numbers of duplicate values we're seeing. I've yet to validate this by looking at the reservoir before and after the replacement stage, though. So at least for the purpose of estimating n_distinct, our current sampling method seems to exhibit the worst rather than the best properties of block- and row- based sampling. What conclusions to draw of that, I'm not sure yet - other that if we move to block-based sampling, we'll certainly have to change the way we estimate n_distinct. best regards, Florian Pflug
On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark <stark@mit.edu> wrote:
I think we're all wet here. I don't see any bias towards larger or smaller rows. Larger tied will be on a larger number of pages but there will be fewer of them on any one page. The average effect should be the same.Smaller values might have a higher variance with block based sampling than larger values. But that actually *is* the kind of thing that Simon's approach of just compensating with later samples can deal with.
I think that looking at all rows in randomly-chosen blocks will not bias size, or histograms. But it will bias n_distinct and MCV for some data distributions of data, unless we find some way to compensate for it.
But even for avg size and histograms, what does block sampling get us? We get larger samples sizes for the same IO, but those samples are less independent (assuming data is no randomly scattered over the table), so the "effective sample size" is less than the true sample size. So we can't just sample 100 time fewer blocks because there are about 100 rows per block--doing so would not bias our avg size or histogram boundaries, but it would certainly make them noisier.
Cheers,
Jeff
On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug <fgp@phlo.org> wrote:
Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.
On Dec11, 2013, at 00:03 , Jeff Janes <jeff.janes@gmail.com> wrote:
> create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
> create table baz2 as select * from baz order by floor;
> create table baz3 as select * from baz order by md5(floor::text);
>
> baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.
>
> After analyzing all of them:
>
> select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ;
> tablename | n_distinct | correlation
> -----------+-------------+-------------
> baz | 8.56006e+06 | 0.00497713
> baz2 | 333774 | 1
> baz3 | 361048 | -0.0118147
I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then
tablename | n_distinct (est) | n_distinct (act)
-----------+------------------+------------------
baz | 391685 | 500000
baz2 | 36001 | 500000
ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
those.
The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.
The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.
Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.
So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.
I think the problem is more subtle than that. It is easier to visualize it if you think of every block having the same number of rows, with that number being fairly large. If you pick 30,000 rows at random from 1,000,000 blocks, the number of rows chosen from any given block should be close to following a poisson distribution with average of 0.03, which means about 29113 blocks should have exactly 1 row chosen from them while 441 would have two or more rows chosen from them.
But if you instead select 30,000 row from 30,000 blocks, which is what we ask Vitter's algorithm to do, you get about a Poisson distribution with average of 1.0. Then about 11036 blocks have exactly one row chosen from them, and 7927 blocks have two or more rows sampled from it. Another 11,036 blocks get zero rows selected from them due to Vitter, in addition to the 970,000 that didn't even get submitted to Vitter in the first place. That is why you see too many duplicates for clustered data, as too many blocks are sampled multiple times.
The Poisson argument doesn't apply cleanly when blocks have variable number of rows, but the general principle still applies. Too-few blocks have exactly one row sampled from them, and too many blocks have either zero row, or more than one row.
It would be relatively easy to fix this if we trusted the number of visible rows in each block to be fairly constant. But without that assumption, I don't see a way to fix the sample selection process without reading the entire table.
So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.
Perhaps we should consider changing that even if we don't change to block based sampling--but what would the other candidates be? I guess the same paper that presented Duj1 would be a good place to start, but are there others? It sounds like this was discussed several years ago, but I didn't find it in the archives with a casual search.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > It would be relatively easy to fix this if we trusted the number of visible > rows in each block to be fairly constant. But without that assumption, I > don't see a way to fix the sample selection process without reading the > entire table. Yeah, varying tuple density is the weak spot in every algorithm we've looked at. The current code is better than what was there before, but as you say, not perfect. You might be entertained to look at the threads referenced by the patch that created the current sampling method: http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at particularly http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at However ... where this thread started was not about trying to reduce the remaining statistical imperfections in our existing sampling method. It was about whether we could reduce the number of pages read for an acceptable cost in increased statistical imperfection. regards, tom lane
On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> It would be relatively easy to fix this if we trusted the number of visible >> rows in each block to be fairly constant. But without that assumption, I >> don't see a way to fix the sample selection process without reading the >> entire table. > > Yeah, varying tuple density is the weak spot in every algorithm we've > looked at. The current code is better than what was there before, but as > you say, not perfect. You might be entertained to look at the threads > referenced by the patch that created the current sampling method: > http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at > > particularly > http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at > > > However ... where this thread started was not about trying to reduce > the remaining statistical imperfections in our existing sampling method. > It was about whether we could reduce the number of pages read for an > acceptable cost in increased statistical imperfection. Well, why not take a supersample containing all visible tuples from N selected blocks, and do bootstrapping over it, with subsamples of M independent rows each?
On 12/12/2013 10:33 AM, Claudio Freire wrote: > Well, why not take a supersample containing all visible tuples from N > selected blocks, and do bootstrapping over it, with subsamples of M > independent rows each? Well, we still need to look at each individual block to determine grouping correlation. Let's take a worst case example: imagine a table has *just* been created by: CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item; If "category" is fairly low cardinality, then grouping will be severe; we can reasonably expect that if we sample 100 blocks, many of them will have only one category value present. The answer to this is to make our block samples fairly widely spaced and compare them. In this simplified example, if the table had 1000 blocks, we would take blocks 1,101,201,301,401,etc. Then we would compare the number and content of values found on each block with the number and content found on each other block. For example, if we see that block 101 is entirely the category "cats", and block 701 is entirely the category "shopping" and block 901 is split 60/40 between the categories "transportation" and "voting", then we can assume that the level of grouping is very high, and the number of unknown values we haven't seen is also high. Whereas if 101 is "cats" and 201 is "cats" and 301 through 501 are "cats" with 2% other stuff, then we assume that the level of grouping is moderate and it's just the case that most of the dataset is "cats". Which means that the number of unknown values we haven't seen is low. Whereas if 101, 201, 501, and 901 have near-identical distributions of values, we assume that the level of grouping is very low, and that there are very few values we haven't seen. As someone else pointed out, full-block (the proposal) vs. random-row (our current style) doesn't have a very significant effect on estimates of Histograms and nullfrac, as long as the sampled blocks are widely spaced. Well, nullfrac is affected in the extreme example of a totally ordered table where the nulls are all in one block, but I'll point out that we can (and do) also miss that using our current algo. Estimated grouping should, however, affect MCVs. In cases where we estimate that grouping levels are high, the expected % of observed values should be "discounted" somehow. That is, with total random distribution you have a 1:1 ratio between observed frequency of a value and assumed frequency. However, with highly grouped values, you might have a 2:1 ratio. Again, more math (backed by statistical analysis) is needed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus <josh@agliodbs.com> wrote: > > Estimated grouping should, however, affect MCVs. In cases where we > estimate that grouping levels are high, the expected % of observed > values should be "discounted" somehow. That is, with total random > distribution you have a 1:1 ratio between observed frequency of a value > and assumed frequency. However, with highly grouped values, you might > have a 2:1 ratio. Cross validation can help there. But it's costly.
On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> It would be relatively easy to fix this if we trusted the number of visible
>> rows in each block to be fairly constant. But without that assumption, I
>> don't see a way to fix the sample selection process without reading the
>> entire table.
>
> Yeah, varying tuple density is the weak spot in every algorithm we've
> looked at. The current code is better than what was there before, but as
> you say, not perfect. You might be entertained to look at the threads
> referenced by the patch that created the current sampling method:
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at
>
> particularly
> http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at
Thanks, I will read those.
>
>
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.
I think it is pretty clear that n_distinct at least, and probably MCV, would be a catastrophe under some common data distribution patterns if we sample all rows in each block without changing our current computation method. If we come up with a computation that works for that type of sampling, it would probably be an improvement under our current sampling as well. If we find such a thing, I wouldn't want it to get rejected just because the larger block-sampling method change did not make it in.
Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?
Bootstrapping methods generally do not work well when ties are significant events, i.e. when two values being identical is meaningfully different from them being very close but not identical.
Cheers,
Jeff
On Thu, Dec 12, 2013 at 4:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> Well, why not take a supersample containing all visible tuples from N >> selected blocks, and do bootstrapping over it, with subsamples of M >> independent rows each? > > > Bootstrapping methods generally do not work well when ties are significant > events, i.e. when two values being identical is meaningfully different from > them being very close but not identical. Yes, that's why I meant to say (but I see now that I didn't) that it wouldn't do much for n_distinct, just the histogram.
On Tue, Dec 3, 2013 at 3:30 PM, Greg Stark <stark@mit.edu> wrote:
At multiple conferences I've heard about people trying all sorts of
gymnastics to avoid ANALYZE which they expect to take too long and
consume too much I/O. This is especially a big complain after upgrades
when their new database performs poorly until the new statistics are
in and they did pg_upgrade to avoid an extended downtime and complain
about ANALYZE taking hours.
Out of curiosity, are they using the 3 stage script "analyze_new_cluster.sh"? If so, is the complaint that even the first rounds are too slow, or that the database remains unusable until the last round is complete?
Cheers,
Jeff
On Dec12, 2013, at 19:29 , Tom Lane <tgl@sss.pgh.pa.us> wrote: > However ... where this thread started was not about trying to reduce > the remaining statistical imperfections in our existing sampling method. > It was about whether we could reduce the number of pages read for an > acceptable cost in increased statistical imperfection. True, but Jeff's case shows that even the imperfections of the current sampling method are larger than what the n_distinct estimator expects. Making it even more biased will thus require us to rethink how we obtain a n_distinct estimate or something equivalent. I don't mean that as an argument against changing the sampling method, just as something to watch out for. best regards, Florian Pflug
On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 30000 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great.
Performance is often chaotic near transition points, so I try to avoid data sets that are slightly bigger or slightly smaller than RAM (or some other limit).
Do you know how many io channels your SSD has (or whatever the term of art is for SSD drives)?
On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB) with 4 GB of RAM goes from ~106 seconds to ~19 seconds.
However, I'm not sure what problem we want to solve here. I certainly would not wish to give a background maintenance process permission to confiscate my entire RAID throughput for its own operation. Perhaps this could only be active for explicit analyze, and only if vacuum_cost_delay=0?
Perhaps there should be something like "alter background role autovac set ...". Otherwise we are going to end up with an "autovacuum_*" shadow parameter for many of our parameters, see "autovacuum_work_mem" discussions.
Cheers,
Jeff
On 12/17/2013 12:06 AM, Jeff Janes wrote: > On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas > <hlinnakangas@vmware.com>wrote: > >> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be >> very easy, patch attached. Your mileage may vary, but I'm seeing a nice >> gain from this on my laptop. Taking a 30000 page sample of a table with >> 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds >> without the patch, and less than a second with the patch, with >> effective_io_concurrency=10. If anyone with a good test data set loaded >> would like to test this and post some numbers, that would be great. > > Performance is often chaotic near transition points, so I try to avoid data > sets that are slightly bigger or slightly smaller than RAM (or some other > limit). > > Do you know how many io channels your SSD has (or whatever the term of art > is for SSD drives)? No idea. It's an Intel 335. > On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB) > with 4 GB of RAM goes from ~106 seconds to ~19 seconds. > > However, I'm not sure what problem we want to solve here. The case that Greg Stark mentioned in the email starting this thread is doing a database-wide ANALYZE after an upgrade. In that use case, you certainly want to get it done as quickly as possible, using all the available resources. > I certainly would not wish to give a background maintenance process > permission to confiscate my entire RAID throughput for its own > operation. Then don't set effective_io_concurrency. If you're worried about that, you probably wouldn't want any other process to monopolize the RAID array either. > Perhaps this could only be active for explicit analyze, and only if > vacuum_cost_delay=0? That would be a bit weird, because ANALYZE in general doesn't obey vacuum_cost_delay. Maybe it should, though... > Perhaps there should be something like "alter background role autovac set > ...". Otherwise we are going to end up with an "autovacuum_*" shadow > parameter for many of our parameters, see "autovacuum_work_mem" discussions. Yeah, so it seems. - Heikki
I assume we never came up with a TODO from this thread: --------------------------------------------------------------------------- On Tue, Dec 3, 2013 at 11:30:44PM +0000, Greg Stark wrote: > At multiple conferences I've heard about people trying all sorts of > gymnastics to avoid ANALYZE which they expect to take too long and > consume too much I/O. This is especially a big complain after upgrades > when their new database performs poorly until the new statistics are > in and they did pg_upgrade to avoid an extended downtime and complain > about ANALYZE taking hours. > > I always gave the party line that ANALYZE only takes a small > constant-sized sample so even very large tables should be very quick. > But after hearing the same story again in Heroku I looked into it a > bit further. I was kind of shocked but the numbers. > > ANALYZE takes a sample of 300 * statistics_target rows. That sounds > pretty reasonable but with default_statistics_target set to 100 that's > 30,000 rows. If I'm reading the code right It takes this sample by > sampling 30,000 blocks and then (if the table is large enough) taking > an average of one row per block. Each block is 8192 bytes so that > means it's reading 240MB of each table.That's a lot more than I > realized. > > It means if your table is anywhere up to 240MB you're effectively > doing a full table scan and then throwing out nearly all the data > read. > > Worse, my experience with the posix_fadvise benchmarking is that on > spinning media reading one out of every 16 blocks takes about the same > time as reading them all. Presumably this is because the seek time > between tracks dominates and reading one out of every 16 blocks is > still reading every track. So in fact if your table is up to about > 3-4G ANALYZE is still effectively going to do a full table scan, at > least as far as I/O time goes. > > The current algorithm seems like it was designed with a 100G+ table in > mind but the consequences on the more common 100M-100G tables weren't > really considered. Consider what this means for partitioned tables. If > they partition their terabyte table into 10 partitions ANALYZE will > suddenly want to use 10x as much I/O which seems like a perverse > consequence. > > I'm not sure I have a prescription but my general feeling is that > we're spending an awful lot of resources going after a statistically > valid sample when we can spend a lot less resources and get something > 90% as good. Or if we're really going to read that much data that we > might as well use more of the rows we find. > > -- > greg > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +