Thread: Re: [GENERAL] Large DB
[time to move this to -hackers] On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Manfred Koizar <mkoi-pg@aon.at> writes: >> The first step, however, (acquire_sample_rows() in analyze.c) has to >> read more rows than finally end up in the sample. It visits less than >> O(nblocks) pages but certainly more than O(1). > >> A vague feeling tries to tell me that the number of page reads is >> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which >> grow like O(ln(n)). > >Good guess. Vitter's paper says the expected time to sample n rows from >a table of size N is O(n * (1 + log(N/n))). Well, for what I tried to find out my wild guess seems to be wrong. I don't doubt that Vitter's formula is correct, but it assumes that access to any tuple has the same cost. This does not apply to our problem, however. With 100 tuples per page, we access the first sample_size tuples at a cost of 0.01 sequential page reads per tuple. Later we use less and less tuples per page which results in higher per-tuple-cost. Near the end of a large relation we can expect to access only one tuple per page and more and more pages are skipped, so that prefetching doesn't help any more. Playing around with some real numbers (for 100 tuples/page and a sample size of 3000) I got: rel | page size | reads ------+------------- 30 | 30 300 | 300 expectation is something like 299.9995 500 | 499 1K | 990 3K | 2.6K 30K | 8K 100K | 12K 1M | 19K 10M | 26K 100M | 33K This growth rate is steeper than O(log(nblocks)). >> I have an idea how this could be done with O(1) page reads. What I have in mind is a kind of "Double Vitter" algorithm. Whatever we do to get our sample of rows, in the end the sampled rows come from no more than sample_size different blocks. So my idea is to first create a random sample of sample_size block numbers, and then to sample the rows out of this pool of blocks. I have to think harder though, what to do about those 400 pages that are not accessed when the sample size is 3000 ... >The hard part is getting a genuinely random sample when we don't know N >in advance. We do however know the table size in blocks, so if you're >willing to make assumptions about constant tuple density you could do >something different. (But the tuple density assumption is exactly the >weak spot of what we've got, so I'm unconvinced that would be a big step >forward.) Starting the scan at some random blocks should help against the common case of unusual distribution of dead tuples near the start of the relation. And I plan to factor information about dead tuple hits into an increasingly better estimation of dead/live tuple ratio. Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > What I have in mind is a kind of "Double Vitter" algorithm. Whatever we > do to get our sample of rows, in the end the sampled rows come from no > more than sample_size different blocks. So my idea is to first create a > random sample of sample_size block numbers, and then to sample the rows > out of this pool of blocks. That assumption is faulty, though --- consider wholly-empty pages. A bigger problem is that this makes the sampling quite nonuniform, because rows that are on relatively low-density pages would be more likely to become part of the final sample than rows that are on pages with lots of tuples. Thus for example your sample would tend to favor rows with wide values of variable-width columns and exclude narrower values. (I am not certain that the existing algorithm completely avoids this trap, but at least it tries.) regards, tom lane
On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Manfred Koizar <mkoi-pg@aon.at> writes: >> What I have in mind is a kind of "Double Vitter" algorithm. [...] >> random sample of sample_size block numbers, and then to sample the rows >> out of this pool of blocks. > >That assumption is faulty, though --- consider wholly-empty pages. > >A bigger problem is that this makes the sampling quite nonuniform, >because rows that are on relatively low-density pages would be more >likely to become part of the final sample than rows that are on pages >with lots of tuples. This sounds like you are assuming that I want to take exactly one tuple out of each block of the block sample. This is not the case. In the second round I plan to apply the same (or a better) Vitter method as it is done now. The main difference is that blocks will be adressed indirectly through the array of block numbers obtained in the first round. > Thus for example your sample would tend to favor >rows with wide values of variable-width columns and exclude narrower >values. (I am not certain that the existing algorithm completely avoids >this trap, but at least it tries.) I'm reading 7.4 source code and I fail to see how it does this. If the relation starts with an atypical distribution of wide/narrow or dead/alive tuples, a wrong value for tuplesperpage is used for the rest of the sampling. Tuples immediately following one or more dead tuples have a better chance of being selected. This may be called as random as anything else and not favouring a special property. OTOH after long runs of dead tuples consecutive tuples are likely to be selected. Your comment about nonuniformity above exactly describes the current algorithm: Once the initial sample is fetched and tuplesperpage is determined, targpos is computed without any further feedback. If targpos points to a sparsely populated area (with wide tuples or with many dead tuples) tuples in this area are more likely to get into the sample than tuples in densely populated areas (with many small active tuples). I think that cutting down the number of blocks to be looked at does not affect these problems. ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A bigger problem is that this makes the sampling quite nonuniform, >> because rows that are on relatively low-density pages would be more >> likely to become part of the final sample than rows that are on pages >> with lots of tuples. > This sounds like you are assuming that I want to take exactly one tuple > out of each block of the block sample. This is not the case. No, I understood that you wanted to resample, but [ ... thinks for awhile ... ] hmm, now I can't construct a failure case either. I must have done the math wrong before. There's still a risk of not being able to collect N rows out of N blocks, if you are unfortunate enough to select a lot of wholly-empty pages. But that seems like a low-probability scenario; besides such a table would be so desperately in need of VACUUM FULL that the possible low quality of the stats hardly matters. You should not need to use the Vitter algorithm for the block-level selection, since you can know the number of blocks in the table in advance. You can just use the traditional method of choosing each block or not with probability (k/K), where k = number of sample blocks still needed, K = number of blocks from here to the end. You'd run the Vitter algorithm separately to decide whether to keep or discard each live row you find in the blocks you read. I do like this, since it eliminates the current method's bias towards estimating the number of live rows from the density found near the start of the table only. At the end you'd know the total number of live rows on all the pages you read, and it's reasonable to extrapolate that total to the full table size. Question: if the table size is less than N blocks, are you going to read every block or try to reduce the number of blocks sampled? If you don't adjust the sample size then I think this would perform worse for intermediate-size tables than the current method does ... perhaps not so much at sample size = 3000, but at larger sizes it would hurt. A lot of people are setting the stats target to 100 which means a sample size of 30000 --- how do the page-access counts look in that case? regards, tom lane
On Fri, 02 Apr 2004 18:06:12 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >You should not need to use the Vitter algorithm for the block-level >selection, since you can know the number of blocks in the table in >advance. You can just use the traditional method of choosing each block >or not with probability (k/K), where k = number of sample blocks still >needed, K = number of blocks from here to the end. Sounds reasonable. I have to play around a bit more to get a feeling where the Vitter method gets more efficient. > You'd run the Vitter >algorithm separately to decide whether to keep or discard each live row >you find in the blocks you read. You mean once a block is sampled we inspect it in any case? This was not the way I had planned to do it, but I'll keep this idea in mind. >Question: if the table size is less than N blocks, are you going to read >every block or try to reduce the number of blocks sampled? Don't know yet. >people are setting the stats target to 100 which means a sample size of >30000 --- how do the page-access counts look in that case? rel | page size | reads ------+------------- 300 | 300 3000 | 3000 5000 | 4999 10K | 9.9K 30K | 25.8K 300K | 85K 1M | 120K 10M | 190K 100M | 260K 1G | 330K This is exactly the table I posted before (for sample size 3000) with every entry multiplied by 10. Well, not quite exactly, but the differences are far behind the decimal point. So for our purposes, for a given relation size the number of pages accessed is proportional to the sample size. ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: >> You'd run the Vitter >> algorithm separately to decide whether to keep or discard each live row >> you find in the blocks you read. > You mean once a block is sampled we inspect it in any case? This was > not the way I had planned to do it, but I'll keep this idea in mind. Well, once we've gone to the trouble of reading in a block we definitely want to count the tuples in it, for the purposes of extrapolating the total number of tuples in the relation. Given that, I think the most painless route is simply to use the Vitter algorithm with the number-of-tuples-scanned as the count variable. You could dump the logic in acquire_sample_rows that tries to estimate where to read the N'th tuple from. If you like I can send you the Vitter paper off-list (I have a PDF of it). The comments in the code are not really intended to teach someone what it's good for ... regards, tom lane
On Fri, 02 Apr 2004 19:57:47 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >If you like I can send you the Vitter paper off-list (I have a PDF of >it). The comments in the code are not really intended to teach someone >what it's good for ... Yes, please. [Would have sent this off-list. But I'm blacklisted.] ServusManfred