Manfred Koizar <mkoi-pg@aon.at> writes:
> On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A bigger problem is that this makes the sampling quite nonuniform,
>> because rows that are on relatively low-density pages would be more
>> likely to become part of the final sample than rows that are on pages
>> with lots of tuples.
> This sounds like you are assuming that I want to take exactly one tuple
> out of each block of the block sample. This is not the case.
No, I understood that you wanted to resample, but [ ... thinks for
awhile ... ] hmm, now I can't construct a failure case either. I must
have done the math wrong before.
There's still a risk of not being able to collect N rows out of N
blocks, if you are unfortunate enough to select a lot of wholly-empty
pages. But that seems like a low-probability scenario; besides such a
table would be so desperately in need of VACUUM FULL that the possible
low quality of the stats hardly matters.
You should not need to use the Vitter algorithm for the block-level
selection, since you can know the number of blocks in the table in
advance. You can just use the traditional method of choosing each block
or not with probability (k/K), where k = number of sample blocks still
needed, K = number of blocks from here to the end. You'd run the Vitter
algorithm separately to decide whether to keep or discard each live row
you find in the blocks you read.
I do like this, since it eliminates the current method's bias towards
estimating the number of live rows from the density found near the start
of the table only. At the end you'd know the total number of live rows
on all the pages you read, and it's reasonable to extrapolate that total
to the full table size.
Question: if the table size is less than N blocks, are you going to read
every block or try to reduce the number of blocks sampled? If you don't
adjust the sample size then I think this would perform worse for
intermediate-size tables than the current method does ... perhaps not so
much at sample size = 3000, but at larger sizes it would hurt. A lot of
people are setting the stats target to 100 which means a sample size of
30000 --- how do the page-access counts look in that case?
regards, tom lane