Greg Stark <gsstark@MIT.EDU> writes:
> Perhaps what this indicates is that the real meat is in track sampling, not
> block sampling.
Fwiw, I've done a little benchmarking and I'm starting to think this isn't a
bad idea. I see a dramatic speed improvement for samples of 1-10% as the block
size increases. Presumably this is as Hannu said, reducing the number of
tracks necessary to cover the sample.
I see improvements up to around 256M blocks or so, but my data is pretty
questionable since I'm busy watching tv in Mythtv in another window. It's on
another drive but it still seems to be making the numbers jump around a bit.
I expect there's a trade-off between keeping enough blocks for the sample of
blocks to be representative on the one hand and large blocks being much faster
to read in on the other.
I would suggest something like setting the block size in the block sampling
algorithm to something like max(8k,sqrt(table size)). That gives 8k blocks for
anything up to 255M but takes better advantage of the speed increase available
from sequential i/o for larger tables, from my experiments about a 50%
increase in speed.
Actually maybe even something even more aggressive would be better, maybe
(table size)^.75 So it kicks in sooner than on 256M tables and gets to larger
block sizes on reasonable sized tables.
Note, this doesn't mean anything like changing page sizes, just selecting more
blocks that hopefully lie on the same track when possible.
--
greg