On Fri, 2008-12-12 at 17:05 +0000, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>
> > The amount of I/O could stay the same, just sample all rows on block.
> > Lifting the sample size will help large tables. Will it be perfect? No.
> > But I'll take "better" over "not working at all".
>
> That will just raise the table size at which the problems start. It'll still
> be a constant-sized sample.
Work with me here. I want to make the situation better. It still won't
be perfect, but is that an argument against any action at all?
> It will also introduce strange biases. For instance in a clustered table it'll
> think there are a lot more duplicates than there really are because it'll see
> lots of similar values.
>
> Incidentally we *do* do block sampling. We pick random blocks and then pick
> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
> around then. It dramatically reduced the i/o requirements but there were long
> discussions of how to do it without introducing biases.
No, we pick random rows. On bigger tables, they get further apart
typically and so we miss any clustering. I mean that we should pick a
random block and read all rows on it.
> > If we are going to quote literature we should believe all the
> > literature. We can't just listen to some people that did a few tests
> > with sample size, but then ignore the guy that designed the MS optimizer
> > and many others.
>
> I'm not sure what you're talking about regarding "some people that did a few
> tests". I looked around for the paper I keep referencing and can't find it on
> my laptop. I'll look for it online. But it included a section which was a
> survey of past results from other papers and the best results required
> stupidly large sample sizes to get anything worthwhile.
Even if you find it, we still need to know why we would listen to the
research in the absent paper yet ignore the conclusion in the paper by
the man in charge of the MS optimizer who said that block level sampling
is a good idea.
-- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support