On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Manfred Koizar <mkoi-pg@aon.at> writes:
>> What I have in mind is a kind of "Double Vitter" algorithm. [...]
>> random sample of sample_size block numbers, and then to sample the rows
>> out of this pool of blocks.
>
>That assumption is faulty, though --- consider wholly-empty pages.
>
>A bigger problem is that this makes the sampling quite nonuniform,
>because rows that are on relatively low-density pages would be more
>likely to become part of the final sample than rows that are on pages
>with lots of tuples.
This sounds like you are assuming that I want to take exactly one tuple
out of each block of the block sample. This is not the case. In the
second round I plan to apply the same (or a better) Vitter method as it
is done now. The main difference is that blocks will be adressed
indirectly through the array of block numbers obtained in the first
round.
> Thus for example your sample would tend to favor
>rows with wide values of variable-width columns and exclude narrower
>values. (I am not certain that the existing algorithm completely avoids
>this trap, but at least it tries.)
I'm reading 7.4 source code and I fail to see how it does this. If the
relation starts with an atypical distribution of wide/narrow or
dead/alive tuples, a wrong value for tuplesperpage is used for the rest
of the sampling.
Tuples immediately following one or more dead tuples have a better
chance of being selected. This may be called as random as anything else
and not favouring a special property. OTOH after long runs of dead
tuples consecutive tuples are likely to be selected.
Your comment about nonuniformity above exactly describes the current
algorithm: Once the initial sample is fetched and tuplesperpage is
determined, targpos is computed without any further feedback. If
targpos points to a sparsely populated area (with wide tuples or with
many dead tuples) tuples in this area are more likely to get into the
sample than tuples in densely populated areas (with many small active
tuples).
I think that cutting down the number of blocks to be looked at does not
affect these problems.
ServusManfred