Re: O(samplesize) tuple sampling, proof of concept - Mailing list pgsql-patches

From Manfred Koizar
Subject Re: O(samplesize) tuple sampling, proof of concept
Date
Msg-id s4j370l6ra56tvodcbg9baaf682q6cu3pn@email.aon.at
Whole thread Raw
In response to Re: O(samplesize) tuple sampling, proof of concept  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
On Mon, 05 Apr 2004 15:37:07 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>I wouldn't bother with a GUC variable for the production patch.

Among other things the GUC variable will be thrown out for the final
version.

>> Once a block is selected for inspection, all tuples of this
>> block are accessed to get a good estimation of the live : dead row
>> ratio.
>
>Why are you bothering to compute the live:dead ratio?

That was basically bad wording.  It should have been "... to get a good
estimation of live rows per page."  Counting dead rows turned out to be
trivial, so I just did it and included the number in the debug messages.
Then it happened to be useful for method 2.

>> Because I was afraid that method 1 might be too expensive in terms of
>> CPU cycles, I implemented a small variation that skips tuples without
>> checking them for visibility; this is sampling_method 2.
>
>And?  Does it matter?

There's a clearly visible difference for mid-size relations.  I'm not
sure whether this can be attributed to visibility bit updating or other
noise-contributing factors.

Method 2 gives a row count estimation error between 10 and 17% where
method 1 is less than 1% off.  (My updates generated dead tuples at very
evenly distributed locations by using conditions like WHERE mod(twenty,
7) = 0).

>If that's as bad as it gets I think we are OK.  You should redo the test
>with larger sample size though (try stats target = 100) to see if the
>answer changes.

Will do.

>I find -u diffs close to unreadable for reviewing purposes.  Please
>submit diffs in -c format in future.

De gustibus non est disputandum :-)
Fortunately this patch wouldn't look much different.  There is just a
bunch of "+" lines.

>AFAICS the rows will *always* be sorted already, and so this qsort is an
>utter waste of cycles.  No?

I don't think so.  We get the *blocks* in the correct order.  But tuples
are still sampled by the Vitter method which starts to replace random
tuples after the pool is filled.

BTW, thanks for the paper!

Servus
 Manfred

pgsql-patches by date:

Previous
From: Tom Lane
Date:
Subject: Re: O(samplesize) tuple sampling, proof of concept
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] logging statement levels