Re: O(samplesize) tuple sampling, proof of concept - Mailing list pgsql-patches

From Tom Lane
Subject Re: O(samplesize) tuple sampling, proof of concept
Date
Msg-id 14645.1081193827@sss.pgh.pa.us
Whole thread Raw
In response to O(samplesize) tuple sampling, proof of concept  (Manfred Koizar <mkoi-pg@aon.at>)
Responses Re: O(samplesize) tuple sampling, proof of concept  (Manfred Koizar <mkoi-pg@aon.at>)
List pgsql-patches
Manfred Koizar <mkoi-pg@aon.at> writes:
> The old sampling method is still there.  I have added a GUC variable
> sampling_method to make testing and benchmarking easier.

I wouldn't bother with a GUC variable for the production patch.

> Once a block is selected for inspection, all tuples of this
> block are accessed to get a good estimation of the live : dead row
> ratio.

Why are you bothering to compute the live:dead ratio?  The statistics
model doesn't have any place for a dead-tuples count, so there's no need
to think about anything but live tuples.  (This is a historical
artifact, perhaps, arising from the fact that originally analysis
happened after VACUUM FULL and so the number of dead tuples was
guaranteed to be zero at the time.  But still, I'm not sure how we'd
factor a dead-tuples count into the estimates if we had it.)

> Because I was afraid that method 1 might be too expensive in terms of
> CPU cycles, I implemented a small variation that skips tuples without
> checking them for visibility; this is sampling_method 2.

And?  Does it matter?  (I'd guess not in practice, as checking a tuple
for visibility is cheap if someone's already marked it good.)

> At 1000 pages there is only a difference of approximately 1%,
> at 3000 pages the difference has its maximum of ca. 15%.  We can sell
> this by pointing to the better quality of the statistics.

If that's as bad as it gets I think we are OK.  You should redo the test
with larger sample size though (try stats target = 100) to see if the
answer changes.

> Are there any spots in the documentation that
> should be updated?

AFAIR there is noplace that specifically needs to be changed.

> diff -ruN ../base/src/backend/commands/analyze.c src/backend/commands/analyze.c

I find -u diffs close to unreadable for reviewing purposes.  Please
submit diffs in -c format in future.

> +    /*
> +     * If we ran out of tuples then we're done.  No sort is needed,
> +     * since they're already in order.
> +     *
> +     * Otherwise we need to sort the collected tuples by position
> +     * (itempointer).  I don't care for corner cases where the tuples
> +     * are already sorted.
> +     */
> +    if (numrows == targrows)
> +        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

AFAICS the rows will *always* be sorted already, and so this qsort is an
utter waste of cycles.  No?

            regards, tom lane

pgsql-patches by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: hint infrastructure setup (v3)
Next
From: Tom Lane
Date:
Subject: Re: O(samplesize) tuple sampling, proof of concept