Re: benchmarking the query planner - Mailing list pgsql-hackers

From Greg Stark
Subject Re: benchmarking the query planner
Date
Msg-id 4136ffa0812121001x3fde1871t558fd8c58070b7d@mail.gmail.com
Whole thread Raw
In response to Re: benchmarking the query planner  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: benchmarking the query planner  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Fri, Dec 12, 2008 at 5:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Incidentally we *do* do block sampling. We pick random blocks and then pick
>> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
>> around then. It dramatically reduced the i/o requirements but there were long
>> discussions of how to do it without introducing biases.
>
> No, we pick random rows. On bigger tables, they get further apart
> typically and so we miss any clustering. I mean that we should pick a
> random block and read all rows on it.

I think what's happening here is that our sampling method is based on
the assumption that a records location is not related to its value.

Consider a table which is clustered and has two copies of each value.
When we look at a block and see n/2 values and we know there are 1000
blocks then a human would conjecture that there are 1000*n/2 distinct
values throughout the table and every value is represented twice
throughout the whole table. But if we're assuming the records are
randomly distributed then looking at the whole block will actually
throw us off completely. We'll deduce from the fact that we saw every
value twice that there must be hundreds of copies spread throughout
the database and there must be a lot less than 1000*n/2 distinct
values.

I think you need to find two different formulas, one which represents
a clustered table and one which represents randomly distributed data.
Then you need a way to measure just how clustered the data is so you
know how much weight to give each formula. Perhaps comparing the
number of duplicates in whole-block samples versus overall random
selections would give that measure.

-- 
greg


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: benchmarking the query planner
Next
From: Tom Lane
Date:
Subject: Re: WIP: default values for function parameters