Home > mailing lists

Re: benchmarking the query planner - Mailing list pgsql-hackers

From	Greg Stark
Subject	Re: benchmarking the query planner
Date	December 12, 2008 17:04:30
Msg-id	4136ffa0812121001x3fde1871t558fd8c58070b7d@mail.gmail.com Whole thread Raw
In response to	Re: benchmarking the query planner (Simon Riggs <simon@2ndQuadrant.com>)
Responses	Re: benchmarking the query planner (Simon Riggs <simon@2ndQuadrant.com>)
List	pgsql-hackers

Tree view

On Fri, Dec 12, 2008 at 5:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Incidentally we *do* do block sampling. We pick random blocks and then pick
>> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
>> around then. It dramatically reduced the i/o requirements but there were long
>> discussions of how to do it without introducing biases.
>
> No, we pick random rows. On bigger tables, they get further apart
> typically and so we miss any clustering. I mean that we should pick a
> random block and read all rows on it.

I think what's happening here is that our sampling method is based on
the assumption that a records location is not related to its value.

Consider a table which is clustered and has two copies of each value.
When we look at a block and see n/2 values and we know there are 1000
blocks then a human would conjecture that there are 1000*n/2 distinct
values throughout the table and every value is represented twice
throughout the whole table. But if we're assuming the records are
randomly distributed then looking at the whole block will actually
throw us off completely. We'll deduce from the fact that we saw every
value twice that there must be hundreds of copies spread throughout
the database and there must be a lot less than 1000*n/2 distinct
values.

I think you need to find two different formulas, one which represents
a clustered table and one which represents randomly distributed data.
Then you need a way to measure just how clustered the data is so you
know how much weight to give each formula. Perhaps comparing the
number of duplicates in whole-block samples versus overall random
selections would give that measure.

-- 
greg

pgsql-hackers by date:

From: Tom Lane
Date: 12 December 2008, 17:03:42
Subject: Re: benchmarking the query planner

From: Tom Lane
Date: 12 December 2008, 17:07:09
Subject: Re: WIP: default values for function parameters

Re: benchmarking the query planner - Mailing list pgsql-hackers

Previous

Next