Home > mailing lists

Re: pg_statistics and sample size WAS: Overhauling GUCS - Mailing list pgsql-hackers

From	Josh Berkus
Subject	Re: pg_statistics and sample size WAS: Overhauling GUCS
Date	June 10, 2008 15:35:20
Msg-id	200806100834.51471.josh@agliodbs.com Whole thread Raw
In response to	Re: Overhauling GUCS (Gregory Stark <stark@enterprisedb.com>)
List	pgsql-hackers

Tree view

Greg,

> The analogous case in our situation is not having 300 million distinct
> values, since we're not gathering info on specific values, only the
> buckets. We need, for example, 600 samples *for each bucket*. Each bucket
> is chosen to have the same number of samples in it. So that means that we
> always need the same number of samples for a given number of buckets.

I think that's plausible.  The issue is that in advance of the sampling we 
don't know how many buckets there *are*.  So we first need a proportional 
sample to determine the number of buckets, then we need to retain a histogram 
sample proportional to the number of buckets.  I'd like to see someone with a 
PhD in this weighing in, though.

> Really? Could you send references? The paper I read surveyed previous work
> and found that you needed to scan up to 50% of the table to get good
> results. 50-250% is considerably looser than what I recall it considering
> "good" results so these aren't entirely inconsistent but I thought previous
> results were much worse than that.

Actually, based on my several years selling performance tuning, I generally 
found that as long as estimates were correct within a factor of 3 (33% to 
300%) the correct plan was generally chosen.

There are papers on block-based sampling which were already cited on -hackers; 
I'll hunt through the archives later.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

pgsql-hackers by date:

From: "Merlin Moncure"
Date: 10 June 2008, 15:29:31
Subject: Re: libpq support for arrays and composites

From: Tom Lane
Date: 10 June 2008, 15:37:19
Subject: Re: Timezone abbreviations - out but not in?

Re: pg_statistics and sample size WAS: Overhauling GUCS - Mailing list pgsql-hackers

Previous

Next