Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Improving N-Distinct estimation by ANALYZE
Date
Msg-id 1136802512.21025.386.camel@localhost.localdomain
Whole thread Raw
In response to Re: Improving N-Distinct estimation by ANALYZE  (Greg Stark <gsstark@mit.edu>)
Responses Re: Improving N-Distinct estimation by ANALYZE
Re: Improving N-Distinct estimation by ANALYZE
Re: Improving N-Distinct estimation by ANALYZE
List pgsql-hackers
On Fri, 2006-01-06 at 16:13 -0500, Greg Stark wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> 
> > Before we start debating merits of proposals based on random reads, can
> > someone confirm that the sampling code actually does read randomly? I
> > looked at it yesterday; there is a comment that states that blocks to be
> > scanned are passed to the analyze function in physical order, and AFAICT
> > the function that chooses blocks does so based strictly on applying a
> > probability function to block numbers as it increments a counter. It
> > seems that any reading is actually sequential and not random, which
> > makes all the random_page_cost hand-waving null and void.
> 
> Hm. I'm curious just how much that behaves like a sequential scan actually. I
> think I'll do some experiments. 
> 
> Reading 1% (1267 read, 126733 skipped):      7748264us
> Reading 2% (2609 read, 125391 skipped):     12672025us
> Reading 5% (6502 read, 121498 skipped):     19005678us
> Reading 5% (6246 read, 121754 skipped):     18509770us
> Reading 10% (12975 read, 115025 skipped):     19305446us
> Reading 20% (25716 read, 102284 skipped):     18147151us
> Reading 50% (63656 read, 64344 skipped):     18089229us
> Reading 100% (128000 read, 0 skipped):         18173003us
> 
> These numbers don't make much sense to me. It seems like 5% is about as slow
> as reading the whole file which is even worse than I expected. I thought I was
> being a bit pessimistic to think reading 5% would be as slow as reading 20% of
> the table.

Just to put a few rumours to bed:

- the current code does *not* use block sampling, it uses random row
sampling. (There is a part of the code that selects the "next block" but
that should not confuse you into thinking that the whole block is
sampled).

- yes, the random sampling is random - please read the code and comments

- yes, I would expect the results you get. If you sample 5% of rows and
each block has on average at least 20 rows, then we should expect the
majority of blocks to be hit.

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: Re: Fw: Is anyone interested in getting PostgreSQL working
Next
From: Lukas Smith
Date:
Subject: Re: Improving N-Distinct estimation by ANALYZE