Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From Greg Stark
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 87k67zdymk.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: More thoughts about planner's cost estimates  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: More thoughts about planner's cost estimates  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
> > 5% block sampling took just as long as reading all the blocks. Even if we
> > figure out what's causing that (IMHO surprising) result and improve matters I
> > would only expect it to be 3-4x faster than a full scan.
> 
> One way to reduce the I/O pain from extensive sampling would be to turn
> VACUUM ANALYZE into a genuine combined operation instead of a mere
> notational shorthand for two separate scans.

Except that we're also looking for every way we can to avoid having vacuum
have to touch every page so large tables with narrow hot spots can still
afford to be vacuumed regularly during peak hours.

But for most users analyze doesn't really have to run as often as vacuum. One
sequential scan per night doesn't seem like that big a deal to me.

> I'd still be worried about the CPU pain though.  ANALYZE can afford to
> expend a pretty fair number of cycles per sampled tuple, but with a
> whole-table sample that's going to add up.

That is a concern. Though the algorithm is pretty efficient, it basically
amounts to hashing all the tuples keeping only a limited number of hash
buckets and just throwing away the rest of the data. 

Things could get more complicated if we get to more complex stats than simple
n_distinct estimates and performing dependence tests on the data for
multi-column statistics.

-- 
greg



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: COPY (query) TO file
Next
From: Greg Stark
Date:
Subject: Re: COPY (query) TO file