Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Collect frequency statistics for arrays
Date
Msg-id 20120308192246.GB20680@tornado.leadboat.com
Whole thread Raw
In response to Re: Collect frequency statistics for arrays  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
> >> On reflection my idea above is wrong; for example assume that we have a
> >> column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
> >> what I said, we'd reduce the histogram to {1,2}, which might accurately
> >> capture the set of lengths present but fails to show that 1 is much more
> >> common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
> >> would capture the situation perfectly in one-tenth the space that the
> >> current logic does.
> 
> > Granted.  When the next sample finds 899/101 instead, though, the optimization
> > vanishes.
> 
> No, you missed my next point.  That example shows that sometimes a
> smaller histogram can represent the situation with zero error, but in
> all cases a smaller histogram can represent the situation with perhaps
> more error than a larger one.  Since we already have a defined error
> tolerance, we should try to generate a histogram that is as small as
> possible while still not exceeding the error tolerance.
> 
> Now, it might be that doing that is computationally impractical, or
> too complicated to be reasonable.  But it seems to me to be worth
> looking into.

Yes, I did miss your point.

One characteristic favoring this approach is its equal applicability to both
STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST.

> > If you want to materially narrow typical statistics, Alexander's
> > proposal looks like the way to go.  I'd guess array columns always
> > having DEC <= default_statistics_target are common enough to make that
> > representation the dominant representation, if not the only necessary
> > representation.
> 
> Well, I don't want to have two representations; I don't think it's worth
> the complexity.  But certainly we could consider switching to a
> different representation if it seems likely to usually be smaller.

Perhaps some heavy array users could provide input: what are some typical
length ranges among arrays in your applications on which you use "arr &&
const", "arr @> const" or "arr <@ const" searches?


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Custom Operators Cannot be Found for Composite Type Values
Next
From: Andrew Dunstan
Date:
Subject: Re: Custom Operators Cannot be Found for Composite Type Values