Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Collect frequency statistics for arrays
Date
Msg-id 20120308155014.GD13139@tornado.leadboat.com
Whole thread Raw
In response to Re: Collect frequency statistics for arrays  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Collect frequency statistics for arrays  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > True. If (max count - min count + 1) is small, enumerating of frequencies
> > is both more compact and more precise representation. Simultaneously,
> > if (max count - min count + 1) is large, we can run out of
> > statistics_target with such representation. We can use same representation
> > of count distribution as for scalar column value: MCV and HISTOGRAM, but it
> > would require additional statkind and statistics slot. Probably, you've
> > better ideas?
> 
> I wasn't thinking of introducing two different representations,
> but just trimming the histogram length when it's larger than necessary.
> 
> On reflection my idea above is wrong; for example assume that we have a
> column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
> what I said, we'd reduce the histogram to {1,2}, which might accurately
> capture the set of lengths present but fails to show that 1 is much more
> common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
> would capture the situation perfectly in one-tenth the space that the
> current logic does.

Granted.  When the next sample finds 899/101 instead, though, the optimization
vanishes.  You save 90% of the space, perhaps 10% of the time.  If you want to
materially narrow typical statistics, Alexander's proposal looks like the way
to go.  I'd guess array columns always having DEC <= default_statistics_target
are common enough to make that representation the dominant representation, if
not the only necessary representation.


pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: sortsupport for text
Next
From: Pavel Stehule
Date:
Subject: Re: check function patch