Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Collect frequency statistics for arrays
Date
Msg-id 12427.1331224252@sss.pgh.pa.us
Whole thread Raw
In response to Re: Collect frequency statistics for arrays  (Noah Misch <noah@leadboat.com>)
Responses Re: Collect frequency statistics for arrays  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Noah Misch <noah@leadboat.com> writes:
> On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
>> On reflection my idea above is wrong; for example assume that we have a
>> column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
>> what I said, we'd reduce the histogram to {1,2}, which might accurately
>> capture the set of lengths present but fails to show that 1 is much more
>> common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
>> would capture the situation perfectly in one-tenth the space that the
>> current logic does.

> Granted.  When the next sample finds 899/101 instead, though, the optimization
> vanishes.

No, you missed my next point.  That example shows that sometimes a
smaller histogram can represent the situation with zero error, but in
all cases a smaller histogram can represent the situation with perhaps
more error than a larger one.  Since we already have a defined error
tolerance, we should try to generate a histogram that is as small as
possible while still not exceeding the error tolerance.

Now, it might be that doing that is computationally impractical, or
too complicated to be reasonable.  But it seems to me to be worth
looking into.

> If you want to materially narrow typical statistics, Alexander's
> proposal looks like the way to go.  I'd guess array columns always
> having DEC <= default_statistics_target are common enough to make that
> representation the dominant representation, if not the only necessary
> representation.

Well, I don't want to have two representations; I don't think it's worth
the complexity.  But certainly we could consider switching to a
different representation if it seems likely to usually be smaller.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: sortsupport for text
Next
From: Peter Geoghegan
Date:
Subject: Re: pg_stat_statements and planning time