Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Collect frequency statistics for arrays
Date
Msg-id CAPpHfdtwyn7QFy-1mo+fHHof=Bf8BTdbOO84a_aH9gGLXTmb0w@mail.gmail.com
Whole thread Raw
In response to Re: Collect frequency statistics for arrays  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Collect frequency statistics for arrays
List pgsql-hackers
On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, one other thing about the count histogram: seems like we are
frequently generating uselessly large ones.  For instance, do ANALYZE
in the regression database and then run

select tablename,attname,elem_count_histogram from pg_stats
 where elem_count_histogram is not null;

You get lots of entries that look like this:

 pg_proc             | proallargtypes | {1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}
 pg_proc             | proargmodes    | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.61111}
 pg_proc             | proargnames    | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}
 pg_proc             | proconfig      | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
 pg_class            | reloptions     | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}

which seems to me to be a rather useless expenditure of space.
Couldn't we reduce the histogram size when there aren't many
different counts?

It seems fairly obvious to me that we could bound the histogram
size with (max count - min count + 1), but maybe something even
tighter would work; or maybe I'm missing something and this would
sacrifice accuracy.

True. If (max count - min count + 1) is small, enumerating of frequencies is both more compact and more precise representation. Simultaneously, if (max count - min count + 1) is large, we can run out of statistics_target with such representation. We can use same representation of count distribution as for scalar column value: MCV and HISTOGRAM, but it would require additional statkind and statistics slot. Probably, you've better ideas?

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: review: CHECK FUNCTION statement
Next
From: "Kevin Grittner"
Date:
Subject: Re: autovacuum locks