Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Collect frequency statistics for arrays
Date
Msg-id CAPpHfdusvZH6zi1iRp627iEE6V4YD5pU5GSbqtx_995sJg7w4w@mail.gmail.com
Whole thread Raw
In response to Re: Collect frequency statistics for arrays  (Noah Misch <noah@leadboat.com>)
Responses Re: Collect frequency statistics for arrays  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Hi!

Patch where most part of issues are fixed is attached.

On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote:
> I find distressing the thought of having two copies of the lossy sampling
> code, each implementing the algorithm with different variable names and levels
> of generality.  We might someday extend this to hstore, and then we'd have yet
> another copy.  Tom commented[1] that ts_typanalyze() and array_typanalyze()
> should remain distinct, and I agree.  However, they could call a shared
> counting module.  Is that practical?  Possible API:
>
> typedef struct LossyCountCtl;
> LossyCountCtl *LossyCountStart(float s,
>                                                           float epsilon,
>                                                           int2 typlen,
>                                                           bool typbyval,
>                                                           Oid eqfunc); /* + hash func, a few others */
> void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
> TrackItem **LossyCountGetAll(LossyCountCtl *ctl);
>

I'm not sure about shared lossy counting module, because part of shared code would be relatively small. Part of compute_array_stats function which is taking care about array decompression, distinct occurence calculation, disting element count histogram, packing statistics slots etc is much larger than lossy counting algorithm itself. May be, there is some other opinions in community?

> I think this is an improvement, but some code out there may rely on the
> ability to get stakind = 4 data from the most_common_vals column.  We'll need
> to mention this in the release notes as an incompatibility.

I'm not sure I understand mechanism of release notes. Does it require something in a patch itself?

> > + /*
> > +  * Let be n independent events with probabilities p. This function calculates
> > +  * probabilities of exact k of events occurence for k in [0;m].
> > +  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
> > +  * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
> > +  * Each next event increase total number of occured events if it occurs and
> > +  * leave last value of that number if it doesn't occur. So, by the law of
> > +  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
> > +  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
> > +  * Also there could be some events with low probabilities. Their summary
> > +  * probability passed in the rest parameter.
> > +  */
> > + static float *
> > + calc_distr(float *p, int n, int m, float rest)
> > + {
>
> > +     /* Take care about events with low probabilities. */
> > +     if (rest > 0.0f)
> > +     {
> > +             /*
> > +              * The probability of no occurence of events which forms "rest"
> > +              * probability have a limit of exp(-rest) when number of events fo to
> > +              * infinity. Another simplification is to replace that events with one
> > +              * event with (1 - exp(-rest)) probability.
> > +              */
> > +             rest = 1.0f - exp(-rest);
>
> What is the name of the underlying concept in probability theory?

The most closest concept to caculated distribution is multinomial distribution. But it's not exactly same, because multinomial distribution gives probability of particular count of each event occurece, not probability of summary occurence. Actually, distribution is caclulated just from assumption of events independence. The most closest concept of rest probability is approximation by exponential distribution. It's quite rough approximation, but I can't invent something better with low calculation complexity.

> > + /*
> > +  * Array selectivity estimation based on most common elements statistics for
> > +  * "column <@ const" case. Assumption that element occurences are independent
> > +  * means certain distribution of array lengths. Typically real distribution
> > +  * of lengths is significantly different from it. For example, if even we
> > +  * have set of arrays with 1 integer element in range [0;10] each, element
> > +  * occurences are not independent. Because in the case of independence we
>
> Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> '{6,19,4}' cannot appear?

I refer column where only one element exists, i.e. only possible values are '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}', '{10}'. That is a corner case. But similar situation occurs when, for example, we've distribution of distinct element count between 1 and 3. It significantly differs from distribution from independent occurence.

> > +  * have probabilities of length of 0, 1, 2 etc. In the "column @> const"
> > +  * and "column && const" cases we usually have "const" with low summary
> > +  * frequency of elements (otherwise we have selectivity close to 0 or 1
> > +  * correspondingly). That's why effect of dependence related to lengths
> > +  * distribution id negligible there. In the "column <@ const" case summary
> > +  * frequency of elements is high (otherwise we have selectivity close to 0).
>
> What does the term "summary frequency" denote?

I meant summ of frequences of "const" array elements.

> > +     /*
> > +      * Rest is a average length of elements which aren't present in mcelem.
> > +      */
> > +     rest = avg_length;
>
> You define "rest" here as an array length ...
>
> > +
> > +     default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
> > +
> > +     mcelem_index = 0;
> > +
> > +     /*
> > +      * mult is the multiplier that presents estimate of probability that each
> > +      * mcelem which is not present in constant doesn't occur.
> > +      */
> > +     mult = 1.0f;
> > +
> > +     for (i = 0; i < nitems; i++)
> > +     {
> > +             bool            found = false;
> > +
> > +             /* Comparison with previous value in order to guarantee uniquness */
> > +             if (i > 0)
> > +             {
> > +                     if (!element_compare(&array_data[i - 1], &array_data[i]))
> > +                             continue;
> > +             }
> > +
> > +             /*
> > +              * Iterate over mcelem until find mcelem that is greater or equal to
> > +              * element of constant. Simultaneously taking care about rest and
> > +              * mult. If that mcelem is found then fill corresponding elem_selec.
> > +              */
> > +             while (mcelem_index < nmcelem)
> > +             {
> > +                     int                     cmp = element_compare(&mcelem[mcelem_index], &array_data[i]);
> > +
> > +                     if (cmp < 0)
> > +                     {
> > +                             mult *= (1.0f - numbers[mcelem_index]);
> > +                             rest -= numbers[mcelem_index];
>
> ... But here, you're subtracting a frequency from an array length?
>

Yes, because average distinct element count is summ of frequencies of elements. Substracting mcelem frequencies from avg_length we have summ of frequencies of non-mcelem elements.

------
With best regards,
Alexander Korotkov.
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Intermittent regression test failures from index-only plan changes
Next
From: Jeff Janes
Date:
Subject: bgwriter holds onto file handles of deleted files