Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Collect frequency statistics for arrays
Date
Msg-id CAPpHfdvw8oi7f7_AmsNvopQf_2MGhFft0WV_Q+PrQmB0qswOeg@mail.gmail.com
Whole thread Raw
In response to Re: Collect frequency statistics for arrays  (Noah Misch <noah@leadboat.com>)
Responses Re: Collect frequency statistics for arrays  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Hi!

Thanks for your fixes to the patch. Them looks correct to me. I did some fixes in the patch. The proof of some concepts is still needed. I'm going to provide it in a few days.

On Thu, Jan 12, 2012 at 3:06 PM, Noah Misch <noah@leadboat.com> wrote:
> I'm not sure about shared lossy counting module, because part of shared
> code would be relatively small. Part of compute_array_stats function which
> is taking care about array decompression, distinct occurence calculation,
> disting element count histogram, packing statistics slots etc is much
> larger than lossy counting algorithm itself. May be, there is some other
> opinions in community?

True; it would probably increase total lines of code.  The benefit, if any,
lies in separation of concerns; the business of implementing this algorithm is
quite different from the other roles of these typanalyze functions.  I won't
insist that you try it, though.
I'd prefer to try it as separate patch.
 
> > > +             /*
> > > +              * The probability of no occurence of events which forms
> "rest"
> > > +              * probability have a limit of exp(-rest) when number of
> events fo to
> > > +              * infinity. Another simplification is to replace that
> events with one
> > > +              * event with (1 - exp(-rest)) probability.
> > > +              */
> > > +             rest = 1.0f - exp(-rest);
> >
> > What is the name of the underlying concept in probability theory?
>
> The most closest concept to caculated distribution is multinomial
> distribution. But it's not exactly same, because multinomial distribution
> gives probability of particular count of each event occurece, not
> probability of summary occurence. Actually, distribution is caclulated just
> from assumption of events independence. The most closest concept of rest
> probability is approximation by exponential distribution. It's quite rough
> approximation, but I can't invent something better with low calculation
> complexity.

Do you have a URL of a tutorial or paper that explains the method in more
detail?  If, rather, this is a novel synthesis, could you write a proof to
include in the comments?
Unfortunately I don't have relevant paper for it. I'll try to search it. Otherwise I'll try to do some proof.
 
> > > + /*
> > > +  * Array selectivity estimation based on most common elements
> statistics for
> > > +  * "column <@ const" case. Assumption that element occurences are
> independent
> > > +  * means certain distribution of array lengths. Typically real
> distribution
> > > +  * of lengths is significantly different from it. For example, if
> even we
> > > +  * have set of arrays with 1 integer element in range [0;10] each,
> element
> > > +  * occurences are not independent. Because in the case of
> independence we
> >
> > Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> > '{6,19,4}' cannot appear?
>
> I refer column where only one element exists, i.e. only possible values are
> '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
> '{10}'. That is a corner case. But similar situation occurs when, for
> example, we've distribution of distinct element count between 1 and 3. It
> significantly differs from distribution from independent occurence.

Oh, I think I see now.  If each element 1..10 had frequency 0.1 independently,
column values would have exactly one distinct element just 39% of the time?
Yes, it's right.
 
If probability theory has a prototypical problem resembling this, it would be
nice to include a URL to a thorough discussion thereof.  I could not think of
the search terms to find one, though.
Actually, usage of both distinct element count histogram and element occurrence frequencies produce some probability distribution (which is more complex than just independent element occurrence). If real distribution is close this distribution, then estimate is accurate. I didn't met such distributions is papers, actually I've just invented it in my tries to do accurate "column <@ const" estimation at least in simple cases. I'll try to search for similar things in papers, but I doubt I'll succeed. Otherwise I'll try to do some more detailed proof.


> *** /dev/null
> --- b/src/backend/utils/adt/array_selfuncs.c

> + Selectivity
> + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause,
> +                                     Oid operator)
> + {
> +     Oid                     elemtype;
> +     Selectivity selec;
> +     TypeCacheEntry *typentry;
> +     Datum      *hist;
> +     int                     nhist;
> +     FunctionCallInfoData cmpfunc;
> +
> +     elemtype = get_base_element_type(vardata->vartype);
> +
> +
> +     /* Get default comparison function */
> +     typentry = lookup_type_cache(elemtype,
> +                TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR);
> +
> +     /* Handle only "=" operator. Return default selectivity in other cases. */
> +     if (operator != typentry->eq_opr)
> +             return (Selectivity) 0.5;

Punting on other operators this way creates a plan quality regression for
operations like "const < ANY (column)".  Please do it some way that falls
back on the somewhat-better existing scalararraysel() treatment for this.
I've made calc_scalararraysel return -1 in this case or in the case of no comparison function. scalararraysel continues it's work when calc_scalararraysel returns negative value.
 
> + /*
> +  * Calculate first n distinct element counts probabilities by histogram. We
> +  * assume that any interval between a and b histogram values gives
> +  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and
> +  * half of that to a and b. Returns total probability that distinct element
> +  * count is less of equal to n.
> +  */
> + static float
> + calc_hist(Datum *hist, int nhist, float *hist_part, int n)

To test this function, I ran the following test case:

set default_statistics_target = 4;
create table t3 as select array(select * from generate_series(1, v)) as arr
from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
analyze t3; -- length_histogram_bounds = {2,2,5,5}
select * from t3 where arr <@ array[6,7,8,9,10,11];

Using gdb to observe calc_hist()'s result during the last command:

(gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
$23 = 0.666666687
(gdb) x/6f hist_part
0xcd4bc8:       0       0       0.333333343     0
0xcd4bd8:       0       0.333333343

I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and
a total probability of 1.0.

> + {
> +     int                     k,
> +                             i = 0,
> +                             prev_interval = 0,
> +                             next_interval = 0;
> +     float           frac,
> +                             total = 0.0f;
> +
> +     /*
> +      * frac is a probability contribution by each interval between histogram
> +      * values. We have nhist - 1 intervals. Contribution of one will be 1 /
> +      * (nhist - 1).
> +      */
> +     frac = 1.0f / ((float) (nhist - 1));
> +     for (k = 0; k <= n; k++)
> +     {
> +             int                     count = 0;
> +
> +             /* Count occurences of k distinct element counts in histogram. */
> +             while (i < nhist && DatumGetInt32(hist[i]) <= k)
> +             {
> +                     if (DatumGetInt32(hist[i]) == k)
> +                             count++;
> +                     i++;
> +             }
> +
> +             if (count > 0)
> +             {
> +                     float           val;
> +
> +                     /* Find length between current histogram value and the next one */
> +                     if (i < nhist)
> +                             next_interval = DatumGetInt32(hist[i + 1]) -

Doesn't this read past the array end when i == nhist - 1?
It was a bug. It also causes wrong histogram calculation you noted above. Fixed.
 
> +     stats->extra_data = extra_data->std_extra_data;
> +     old_context = CurrentMemoryContext;
> +     extra_data->std_compute_stats(stats, fetchfunc, samplerows, totalrows);
> +     MemoryContextSwitchTo(old_context);

Is the callee known to change CurrentMemoryContext and not restore it?
Offhand, I'm not seeing how it could do so.
Right. Saving of memory context is not needed. Removed.

> *** a/src/include/catalog/pg_type.h
> --- b/src/include/catalog/pg_type.h

This now updates all array types except record[].  I'm don't know offhand how
to even make a non-empty value of type record[], let alone get it into a
context where ANALYZE would see it.  However, is there a particular reason to
make that one different?
Oh, I didn't update all array types in 2 tries :) Fixed.

------
With best regards,
Alexander Korotkov.
Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: psql case preserving completion
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Speed dblink using alternate libpq tuple storage