Thread: Collect frequency statistics for arrays

Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
Hi!

There is updated version of patch. General list of changes since reviewed version:
1) Distinct slot is used for length histogram.
2) Standard statistics is collected for arrays.
3) Most common values and most common elements are mapped to distinct columns of pg_stats view, because both of them are calculated for arrays.
4) Description of lossy counting algorithm was copied from compute_tsvector_stats with corresponding changes in it.
5) In estimation functions comments about assumtions were added.

Accuracy testing

Following files are attached.
datasets.sql - sql script which generates test datasets
arrayanalyze.php - php script which does accuracy testing
results.sql - dump of table with tests results

As we can see from testing results, estimates seem to be quite accurate in most part of test cases. When length of constant array exceeds 30, estimate of "column <@ const" is very inaccurate for arrat_test3 table. It's related with skipping of length histogram usage because of high CPU usage during estimate (see array_sel.c:888).

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
Rebased with head.

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Nathan Boley
Date:
> Rebased with head.

FYI, I've added myself as the reviewer for the current commitfest.

Best,
Nathan Boley


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
Hi!

On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley <npboley@gmail.com> wrote:
FYI, I've added myself as the reviewer for the current commitfest.
How is going review now?
 
------
With best regards,
Alexander Korotkov.

Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Tue, Dec 20, 2011 at 04:37:37PM +0400, Alexander Korotkov wrote:
> On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley <npboley@gmail.com> wrote:
> 
> > FYI, I've added myself as the reviewer for the current commitfest.
> >
> How is going review now?

I will examine this patch within the week.


Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote:
> Rebased with head.

I took a look at this patch.  The basic approach seems sensible.  For array
columns, ANALYZE will now determine the elements most often present in the
column's arrays.  It will also calculate a histogram from the counts of
distinct elements in each array.  That is to say, both newly-collected
statistics treat {1,1,2,3,1} and {3,1,2} identically.  New selectivity
machinery uses the new statistics to optimize these operations:

column @> const
column <@ const
column && const
const = ANY (column)
const = ALL (column)

Concrete estimates look mostly sane, with a few corner cases I note below.


We have many implementation details to nail down.

With the patch applied, I get the attached regression.diffs from "make check".
The palloc errors indicate bugs, but rules.out just needs a refresh.

During ANALYZE, we'll now detoast all array column values regardless of size,
just as we already do for tsvector columns.  That may be reasonable enough for
tsvector table/index columns, whose very presence is a hint that the user has
planned to use the value for searching.  Since arrays make no such
implication, should we skip large arrays to constrain TOAST I/O?  Say, skip
arrays larger than 100 KiB or 1 MiB?

I find distressing the thought of having two copies of the lossy sampling
code, each implementing the algorithm with different variable names and levels
of generality.  We might someday extend this to hstore, and then we'd have yet
another copy.  Tom commented[1] that ts_typanalyze() and array_typanalyze()
should remain distinct, and I agree.  However, they could call a shared
counting module.  Is that practical?  Possible API:

typedef struct LossyCountCtl;
LossyCountCtl *LossyCountStart(float s,
                               float epsilon,
                               int2 typlen,
                               bool typbyval,
                               Oid eqfunc); /* + hash func, a few others */
void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
TrackItem **LossyCountGetAll(LossyCountCtl *ctl);

[1] http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us

> *** a/doc/src/sgml/catalogs.sgml
> --- b/doc/src/sgml/catalogs.sgml
> ***************
> *** 8253,8260 ****
>         <entry>
>          A list of the most common values in the column. (Null if
>          no values seem to be more common than any others.)
> !        For some data types such as <type>tsvector</>, this is a list of
> !        the most common element values rather than values of the type itself.
>         </entry>
>        </row>
>
> --- 8253,8261 ----
>         <entry>
>          A list of the most common values in the column. (Null if
>          no values seem to be more common than any others.)
> !        For some data types such as <type>arrays</> and <type>tsvector</>,
> !        this is a list of the most common element values rather than values of
> !        the type itself.
>         </entry>
>        </row>
>
> ***************
> *** 8266,8274 ****
>          A list of the frequencies of the most common values or elements,
>          i.e., number of occurrences of each divided by total number of rows.
>          (Null when <structfield>most_common_vals</structfield> is.)
> !        For some data types such as <type>tsvector</>, it can also store some
> !        additional information, making it longer than the
> !        <structfield>most_common_vals</> array.
>         </entry>
>        </row>
>
> --- 8267,8275 ----
>          A list of the frequencies of the most common values or elements,
>          i.e., number of occurrences of each divided by total number of rows.
>          (Null when <structfield>most_common_vals</structfield> is.)
> !        For some data types such as <type>arrays</> and <type>tsvector</>,
> !        it can also store some additional information, making it longer than
> !        the <structfield>most_common_vals</> array.
>         </entry>
>        </row>

We're falsifying the above by splitting out that case into new columns
most_common_elems and most_common_elem_freqs.

>
> ***************
> *** 8284,8289 ****
> --- 8285,8291 ----
>          does not have a <literal><</> operator or if the
>          <structfield>most_common_vals</> list accounts for the entire
>          population.)
> +        For <type>arrays</>, it holds histogram bounds of array lengths.
>         </entry>
>        </row>

Likewise: that's now in the new column length_histogram_bounds.

We need documentation for the new pg_stats columns.  Also, in particular,
let's document the special entries at the end of most_common_freqs.

> *** a/src/backend/catalog/system_views.sql
> --- b/src/backend/catalog/system_views.sql
> ***************
> *** 117,145 **** CREATE VIEW pg_stats AS
>           stawidth AS avg_width,
>           stadistinct AS n_distinct,
>           CASE
> !             WHEN stakind1 IN (1, 4) THEN stavalues1
> !             WHEN stakind2 IN (1, 4) THEN stavalues2
> !             WHEN stakind3 IN (1, 4) THEN stavalues3
> !             WHEN stakind4 IN (1, 4) THEN stavalues4
>           END AS most_common_vals,
>           CASE
> !             WHEN stakind1 IN (1, 4) THEN stanumbers1
> !             WHEN stakind2 IN (1, 4) THEN stanumbers2
> !             WHEN stakind3 IN (1, 4) THEN stanumbers3
> !             WHEN stakind4 IN (1, 4) THEN stanumbers4
>           END AS most_common_freqs,
>           CASE
>               WHEN stakind1 = 2 THEN stavalues1
>               WHEN stakind2 = 2 THEN stavalues2
>               WHEN stakind3 = 2 THEN stavalues3
>               WHEN stakind4 = 2 THEN stavalues4
>           END AS histogram_bounds,
>           CASE
>               WHEN stakind1 = 3 THEN stanumbers1[1]
>               WHEN stakind2 = 3 THEN stanumbers2[1]
>               WHEN stakind3 = 3 THEN stanumbers3[1]
>               WHEN stakind4 = 3 THEN stanumbers4[1]
> !         END AS correlation
>       FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid)
>            JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum)
>            LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)
> --- 117,170 ----
>           stawidth AS avg_width,
>           stadistinct AS n_distinct,
>           CASE
> !             WHEN stakind1 = 1 THEN stavalues1
> !             WHEN stakind2 = 1 THEN stavalues2
> !             WHEN stakind3 = 1 THEN stavalues3
> !             WHEN stakind4 = 1 THEN stavalues4
> !             WHEN stakind5 = 1 THEN stavalues5
>           END AS most_common_vals,
>           CASE
> !             WHEN stakind1 = 1 THEN stanumbers1
> !             WHEN stakind2 = 1 THEN stanumbers2
> !             WHEN stakind3 = 1 THEN stanumbers3
> !             WHEN stakind4 = 1 THEN stanumbers4
> !             WHEN stakind5 = 1 THEN stanumbers5
>           END AS most_common_freqs,
>           CASE
>               WHEN stakind1 = 2 THEN stavalues1
>               WHEN stakind2 = 2 THEN stavalues2
>               WHEN stakind3 = 2 THEN stavalues3
>               WHEN stakind4 = 2 THEN stavalues4
> +             WHEN stakind5 = 2 THEN stavalues5
>           END AS histogram_bounds,
>           CASE
>               WHEN stakind1 = 3 THEN stanumbers1[1]
>               WHEN stakind2 = 3 THEN stanumbers2[1]
>               WHEN stakind3 = 3 THEN stanumbers3[1]
>               WHEN stakind4 = 3 THEN stanumbers4[1]
> !             WHEN stakind5 = 3 THEN stanumbers5[1]
> !         END AS correlation,
> !         CASE
> !             WHEN stakind1 = 4 THEN stavalues1
> !             WHEN stakind2 = 4 THEN stavalues2
> !             WHEN stakind3 = 4 THEN stavalues3
> !             WHEN stakind4 = 4 THEN stavalues4
> !             WHEN stakind5 = 4 THEN stavalues5
> !         END AS most_common_elems,
> !         CASE
> !             WHEN stakind1 = 4 THEN stanumbers1
> !             WHEN stakind2 = 4 THEN stanumbers2
> !             WHEN stakind3 = 4 THEN stanumbers3
> !             WHEN stakind4 = 4 THEN stanumbers4
> !             WHEN stakind5 = 4 THEN stanumbers5
> !         END AS most_common_elem_freqs,

I think this is an improvement, but some code out there may rely on the
ability to get stakind = 4 data from the most_common_vals column.  We'll need
to mention this in the release notes as an incompatibility.

> !         CASE
> !             WHEN stakind1 = 5 THEN stavalues1
> !             WHEN stakind2 = 5 THEN stavalues2
> !             WHEN stakind3 = 5 THEN stavalues3
> !             WHEN stakind4 = 5 THEN stavalues4
> !             WHEN stakind5 = 5 THEN stavalues5
> !         END AS length_histogram_bounds
>       FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid)
>            JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum)
>            LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)

> *** a/src/backend/commands/typecmds.c
> --- b/src/backend/commands/typecmds.c
> ***************
> *** 610,616 **** DefineType(List *names, List *parameters)
>                  F_ARRAY_SEND,    /* send procedure */
>                  typmodinOid,        /* typmodin procedure */
>                  typmodoutOid,    /* typmodout procedure */
> !                InvalidOid,        /* analyze procedure - default */
>                  typoid,            /* element type ID */
>                  true,            /* yes this is an array type */
>                  InvalidOid,        /* no further array type */
> --- 610,616 ----
>                  F_ARRAY_SEND,    /* send procedure */
>                  typmodinOid,        /* typmodin procedure */
>                  typmodoutOid,    /* typmodout procedure */
> !                ArrayTypanalyzeOid,        /* special analyze procedure for arrays */
>                  typoid,            /* element type ID */
>                  true,            /* yes this is an array type */
>                  InvalidOid,        /* no further array type */

The recently-added function DefineRange() needs the same change.

> *** /dev/null
> --- b/src/backend/utils/adt/array_sel.c

"array_selfuncs.c" would better match our informal convention.

> ***************
> *** 0 ****
> --- 1,948 ----
> + /*-------------------------------------------------------------------------
> +  *
> +  * array_sel.c
> +  *      Functions for selectivity estimation of array operators.
> +  *
> +  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
> +  *
> +  *
> +  * IDENTIFICATION
> +  *      src/backend/tsearch/array_sel.c

Fix file location.

> +  *
> +  *-------------------------------------------------------------------------
> +  */
> +
> + #include "postgres.h"
> + #include "access/hash.h"
> + #include "catalog/pg_operator.h"
> + #include "commands/vacuum.h"
> + #include "utils/builtins.h"
> + #include "utils/typcache.h"
> + #include "utils/array.h"
> + #include "catalog/pg_am.h"
> + #include "catalog/pg_collation.h"
> + #include "commands/defrem.h"
> + #include "utils/lsyscache.h"
> + #include "utils/selfuncs.h"

Sort the includes alphabetically, except for postgres.h coming first.

> +
> + /* Default selectivity constant */
> + #define DEFAULT_CONT_SEL 0.005
> +
> + /* Macro for selectivity estimation to be used if we have no statistics */
> + #define array_selec_no_stats(array,nitems,op) \
> +     mcelem_array_selec(array, nitems, typentry, NULL, 0, NULL, 0, NULL, 0, op)
> +
> + Datum        arraysel(PG_FUNCTION_ARGS);

extern prototypes go in header files, even when it's not strictly necessary.

> +
> + static Selectivity calc_arraysel(VariableStatData *vardata, Datum constval,
> +               Oid operator);
> + static Selectivity mcelem_array_selec(ArrayType *array, int nitems, TypeCacheEntry *typentry,
> +                    Datum *mcelem, int nmcelem, float4 *numbers, int nnumbers,
> +                    Datum *hist, int nhist, Oid operator);
> + static int    element_compare(const void *key1, const void *key2);
> + bool        find_next_mcelem(Datum *mcelem, int nmcelem, Datum value, int *index);
> + static Selectivity mcelem_array_contain_overlap_selec(Datum *mcelem,
> +   int nmcelem, float4 *numbers, Datum *array_data, int nitems, Oid operator);
> + static float calc_hist(Datum *hist, int nhist, float *hist_part, int n);
> + static Selectivity mcelem_array_contained_selec(Datum *mcelem, int nmcelem,
> +       float4 *numbers, Datum *array_data, int nitems, Datum *hist, int nhist,
> +                              Oid operator);
> + static float *calc_distr(float *p, int n, int m, float rest);
> +
> + /* Compare function of element data type */
> + static FunctionCallInfoData cmpfunc;
> +
> + /*
> +  * selectivity for "const = ANY(column)" and "const = ALL(column)"
> +  */
> + Selectivity
> + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause)

You have scalararraysel() calling this function for any operator (consider
"const < ANY(column)"), but it only handles a single operator: the "=" of the
default btree opclass used at ANALYZE time.  We could start by just returning
a constant selectivity for other operators, but we may be able to do better.
If the actual operator is the equality operator we used at ANALYZE time
(staop), use binary search on the mcelem array.  Otherwise, use linear search,
applying the operator to each MCE.  (If the benefits justify the trouble, you
could also use this strategy to support types with no default btree opclass.)

> + {
> +     Oid            elemtype;
> +     Selectivity selec;
> +     TypeCacheEntry *typentry;
> +     Datum       *hist;
> +     int            nhist;
> +
> +     elemtype = get_base_element_type(vardata->vartype);
> +
> +     /* Get default comparison function */
> +     typentry = lookup_type_cache(elemtype,
> +                               TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO);
> +     InitFunctionCallInfoData(cmpfunc, &typentry->cmp_proc_finfo, 2,
> +                              DEFAULT_COLLATION_OID, NULL, NULL);

The default btree operator class that existed at ANALYZE time may no longer
exist.  If we don't find a cmp function here, punt to avoid a crash.

> + /*
> +  * arraysel -- restriction selectivity for "column @> const", "column && const"
> +  * and "column <@ const"
> +  */
> + Datum
> + arraysel(PG_FUNCTION_ARGS)
> + {
> +     PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
> +
> +     Oid            operator = PG_GETARG_OID(1);
> +     List       *args = (List *) PG_GETARG_POINTER(2);
> +     int            varRelid = PG_GETARG_INT32(3);
> +     VariableStatData vardata;
> +     Node       *other;
> +     bool        varonleft;
> +     Selectivity selec;
> +     Oid            element_typeid;
> +
> +     /*
> +      * If expression is not variable = something or something = variable, then
> +      * punt and return a default estimate.
> +      */

The operator is never "=".

> +     if (!get_restriction_variable(root, args, varRelid,
> +                                   &vardata, &other, &varonleft))
> +         PG_RETURN_FLOAT8(DEFAULT_CONT_SEL);
> +
> +     /*
> +      * Can't do anything useful if the something is not a constant, either.
> +      */
> +     if (!IsA(other, Const))
> +     {
> +         ReleaseVariableStats(vardata);
> +         PG_RETURN_FLOAT8(DEFAULT_CONT_SEL);
> +     }

Previously, we defaulted to 0.005 for operator "&&" (via areasel()) and 0.001
for operators "@>" and "<@" (via contsel()).  Surely some difference between
those cases remains appropriate.

> +         if (res == 0)
> +         {
> +             *index = i;
> +             return true;
> +         }
> +         else if (res < 0)
> +         {
> +             l = i + 1;
> +         }
> +         else
> +         {
> +             r = i - 1;
> +         }

Throughout this patch, omit braces around single statements.

> +     }
> +     *index = l;
> +     return false;
> + }
> +
> + /*
> +  * Array selectivity estimation based on most common elements statistics.
> +  */
> + static Selectivity
> + mcelem_array_selec(ArrayType *array, int nitems, TypeCacheEntry *typentry,
> +       Datum *mcelem, int nmcelem, float4 *numbers, int nnumbers, Datum *hist,
> +                    int nhist, Oid operator)
> + {

> +     /* "column @> const" and "column && const" cases */
> +     if (operator == OID_ARRAY_CONTAIN_OP || operator == OID_ARRAY_OVERLAP_OP)
> +         return mcelem_array_contain_overlap_selec(mcelem, nmcelem, numbers,
> +                                        array_data, nonnull_nitems, operator);
> +
> +     /* "column <@ const" case */
> +     if (operator == OID_ARRAY_CONTAINED_OP)
> +         return mcelem_array_contained_selec(mcelem, nmcelem, numbers,
> +                           array_data, nonnull_nitems, hist, nhist, operator);
> +     return DEFAULT_CONT_SEL;

Returning a fixed selectivity when this gets attached to an unexpected
operator seems counterproductive.  Let's just elog(ERROR) in that case.

> + /*
> +  * Array selectivity estimation based on most common elements statistics for
> +  * "column @> const" and "column && const" cases. This estimation assumes
> +  * element occurences to be independent.
> +  */
> + static Selectivity
> + mcelem_array_contain_overlap_selec(Datum *mcelem, int nmcelem,
> +                 float4 *numbers, Datum *array_data, int nitems, Oid operator)

We could take some advantage of the unique element count histogram for "@>".
Any column value with fewer distinct elements than the constant array cannot
match.  We perhaps can't use both that fact and element frequency stats at the
same time, but we could use the lesser of the two probabilities.

For operator "&&" or operator "@>" with a nonempty constant array, no rows
having empty arrays will match.  Excepting the special case of "@>" with an
empty constant array, we can multiply the MCE-derived selectivity by the
fraction, based on the histogram, of nonempty arrays in the column.  (For
"<@", rows having empty arrays always match.)

I don't think it's mandatory that the initial commit implement the above, but
it's something to mention in the comments as a future direction.

> + /*
> +  * Let be n independent events with probabilities p. This function calculates
> +  * probabilities of exact k of events occurence for k in [0;m].
> +  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
> +  * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
> +  * Each next event increase total number of occured events if it occurs and
> +  * leave last value of that number if it doesn't occur. So, by the law of
> +  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
> +  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
> +  * Also there could be some events with low probabilities. Their summary
> +  * probability passed in the rest parameter.
> +  */
> + static float *
> + calc_distr(float *p, int n, int m, float rest)
> + {

> +     /* Take care about events with low probabilities. */
> +     if (rest > 0.0f)
> +     {
> +         /*
> +          * The probability of no occurence of events which forms "rest"
> +          * probability have a limit of exp(-rest) when number of events fo to
> +          * infinity. Another simplification is to replace that events with one
> +          * event with (1 - exp(-rest)) probability.
> +          */
> +         rest = 1.0f - exp(-rest);

What is the name of the underlying concept in probability theory?

> + /*
> +  * Array selectivity estimation based on most common elements statistics for
> +  * "column <@ const" case. Assumption that element occurences are independent
> +  * means certain distribution of array lengths. Typically real distribution
> +  * of lengths is significantly different from it. For example, if even we
> +  * have set of arrays with 1 integer element in range [0;10] each, element
> +  * occurences are not independent. Because in the case of independence we

Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
'{6,19,4}' cannot appear?

> +  * have probabilities of length of 0, 1, 2 etc. In the "column @> const"
> +  * and "column && const" cases we usually have "const" with low summary
> +  * frequency of elements (otherwise we have selectivity close to 0 or 1
> +  * correspondingly). That's why effect of dependence related to lengths
> +  * distribution id negligible there. In the "column <@ const" case summary
> +  * frequency of elements is high (otherwise we have selectivity close to 0).

What does the term "summary frequency" denote?

> +  * That's why we should do correction due to array lengths distribution.
> +  */
> + static Selectivity
> + mcelem_array_contained_selec(Datum *mcelem, int nmcelem, float4 *numbers,
> +          Datum *array_data, int nitems, Datum *hist, int nhist, Oid operator)

Break up the parameter list to avoid pgindent reverse-indenting the line.


When I tried a constant array with duplicate elements, I saw an inappropriate
report of 100% selectivity:

[local] test=# create table t1 as select array[n % 2, n] as arr from generate_series(1,100000) t(n);
SELECT 100000
[local] test=# analyze t1;
WARNING:  problem in alloc set Analyze: detected write past chunk end in block 0x7f189cbbf010, chunk 0x7f189cc1d0d8
ANALYZE
[local] test=# explain select * from t1 where arr <@ '{0,45}';
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on t1  (cost=0.00..1986.00 rows=186 width=29)
   Filter: (arr <@ '{0,45}'::integer[])
(2 rows)

[local] test=# explain select * from t1 where arr <@ '{0,45,45}';
                        QUERY PLAN
-----------------------------------------------------------
 Seq Scan on t1  (cost=0.00..1986.00 rows=100000 width=29)
   Filter: (arr <@ '{0,45,45}'::integer[])
(2 rows)

By contrast, the estimate in the non-duplicate case looks sane considering
these estimates for the individual elements:

[local] test=# explain select * from t1 where 0 = any (arr);
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on t1  (cost=0.00..2986.00 rows=49967 width=29)
   Filter: (0 = ANY (arr))
(2 rows)

[local] test=# explain select * from t1 where 45 = any (arr);
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on t1  (cost=0.00..2986.00 rows=500 width=29)
   Filter: (45 = ANY (arr))
(2 rows)


Incidentally, "const = ALL (column)" should be equivalent to "column <@
array[const]".  (Assuming the "=" operator chosen in the first statement is
the "=" operator of the array type's default btree opclass).  However, I get
significantly different estimates, with the latter getting a better estimate:

[local] test=# explain select * from t1 where 1 = all (arr);
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on t1  (cost=0.00..2986.00 rows=18407 width=29)
   Filter: (1 = ALL (arr))
(2 rows)

[local] test=# explain select * from t1 where arr <@ array[1];
                      QUERY PLAN
------------------------------------------------------
 Seq Scan on t1  (cost=0.00..1986.00 rows=1 width=29)
   Filter: (arr <@ '{1}'::integer[])
(2 rows)

> +     /*
> +      * Rest is a average length of elements which aren't present in mcelem.
> +      */
> +     rest = avg_length;

You define "rest" here as an array length ...

> +
> +     default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
> +
> +     mcelem_index = 0;
> +
> +     /*
> +      * mult is the multiplier that presents estimate of probability that each
> +      * mcelem which is not present in constant doesn't occur.
> +      */
> +     mult = 1.0f;
> +
> +     for (i = 0; i < nitems; i++)
> +     {
> +         bool        found = false;
> +
> +         /* Comparison with previous value in order to guarantee uniquness */
> +         if (i > 0)
> +         {
> +             if (!element_compare(&array_data[i - 1], &array_data[i]))
> +                 continue;
> +         }
> +
> +         /*
> +          * Iterate over mcelem until find mcelem that is greater or equal to
> +          * element of constant. Simultaneously taking care about rest and
> +          * mult. If that mcelem is found then fill corresponding elem_selec.
> +          */
> +         while (mcelem_index < nmcelem)
> +         {
> +             int            cmp = element_compare(&mcelem[mcelem_index], &array_data[i]);
> +
> +             if (cmp < 0)
> +             {
> +                 mult *= (1.0f - numbers[mcelem_index]);
> +                 rest -= numbers[mcelem_index];

... But here, you're subtracting a frequency from an array length?

> + /*
> +  * Comparison function for elements. Based on default comparison function for
> +  * array element data type.
> +  */
> + static int
> + element_compare(const void *key1, const void *key2)
> + {
> +     const Datum *d1 = (const Datum *) key1;
> +     const Datum *d2 = (const Datum *) key2;
> +
> +     cmpfunc.    arg[0] = *d1;
> +     cmpfunc.    arg[1] = *d2;
> +     cmpfunc.    argnull[0] = false;
> +     cmpfunc.    argnull[1] = false;
> +     cmpfunc.    isnull = false;
> +
> +     return DatumGetInt32(FunctionCallInvoke(&cmpfunc));
> + }

We could easily make this reentrant by passing the fcinfo through an argument
and using qsort_arg().  Please do so.

> *** /dev/null
> --- b/src/backend/utils/adt/array_typanalyze.c
> ***************
> *** 0 ****
> --- 1,834 ----
> + /*-------------------------------------------------------------------------
> +  *
> +  * array_typanalyze.c
> +  *      functions for gathering statistics from array columns
> +  *
> +  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
> +  *
> +  *
> +  * IDENTIFICATION
> +  *      src/backend/tsearch/array_typanalyze.c

Fix file location.

> +  *
> +  *-------------------------------------------------------------------------
> +  */
> +
> + #include "postgres.h"
> + #include "access/hash.h"
> + #include "catalog/pg_operator.h"
> + #include "commands/vacuum.h"
> + #include "commands/defrem.h"
> + #include "parser/parse_oper.h"
> + #include "utils/builtins.h"
> + #include "utils/hsearch.h"
> + #include "utils/typcache.h"
> + #include "utils/array.h"
> + #include "catalog/pg_am.h"
> + #include "catalog/pg_collation.h"
> + #include "utils/lsyscache.h"
> + #include "utils/selfuncs.h"

Alphabetize the includes after "postgres.h".

> +
> + #define ARRAY_ANALYZE_CHECK_OID(x) \
> +     if (!OidIsValid(x)) \
> +     { \
> +         elog(INFO, "Can't collect statistics on %d array type. Array \
> +                     statistics collection requires default hash and btree \
> +                     opclasses for element type.", stats->attrtypid); \
> +         stats->stats_valid = false; \
> +         return; \
> +     }

No message is necessary: the absence of a btree opclass or of both opclasses
degrades normal statistics, and we make no message about it then.

The decision to abandon the stats at this point causes a minor regression for
types having only a default hash opclass.  They currently get scalar minimal
stats; now they'll have none.  See below for one way to avoid that.

This macro is used in just one function, and the return statement means one
wouldn't lightly use it anywhere else.  Therefore, as a minor style point, I'd
place its definition right with the function that uses it rather than at the
head of the file.

> + Datum        array_typanalyze(PG_FUNCTION_ARGS);

extern prototypes go in header files, even when it's not strictly necessary.

> + /*
> +  *    array_typanalyze -- a custom typanalyze function for array columns
> +  */
> + Datum
> + array_typanalyze(PG_FUNCTION_ARGS)
> + {
> +     VacAttrStats *stats = (VacAttrStats *) PG_GETARG_POINTER(0);
> +     Form_pg_attribute attr = stats->attr;
> +     Oid            ltopr;
> +     Oid            eqopr;
> +     StdAnalyzeData *mystats;
> +
> +     /* If the attstattarget column is negative, use the default value */
> +     /* NB: it is okay to scribble on stats->attr since it's a copy */
> +     if (attr->attstattarget < 0)
> +         attr->attstattarget = default_statistics_target;
> +
> +     /* Look for default "<" and "=" operators for column's type */
> +     get_sort_group_operators(stats->attrtypid,
> +                              false, false, false,
> +                              <opr, &eqopr, NULL,
> +                              NULL);
> +
> +     /* If column has no "=" operator, we can't do much of anything */
> +     if (!OidIsValid(eqopr))
> +         return false;
> +
> +     /* Save the operator info for compute_stats routines */
> +     mystats = (StdAnalyzeData *) palloc(sizeof(StdAnalyzeData));
> +     mystats->eqopr = eqopr;
> +     mystats->eqfunc = get_opcode(eqopr);
> +     mystats->ltopr = ltopr;
> +     stats->extra_data = mystats;

Instead of duplicating this initialization and exporting StdAnalyzeData,
compute_scalar_stats() and compute_minimal_stats() from analyze.c, I suggest
structuring things as follows.  Export only std_typanalyze() and call it here.
If it returns false, return false here, too.  Otherwise, proceed with the
additional lookups you currently do in compute_array_stats().  If you don't
find everything you need (default btree operator class, for example), just
return true; the analysis will continue with the standard scalar approach as
setup by std_typanalyze().  Otherwise, replace compute_stats and extra_data
with your own materials.

> +
> +     stats->compute_stats = compute_array_stats;
> +     /* see comment about the choice of minrows in commands/analyze.c */
> +     stats->minrows = 300 * attr->attstattarget;
> +
> +     PG_RETURN_BOOL(true);
> + }
> +
> + /*
> +  *    compute_array_stats() -- compute statistics for a array column
> +  *
> +  *    This functions computes statistics that are useful for determining <@, &&,
> +  *    @> operations selectivity, along with the fraction of non-null rows and
> +  *    average width.
> +  *
> +  *    As an addition to the the most common values, as we do for most datatypes,
> +  *    we're looking for the most common elements and length histogram. In the
> +  *    case of relatively long arrays it might be more useful, because there most
> +  *    probably won't be any two rows with the same array and thus MCV has no
> +  *    much sense. With a list of the most common elements we can do a better job
> +  *    at figuring out <@, &&, @> selectivity. Arrays length histogram helps to
> +  *    "column <@ const" to be more precise.

The histogram addresses array distinct element counts, not array lengths.
That's exactly what we need for the selectivity calculations in question.
Please update the terminology throughout the patch, though.

> +  *
> +  *    The algorithm used is Lossy Counting, as proposed in the paper "Approximate
> +  *    frequency counts over data streams" by G. S. Manku and R. Motwani, in
> +  *    Proceedings of the 28th International Conference on Very Large Data Bases,
> +  *    Hong Kong, China, August 2002, section 4.2. The paper is available at
> +  *    http://www.vldb.org/conf/2002/S10P03.pdf
> +  *
> +  *    The Lossy Counting (aka LC) algorithm goes like this:
> +  *    Let s be the threshold frequency for an item (the minimum frequency we
> +  *    are interested in) and epsilon the error margin for the frequency. Let D
> +  *    be a set of triples (e, f, delta), where e is an element value, f is that
> +  *    element's frequency (actually, its current occurrence count) and delta is
> +  *    the maximum error in f. We start with D empty and process the elements in
> +  *    batches of size w. (The batch size is also known as "bucket size" and is
> +  *    equal to 1/epsilon.) Let the current batch number be b_current, starting
> +  *    with 1. For each element e we either increment its f count, if it's
> +  *    already in D, or insert a new triple into D with values (e, 1, b_current
> +  *    - 1). After processing each batch we prune D, by removing from it all
> +  *    elements with f + delta <= b_current.  After the algorithm finishes we
> +  *    suppress all elements from D that do not satisfy f >= (s - epsilon) * N,
> +  *    where N is the total number of elements in the input.  We emit the
> +  *    remaining elements with estimated frequency f/N.  The LC paper proves
> +  *    that this algorithm finds all elements with true frequency at least s,
> +  *    and that no frequency is overestimated or is underestimated by more than
> +  *    epsilon.  Furthermore, given reasonable assumptions about the input
> +  *    distribution, the required table size is no more than about 7 times w.
> +  *
> +  *    We set s to be the estimated frequency of the K'th element in a natural
> +  *    language's frequency table, where K is the target number of entries in
> +  *    the MCELEM array. We assume that the distribution of element frequencies
> +  *    follows Zipf's law with an exponent of 1.
> +  *
> +  *    Assuming Zipfian distribution, the frequency of the K'th element is equal
> +  *    to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
> +  *    elements in the language.    Putting W as one million, we get roughly
> +  *    0.07/K. This gives s = 0.07/K.    We set epsilon = s/10, which gives bucket
> +  *    width w = K/0.007 and maximum expected hashtable size of about 1000 * K.

These last two paragraphs, adapted from ts_typanalyze.c, assume natural
language documents.  To what extent do these parameter choices remain sensible
for arbitrary data such as users may place in arrays?  In any event, we need a
different justification, even if it's just a hand-wavy justification.

If I'm following this correctly, this choice of "s" makes the algorithm
guaranteed to find only elements constituting >= 7% of the input elements.
Incidentally, isn't that far too high for natural language documents?  If the
English word "the" only constitutes 7% of typical documents, then this "s"
value would permit us to discard practically every word; we'd be left with
words read while filling the last bucket and words that happened to repeat
exceedingly often in the column.  I haven't tried to make a test case to
observe this problem; am I missing something?  (This question is largely
orthogonal to your patch.)

> +  *
> +  *    Note: in the above discussion, s, epsilon, and f/N are in terms of a
> +  *    element's frequency as a fraction of all elements seen in the input.
> +  *    However, what we actually want to store in the finished pg_statistic
> +  *    entry is each element's frequency as a fraction of all rows that it occurs
> +  *    in. Elements might be repeated in the same array. Since operators
> +  *    <@, &&, @> takes care only about element occurence itself and not about
> +  *    occurence count, function takes additional care about uniqueness of
> +  *    counting. Also we need to change the divisor from N to nonnull_cnt to get
> +  *    the number we want.

On the same tangent, why does ts_typanalyze() not deduplicate the same way?
The @@ operator has the same property.

> +  */
> + static void
> + compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
> +                     int samplerows, double totalrows)
> + {
> +     int            num_mcelem;
> +     int            null_cnt = 0;
> +
> +     /*
> +      * We should count not only null array values, but also null array
> +      * elements
> +      */
> +     int            null_elem_cnt = 0.0;
> +
> +     double        total_width = 0.0;
> +     double        total_length = 0.0;
> +
> +     /* This is D from the LC algorithm. */
> +     HTAB       *elements_tab;
> +     HASHCTL        elem_hash_ctl;
> +     HASH_SEQ_STATUS scan_status;
> +
> +     /* This is the current bucket number from the LC algorithm */
> +     int            b_current;
> +
> +     /* This is 'w' from the LC algorithm */
> +     int            bucket_width;
> +     int            array_no,
> +                 element_no;
> +     Datum        hash_key;
> +     TrackItem  *item;
> +
> +     int            lengths_count;
> +     int            length_index;
> +     int            slot_idx = 0;
> +     HTAB       *length_tab;
> +     HASHCTL        length_hash_ctl;
> +     LengthItem *length_item;
> +     LengthItem *sorted_length_tab;
> +
> +     /*
> +      * Most part of array operators, which selectivity is estimated by this
> +      * statistics, takes care only one occurence of element in array. That's
> +      * why we should take care about count element occurence only once per
> +      * array. To clean occurence flag for each array by iterating over all
> +      * hash table can be too expensive. That's why we store pointers to hash
> +      * items contain elements which occur in last array.
> +      */
> +     TrackItem **occurences = NULL;
> +     int            occurences_count = 0,
> +                 occurence_index;
> +
> +     TypeCacheEntry *typentry;
> +     Oid            hash_opclass,
> +                 hash_opfamily,
> +                 element_typeid,
> +                 hash_oroc;
> +     FmgrInfo    hash_func_info;
> +
> +     StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
> +
> +     /* Compute standard statistics */
> +     if (OidIsValid(mystats->ltopr))
> +         compute_scalar_stats(stats, fetchfunc, samplerows, totalrows);
> +     else
> +         compute_minimal_stats(stats, fetchfunc, samplerows, totalrows);
> +
> +
> +     /* Gathering all necessary information about element data type. */
> +
> +     element_typeid = stats->attrtype->typelem;
> +
> +     if (!OidIsValid(element_typeid))
> +         elog(ERROR, "array_typanalyze was invoked with %d non-array type",
> +              stats->attrtypid);
> +
> +     typentry = lookup_type_cache(element_typeid, TYPECACHE_EQ_OPR |
> +      TYPECACHE_CMP_PROC | TYPECACHE_EQ_OPR_FINFO | TYPECACHE_CMP_PROC_FINFO);
> +     ARRAY_ANALYZE_CHECK_OID(typentry->cmp_proc);
> +     ARRAY_ANALYZE_CHECK_OID(typentry->eq_opr);
> +
> +     hash_opclass = GetDefaultOpClass(element_typeid, HASH_AM_OID);
> +     ARRAY_ANALYZE_CHECK_OID(hash_opclass);
> +
> +     hash_opfamily = get_opclass_family(hash_opclass);
> +     ARRAY_ANALYZE_CHECK_OID(hash_opfamily);
> +
> +     hash_oroc = get_opfamily_proc(hash_opfamily, element_typeid,
> +                                   element_typeid, HASHPROC);
> +     ARRAY_ANALYZE_CHECK_OID(hash_oroc);
> +
> +     fmgr_info(hash_oroc, &hash_func_info);
> +
> +     InitFunctionCallInfoData(element_type_info.cmp, &typentry->cmp_proc_finfo,
> +                              2, DEFAULT_COLLATION_OID, NULL, NULL);
> +     InitFunctionCallInfoData(element_type_info.eq, &typentry->eq_opr_finfo,
> +                              2, DEFAULT_COLLATION_OID, NULL, NULL);
> +     InitFunctionCallInfoData(element_type_info.hash, &hash_func_info,
> +                              1, DEFAULT_COLLATION_OID, NULL, NULL);
> +     element_type_info.typbyval = typentry->typbyval;

As I mentioned above in passing, all of the above setup only needs to happen
once per analyzed column, not once per tuple.  It belongs in array_typanalyze;
define a struct to hold all the looked-up state, including a pointer to any
state for std_typanalyze(), and store that in stats->extra_data.

This code should probably get around to using the new SortSupport
infrastructure like analyze.c now uses.  Not sure whether that's mandatory for
initial commit.

typanalyze functions that call arbitrary user code must be reentrant.  It only
matters in corner cases, but crashing in those corner cases is not acceptable.
See our use of CurTupleHashTable in execGrouping.c and defend this global
state in a similar fashion.

> +
> +     /*
> +      * We want statistics_target * 10 elements in the MCELEM array. This
> +      * multiplier is pretty arbitrary, but is meant to reflect the fact that
> +      * the number of individual elements tracked in pg_statistic ought to be
> +      * more than the number of values for a simple scalar column.
> +      */
> +     num_mcelem = stats->attr->attstattarget * 10;
> +
> +     /*
> +      * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
> +      * comment above.
> +      */
> +     bucket_width = num_mcelem * 1000 / 7;

The addend mentioned is not present in the code or discussed in "the comment
above".  (I see the comment is copied verbatim from ts_typanalyze(), where the
addend *is* present, though again the preceding comment says nothing of it.)

> +
> +     /*
> +      * Create the hashtable. It will be in local memory, so we don't need to
> +      * worry about overflowing the initial size. Also we don't need to pay any
> +      * attention to locking and memory management.
> +      */
> +     MemSet(&elem_hash_ctl, 0, sizeof(elem_hash_ctl));
> +     elem_hash_ctl.keysize = sizeof(Datum);
> +     elem_hash_ctl.entrysize = sizeof(TrackItem);
> +     elem_hash_ctl.hash = element_hash;
> +     elem_hash_ctl.match = element_match;
> +     elem_hash_ctl.hcxt = CurrentMemoryContext;
> +     elements_tab = hash_create("Analyzed elements table",
> +                                bucket_width * 7,

Though it's copied from compute_tsvector_stats(), why "7"?

> +                                &elem_hash_ctl,
> +                     HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
> +
> +     /*
> +      * hashtable for arrays lengths.
> +      */
> +     MemSet(&length_hash_ctl, 0, sizeof(length_hash_ctl));
> +     length_hash_ctl.keysize = sizeof(int);
> +     length_hash_ctl.entrysize = sizeof(LengthItem);
> +     length_hash_ctl.hash = tag_hash;

You need to pass the HASH_FUNCTION flag for this setting to take effect.

> +     length_hash_ctl.match = memcmp;

Omit this.  You would need to pass HASH_COMPARE for it to take effect, but
it's also implicit once you override the hash function.

> +     length_hash_ctl.hcxt = CurrentMemoryContext;
> +     length_tab = hash_create("Array length table",
> +                              64,
> +                              &length_hash_ctl,
> +                              HASH_ELEM | HASH_CONTEXT);
> +
> +     /* Initialize counters. */
> +     b_current = 1;
> +     element_no = 0;
> +
> +     /* Loop over the arrays. */
> +     for (array_no = 0; array_no < samplerows; array_no++)
> +     {
> +         Datum        value;
> +         bool        isnull;
> +         bool        null_present;
> +         ArrayType  *array;
> +         char       *ptr;
> +         bits8       *bitmap;
> +         int            bitmask;
> +         int            j;
> +         int            ndims;
> +         int           *dims;
> +         int            nitems;
> +         bool        length_found;
> +
> +         vacuum_delay_point();
> +
> +         value = fetchfunc(stats, array_no, &isnull);
> +
> +         /*
> +          * Check for null/nonnull.
> +          */
> +         if (isnull)
> +         {
> +             null_cnt++;
> +             continue;
> +         }
> +
> +         /*
> +          * Add up widths for average-width calculation.  Since it's a array,
> +          * we know it's varlena.  As in the regular compute_minimal_stats
> +          * function, we use the toasted width for this calculation.
> +          */
> +         total_width += VARSIZE_ANY(DatumGetPointer(value));
> +
> +         /*
> +          * Now detoast the array if needed.
> +          */
> +         array = DatumGetArrayTypeP(value);
> +         ptr = ARR_DATA_PTR(array);
> +         bitmap = ARR_NULLBITMAP(array);
> +         bitmask = 1;
> +         ndims = ARR_NDIM(array);
> +         dims = ARR_DIMS(array);
> +         nitems = ArrayGetNItems(ndims, dims);
> +
> +         /*
> +          * Check if we have enough of memory to store element occurences in
> +          * one array.
> +          */
> +         if (nitems > occurences_count)
> +         {
> +             occurences_count = 2 * nitems;
> +             if (occurences)
> +                 occurences = (TrackItem **) repalloc(occurences,
> +                                      sizeof(TrackItem *) * occurences_count);
> +             else
> +                 occurences = (TrackItem **) palloc(
> +                                      sizeof(TrackItem *) * occurences_count);
> +         }
> +         occurence_index = 0;
> +
> +         null_present = false;
> +
> +         /*
> +          * We loop through the elements in the array and add them to our
> +          * tracking hashtable.    Note: the hashtable entries will point into
> +          * the (detoasted) array value, therefore we cannot free that storage
> +          * until we're done.
> +          */

The second sentence of this comment is obsolete.

> +         for (j = 0; j < nitems; j++)
> +         {
> +             bool        found;
> +             bool        isnull;
> +
> +             /* Get elements, checking for NULL */
> +             if (bitmap && (*bitmap & bitmask) == 0)
> +             {
> +                 hash_key = (Datum) 0;
> +                 isnull = true;
> +                 null_present = true;
> +             }
> +             else
> +             {
> +                 /* Get element value */
> +                 hash_key = fetch_att(ptr, typentry->typbyval, typentry->typlen);
> +                 isnull = false;
> +
> +                 /*
> +                  * We should allocate memory for element if it isn't passing
> +                  * by value, because array will be freed after that loop.
> +                  */
> +                 if (!typentry->typbyval)
> +                 {
> +                     Datum        tmp;
> +                     int            length;
> +
> +                     length = (typentry->typlen == -1) ?
> +                         VARSIZE(hash_key) : typentry->typlen;
> +                     tmp = (Datum) MemoryContextAlloc(stats->anl_context, length);
> +                     memcpy((void *) tmp, (void *) hash_key, length);
> +                     hash_key = tmp;
> +                 }

Please use datumCopy() or add a comment about why it's insufficient.

> +                 ptr = att_addlength_pointer(ptr, typentry->typlen, ptr);
> +                 ptr = (char *) att_align_nominal(ptr, typentry->typalign);
> +             }
> +
> +             /* Advance bitmap pointers if any */
> +             bitmask <<= 1;
> +             if (bitmask == 0x100)
> +             {
> +                 if (bitmap)
> +                     bitmap++;
> +                 bitmask = 1;
> +             }
> +
> +             /* No null element processing other then flag setting here */
> +             if (isnull)
> +                 continue;
> +
> +             /* Lookup current element in hashtable, adding it if new */
> +             item = (TrackItem *) hash_search(elements_tab,
> +                                              (const void *) &hash_key,
> +                                              HASH_ENTER, &found);
> +
> +             if (found)
> +             {
> +                 /*
> +                  * The element is already on the tracking list. If it is the
> +                  * first occurence in array then update element frequency.
> +                  */
> +                 if (!item->occurence)
> +                 {
> +                     item->frequency++;
> +                     item->occurence = true;
> +                     occurences[occurence_index++] = item;
> +                 }
> +             }
> +             else
> +             {
> +                 /* Initialize new tracking list element */
> +                 item->frequency = 1;
> +                 item->delta = b_current - 1;
> +                 item->occurence = true;
> +                 occurences[occurence_index++] = item;
> +             }
> +
> +             /* element_no is the number of elements processed (ie N) */
> +             element_no++;

We should not bump element_no when we skipped the element as a duplicate
within the same array.

> +
> +             /* We prune the D structure after processing each bucket */
> +             if (element_no % bucket_width == 0)
> +             {
> +                 prune_element_hashtable(elements_tab, b_current);
> +                 b_current++;
> +             }
> +         }
> +         /* Count null element only once per array */
> +         if (null_present)
> +             null_elem_cnt++;
> +
> +         /* Update frequency of particular array length. */
> +         length_item = (LengthItem *) hash_search(length_tab,
> +                                                  &occurence_index,
> +                                                  HASH_ENTER, &length_found);
> +         if (length_found)
> +         {
> +             length_item->frequency++;
> +         }
> +         else
> +         {
> +             length_item->length = occurence_index;
> +             length_item->frequency = 1;
> +         }
> +         total_length += occurence_index;
> +
> +         /*
> +          * When we end processing of particular array we should clean the
> +          * occurence flag.
> +          */
> +         for (j = 0; j < occurence_index; j++)
> +             occurences[j]->occurence = false;

Could you usefully simplify this away by storing the last-containing array_no,
instead of a bool, in each hash entry?

> +
> +         /* We should free memory from array if it was copied during detoast. */
> +         if ((Datum) array != value)
> +             pfree((void *) array);

No need to for casts to "void *".

> +     }
> +
> +     /* Skip slots occupied by standard statistics */
> +     while (OidIsValid(stats->stakind[slot_idx]))
> +         slot_idx++;
> +
> +     /* Fill histogram of arrays lengths. */
> +     lengths_count = hash_get_num_entries(length_tab);
> +     if (lengths_count > 0)
> +     {
> +         int            num_hist = stats->attr->attstattarget;
> +         int            delta;
> +         int            frac;
> +         int            i;
> +         Datum       *hist_values;
> +
> +         /* Copy lengths statistics from hashtab to array and sort them. */
> +         length_index = 0;
> +         sorted_length_tab = (LengthItem *) palloc(sizeof(LengthItem) * lengths_count);
> +         hash_seq_init(&scan_status, length_tab);
> +         while ((length_item = (LengthItem *) hash_seq_search(&scan_status)) != NULL)
> +         {
> +             memcpy(&sorted_length_tab[length_index], length_item,
> +                    sizeof(LengthItem));
> +             length_index++;
> +         }
> +         qsort(sorted_length_tab, lengths_count, sizeof(LengthItem),
> +               lengthitem_compare_element);
> +
> +         /* Histogram should be stored in anl_context. */
> +         hist_values = (Datum *) MemoryContextAlloc(stats->anl_context,
> +                                                    sizeof(Datum) * num_hist);
> +         /* Fill histogram by hashtab. */
> +         delta = samplerows - null_cnt - 1;
> +         length_index = 0;
> +         frac = sorted_length_tab[0].frequency * (num_hist - 1);
> +         for (i = 0; i < num_hist; i++)
> +         {
> +             hist_values[i] =
> +                 Int32GetDatum(sorted_length_tab[length_index].length);
> +             frac -= delta;
> +             while (frac <= 0)
> +             {
> +                 length_index++;
> +                 frac += sorted_length_tab[length_index].frequency *
> +                     (num_hist - 1);
> +             }
> +         }
> +
> +         stats->stakind[slot_idx] = STATISTIC_KIND_LENGTH_HISTOGRAM;
> +         stats->staop[slot_idx] = Int4EqualOperator;
> +         stats->stavalues[slot_idx] = hist_values;
> +         stats->numvalues[slot_idx] = num_hist;
> +         /* We are storing values of element type */
> +         stats->statypid[slot_idx] = INT4OID;
> +         stats->statyplen[slot_idx] = 4;
> +         stats->statypbyval[slot_idx] = true;
> +         stats->statypalign[slot_idx] = 'i';
> +         slot_idx++;
> +     }
> +
> +     /* We can only compute real stats if we found some non-null values. */
> +     if (null_cnt < samplerows)
> +     {
> +         int            nonnull_cnt = samplerows - null_cnt;
> +         int            i;
> +         TrackItem **sort_table;
> +         int            track_len;
> +         int            cutoff_freq;
> +         int            minfreq,
> +                     maxfreq;
> +
> +         stats->stats_valid = true;
> +         /* Do the simple null-frac and average width stats */
> +         stats->stanullfrac = (double) null_cnt / (double) samplerows;
> +         stats->stawidth = total_width / (double) nonnull_cnt;

Isn't this redundant with the calculations made in compute_scalar_stats()?

> +
> +         /* Assume it's a unique column (see notes above) */
> +         stats->stadistinct = -1.0;

The comment, copied from ts_typanalyze(), refers to notes not likewise copied.
We should probably instead leave whatever compute_scalar_stats() calculated.

> *** a/src/backend/utils/adt/selfuncs.c
> --- b/src/backend/utils/adt/selfuncs.c
> ***************
> *** 1705,1710 **** scalararraysel(PlannerInfo *root,
> --- 1705,1736 ----
>       RegProcedure oprsel;
>       FmgrInfo    oprselproc;
>       Selectivity s1;
> +     bool        varonleft;
> +     Node       *other;
> +     VariableStatData vardata;
> +
> +     /*
> +      * Handle "const = qual(column)" case using array column statistics.
> +      */
> +     if (get_restriction_variable(root, clause->args, varRelid,
> +                                   &vardata, &other, &varonleft))
> +     {
> +         Oid elemtype;
> +         elemtype = get_base_element_type(vardata.vartype);
> +         if (elemtype != InvalidOid && IsA(other, Const))
> +         {
> +             if (((Const *) other)->constisnull)
> +             {
> +                 /* qual can't succeed if null array */
> +                 ReleaseVariableStats(vardata);
> +                 return (Selectivity) 0.0;
> +             }
> +             s1 = calc_scalararraysel(&vardata, ((Const *) other)->constvalue, useOr);
> +             ReleaseVariableStats(vardata);
> +             return s1;
> +         }
> +         ReleaseVariableStats(vardata);
> +     }

If we're going to add a new file for array selectivity functions (which seems
reasonable), scalararraysel() should also move there.  (To do this and still
keep the patch easy to read, you could do the move in a pre-patch.)

> *** a/src/include/catalog/pg_proc.h
> --- b/src/include/catalog/pg_proc.h
> ***************
> *** 849,854 **** DATA(insert OID = 2334 (  array_agg_finalfn   PGNSP PGUID 12 1 0 0 0 f f f f f i
> --- 849,859 ----
>   DESCR("aggregate final function");
>   DATA(insert OID = 2335 (  array_agg           PGNSP PGUID 12 1 0 0 0 t f f f f i 1 0 2277 "2283" _null_ _null_
_null__null_ aggregate_dummy _null_ _null_ _null_ )); 
>   DESCR("concatenate aggregate input into an array");
> + DATA(insert OID = 3816 (  array_typanalyze PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 16 "2281" _null_ _null_ _null_
_null_array_typanalyze _null_ _null_ _null_ )); 
> + DESCR("array statistics collector");
> + #define ArrayTypanalyzeOid 3816

Use the fmgroids.h symbol, F_ARRAY_TYPANALYZE, instead.

> + DATA(insert OID = 3817 (  arraysel           PGNSP PGUID 12 1 0 0 0 f f f t f s 4 0 701 "2281 26 2281 23" _null_
_null__null_ _null_ arraysel _null_ _null_ _null_ )); 
> + DESCR("array selectivity estimation functions");
>
>   DATA(insert OID = 760 (  smgrin               PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 210 "2275" _null_ _null_
_null__null_    smgrin _null_ _null_ _null_ )); 
>   DESCR("I/O");
> *** a/src/include/catalog/pg_statistic.h
> --- b/src/include/catalog/pg_statistic.h

>   /*
>    * Currently, three statistical slot "kinds" are defined: most common values,

Here's a larger quote of the comment starting here:

/*
 * Currently, three statistical slot "kinds" are defined: most common values,
 * histogram, and correlation.    Additional "kinds" will probably appear in
 * future to help cope with non-scalar datatypes.  Also, custom data types
 * can define their own "kind" codes by mutual agreement between a custom
 * typanalyze routine and the selectivity estimation functions of the type's
 * operators.

That needs an update.  (It already needs an update for STATISTIC_KIND_MCELEM,
but this patch would invalidate it yet again.)

> ***************
> *** 260,263 **** typedef FormData_pg_statistic *Form_pg_statistic;
> --- 268,274 ----
>    */
>   #define STATISTIC_KIND_MCELEM  4
>
> +
> + #define STATISTIC_KIND_LENGTH_HISTOGRAM  5

The other kinds have long explanatory comments; this one should, too.

> *** a/src/include/catalog/pg_type.h
> --- b/src/include/catalog/pg_type.h

> ! DATA(insert OID = 1021 (  _float4     PGNSP PGUID -1 f b A f t \054 0 700 0 array_in array_out array_recv
array_send- - array_typanalyze i x f 0 -1 0 0 _null_ _null_ )); 

With this patch, a fresh database sets typanalyze = array_typanalyze for 27
array types and leaves typanalyze = NULL for the other 38 array types.  What
is the rationale for the split?  For example, why does real[] use the new
typanalyze but numeric[] does not?

Thanks,
nm

Attachment

Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
Hi!

Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.).

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:
> Thanks for your great work on reviewing this patch. Now I'm trying to find
> memory corruption bug. Unfortunately it doesn't appears on my system. Can
> you check if this bug remains in attached version of patch. If so, please
> provide me information about system you're running (processor, OS etc.).

I get the same diagnostic from this version.  Opteron processor, operating
system is Ubuntu 8.04 (64-bit).  You're using --enable-cassert, right?


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Wed, Jan 4, 2012 at 12:33 AM, Noah Misch <noah@leadboat.com> wrote:
On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:
> Thanks for your great work on reviewing this patch. Now I'm trying to find
> memory corruption bug. Unfortunately it doesn't appears on my system. Can
> you check if this bug remains in attached version of patch. If so, please
> provide me information about system you're running (processor, OS etc.).

I get the same diagnostic from this version.  Opteron processor, operating
system is Ubuntu 8.04 (64-bit).  You're using --enable-cassert, right?
Oh, actually no. Thanks for point.

------
With best regards,
Alexander Korotkov.

Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
Corrections:

On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote:
> On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote:
> > +  *    We set s to be the estimated frequency of the K'th element in a natural
> > +  *    language's frequency table, where K is the target number of entries in
> > +  *    the MCELEM array. We assume that the distribution of element frequencies
> > +  *    follows Zipf's law with an exponent of 1.
> > +  *
> > +  *    Assuming Zipfian distribution, the frequency of the K'th element is equal
> > +  *    to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
> > +  *    elements in the language.    Putting W as one million, we get roughly
> > +  *    0.07/K. This gives s = 0.07/K.    We set epsilon = s/10, which gives bucket
> > +  *    width w = K/0.007 and maximum expected hashtable size of about 1000 * K.
> 
> These last two paragraphs, adapted from ts_typanalyze.c, assume natural
> language documents.  To what extent do these parameter choices remain sensible
> for arbitrary data such as users may place in arrays?  In any event, we need a
> different justification, even if it's just a hand-wavy justification.
> 
> If I'm following this correctly, this choice of "s" makes the algorithm
> guaranteed to find only elements constituting >= 7% of the input elements.
> Incidentally, isn't that far too high for natural language documents?  If the
> English word "the" only constitutes 7% of typical documents, then this "s"
> value would permit us to discard practically every word; we'd be left with
> words read while filling the last bucket and words that happened to repeat
> exceedingly often in the column.  I haven't tried to make a test case to
> observe this problem; am I missing something?  (This question is largely
> orthogonal to your patch.)

No, we'll find elements of frequency at least 0.07/(default_statistics_target
* 10) -- in the default configuration, 0.007%.  Also, ts_typanalyze() counts
the number of documents that contain one or more instances of each lexeme,
ignoring the number of appearances within each document.  The word "the" may
constitute 7% of a typical document, but it will appear at least once in
nearly 100% of documents.  Therefore, this "s" value is adequate even for the
pathological case of each "document" containing just one lexeme.

> > +  *
> > +  *    Note: in the above discussion, s, epsilon, and f/N are in terms of a
> > +  *    element's frequency as a fraction of all elements seen in the input.
> > +  *    However, what we actually want to store in the finished pg_statistic
> > +  *    entry is each element's frequency as a fraction of all rows that it occurs
> > +  *    in. Elements might be repeated in the same array. Since operators
> > +  *    <@, &&, @> takes care only about element occurence itself and not about
> > +  *    occurence count, function takes additional care about uniqueness of
> > +  *    counting. Also we need to change the divisor from N to nonnull_cnt to get
> > +  *    the number we want.
> 
> On the same tangent, why does ts_typanalyze() not deduplicate the same way?
> The @@ operator has the same property.

Answer: to_tsvector() will have already done so.

> > +     /*
> > +      * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
> > +      * comment above.
> > +      */
> > +     bucket_width = num_mcelem * 1000 / 7;
> 
> The addend mentioned is not present in the code or discussed in "the comment
> above".  (I see the comment is copied verbatim from ts_typanalyze(), where the
> addend *is* present, though again the preceding comment says nothing of it.)

The addend rationale in fact does appear in the ts_typanalyze() comment.

Thanks,
nm


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
Hi!

Patch where most part of issues are fixed is attached.

On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote:
> I find distressing the thought of having two copies of the lossy sampling
> code, each implementing the algorithm with different variable names and levels
> of generality.  We might someday extend this to hstore, and then we'd have yet
> another copy.  Tom commented[1] that ts_typanalyze() and array_typanalyze()
> should remain distinct, and I agree.  However, they could call a shared
> counting module.  Is that practical?  Possible API:
>
> typedef struct LossyCountCtl;
> LossyCountCtl *LossyCountStart(float s,
>                                                           float epsilon,
>                                                           int2 typlen,
>                                                           bool typbyval,
>                                                           Oid eqfunc); /* + hash func, a few others */
> void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
> TrackItem **LossyCountGetAll(LossyCountCtl *ctl);
>

I'm not sure about shared lossy counting module, because part of shared code would be relatively small. Part of compute_array_stats function which is taking care about array decompression, distinct occurence calculation, disting element count histogram, packing statistics slots etc is much larger than lossy counting algorithm itself. May be, there is some other opinions in community?

> I think this is an improvement, but some code out there may rely on the
> ability to get stakind = 4 data from the most_common_vals column.  We'll need
> to mention this in the release notes as an incompatibility.

I'm not sure I understand mechanism of release notes. Does it require something in a patch itself?

> > + /*
> > +  * Let be n independent events with probabilities p. This function calculates
> > +  * probabilities of exact k of events occurence for k in [0;m].
> > +  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
> > +  * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
> > +  * Each next event increase total number of occured events if it occurs and
> > +  * leave last value of that number if it doesn't occur. So, by the law of
> > +  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
> > +  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
> > +  * Also there could be some events with low probabilities. Their summary
> > +  * probability passed in the rest parameter.
> > +  */
> > + static float *
> > + calc_distr(float *p, int n, int m, float rest)
> > + {
>
> > +     /* Take care about events with low probabilities. */
> > +     if (rest > 0.0f)
> > +     {
> > +             /*
> > +              * The probability of no occurence of events which forms "rest"
> > +              * probability have a limit of exp(-rest) when number of events fo to
> > +              * infinity. Another simplification is to replace that events with one
> > +              * event with (1 - exp(-rest)) probability.
> > +              */
> > +             rest = 1.0f - exp(-rest);
>
> What is the name of the underlying concept in probability theory?

The most closest concept to caculated distribution is multinomial distribution. But it's not exactly same, because multinomial distribution gives probability of particular count of each event occurece, not probability of summary occurence. Actually, distribution is caclulated just from assumption of events independence. The most closest concept of rest probability is approximation by exponential distribution. It's quite rough approximation, but I can't invent something better with low calculation complexity.

> > + /*
> > +  * Array selectivity estimation based on most common elements statistics for
> > +  * "column <@ const" case. Assumption that element occurences are independent
> > +  * means certain distribution of array lengths. Typically real distribution
> > +  * of lengths is significantly different from it. For example, if even we
> > +  * have set of arrays with 1 integer element in range [0;10] each, element
> > +  * occurences are not independent. Because in the case of independence we
>
> Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> '{6,19,4}' cannot appear?

I refer column where only one element exists, i.e. only possible values are '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}', '{10}'. That is a corner case. But similar situation occurs when, for example, we've distribution of distinct element count between 1 and 3. It significantly differs from distribution from independent occurence.

> > +  * have probabilities of length of 0, 1, 2 etc. In the "column @> const"
> > +  * and "column && const" cases we usually have "const" with low summary
> > +  * frequency of elements (otherwise we have selectivity close to 0 or 1
> > +  * correspondingly). That's why effect of dependence related to lengths
> > +  * distribution id negligible there. In the "column <@ const" case summary
> > +  * frequency of elements is high (otherwise we have selectivity close to 0).
>
> What does the term "summary frequency" denote?

I meant summ of frequences of "const" array elements.

> > +     /*
> > +      * Rest is a average length of elements which aren't present in mcelem.
> > +      */
> > +     rest = avg_length;
>
> You define "rest" here as an array length ...
>
> > +
> > +     default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
> > +
> > +     mcelem_index = 0;
> > +
> > +     /*
> > +      * mult is the multiplier that presents estimate of probability that each
> > +      * mcelem which is not present in constant doesn't occur.
> > +      */
> > +     mult = 1.0f;
> > +
> > +     for (i = 0; i < nitems; i++)
> > +     {
> > +             bool            found = false;
> > +
> > +             /* Comparison with previous value in order to guarantee uniquness */
> > +             if (i > 0)
> > +             {
> > +                     if (!element_compare(&array_data[i - 1], &array_data[i]))
> > +                             continue;
> > +             }
> > +
> > +             /*
> > +              * Iterate over mcelem until find mcelem that is greater or equal to
> > +              * element of constant. Simultaneously taking care about rest and
> > +              * mult. If that mcelem is found then fill corresponding elem_selec.
> > +              */
> > +             while (mcelem_index < nmcelem)
> > +             {
> > +                     int                     cmp = element_compare(&mcelem[mcelem_index], &array_data[i]);
> > +
> > +                     if (cmp < 0)
> > +                     {
> > +                             mult *= (1.0f - numbers[mcelem_index]);
> > +                             rest -= numbers[mcelem_index];
>
> ... But here, you're subtracting a frequency from an array length?
>

Yes, because average distinct element count is summ of frequencies of elements. Substracting mcelem frequencies from avg_length we have summ of frequencies of non-mcelem elements.

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Sat, Jan 07, 2012 at 09:36:42PM +0400, Alexander Korotkov wrote:
> Patch where most part of issues are fixed is attached.

Thanks.  I've made several, largely cosmetic, edits.  See attached version
0.10.  Please use it as the basis for your next version, and feel free to
revert any changes you deem inappropriate.  Where I made non-cosmetic edits, I
attempt to point that out below.  I've left unfixed a few more-substantive
problems, also described below.

When you post another update, could you add it to the open CF?  Given the
timing, I think we might as well consider any further activity to have
happened under the aegis of the 2012-01 CF.  I'm marking the current entry
Returned with Feedback.

> On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote:
> > I find distressing the thought of having two copies of the lossy sampling
> > code, each implementing the algorithm with different variable names and
> levels
> > of generality.  We might someday extend this to hstore, and then we'd
> have yet
> > another copy.  Tom commented[1] that ts_typanalyze() and
> array_typanalyze()
> > should remain distinct, and I agree.  However, they could call a shared
> > counting module.  Is that practical?  Possible API:
> >
> > typedef struct LossyCountCtl;
> > LossyCountCtl *LossyCountStart(float s,
> >                                                           float epsilon,
> >                                                           int2 typlen,
> >                                                           bool typbyval,
> >                                                           Oid eqfunc); /*
> + hash func, a few others */
> > void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
> > TrackItem **LossyCountGetAll(LossyCountCtl *ctl);
> >
> > [1]
> http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us
>
> I'm not sure about shared lossy counting module, because part of shared
> code would be relatively small. Part of compute_array_stats function which
> is taking care about array decompression, distinct occurence calculation,
> disting element count histogram, packing statistics slots etc is much
> larger than lossy counting algorithm itself. May be, there is some other
> opinions in community?

True; it would probably increase total lines of code.  The benefit, if any,
lies in separation of concerns; the business of implementing this algorithm is
quite different from the other roles of these typanalyze functions.  I won't
insist that you try it, though.

> > I think this is an improvement, but some code out there may rely on the
> > ability to get stakind = 4 data from the most_common_vals column.  We'll
> need
> > to mention this in the release notes as an incompatibility.
>
> I'm not sure I understand mechanism of release notes. Does it require
> something in a patch itself?

No.  I just wanted to call attention to the fact in the hope that someone
remembers as the release notes get drafted.

> > > +             /*
> > > +              * The probability of no occurence of events which forms
> "rest"
> > > +              * probability have a limit of exp(-rest) when number of
> events fo to
> > > +              * infinity. Another simplification is to replace that
> events with one
> > > +              * event with (1 - exp(-rest)) probability.
> > > +              */
> > > +             rest = 1.0f - exp(-rest);
> >
> > What is the name of the underlying concept in probability theory?
>
> The most closest concept to caculated distribution is multinomial
> distribution. But it's not exactly same, because multinomial distribution
> gives probability of particular count of each event occurece, not
> probability of summary occurence. Actually, distribution is caclulated just
> from assumption of events independence. The most closest concept of rest
> probability is approximation by exponential distribution. It's quite rough
> approximation, but I can't invent something better with low calculation
> complexity.

Do you have a URL of a tutorial or paper that explains the method in more
detail?  If, rather, this is a novel synthesis, could you write a proof to
include in the comments?

> > > + /*
> > > +  * Array selectivity estimation based on most common elements
> statistics for
> > > +  * "column <@ const" case. Assumption that element occurences are
> independent
> > > +  * means certain distribution of array lengths. Typically real
> distribution
> > > +  * of lengths is significantly different from it. For example, if
> even we
> > > +  * have set of arrays with 1 integer element in range [0;10] each,
> element
> > > +  * occurences are not independent. Because in the case of
> independence we
> >
> > Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> > '{6,19,4}' cannot appear?
>
> I refer column where only one element exists, i.e. only possible values are
> '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
> '{10}'. That is a corner case. But similar situation occurs when, for
> example, we've distribution of distinct element count between 1 and 3. It
> significantly differs from distribution from independent occurence.

Oh, I think I see now.  If each element 1..10 had frequency 0.1 independently,
column values would have exactly one distinct element just 39% of the time?

If probability theory has a prototypical problem resembling this, it would be
nice to include a URL to a thorough discussion thereof.  I could not think of
the search terms to find one, though.

> > > +  * have probabilities of length of 0, 1, 2 etc. In the "column @>
> const"
> > > +  * and "column && const" cases we usually have "const" with low
> summary
> > > +  * frequency of elements (otherwise we have selectivity close to 0 or
> 1
> > > +  * correspondingly). That's why effect of dependence related to
> lengths
> > > +  * distribution id negligible there. In the "column <@ const" case
> summary
> > > +  * frequency of elements is high (otherwise we have selectivity close
> to 0).
> >
> > What does the term "summary frequency" denote?
>
> I meant summ of frequences of "const" array elements.

Do you mean literally P_0 + P_1 ... + P_N?  If so, I can follow the above
argument for "column && const" and "column <@ const", but not for "column @>
const".  For "column @> const", selectivity cannot exceed the smallest
frequency among elements of "const".  Several high-frequency elements together
will drive up the sum of frequencies without increasing the true selectivity.

> > > +     /*
> > > +      * Rest is a average length of elements which aren't present in
> mcelem.
> > > +      */
> > > +     rest = avg_length;
> >
> > You define "rest" here as an array length ...

> > > +                             rest -= numbers[mcelem_index];
> >
> > ... But here, you're subtracting a frequency from an array length?
> >
>
> Yes, because average distinct element count is summ of frequencies of
> elements. Substracting mcelem frequencies from avg_length we have summ of
> frequencies of non-mcelem elements.

I see now; thanks.  I updated the comments so that this would have been
clearer to me.

> *** /dev/null
> --- b/src/backend/utils/adt/array_selfuncs.c

> + Selectivity
> + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause,
> +                     Oid operator)
> + {
> +     Oid            elemtype;
> +     Selectivity selec;
> +     TypeCacheEntry *typentry;
> +     Datum       *hist;
> +     int            nhist;
> +     FunctionCallInfoData cmpfunc;
> +
> +     elemtype = get_base_element_type(vardata->vartype);
> +
> +
> +     /* Get default comparison function */
> +     typentry = lookup_type_cache(elemtype,
> +            TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR);
> +
> +     /* Handle only "=" operator. Return default selectivity in other cases. */
> +     if (operator != typentry->eq_opr)
> +         return (Selectivity) 0.5;

Punting on other operators this way creates a plan quality regression for
operations like "const < ANY (column)".  Please do it some way that falls
back on the somewhat-better existing scalararraysel() treatment for this.

> +
> +     /* Without comparison function return default selectivity estimation */
> +     if (!OidIsValid(typentry->cmp_proc))
> +     {
> +         if (orClause)
> +             return DEFAULT_OVERLAP_SEL;
> +         else
> +             return DEFAULT_CONTAIN_SEL;
> +     }

Since "const = ANY (column)" is equivalent to "column @> array[const]" and
"const = ALL (column)" is equivalent to "column <@ array[const]",
DEFAULT_CONTAIN_SEL is always correct here.  I've made that change.

> + /*
> +  * Calculate first n distinct element counts probabilities by histogram. We
> +  * assume that any interval between a and b histogram values gives
> +  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and
> +  * half of that to a and b. Returns total probability that distinct element
> +  * count is less of equal to n.
> +  */
> + static float
> + calc_hist(Datum *hist, int nhist, float *hist_part, int n)

To test this function, I ran the following test case:

set default_statistics_target = 4;
create table t3 as select array(select * from generate_series(1, v)) as arr
from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
analyze t3; -- length_histogram_bounds = {2,2,5,5}
select * from t3 where arr <@ array[6,7,8,9,10,11];

Using gdb to observe calc_hist()'s result during the last command:

(gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
$23 = 0.666666687
(gdb) x/6f hist_part
0xcd4bc8:       0       0       0.333333343     0
0xcd4bd8:       0       0.333333343

I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and
a total probability of 1.0.

> + {
> +     int            k,
> +                 i = 0,
> +                 prev_interval = 0,
> +                 next_interval = 0;
> +     float        frac,
> +                 total = 0.0f;
> +
> +     /*
> +      * frac is a probability contribution by each interval between histogram
> +      * values. We have nhist - 1 intervals. Contribution of one will be 1 /
> +      * (nhist - 1).
> +      */
> +     frac = 1.0f / ((float) (nhist - 1));
> +     for (k = 0; k <= n; k++)
> +     {
> +         int            count = 0;
> +
> +         /* Count occurences of k distinct element counts in histogram. */
> +         while (i < nhist && DatumGetInt32(hist[i]) <= k)
> +         {
> +             if (DatumGetInt32(hist[i]) == k)
> +                 count++;
> +             i++;
> +         }
> +
> +         if (count > 0)
> +         {
> +             float        val;
> +
> +             /* Find length between current histogram value and the next one */
> +             if (i < nhist)
> +                 next_interval = DatumGetInt32(hist[i + 1]) -

Doesn't this read past the array end when i == nhist - 1?

> + /*
> +  * Let be n independent events with probabilities p. This function calculates
> +  * probabilities of exact k of events occurence for k in [0;m].
> +  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
> +  * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
> +  * Each next event increase total number of occured events if it occurs and
> +  * leave last value of that number if it doesn't occur. So, by the law of
> +  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
> +  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
> +  * Also there could be some events with low probabilities. Their summary
> +  * probability passed in the rest parameter.
> +  */
> + static float *
> + calc_distr(float *p, int n, int m, float rest)

I attempted to clarify this comment; please see if I preserved its accuracy.

> +     /*
> +      * Using of distinct element counts histogram requires O(nitems * (nmcelem
> +      * + nitems)) operations. It's reasonable to limit the number of required
> +      * operation and give less accurate answer when this limit exceed.
> +      */
> +     if (nhist > 0 && unique_nitems <=
> +         300 * default_statistics_target / (nmcelem + unique_nitems))

I benchmarked the quadratic complexity here.  With default settings, this
cutoff skips the algorithm beginning around a 170-element constant array,
which would nonetheless take single-digit milliseconds to plan.  When I
temporarily removed the cutoff, I needed much larger scales to get poor plan
times.  5000 elements took 180ms to plan, and 10000 elements took 620 ms.
Bottom line, the cutoff you've chosen is plenty conservative.

> + static int
> + element_compare(const void *key1, const void *key2, void *arg)
> + {
> +     const Datum *d1 = (const Datum *) key1;
> +     const Datum *d2 = (const Datum *) key2;
> +     FunctionCallInfo cmpfunc = (FunctionCallInfo) arg;
> +
> +     cmpfunc   ->arg[0] = *d1;
> +     cmpfunc   ->arg[1] = *d2;
> +     cmpfunc   ->argnull[0] = false;
> +     cmpfunc   ->argnull[1] = false;
> +     cmpfunc   ->isnull = false;

This indented poorly due to "cmpfunc" having a place in our typedefs list.  I
changed the identifier.

> *** /dev/null
> --- b/src/backend/utils/adt/array_typanalyze.c

> +  *    We set s to be the estimated frequency of the K'th element in a natural
> +  *    language's frequency table, where K is the target number of entries in
> +  *    the MCELEM array. We assume that the distribution of element frequencies
> +  *    follows Zipf's law with an exponent of 1.
> +  *
> +  *    Assuming Zipfian distribution, the frequency of the K'th element is equal
> +  *    to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
> +  *    elements in the language.    Putting W as one million, we get roughly
> +  *    0.07/K. This gives s = 0.07/K.    We set epsilon = s/10, which gives bucket
> +  *    width w = K/0.007 and maximum expected hashtable size of about 1000 * K.

Given the lack of applicability to arrays, I replaced these last two
paragraphs with some weasel words.  My gut feeling is that we're priming the
algorithm to deliver answers far more precise than needed.  However, I haven't
attempted a principled replacement.

> +     /* This is 'w' from the LC algorithm */
> +     int            bucket_width;
> +     int            array_no,
> +                 element_no;

I think it's possible for element_no to overflow.  Consider rows with 2000
distinct elements apiece at a statistics target of 10000 (3M sample rows).
So, I made it a uint64.

> +     extra_data = (ArrayAnalyzeExtraData *) stats->extra_data;

This still isn't reentrant; you'd need to save the existing static extra_data
and restore it on function exit.  However, it turns out that do_analyze_rel()
itself isn't reentrant on account of its similar management of "anl_context";
any nested ANALYZE crashes the backend.  So, I don't think we need further
change here.  It will be easy to make reentrant later if necessary, though I'd
probably fix do_analyze_rel() by just throwing an error on recursive ANALYZE.

> +     stats->extra_data = extra_data->std_extra_data;
> +     old_context = CurrentMemoryContext;
> +     extra_data->std_compute_stats(stats, fetchfunc, samplerows, totalrows);
> +     MemoryContextSwitchTo(old_context);

Is the callee known to change CurrentMemoryContext and not restore it?
Offhand, I'm not seeing how it could do so.

> +     /*
> +      * hashtable for arrays distinct element count.
> +      */
> +     MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl));
> +     count_hash_ctl.keysize = sizeof(int);
> +     count_hash_ctl.entrysize = sizeof(DistinctElementCountItem);
> +     count_hash_ctl.hash = tag_hash;
> +     count_hash_ctl.match = memcmp;
> +     count_hash_ctl.hcxt = CurrentMemoryContext;
> +     count_tab = hash_create("Array distinct element count table",
> +                             64,
> +                             &count_hash_ctl,
> +                     HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);

This HASH_COMPARE setting is redundant, so I've removed it.

> +         /* Skip too large values. */
> +         if (toast_raw_datum_size(value) > WIDTH_THRESHOLD)

Fixed this warning:

array_typanalyze.c: In function `compute_array_stats':
array_typanalyze.c:361: warning: implicit declaration of function `toast_raw_datum_size'

> +             continue;
> +         else
> +             analyzed_rows++;
> +
> +         /*
> +          * Add up widths for average-width calculation.  Since it's a array,
> +          * we know it's varlena.  As in the regular compute_minimal_stats
> +          * function, we use the toasted width for this calculation.
> +          */
> +         total_width += VARSIZE_ANY(DatumGetPointer(value));

Since this is now unused, I removed it.

> +             /* Lookup current element in hashtable, adding it if new */
> +             item = (TrackItem *) hash_search(elements_tab,
> +                                              (const void *) &hash_key,
> +                                              HASH_ENTER, &found);
> +
> +             if (found)
> +             {

I added a pfree(hash_key) here.  In one of my default_statistics_target=3000
tests on a table with few possible elements, this saved hundreds of megabytes
of memory.

> +                 int            i;
> +
> +                 /*
> +                  * The element is already on the tracking list. Check if it's
> +                  * first occurence of this element in array.
> +                  */
> +                 for (i = 0; i < occurence_index; i++)
> +                 {
> +                     if (occurences[i] == item)
> +                         break;
> +                 }

This wasn't what I had in mind when I suggested the different approach last
time.  See how I changed it in this version, and let me know if you see any
essential disadvantages.

> +         /* Update frequency of particular array distinct element count. */
> +         count_item = (DistinctElementCountItem *) hash_search(count_tab,
> +                                                             &occurence_index,
> +                                               HASH_ENTER, &count_item_found);
> +         if (count_item_found)
> +             count_item->frequency++;
> +         else
> +         {
> +             count_item->count = occurence_index;

The key gets initialized automatically, so I removed this line.

> +             count_item->frequency = 1;
> +         }
> +         total_distinct_count += occurence_index;

total_distinct_count seemed to follow element_no exactly, so I removed it.

> *** a/src/include/catalog/pg_statistic.h
> --- b/src/include/catalog/pg_statistic.h

> ***************
> *** 260,263 **** typedef FormData_pg_statistic *Form_pg_statistic;
> --- 268,285 ----
>    */
>   #define STATISTIC_KIND_MCELEM  4
>
> + /*
> +  * A "length histogram" slot describes the distribution of lengths of data for
> +  * datatypes where length is important for selectivity estimation. stavalues
> +  * contains M (>=2) non-null values that divide the non-null column data values
> +  * into M-1 bins of approximately equal population. The first stavalues item
> +  * is the minimum length and the last is the maximum length. In dependence on
> +  * datatype this slot can hold distribution of not exactly length, but of
> +  * similar value. For instance, it hold distribution of distinct elements count
> +  * for arrays, because multiple occurences of array elements are ignored by
> +  * array comparison operators.
> +  *
> +  */
> + #define STATISTIC_KIND_LENGTH_HISTOGRAM  5

I changed this text to say that we always store distinct element counts.  We
can always update the comment later if we diversify its applications.

> *** a/src/include/catalog/pg_type.h
> --- b/src/include/catalog/pg_type.h

This now updates all array types except record[].  I'm don't know offhand how
to even make a non-empty value of type record[], let alone get it into a
context where ANALYZE would see it.  However, is there a particular reason to
make that one different?

Thanks,
nm

Attachment

Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
Hi!

Thanks for your fixes to the patch. Them looks correct to me. I did some fixes in the patch. The proof of some concepts is still needed. I'm going to provide it in a few days.

On Thu, Jan 12, 2012 at 3:06 PM, Noah Misch <noah@leadboat.com> wrote:
> I'm not sure about shared lossy counting module, because part of shared
> code would be relatively small. Part of compute_array_stats function which
> is taking care about array decompression, distinct occurence calculation,
> disting element count histogram, packing statistics slots etc is much
> larger than lossy counting algorithm itself. May be, there is some other
> opinions in community?

True; it would probably increase total lines of code.  The benefit, if any,
lies in separation of concerns; the business of implementing this algorithm is
quite different from the other roles of these typanalyze functions.  I won't
insist that you try it, though.
I'd prefer to try it as separate patch.
 
> > > +             /*
> > > +              * The probability of no occurence of events which forms
> "rest"
> > > +              * probability have a limit of exp(-rest) when number of
> events fo to
> > > +              * infinity. Another simplification is to replace that
> events with one
> > > +              * event with (1 - exp(-rest)) probability.
> > > +              */
> > > +             rest = 1.0f - exp(-rest);
> >
> > What is the name of the underlying concept in probability theory?
>
> The most closest concept to caculated distribution is multinomial
> distribution. But it's not exactly same, because multinomial distribution
> gives probability of particular count of each event occurece, not
> probability of summary occurence. Actually, distribution is caclulated just
> from assumption of events independence. The most closest concept of rest
> probability is approximation by exponential distribution. It's quite rough
> approximation, but I can't invent something better with low calculation
> complexity.

Do you have a URL of a tutorial or paper that explains the method in more
detail?  If, rather, this is a novel synthesis, could you write a proof to
include in the comments?
Unfortunately I don't have relevant paper for it. I'll try to search it. Otherwise I'll try to do some proof.
 
> > > + /*
> > > +  * Array selectivity estimation based on most common elements
> statistics for
> > > +  * "column <@ const" case. Assumption that element occurences are
> independent
> > > +  * means certain distribution of array lengths. Typically real
> distribution
> > > +  * of lengths is significantly different from it. For example, if
> even we
> > > +  * have set of arrays with 1 integer element in range [0;10] each,
> element
> > > +  * occurences are not independent. Because in the case of
> independence we
> >
> > Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> > '{6,19,4}' cannot appear?
>
> I refer column where only one element exists, i.e. only possible values are
> '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
> '{10}'. That is a corner case. But similar situation occurs when, for
> example, we've distribution of distinct element count between 1 and 3. It
> significantly differs from distribution from independent occurence.

Oh, I think I see now.  If each element 1..10 had frequency 0.1 independently,
column values would have exactly one distinct element just 39% of the time?
Yes, it's right.
 
If probability theory has a prototypical problem resembling this, it would be
nice to include a URL to a thorough discussion thereof.  I could not think of
the search terms to find one, though.
Actually, usage of both distinct element count histogram and element occurrence frequencies produce some probability distribution (which is more complex than just independent element occurrence). If real distribution is close this distribution, then estimate is accurate. I didn't met such distributions is papers, actually I've just invented it in my tries to do accurate "column <@ const" estimation at least in simple cases. I'll try to search for similar things in papers, but I doubt I'll succeed. Otherwise I'll try to do some more detailed proof.


> *** /dev/null
> --- b/src/backend/utils/adt/array_selfuncs.c

> + Selectivity
> + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause,
> +                                     Oid operator)
> + {
> +     Oid                     elemtype;
> +     Selectivity selec;
> +     TypeCacheEntry *typentry;
> +     Datum      *hist;
> +     int                     nhist;
> +     FunctionCallInfoData cmpfunc;
> +
> +     elemtype = get_base_element_type(vardata->vartype);
> +
> +
> +     /* Get default comparison function */
> +     typentry = lookup_type_cache(elemtype,
> +                TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR);
> +
> +     /* Handle only "=" operator. Return default selectivity in other cases. */
> +     if (operator != typentry->eq_opr)
> +             return (Selectivity) 0.5;

Punting on other operators this way creates a plan quality regression for
operations like "const < ANY (column)".  Please do it some way that falls
back on the somewhat-better existing scalararraysel() treatment for this.
I've made calc_scalararraysel return -1 in this case or in the case of no comparison function. scalararraysel continues it's work when calc_scalararraysel returns negative value.
 
> + /*
> +  * Calculate first n distinct element counts probabilities by histogram. We
> +  * assume that any interval between a and b histogram values gives
> +  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and
> +  * half of that to a and b. Returns total probability that distinct element
> +  * count is less of equal to n.
> +  */
> + static float
> + calc_hist(Datum *hist, int nhist, float *hist_part, int n)

To test this function, I ran the following test case:

set default_statistics_target = 4;
create table t3 as select array(select * from generate_series(1, v)) as arr
from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
analyze t3; -- length_histogram_bounds = {2,2,5,5}
select * from t3 where arr <@ array[6,7,8,9,10,11];

Using gdb to observe calc_hist()'s result during the last command:

(gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
$23 = 0.666666687
(gdb) x/6f hist_part
0xcd4bc8:       0       0       0.333333343     0
0xcd4bd8:       0       0.333333343

I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and
a total probability of 1.0.

> + {
> +     int                     k,
> +                             i = 0,
> +                             prev_interval = 0,
> +                             next_interval = 0;
> +     float           frac,
> +                             total = 0.0f;
> +
> +     /*
> +      * frac is a probability contribution by each interval between histogram
> +      * values. We have nhist - 1 intervals. Contribution of one will be 1 /
> +      * (nhist - 1).
> +      */
> +     frac = 1.0f / ((float) (nhist - 1));
> +     for (k = 0; k <= n; k++)
> +     {
> +             int                     count = 0;
> +
> +             /* Count occurences of k distinct element counts in histogram. */
> +             while (i < nhist && DatumGetInt32(hist[i]) <= k)
> +             {
> +                     if (DatumGetInt32(hist[i]) == k)
> +                             count++;
> +                     i++;
> +             }
> +
> +             if (count > 0)
> +             {
> +                     float           val;
> +
> +                     /* Find length between current histogram value and the next one */
> +                     if (i < nhist)
> +                             next_interval = DatumGetInt32(hist[i + 1]) -

Doesn't this read past the array end when i == nhist - 1?
It was a bug. It also causes wrong histogram calculation you noted above. Fixed.
 
> +     stats->extra_data = extra_data->std_extra_data;
> +     old_context = CurrentMemoryContext;
> +     extra_data->std_compute_stats(stats, fetchfunc, samplerows, totalrows);
> +     MemoryContextSwitchTo(old_context);

Is the callee known to change CurrentMemoryContext and not restore it?
Offhand, I'm not seeing how it could do so.
Right. Saving of memory context is not needed. Removed.

> *** a/src/include/catalog/pg_type.h
> --- b/src/include/catalog/pg_type.h

This now updates all array types except record[].  I'm don't know offhand how
to even make a non-empty value of type record[], let alone get it into a
context where ANALYZE would see it.  However, is there a particular reason to
make that one different?
Oh, I didn't update all array types in 2 tries :) Fixed.

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote:
> Thanks for your fixes to the patch. Them looks correct to me. I did some
> fixes in the patch. The proof of some concepts is still needed. I'm going
> to provide it in a few days.

Your further fixes look good.  Could you also answer my question about the
header comment of mcelem_array_contained_selec()?

/** Estimate selectivity of "column <@ const" based on most common element* statistics.    Independent element
occurrencewould imply a particular* distribution of distinct element counts among matching rows.  Real data* usually
falsifiesthat assumption.  For example, in a set of 1-element* integer arrays having elements in the range [0;10],
elementoccurrences are* not independent.  If they were, a sufficiently-large set would include all* distinct element
counts0 through 11.  We correct for this using the* histogram of distinct element counts.** In the "column @> const"
and"column && const" cases, we usually have* "const" with low summary frequency of elements (otherwise we have*
selectivityclose to 0 or 1 correspondingly).  That's why the effect of* dependence related to distinct element counts
distributionis negligible* there.  In the "column <@ const" case, summary frequency of elements is* high (otherwise we
haveselectivity close to 0).  That's why we should do* correction due to array distinct element counts distribution.*/
 

By "summary frequency of elements", do you mean literally P_0 + P_1 ... + P_N?
If so, I can follow the above argument for "column && const" and "column <@
const", but not for "column @> const".  For "column @> const", selectivity
cannot exceed the smallest frequency among const elements.  A number of
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.

Thanks,
nm


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
Hi!

Updated patch is attached. I've updated comment of mcelem_array_contained_selec with more detailed description of probability distribution assumption. Also, I found that "rest" behavious should be better described by Poisson distribution, relevant changes were made.

On Tue, Jan 17, 2012 at 2:33 PM, Noah Misch <noah@leadboat.com> wrote:
By "summary frequency of elements", do you mean literally P_0 + P_1 ... + P_N?
If so, I can follow the above argument for "column && const" and "column <@
const", but not for "column @> const".  For "column @> const", selectivity
cannot exceed the smallest frequency among const elements.  A number of
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.
Referencing to summary frequency is not really correct. It would be more correct to reference to number of element in "const". When there are many elements in "const", "column @> const" selectivity tends to be close to 0 and  "column @> const" tends to be close to 1. Surely, it's true when elements have some kind of middle values of frequencies (not very close to 0 and not very close to 1). I've replaced "summary frequency of elements" by "number of elements".

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Mon, Jan 23, 2012 at 01:21:20AM +0400, Alexander Korotkov wrote:
> Updated patch is attached. I've updated comment
> of mcelem_array_contained_selec with more detailed description of
> probability distribution assumption. Also, I found that "rest" behavious
> should be better described by Poisson distribution, relevant changes were
> made.

Thanks.  That makes more of the math clear to me.  I do not follow all of it,
but I feel that the comments now have enough information that I could go about
doing so.

> +     /* Take care about events with low probabilities. */
> +     if (rest > DEFAULT_CONTAIN_SEL)
> +     {

Why the change from "rest > 0" to this in the latest version?

> +         /* emit some statistics for debug purposes */
> +         elog(DEBUG3, "array: target # mces = %d, bucket width = %d, "
> +              "# elements = %llu, hashtable size = %d, usable entries = %d",
> +              num_mcelem, bucket_width, element_no, i, track_len);

That should be UINT64_FMT.  (I introduced that error in v0.10.)


I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files.  I see no
other issues precluding commit, so I am marking the patch Ready for Committer.
If I made any of the comments worse, please post another update.

Thanks,
nm

Attachment

Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote:
> +     /* Take care about events with low probabilities. */
> +     if (rest > DEFAULT_CONTAIN_SEL)
> +     {

Why the change from "rest > 0" to this in the latest version?
Ealier addition of "rest" distribution require O(m) time. Now there is a more accurate and proved estimate, but it takes O(m^2) time.It doesn't make general assymptotical time worse, but it significant. That's why I decided to skip for low values of "rest" which don't change distribution significantly.
 

> +             /* emit some statistics for debug purposes */
> +             elog(DEBUG3, "array: target # mces = %d, bucket width = %d, "
> +                      "# elements = %llu, hashtable size = %d, usable entries = %d",
> +                      num_mcelem, bucket_width, element_no, i, track_len);

That should be UINT64_FMT.  (I introduced that error in v0.10.)


I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files.  I see no
other issues precluding commit, so I am marking the patch Ready for Committer.
Great!
 
If I made any of the comments worse, please post another update.
Changes looks reasonable for me. Thanks!

------
With best regards,
Alexander Korotkov.

Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote:
>> I've attached a new version that includes the UINT64_FMT fix, some edits of
>> your newest comments, and a rerun of pgindent on the new files.  I see no
>> other issues precluding commit, so I am marking the patch Ready for
>> Committer.
>> If I made any of the comments worse, please post another update.

> Changes looks reasonable for me. Thanks!

I am starting to look at this patch now.  I'm wondering exactly why the
decision was made to continue storing btree-style statistics for arrays,
in addition to the new stuff.  The pg_statistic rows for array columns
tend to be unreasonably wide already, and as-is this patch will make
them quite a lot wider.  I think it requires more than a little bit of
evidence to continue storing stats that seem to have only small
probability of usefulness.

In particular, if we didn't store that stuff, we'd not need to widen the
number of columns in pg_statistic, which would noticeably reduce both
the footprint of the patch and the probability of breaking external
code.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I am starting to look at this patch now.  I'm wondering exactly why the
decision was made to continue storing btree-style statistics for arrays,
in addition to the new stuff.  The pg_statistic rows for array columns
tend to be unreasonably wide already, and as-is this patch will make
them quite a lot wider.  I think it requires more than a little bit of
evidence to continue storing stats that seem to have only small
probability of usefulness.

In particular, if we didn't store that stuff, we'd not need to widen the
number of columns in pg_statistic, which would noticeably reduce both
the footprint of the patch and the probability of breaking external
code.

Initially, I used existing slots for new statistics, but I've changed this after the first review:

Probably, btree statistics really does matter for some sort of arrays? For example, arrays representing paths in the tree. We could request a subtree in a range query on such arrays.

------
With best regards,
Alexander Korotkov.

Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am starting to look at this patch now.  I'm wondering exactly why the
>> decision was made to continue storing btree-style statistics for arrays,

> Probably, btree statistics really does matter for some sort of arrays? For
> example, arrays representing paths in the tree. We could request a subtree
> in a range query on such arrays.

That seems like a pretty narrow, uncommon use-case.  Also, to get
accurate stats for such queries that way, you'd need really enormous
histograms.  I doubt that the existing parameters for histogram size
will permit meaningful estimation of more than the first array entry
(since we don't make the histogram any larger than we do for a scalar
column).

The real point here is that the fact that we're storing btree-style
stats for arrays is an accident, backed into by having added btree
comparators for arrays plus analyze.c's habit of applying default
scalar-oriented analysis functions to any type without an explicit
typanalyze entry.  I don't recall that we ever thought hard about
it or showed that those stats were worth anything.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am starting to look at this patch now.  I'm wondering exactly why the
>> decision was made to continue storing btree-style statistics for arrays,

> Probably, btree statistics really does matter for some sort of arrays? For
> example, arrays representing paths in the tree. We could request a subtree
> in a range query on such arrays.

That seems like a pretty narrow, uncommon use-case.  Also, to get
accurate stats for such queries that way, you'd need really enormous
histograms.  I doubt that the existing parameters for histogram size
will permit meaningful estimation of more than the first array entry
(since we don't make the histogram any larger than we do for a scalar
column).

The real point here is that the fact that we're storing btree-style
stats for arrays is an accident, backed into by having added btree
comparators for arrays plus analyze.c's habit of applying default
scalar-oriented analysis functions to any type without an explicit
typanalyze entry.  I don't recall that we ever thought hard about
it or showed that those stats were worth anything.

OK. I don't object to removing btree stats from arrays.
What do you thinks about pg_stats view in this case? Should it combine values histogram and array length histogram in single column like do for MCV and MCELEM?

------
With best regards,
Alexander Korotkov.

Re: Collect frequency statistics for arrays

From
Nathan Boley
Date:
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
>> On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote:
>>> I've attached a new version that includes the UINT64_FMT fix, some edits of
>>> your newest comments, and a rerun of pgindent on the new files.  I see no
>>> other issues precluding commit, so I am marking the patch Ready for
>>> Committer.
>>> If I made any of the comments worse, please post another update.
>
>> Changes looks reasonable for me. Thanks!
>
> I am starting to look at this patch now.  I'm wondering exactly why the
> decision was made to continue storing btree-style statistics for arrays,
> in addition to the new stuff.

If I understand you're suggestion, queries of the form

SELECT * FROM rel
WHERE ARRAY[ 1,2,3,4 ] <= x    AND x <=ARRAY[ 1, 2, 3, 1000];

would no longer use an index. Is that correct?

Are you suggesting removing MCV's in lieu of MCE's as well?

-Nathan


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Nathan Boley <npboley@gmail.com> writes:
> On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am starting to look at this patch now. �I'm wondering exactly why the
>> decision was made to continue storing btree-style statistics for arrays,
>> in addition to the new stuff.

> If I understand you're suggestion, queries of the form

> SELECT * FROM rel
> WHERE ARRAY[ 1,2,3,4 ] <= x
>      AND x <=ARRAY[ 1, 2, 3, 1000];

> would no longer use an index. Is that correct?

No, just that we'd no longer have statistics relevant to that, and would
have to fall back on default selectivity assumptions.  Do you think that
such applications are so common as to justify bloating pg_statistic for
everybody that uses arrays?
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Nathan Boley
Date:
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Nathan Boley <npboley@gmail.com> writes:
>> On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I am starting to look at this patch now.  I'm wondering exactly why the
>>> decision was made to continue storing btree-style statistics for arrays,
>>> in addition to the new stuff.
>
>> If I understand you're suggestion, queries of the form
>
>> SELECT * FROM rel
>> WHERE ARRAY[ 1,2,3,4 ] <= x
>>      AND x <=ARRAY[ 1, 2, 3, 1000];
>
>> would no longer use an index. Is that correct?
>
> No, just that we'd no longer have statistics relevant to that, and would
> have to fall back on default selectivity assumptions.

Which, currently, would mean queries of that form would typically use
a table scan, right?

> Do you think that
> such applications are so common as to justify bloating pg_statistic for
> everybody that uses arrays?

I have no idea, but it seems like it will be a substantial regression
for the people that are.

What about MCV's? Will those be removed as well?

Best,
Nathan


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Nathan Boley <npboley@gmail.com> writes:
> On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Nathan Boley <npboley@gmail.com> writes:
>>> If I understand you're suggestion, queries of the form
>>> SELECT * FROM rel
>>> WHERE ARRAY[ 1,2,3,4 ] <= x
>>> � � �AND x <=ARRAY[ 1, 2, 3, 1000];
>>> would no longer use an index. Is that correct?

>> No, just that we'd no longer have statistics relevant to that, and would
>> have to fall back on default selectivity assumptions.

> Which, currently, would mean queries of that form would typically use
> a table scan, right?

No, it doesn't.

> What about MCV's? Will those be removed as well?

Sure.  Those seem even less useful.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Thu, Mar 1, 2012 at 1:19 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
That seems like a pretty narrow, uncommon use-case.  Also, to get
accurate stats for such queries that way, you'd need really enormous
histograms.  I doubt that the existing parameters for histogram size
will permit meaningful estimation of more than the first array entry
(since we don't make the histogram any larger than we do for a scalar
column).

The real point here is that the fact that we're storing btree-style
stats for arrays is an accident, backed into by having added btree
comparators for arrays plus analyze.c's habit of applying default
scalar-oriented analysis functions to any type without an explicit
typanalyze entry.  I don't recall that we ever thought hard about
it or showed that those stats were worth anything.

OK. I don't object to removing btree stats from arrays.
What do you thinks about pg_stats view in this case? Should it combine values histogram and array length histogram in single column like do for MCV and MCELEM?

Btree statistics for arrays and additional statistics slot are removed from attached version of patch. pg_stats view is untouched for while.

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Nathan Boley <npboley@gmail.com> writes:
>> On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I am starting to look at this patch now.  I'm wondering exactly why the
>>> decision was made to continue storing btree-style statistics for arrays,
>>> in addition to the new stuff.
>
>> If I understand you're suggestion, queries of the form
>
>> SELECT * FROM rel
>> WHERE ARRAY[ 1,2,3,4 ] <= x
>>      AND x <=ARRAY[ 1, 2, 3, 1000];
>
>> would no longer use an index. Is that correct?
>
> No, just that we'd no longer have statistics relevant to that, and would
> have to fall back on default selectivity assumptions.  Do you think that
> such applications are so common as to justify bloating pg_statistic for
> everybody that uses arrays?

I confess I am nervous about ripping this out.  I am pretty sure we
will get complaints about it.  Performance optimizations that benefit
group A at the expense of group B are always iffy, and I'm not sure
the case of using an array as a path indicator is as uncommon as you
seem to think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Collect frequency statistics for arrays

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012:
> On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > No, just that we'd no longer have statistics relevant to that, and would
> > have to fall back on default selectivity assumptions.  Do you think that
> > such applications are so common as to justify bloating pg_statistic for
> > everybody that uses arrays?
>
> I confess I am nervous about ripping this out.  I am pretty sure we
> will get complaints about it.  Performance optimizations that benefit
> group A at the expense of group B are always iffy, and I'm not sure
> the case of using an array as a path indicator is as uncommon as you
> seem to think.

Maybe we should keep it as an option.  I do think it's quite uncommon,
but for those rare users, it'd be good to provide the capability while
not bloating everyone else's stat catalog.  The thing is, people using
arrays as path indicators and such are likely using relatively small
arrays; people storing real data are likely to store much bigger arrays.
Just a hunch though.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012:
>> On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I confess I am nervous about ripping this out.  I am pretty sure we
>> will get complaints about it.  Performance optimizations that benefit
>> group A at the expense of group B are always iffy, and I'm not sure
>> the case of using an array as a path indicator is as uncommon as you
>> seem to think.

> Maybe we should keep it as an option.

How would we make it optional?  There's noplace I can think of to stick
such a knob ...
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of jue mar 01 18:51:38 -0300 2012:
>
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012:
> >> On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I confess I am nervous about ripping this out.  I am pretty sure we
> >> will get complaints about it.  Performance optimizations that benefit
> >> group A at the expense of group B are always iffy, and I'm not sure
> >> the case of using an array as a path indicator is as uncommon as you
> >> seem to think.
>
> > Maybe we should keep it as an option.
>
> How would we make it optional?  There's noplace I can think of to stick
> such a knob ...

Uhm, attoptions?

"alter table foo alter column bar set extended_array_stats to on"
or something like that?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Collect frequency statistics for arrays

From
Nathan Boley
Date:
>> What about MCV's? Will those be removed as well?
>
> Sure.  Those seem even less useful.

Ya, this will destroy the performance of several queries without some
heavy tweaking.

Maybe this is bad design, but I've gotten in the habit of storing
sequences as arrays and I commonly join on them. I looked through my
code this morning, and I only have one 'range' query ( of the form
described up-thread ), but there are tons of the form

SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array;

I can provide some examples if that would make my argument more compelling.

Sorry to be difficult,
Nathan


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Nathan Boley <npboley@gmail.com> writes:
> Maybe this is bad design, but I've gotten in the habit of storing
> sequences as arrays and I commonly join on them. I looked through my
> code this morning, and I only have one 'range' query ( of the form
> described up-thread ), but there are tons of the form

> SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array;

What do you mean by "storing sequences as arrays"?  Can you demonstrate
that the existing stats are relevant at all to the query you're worried
about?
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Tom Lane's message of jue mar 01 18:51:38 -0300 2012:
>> How would we make it optional?  There's noplace I can think of to stick
>> such a knob ...

> Uhm, attoptions?

Oh, I had forgotten we had that mechanism already.  Yeah, that might
work.  I'm a bit tempted to design the option setting so that you can
select whether to keep the btree stats, the new stats, or both or
neither --- after all, there are probably plenty of databases where
nobody cares about the array-containment operators either.

That leaves the question of which setting should be the default ...
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Nathan Boley
Date:
[ sorry Tom, reply all this time... ]

> What do you mean by "storing sequences as arrays"?

So, a simple example is, for transcripts ( sequences of DNA that are
turned into proteins ), we store each of the connected components as
an array of the form:

exon_type in [1,6]
splice_type = [1,3]

and then the array elements are

[ exon_type, splice_type, exon_type ]

~ 99% of the elements are of the form [ [1,3], 1, [1,3] ],

so I almost always get a hash or merge join ( correctly ) but for the
rare junction types ( which are usually more interesting as well ) I
correctly get nest loops with an index scan.

> Can you demonstrate
> that the existing stats are relevant at all to the query you're worried
> about?

Well, if we didn't have mcv's and just relied on ndistinct to estimate
the '=' selectivities, either my low selectivity quals would use the
index, or my high selectivity quals would use a table scan, either of
which would be wrong.

I guess I could wipe out the stats and get some real numbers tonight,
but I can't see how the planner would be able to distinguish *without*
mcv's...

Best,
Nathan


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Still working through this patch ... there are some things that bother
me about the entries being made in pg_statistic:

1. You re-used STATISTIC_KIND_MCELEM for something that, while similar
to tsvector's usage, is not the same.  In particular, tsvector adds two
extra elements to the stanumbers array, but this patch adds four (and
doesn't bother documenting that in pg_statistic.h).  I don't think it's
acceptable to re-use the same stakind value for something that's not
following the same specification.  I see Nathan complained of this way,
way upthread, but nothing was done about it.

I think we should either assign a different stakind code for this
definition, or change things around so that tsvector actually is using
the same stats kind definition as arrays are.  (We could get away with
redefining the tsvector stats format, because pg_upgrade doesn't try to
copy pg_statistic rows to the new database.)  Now, of the two new
elements added by the patch, it seems to me to be perfectly reasonable
to add a null-element frequency to the kind specification; the fact that
it wasn't there to start with is kind of an oversight born of the fact
that tsvectors don't contain any null lexemes.  But the other new
element is the average distinct element count, which really does not
belong here at all, as it is *entirely* unrelated to element
frequencies.  It seems to me that that more nearly belongs in the
element-count histogram slot.  So my preference is to align the two
definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency
to tsvector's usage (where it'll always be zero) and getting rid of the
average distinct element count here.

2. I think STATISTIC_KIND_LENGTH_HISTOGRAM is badly named and
confusingly documented.  The stats are not about anything I would call a
"length" --- rather we're considering the counts of numbers of distinct
element values present in each array value.  An ideal name perhaps would
be STATISTIC_KIND_DISTINCT_ELEMENTS_COUNT_HISTOGRAM, but of course
that's unreasonably long.  Considering the way that the existing stats
kind names are abbreviated, maybe STATISTIC_KIND_DECHIST would do.
Anybody have a better idea?

3. I also find it a bit odd that you chose to store the length (count)
histogram as an integer array in stavalues.  Usually we've put such data
in stanumbers.  That would make the entries float4 not integer, but that
doesn't seem unreasonable to me --- values would still be exact up to
2^24 or so on typical machines, and if we ever do have values larger
than that, it seems to me that having headroom to go above 2^32 would
be a good thing.  In any case, if we're going to move the average
distinct-element count over here, that would have to go into stanumbers.

Comments?
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
I wrote:
> ... So my preference is to align the two
> definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency
> to tsvector's usage (where it'll always be zero) and getting rid of the
> average distinct element count here.

Actually, there's a way we can do this without code changes in the
tsvector stuff.  Since the number of MCELEM stanumber items that provide
frequencies of stavalue items is obviously equal to the length of
stavalues, we could define stanumbers as containing those matching
entries, then two min/max entries, then an *optional* entry for the
frequency of null elements (with the frequency presumed to be zero if
omitted).  This'd be non-ambiguous given access to stavalues.  I'm not
sure though if making the null frequency optional wouldn't introduce
complexity elsewhere that outweighs not having to touch the tsvector
code.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
... BTW, could you explain exactly how that "Fill histogram by hashtab"
loop works?  It's way too magic for my taste, and does in fact have bugs
in the currently submitted patch.  I've reworked it to this:
           /* Fill histogram by hashtab. */           delta = analyzed_rows - 1;           count_item_index = 0;
  frac = sorted_count_items[0]->frequency * (num_hist - 1);           for (i = 0; i < num_hist; i++)           {
      while (frac <= 0)               {                   count_item_index++;                   Assert(count_item_index
<count_items_count);                   frac += sorted_count_items[count_item_index]->frequency * (num_hist - 1);
      }               hist[i] = sorted_count_items[count_item_index]->count;               frac -= delta;           }
       Assert(count_item_index == count_items_count - 1);
 

The asserts don't fire in any test case I've tried, which seems to
indicate that it *does* work in the sense that the first histogram entry
is always the smallest count and the last histogram entry is always
the largest one.  But it's extremely unclear why it manages to stop
exactly at the last count_items array entry, or for that matter why it's
generating a representative histogram at all.  I'm suspicious that the
"-1" bits represent off-by-one bugs.

I also don't especially like the fact that "frac" is capable of
overflowing (since worst case frequency is 300 * 10000 and worst case
num_hist is 10000, with the current limits on statistics_target).
We could work around that by doing the frac arithmetic in int64, but I
wonder whether that couldn't be avoided.  In any case, first I'd like
an explanation why this code works at all.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> [ array statistics patch ]

I've committed this after a fair amount of editorialization.  There are
still some loose ends to deal with, but I felt it was ready to go into
the tree for wider testing.

The main thing I changed that wasn't in the nature of cleanup/bugfixing
was that I revised the effort-limiting logic in
mcelem_array_contained_selec.  The submitted code basically just punted
if the estimated work was too much, but as was already noted in
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01349.php
that can result in really bad estimates.  What I did instead is
something closer to Robert's original suggestion: trim the number of
element values taken into consideration from the array constant to a
value that fits within the desired effort limit.  If we consider just
the N most common values from the array constant, we still get a pretty
good estimate (since the trimmed N will still be close to 100 for the
values we're talking about).

I redid the tests in the above-mentioned message and see no cases where
the estimate is off by more than a factor of 2, and very few where it's
off by more than 20%, so this seems to work pretty well now.

The remaining loose ends IMO are:

1. I'm still unhappy about the loop that fills the count histogram,
as I noted earlier today.  It at least needs a decent comment and some
overflow protection, and I'm not entirely convinced that it doesn't have
more bugs than the overflow issue.

2. The tests in the above-mentioned message show that in most cases
where mcelem_array_contained_selec falls through to the "rough
estimate", the resulting rowcount estimate is just 1, ie we are coming
out with very small selectivities.  Although that path will now only be
taken when there are no stats, it seems like we'd be better off to
return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
must be something wrong with the "rough estimate" logic.  Could you
recheck that?

3. As I mentioned yesterday, I think it'd be a good idea to make some
provisions to reduce the width of pg_statistic rows for array columns
by not storing the scalar-style and/or array-style stats, if the DBA
knows that they're not going to be useful for a particular column.
I have not done anything about that.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
1. I'm still unhappy about the loop that fills the count histogram,
as I noted earlier today.  It at least needs a decent comment and some
overflow protection, and I'm not entirely convinced that it doesn't have
more bugs than the overflow issue.

Attached patch is focused on fixing this. The "frac" variable overflow is evaded by making it int64. I hope comments is clarifying something. In general this loop copies behaviour of histogram constructing loop of compute_scalar_stats function. But instead of values array we've array of unique DEC and it's frequency. 

------
With best regards,
Alexander Korotkov.
Attachment

Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
2. The tests in the above-mentioned message show that in most cases
where mcelem_array_contained_selec falls through to the "rough
estimate", the resulting rowcount estimate is just 1, ie we are coming
out with very small selectivities.  Although that path will now only be
taken when there are no stats, it seems like we'd be better off to
return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
must be something wrong with the "rough estimate" logic.  Could you
recheck that?

I think the wrong think with "rough estimate" is that assumption about independent occurrences of items is very unsuitable even for "rough estimate". The following example shows that "rough estimate" really works in the case of independent occurrences of items.

Generate test table where item occurrences are really independent.

test=# create table test as select ('{'||(select string_agg(s,',') from (select case when (t*0 + random()) < 0.1 then i::text else null end from generate_series(1,100) i) as x(s))||'}')::int[] AS val  from generate_series(1,10000) t;

SELECT 10000

test=# analyze test;
ANALYZE

Do some test.

test=# explain analyze select * from test where val <@ array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];
                                                                                                  QUERY PLAN                                                                                           
      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..239.00 rows=151 width=61) (actual time=0.325..32.556 rows=163 loops=1
)
   Filter: (val <@ '{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
   Rows Removed by Filter: 9837
 Total runtime: 32.806 ms
(4 rows)

Delete DECHIST statistics.

test=# update pg_statistic set stakind1 = 0, staop1 = 0, stanumbers1 = null, stavalues1 = null where starelid = (select oid from pg_class where relname = 'test') and stakind1 = 5;
UPDATE 0
test=# update pg_statistic set stakind2 = 0, staop2 = 0, stanumbers2 = null, stavalues2 = null where starelid = (select oid from pg_class where relname = 'test') and stakind2 = 5;
UPDATE 0
test=# update pg_statistic set stakind3 = 0, staop3 = 0, stanumbers3 = null, stavalues3 = null where starelid = (select oid from pg_class where relname = 'test') and stakind3 = 5;
UPDATE 0
test=# update pg_statistic set stakind4 = 0, staop4 = 0, stanumbers4 = null, stavalues4 = null where starelid = (select oid from pg_class where relname = 'test') and stakind4 = 5;
UPDATE 1
test=# update pg_statistic set stakind5 = 0, staop5 = 0, stanumbers5 = null, stavalues5 = null where starelid = (select oid from pg_class where relname = 'test') and stakind5 = 5;
UPDATE 0

Do another test.

test=# explain analyze select * from test where val <@ array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];
                                                                                                  QUERY PLAN                                                                                            
      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..239.00 rows=148 width=61) (actual time=0.332..32.952 rows=163 loops=1)
   Filter: (val <@ '{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
   Rows Removed by Filter: 9837
 Total runtime: 33.225 ms
(4 rows)

It this particular case "rough estimate" is quite accurate. But in most part of cases it behaves really bad. It is why I started to invent calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless we've some better ideas.

------
With best regards,
Alexander Korotkov.

Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2. The tests in the above-mentioned message show that in most cases
>> where mcelem_array_contained_selec falls through to the "rough
>> estimate", the resulting rowcount estimate is just 1, ie we are coming
>> out with very small selectivities.  Although that path will now only be
>> taken when there are no stats, it seems like we'd be better off to
>> return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
>> must be something wrong with the "rough estimate" logic.  Could you
>> recheck that?

> I think the wrong think with "rough estimate" is that assumption about
> independent occurrences of items is very unsuitable even for "rough
> estimate". The following example shows that "rough estimate" really works
> in the case of independent occurrences of items. ...
> It this particular case "rough estimate" is quite accurate. But in most
> part of cases it behaves really bad. It is why I started to invent
> calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless
> we've some better ideas.

OK.  Looking again at that code, I notice that it also punts and returns
DEFAULT_CONTAIN_SEL if it's not given MCELEM stats, which it more or
less has to because without even a minfreq the whole calculation is just
hot air.  And there are no plausible scenarios where compute_array_stats
would produce an MCELEM slot but no count histogram.  So that says there
is no point in sweating over this case, unless you have an idea how to
produce useful results without MCELEM.

So I think it's sufficient to punt at the top of the function if no
histogram, and take out the various attempts to cope with the case.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 1. I'm still unhappy about the loop that fills the count histogram,
>> as I noted earlier today.  It at least needs a decent comment and some
>> overflow protection, and I'm not entirely convinced that it doesn't have
>> more bugs than the overflow issue.

> Attached patch is focused on fixing this. The "frac" variable overflow is
> evaded by making it int64. I hope comments is clarifying something. In
> general this loop copies behaviour of histogram constructing loop of
> compute_scalar_stats function. But instead of values array we've array of
> unique DEC and it's frequency.

OK, I reworked this a bit and committed it.  Thanks.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
BTW, one other thing about the count histogram: seems like we are
frequently generating uselessly large ones.  For instance, do ANALYZE
in the regression database and then run

select tablename,attname,elem_count_histogram from pg_stats where elem_count_histogram is not null;

You get lots of entries that look like this:
pg_proc             | proallargtypes |
{1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}pg_proc
           | proargmodes    |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.61111}pg_proc
           | proargnames    |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}pg_proc
           | proconfig      |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}pg_class
          | reloptions     |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}

which seems to me to be a rather useless expenditure of space.
Couldn't we reduce the histogram size when there aren't many
different counts?

It seems fairly obvious to me that we could bound the histogram
size with (max count - min count + 1), but maybe something even
tighter would work; or maybe I'm missing something and this would
sacrifice accuracy.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
<div class="gmail_quote">On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane <span dir="ltr"><<a
href="mailto:tgl@sss.pgh.pa.us"target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> BTW, one other thing about
thecount histogram: seems like we are<br /> frequently generating uselessly large ones.  For instance, do ANALYZE<br />
inthe regression database and then run<br /><br /> select tablename,attname,elem_count_histogram from pg_stats<br />
 whereelem_count_histogram is not null;<br /><br /> You get lots of entries that look like this:<br /><br />  pg_proc  
         | proallargtypes |
{1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}<br
/> pg_proc             | proargmodes    |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.61111}<br
/> pg_proc             | proargnames    |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}<br
/> pg_proc             | proconfig      |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}<br
/> pg_class            | reloptions     |
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}<br
/><br/> which seems to me to be a rather useless expenditure of space.<br /> Couldn't we reduce the histogram size when
therearen't many<br /> different counts?<br /><br /> It seems fairly obvious to me that we could bound the histogram<br
/>size with (max count - min count + 1), but maybe something even<br /> tighter would work; or maybe I'm missing
somethingand this would<br /> sacrifice accuracy.<br /></blockquote><div class="gmail_quote"><br /></div><div
class="gmail_quote">True.If (max count - min count + 1) is small, enumerating of frequencies is both more compact and
moreprecise representation. Simultaneously, if (max count - min count + 1) is large, we can run out of
statistics_targetwith such representation. We can use same representation of count distribution as for scalar column
value: MCVand HISTOGRAM, but it would require additional statkind and statistics slot. Probably, you've better
ideas?</div><br/>------<br />With best regards,<br />Alexander Korotkov.</div> 

Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Couldn't we reduce the histogram size when there aren't many
>> different counts?
>> 
>> It seems fairly obvious to me that we could bound the histogram
>> size with (max count - min count + 1), but maybe something even
>> tighter would work; or maybe I'm missing something and this would
>> sacrifice accuracy.

> True. If (max count - min count + 1) is small, enumerating of frequencies
> is both more compact and more precise representation. Simultaneously,
> if (max count - min count + 1) is large, we can run out of
> statistics_target with such representation. We can use same representation
> of count distribution as for scalar column value: MCV and HISTOGRAM, but it
> would require additional statkind and statistics slot. Probably, you've
> better ideas?

I wasn't thinking of introducing two different representations,
but just trimming the histogram length when it's larger than necessary.

On reflection my idea above is wrong; for example assume that we have a
column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
what I said, we'd reduce the histogram to {1,2}, which might accurately
capture the set of lengths present but fails to show that 1 is much more
common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
would capture the situation perfectly in one-tenth the space that the
current logic does.

More generally, by limiting the histogram to statistics_target entries,
we are already accepting errors of up to 1/(2*statistics_target) in the
accuracy of the bin-boundary representation.  What the above example
shows is that sometimes we could meet the same accuracy requirement with
fewer entries.  I'm not sure how this could be mechanized but it seems
worth thinking about.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > True. If (max count - min count + 1) is small, enumerating of frequencies
> > is both more compact and more precise representation. Simultaneously,
> > if (max count - min count + 1) is large, we can run out of
> > statistics_target with such representation. We can use same representation
> > of count distribution as for scalar column value: MCV and HISTOGRAM, but it
> > would require additional statkind and statistics slot. Probably, you've
> > better ideas?
> 
> I wasn't thinking of introducing two different representations,
> but just trimming the histogram length when it's larger than necessary.
> 
> On reflection my idea above is wrong; for example assume that we have a
> column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
> what I said, we'd reduce the histogram to {1,2}, which might accurately
> capture the set of lengths present but fails to show that 1 is much more
> common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
> would capture the situation perfectly in one-tenth the space that the
> current logic does.

Granted.  When the next sample finds 899/101 instead, though, the optimization
vanishes.  You save 90% of the space, perhaps 10% of the time.  If you want to
materially narrow typical statistics, Alexander's proposal looks like the way
to go.  I'd guess array columns always having DEC <= default_statistics_target
are common enough to make that representation the dominant representation, if
not the only necessary representation.


Re: Collect frequency statistics for arrays

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
>> On reflection my idea above is wrong; for example assume that we have a
>> column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
>> what I said, we'd reduce the histogram to {1,2}, which might accurately
>> capture the set of lengths present but fails to show that 1 is much more
>> common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
>> would capture the situation perfectly in one-tenth the space that the
>> current logic does.

> Granted.  When the next sample finds 899/101 instead, though, the optimization
> vanishes.

No, you missed my next point.  That example shows that sometimes a
smaller histogram can represent the situation with zero error, but in
all cases a smaller histogram can represent the situation with perhaps
more error than a larger one.  Since we already have a defined error
tolerance, we should try to generate a histogram that is as small as
possible while still not exceeding the error tolerance.

Now, it might be that doing that is computationally impractical, or
too complicated to be reasonable.  But it seems to me to be worth
looking into.

> If you want to materially narrow typical statistics, Alexander's
> proposal looks like the way to go.  I'd guess array columns always
> having DEC <= default_statistics_target are common enough to make that
> representation the dominant representation, if not the only necessary
> representation.

Well, I don't want to have two representations; I don't think it's worth
the complexity.  But certainly we could consider switching to a
different representation if it seems likely to usually be smaller.
        regards, tom lane


Re: Collect frequency statistics for arrays

From
Noah Misch
Date:
On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
> >> On reflection my idea above is wrong; for example assume that we have a
> >> column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
> >> what I said, we'd reduce the histogram to {1,2}, which might accurately
> >> capture the set of lengths present but fails to show that 1 is much more
> >> common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
> >> would capture the situation perfectly in one-tenth the space that the
> >> current logic does.
> 
> > Granted.  When the next sample finds 899/101 instead, though, the optimization
> > vanishes.
> 
> No, you missed my next point.  That example shows that sometimes a
> smaller histogram can represent the situation with zero error, but in
> all cases a smaller histogram can represent the situation with perhaps
> more error than a larger one.  Since we already have a defined error
> tolerance, we should try to generate a histogram that is as small as
> possible while still not exceeding the error tolerance.
> 
> Now, it might be that doing that is computationally impractical, or
> too complicated to be reasonable.  But it seems to me to be worth
> looking into.

Yes, I did miss your point.

One characteristic favoring this approach is its equal applicability to both
STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST.

> > If you want to materially narrow typical statistics, Alexander's
> > proposal looks like the way to go.  I'd guess array columns always
> > having DEC <= default_statistics_target are common enough to make that
> > representation the dominant representation, if not the only necessary
> > representation.
> 
> Well, I don't want to have two representations; I don't think it's worth
> the complexity.  But certainly we could consider switching to a
> different representation if it seems likely to usually be smaller.

Perhaps some heavy array users could provide input: what are some typical
length ranges among arrays in your applications on which you use "arr &&
const", "arr @> const" or "arr <@ const" searches?


Re: Collect frequency statistics for arrays

From
Alexander Korotkov
Date:
On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> True. If (max count - min count + 1) is small, enumerating of frequencies
> is both more compact and more precise representation. Simultaneously,
> if (max count - min count + 1) is large, we can run out of
> statistics_target with such representation. We can use same representation
> of count distribution as for scalar column value: MCV and HISTOGRAM, but it
> would require additional statkind and statistics slot. Probably, you've
> better ideas?

I wasn't thinking of introducing two different representations,
but just trimming the histogram length when it's larger than necessary.

On reflection my idea above is wrong; for example assume that we have a
column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
what I said, we'd reduce the histogram to {1,2}, which might accurately
capture the set of lengths present but fails to show that 1 is much more
common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
would capture the situation perfectly in one-tenth the space that the
current logic does.

More generally, by limiting the histogram to statistics_target entries,
we are already accepting errors of up to 1/(2*statistics_target) in the
accuracy of the bin-boundary representation.  What the above example
shows is that sometimes we could meet the same accuracy requirement with
fewer entries.  I'm not sure how this could be mechanized but it seems
worth thinking about.

I can propose following representation of histogram.

If (max_count - min_count + 1) <= statistics_target then
1) store max_count and min_count in stavalues
2) store frequencies from min_count ot max_count in numvalues

If (max_count - min_count + 1) > statistics_target then
store histogram in current manner. I think in this case it's unlikely to be many repeating values.

I can propose patch which change histogram representation to this.

Comments?

------
With best regards,
Alexander Korotkov.