Thread: Collect frequency statistics for arrays
Hi!
There is updated version of patch. General list of changes since reviewed version:
1) Distinct slot is used for length histogram.
Following files are attached.
2) Standard statistics is collected for arrays.
3) Most common values and most common elements are mapped to distinct columns of pg_stats view, because both of them are calculated for arrays.
4) Description of lossy counting algorithm was copied from compute_tsvector_stats with corresponding changes in it.
5) In estimation functions comments about assumtions were added.
Accuracy testing
datasets.sql - sql script which generates test datasets
arrayanalyze.php - php script which does accuracy testing
results.sql - dump of table with tests results
As we can see from testing results, estimates seem to be quite accurate in most part of test cases. When length of constant array exceeds 30, estimate of "column <@ const" is very inaccurate for arrat_test3 table. It's related with skipping of length histogram usage because of high CPU usage during estimate (see array_sel.c:888).
------
With best regards,
Alexander Korotkov.
With best regards,
Alexander Korotkov.
Attachment
Rebased with head.
With best regards,
Alexander Korotkov.
Attachment
> Rebased with head. FYI, I've added myself as the reviewer for the current commitfest. Best, Nathan Boley
Hi!
On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley <npboley@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
FYI, I've added myself as the reviewer for the current commitfest.How is going review now?
------
With best regards,
Alexander Korotkov.
On Tue, Dec 20, 2011 at 04:37:37PM +0400, Alexander Korotkov wrote: > On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley <npboley@gmail.com> wrote: > > > FYI, I've added myself as the reviewer for the current commitfest. > > > How is going review now? I will examine this patch within the week.
On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote: > Rebased with head. I took a look at this patch. The basic approach seems sensible. For array columns, ANALYZE will now determine the elements most often present in the column's arrays. It will also calculate a histogram from the counts of distinct elements in each array. That is to say, both newly-collected statistics treat {1,1,2,3,1} and {3,1,2} identically. New selectivity machinery uses the new statistics to optimize these operations: column @> const column <@ const column && const const = ANY (column) const = ALL (column) Concrete estimates look mostly sane, with a few corner cases I note below. We have many implementation details to nail down. With the patch applied, I get the attached regression.diffs from "make check". The palloc errors indicate bugs, but rules.out just needs a refresh. During ANALYZE, we'll now detoast all array column values regardless of size, just as we already do for tsvector columns. That may be reasonable enough for tsvector table/index columns, whose very presence is a hint that the user has planned to use the value for searching. Since arrays make no such implication, should we skip large arrays to constrain TOAST I/O? Say, skip arrays larger than 100 KiB or 1 MiB? I find distressing the thought of having two copies of the lossy sampling code, each implementing the algorithm with different variable names and levels of generality. We might someday extend this to hstore, and then we'd have yet another copy. Tom commented[1] that ts_typanalyze() and array_typanalyze() should remain distinct, and I agree. However, they could call a shared counting module. Is that practical? Possible API: typedef struct LossyCountCtl; LossyCountCtl *LossyCountStart(float s, float epsilon, int2 typlen, bool typbyval, Oid eqfunc); /* + hash func, a few others */ void LossyCountAdd(LossyCountCtl *ctl, Datum elem); TrackItem **LossyCountGetAll(LossyCountCtl *ctl); [1] http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us > *** a/doc/src/sgml/catalogs.sgml > --- b/doc/src/sgml/catalogs.sgml > *************** > *** 8253,8260 **** > <entry> > A list of the most common values in the column. (Null if > no values seem to be more common than any others.) > ! For some data types such as <type>tsvector</>, this is a list of > ! the most common element values rather than values of the type itself. > </entry> > </row> > > --- 8253,8261 ---- > <entry> > A list of the most common values in the column. (Null if > no values seem to be more common than any others.) > ! For some data types such as <type>arrays</> and <type>tsvector</>, > ! this is a list of the most common element values rather than values of > ! the type itself. > </entry> > </row> > > *************** > *** 8266,8274 **** > A list of the frequencies of the most common values or elements, > i.e., number of occurrences of each divided by total number of rows. > (Null when <structfield>most_common_vals</structfield> is.) > ! For some data types such as <type>tsvector</>, it can also store some > ! additional information, making it longer than the > ! <structfield>most_common_vals</> array. > </entry> > </row> > > --- 8267,8275 ---- > A list of the frequencies of the most common values or elements, > i.e., number of occurrences of each divided by total number of rows. > (Null when <structfield>most_common_vals</structfield> is.) > ! For some data types such as <type>arrays</> and <type>tsvector</>, > ! it can also store some additional information, making it longer than > ! the <structfield>most_common_vals</> array. > </entry> > </row> We're falsifying the above by splitting out that case into new columns most_common_elems and most_common_elem_freqs. > > *************** > *** 8284,8289 **** > --- 8285,8291 ---- > does not have a <literal><</> operator or if the > <structfield>most_common_vals</> list accounts for the entire > population.) > + For <type>arrays</>, it holds histogram bounds of array lengths. > </entry> > </row> Likewise: that's now in the new column length_histogram_bounds. We need documentation for the new pg_stats columns. Also, in particular, let's document the special entries at the end of most_common_freqs. > *** a/src/backend/catalog/system_views.sql > --- b/src/backend/catalog/system_views.sql > *************** > *** 117,145 **** CREATE VIEW pg_stats AS > stawidth AS avg_width, > stadistinct AS n_distinct, > CASE > ! WHEN stakind1 IN (1, 4) THEN stavalues1 > ! WHEN stakind2 IN (1, 4) THEN stavalues2 > ! WHEN stakind3 IN (1, 4) THEN stavalues3 > ! WHEN stakind4 IN (1, 4) THEN stavalues4 > END AS most_common_vals, > CASE > ! WHEN stakind1 IN (1, 4) THEN stanumbers1 > ! WHEN stakind2 IN (1, 4) THEN stanumbers2 > ! WHEN stakind3 IN (1, 4) THEN stanumbers3 > ! WHEN stakind4 IN (1, 4) THEN stanumbers4 > END AS most_common_freqs, > CASE > WHEN stakind1 = 2 THEN stavalues1 > WHEN stakind2 = 2 THEN stavalues2 > WHEN stakind3 = 2 THEN stavalues3 > WHEN stakind4 = 2 THEN stavalues4 > END AS histogram_bounds, > CASE > WHEN stakind1 = 3 THEN stanumbers1[1] > WHEN stakind2 = 3 THEN stanumbers2[1] > WHEN stakind3 = 3 THEN stanumbers3[1] > WHEN stakind4 = 3 THEN stanumbers4[1] > ! END AS correlation > FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid) > JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum) > LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace) > --- 117,170 ---- > stawidth AS avg_width, > stadistinct AS n_distinct, > CASE > ! WHEN stakind1 = 1 THEN stavalues1 > ! WHEN stakind2 = 1 THEN stavalues2 > ! WHEN stakind3 = 1 THEN stavalues3 > ! WHEN stakind4 = 1 THEN stavalues4 > ! WHEN stakind5 = 1 THEN stavalues5 > END AS most_common_vals, > CASE > ! WHEN stakind1 = 1 THEN stanumbers1 > ! WHEN stakind2 = 1 THEN stanumbers2 > ! WHEN stakind3 = 1 THEN stanumbers3 > ! WHEN stakind4 = 1 THEN stanumbers4 > ! WHEN stakind5 = 1 THEN stanumbers5 > END AS most_common_freqs, > CASE > WHEN stakind1 = 2 THEN stavalues1 > WHEN stakind2 = 2 THEN stavalues2 > WHEN stakind3 = 2 THEN stavalues3 > WHEN stakind4 = 2 THEN stavalues4 > + WHEN stakind5 = 2 THEN stavalues5 > END AS histogram_bounds, > CASE > WHEN stakind1 = 3 THEN stanumbers1[1] > WHEN stakind2 = 3 THEN stanumbers2[1] > WHEN stakind3 = 3 THEN stanumbers3[1] > WHEN stakind4 = 3 THEN stanumbers4[1] > ! WHEN stakind5 = 3 THEN stanumbers5[1] > ! END AS correlation, > ! CASE > ! WHEN stakind1 = 4 THEN stavalues1 > ! WHEN stakind2 = 4 THEN stavalues2 > ! WHEN stakind3 = 4 THEN stavalues3 > ! WHEN stakind4 = 4 THEN stavalues4 > ! WHEN stakind5 = 4 THEN stavalues5 > ! END AS most_common_elems, > ! CASE > ! WHEN stakind1 = 4 THEN stanumbers1 > ! WHEN stakind2 = 4 THEN stanumbers2 > ! WHEN stakind3 = 4 THEN stanumbers3 > ! WHEN stakind4 = 4 THEN stanumbers4 > ! WHEN stakind5 = 4 THEN stanumbers5 > ! END AS most_common_elem_freqs, I think this is an improvement, but some code out there may rely on the ability to get stakind = 4 data from the most_common_vals column. We'll need to mention this in the release notes as an incompatibility. > ! CASE > ! WHEN stakind1 = 5 THEN stavalues1 > ! WHEN stakind2 = 5 THEN stavalues2 > ! WHEN stakind3 = 5 THEN stavalues3 > ! WHEN stakind4 = 5 THEN stavalues4 > ! WHEN stakind5 = 5 THEN stavalues5 > ! END AS length_histogram_bounds > FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid) > JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum) > LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace) > *** a/src/backend/commands/typecmds.c > --- b/src/backend/commands/typecmds.c > *************** > *** 610,616 **** DefineType(List *names, List *parameters) > F_ARRAY_SEND, /* send procedure */ > typmodinOid, /* typmodin procedure */ > typmodoutOid, /* typmodout procedure */ > ! InvalidOid, /* analyze procedure - default */ > typoid, /* element type ID */ > true, /* yes this is an array type */ > InvalidOid, /* no further array type */ > --- 610,616 ---- > F_ARRAY_SEND, /* send procedure */ > typmodinOid, /* typmodin procedure */ > typmodoutOid, /* typmodout procedure */ > ! ArrayTypanalyzeOid, /* special analyze procedure for arrays */ > typoid, /* element type ID */ > true, /* yes this is an array type */ > InvalidOid, /* no further array type */ The recently-added function DefineRange() needs the same change. > *** /dev/null > --- b/src/backend/utils/adt/array_sel.c "array_selfuncs.c" would better match our informal convention. > *************** > *** 0 **** > --- 1,948 ---- > + /*------------------------------------------------------------------------- > + * > + * array_sel.c > + * Functions for selectivity estimation of array operators. > + * > + * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group > + * > + * > + * IDENTIFICATION > + * src/backend/tsearch/array_sel.c Fix file location. > + * > + *------------------------------------------------------------------------- > + */ > + > + #include "postgres.h" > + #include "access/hash.h" > + #include "catalog/pg_operator.h" > + #include "commands/vacuum.h" > + #include "utils/builtins.h" > + #include "utils/typcache.h" > + #include "utils/array.h" > + #include "catalog/pg_am.h" > + #include "catalog/pg_collation.h" > + #include "commands/defrem.h" > + #include "utils/lsyscache.h" > + #include "utils/selfuncs.h" Sort the includes alphabetically, except for postgres.h coming first. > + > + /* Default selectivity constant */ > + #define DEFAULT_CONT_SEL 0.005 > + > + /* Macro for selectivity estimation to be used if we have no statistics */ > + #define array_selec_no_stats(array,nitems,op) \ > + mcelem_array_selec(array, nitems, typentry, NULL, 0, NULL, 0, NULL, 0, op) > + > + Datum arraysel(PG_FUNCTION_ARGS); extern prototypes go in header files, even when it's not strictly necessary. > + > + static Selectivity calc_arraysel(VariableStatData *vardata, Datum constval, > + Oid operator); > + static Selectivity mcelem_array_selec(ArrayType *array, int nitems, TypeCacheEntry *typentry, > + Datum *mcelem, int nmcelem, float4 *numbers, int nnumbers, > + Datum *hist, int nhist, Oid operator); > + static int element_compare(const void *key1, const void *key2); > + bool find_next_mcelem(Datum *mcelem, int nmcelem, Datum value, int *index); > + static Selectivity mcelem_array_contain_overlap_selec(Datum *mcelem, > + int nmcelem, float4 *numbers, Datum *array_data, int nitems, Oid operator); > + static float calc_hist(Datum *hist, int nhist, float *hist_part, int n); > + static Selectivity mcelem_array_contained_selec(Datum *mcelem, int nmcelem, > + float4 *numbers, Datum *array_data, int nitems, Datum *hist, int nhist, > + Oid operator); > + static float *calc_distr(float *p, int n, int m, float rest); > + > + /* Compare function of element data type */ > + static FunctionCallInfoData cmpfunc; > + > + /* > + * selectivity for "const = ANY(column)" and "const = ALL(column)" > + */ > + Selectivity > + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause) You have scalararraysel() calling this function for any operator (consider "const < ANY(column)"), but it only handles a single operator: the "=" of the default btree opclass used at ANALYZE time. We could start by just returning a constant selectivity for other operators, but we may be able to do better. If the actual operator is the equality operator we used at ANALYZE time (staop), use binary search on the mcelem array. Otherwise, use linear search, applying the operator to each MCE. (If the benefits justify the trouble, you could also use this strategy to support types with no default btree opclass.) > + { > + Oid elemtype; > + Selectivity selec; > + TypeCacheEntry *typentry; > + Datum *hist; > + int nhist; > + > + elemtype = get_base_element_type(vardata->vartype); > + > + /* Get default comparison function */ > + typentry = lookup_type_cache(elemtype, > + TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO); > + InitFunctionCallInfoData(cmpfunc, &typentry->cmp_proc_finfo, 2, > + DEFAULT_COLLATION_OID, NULL, NULL); The default btree operator class that existed at ANALYZE time may no longer exist. If we don't find a cmp function here, punt to avoid a crash. > + /* > + * arraysel -- restriction selectivity for "column @> const", "column && const" > + * and "column <@ const" > + */ > + Datum > + arraysel(PG_FUNCTION_ARGS) > + { > + PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0); > + > + Oid operator = PG_GETARG_OID(1); > + List *args = (List *) PG_GETARG_POINTER(2); > + int varRelid = PG_GETARG_INT32(3); > + VariableStatData vardata; > + Node *other; > + bool varonleft; > + Selectivity selec; > + Oid element_typeid; > + > + /* > + * If expression is not variable = something or something = variable, then > + * punt and return a default estimate. > + */ The operator is never "=". > + if (!get_restriction_variable(root, args, varRelid, > + &vardata, &other, &varonleft)) > + PG_RETURN_FLOAT8(DEFAULT_CONT_SEL); > + > + /* > + * Can't do anything useful if the something is not a constant, either. > + */ > + if (!IsA(other, Const)) > + { > + ReleaseVariableStats(vardata); > + PG_RETURN_FLOAT8(DEFAULT_CONT_SEL); > + } Previously, we defaulted to 0.005 for operator "&&" (via areasel()) and 0.001 for operators "@>" and "<@" (via contsel()). Surely some difference between those cases remains appropriate. > + if (res == 0) > + { > + *index = i; > + return true; > + } > + else if (res < 0) > + { > + l = i + 1; > + } > + else > + { > + r = i - 1; > + } Throughout this patch, omit braces around single statements. > + } > + *index = l; > + return false; > + } > + > + /* > + * Array selectivity estimation based on most common elements statistics. > + */ > + static Selectivity > + mcelem_array_selec(ArrayType *array, int nitems, TypeCacheEntry *typentry, > + Datum *mcelem, int nmcelem, float4 *numbers, int nnumbers, Datum *hist, > + int nhist, Oid operator) > + { > + /* "column @> const" and "column && const" cases */ > + if (operator == OID_ARRAY_CONTAIN_OP || operator == OID_ARRAY_OVERLAP_OP) > + return mcelem_array_contain_overlap_selec(mcelem, nmcelem, numbers, > + array_data, nonnull_nitems, operator); > + > + /* "column <@ const" case */ > + if (operator == OID_ARRAY_CONTAINED_OP) > + return mcelem_array_contained_selec(mcelem, nmcelem, numbers, > + array_data, nonnull_nitems, hist, nhist, operator); > + return DEFAULT_CONT_SEL; Returning a fixed selectivity when this gets attached to an unexpected operator seems counterproductive. Let's just elog(ERROR) in that case. > + /* > + * Array selectivity estimation based on most common elements statistics for > + * "column @> const" and "column && const" cases. This estimation assumes > + * element occurences to be independent. > + */ > + static Selectivity > + mcelem_array_contain_overlap_selec(Datum *mcelem, int nmcelem, > + float4 *numbers, Datum *array_data, int nitems, Oid operator) We could take some advantage of the unique element count histogram for "@>". Any column value with fewer distinct elements than the constant array cannot match. We perhaps can't use both that fact and element frequency stats at the same time, but we could use the lesser of the two probabilities. For operator "&&" or operator "@>" with a nonempty constant array, no rows having empty arrays will match. Excepting the special case of "@>" with an empty constant array, we can multiply the MCE-derived selectivity by the fraction, based on the histogram, of nonempty arrays in the column. (For "<@", rows having empty arrays always match.) I don't think it's mandatory that the initial commit implement the above, but it's something to mention in the comments as a future direction. > + /* > + * Let be n independent events with probabilities p. This function calculates > + * probabilities of exact k of events occurence for k in [0;m]. > + * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes > + * probability that exact j of first i events occurs. Obviously M[0,0] = 1. > + * Each next event increase total number of occured events if it occurs and > + * leave last value of that number if it doesn't occur. So, by the law of > + * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i] > + * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0. > + * Also there could be some events with low probabilities. Their summary > + * probability passed in the rest parameter. > + */ > + static float * > + calc_distr(float *p, int n, int m, float rest) > + { > + /* Take care about events with low probabilities. */ > + if (rest > 0.0f) > + { > + /* > + * The probability of no occurence of events which forms "rest" > + * probability have a limit of exp(-rest) when number of events fo to > + * infinity. Another simplification is to replace that events with one > + * event with (1 - exp(-rest)) probability. > + */ > + rest = 1.0f - exp(-rest); What is the name of the underlying concept in probability theory? > + /* > + * Array selectivity estimation based on most common elements statistics for > + * "column <@ const" case. Assumption that element occurences are independent > + * means certain distribution of array lengths. Typically real distribution > + * of lengths is significantly different from it. For example, if even we > + * have set of arrays with 1 integer element in range [0;10] each, element > + * occurences are not independent. Because in the case of independence we Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but '{6,19,4}' cannot appear? > + * have probabilities of length of 0, 1, 2 etc. In the "column @> const" > + * and "column && const" cases we usually have "const" with low summary > + * frequency of elements (otherwise we have selectivity close to 0 or 1 > + * correspondingly). That's why effect of dependence related to lengths > + * distribution id negligible there. In the "column <@ const" case summary > + * frequency of elements is high (otherwise we have selectivity close to 0). What does the term "summary frequency" denote? > + * That's why we should do correction due to array lengths distribution. > + */ > + static Selectivity > + mcelem_array_contained_selec(Datum *mcelem, int nmcelem, float4 *numbers, > + Datum *array_data, int nitems, Datum *hist, int nhist, Oid operator) Break up the parameter list to avoid pgindent reverse-indenting the line. When I tried a constant array with duplicate elements, I saw an inappropriate report of 100% selectivity: [local] test=# create table t1 as select array[n % 2, n] as arr from generate_series(1,100000) t(n); SELECT 100000 [local] test=# analyze t1; WARNING: problem in alloc set Analyze: detected write past chunk end in block 0x7f189cbbf010, chunk 0x7f189cc1d0d8 ANALYZE [local] test=# explain select * from t1 where arr <@ '{0,45}'; QUERY PLAN -------------------------------------------------------- Seq Scan on t1 (cost=0.00..1986.00 rows=186 width=29) Filter: (arr <@ '{0,45}'::integer[]) (2 rows) [local] test=# explain select * from t1 where arr <@ '{0,45,45}'; QUERY PLAN ----------------------------------------------------------- Seq Scan on t1 (cost=0.00..1986.00 rows=100000 width=29) Filter: (arr <@ '{0,45,45}'::integer[]) (2 rows) By contrast, the estimate in the non-duplicate case looks sane considering these estimates for the individual elements: [local] test=# explain select * from t1 where 0 = any (arr); QUERY PLAN ---------------------------------------------------------- Seq Scan on t1 (cost=0.00..2986.00 rows=49967 width=29) Filter: (0 = ANY (arr)) (2 rows) [local] test=# explain select * from t1 where 45 = any (arr); QUERY PLAN -------------------------------------------------------- Seq Scan on t1 (cost=0.00..2986.00 rows=500 width=29) Filter: (45 = ANY (arr)) (2 rows) Incidentally, "const = ALL (column)" should be equivalent to "column <@ array[const]". (Assuming the "=" operator chosen in the first statement is the "=" operator of the array type's default btree opclass). However, I get significantly different estimates, with the latter getting a better estimate: [local] test=# explain select * from t1 where 1 = all (arr); QUERY PLAN ---------------------------------------------------------- Seq Scan on t1 (cost=0.00..2986.00 rows=18407 width=29) Filter: (1 = ALL (arr)) (2 rows) [local] test=# explain select * from t1 where arr <@ array[1]; QUERY PLAN ------------------------------------------------------ Seq Scan on t1 (cost=0.00..1986.00 rows=1 width=29) Filter: (arr <@ '{1}'::integer[]) (2 rows) > + /* > + * Rest is a average length of elements which aren't present in mcelem. > + */ > + rest = avg_length; You define "rest" here as an array length ... > + > + default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2); > + > + mcelem_index = 0; > + > + /* > + * mult is the multiplier that presents estimate of probability that each > + * mcelem which is not present in constant doesn't occur. > + */ > + mult = 1.0f; > + > + for (i = 0; i < nitems; i++) > + { > + bool found = false; > + > + /* Comparison with previous value in order to guarantee uniquness */ > + if (i > 0) > + { > + if (!element_compare(&array_data[i - 1], &array_data[i])) > + continue; > + } > + > + /* > + * Iterate over mcelem until find mcelem that is greater or equal to > + * element of constant. Simultaneously taking care about rest and > + * mult. If that mcelem is found then fill corresponding elem_selec. > + */ > + while (mcelem_index < nmcelem) > + { > + int cmp = element_compare(&mcelem[mcelem_index], &array_data[i]); > + > + if (cmp < 0) > + { > + mult *= (1.0f - numbers[mcelem_index]); > + rest -= numbers[mcelem_index]; ... But here, you're subtracting a frequency from an array length? > + /* > + * Comparison function for elements. Based on default comparison function for > + * array element data type. > + */ > + static int > + element_compare(const void *key1, const void *key2) > + { > + const Datum *d1 = (const Datum *) key1; > + const Datum *d2 = (const Datum *) key2; > + > + cmpfunc. arg[0] = *d1; > + cmpfunc. arg[1] = *d2; > + cmpfunc. argnull[0] = false; > + cmpfunc. argnull[1] = false; > + cmpfunc. isnull = false; > + > + return DatumGetInt32(FunctionCallInvoke(&cmpfunc)); > + } We could easily make this reentrant by passing the fcinfo through an argument and using qsort_arg(). Please do so. > *** /dev/null > --- b/src/backend/utils/adt/array_typanalyze.c > *************** > *** 0 **** > --- 1,834 ---- > + /*------------------------------------------------------------------------- > + * > + * array_typanalyze.c > + * functions for gathering statistics from array columns > + * > + * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group > + * > + * > + * IDENTIFICATION > + * src/backend/tsearch/array_typanalyze.c Fix file location. > + * > + *------------------------------------------------------------------------- > + */ > + > + #include "postgres.h" > + #include "access/hash.h" > + #include "catalog/pg_operator.h" > + #include "commands/vacuum.h" > + #include "commands/defrem.h" > + #include "parser/parse_oper.h" > + #include "utils/builtins.h" > + #include "utils/hsearch.h" > + #include "utils/typcache.h" > + #include "utils/array.h" > + #include "catalog/pg_am.h" > + #include "catalog/pg_collation.h" > + #include "utils/lsyscache.h" > + #include "utils/selfuncs.h" Alphabetize the includes after "postgres.h". > + > + #define ARRAY_ANALYZE_CHECK_OID(x) \ > + if (!OidIsValid(x)) \ > + { \ > + elog(INFO, "Can't collect statistics on %d array type. Array \ > + statistics collection requires default hash and btree \ > + opclasses for element type.", stats->attrtypid); \ > + stats->stats_valid = false; \ > + return; \ > + } No message is necessary: the absence of a btree opclass or of both opclasses degrades normal statistics, and we make no message about it then. The decision to abandon the stats at this point causes a minor regression for types having only a default hash opclass. They currently get scalar minimal stats; now they'll have none. See below for one way to avoid that. This macro is used in just one function, and the return statement means one wouldn't lightly use it anywhere else. Therefore, as a minor style point, I'd place its definition right with the function that uses it rather than at the head of the file. > + Datum array_typanalyze(PG_FUNCTION_ARGS); extern prototypes go in header files, even when it's not strictly necessary. > + /* > + * array_typanalyze -- a custom typanalyze function for array columns > + */ > + Datum > + array_typanalyze(PG_FUNCTION_ARGS) > + { > + VacAttrStats *stats = (VacAttrStats *) PG_GETARG_POINTER(0); > + Form_pg_attribute attr = stats->attr; > + Oid ltopr; > + Oid eqopr; > + StdAnalyzeData *mystats; > + > + /* If the attstattarget column is negative, use the default value */ > + /* NB: it is okay to scribble on stats->attr since it's a copy */ > + if (attr->attstattarget < 0) > + attr->attstattarget = default_statistics_target; > + > + /* Look for default "<" and "=" operators for column's type */ > + get_sort_group_operators(stats->attrtypid, > + false, false, false, > + <opr, &eqopr, NULL, > + NULL); > + > + /* If column has no "=" operator, we can't do much of anything */ > + if (!OidIsValid(eqopr)) > + return false; > + > + /* Save the operator info for compute_stats routines */ > + mystats = (StdAnalyzeData *) palloc(sizeof(StdAnalyzeData)); > + mystats->eqopr = eqopr; > + mystats->eqfunc = get_opcode(eqopr); > + mystats->ltopr = ltopr; > + stats->extra_data = mystats; Instead of duplicating this initialization and exporting StdAnalyzeData, compute_scalar_stats() and compute_minimal_stats() from analyze.c, I suggest structuring things as follows. Export only std_typanalyze() and call it here. If it returns false, return false here, too. Otherwise, proceed with the additional lookups you currently do in compute_array_stats(). If you don't find everything you need (default btree operator class, for example), just return true; the analysis will continue with the standard scalar approach as setup by std_typanalyze(). Otherwise, replace compute_stats and extra_data with your own materials. > + > + stats->compute_stats = compute_array_stats; > + /* see comment about the choice of minrows in commands/analyze.c */ > + stats->minrows = 300 * attr->attstattarget; > + > + PG_RETURN_BOOL(true); > + } > + > + /* > + * compute_array_stats() -- compute statistics for a array column > + * > + * This functions computes statistics that are useful for determining <@, &&, > + * @> operations selectivity, along with the fraction of non-null rows and > + * average width. > + * > + * As an addition to the the most common values, as we do for most datatypes, > + * we're looking for the most common elements and length histogram. In the > + * case of relatively long arrays it might be more useful, because there most > + * probably won't be any two rows with the same array and thus MCV has no > + * much sense. With a list of the most common elements we can do a better job > + * at figuring out <@, &&, @> selectivity. Arrays length histogram helps to > + * "column <@ const" to be more precise. The histogram addresses array distinct element counts, not array lengths. That's exactly what we need for the selectivity calculations in question. Please update the terminology throughout the patch, though. > + * > + * The algorithm used is Lossy Counting, as proposed in the paper "Approximate > + * frequency counts over data streams" by G. S. Manku and R. Motwani, in > + * Proceedings of the 28th International Conference on Very Large Data Bases, > + * Hong Kong, China, August 2002, section 4.2. The paper is available at > + * http://www.vldb.org/conf/2002/S10P03.pdf > + * > + * The Lossy Counting (aka LC) algorithm goes like this: > + * Let s be the threshold frequency for an item (the minimum frequency we > + * are interested in) and epsilon the error margin for the frequency. Let D > + * be a set of triples (e, f, delta), where e is an element value, f is that > + * element's frequency (actually, its current occurrence count) and delta is > + * the maximum error in f. We start with D empty and process the elements in > + * batches of size w. (The batch size is also known as "bucket size" and is > + * equal to 1/epsilon.) Let the current batch number be b_current, starting > + * with 1. For each element e we either increment its f count, if it's > + * already in D, or insert a new triple into D with values (e, 1, b_current > + * - 1). After processing each batch we prune D, by removing from it all > + * elements with f + delta <= b_current. After the algorithm finishes we > + * suppress all elements from D that do not satisfy f >= (s - epsilon) * N, > + * where N is the total number of elements in the input. We emit the > + * remaining elements with estimated frequency f/N. The LC paper proves > + * that this algorithm finds all elements with true frequency at least s, > + * and that no frequency is overestimated or is underestimated by more than > + * epsilon. Furthermore, given reasonable assumptions about the input > + * distribution, the required table size is no more than about 7 times w. > + * > + * We set s to be the estimated frequency of the K'th element in a natural > + * language's frequency table, where K is the target number of entries in > + * the MCELEM array. We assume that the distribution of element frequencies > + * follows Zipf's law with an exponent of 1. > + * > + * Assuming Zipfian distribution, the frequency of the K'th element is equal > + * to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of > + * elements in the language. Putting W as one million, we get roughly > + * 0.07/K. This gives s = 0.07/K. We set epsilon = s/10, which gives bucket > + * width w = K/0.007 and maximum expected hashtable size of about 1000 * K. These last two paragraphs, adapted from ts_typanalyze.c, assume natural language documents. To what extent do these parameter choices remain sensible for arbitrary data such as users may place in arrays? In any event, we need a different justification, even if it's just a hand-wavy justification. If I'm following this correctly, this choice of "s" makes the algorithm guaranteed to find only elements constituting >= 7% of the input elements. Incidentally, isn't that far too high for natural language documents? If the English word "the" only constitutes 7% of typical documents, then this "s" value would permit us to discard practically every word; we'd be left with words read while filling the last bucket and words that happened to repeat exceedingly often in the column. I haven't tried to make a test case to observe this problem; am I missing something? (This question is largely orthogonal to your patch.) > + * > + * Note: in the above discussion, s, epsilon, and f/N are in terms of a > + * element's frequency as a fraction of all elements seen in the input. > + * However, what we actually want to store in the finished pg_statistic > + * entry is each element's frequency as a fraction of all rows that it occurs > + * in. Elements might be repeated in the same array. Since operators > + * <@, &&, @> takes care only about element occurence itself and not about > + * occurence count, function takes additional care about uniqueness of > + * counting. Also we need to change the divisor from N to nonnull_cnt to get > + * the number we want. On the same tangent, why does ts_typanalyze() not deduplicate the same way? The @@ operator has the same property. > + */ > + static void > + compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, > + int samplerows, double totalrows) > + { > + int num_mcelem; > + int null_cnt = 0; > + > + /* > + * We should count not only null array values, but also null array > + * elements > + */ > + int null_elem_cnt = 0.0; > + > + double total_width = 0.0; > + double total_length = 0.0; > + > + /* This is D from the LC algorithm. */ > + HTAB *elements_tab; > + HASHCTL elem_hash_ctl; > + HASH_SEQ_STATUS scan_status; > + > + /* This is the current bucket number from the LC algorithm */ > + int b_current; > + > + /* This is 'w' from the LC algorithm */ > + int bucket_width; > + int array_no, > + element_no; > + Datum hash_key; > + TrackItem *item; > + > + int lengths_count; > + int length_index; > + int slot_idx = 0; > + HTAB *length_tab; > + HASHCTL length_hash_ctl; > + LengthItem *length_item; > + LengthItem *sorted_length_tab; > + > + /* > + * Most part of array operators, which selectivity is estimated by this > + * statistics, takes care only one occurence of element in array. That's > + * why we should take care about count element occurence only once per > + * array. To clean occurence flag for each array by iterating over all > + * hash table can be too expensive. That's why we store pointers to hash > + * items contain elements which occur in last array. > + */ > + TrackItem **occurences = NULL; > + int occurences_count = 0, > + occurence_index; > + > + TypeCacheEntry *typentry; > + Oid hash_opclass, > + hash_opfamily, > + element_typeid, > + hash_oroc; > + FmgrInfo hash_func_info; > + > + StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data; > + > + /* Compute standard statistics */ > + if (OidIsValid(mystats->ltopr)) > + compute_scalar_stats(stats, fetchfunc, samplerows, totalrows); > + else > + compute_minimal_stats(stats, fetchfunc, samplerows, totalrows); > + > + > + /* Gathering all necessary information about element data type. */ > + > + element_typeid = stats->attrtype->typelem; > + > + if (!OidIsValid(element_typeid)) > + elog(ERROR, "array_typanalyze was invoked with %d non-array type", > + stats->attrtypid); > + > + typentry = lookup_type_cache(element_typeid, TYPECACHE_EQ_OPR | > + TYPECACHE_CMP_PROC | TYPECACHE_EQ_OPR_FINFO | TYPECACHE_CMP_PROC_FINFO); > + ARRAY_ANALYZE_CHECK_OID(typentry->cmp_proc); > + ARRAY_ANALYZE_CHECK_OID(typentry->eq_opr); > + > + hash_opclass = GetDefaultOpClass(element_typeid, HASH_AM_OID); > + ARRAY_ANALYZE_CHECK_OID(hash_opclass); > + > + hash_opfamily = get_opclass_family(hash_opclass); > + ARRAY_ANALYZE_CHECK_OID(hash_opfamily); > + > + hash_oroc = get_opfamily_proc(hash_opfamily, element_typeid, > + element_typeid, HASHPROC); > + ARRAY_ANALYZE_CHECK_OID(hash_oroc); > + > + fmgr_info(hash_oroc, &hash_func_info); > + > + InitFunctionCallInfoData(element_type_info.cmp, &typentry->cmp_proc_finfo, > + 2, DEFAULT_COLLATION_OID, NULL, NULL); > + InitFunctionCallInfoData(element_type_info.eq, &typentry->eq_opr_finfo, > + 2, DEFAULT_COLLATION_OID, NULL, NULL); > + InitFunctionCallInfoData(element_type_info.hash, &hash_func_info, > + 1, DEFAULT_COLLATION_OID, NULL, NULL); > + element_type_info.typbyval = typentry->typbyval; As I mentioned above in passing, all of the above setup only needs to happen once per analyzed column, not once per tuple. It belongs in array_typanalyze; define a struct to hold all the looked-up state, including a pointer to any state for std_typanalyze(), and store that in stats->extra_data. This code should probably get around to using the new SortSupport infrastructure like analyze.c now uses. Not sure whether that's mandatory for initial commit. typanalyze functions that call arbitrary user code must be reentrant. It only matters in corner cases, but crashing in those corner cases is not acceptable. See our use of CurTupleHashTable in execGrouping.c and defend this global state in a similar fashion. > + > + /* > + * We want statistics_target * 10 elements in the MCELEM array. This > + * multiplier is pretty arbitrary, but is meant to reflect the fact that > + * the number of individual elements tracked in pg_statistic ought to be > + * more than the number of values for a simple scalar column. > + */ > + num_mcelem = stats->attr->attstattarget * 10; > + > + /* > + * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the > + * comment above. > + */ > + bucket_width = num_mcelem * 1000 / 7; The addend mentioned is not present in the code or discussed in "the comment above". (I see the comment is copied verbatim from ts_typanalyze(), where the addend *is* present, though again the preceding comment says nothing of it.) > + > + /* > + * Create the hashtable. It will be in local memory, so we don't need to > + * worry about overflowing the initial size. Also we don't need to pay any > + * attention to locking and memory management. > + */ > + MemSet(&elem_hash_ctl, 0, sizeof(elem_hash_ctl)); > + elem_hash_ctl.keysize = sizeof(Datum); > + elem_hash_ctl.entrysize = sizeof(TrackItem); > + elem_hash_ctl.hash = element_hash; > + elem_hash_ctl.match = element_match; > + elem_hash_ctl.hcxt = CurrentMemoryContext; > + elements_tab = hash_create("Analyzed elements table", > + bucket_width * 7, Though it's copied from compute_tsvector_stats(), why "7"? > + &elem_hash_ctl, > + HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT); > + > + /* > + * hashtable for arrays lengths. > + */ > + MemSet(&length_hash_ctl, 0, sizeof(length_hash_ctl)); > + length_hash_ctl.keysize = sizeof(int); > + length_hash_ctl.entrysize = sizeof(LengthItem); > + length_hash_ctl.hash = tag_hash; You need to pass the HASH_FUNCTION flag for this setting to take effect. > + length_hash_ctl.match = memcmp; Omit this. You would need to pass HASH_COMPARE for it to take effect, but it's also implicit once you override the hash function. > + length_hash_ctl.hcxt = CurrentMemoryContext; > + length_tab = hash_create("Array length table", > + 64, > + &length_hash_ctl, > + HASH_ELEM | HASH_CONTEXT); > + > + /* Initialize counters. */ > + b_current = 1; > + element_no = 0; > + > + /* Loop over the arrays. */ > + for (array_no = 0; array_no < samplerows; array_no++) > + { > + Datum value; > + bool isnull; > + bool null_present; > + ArrayType *array; > + char *ptr; > + bits8 *bitmap; > + int bitmask; > + int j; > + int ndims; > + int *dims; > + int nitems; > + bool length_found; > + > + vacuum_delay_point(); > + > + value = fetchfunc(stats, array_no, &isnull); > + > + /* > + * Check for null/nonnull. > + */ > + if (isnull) > + { > + null_cnt++; > + continue; > + } > + > + /* > + * Add up widths for average-width calculation. Since it's a array, > + * we know it's varlena. As in the regular compute_minimal_stats > + * function, we use the toasted width for this calculation. > + */ > + total_width += VARSIZE_ANY(DatumGetPointer(value)); > + > + /* > + * Now detoast the array if needed. > + */ > + array = DatumGetArrayTypeP(value); > + ptr = ARR_DATA_PTR(array); > + bitmap = ARR_NULLBITMAP(array); > + bitmask = 1; > + ndims = ARR_NDIM(array); > + dims = ARR_DIMS(array); > + nitems = ArrayGetNItems(ndims, dims); > + > + /* > + * Check if we have enough of memory to store element occurences in > + * one array. > + */ > + if (nitems > occurences_count) > + { > + occurences_count = 2 * nitems; > + if (occurences) > + occurences = (TrackItem **) repalloc(occurences, > + sizeof(TrackItem *) * occurences_count); > + else > + occurences = (TrackItem **) palloc( > + sizeof(TrackItem *) * occurences_count); > + } > + occurence_index = 0; > + > + null_present = false; > + > + /* > + * We loop through the elements in the array and add them to our > + * tracking hashtable. Note: the hashtable entries will point into > + * the (detoasted) array value, therefore we cannot free that storage > + * until we're done. > + */ The second sentence of this comment is obsolete. > + for (j = 0; j < nitems; j++) > + { > + bool found; > + bool isnull; > + > + /* Get elements, checking for NULL */ > + if (bitmap && (*bitmap & bitmask) == 0) > + { > + hash_key = (Datum) 0; > + isnull = true; > + null_present = true; > + } > + else > + { > + /* Get element value */ > + hash_key = fetch_att(ptr, typentry->typbyval, typentry->typlen); > + isnull = false; > + > + /* > + * We should allocate memory for element if it isn't passing > + * by value, because array will be freed after that loop. > + */ > + if (!typentry->typbyval) > + { > + Datum tmp; > + int length; > + > + length = (typentry->typlen == -1) ? > + VARSIZE(hash_key) : typentry->typlen; > + tmp = (Datum) MemoryContextAlloc(stats->anl_context, length); > + memcpy((void *) tmp, (void *) hash_key, length); > + hash_key = tmp; > + } Please use datumCopy() or add a comment about why it's insufficient. > + ptr = att_addlength_pointer(ptr, typentry->typlen, ptr); > + ptr = (char *) att_align_nominal(ptr, typentry->typalign); > + } > + > + /* Advance bitmap pointers if any */ > + bitmask <<= 1; > + if (bitmask == 0x100) > + { > + if (bitmap) > + bitmap++; > + bitmask = 1; > + } > + > + /* No null element processing other then flag setting here */ > + if (isnull) > + continue; > + > + /* Lookup current element in hashtable, adding it if new */ > + item = (TrackItem *) hash_search(elements_tab, > + (const void *) &hash_key, > + HASH_ENTER, &found); > + > + if (found) > + { > + /* > + * The element is already on the tracking list. If it is the > + * first occurence in array then update element frequency. > + */ > + if (!item->occurence) > + { > + item->frequency++; > + item->occurence = true; > + occurences[occurence_index++] = item; > + } > + } > + else > + { > + /* Initialize new tracking list element */ > + item->frequency = 1; > + item->delta = b_current - 1; > + item->occurence = true; > + occurences[occurence_index++] = item; > + } > + > + /* element_no is the number of elements processed (ie N) */ > + element_no++; We should not bump element_no when we skipped the element as a duplicate within the same array. > + > + /* We prune the D structure after processing each bucket */ > + if (element_no % bucket_width == 0) > + { > + prune_element_hashtable(elements_tab, b_current); > + b_current++; > + } > + } > + /* Count null element only once per array */ > + if (null_present) > + null_elem_cnt++; > + > + /* Update frequency of particular array length. */ > + length_item = (LengthItem *) hash_search(length_tab, > + &occurence_index, > + HASH_ENTER, &length_found); > + if (length_found) > + { > + length_item->frequency++; > + } > + else > + { > + length_item->length = occurence_index; > + length_item->frequency = 1; > + } > + total_length += occurence_index; > + > + /* > + * When we end processing of particular array we should clean the > + * occurence flag. > + */ > + for (j = 0; j < occurence_index; j++) > + occurences[j]->occurence = false; Could you usefully simplify this away by storing the last-containing array_no, instead of a bool, in each hash entry? > + > + /* We should free memory from array if it was copied during detoast. */ > + if ((Datum) array != value) > + pfree((void *) array); No need to for casts to "void *". > + } > + > + /* Skip slots occupied by standard statistics */ > + while (OidIsValid(stats->stakind[slot_idx])) > + slot_idx++; > + > + /* Fill histogram of arrays lengths. */ > + lengths_count = hash_get_num_entries(length_tab); > + if (lengths_count > 0) > + { > + int num_hist = stats->attr->attstattarget; > + int delta; > + int frac; > + int i; > + Datum *hist_values; > + > + /* Copy lengths statistics from hashtab to array and sort them. */ > + length_index = 0; > + sorted_length_tab = (LengthItem *) palloc(sizeof(LengthItem) * lengths_count); > + hash_seq_init(&scan_status, length_tab); > + while ((length_item = (LengthItem *) hash_seq_search(&scan_status)) != NULL) > + { > + memcpy(&sorted_length_tab[length_index], length_item, > + sizeof(LengthItem)); > + length_index++; > + } > + qsort(sorted_length_tab, lengths_count, sizeof(LengthItem), > + lengthitem_compare_element); > + > + /* Histogram should be stored in anl_context. */ > + hist_values = (Datum *) MemoryContextAlloc(stats->anl_context, > + sizeof(Datum) * num_hist); > + /* Fill histogram by hashtab. */ > + delta = samplerows - null_cnt - 1; > + length_index = 0; > + frac = sorted_length_tab[0].frequency * (num_hist - 1); > + for (i = 0; i < num_hist; i++) > + { > + hist_values[i] = > + Int32GetDatum(sorted_length_tab[length_index].length); > + frac -= delta; > + while (frac <= 0) > + { > + length_index++; > + frac += sorted_length_tab[length_index].frequency * > + (num_hist - 1); > + } > + } > + > + stats->stakind[slot_idx] = STATISTIC_KIND_LENGTH_HISTOGRAM; > + stats->staop[slot_idx] = Int4EqualOperator; > + stats->stavalues[slot_idx] = hist_values; > + stats->numvalues[slot_idx] = num_hist; > + /* We are storing values of element type */ > + stats->statypid[slot_idx] = INT4OID; > + stats->statyplen[slot_idx] = 4; > + stats->statypbyval[slot_idx] = true; > + stats->statypalign[slot_idx] = 'i'; > + slot_idx++; > + } > + > + /* We can only compute real stats if we found some non-null values. */ > + if (null_cnt < samplerows) > + { > + int nonnull_cnt = samplerows - null_cnt; > + int i; > + TrackItem **sort_table; > + int track_len; > + int cutoff_freq; > + int minfreq, > + maxfreq; > + > + stats->stats_valid = true; > + /* Do the simple null-frac and average width stats */ > + stats->stanullfrac = (double) null_cnt / (double) samplerows; > + stats->stawidth = total_width / (double) nonnull_cnt; Isn't this redundant with the calculations made in compute_scalar_stats()? > + > + /* Assume it's a unique column (see notes above) */ > + stats->stadistinct = -1.0; The comment, copied from ts_typanalyze(), refers to notes not likewise copied. We should probably instead leave whatever compute_scalar_stats() calculated. > *** a/src/backend/utils/adt/selfuncs.c > --- b/src/backend/utils/adt/selfuncs.c > *************** > *** 1705,1710 **** scalararraysel(PlannerInfo *root, > --- 1705,1736 ---- > RegProcedure oprsel; > FmgrInfo oprselproc; > Selectivity s1; > + bool varonleft; > + Node *other; > + VariableStatData vardata; > + > + /* > + * Handle "const = qual(column)" case using array column statistics. > + */ > + if (get_restriction_variable(root, clause->args, varRelid, > + &vardata, &other, &varonleft)) > + { > + Oid elemtype; > + elemtype = get_base_element_type(vardata.vartype); > + if (elemtype != InvalidOid && IsA(other, Const)) > + { > + if (((Const *) other)->constisnull) > + { > + /* qual can't succeed if null array */ > + ReleaseVariableStats(vardata); > + return (Selectivity) 0.0; > + } > + s1 = calc_scalararraysel(&vardata, ((Const *) other)->constvalue, useOr); > + ReleaseVariableStats(vardata); > + return s1; > + } > + ReleaseVariableStats(vardata); > + } If we're going to add a new file for array selectivity functions (which seems reasonable), scalararraysel() should also move there. (To do this and still keep the patch easy to read, you could do the move in a pre-patch.) > *** a/src/include/catalog/pg_proc.h > --- b/src/include/catalog/pg_proc.h > *************** > *** 849,854 **** DATA(insert OID = 2334 ( array_agg_finalfn PGNSP PGUID 12 1 0 0 0 f f f f f i > --- 849,859 ---- > DESCR("aggregate final function"); > DATA(insert OID = 2335 ( array_agg PGNSP PGUID 12 1 0 0 0 t f f f f i 1 0 2277 "2283" _null_ _null_ _null__null_ aggregate_dummy _null_ _null_ _null_ )); > DESCR("concatenate aggregate input into an array"); > + DATA(insert OID = 3816 ( array_typanalyze PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 16 "2281" _null_ _null_ _null_ _null_array_typanalyze _null_ _null_ _null_ )); > + DESCR("array statistics collector"); > + #define ArrayTypanalyzeOid 3816 Use the fmgroids.h symbol, F_ARRAY_TYPANALYZE, instead. > + DATA(insert OID = 3817 ( arraysel PGNSP PGUID 12 1 0 0 0 f f f t f s 4 0 701 "2281 26 2281 23" _null_ _null__null_ _null_ arraysel _null_ _null_ _null_ )); > + DESCR("array selectivity estimation functions"); > > DATA(insert OID = 760 ( smgrin PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 210 "2275" _null_ _null_ _null__null_ smgrin _null_ _null_ _null_ )); > DESCR("I/O"); > *** a/src/include/catalog/pg_statistic.h > --- b/src/include/catalog/pg_statistic.h > /* > * Currently, three statistical slot "kinds" are defined: most common values, Here's a larger quote of the comment starting here: /* * Currently, three statistical slot "kinds" are defined: most common values, * histogram, and correlation. Additional "kinds" will probably appear in * future to help cope with non-scalar datatypes. Also, custom data types * can define their own "kind" codes by mutual agreement between a custom * typanalyze routine and the selectivity estimation functions of the type's * operators. That needs an update. (It already needs an update for STATISTIC_KIND_MCELEM, but this patch would invalidate it yet again.) > *************** > *** 260,263 **** typedef FormData_pg_statistic *Form_pg_statistic; > --- 268,274 ---- > */ > #define STATISTIC_KIND_MCELEM 4 > > + > + #define STATISTIC_KIND_LENGTH_HISTOGRAM 5 The other kinds have long explanatory comments; this one should, too. > *** a/src/include/catalog/pg_type.h > --- b/src/include/catalog/pg_type.h > ! DATA(insert OID = 1021 ( _float4 PGNSP PGUID -1 f b A f t \054 0 700 0 array_in array_out array_recv array_send- - array_typanalyze i x f 0 -1 0 0 _null_ _null_ )); With this patch, a fresh database sets typanalyze = array_typanalyze for 27 array types and leaves typanalyze = NULL for the other 38 array types. What is the rationale for the split? For example, why does real[] use the new typanalyze but numeric[] does not? Thanks, nm
Attachment
Hi!
Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.).
------
With best regards,
Alexander Korotkov.
Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.).
------
With best regards,
Alexander Korotkov.
Attachment
On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote: > Thanks for your great work on reviewing this patch. Now I'm trying to find > memory corruption bug. Unfortunately it doesn't appears on my system. Can > you check if this bug remains in attached version of patch. If so, please > provide me information about system you're running (processor, OS etc.). I get the same diagnostic from this version. Opteron processor, operating system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right?
On Wed, Jan 4, 2012 at 12:33 AM, Noah Misch <noah@leadboat.com> wrote:
On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:I get the same diagnostic from this version. Opteron processor, operating
> Thanks for your great work on reviewing this patch. Now I'm trying to find
> memory corruption bug. Unfortunately it doesn't appears on my system. Can
> you check if this bug remains in attached version of patch. If so, please
> provide me information about system you're running (processor, OS etc.).
system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right?
Oh, actually no. Thanks for point.
------
With best regards,
Alexander Korotkov.
Corrections: On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote: > On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote: > > + * We set s to be the estimated frequency of the K'th element in a natural > > + * language's frequency table, where K is the target number of entries in > > + * the MCELEM array. We assume that the distribution of element frequencies > > + * follows Zipf's law with an exponent of 1. > > + * > > + * Assuming Zipfian distribution, the frequency of the K'th element is equal > > + * to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of > > + * elements in the language. Putting W as one million, we get roughly > > + * 0.07/K. This gives s = 0.07/K. We set epsilon = s/10, which gives bucket > > + * width w = K/0.007 and maximum expected hashtable size of about 1000 * K. > > These last two paragraphs, adapted from ts_typanalyze.c, assume natural > language documents. To what extent do these parameter choices remain sensible > for arbitrary data such as users may place in arrays? In any event, we need a > different justification, even if it's just a hand-wavy justification. > > If I'm following this correctly, this choice of "s" makes the algorithm > guaranteed to find only elements constituting >= 7% of the input elements. > Incidentally, isn't that far too high for natural language documents? If the > English word "the" only constitutes 7% of typical documents, then this "s" > value would permit us to discard practically every word; we'd be left with > words read while filling the last bucket and words that happened to repeat > exceedingly often in the column. I haven't tried to make a test case to > observe this problem; am I missing something? (This question is largely > orthogonal to your patch.) No, we'll find elements of frequency at least 0.07/(default_statistics_target * 10) -- in the default configuration, 0.007%. Also, ts_typanalyze() counts the number of documents that contain one or more instances of each lexeme, ignoring the number of appearances within each document. The word "the" may constitute 7% of a typical document, but it will appear at least once in nearly 100% of documents. Therefore, this "s" value is adequate even for the pathological case of each "document" containing just one lexeme. > > + * > > + * Note: in the above discussion, s, epsilon, and f/N are in terms of a > > + * element's frequency as a fraction of all elements seen in the input. > > + * However, what we actually want to store in the finished pg_statistic > > + * entry is each element's frequency as a fraction of all rows that it occurs > > + * in. Elements might be repeated in the same array. Since operators > > + * <@, &&, @> takes care only about element occurence itself and not about > > + * occurence count, function takes additional care about uniqueness of > > + * counting. Also we need to change the divisor from N to nonnull_cnt to get > > + * the number we want. > > On the same tangent, why does ts_typanalyze() not deduplicate the same way? > The @@ operator has the same property. Answer: to_tsvector() will have already done so. > > + /* > > + * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the > > + * comment above. > > + */ > > + bucket_width = num_mcelem * 1000 / 7; > > The addend mentioned is not present in the code or discussed in "the comment > above". (I see the comment is copied verbatim from ts_typanalyze(), where the > addend *is* present, though again the preceding comment says nothing of it.) The addend rationale in fact does appear in the ts_typanalyze() comment. Thanks, nm
Hi!
Patch where most part of issues are fixed is attached.
On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote:
> I find distressing the thought of having two copies of the lossy sampling
> code, each implementing the algorithm with different variable names and levels
> of generality. We might someday extend this to hstore, and then we'd have yet
> another copy. Tom commented[1] that ts_typanalyze() and array_typanalyze()
> should remain distinct, and I agree. However, they could call a shared
> counting module. Is that practical? Possible API:
>
> typedef struct LossyCountCtl;
> LossyCountCtl *LossyCountStart(float s,
> float epsilon,
> int2 typlen,
> bool typbyval,
> Oid eqfunc); /* + hash func, a few others */
> void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
> TrackItem **LossyCountGetAll(LossyCountCtl *ctl);
>
I'm not sure about shared lossy counting module, because part of shared code would be relatively small. Part of compute_array_stats function which is taking care about array decompression, distinct occurence calculation, disting element count histogram, packing statistics slots etc is much larger than lossy counting algorithm itself. May be, there is some other opinions in community?
> I think this is an improvement, but some code out there may rely on the
> ability to get stakind = 4 data from the most_common_vals column. We'll need
> to mention this in the release notes as an incompatibility.
I'm not sure I understand mechanism of release notes. Does it require something in a patch itself?
> > + /*
> > + * Let be n independent events with probabilities p. This function calculates
> > + * probabilities of exact k of events occurence for k in [0;m].
> > + * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
> > + * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
> > + * Each next event increase total number of occured events if it occurs and
> > + * leave last value of that number if it doesn't occur. So, by the law of
> > + * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
> > + * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
> > + * Also there could be some events with low probabilities. Their summary
> > + * probability passed in the rest parameter.
> > + */
> > + static float *
> > + calc_distr(float *p, int n, int m, float rest)
> > + {
>
> > + /* Take care about events with low probabilities. */
> > + if (rest > 0.0f)
> > + {
> > + /*
> > + * The probability of no occurence of events which forms "rest"
> > + * probability have a limit of exp(-rest) when number of events fo to
> > + * infinity. Another simplification is to replace that events with one
> > + * event with (1 - exp(-rest)) probability.
> > + */
> > + rest = 1.0f - exp(-rest);
>
> What is the name of the underlying concept in probability theory?
The most closest concept to caculated distribution is multinomial distribution. But it's not exactly same, because multinomial distribution gives probability of particular count of each event occurece, not probability of summary occurence. Actually, distribution is caclulated just from assumption of events independence. The most closest concept of rest probability is approximation by exponential distribution. It's quite rough approximation, but I can't invent something better with low calculation complexity.
> > + /*
> > + * Array selectivity estimation based on most common elements statistics for
> > + * "column <@ const" case. Assumption that element occurences are independent
> > + * means certain distribution of array lengths. Typically real distribution
> > + * of lengths is significantly different from it. For example, if even we
> > + * have set of arrays with 1 integer element in range [0;10] each, element
> > + * occurences are not independent. Because in the case of independence we
>
> Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> '{6,19,4}' cannot appear?
I refer column where only one element exists, i.e. only possible values are '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}', '{10}'. That is a corner case. But similar situation occurs when, for example, we've distribution of distinct element count between 1 and 3. It significantly differs from distribution from independent occurence.
> > + * have probabilities of length of 0, 1, 2 etc. In the "column @> const"
> > + * and "column && const" cases we usually have "const" with low summary
> > + * frequency of elements (otherwise we have selectivity close to 0 or 1
> > + * correspondingly). That's why effect of dependence related to lengths
> > + * distribution id negligible there. In the "column <@ const" case summary
> > + * frequency of elements is high (otherwise we have selectivity close to 0).
>
> What does the term "summary frequency" denote?
I meant summ of frequences of "const" array elements.
> > + /*
> > + * Rest is a average length of elements which aren't present in mcelem.
> > + */
> > + rest = avg_length;
>
> You define "rest" here as an array length ...
>
> > +
> > + default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
> > +
> > + mcelem_index = 0;
> > +
> > + /*
> > + * mult is the multiplier that presents estimate of probability that each
> > + * mcelem which is not present in constant doesn't occur.
> > + */
> > + mult = 1.0f;
> > +
> > + for (i = 0; i < nitems; i++)
> > + {
> > + bool found = false;
> > +
> > + /* Comparison with previous value in order to guarantee uniquness */
> > + if (i > 0)
> > + {
> > + if (!element_compare(&array_data[i - 1], &array_data[i]))
> > + continue;
> > + }
> > +
> > + /*
> > + * Iterate over mcelem until find mcelem that is greater or equal to
> > + * element of constant. Simultaneously taking care about rest and
> > + * mult. If that mcelem is found then fill corresponding elem_selec.
> > + */
> > + while (mcelem_index < nmcelem)
> > + {
> > + int cmp = element_compare(&mcelem[mcelem_index], &array_data[i]);
> > +
> > + if (cmp < 0)
> > + {
> > + mult *= (1.0f - numbers[mcelem_index]);
> > + rest -= numbers[mcelem_index];
>
> ... But here, you're subtracting a frequency from an array length?
>
Yes, because average distinct element count is summ of frequencies of elements. Substracting mcelem frequencies from avg_length we have summ of frequencies of non-mcelem elements.
------
With best regards,
Alexander Korotkov.
Attachment
On Sat, Jan 07, 2012 at 09:36:42PM +0400, Alexander Korotkov wrote: > Patch where most part of issues are fixed is attached. Thanks. I've made several, largely cosmetic, edits. See attached version 0.10. Please use it as the basis for your next version, and feel free to revert any changes you deem inappropriate. Where I made non-cosmetic edits, I attempt to point that out below. I've left unfixed a few more-substantive problems, also described below. When you post another update, could you add it to the open CF? Given the timing, I think we might as well consider any further activity to have happened under the aegis of the 2012-01 CF. I'm marking the current entry Returned with Feedback. > On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote: > > I find distressing the thought of having two copies of the lossy sampling > > code, each implementing the algorithm with different variable names and > levels > > of generality. We might someday extend this to hstore, and then we'd > have yet > > another copy. Tom commented[1] that ts_typanalyze() and > array_typanalyze() > > should remain distinct, and I agree. However, they could call a shared > > counting module. Is that practical? Possible API: > > > > typedef struct LossyCountCtl; > > LossyCountCtl *LossyCountStart(float s, > > float epsilon, > > int2 typlen, > > bool typbyval, > > Oid eqfunc); /* > + hash func, a few others */ > > void LossyCountAdd(LossyCountCtl *ctl, Datum elem); > > TrackItem **LossyCountGetAll(LossyCountCtl *ctl); > > > > [1] > http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us > > I'm not sure about shared lossy counting module, because part of shared > code would be relatively small. Part of compute_array_stats function which > is taking care about array decompression, distinct occurence calculation, > disting element count histogram, packing statistics slots etc is much > larger than lossy counting algorithm itself. May be, there is some other > opinions in community? True; it would probably increase total lines of code. The benefit, if any, lies in separation of concerns; the business of implementing this algorithm is quite different from the other roles of these typanalyze functions. I won't insist that you try it, though. > > I think this is an improvement, but some code out there may rely on the > > ability to get stakind = 4 data from the most_common_vals column. We'll > need > > to mention this in the release notes as an incompatibility. > > I'm not sure I understand mechanism of release notes. Does it require > something in a patch itself? No. I just wanted to call attention to the fact in the hope that someone remembers as the release notes get drafted. > > > + /* > > > + * The probability of no occurence of events which forms > "rest" > > > + * probability have a limit of exp(-rest) when number of > events fo to > > > + * infinity. Another simplification is to replace that > events with one > > > + * event with (1 - exp(-rest)) probability. > > > + */ > > > + rest = 1.0f - exp(-rest); > > > > What is the name of the underlying concept in probability theory? > > The most closest concept to caculated distribution is multinomial > distribution. But it's not exactly same, because multinomial distribution > gives probability of particular count of each event occurece, not > probability of summary occurence. Actually, distribution is caclulated just > from assumption of events independence. The most closest concept of rest > probability is approximation by exponential distribution. It's quite rough > approximation, but I can't invent something better with low calculation > complexity. Do you have a URL of a tutorial or paper that explains the method in more detail? If, rather, this is a novel synthesis, could you write a proof to include in the comments? > > > + /* > > > + * Array selectivity estimation based on most common elements > statistics for > > > + * "column <@ const" case. Assumption that element occurences are > independent > > > + * means certain distribution of array lengths. Typically real > distribution > > > + * of lengths is significantly different from it. For example, if > even we > > > + * have set of arrays with 1 integer element in range [0;10] each, > element > > > + * occurences are not independent. Because in the case of > independence we > > > > Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but > > '{6,19,4}' cannot appear? > > I refer column where only one element exists, i.e. only possible values are > '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}', > '{10}'. That is a corner case. But similar situation occurs when, for > example, we've distribution of distinct element count between 1 and 3. It > significantly differs from distribution from independent occurence. Oh, I think I see now. If each element 1..10 had frequency 0.1 independently, column values would have exactly one distinct element just 39% of the time? If probability theory has a prototypical problem resembling this, it would be nice to include a URL to a thorough discussion thereof. I could not think of the search terms to find one, though. > > > + * have probabilities of length of 0, 1, 2 etc. In the "column @> > const" > > > + * and "column && const" cases we usually have "const" with low > summary > > > + * frequency of elements (otherwise we have selectivity close to 0 or > 1 > > > + * correspondingly). That's why effect of dependence related to > lengths > > > + * distribution id negligible there. In the "column <@ const" case > summary > > > + * frequency of elements is high (otherwise we have selectivity close > to 0). > > > > What does the term "summary frequency" denote? > > I meant summ of frequences of "const" array elements. Do you mean literally P_0 + P_1 ... + P_N? If so, I can follow the above argument for "column && const" and "column <@ const", but not for "column @> const". For "column @> const", selectivity cannot exceed the smallest frequency among elements of "const". Several high-frequency elements together will drive up the sum of frequencies without increasing the true selectivity. > > > + /* > > > + * Rest is a average length of elements which aren't present in > mcelem. > > > + */ > > > + rest = avg_length; > > > > You define "rest" here as an array length ... > > > + rest -= numbers[mcelem_index]; > > > > ... But here, you're subtracting a frequency from an array length? > > > > Yes, because average distinct element count is summ of frequencies of > elements. Substracting mcelem frequencies from avg_length we have summ of > frequencies of non-mcelem elements. I see now; thanks. I updated the comments so that this would have been clearer to me. > *** /dev/null > --- b/src/backend/utils/adt/array_selfuncs.c > + Selectivity > + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause, > + Oid operator) > + { > + Oid elemtype; > + Selectivity selec; > + TypeCacheEntry *typentry; > + Datum *hist; > + int nhist; > + FunctionCallInfoData cmpfunc; > + > + elemtype = get_base_element_type(vardata->vartype); > + > + > + /* Get default comparison function */ > + typentry = lookup_type_cache(elemtype, > + TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR); > + > + /* Handle only "=" operator. Return default selectivity in other cases. */ > + if (operator != typentry->eq_opr) > + return (Selectivity) 0.5; Punting on other operators this way creates a plan quality regression for operations like "const < ANY (column)". Please do it some way that falls back on the somewhat-better existing scalararraysel() treatment for this. > + > + /* Without comparison function return default selectivity estimation */ > + if (!OidIsValid(typentry->cmp_proc)) > + { > + if (orClause) > + return DEFAULT_OVERLAP_SEL; > + else > + return DEFAULT_CONTAIN_SEL; > + } Since "const = ANY (column)" is equivalent to "column @> array[const]" and "const = ALL (column)" is equivalent to "column <@ array[const]", DEFAULT_CONTAIN_SEL is always correct here. I've made that change. > + /* > + * Calculate first n distinct element counts probabilities by histogram. We > + * assume that any interval between a and b histogram values gives > + * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and > + * half of that to a and b. Returns total probability that distinct element > + * count is less of equal to n. > + */ > + static float > + calc_hist(Datum *hist, int nhist, float *hist_part, int n) To test this function, I ran the following test case: set default_statistics_target = 4; create table t3 as select array(select * from generate_series(1, v)) as arr from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100); analyze t3; -- length_histogram_bounds = {2,2,5,5} select * from t3 where arr <@ array[6,7,8,9,10,11]; Using gdb to observe calc_hist()'s result during the last command: (gdb) p calc_hist(hist, nhist, hist_part, unique_nitems) $23 = 0.666666687 (gdb) x/6f hist_part 0xcd4bc8: 0 0 0.333333343 0 0xcd4bd8: 0 0.333333343 I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and a total probability of 1.0. > + { > + int k, > + i = 0, > + prev_interval = 0, > + next_interval = 0; > + float frac, > + total = 0.0f; > + > + /* > + * frac is a probability contribution by each interval between histogram > + * values. We have nhist - 1 intervals. Contribution of one will be 1 / > + * (nhist - 1). > + */ > + frac = 1.0f / ((float) (nhist - 1)); > + for (k = 0; k <= n; k++) > + { > + int count = 0; > + > + /* Count occurences of k distinct element counts in histogram. */ > + while (i < nhist && DatumGetInt32(hist[i]) <= k) > + { > + if (DatumGetInt32(hist[i]) == k) > + count++; > + i++; > + } > + > + if (count > 0) > + { > + float val; > + > + /* Find length between current histogram value and the next one */ > + if (i < nhist) > + next_interval = DatumGetInt32(hist[i + 1]) - Doesn't this read past the array end when i == nhist - 1? > + /* > + * Let be n independent events with probabilities p. This function calculates > + * probabilities of exact k of events occurence for k in [0;m]. > + * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes > + * probability that exact j of first i events occurs. Obviously M[0,0] = 1. > + * Each next event increase total number of occured events if it occurs and > + * leave last value of that number if it doesn't occur. So, by the law of > + * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i] > + * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0. > + * Also there could be some events with low probabilities. Their summary > + * probability passed in the rest parameter. > + */ > + static float * > + calc_distr(float *p, int n, int m, float rest) I attempted to clarify this comment; please see if I preserved its accuracy. > + /* > + * Using of distinct element counts histogram requires O(nitems * (nmcelem > + * + nitems)) operations. It's reasonable to limit the number of required > + * operation and give less accurate answer when this limit exceed. > + */ > + if (nhist > 0 && unique_nitems <= > + 300 * default_statistics_target / (nmcelem + unique_nitems)) I benchmarked the quadratic complexity here. With default settings, this cutoff skips the algorithm beginning around a 170-element constant array, which would nonetheless take single-digit milliseconds to plan. When I temporarily removed the cutoff, I needed much larger scales to get poor plan times. 5000 elements took 180ms to plan, and 10000 elements took 620 ms. Bottom line, the cutoff you've chosen is plenty conservative. > + static int > + element_compare(const void *key1, const void *key2, void *arg) > + { > + const Datum *d1 = (const Datum *) key1; > + const Datum *d2 = (const Datum *) key2; > + FunctionCallInfo cmpfunc = (FunctionCallInfo) arg; > + > + cmpfunc ->arg[0] = *d1; > + cmpfunc ->arg[1] = *d2; > + cmpfunc ->argnull[0] = false; > + cmpfunc ->argnull[1] = false; > + cmpfunc ->isnull = false; This indented poorly due to "cmpfunc" having a place in our typedefs list. I changed the identifier. > *** /dev/null > --- b/src/backend/utils/adt/array_typanalyze.c > + * We set s to be the estimated frequency of the K'th element in a natural > + * language's frequency table, where K is the target number of entries in > + * the MCELEM array. We assume that the distribution of element frequencies > + * follows Zipf's law with an exponent of 1. > + * > + * Assuming Zipfian distribution, the frequency of the K'th element is equal > + * to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of > + * elements in the language. Putting W as one million, we get roughly > + * 0.07/K. This gives s = 0.07/K. We set epsilon = s/10, which gives bucket > + * width w = K/0.007 and maximum expected hashtable size of about 1000 * K. Given the lack of applicability to arrays, I replaced these last two paragraphs with some weasel words. My gut feeling is that we're priming the algorithm to deliver answers far more precise than needed. However, I haven't attempted a principled replacement. > + /* This is 'w' from the LC algorithm */ > + int bucket_width; > + int array_no, > + element_no; I think it's possible for element_no to overflow. Consider rows with 2000 distinct elements apiece at a statistics target of 10000 (3M sample rows). So, I made it a uint64. > + extra_data = (ArrayAnalyzeExtraData *) stats->extra_data; This still isn't reentrant; you'd need to save the existing static extra_data and restore it on function exit. However, it turns out that do_analyze_rel() itself isn't reentrant on account of its similar management of "anl_context"; any nested ANALYZE crashes the backend. So, I don't think we need further change here. It will be easy to make reentrant later if necessary, though I'd probably fix do_analyze_rel() by just throwing an error on recursive ANALYZE. > + stats->extra_data = extra_data->std_extra_data; > + old_context = CurrentMemoryContext; > + extra_data->std_compute_stats(stats, fetchfunc, samplerows, totalrows); > + MemoryContextSwitchTo(old_context); Is the callee known to change CurrentMemoryContext and not restore it? Offhand, I'm not seeing how it could do so. > + /* > + * hashtable for arrays distinct element count. > + */ > + MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl)); > + count_hash_ctl.keysize = sizeof(int); > + count_hash_ctl.entrysize = sizeof(DistinctElementCountItem); > + count_hash_ctl.hash = tag_hash; > + count_hash_ctl.match = memcmp; > + count_hash_ctl.hcxt = CurrentMemoryContext; > + count_tab = hash_create("Array distinct element count table", > + 64, > + &count_hash_ctl, > + HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT); This HASH_COMPARE setting is redundant, so I've removed it. > + /* Skip too large values. */ > + if (toast_raw_datum_size(value) > WIDTH_THRESHOLD) Fixed this warning: array_typanalyze.c: In function `compute_array_stats': array_typanalyze.c:361: warning: implicit declaration of function `toast_raw_datum_size' > + continue; > + else > + analyzed_rows++; > + > + /* > + * Add up widths for average-width calculation. Since it's a array, > + * we know it's varlena. As in the regular compute_minimal_stats > + * function, we use the toasted width for this calculation. > + */ > + total_width += VARSIZE_ANY(DatumGetPointer(value)); Since this is now unused, I removed it. > + /* Lookup current element in hashtable, adding it if new */ > + item = (TrackItem *) hash_search(elements_tab, > + (const void *) &hash_key, > + HASH_ENTER, &found); > + > + if (found) > + { I added a pfree(hash_key) here. In one of my default_statistics_target=3000 tests on a table with few possible elements, this saved hundreds of megabytes of memory. > + int i; > + > + /* > + * The element is already on the tracking list. Check if it's > + * first occurence of this element in array. > + */ > + for (i = 0; i < occurence_index; i++) > + { > + if (occurences[i] == item) > + break; > + } This wasn't what I had in mind when I suggested the different approach last time. See how I changed it in this version, and let me know if you see any essential disadvantages. > + /* Update frequency of particular array distinct element count. */ > + count_item = (DistinctElementCountItem *) hash_search(count_tab, > + &occurence_index, > + HASH_ENTER, &count_item_found); > + if (count_item_found) > + count_item->frequency++; > + else > + { > + count_item->count = occurence_index; The key gets initialized automatically, so I removed this line. > + count_item->frequency = 1; > + } > + total_distinct_count += occurence_index; total_distinct_count seemed to follow element_no exactly, so I removed it. > *** a/src/include/catalog/pg_statistic.h > --- b/src/include/catalog/pg_statistic.h > *************** > *** 260,263 **** typedef FormData_pg_statistic *Form_pg_statistic; > --- 268,285 ---- > */ > #define STATISTIC_KIND_MCELEM 4 > > + /* > + * A "length histogram" slot describes the distribution of lengths of data for > + * datatypes where length is important for selectivity estimation. stavalues > + * contains M (>=2) non-null values that divide the non-null column data values > + * into M-1 bins of approximately equal population. The first stavalues item > + * is the minimum length and the last is the maximum length. In dependence on > + * datatype this slot can hold distribution of not exactly length, but of > + * similar value. For instance, it hold distribution of distinct elements count > + * for arrays, because multiple occurences of array elements are ignored by > + * array comparison operators. > + * > + */ > + #define STATISTIC_KIND_LENGTH_HISTOGRAM 5 I changed this text to say that we always store distinct element counts. We can always update the comment later if we diversify its applications. > *** a/src/include/catalog/pg_type.h > --- b/src/include/catalog/pg_type.h This now updates all array types except record[]. I'm don't know offhand how to even make a non-empty value of type record[], let alone get it into a context where ANALYZE would see it. However, is there a particular reason to make that one different? Thanks, nm
Attachment
Hi!
Thanks for your fixes to the patch. Them looks correct to me. I did some fixes in the patch. The proof of some concepts is still needed. I'm going to provide it in a few days.
On Thu, Jan 12, 2012 at 3:06 PM, Noah Misch <noah@leadboat.com> wrote:
> I'm not sure about shared lossy counting module, because part of sharedTrue; it would probably increase total lines of code. The benefit, if any,> code would be relatively small. Part of compute_array_stats function which
> is taking care about array decompression, distinct occurence calculation,
> disting element count histogram, packing statistics slots etc is much
> larger than lossy counting algorithm itself. May be, there is some other
> opinions in community?
lies in separation of concerns; the business of implementing this algorithm is
quite different from the other roles of these typanalyze functions. I won't
insist that you try it, though.
I'd prefer to try it as separate patch.
> > > + /*Do you have a URL of a tutorial or paper that explains the method in more
> > > + * The probability of no occurence of events which forms
> "rest"
> > > + * probability have a limit of exp(-rest) when number of
> events fo to
> > > + * infinity. Another simplification is to replace that
> events with one
> > > + * event with (1 - exp(-rest)) probability.
> > > + */
> > > + rest = 1.0f - exp(-rest);
> >
> > What is the name of the underlying concept in probability theory?
>
> The most closest concept to caculated distribution is multinomial
> distribution. But it's not exactly same, because multinomial distribution
> gives probability of particular count of each event occurece, not
> probability of summary occurence. Actually, distribution is caclulated just
> from assumption of events independence. The most closest concept of rest
> probability is approximation by exponential distribution. It's quite rough
> approximation, but I can't invent something better with low calculation
> complexity.
detail? If, rather, this is a novel synthesis, could you write a proof to
include in the comments?
Unfortunately I don't have relevant paper for it. I'll try to search it. Otherwise I'll try to do some proof.
> > > + /*Oh, I think I see now. If each element 1..10 had frequency 0.1 independently,
> > > + * Array selectivity estimation based on most common elements
> statistics for
> > > + * "column <@ const" case. Assumption that element occurences are
> independent
> > > + * means certain distribution of array lengths. Typically real
> distribution
> > > + * of lengths is significantly different from it. For example, if
> even we
> > > + * have set of arrays with 1 integer element in range [0;10] each,
> element
> > > + * occurences are not independent. Because in the case of
> independence we
> >
> > Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> > '{6,19,4}' cannot appear?
>
> I refer column where only one element exists, i.e. only possible values are
> '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
> '{10}'. That is a corner case. But similar situation occurs when, for
> example, we've distribution of distinct element count between 1 and 3. It
> significantly differs from distribution from independent occurence.
column values would have exactly one distinct element just 39% of the time?
Yes, it's right.
If probability theory has a prototypical problem resembling this, it would be
nice to include a URL to a thorough discussion thereof. I could not think of
the search terms to find one, though.
Actually, usage of both distinct element count histogram and element occurrence frequencies produce some probability distribution (which is more complex than just independent element occurrence). If real distribution is close this distribution, then estimate is accurate. I didn't met such distributions is papers, actually I've just invented it in my tries to do accurate "column <@ const" estimation at least in simple cases. I'll try to search for similar things in papers, but I doubt I'll succeed. Otherwise I'll try to do some more detailed proof.
> *** /dev/null
> --- b/src/backend/utils/adt/array_selfuncs.c
> + Selectivity
> + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause,
> + Oid operator)> + {> + FunctionCallInfoData cmpfunc;
> + Oid elemtype;
> + Selectivity selec;
> + TypeCacheEntry *typentry;
> + Datum *hist;
> + int nhist;> +> + TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR);
> + elemtype = get_base_element_type(vardata->vartype);
> +
> +
> + /* Get default comparison function */
> + typentry = lookup_type_cache(elemtype,
> +
> + /* Handle only "=" operator. Return default selectivity in other cases. */
> + if (operator != typentry->eq_opr)
> + return (Selectivity) 0.5;
Punting on other operators this way creates a plan quality regression for
operations like "const < ANY (column)". Please do it some way that falls
back on the somewhat-better existing scalararraysel() treatment for this.
I've made calc_scalararraysel return -1 in this case or in the case of no comparison function. scalararraysel continues it's work when calc_scalararraysel returns negative value.
> + /*
> + * Calculate first n distinct element counts probabilities by histogram. We
> + * assume that any interval between a and b histogram values gives
> + * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and
> + * half of that to a and b. Returns total probability that distinct element
> + * count is less of equal to n.> + */
> + static float> + calc_hist(Datum *hist, int nhist, float *hist_part, int n)To test this function, I ran the following test case:
set default_statistics_target = 4;
create table t3 as select array(select * from generate_series(1, v)) as arr
from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
analyze t3; -- length_histogram_bounds = {2,2,5,5}
select * from t3 where arr <@ array[6,7,8,9,10,11];
Using gdb to observe calc_hist()'s result during the last command:
(gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
$23 = 0.666666687
(gdb) x/6f hist_part
0xcd4bc8: 0 0 0.333333343 0
0xcd4bd8: 0 0.333333343
I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and
a total probability of 1.0.
> + {
> + int k,
> + i = 0,
> + prev_interval = 0,
> + next_interval = 0;
> + float frac,
> + total = 0.0f;
> +
> + /*
> + * frac is a probability contribution by each interval between histogram
> + * values. We have nhist - 1 intervals. Contribution of one will be 1 /
> + * (nhist - 1).
> + */
> + frac = 1.0f / ((float) (nhist - 1));
> + for (k = 0; k <= n; k++)
> + {
> + int count = 0;
> +
> + /* Count occurences of k distinct element counts in histogram. */
> + while (i < nhist && DatumGetInt32(hist[i]) <= k)
> + {
> + if (DatumGetInt32(hist[i]) == k)
> + count++;
> + i++;
> + }
> +
> + if (count > 0)
> + {
> + float val;
> +
> + /* Find length between current histogram value and the next one */
> + if (i < nhist)
> + next_interval = DatumGetInt32(hist[i + 1]) -
Doesn't this read past the array end when i == nhist - 1?
It was a bug. It also causes wrong histogram calculation you noted above. Fixed.
> + stats->extra_data = extra_data->std_extra_data;
> + old_context = CurrentMemoryContext;
> + extra_data->std_compute_stats(stats, fetchfunc, samplerows, totalrows);
> + MemoryContextSwitchTo(old_context);
Is the callee known to change CurrentMemoryContext and not restore it?
Offhand, I'm not seeing how it could do so.
Right. Saving of memory context is not needed. Removed.
> *** a/src/include/catalog/pg_type.h
> --- b/src/include/catalog/pg_type.h
This now updates all array types except record[]. I'm don't know offhand how
to even make a non-empty value of type record[], let alone get it into a
context where ANALYZE would see it. However, is there a particular reason to
make that one different?
Oh, I didn't update all array types in 2 tries :) Fixed.
With best regards,
Alexander Korotkov.
Attachment
On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote: > Thanks for your fixes to the patch. Them looks correct to me. I did some > fixes in the patch. The proof of some concepts is still needed. I'm going > to provide it in a few days. Your further fixes look good. Could you also answer my question about the header comment of mcelem_array_contained_selec()? /** Estimate selectivity of "column <@ const" based on most common element* statistics. Independent element occurrencewould imply a particular* distribution of distinct element counts among matching rows. Real data* usually falsifiesthat assumption. For example, in a set of 1-element* integer arrays having elements in the range [0;10], elementoccurrences are* not independent. If they were, a sufficiently-large set would include all* distinct element counts0 through 11. We correct for this using the* histogram of distinct element counts.** In the "column @> const" and"column && const" cases, we usually have* "const" with low summary frequency of elements (otherwise we have* selectivityclose to 0 or 1 correspondingly). That's why the effect of* dependence related to distinct element counts distributionis negligible* there. In the "column <@ const" case, summary frequency of elements is* high (otherwise we haveselectivity close to 0). That's why we should do* correction due to array distinct element counts distribution.*/ By "summary frequency of elements", do you mean literally P_0 + P_1 ... + P_N? If so, I can follow the above argument for "column && const" and "column <@ const", but not for "column @> const". For "column @> const", selectivity cannot exceed the smallest frequency among const elements. A number of high-frequency elements will drive up the sum of the frequencies without changing the true selectivity much at all. Thanks, nm
Hi!
Updated patch is attached. I've updated comment of mcelem_array_contained_selec with more detailed description of probability distribution assumption. Also, I found that "rest" behavious should be better described by Poisson distribution, relevant changes were made.
On Tue, Jan 17, 2012 at 2:33 PM, Noah Misch <noah@leadboat.com> wrote:
By "summary frequency of elements", do you mean literally P_0 + P_1 ... + P_N?If so, I can follow the above argument for "column && const" and "column <@cannot exceed the smallest frequency among const elements. A number of
const", but not for "column @> const". For "column @> const", selectivity
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.
Referencing to summary frequency is not really correct. It would be more correct to reference to number of element in "const". When there are many elements in "const", "column @> const" selectivity tends to be close to 0 and "column @> const" tends to be close to 1. Surely, it's true when elements have some kind of middle values of frequencies (not very close to 0 and not very close to 1). I've replaced "summary frequency of elements" by "number of elements".
With best regards,
Alexander Korotkov.
Attachment
On Mon, Jan 23, 2012 at 01:21:20AM +0400, Alexander Korotkov wrote: > Updated patch is attached. I've updated comment > of mcelem_array_contained_selec with more detailed description of > probability distribution assumption. Also, I found that "rest" behavious > should be better described by Poisson distribution, relevant changes were > made. Thanks. That makes more of the math clear to me. I do not follow all of it, but I feel that the comments now have enough information that I could go about doing so. > + /* Take care about events with low probabilities. */ > + if (rest > DEFAULT_CONTAIN_SEL) > + { Why the change from "rest > 0" to this in the latest version? > + /* emit some statistics for debug purposes */ > + elog(DEBUG3, "array: target # mces = %d, bucket width = %d, " > + "# elements = %llu, hashtable size = %d, usable entries = %d", > + num_mcelem, bucket_width, element_no, i, track_len); That should be UINT64_FMT. (I introduced that error in v0.10.) I've attached a new version that includes the UINT64_FMT fix, some edits of your newest comments, and a rerun of pgindent on the new files. I see no other issues precluding commit, so I am marking the patch Ready for Committer. If I made any of the comments worse, please post another update. Thanks, nm
Attachment
On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote:
------> + /* Take care about events with low probabilities. */> + if (rest > DEFAULT_CONTAIN_SEL)
> + {
Why the change from "rest > 0" to this in the latest version?
Ealier addition of "rest" distribution require O(m) time. Now there is a more accurate and proved estimate, but it takes O(m^2) time.It doesn't make general assymptotical time worse, but it significant. That's why I decided to skip for low values of "rest" which don't change distribution significantly.
> + /* emit some statistics for debug purposes */
> + elog(DEBUG3, "array: target # mces = %d, bucket width = %d, "
> + "# elements = %llu, hashtable size = %d, usable entries = %d",
> + num_mcelem, bucket_width, element_no, i, track_len);
That should be UINT64_FMT. (I introduced that error in v0.10.)
I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files. I see no
other issues precluding commit, so I am marking the patch Ready for Committer.
Great!
If I made any of the comments worse, please post another update.
Changes looks reasonable for me. Thanks!
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote: >> I've attached a new version that includes the UINT64_FMT fix, some edits of >> your newest comments, and a rerun of pgindent on the new files. I see no >> other issues precluding commit, so I am marking the patch Ready for >> Committer. >> If I made any of the comments worse, please post another update. > Changes looks reasonable for me. Thanks! I am starting to look at this patch now. I'm wondering exactly why the decision was made to continue storing btree-style statistics for arrays, in addition to the new stuff. The pg_statistic rows for array columns tend to be unreasonably wide already, and as-is this patch will make them quite a lot wider. I think it requires more than a little bit of evidence to continue storing stats that seem to have only small probability of usefulness. In particular, if we didn't store that stuff, we'd not need to widen the number of columns in pg_statistic, which would noticeably reduce both the footprint of the patch and the probability of breaking external code. regards, tom lane
On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------
With best regards,
Alexander Korotkov.
I am starting to look at this patch now. I'm wondering exactly why thedecision was made to continue storing btree-style statistics for arrays,
in addition to the new stuff. The pg_statistic rows for array columns
tend to be unreasonably wide already, and as-is this patch will make
them quite a lot wider. I think it requires more than a little bit of
evidence to continue storing stats that seem to have only small
probability of usefulness.
In particular, if we didn't store that stuff, we'd not need to widen the
number of columns in pg_statistic, which would noticeably reduce both
the footprint of the patch and the probability of breaking external
code.
Initially, I used existing slots for new statistics, but I've changed this after the first review:
Probably, btree statistics really does matter for some sort of arrays? For example, arrays representing paths in the tree. We could request a subtree in a range query on such arrays.
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I am starting to look at this patch now. I'm wondering exactly why the >> decision was made to continue storing btree-style statistics for arrays, > Probably, btree statistics really does matter for some sort of arrays? For > example, arrays representing paths in the tree. We could request a subtree > in a range query on such arrays. That seems like a pretty narrow, uncommon use-case. Also, to get accurate stats for such queries that way, you'd need really enormous histograms. I doubt that the existing parameters for histogram size will permit meaningful estimation of more than the first array entry (since we don't make the histogram any larger than we do for a scalar column). The real point here is that the fact that we're storing btree-style stats for arrays is an accident, backed into by having added btree comparators for arrays plus analyze.c's habit of applying default scalar-oriented analysis functions to any type without an explicit typanalyze entry. I don't recall that we ever thought hard about it or showed that those stats were worth anything. regards, tom lane
On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am starting to look at this patch now. I'm wondering exactly why the
>> decision was made to continue storing btree-style statistics for arrays,> Probably, btree statistics really does matter for some sort of arrays? ForThat seems like a pretty narrow, uncommon use-case. Also, to get
> example, arrays representing paths in the tree. We could request a subtree
> in a range query on such arrays.
accurate stats for such queries that way, you'd need really enormous
histograms. I doubt that the existing parameters for histogram size
will permit meaningful estimation of more than the first array entry
(since we don't make the histogram any larger than we do for a scalar
column).
The real point here is that the fact that we're storing btree-style
stats for arrays is an accident, backed into by having added btree
comparators for arrays plus analyze.c's habit of applying default
scalar-oriented analysis functions to any type without an explicit
typanalyze entry. I don't recall that we ever thought hard about
it or showed that those stats were worth anything.
OK. I don't object to removing btree stats from arrays.
What do you thinks about pg_stats view in this case? Should it combine values histogram and array length histogram in single column like do for MCV and MCELEM?
With best regards,
Alexander Korotkov.
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: >> On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote: >>> I've attached a new version that includes the UINT64_FMT fix, some edits of >>> your newest comments, and a rerun of pgindent on the new files. I see no >>> other issues precluding commit, so I am marking the patch Ready for >>> Committer. >>> If I made any of the comments worse, please post another update. > >> Changes looks reasonable for me. Thanks! > > I am starting to look at this patch now. I'm wondering exactly why the > decision was made to continue storing btree-style statistics for arrays, > in addition to the new stuff. If I understand you're suggestion, queries of the form SELECT * FROM rel WHERE ARRAY[ 1,2,3,4 ] <= x AND x <=ARRAY[ 1, 2, 3, 1000]; would no longer use an index. Is that correct? Are you suggesting removing MCV's in lieu of MCE's as well? -Nathan
Nathan Boley <npboley@gmail.com> writes: > On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I am starting to look at this patch now. �I'm wondering exactly why the >> decision was made to continue storing btree-style statistics for arrays, >> in addition to the new stuff. > If I understand you're suggestion, queries of the form > SELECT * FROM rel > WHERE ARRAY[ 1,2,3,4 ] <= x > AND x <=ARRAY[ 1, 2, 3, 1000]; > would no longer use an index. Is that correct? No, just that we'd no longer have statistics relevant to that, and would have to fall back on default selectivity assumptions. Do you think that such applications are so common as to justify bloating pg_statistic for everybody that uses arrays? regards, tom lane
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Nathan Boley <npboley@gmail.com> writes: >> On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I am starting to look at this patch now. I'm wondering exactly why the >>> decision was made to continue storing btree-style statistics for arrays, >>> in addition to the new stuff. > >> If I understand you're suggestion, queries of the form > >> SELECT * FROM rel >> WHERE ARRAY[ 1,2,3,4 ] <= x >> AND x <=ARRAY[ 1, 2, 3, 1000]; > >> would no longer use an index. Is that correct? > > No, just that we'd no longer have statistics relevant to that, and would > have to fall back on default selectivity assumptions. Which, currently, would mean queries of that form would typically use a table scan, right? > Do you think that > such applications are so common as to justify bloating pg_statistic for > everybody that uses arrays? I have no idea, but it seems like it will be a substantial regression for the people that are. What about MCV's? Will those be removed as well? Best, Nathan
Nathan Boley <npboley@gmail.com> writes: > On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Nathan Boley <npboley@gmail.com> writes: >>> If I understand you're suggestion, queries of the form >>> SELECT * FROM rel >>> WHERE ARRAY[ 1,2,3,4 ] <= x >>> � � �AND x <=ARRAY[ 1, 2, 3, 1000]; >>> would no longer use an index. Is that correct? >> No, just that we'd no longer have statistics relevant to that, and would >> have to fall back on default selectivity assumptions. > Which, currently, would mean queries of that form would typically use > a table scan, right? No, it doesn't. > What about MCV's? Will those be removed as well? Sure. Those seem even less useful. regards, tom lane
On Thu, Mar 1, 2012 at 1:19 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:That seems like a pretty narrow, uncommon use-case. Also, to getaccurate stats for such queries that way, you'd need really enormous
histograms. I doubt that the existing parameters for histogram size
will permit meaningful estimation of more than the first array entry
(since we don't make the histogram any larger than we do for a scalar
column).
The real point here is that the fact that we're storing btree-style
stats for arrays is an accident, backed into by having added btree
comparators for arrays plus analyze.c's habit of applying default
scalar-oriented analysis functions to any type without an explicit
typanalyze entry. I don't recall that we ever thought hard about
it or showed that those stats were worth anything.OK. I don't object to removing btree stats from arrays.What do you thinks about pg_stats view in this case? Should it combine values histogram and array length histogram in single column like do for MCV and MCELEM?
Btree statistics for arrays and additional statistics slot are removed from attached version of patch. pg_stats view is untouched for while.
With best regards,
Alexander Korotkov.
Attachment
On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Nathan Boley <npboley@gmail.com> writes: >> On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I am starting to look at this patch now. I'm wondering exactly why the >>> decision was made to continue storing btree-style statistics for arrays, >>> in addition to the new stuff. > >> If I understand you're suggestion, queries of the form > >> SELECT * FROM rel >> WHERE ARRAY[ 1,2,3,4 ] <= x >> AND x <=ARRAY[ 1, 2, 3, 1000]; > >> would no longer use an index. Is that correct? > > No, just that we'd no longer have statistics relevant to that, and would > have to fall back on default selectivity assumptions. Do you think that > such applications are so common as to justify bloating pg_statistic for > everybody that uses arrays? I confess I am nervous about ripping this out. I am pretty sure we will get complaints about it. Performance optimizations that benefit group A at the expense of group B are always iffy, and I'm not sure the case of using an array as a path indicator is as uncommon as you seem to think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012: > On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > No, just that we'd no longer have statistics relevant to that, and would > > have to fall back on default selectivity assumptions. Do you think that > > such applications are so common as to justify bloating pg_statistic for > > everybody that uses arrays? > > I confess I am nervous about ripping this out. I am pretty sure we > will get complaints about it. Performance optimizations that benefit > group A at the expense of group B are always iffy, and I'm not sure > the case of using an array as a path indicator is as uncommon as you > seem to think. Maybe we should keep it as an option. I do think it's quite uncommon, but for those rare users, it'd be good to provide the capability while not bloating everyone else's stat catalog. The thing is, people using arrays as path indicators and such are likely using relatively small arrays; people storing real data are likely to store much bigger arrays. Just a hunch though. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012: >> On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I confess I am nervous about ripping this out. I am pretty sure we >> will get complaints about it. Performance optimizations that benefit >> group A at the expense of group B are always iffy, and I'm not sure >> the case of using an array as a path indicator is as uncommon as you >> seem to think. > Maybe we should keep it as an option. How would we make it optional? There's noplace I can think of to stick such a knob ... regards, tom lane
Excerpts from Tom Lane's message of jue mar 01 18:51:38 -0300 2012: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012: > >> On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I confess I am nervous about ripping this out. I am pretty sure we > >> will get complaints about it. Performance optimizations that benefit > >> group A at the expense of group B are always iffy, and I'm not sure > >> the case of using an array as a path indicator is as uncommon as you > >> seem to think. > > > Maybe we should keep it as an option. > > How would we make it optional? There's noplace I can think of to stick > such a knob ... Uhm, attoptions? "alter table foo alter column bar set extended_array_stats to on" or something like that? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>> What about MCV's? Will those be removed as well? > > Sure. Those seem even less useful. Ya, this will destroy the performance of several queries without some heavy tweaking. Maybe this is bad design, but I've gotten in the habit of storing sequences as arrays and I commonly join on them. I looked through my code this morning, and I only have one 'range' query ( of the form described up-thread ), but there are tons of the form SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array; I can provide some examples if that would make my argument more compelling. Sorry to be difficult, Nathan
Nathan Boley <npboley@gmail.com> writes: > Maybe this is bad design, but I've gotten in the habit of storing > sequences as arrays and I commonly join on them. I looked through my > code this morning, and I only have one 'range' query ( of the form > described up-thread ), but there are tons of the form > SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array; What do you mean by "storing sequences as arrays"? Can you demonstrate that the existing stats are relevant at all to the query you're worried about? regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of jue mar 01 18:51:38 -0300 2012: >> How would we make it optional? There's noplace I can think of to stick >> such a knob ... > Uhm, attoptions? Oh, I had forgotten we had that mechanism already. Yeah, that might work. I'm a bit tempted to design the option setting so that you can select whether to keep the btree stats, the new stats, or both or neither --- after all, there are probably plenty of databases where nobody cares about the array-containment operators either. That leaves the question of which setting should be the default ... regards, tom lane
[ sorry Tom, reply all this time... ] > What do you mean by "storing sequences as arrays"? So, a simple example is, for transcripts ( sequences of DNA that are turned into proteins ), we store each of the connected components as an array of the form: exon_type in [1,6] splice_type = [1,3] and then the array elements are [ exon_type, splice_type, exon_type ] ~ 99% of the elements are of the form [ [1,3], 1, [1,3] ], so I almost always get a hash or merge join ( correctly ) but for the rare junction types ( which are usually more interesting as well ) I correctly get nest loops with an index scan. > Can you demonstrate > that the existing stats are relevant at all to the query you're worried > about? Well, if we didn't have mcv's and just relied on ndistinct to estimate the '=' selectivities, either my low selectivity quals would use the index, or my high selectivity quals would use a table scan, either of which would be wrong. I guess I could wipe out the stats and get some real numbers tonight, but I can't see how the planner would be able to distinguish *without* mcv's... Best, Nathan
Still working through this patch ... there are some things that bother me about the entries being made in pg_statistic: 1. You re-used STATISTIC_KIND_MCELEM for something that, while similar to tsvector's usage, is not the same. In particular, tsvector adds two extra elements to the stanumbers array, but this patch adds four (and doesn't bother documenting that in pg_statistic.h). I don't think it's acceptable to re-use the same stakind value for something that's not following the same specification. I see Nathan complained of this way, way upthread, but nothing was done about it. I think we should either assign a different stakind code for this definition, or change things around so that tsvector actually is using the same stats kind definition as arrays are. (We could get away with redefining the tsvector stats format, because pg_upgrade doesn't try to copy pg_statistic rows to the new database.) Now, of the two new elements added by the patch, it seems to me to be perfectly reasonable to add a null-element frequency to the kind specification; the fact that it wasn't there to start with is kind of an oversight born of the fact that tsvectors don't contain any null lexemes. But the other new element is the average distinct element count, which really does not belong here at all, as it is *entirely* unrelated to element frequencies. It seems to me that that more nearly belongs in the element-count histogram slot. So my preference is to align the two definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency to tsvector's usage (where it'll always be zero) and getting rid of the average distinct element count here. 2. I think STATISTIC_KIND_LENGTH_HISTOGRAM is badly named and confusingly documented. The stats are not about anything I would call a "length" --- rather we're considering the counts of numbers of distinct element values present in each array value. An ideal name perhaps would be STATISTIC_KIND_DISTINCT_ELEMENTS_COUNT_HISTOGRAM, but of course that's unreasonably long. Considering the way that the existing stats kind names are abbreviated, maybe STATISTIC_KIND_DECHIST would do. Anybody have a better idea? 3. I also find it a bit odd that you chose to store the length (count) histogram as an integer array in stavalues. Usually we've put such data in stanumbers. That would make the entries float4 not integer, but that doesn't seem unreasonable to me --- values would still be exact up to 2^24 or so on typical machines, and if we ever do have values larger than that, it seems to me that having headroom to go above 2^32 would be a good thing. In any case, if we're going to move the average distinct-element count over here, that would have to go into stanumbers. Comments? regards, tom lane
I wrote: > ... So my preference is to align the two > definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency > to tsvector's usage (where it'll always be zero) and getting rid of the > average distinct element count here. Actually, there's a way we can do this without code changes in the tsvector stuff. Since the number of MCELEM stanumber items that provide frequencies of stavalue items is obviously equal to the length of stavalues, we could define stanumbers as containing those matching entries, then two min/max entries, then an *optional* entry for the frequency of null elements (with the frequency presumed to be zero if omitted). This'd be non-ambiguous given access to stavalues. I'm not sure though if making the null frequency optional wouldn't introduce complexity elsewhere that outweighs not having to touch the tsvector code. regards, tom lane
... BTW, could you explain exactly how that "Fill histogram by hashtab" loop works? It's way too magic for my taste, and does in fact have bugs in the currently submitted patch. I've reworked it to this: /* Fill histogram by hashtab. */ delta = analyzed_rows - 1; count_item_index = 0; frac = sorted_count_items[0]->frequency * (num_hist - 1); for (i = 0; i < num_hist; i++) { while (frac <= 0) { count_item_index++; Assert(count_item_index <count_items_count); frac += sorted_count_items[count_item_index]->frequency * (num_hist - 1); } hist[i] = sorted_count_items[count_item_index]->count; frac -= delta; } Assert(count_item_index == count_items_count - 1); The asserts don't fire in any test case I've tried, which seems to indicate that it *does* work in the sense that the first histogram entry is always the smallest count and the last histogram entry is always the largest one. But it's extremely unclear why it manages to stop exactly at the last count_items array entry, or for that matter why it's generating a representative histogram at all. I'm suspicious that the "-1" bits represent off-by-one bugs. I also don't especially like the fact that "frac" is capable of overflowing (since worst case frequency is 300 * 10000 and worst case num_hist is 10000, with the current limits on statistics_target). We could work around that by doing the frac arithmetic in int64, but I wonder whether that couldn't be avoided. In any case, first I'd like an explanation why this code works at all. regards, tom lane
Alexander Korotkov <aekorotkov@gmail.com> writes: > [ array statistics patch ] I've committed this after a fair amount of editorialization. There are still some loose ends to deal with, but I felt it was ready to go into the tree for wider testing. The main thing I changed that wasn't in the nature of cleanup/bugfixing was that I revised the effort-limiting logic in mcelem_array_contained_selec. The submitted code basically just punted if the estimated work was too much, but as was already noted in http://archives.postgresql.org/pgsql-hackers/2011-10/msg01349.php that can result in really bad estimates. What I did instead is something closer to Robert's original suggestion: trim the number of element values taken into consideration from the array constant to a value that fits within the desired effort limit. If we consider just the N most common values from the array constant, we still get a pretty good estimate (since the trimmed N will still be close to 100 for the values we're talking about). I redid the tests in the above-mentioned message and see no cases where the estimate is off by more than a factor of 2, and very few where it's off by more than 20%, so this seems to work pretty well now. The remaining loose ends IMO are: 1. I'm still unhappy about the loop that fills the count histogram, as I noted earlier today. It at least needs a decent comment and some overflow protection, and I'm not entirely convinced that it doesn't have more bugs than the overflow issue. 2. The tests in the above-mentioned message show that in most cases where mcelem_array_contained_selec falls through to the "rough estimate", the resulting rowcount estimate is just 1, ie we are coming out with very small selectivities. Although that path will now only be taken when there are no stats, it seems like we'd be better off to return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there must be something wrong with the "rough estimate" logic. Could you recheck that? 3. As I mentioned yesterday, I think it'd be a good idea to make some provisions to reduce the width of pg_statistic rows for array columns by not storing the scalar-style and/or array-style stats, if the DBA knows that they're not going to be useful for a particular column. I have not done anything about that. regards, tom lane
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Attached patch is focused on fixing this. The "frac" variable overflow is evaded by making it int64. I hope comments is clarifying something. In general this loop copies behaviour of histogram constructing loop of compute_scalar_stats function. But instead of values array we've array of unique DEC and it's frequency.
------
With best regards,
Alexander Korotkov.
1. I'm still unhappy about the loop that fills the count histogram,
as I noted earlier today. It at least needs a decent comment and some
overflow protection, and I'm not entirely convinced that it doesn't have
more bugs than the overflow issue.
------
With best regards,
Alexander Korotkov.
Attachment
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------
With best regards,
Alexander Korotkov.
2. The tests in the above-mentioned message show that in most cases
where mcelem_array_contained_selec falls through to the "rough
estimate", the resulting rowcount estimate is just 1, ie we are coming
out with very small selectivities. Although that path will now only be
taken when there are no stats, it seems like we'd be better off to
return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there
must be something wrong with the "rough estimate" logic. Could you
recheck that?
I think the wrong think with "rough estimate" is that assumption about independent occurrences of items is very unsuitable even for "rough estimate". The following example shows that "rough estimate" really works in the case of independent occurrences of items.
Generate test table where item occurrences are really independent.
test=# create table test as select ('{'||(select string_agg(s,',') from (select case when (t*0 + random()) < 0.1 then i::text else null end from generate_series(1,100) i) as x(s))||'}')::int[] AS val from generate_series(1,10000) t;
SELECT 10000
test=# analyze test;
ANALYZE
Do some test.
test=# explain analyze select * from test where val <@ array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..239.00 rows=151 width=61) (actual time=0.325..32.556 rows=163 loops=1
)
Filter: (val <@ '{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
Rows Removed by Filter: 9837
Total runtime: 32.806 ms
(4 rows)
Delete DECHIST statistics.
test=# update pg_statistic set stakind1 = 0, staop1 = 0, stanumbers1 = null, stavalues1 = null where starelid = (select oid from pg_class where relname = 'test') and stakind1 = 5;
UPDATE 0
test=# update pg_statistic set stakind2 = 0, staop2 = 0, stanumbers2 = null, stavalues2 = null where starelid = (select oid from pg_class where relname = 'test') and stakind2 = 5;
UPDATE 0
test=# update pg_statistic set stakind3 = 0, staop3 = 0, stanumbers3 = null, stavalues3 = null where starelid = (select oid from pg_class where relname = 'test') and stakind3 = 5;
UPDATE 0
test=# update pg_statistic set stakind4 = 0, staop4 = 0, stanumbers4 = null, stavalues4 = null where starelid = (select oid from pg_class where relname = 'test') and stakind4 = 5;
UPDATE 1
test=# update pg_statistic set stakind5 = 0, staop5 = 0, stanumbers5 = null, stavalues5 = null where starelid = (select oid from pg_class where relname = 'test') and stakind5 = 5;
UPDATE 0
Do another test.
test=# explain analyze select * from test where val <@ array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..239.00 rows=148 width=61) (actual time=0.332..32.952 rows=163 loops=1)
Filter: (val <@ '{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
Rows Removed by Filter: 9837
Total runtime: 33.225 ms
(4 rows)
It this particular case "rough estimate" is quite accurate. But in most part of cases it behaves really bad. It is why I started to invent calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless we've some better ideas.
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 2. The tests in the above-mentioned message show that in most cases >> where mcelem_array_contained_selec falls through to the "rough >> estimate", the resulting rowcount estimate is just 1, ie we are coming >> out with very small selectivities. Although that path will now only be >> taken when there are no stats, it seems like we'd be better off to >> return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there >> must be something wrong with the "rough estimate" logic. Could you >> recheck that? > I think the wrong think with "rough estimate" is that assumption about > independent occurrences of items is very unsuitable even for "rough > estimate". The following example shows that "rough estimate" really works > in the case of independent occurrences of items. ... > It this particular case "rough estimate" is quite accurate. But in most > part of cases it behaves really bad. It is why I started to invent > calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless > we've some better ideas. OK. Looking again at that code, I notice that it also punts and returns DEFAULT_CONTAIN_SEL if it's not given MCELEM stats, which it more or less has to because without even a minfreq the whole calculation is just hot air. And there are no plausible scenarios where compute_array_stats would produce an MCELEM slot but no count histogram. So that says there is no point in sweating over this case, unless you have an idea how to produce useful results without MCELEM. So I think it's sufficient to punt at the top of the function if no histogram, and take out the various attempts to cope with the case. regards, tom lane
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 1. I'm still unhappy about the loop that fills the count histogram, >> as I noted earlier today. It at least needs a decent comment and some >> overflow protection, and I'm not entirely convinced that it doesn't have >> more bugs than the overflow issue. > Attached patch is focused on fixing this. The "frac" variable overflow is > evaded by making it int64. I hope comments is clarifying something. In > general this loop copies behaviour of histogram constructing loop of > compute_scalar_stats function. But instead of values array we've array of > unique DEC and it's frequency. OK, I reworked this a bit and committed it. Thanks. regards, tom lane
BTW, one other thing about the count histogram: seems like we are frequently generating uselessly large ones. For instance, do ANALYZE in the regression database and then run select tablename,attname,elem_count_histogram from pg_stats where elem_count_histogram is not null; You get lots of entries that look like this: pg_proc | proallargtypes | {1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}pg_proc | proargmodes | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.61111}pg_proc | proargnames | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}pg_proc | proconfig | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}pg_class | reloptions | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} which seems to me to be a rather useless expenditure of space. Couldn't we reduce the histogram size when there aren't many different counts? It seems fairly obvious to me that we could bound the histogram size with (max count - min count + 1), but maybe something even tighter would work; or maybe I'm missing something and this would sacrifice accuracy. regards, tom lane
<div class="gmail_quote">On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane <span dir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us"target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> BTW, one other thing about thecount histogram: seems like we are<br /> frequently generating uselessly large ones. For instance, do ANALYZE<br /> inthe regression database and then run<br /><br /> select tablename,attname,elem_count_histogram from pg_stats<br /> whereelem_count_histogram is not null;<br /><br /> You get lots of entries that look like this:<br /><br /> pg_proc | proallargtypes | {1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}<br /> pg_proc | proargmodes | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.61111}<br /> pg_proc | proargnames | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}<br /> pg_proc | proconfig | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}<br /> pg_class | reloptions | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}<br /><br/> which seems to me to be a rather useless expenditure of space.<br /> Couldn't we reduce the histogram size when therearen't many<br /> different counts?<br /><br /> It seems fairly obvious to me that we could bound the histogram<br />size with (max count - min count + 1), but maybe something even<br /> tighter would work; or maybe I'm missing somethingand this would<br /> sacrifice accuracy.<br /></blockquote><div class="gmail_quote"><br /></div><div class="gmail_quote">True.If (max count - min count + 1) is small, enumerating of frequencies is both more compact and moreprecise representation. Simultaneously, if (max count - min count + 1) is large, we can run out of statistics_targetwith such representation. We can use same representation of count distribution as for scalar column value: MCVand HISTOGRAM, but it would require additional statkind and statistics slot. Probably, you've better ideas?</div><br/>------<br />With best regards,<br />Alexander Korotkov.</div>
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Couldn't we reduce the histogram size when there aren't many >> different counts? >> >> It seems fairly obvious to me that we could bound the histogram >> size with (max count - min count + 1), but maybe something even >> tighter would work; or maybe I'm missing something and this would >> sacrifice accuracy. > True. If (max count - min count + 1) is small, enumerating of frequencies > is both more compact and more precise representation. Simultaneously, > if (max count - min count + 1) is large, we can run out of > statistics_target with such representation. We can use same representation > of count distribution as for scalar column value: MCV and HISTOGRAM, but it > would require additional statkind and statistics slot. Probably, you've > better ideas? I wasn't thinking of introducing two different representations, but just trimming the histogram length when it's larger than necessary. On reflection my idea above is wrong; for example assume that we have a column with 900 arrays of length 1 and 100 arrays of length 2. Going by what I said, we'd reduce the histogram to {1,2}, which might accurately capture the set of lengths present but fails to show that 1 is much more common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) would capture the situation perfectly in one-tenth the space that the current logic does. More generally, by limiting the histogram to statistics_target entries, we are already accepting errors of up to 1/(2*statistics_target) in the accuracy of the bin-boundary representation. What the above example shows is that sometimes we could meet the same accuracy requirement with fewer entries. I'm not sure how this could be mechanized but it seems worth thinking about. regards, tom lane
On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: > > True. If (max count - min count + 1) is small, enumerating of frequencies > > is both more compact and more precise representation. Simultaneously, > > if (max count - min count + 1) is large, we can run out of > > statistics_target with such representation. We can use same representation > > of count distribution as for scalar column value: MCV and HISTOGRAM, but it > > would require additional statkind and statistics slot. Probably, you've > > better ideas? > > I wasn't thinking of introducing two different representations, > but just trimming the histogram length when it's larger than necessary. > > On reflection my idea above is wrong; for example assume that we have a > column with 900 arrays of length 1 and 100 arrays of length 2. Going by > what I said, we'd reduce the histogram to {1,2}, which might accurately > capture the set of lengths present but fails to show that 1 is much more > common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) > would capture the situation perfectly in one-tenth the space that the > current logic does. Granted. When the next sample finds 899/101 instead, though, the optimization vanishes. You save 90% of the space, perhaps 10% of the time. If you want to materially narrow typical statistics, Alexander's proposal looks like the way to go. I'd guess array columns always having DEC <= default_statistics_target are common enough to make that representation the dominant representation, if not the only necessary representation.
Noah Misch <noah@leadboat.com> writes: > On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote: >> On reflection my idea above is wrong; for example assume that we have a >> column with 900 arrays of length 1 and 100 arrays of length 2. Going by >> what I said, we'd reduce the histogram to {1,2}, which might accurately >> capture the set of lengths present but fails to show that 1 is much more >> common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) >> would capture the situation perfectly in one-tenth the space that the >> current logic does. > Granted. When the next sample finds 899/101 instead, though, the optimization > vanishes. No, you missed my next point. That example shows that sometimes a smaller histogram can represent the situation with zero error, but in all cases a smaller histogram can represent the situation with perhaps more error than a larger one. Since we already have a defined error tolerance, we should try to generate a histogram that is as small as possible while still not exceeding the error tolerance. Now, it might be that doing that is computationally impractical, or too complicated to be reasonable. But it seems to me to be worth looking into. > If you want to materially narrow typical statistics, Alexander's > proposal looks like the way to go. I'd guess array columns always > having DEC <= default_statistics_target are common enough to make that > representation the dominant representation, if not the only necessary > representation. Well, I don't want to have two representations; I don't think it's worth the complexity. But certainly we could consider switching to a different representation if it seems likely to usually be smaller. regards, tom lane
On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote: > >> On reflection my idea above is wrong; for example assume that we have a > >> column with 900 arrays of length 1 and 100 arrays of length 2. Going by > >> what I said, we'd reduce the histogram to {1,2}, which might accurately > >> capture the set of lengths present but fails to show that 1 is much more > >> common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) > >> would capture the situation perfectly in one-tenth the space that the > >> current logic does. > > > Granted. When the next sample finds 899/101 instead, though, the optimization > > vanishes. > > No, you missed my next point. That example shows that sometimes a > smaller histogram can represent the situation with zero error, but in > all cases a smaller histogram can represent the situation with perhaps > more error than a larger one. Since we already have a defined error > tolerance, we should try to generate a histogram that is as small as > possible while still not exceeding the error tolerance. > > Now, it might be that doing that is computationally impractical, or > too complicated to be reasonable. But it seems to me to be worth > looking into. Yes, I did miss your point. One characteristic favoring this approach is its equal applicability to both STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST. > > If you want to materially narrow typical statistics, Alexander's > > proposal looks like the way to go. I'd guess array columns always > > having DEC <= default_statistics_target are common enough to make that > > representation the dominant representation, if not the only necessary > > representation. > > Well, I don't want to have two representations; I don't think it's worth > the complexity. But certainly we could consider switching to a > different representation if it seems likely to usually be smaller. Perhaps some heavy array users could provide input: what are some typical length ranges among arrays in your applications on which you use "arr && const", "arr @> const" or "arr <@ const" searches?
On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If (max_count - min_count + 1) <= statistics_target then
Alexander Korotkov <aekorotkov@gmail.com> writes:> True. If (max count - min count + 1) is small, enumerating of frequenciesI wasn't thinking of introducing two different representations,
> is both more compact and more precise representation. Simultaneously,
> if (max count - min count + 1) is large, we can run out of
> statistics_target with such representation. We can use same representation
> of count distribution as for scalar column value: MCV and HISTOGRAM, but it
> would require additional statkind and statistics slot. Probably, you've
> better ideas?
but just trimming the histogram length when it's larger than necessary.
On reflection my idea above is wrong; for example assume that we have a
column with 900 arrays of length 1 and 100 arrays of length 2. Going by
what I said, we'd reduce the histogram to {1,2}, which might accurately
capture the set of lengths present but fails to show that 1 is much more
common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
would capture the situation perfectly in one-tenth the space that the
current logic does.
More generally, by limiting the histogram to statistics_target entries,
we are already accepting errors of up to 1/(2*statistics_target) in the
accuracy of the bin-boundary representation. What the above example
shows is that sometimes we could meet the same accuracy requirement with
fewer entries. I'm not sure how this could be mechanized but it seems
worth thinking about.
I can propose following representation of histogram.
1) store max_count and min_count in stavalues
2) store frequencies from min_count ot max_count in numvalues
If (max_count - min_count + 1) > statistics_target then
------
With best regards,
Alexander Korotkov.
store histogram in current manner. I think in this case it's unlikely to be many repeating values.
I can propose patch which change histogram representation to this.
Comments?
With best regards,
Alexander Korotkov.