Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column - Mailing list pgsql-performance

From Tom Lane
Subject Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
Date
Msg-id 1751511.1749259794@sss.pgh.pa.us
Whole thread Raw
In response to RE: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column  (Mark Frost <FROSTMAR@uk.ibm.com>)
List pgsql-performance
Mark Frost <FROSTMAR@uk.ibm.com> writes:
> Actually *any* most_common_elems stats would be fine, because the reasoning is:
>   *   If the searched element is in most_common_elems we know it's frequency
>   *   If it's not, it's less frequent than the least most_common_elems
> So in our case when every row is unique, we'd only actually need stats to record a single most_common_elems (if only
itwould record one) 

Well, we don't have a most common element in this scenario --- the
whole point is that the occurrence counts resulting from the lossy
counting algorithm are too low to be trustworthy.  However, what we
do have is the cutoff frequency, and it seems to me that we could use
that as the estimate of the maximum frequency of the non-MCEs.

Attached is an extremely crude prototype patch for that.  What it
does is to create an MCELEM stats entry with zero MCEs, but containing
min and max frequencies equal to the cutoff frequency (plus the nulls
frequency, which we know accurately in any case).  Mark, this fixes
your example case, but I wonder if it fixes your original problem ---
are you in a position to test it?

Assuming we like this direction, the main thing that makes this a hack
not a polished patch is that I had to strongarm the code into storing
a zero-length values array.  What update_attstats would normally do
is leave the values column of the MCELEM stats slot NULL, which then
causes get_attstatsslot to throw a that-shouldn't-be-null error.
An alternative answer is to change get_attstatsslot to allow a null,
but I'm not sure that that's any cleaner.  Either way it seems like
there's a hazard of breaking some code that isn't expecting the case.

An alternative that feels cleaner but a good bit more invasive
is to get rid of the convention of storing the min/max/nulls
frequencies as extra entries in the MCELEM numbers entry ---
which surely is a hack too --- and put them into some new slot
type.  I'm not sure if that's enough nicer to be worth the
conversion pain.

Thoughts?

            regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..c312de7d593 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1733,7 +1733,10 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
         i = Anum_pg_statistic_stavalues1 - 1;
         for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
         {
-            if (stats->numvalues[k] > 0)
+            /* XXX ugly hack to allow zero-length MCELEM values arrays */
+            if (stats->numvalues[k] > 0 ||
+                (stats->numvalues[k] == 0 &&
+                 stats->stakind[k] == STATISTIC_KIND_MCELEM))
             {
                 ArrayType  *arry;

diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 6f61629b977..9c6dd2852ab 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -497,6 +497,15 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
          * If we obtained more elements than we really want, get rid of those
          * with least frequencies.  The easiest way is to qsort the array into
          * descending frequency order and truncate the array.
+         *
+         * On the other extreme, we might have found no elements with
+         * frequency above the cutoff.  Then there's nothing to store so far
+         * as element values go, but we still want to record the cutoff
+         * frequency, since array_selfuncs.c can use that as an upper bound on
+         * the frequency of non-MCVs.  (Note: what we store is the minimum
+         * frequency that would have been accepted as a valid MCE.  In
+         * particular, this ensures we don't create a bogus stats entry with
+         * frequency zero.)
          */
         if (num_mcelem < track_len)
         {
@@ -506,10 +515,14 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
             minfreq = sort_table[num_mcelem - 1]->frequency;
         }
         else
+        {
             num_mcelem = track_len;
+            if (track_len == 0)
+                minfreq = maxfreq = cutoff_freq + 1;
+        }

         /* Generate MCELEM slot entry */
-        if (num_mcelem > 0)
+        if (num_mcelem >= 0)
         {
             MemoryContext old_context;
             Datum       *mcelem_values;

pgsql-performance by date:

Previous
From: Mark Frost
Date:
Subject: RE: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column