Thread: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

Hello all,

 

We’re seeing intermittently very poor performance of a query, when occasionally a poor query plan is chosen. We’re using Postgres 16.9.

One suspicious factor when looking at the EXPLAIN ANALYZE output, is a very wrong estimated number of rows to be returned from a text[] column queried with ‘&&’.

 

After playing around with a simple recreate (details below), it seems ANALYZE of the table is affected by the number of rows in the table. Statistic `most_common_elems` is [null] when there’s over 15,873 rows in the table when analyzed. With fewer rows it’s analyzed correctly.

 

Is there any good explanation for this behaviour? Preferably we’d like some way for proper `most_common_elems` statistics to be collected in our production database, in the hope that influences a good query plan to always be selected.

 

In our production system there’s ~150,000 rows in a table including a `text[]` column, where each row has an array containing a single 19ish char string, unique within the table. The full query joins against a couple more tables, and has a GIN index on the text[] column. If necessary, I can get into details of the real system, but hope the simple recreate will be sufficient to understand the problem:

 

 

 

CREATE TABLE IF NOT EXISTS public.test(

    id SERIAL PRIMARY KEY,

    tags text[]

)

 

INSERT INTO public.test (tags)

        SELECT ARRAY[TO_CHAR(n,'fm00000000')] FROM ( SELECT generate_series(1,15_873) AS n );

 

ANALYZE public.test;

 

SELECT * FROM pg_stat_user_tables WHERE relname = 'test';

 

EXPLAIN (ANALYZE,BUFFERS,VERBOSE)

        SELECT * FROM test WHERE tags && ARRAY['00000002']

 

 

 

 

Results

-------

table with 15_000 rows has most_common_elems after ANALYZE (most_common_elem_freqs : 6.666667e-05)

table with 15_872 rows has most_common_elems after ANALYZE (most_common_elem_freqs : 6.300403e-05)

table with 15_873 rows has [null] most_common_elems after ANALYZE

table with 100_000 rows has [null] most_common_elems after ANALYZE 

 

 

 

Query plans show an estimated 1 row is predicted when statistics has `most_common_elems` available, or the hardcoded default 1/200 of the estimated table size when most_common_elems is null.

Here 79 rows are estimated, when the table contained 15,873 rows and stats weren’t available.

 

Query plan

-----------

Seq Scan on public.test  (cost=0.00..463.41 rows=79 width=37) (actual time=9.934..17.190 rows=1 loops=1)

  Output: id, tags

  Filter: (test.tags && '{00000002}'::text[])

 Rows Removed by Filter: 15872

  Buffers: shared hit=268

Planning:

  Buffers: shared hit=75

Planning Time: 2.060 ms

Execution Time: 17.205 ms

 

 

Full version

------------

"PostgreSQL 16.9 (Debian 16.9-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit"

 

 

Regards,

Mark Frost

IBM

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: Building C, IBM Hursley Office, Hursley Park Road, Winchester, Hampshire SO21 2JN

On 6/5/25 17:42, Mark Frost wrote:
> Is there any good explanation for this behaviour? Preferably we’d like 
> some way for proper `most_common_elems` statistics to be collected in 
> our production database, in the hope that influences a good query plan 
> to always be selected.


most_common_elems has a limited size, and if all the elements have the 
same freq, there's nothing we can do.

You could do: alter table test alter column tags set statistics X;

However, X is capped at 10000, which means that the size of 
most_common_elems will be less than 100k, and it would probably be 
stupid to go beyond that anyway.

It seems that postgres lacks some kind of "n_distinct_elems" for that 
kind of case, but let's wait and see what the statistics gurus think.



Mark Frost <FROSTMAR@uk.ibm.com> writes:
> We're seeing intermittently very poor performance of a query, when occasionally a poor query plan is chosen. We're
usingPostgres 16.9. 
> One suspicious factor when looking at the EXPLAIN ANALYZE output, is a very wrong estimated number of rows to be
returnedfrom a text[] column queried with '&&'. 
> After playing around with a simple recreate (details below), it seems ANALYZE of the table is affected by the number
ofrows in the table. Statistic `most_common_elems` is [null] when there's over 15,873 rows in the table when analyzed.
Withfewer rows it's analyzed correctly. 

Thanks for the report.  Poking through the code, it seems like
there are two distinct problems here:

1. ANALYZE uses a "lossy counting" algorithm (dating to commit
0e5e167aa) to estimate the frequencies of array element values.
The part of that that seems to be going off the rails is
this selection of a cutoff frequency below which element values
will be dropped:

        /*
         * Construct an array of the interesting hashtable items, that is,
         * those meeting the cutoff frequency (s - epsilon)*N.  Also identify
         * the minimum and maximum frequencies among these items.
         *
         * Since epsilon = s/10 and bucket_width = 1/epsilon, the cutoff
         * frequency is 9*N / bucket_width.
         */
        cutoff_freq = 9 * element_no / bucket_width;

The first thing I find suspicious here is that the calculation is
based on element_no (the total number of array elements processed)
and not nonnull_cnt (the maximum possible frequency).  Is that
really right?  It wouldn't change the results in your reproducer
with just one element per array, but it seems bogus.

More relevant to your immediate problem, this creates a behavioral
cliff at the point where cutoff_freq rises from 0 to 1, which with the
default attstattarget turns out to be, you guessed it, 15873 elements.
In your example, all the element values have frequency 1, so that
switches us from being willing to record all the values to being
willing to record none of them.  That doesn't feel right either.
By analogy to our treatment of regular MCVs, it seems like we should
never be willing to store values that we didn't see at least twice.
Of course, doing that would make this example worse, because then
we'd not store any "most common" elements at smaller rowcounts either.
Which brings us to the other major problem this reveals:

2. In array_selfuncs.c, we fall back to default selectivity
estimates if there's no MCELEM statistics field.  What we
should be doing, if there are other stats for the column but
not MCELEM, is realizing that compute_array_stats did not find
any elements that are common enough to be worth recording.
Then we'd use some much lower-than-default estimate for the
selectivity.

I don't have any immediate patch to offer, but clearly this
area could use another look.

compute_array_stats seems to have borrowed the lossy-counting
algorithm from ts_typanalyze, so we'd better take a look at
that too.

            regards, tom lane



I wrote:
> The part of that that seems to be going off the rails is
> this selection of a cutoff frequency below which element values
> will be dropped:

>         cutoff_freq = 9 * element_no / bucket_width;

> The first thing I find suspicious here is that the calculation is
> based on element_no (the total number of array elements processed)
> and not nonnull_cnt (the maximum possible frequency).  Is that
> really right?

I did some more digging and found that that calculation was introduced
(in the older tsvector code) in bc0f08092, which traces to this
discussion:

https://www.postgresql.org/message-id/flat/4BF4357E.6000505%40krogh.cc

So the use of element_no is correct, because what we need to consider
here is the total number of values fed to the LC algorithm.

Also, my thought that maybe we should reject entries with f < 2
is bogus, because at the end of the algorithm f is not necessarily
the true count of occurrences of the value: some early occurrences
could have been forgotten via pruning.  The "behavioral cliff" is
annoying but I'm not sure there is much to be done about it: having
a single (still-remembered) occurrence gets less and less significant
as the total input size increases, so sooner or later you are going
to hit a point where such values should be thrown away.

So at this point I'm thinking that there is nothing wrong with
ANALYZE's algorithm, although I now see that there are some relevant
comments in ts_typanalyze.c that probably ought to be transposed into
array_typanalyze.c.

The idea of treating lack of MCELEM differently from complete
lack of stats still seems to have merit, though.

            regards, tom lane




On 6/5/25 23:52, Tom Lane wrote:
> The idea of treating lack of MCELEM differently from complete
> lack of stats still seems to have merit, though.


Couldn't we count / estimate the number of distinct two-by-two elements, 
and use that instead of the default selectivity estimate?




> On 6/5/25 17:42, Mark Frost wrote:
> > Is there any good explanation for this behaviour? Preferably we’d like
> > some way for proper `most_common_elems` statistics to be collected in
> > our production database, in the hope that influences a good query plan
> > to always be selected.


> most_common_elems has a limited size, and if all the elements have the

> same freq, there's nothing we can do.

>  You could do: alter table test alter column tags set statistics X;
>  However, X is capped at 10000…

Actually *any* most_common_elems stats would be fine, because the reasoning is:

  • If the searched element is in most_common_elems we know it’s frequency
  • If it’s not, it’s less frequent than the least most_common_elems

 

So in our case when every row is unique, we’d only actually need stats to record a single most_common_elems (if only it would record one)

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: Building C, IBM Hursley Office, Hursley Park Road, Winchester, Hampshire SO21 2JN
Mark Frost <FROSTMAR@uk.ibm.com> writes:
> Actually *any* most_common_elems stats would be fine, because the reasoning is:
>   *   If the searched element is in most_common_elems we know it's frequency
>   *   If it's not, it's less frequent than the least most_common_elems
> So in our case when every row is unique, we'd only actually need stats to record a single most_common_elems (if only
itwould record one) 

Well, we don't have a most common element in this scenario --- the
whole point is that the occurrence counts resulting from the lossy
counting algorithm are too low to be trustworthy.  However, what we
do have is the cutoff frequency, and it seems to me that we could use
that as the estimate of the maximum frequency of the non-MCEs.

Attached is an extremely crude prototype patch for that.  What it
does is to create an MCELEM stats entry with zero MCEs, but containing
min and max frequencies equal to the cutoff frequency (plus the nulls
frequency, which we know accurately in any case).  Mark, this fixes
your example case, but I wonder if it fixes your original problem ---
are you in a position to test it?

Assuming we like this direction, the main thing that makes this a hack
not a polished patch is that I had to strongarm the code into storing
a zero-length values array.  What update_attstats would normally do
is leave the values column of the MCELEM stats slot NULL, which then
causes get_attstatsslot to throw a that-shouldn't-be-null error.
An alternative answer is to change get_attstatsslot to allow a null,
but I'm not sure that that's any cleaner.  Either way it seems like
there's a hazard of breaking some code that isn't expecting the case.

An alternative that feels cleaner but a good bit more invasive
is to get rid of the convention of storing the min/max/nulls
frequencies as extra entries in the MCELEM numbers entry ---
which surely is a hack too --- and put them into some new slot
type.  I'm not sure if that's enough nicer to be worth the
conversion pain.

Thoughts?

            regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..c312de7d593 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1733,7 +1733,10 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
         i = Anum_pg_statistic_stavalues1 - 1;
         for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
         {
-            if (stats->numvalues[k] > 0)
+            /* XXX ugly hack to allow zero-length MCELEM values arrays */
+            if (stats->numvalues[k] > 0 ||
+                (stats->numvalues[k] == 0 &&
+                 stats->stakind[k] == STATISTIC_KIND_MCELEM))
             {
                 ArrayType  *arry;

diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 6f61629b977..9c6dd2852ab 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -497,6 +497,15 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
          * If we obtained more elements than we really want, get rid of those
          * with least frequencies.  The easiest way is to qsort the array into
          * descending frequency order and truncate the array.
+         *
+         * On the other extreme, we might have found no elements with
+         * frequency above the cutoff.  Then there's nothing to store so far
+         * as element values go, but we still want to record the cutoff
+         * frequency, since array_selfuncs.c can use that as an upper bound on
+         * the frequency of non-MCVs.  (Note: what we store is the minimum
+         * frequency that would have been accepted as a valid MCE.  In
+         * particular, this ensures we don't create a bogus stats entry with
+         * frequency zero.)
          */
         if (num_mcelem < track_len)
         {
@@ -506,10 +515,14 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
             minfreq = sort_table[num_mcelem - 1]->frequency;
         }
         else
+        {
             num_mcelem = track_len;
+            if (track_len == 0)
+                minfreq = maxfreq = cutoff_freq + 1;
+        }

         /* Generate MCELEM slot entry */
-        if (num_mcelem > 0)
+        if (num_mcelem >= 0)
         {
             MemoryContext old_context;
             Datum       *mcelem_values;