Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column - Mailing list pgsql-performance
From | Matt Long |
---|---|
Subject | Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column |
Date | |
Msg-id | CAB2=oSHo0cjK-BktYgFL-mDHCov125my4XhvQRJzZo2VBBz=eA@mail.gmail.com Whole thread Raw |
In response to | Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
|
List | pgsql-performance |
Not to let perfect be the enemy of better, but we're facing a variant of this issue that would not be addressed by the proposed patch. If you're interested, the real world use case is described in https://github.com/medplum/medplum/issues/7310. Rows have an array consisting of one common element present in every row and one unique element. With a statistics target at the default 100, this pushes the threshold for vastly overestimated row counts for the unique elements down to 7,937 rows.
In this case, the effects of the proposed patch are not applied since the most_common_elems array is not empty. I'm not a statistician, so maybe this wouldn't be valid, but it seems like using the highest frequency of an element that did not qualify for the mce list instead of the 0.5% default frequency could be an elegant, but more invasive solution.
Simple repro of the problem:
CREATE TABLE public.test(
id integer,
tags text[]
);
CREATE TABLE public.test(
id integer,
tags text[]
);
INSERT INTO public.test (id, tags)
SELECT n, ARRAY['common', TO_CHAR(n,'fm00000000')] FROM ( SELECT generate_series(1,7_937) AS n );
SELECT n, ARRAY['common', TO_CHAR(n,'fm00000000')] FROM ( SELECT generate_series(1,7_937) AS n );
ANALYZE public.test;
SELECT most_common_elems, most_common_elem_freqs FROM pg_stats WHERE schemaname = 'public' AND tablename = 'test' AND attname = 'tags';
EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
SELECT * FROM test WHERE tags && ARRAY['common'];
EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
SELECT * FROM test WHERE tags && ARRAY['00000001'];
SELECT most_common_elems, most_common_elem_freqs FROM pg_stats WHERE schemaname = 'public' AND tablename = 'test' AND attname = 'tags';
EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
SELECT * FROM test WHERE tags && ARRAY['common'];
EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
SELECT * FROM test WHERE tags && ARRAY['00000001'];
With 7,936 rows:
most_common_elems has 1,000 entries and looks like {00000001,00000003,00000009,...,common}
most_common_elem_freqs has 1,003 entries and looks like {0.00012600806,0.00012600806,0.00012600806,...,1,0.00012600806,1,0}
With 7,937+ rows:
most_common_elems is {common}
most_common_elem_freqs is {1,1,1,0}
On Mon, Sep 8, 2025 at 11:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> Well, we don't have a most common element in this scenario --- the
> whole point is that the occurrence counts resulting from the lossy
> counting algorithm are too low to be trustworthy. However, what we
> do have is the cutoff frequency, and it seems to me that we could use
> that as the estimate of the maximum frequency of the non-MCEs.
Here's a less crude patch for that. The array_typanalyze.c changes
are the same as before, but I reconsidered what to do about this
stumbling block:
> Assuming we like this direction, the main thing that makes this a hack
> not a polished patch is that I had to strongarm the code into storing
> a zero-length values array. What update_attstats would normally do
> is leave the values column of the MCELEM stats slot NULL, which then
> causes get_attstatsslot to throw a that-shouldn't-be-null error.
> An alternative answer is to change get_attstatsslot to allow a null,
> but I'm not sure that that's any cleaner. Either way it seems like
> there's a hazard of breaking some code that isn't expecting the case.
I concluded that letting get_attstatsslot accept nulls is too risky;
there is probably code that assumes that get_attstatsslot would've
rejected that. Instead, this version changes update_attstats' rule
for when to store an array rather than a null. Now it will do so
if the passed-in stavalues pointer isn't null, even if numvalues
is zero. I think that this doesn't change the outcome for any
existing analyze routines because they wouldn't pass a data pointer
if they have no values; and even if it does, storing an empty array
not a null seems like it should be pretty harmless.
> An alternative that feels cleaner but a good bit more invasive
> is to get rid of the convention of storing the min/max/nulls
> frequencies as extra entries in the MCELEM numbers entry ---
> which surely is a hack too --- and put them into some new slot
> type. I'm not sure if that's enough nicer to be worth the
> conversion pain.
A year ago I would have seriously considered doing it that way.
But now that we have code to dump-n-restore stats, that code would
have to be adjusted to convert the old representation. It's not
worth it for this case.
Hence, v1 attached, now with a commit message.
regards, tom lane
pgsql-performance by date: