Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column - Mailing list pgsql-performance
From | Tom Lane |
---|---|
Subject | Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column |
Date | |
Msg-id | 1751511.1749259794@sss.pgh.pa.us Whole thread Raw |
In response to | RE: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column (Mark Frost <FROSTMAR@uk.ibm.com>) |
List | pgsql-performance |
Mark Frost <FROSTMAR@uk.ibm.com> writes: > Actually *any* most_common_elems stats would be fine, because the reasoning is: > * If the searched element is in most_common_elems we know it's frequency > * If it's not, it's less frequent than the least most_common_elems > So in our case when every row is unique, we'd only actually need stats to record a single most_common_elems (if only itwould record one) Well, we don't have a most common element in this scenario --- the whole point is that the occurrence counts resulting from the lossy counting algorithm are too low to be trustworthy. However, what we do have is the cutoff frequency, and it seems to me that we could use that as the estimate of the maximum frequency of the non-MCEs. Attached is an extremely crude prototype patch for that. What it does is to create an MCELEM stats entry with zero MCEs, but containing min and max frequencies equal to the cutoff frequency (plus the nulls frequency, which we know accurately in any case). Mark, this fixes your example case, but I wonder if it fixes your original problem --- are you in a position to test it? Assuming we like this direction, the main thing that makes this a hack not a polished patch is that I had to strongarm the code into storing a zero-length values array. What update_attstats would normally do is leave the values column of the MCELEM stats slot NULL, which then causes get_attstatsslot to throw a that-shouldn't-be-null error. An alternative answer is to change get_attstatsslot to allow a null, but I'm not sure that that's any cleaner. Either way it seems like there's a hazard of breaking some code that isn't expecting the case. An alternative that feels cleaner but a good bit more invasive is to get rid of the convention of storing the min/max/nulls frequencies as extra entries in the MCELEM numbers entry --- which surely is a hack too --- and put them into some new slot type. I'm not sure if that's enough nicer to be worth the conversion pain. Thoughts? regards, tom lane diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 4fffb76e557..c312de7d593 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1733,7 +1733,10 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats) i = Anum_pg_statistic_stavalues1 - 1; for (k = 0; k < STATISTIC_NUM_SLOTS; k++) { - if (stats->numvalues[k] > 0) + /* XXX ugly hack to allow zero-length MCELEM values arrays */ + if (stats->numvalues[k] > 0 || + (stats->numvalues[k] == 0 && + stats->stakind[k] == STATISTIC_KIND_MCELEM)) { ArrayType *arry; diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c index 6f61629b977..9c6dd2852ab 100644 --- a/src/backend/utils/adt/array_typanalyze.c +++ b/src/backend/utils/adt/array_typanalyze.c @@ -497,6 +497,15 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, * If we obtained more elements than we really want, get rid of those * with least frequencies. The easiest way is to qsort the array into * descending frequency order and truncate the array. + * + * On the other extreme, we might have found no elements with + * frequency above the cutoff. Then there's nothing to store so far + * as element values go, but we still want to record the cutoff + * frequency, since array_selfuncs.c can use that as an upper bound on + * the frequency of non-MCVs. (Note: what we store is the minimum + * frequency that would have been accepted as a valid MCE. In + * particular, this ensures we don't create a bogus stats entry with + * frequency zero.) */ if (num_mcelem < track_len) { @@ -506,10 +515,14 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, minfreq = sort_table[num_mcelem - 1]->frequency; } else + { num_mcelem = track_len; + if (track_len == 0) + minfreq = maxfreq = cutoff_freq + 1; + } /* Generate MCELEM slot entry */ - if (num_mcelem > 0) + if (num_mcelem >= 0) { MemoryContext old_context; Datum *mcelem_values;
pgsql-performance by date:
Previous
From: Mark FrostDate:
Subject: RE: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column