Re: More stable query plans via more predictable column statistics - Mailing list pgsql-hackers

From Alex Shulgin
Subject Re: More stable query plans via more predictable column statistics
Date
Msg-id CAM-UEKQ6YtvtszT4BGP6aroNb99oW8M=b5GziMU9iPAHRbUMKQ@mail.gmail.com
Whole thread Raw
In response to Re: More stable query plans via more predictable column statistics  ("Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de>)
Responses Re: More stable query plans via more predictable column statistics  (Alex Shulgin <alex.shulgin@gmail.com>)
List pgsql-hackers
On Sat, Apr 2, 2016 at 8:57 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
> On Apr 2, 2016 18:38, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>
>> I did not like the fact that the compute_scalar_stats logic
>> would allow absolutely anything into the MCV list once num_hist falls
>> below 2. I think it's important that we continue to reject values that
>> are only seen once in the sample, because there's no very good reason to
>> think that they are MCVs and not just infrequent values that by luck
>> appeared in the sample.
>
> In my understanding we only put a value in the track list if we've seen it
> at least twice, no?

This is actually the case for compute_scalar_stats, but not for compute_distinct_stats.  In the latter case we can still have track[i].count == 1, but we can also break out of the loop if we see the first tracked item like that.

>> Before I noticed the regression failure, I'd been thinking that maybe it'd
>> be better if the decision rule were not "at least 100+x% of the average
>> frequency of this value and later ones", but "at least 100+x% of the
>> average frequency of values after this one".
>
> Hm, sounds pretty similar to what I wanted to achieve, but better
> formalized.
>
>> With that formulation, we're
>> not constrained as to the range of x.  Now, if there are *no* values after
>> this one, then this way needs an explicit special case in order not to
>> compute 0/0; but the preceding para shows that we need a special case for
>> the last value anyway.
>>
>> So, attached is a patch rewritten along these lines.  I used 50% rather
>> than 25% as the new cutoff percentage --- obviously it should be higher
>> in this formulation than before, but I have no idea if that particular
>> number is good or we should use something else.  Also, the rule for the
>> last value is "at least 1% of the non-null samples".  That's a pure guess
>> as well.
>>
>> I do not have any good corpuses of data to try this on.  Can folks who
>> have been following this thread try it on their data and see how it
>> does?  Also please try some other multipliers besides 1.5, so we can
>> get a feeling for where that cutoff should be placed.
>
> Expect me to run it on my pet db early next week. :-)

I was trying to come up with some examples where 50% could be a good or a bad choice and then I noticed that we might be able to turn it it the other way round: instead of inventing an arbitrary limit at the MCVs frequency we could use the histogram as the criteria for a candidate MCV to be considered "common enough".  If we can prove that the value would produce duplicates in the histogram, we should rather put it in the MCV list (unless it's already fully occupied, then we can't do anything).

A value is guaranteed to produce a duplicate if it has appeared at least 2*hfreq+1 times in the sample (hfreq from your version of the patch, which is recalculated on every loop iteration).  I could produce an updated patch on Monday or anyone else following this discussion should be able to do that.

This approach would be a huge win in my opinion, because this way we can avoid all the arbitrariness of that .25 / .50 multiplier.  Otherwise there might be (valid) complaints that for some data .40 (or .60) is a better fit, but we have already hard-coded something and there would be no easy way to improve situation for some users while avoiding to break it for the rest (unless we introduce a per-attribute configurable parameter like statistics_target for this multiplier, which I'd like to avoid even thinking about ;-)

While we don't (well, can't) build a histogram in the compute_distinct_stats variant, we could also apply the above mind trick there for the same reason and to make the output of both functions more consistent (and to have less maintenance burden between the variants).  And anyway it would be rather surprising to see that depending on the presence of an order operator for the type, the resulting MCV lists after the ANALYZE would be different (I mean not only due to the random nature of the sample).

I'm not sure yet about the 1% rule for the last value, but would also love to see if we can avoid the arbitrary limit here.  What happens with a last value which is less than 1% popular in the current code anyway?

Cheers!
--
Alex

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Add schema-qualified relnames in constraint error messages.
Next
From: Noah Misch
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Refer to a TOKEN_USER payload as a "token user, " not as a "user