Re: Choosing values for multivariate MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Choosing values for multivariate MCV lists
Date
Msg-id 20190629130126.53rshkfu2go6atkl@development
Whole thread Raw
In response to Re: Choosing values for multivariate MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Choosing values for multivariate MCV lists
List pgsql-hackers
On Tue, Jun 25, 2019 at 11:18:19AM +0200, Tomas Vondra wrote:
>On Mon, Jun 24, 2019 at 02:54:01PM +0100, Dean Rasheed wrote:
>>On Mon, 24 Jun 2019 at 00:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
>>>>On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
>>>>>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>>>>One annoying thing I noticed is that the base_frequency tends to end up
>>>>>>being 0, most likely due to getting too small. It's a bit strange, though,
>>>>>>because with statistic target set to 10k the smallest frequency for a
>>>>>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>>>>>>something the float8 can represent).
>>>>>>
>>>>>
>>>>>Yeah, it should be impossible for the base frequency to underflow to
>>>>>0. However, it looks like the problem is with mcv_list_items()'s use
>>>>>of %f to convert to text, which is pretty ugly.
>>>>>
>>>>
>>>>Yeah, I realized that too, eventually. One way to fix that would be
>>>>adding %.15f to the sprintf() call, but that just adds ugliness. It's
>>>>probably time to rewrite the function to build the tuple from datums,
>>>>instead of relying on BuildTupleFromCStrings.
>>>>
>>>
>>>OK, attached is a patch doing this. It's pretty simple, and it does
>>>resolve the issue with frequency precision.
>>>
>>>There's one issue with the signature, though - currently the function
>>>returns null flags as bool array, but values are returned as simple
>>>text value (formatted in array-like way, but still just a text).
>>>
>>>In the attached patch I've reworked both to proper arrays, but obviously
>>>that'd require a CATVERSION bump - and there's not much apetite for that
>>>past beta2, I suppose. So I'll just undo this bit.
>>>
>>
>>Hmm, I didn't spot that the old code was using a single text value
>>rather than a text array. That's clearly broken, especially since it
>>wasn't even necessarily constructing a valid textual representation of
>>an array (e.g., if an individual value's textual representation
>>included the array markers "{" or "}").
>>
>>IMO fixing this to return a text array is worth doing, even though it
>>means a catversion bump.
>>
>
>Yeah :-(
>
>It used to be just a "debugging" function, but now that we're using it
>e.g. in pg_stats_ext definition, we need to be more careful about the
>output. Presumably we could keep the text output and make sure it's
>escaped properly etc. We could even build an array internally and then
>run it through an output function. That'd not require catversion bump.
>
>I'll cleanup the patch changing the function signature. If others think
>the catversion bump would be too significant annoyance at this point, I
>will switch back to the text output (with proper formatting).
>
>Opinions?
>

Attached is a cleaned-up version of that patch. The main difference is
that instead of using construct_md_array() this uses ArrayBuildState to
construct the arrays, which is much easier. The docs don't need any
update because those were already using text[] for the parameter, the
code was inconsistent with it.

This does require catversion bump, but as annoying as it is, I think
it's worth it (and there's also the thread discussing the serialization
issues). Barring objections, I'll get it committed later next week, once
I get back from PostgresLondon.

As I mentioned before, if we don't want any additional catversion bumps,
it's possible to pass the arrays through output functions - that would
allow us keeping the text output (but correct, unlike what we have now).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Avoid full GIN index scan when possible
Next
From: Tomas Vondra
Date:
Subject: Re: Avoid full GIN index scan when possible