Thread: Choosing values for multivariate MCV lists

Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

18 June 2019, 20:59:20

Hi,

The current implementation of multi-column MCV lists (added in this
cycle) uses a fairly simple algorithm to pick combinations to include in
the MCV list. We just compute a minimum number of occurences, and then
include all entries sampled more often. See get_mincount_for_mcv_list().

By coincidence I received a real-world data set where this does not work
particularly well, so I'm wondering if we can improve this somehow. It
does not make the estimates worse than without the MCV lists, so I don't
think we have an issue, but I wonder if we could do better.

The data set is very simple - it's a single table of valid addresses
(city/street name and street number):

    CREATE TABLE addresses (
        city_name   text,
        street_name text,
        street_no   int
    ); 

Data with Czech addresses are available here (it's ~9.3MB, so I'm not
attaching it directly)

    https://drive.google.com/file/d/1EiZGsk6s5hqzZrL7t5KiaqByJnBsndfA

and attached is a SQL script I used to compute a "virtual" MCV list.

There are about 3M records, so let's query for a street name in one of
the cities:

  EXPLAIN ANALYZE
  SELECT * FROM addresses
   WHERE city_name = 'Most' AND street_name = 'Pionýrů'


                                  QUERY PLAN
  ----------------------------------------------------------------------------
   Seq Scan on addresses  (cost=0.00..62645.38 rows=1 width=25)
                          (actual time=117.225..204.291 rows=779 loops=1)
     Filter: ((city_name = 'Most'::text) AND (street_name = 'Pionýrů'::text))
     Rows Removed by Filter: 2923846
   Planning Time: 0.065 ms
   Execution Time: 204.525 ms
  (5 rows)

It's true 779 rows is only a tiny fraction of the data set (~0.025%),
but this data set is a bit weird in one other aspect - about half of the
records has empty (NULL) street_name, it's just city + number. Small
villages simply don't have streets at all, and large cities probably
started as small villages, so they have such addresses too.

This however means that by choosing the MCV entries solely based on the
number of occurrences in the sample, we end up with MCV lists where vast
majority of entries has NULL street name.

That's why we got such poor estimate in the example query, despite the
fact that the city/street combination is the most frequent in the data
set (with non-NULL street name).

The other weird thing is that frequency of NULL street names is fairly
uniform in the whole data set. In total about 50% addresses match that,
and for individual cities it's generally between 25% and 100%, so the
estimate is less than 2x off in those cases.

But for addresses with non-NULL street names, the situation is very
different. Some street names are unique to a single city, etc.

Overall, this means we end up with MCV list with entries representing
the mostly-uniform part of the data set, instead of prefering the
entries that are truly skewed.

So I'm wondering if we could/should rethink the algorithm, so look at
the frequency and base_frequency, and maybe pick entries based on their
ratio, or something like that.

For example, we might sort the entries by

    abs(freq - base_freq) / freq

which seems like a reasonable "measure of mis-estimate". Or maybe we
might look just at abs(freq - base_freq)? I think the first option would
be better, because (1 row vs. 100.000 rows) is probably worse than (1M
rows vs. 2M rows).

Of course, this is a challenging problem, for a couple of reasons.

Firstly, picking simply the most frequent groups is very simple and it
gives us additional information about the largest group (which may be
useful elsewhere, e.g. the incremental sort patch).

Secondly, if the correlated combinations are less frequent, how reliably
can we even estimate the frequency from a sample? The combination in the
example query was ~0.02% of the data, so how likely it's to be sampled?

Alternatively, it's possible to increase statistics target to make the
sample larger, but that also keeps more entries in the MCV list,
including the correlated ones. So maybe the best thing we can do is to
just rely on that?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

addresses.sql

Re: Choosing values for multivariate MCV lists

From

Dean Rasheed

Date:

20 June 2019, 05:55:41

On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> The current implementation of multi-column MCV lists (added in this
> cycle) uses a fairly simple algorithm to pick combinations to include in
> the MCV list. We just compute a minimum number of occurences, and then
> include all entries sampled more often. See get_mincount_for_mcv_list().
>
> [snip]
>
> This however means that by choosing the MCV entries solely based on the
> number of occurrences in the sample, we end up with MCV lists where vast
> majority of entries has NULL street name.
>
> That's why we got such poor estimate in the example query, despite the
> fact that the city/street combination is the most frequent in the data
> set (with non-NULL street name).
>

I think the fact that they're NULL is a bit of a red herring because
we're treating NULL just like any other value. The same thing would
happen if there were some other very common non-NULL value that
dominated the dataset.

> The other weird thing is that frequency of NULL street names is fairly
> uniform in the whole data set. In total about 50% addresses match that,
> and for individual cities it's generally between 25% and 100%, so the
> estimate is less than 2x off in those cases.
>
> But for addresses with non-NULL street names, the situation is very
> different. Some street names are unique to a single city, etc.
>
> Overall, this means we end up with MCV list with entries representing
> the mostly-uniform part of the data set, instead of prefering the
> entries that are truly skewed.
>
> So I'm wondering if we could/should rethink the algorithm, so look at
> the frequency and base_frequency, and maybe pick entries based on their
> ratio, or something like that.
>

Hmm, interesting. I think I would like to see a more rigorous
justification for changing the algorithm deciding which values to
keep.

If I've understood correctly, I think the problem is this: The
mincount calculation is a good way of identifying MCV candidates to
keep, because it ensures that we don't keep values that don't appear
sufficiently often to produce accurate estimates, and ideally we'd
keep everything with count >= mincount. However, in the case were
there are more than stats_target items with count >= mincount, simply
ordering by count and keeping the most commonly seen values isn't
necessarily the best strategy in the case of multivariate statistics.

To identify what the best strategy might be, I think you need to
examine the errors that would occur as a result of *not* keeping a
value in the multivariate MCV list. Given a value that appears with
count >= mincount, N*freq ought to be a reasonable estimate for the
actual number of occurrences of that value in the table, and
N*base_freq ought to be a reasonable estimate for the
univariate-stats-based estimate that it would be given if it weren't
kept in the multivariate MCV list. So the absolute error resulting
from not keeping that value would be

  N * Abs(freq - base_freq)

But then I think we ought to take into account how often we're likely
to get that error. If we're simply picking values at random, the
likelihood of getting that value is just its frequency, so the average
average absolute error would be

  Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) )

which suggests that, if we wanted to reduce the average absolute error
of the estimates, we should order by freq*Abs(freq-base_freq) and keep
the top n in the MCV list.

On the other hand, if we wanted to reduce the average *relative* error
of the estimates, we might instead order by Abs(freq-base_freq).

> For example, we might sort the entries by
>
>     abs(freq - base_freq) / freq
>

I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
though, because that would seem likely to put too much weight on the
least commonly occurring values.

> Of course, this is a challenging problem, for a couple of reasons.
>
> Firstly, picking simply the most frequent groups is very simple and it
> gives us additional information about the largest group (which may be
> useful elsewhere, e.g. the incremental sort patch).
>

Yes, you would have to keep in mind that changing the algorithm would
mean that the MCV list no longer represented all the most common
values. For example, it would no longer be valid to assume that no
value appeared more often than the first value in the MCV list. I'm
not sure that we currently do that though.

> Secondly, if the correlated combinations are less frequent, how reliably
> can we even estimate the frequency from a sample? The combination in the
> example query was ~0.02% of the data, so how likely it's to be sampled?
>

I think that's OK as long as we keep the mincount filter in the new
algorithm. I think some experimentation is definitely worthwhile here,
but it looks plausible that a decent approach might be:

1). Discard all values with count < mincount.
2). Order by freq*Abs(freq-base_freq) (or possibly just
Abs(freq-base_freq)) and keep the top n, where n is the stats target.
3). Perhaps re-sort by count, so the final list is in frequency order
again? Not sure if that's a desirable property to keep.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

20 June 2019, 22:35:48

On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> The current implementation of multi-column MCV lists (added in this
>> cycle) uses a fairly simple algorithm to pick combinations to include in
>> the MCV list. We just compute a minimum number of occurences, and then
>> include all entries sampled more often. See get_mincount_for_mcv_list().
>>
>> [snip]
>>
>> This however means that by choosing the MCV entries solely based on the
>> number of occurrences in the sample, we end up with MCV lists where vast
>> majority of entries has NULL street name.
>>
>> That's why we got such poor estimate in the example query, despite the
>> fact that the city/street combination is the most frequent in the data
>> set (with non-NULL street name).
>>
>
>I think the fact that they're NULL is a bit of a red herring because
>we're treating NULL just like any other value. The same thing would
>happen if there were some other very common non-NULL value that
>dominated the dataset.
>

I wasn't really suggesting the NULL is an issue - sorry if that wasn't
clear. It might be any other value, as long as it's very common (and
roughly uniform) in all cities. So yes, I agree with you here.

>> The other weird thing is that frequency of NULL street names is fairly
>> uniform in the whole data set. In total about 50% addresses match that,
>> and for individual cities it's generally between 25% and 100%, so the
>> estimate is less than 2x off in those cases.
>>
>> But for addresses with non-NULL street names, the situation is very
>> different. Some street names are unique to a single city, etc.
>>
>> Overall, this means we end up with MCV list with entries representing
>> the mostly-uniform part of the data set, instead of prefering the
>> entries that are truly skewed.
>>
>> So I'm wondering if we could/should rethink the algorithm, so look at
>> the frequency and base_frequency, and maybe pick entries based on their
>> ratio, or something like that.
>>
>
>Hmm, interesting. I think I would like to see a more rigorous
>justification for changing the algorithm deciding which values to
>keep.
>

Sure, I'm not going to pretend my proposals were particularly rigorous, it
was more a collection of random ideas.

>If I've understood correctly, I think the problem is this: The
>mincount calculation is a good way of identifying MCV candidates to
>keep, because it ensures that we don't keep values that don't appear
>sufficiently often to produce accurate estimates, and ideally we'd
>keep everything with count >= mincount. However, in the case were
>there are more than stats_target items with count >= mincount, simply
>ordering by count and keeping the most commonly seen values isn't
>necessarily the best strategy in the case of multivariate statistics.
>

Yes.

>To identify what the best strategy might be, I think you need to
>examine the errors that would occur as a result of *not* keeping a
>value in the multivariate MCV list. Given a value that appears with
>count >= mincount, N*freq ought to be a reasonable estimate for the
>actual number of occurrences of that value in the table, and
>N*base_freq ought to be a reasonable estimate for the
>univariate-stats-based estimate that it would be given if it weren't
>kept in the multivariate MCV list. So the absolute error resulting
>from not keeping that value would be
>
>  N * Abs(freq - base_freq)
>

Yeah. Considering N is the same for all groups in the sample, this
would mean the same thing as Abs(freq - base_freq).

>But then I think we ought to take into account how often we're likely
>to get that error. If we're simply picking values at random, the
>likelihood of getting that value is just its frequency, so the average
>average absolute error would be
>
>  Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) )
>
>which suggests that, if we wanted to reduce the average absolute error
>of the estimates, we should order by freq*Abs(freq-base_freq) and keep
>the top n in the MCV list.
>

Interesting idea. But I'm not sure it makes much sense to assume the rows
are picked randomly - OTOH we don't really know anything about how the
data will be queries, so we may just as well do that.

>On the other hand, if we wanted to reduce the average *relative* error
>of the estimates, we might instead order by Abs(freq-base_freq).
>

Hmmm, yeah. I don't know what's the right choice here, TBH.

>> For example, we might sort the entries by
>>
>>     abs(freq - base_freq) / freq
>>
>
>I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>though, because that would seem likely to put too much weight on the
>least commonly occurring values.
>

But would that be an issue, or a good thing? I mean, as long as the item
is above mincount, we take the counts as reliable. As I explained, my
motivation for proposing that was that both

   ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)

and

   ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)

have exactly the same Abs(freq - base_freq), but I think we both agree
that the first misestimate is much more dangerous, because it's off by six
orders of magnitude. I think the MCV algorithm should reflect this.


>> Of course, this is a challenging problem, for a couple of reasons.
>>
>> Firstly, picking simply the most frequent groups is very simple and it
>> gives us additional information about the largest group (which may be
>> useful elsewhere, e.g. the incremental sort patch).
>>
>
>Yes, you would have to keep in mind that changing the algorithm would
>mean that the MCV list no longer represented all the most common
>values. For example, it would no longer be valid to assume that no
>value appeared more often than the first value in the MCV list. I'm
>not sure that we currently do that though.
>

We don't, but I've proposed using some of this knowledge in the
incremental sort patch. But I think we might actually extend the MCV list
to store some extra information (e.g. frequency of the largest groups,
even if we don't store it).

>> Secondly, if the correlated combinations are less frequent, how reliably
>> can we even estimate the frequency from a sample? The combination in the
>> example query was ~0.02% of the data, so how likely it's to be sampled?
>>
>
>I think that's OK as long as we keep the mincount filter in the new
>algorithm. I think some experimentation is definitely worthwhile here,
>but it looks plausible that a decent approach might be:
>
>1). Discard all values with count < mincount.
>2). Order by freq*Abs(freq-base_freq) (or possibly just
>Abs(freq-base_freq)) and keep the top n, where n is the stats target.
>3). Perhaps re-sort by count, so the final list is in frequency order
>again? Not sure if that's a desirable property to keep.
>

Will try. Thanks for the feedback.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Choosing values for multivariate MCV lists

From

Dean Rasheed

Date:

21 June 2019, 07:50:33

On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>
> >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
> >though, because that would seem likely to put too much weight on the
> >least commonly occurring values.
>
> But would that be an issue, or a good thing? I mean, as long as the item
> is above mincount, we take the counts as reliable. As I explained, my
> motivation for proposing that was that both
>
>    ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)
>
> and
>
>    ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)
>
> have exactly the same Abs(freq - base_freq), but I think we both agree
> that the first misestimate is much more dangerous, because it's off by six
> orders of magnitude.
>

Hmm, that's a good example. That definitely suggests that we should be
trying to minimise the relative error, but also perhaps that what we
should be looking at is actually just the ratio freq / base_freq,
rather than their difference.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

22 June 2019, 14:10:52

On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote:
>On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>>
>> >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>> >though, because that would seem likely to put too much weight on the
>> >least commonly occurring values.
>>
>> But would that be an issue, or a good thing? I mean, as long as the item
>> is above mincount, we take the counts as reliable. As I explained, my
>> motivation for proposing that was that both
>>
>>    ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)
>>
>> and
>>
>>    ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)
>>
>> have exactly the same Abs(freq - base_freq), but I think we both agree
>> that the first misestimate is much more dangerous, because it's off by six
>> orders of magnitude.
>>
>
>Hmm, that's a good example. That definitely suggests that we should be
>trying to minimise the relative error, but also perhaps that what we
>should be looking at is actually just the ratio freq / base_freq,
>rather than their difference.
>

Attached are patches that implement this (well, the first one optimizes
how the base frequency is computed, which helps for high statistic target
values). The 0002 part picks the items based on

   Max(freq/base_freq, base_freq/freq)

It did help with the addresses data set quite a bit, but I'm sure it needs
more testing. I've also tried using

   freq * abs(freq - base_freq)

but the estimates were not as good.

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Choosing values for multivariate MCV lists

From

Dean Rasheed

Date:

23 June 2019, 19:48:26

On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> One annoying thing I noticed is that the base_frequency tends to end up
> being 0, most likely due to getting too small. It's a bit strange, though,
> because with statistic target set to 10k the smallest frequency for a
> single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
> something the float8 can represent).
>

Yeah, it should be impossible for the base frequency to underflow to
0. However, it looks like the problem is with mcv_list_items()'s use
of %f to convert to text, which is pretty ugly.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

23 June 2019, 20:23:19

On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> One annoying thing I noticed is that the base_frequency tends to end up
>> being 0, most likely due to getting too small. It's a bit strange, though,
>> because with statistic target set to 10k the smallest frequency for a
>> single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>> something the float8 can represent).
>>
>
>Yeah, it should be impossible for the base frequency to underflow to
>0. However, it looks like the problem is with mcv_list_items()'s use
>of %f to convert to text, which is pretty ugly.
>

Yeah, I realized that too, eventually. One way to fix that would be
adding %.15f to the sprintf() call, but that just adds ugliness. It's
probably time to rewrite the function to build the tuple from datums,
instead of relying on BuildTupleFromCStrings.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

23 June 2019, 23:42:32

On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
>On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
>>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>One annoying thing I noticed is that the base_frequency tends to end up
>>>being 0, most likely due to getting too small. It's a bit strange, though,
>>>because with statistic target set to 10k the smallest frequency for a
>>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>>>something the float8 can represent).
>>>
>>
>>Yeah, it should be impossible for the base frequency to underflow to
>>0. However, it looks like the problem is with mcv_list_items()'s use
>>of %f to convert to text, which is pretty ugly.
>>
>
>Yeah, I realized that too, eventually. One way to fix that would be
>adding %.15f to the sprintf() call, but that just adds ugliness. It's
>probably time to rewrite the function to build the tuple from datums,
>instead of relying on BuildTupleFromCStrings.
>

OK, attached is a patch doing this. It's pretty simple, and it does
resolve the issue with frequency precision.

There's one issue with the signature, though - currently the function
returns null flags as bool array, but values are returned as simple
text value (formatted in array-like way, but still just a text).

In the attached patch I've reworked both to proper arrays, but obviously
that'd require a CATVERSION bump - and there's not much apetite for that
past beta2, I suppose. So I'll just undo this bit.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

23 June 2019, 23:54:55

On Sat, Jun 22, 2019 at 04:10:52PM +0200, Tomas Vondra wrote:
>On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote:
>>On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>>>
>>>>I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>>>>though, because that would seem likely to put too much weight on the
>>>>least commonly occurring values.
>>>
>>>But would that be an issue, or a good thing? I mean, as long as the item
>>>is above mincount, we take the counts as reliable. As I explained, my
>>>motivation for proposing that was that both
>>>
>>>   ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)
>>>
>>>and
>>>
>>>   ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)
>>>
>>>have exactly the same Abs(freq - base_freq), but I think we both agree
>>>that the first misestimate is much more dangerous, because it's off by six
>>>orders of magnitude.
>>>
>>
>>Hmm, that's a good example. That definitely suggests that we should be
>>trying to minimise the relative error, but also perhaps that what we
>>should be looking at is actually just the ratio freq / base_freq,
>>rather than their difference.
>>
>
>Attached are patches that implement this (well, the first one optimizes
>how the base frequency is computed, which helps for high statistic target
>values). The 0002 part picks the items based on
>
>  Max(freq/base_freq, base_freq/freq)
>
>It did help with the addresses data set quite a bit, but I'm sure it needs
>more testing. I've also tried using
>
>  freq * abs(freq - base_freq)
>
>but the estimates were not as good.
>
>One annoying thing I noticed is that the base_frequency tends to end up
>being 0, most likely due to getting too small. It's a bit strange, though,
>because with statistic target set to 10k the smallest frequency for a
>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>something the float8 can represent).
>

FWIW while doing more tests on this, I've realized a rather annoying
behavior while increasing the statistics target.

With the current algorithm picking values merely based on frequency, the
MCV list is expanded in a stable way. Increasing the statistics target
means the MCV list may grow, but the larger MCV list will contain the
smaller MCV one (ignoring changes due to a differences in the sample).

After switching to the two-phase algorithm (first picking candidates
based on mincount, then picking the items based on error) that's no
longer true. I've repeatedly seen cases when increasing the target
lowered mincount, adding candidates with larger frequency errors,
removing some of the items from the "smaller" MCV list.

In practice this means that increasing the statistics target may easily
make the estimates much worse (for some of the values).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

23 June 2019, 23:56:36

On Mon, Jun 24, 2019 at 01:42:32AM +0200, Tomas Vondra wrote:
>On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
>>On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
>>>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>>One annoying thing I noticed is that the base_frequency tends to end up
>>>>being 0, most likely due to getting too small. It's a bit strange, though,
>>>>because with statistic target set to 10k the smallest frequency for a
>>>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>>>>something the float8 can represent).
>>>>
>>>
>>>Yeah, it should be impossible for the base frequency to underflow to
>>>0. However, it looks like the problem is with mcv_list_items()'s use
>>>of %f to convert to text, which is pretty ugly.
>>>
>>
>>Yeah, I realized that too, eventually. One way to fix that would be
>>adding %.15f to the sprintf() call, but that just adds ugliness. It's
>>probably time to rewrite the function to build the tuple from datums,
>>instead of relying on BuildTupleFromCStrings.
>>
>
>OK, attached is a patch doing this. It's pretty simple, and it does
>resolve the issue with frequency precision.
>
>There's one issue with the signature, though - currently the function
>returns null flags as bool array, but values are returned as simple
>text value (formatted in array-like way, but still just a text).
>
>In the attached patch I've reworked both to proper arrays, but obviously
>that'd require a CATVERSION bump - and there's not much apetite for that
>past beta2, I suppose. So I'll just undo this bit.
>

Meh, I forgot to attach the patch, of couse ...


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-rework-pg_mcv_list_items.patch

Re: Choosing values for multivariate MCV lists

From

Dean Rasheed

Date:

24 June 2019, 13:54:01

On Mon, 24 Jun 2019 at 00:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
> >On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
> >>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >>>One annoying thing I noticed is that the base_frequency tends to end up
> >>>being 0, most likely due to getting too small. It's a bit strange, though,
> >>>because with statistic target set to 10k the smallest frequency for a
> >>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
> >>>something the float8 can represent).
> >>>
> >>
> >>Yeah, it should be impossible for the base frequency to underflow to
> >>0. However, it looks like the problem is with mcv_list_items()'s use
> >>of %f to convert to text, which is pretty ugly.
> >>
> >
> >Yeah, I realized that too, eventually. One way to fix that would be
> >adding %.15f to the sprintf() call, but that just adds ugliness. It's
> >probably time to rewrite the function to build the tuple from datums,
> >instead of relying on BuildTupleFromCStrings.
> >
>
> OK, attached is a patch doing this. It's pretty simple, and it does
> resolve the issue with frequency precision.
>
> There's one issue with the signature, though - currently the function
> returns null flags as bool array, but values are returned as simple
> text value (formatted in array-like way, but still just a text).
>
> In the attached patch I've reworked both to proper arrays, but obviously
> that'd require a CATVERSION bump - and there's not much apetite for that
> past beta2, I suppose. So I'll just undo this bit.
>

Hmm, I didn't spot that the old code was using a single text value
rather than a text array. That's clearly broken, especially since it
wasn't even necessarily constructing a valid textual representation of
an array (e.g., if an individual value's textual representation
included the array markers "{" or "}").

IMO fixing this to return a text array is worth doing, even though it
means a catversion bump.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

25 June 2019, 09:18:19

On Mon, Jun 24, 2019 at 02:54:01PM +0100, Dean Rasheed wrote:
>On Mon, 24 Jun 2019 at 00:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
>> >On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
>> >>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >>>One annoying thing I noticed is that the base_frequency tends to end up
>> >>>being 0, most likely due to getting too small. It's a bit strange, though,
>> >>>because with statistic target set to 10k the smallest frequency for a
>> >>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>> >>>something the float8 can represent).
>> >>>
>> >>
>> >>Yeah, it should be impossible for the base frequency to underflow to
>> >>0. However, it looks like the problem is with mcv_list_items()'s use
>> >>of %f to convert to text, which is pretty ugly.
>> >>
>> >
>> >Yeah, I realized that too, eventually. One way to fix that would be
>> >adding %.15f to the sprintf() call, but that just adds ugliness. It's
>> >probably time to rewrite the function to build the tuple from datums,
>> >instead of relying on BuildTupleFromCStrings.
>> >
>>
>> OK, attached is a patch doing this. It's pretty simple, and it does
>> resolve the issue with frequency precision.
>>
>> There's one issue with the signature, though - currently the function
>> returns null flags as bool array, but values are returned as simple
>> text value (formatted in array-like way, but still just a text).
>>
>> In the attached patch I've reworked both to proper arrays, but obviously
>> that'd require a CATVERSION bump - and there's not much apetite for that
>> past beta2, I suppose. So I'll just undo this bit.
>>
>
>Hmm, I didn't spot that the old code was using a single text value
>rather than a text array. That's clearly broken, especially since it
>wasn't even necessarily constructing a valid textual representation of
>an array (e.g., if an individual value's textual representation
>included the array markers "{" or "}").
>
>IMO fixing this to return a text array is worth doing, even though it
>means a catversion bump.
>

Yeah :-(

It used to be just a "debugging" function, but now that we're using it
e.g. in pg_stats_ext definition, we need to be more careful about the
output. Presumably we could keep the text output and make sure it's
escaped properly etc. We could even build an array internally and then
run it through an output function. That'd not require catversion bump.

I'll cleanup the patch changing the function signature. If others think
the catversion bump would be too significant annoyance at this point, I
will switch back to the text output (with proper formatting).

Opinions?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

29 June 2019, 13:01:26

On Tue, Jun 25, 2019 at 11:18:19AM +0200, Tomas Vondra wrote:
>On Mon, Jun 24, 2019 at 02:54:01PM +0100, Dean Rasheed wrote:
>>On Mon, 24 Jun 2019 at 00:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
>>>>On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
>>>>>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>>>>One annoying thing I noticed is that the base_frequency tends to end up
>>>>>>being 0, most likely due to getting too small. It's a bit strange, though,
>>>>>>because with statistic target set to 10k the smallest frequency for a
>>>>>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>>>>>>something the float8 can represent).
>>>>>>
>>>>>
>>>>>Yeah, it should be impossible for the base frequency to underflow to
>>>>>0. However, it looks like the problem is with mcv_list_items()'s use
>>>>>of %f to convert to text, which is pretty ugly.
>>>>>
>>>>
>>>>Yeah, I realized that too, eventually. One way to fix that would be
>>>>adding %.15f to the sprintf() call, but that just adds ugliness. It's
>>>>probably time to rewrite the function to build the tuple from datums,
>>>>instead of relying on BuildTupleFromCStrings.
>>>>
>>>
>>>OK, attached is a patch doing this. It's pretty simple, and it does
>>>resolve the issue with frequency precision.
>>>
>>>There's one issue with the signature, though - currently the function
>>>returns null flags as bool array, but values are returned as simple
>>>text value (formatted in array-like way, but still just a text).
>>>
>>>In the attached patch I've reworked both to proper arrays, but obviously
>>>that'd require a CATVERSION bump - and there's not much apetite for that
>>>past beta2, I suppose. So I'll just undo this bit.
>>>
>>
>>Hmm, I didn't spot that the old code was using a single text value
>>rather than a text array. That's clearly broken, especially since it
>>wasn't even necessarily constructing a valid textual representation of
>>an array (e.g., if an individual value's textual representation
>>included the array markers "{" or "}").
>>
>>IMO fixing this to return a text array is worth doing, even though it
>>means a catversion bump.
>>
>
>Yeah :-(
>
>It used to be just a "debugging" function, but now that we're using it
>e.g. in pg_stats_ext definition, we need to be more careful about the
>output. Presumably we could keep the text output and make sure it's
>escaped properly etc. We could even build an array internally and then
>run it through an output function. That'd not require catversion bump.
>
>I'll cleanup the patch changing the function signature. If others think
>the catversion bump would be too significant annoyance at this point, I
>will switch back to the text output (with proper formatting).
>
>Opinions?
>

Attached is a cleaned-up version of that patch. The main difference is
that instead of using construct_md_array() this uses ArrayBuildState to
construct the arrays, which is much easier. The docs don't need any
update because those were already using text[] for the parameter, the
code was inconsistent with it.

This does require catversion bump, but as annoying as it is, I think
it's worth it (and there's also the thread discussing the serialization
issues). Barring objections, I'll get it committed later next week, once
I get back from PostgresLondon.

As I mentioned before, if we don't want any additional catversion bumps,
it's possible to pass the arrays through output functions - that would
allow us keeping the text output (but correct, unlike what we have now).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

rework-pg-mcv-list-items-v2.patch

Re: Choosing values for multivariate MCV lists

From

Dean Rasheed

Date:

01 July 2019, 11:02:28

On Sat, 29 Jun 2019 at 14:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> >>>>>However, it looks like the problem is with mcv_list_items()'s use
> >>>>>of %f to convert to text, which is pretty ugly.
> >>>>
> >>>There's one issue with the signature, though - currently the function
> >>>returns null flags as bool array, but values are returned as simple
> >>>text value (formatted in array-like way, but still just a text).
> >>>
> >>IMO fixing this to return a text array is worth doing, even though it
> >>means a catversion bump.
> >>
> Attached is a cleaned-up version of that patch. The main difference is
> that instead of using construct_md_array() this uses ArrayBuildState to
> construct the arrays, which is much easier. The docs don't need any
> update because those were already using text[] for the parameter, the
> code was inconsistent with it.
>

Cool, this looks a lot neater and fixes the issues discussed with both
floating point values no longer being converted to text, and proper
text arrays for values.

One minor nitpick -- it doesn't seem to be necessary to build the
arrays *outfuncs and *fmgrinfo. You may as well just fetch the
individual column output function information at the point where it's
used (in the "if (!item->isnull[i])" block) rather than building those
arrays.


> This does require catversion bump, but as annoying as it is, I think
> it's worth it (and there's also the thread discussing the serialization
> issues). Barring objections, I'll get it committed later next week, once
> I get back from PostgresLondon.
>
> As I mentioned before, if we don't want any additional catversion bumps,
> it's possible to pass the arrays through output functions - that would
> allow us keeping the text output (but correct, unlike what we have now).
>

I think this is a clear bug fix, so I'd vote for fixing it properly
now, with a catversion bump.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

From

Tomas Vondra

Date:

02 July 2019, 10:25:55

On Mon, Jul 01, 2019 at 12:02:28PM +0100, Dean Rasheed wrote:
>On Sat, 29 Jun 2019 at 14:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> >>>>>However, it looks like the problem is with mcv_list_items()'s use
>> >>>>>of %f to convert to text, which is pretty ugly.
>> >>>>
>> >>>There's one issue with the signature, though - currently the function
>> >>>returns null flags as bool array, but values are returned as simple
>> >>>text value (formatted in array-like way, but still just a text).
>> >>>
>> >>IMO fixing this to return a text array is worth doing, even though it
>> >>means a catversion bump.
>> >>
>> Attached is a cleaned-up version of that patch. The main difference is
>> that instead of using construct_md_array() this uses ArrayBuildState to
>> construct the arrays, which is much easier. The docs don't need any
>> update because those were already using text[] for the parameter, the
>> code was inconsistent with it.
>>
>
>Cool, this looks a lot neater and fixes the issues discussed with both
>floating point values no longer being converted to text, and proper
>text arrays for values.
>
>One minor nitpick -- it doesn't seem to be necessary to build the
>arrays *outfuncs and *fmgrinfo. You may as well just fetch the
>individual column output function information at the point where it's
>used (in the "if (!item->isnull[i])" block) rather than building those
>arrays.
>

OK, thanks for the feedback. I'll clean-up the function lookup.

>
>> This does require catversion bump, but as annoying as it is, I think
>> it's worth it (and there's also the thread discussing the serialization
>> issues). Barring objections, I'll get it committed later next week, once
>> I get back from PostgresLondon.
>>
>> As I mentioned before, if we don't want any additional catversion bumps,
>> it's possible to pass the arrays through output functions - that would
>> allow us keeping the text output (but correct, unlike what we have now).
>>
>
>I think this is a clear bug fix, so I'd vote for fixing it properly
>now, with a catversion bump.
>

I agree.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services