Thread: statistics for array types

statistics for array types

From
Jeff Janes
Date:
When reviewing some recent patches, I decided the statistics gathered for arrays had some pre-existing shortcomings.

The main one is that when the arrays contain rare elements there is no histogram to fall back upon when the MCE array is empty, the way there is for scalar stats.  So it has to punt completely and resort to saying that it is 0.5% selectivity without recourse to any data at all.

The rationale for applying the threshold before things are eligible for inclusion in the MCE array seems to be that this puts some theoretical bound on the amount of error we are likely to have in that element.  But I think it is better to exceed that theoretical bound than it is to have no data at all.

The attached patch forces there to be at least one element in MCE, keeping the one element with the highest predicted frequency if the MCE would otherwise be empty.  Then any other element queried for is assumed to be no more common than this most common element.

I'd also briefly considered just having the part of the code that pulls the stats out of pg_stats interpret a MCE array as meaning that nothing is more frequent than the threshold, but that would mean that that part of the code needs to know about how the threshold is chosen, which just seems wrong.  And it would need to know the difference between NULL MCE because no stats were gathered, versus because stats were gathered but nothing met the threshold.

I'm only marginally interested in this area, but since I did the investigation into it already I wanted to present it here.  I'd be quite happy if someone else wanted to take over the patch (or the concept behind it).

Here is an example, where the estimate drops from 500 rows to 3 rows where the correct number is zero:

create table foobar as
select (
  select
    array_agg(floor((random()*1000000000+g.y/1000000+f.x/10000000))::int)
    from generate_series(1,10) g(y)
  ) foo
from generate_series(1,100000) f(x);

Unpatched:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2387.00 rows=500 width=61) (actual time=76.448..76.448 rows=0 loops=1)
   Filter: (foo @> '{567}'::integer[])
   Rows Removed by Filter: 100000
 Planning time: 0.211 ms
 Execution time: 76.492 ms


With patch:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2387.00 rows=3 width=61) (actual time=78.604..78.604 rows=0 loops=1)
   Filter: (foo @> '{567}'::integer[])
   Rows Removed by Filter: 100000
 Planning time: 0.199 ms
 Execution time: 78.665 ms

Cheers,

Jeff
Attachment

Re: statistics for array types

From
Tomas Vondra
Date:
Hi,

On 08/11/2015 04:38 PM, Jeff Janes wrote:
> When reviewing some recent patches, I decided the statistics gathered
>  for arrays had some pre-existing shortcomings.
>
> The main one is that when the arrays contain rare elements there is
> no histogram to fall back upon when the MCE array is empty, the way
> there is for scalar stats.  So it has to punt completely and resort
> to saying that it is 0.5% selectivity without recourse to any data at
> all.
>
> The rationale for applying the threshold before things are eligible
> for inclusion in the MCE array seems to be that this puts some
> theoretical bound on the amount of error we are likely to have in
> that element.  But I think it is better to exceed that theoretical
> bound than it is to have no data at all.
>
> The attached patch forces there to be at least one element in MCE,
> keeping the one element with the highest predicted frequency if the
> MCE would otherwise be empty.  Then any other element queried for is
> assumed to be no more common than this most common element.

We only really need the frequency, right? So do we really need to keep
the actual MCV element? I.e. most_common_elem_freqs does not have the
same number of values as most_common_elems anyway:
  A list of the frequencies of the most common element values, i.e., the  fraction of rows containing at least one
instanceof the given value.  Two or three additional values follow the per-element frequencies;  these are the minimum
andmaximum of the preceding per-element  frequencies, and optionally the frequency of null elements.  (Null when
most_common_elemsis.)
 

So we might modify it so that it's always defined - either it tracks the
same values as today (when most_common_elems is defined), or the
frequency of the most common element (when most_common_elems is NULL).

This way we can keep the current theoretical error-bound on the MCE
frequencies, and if that's not possible we can have at least the new
value without confusing existing code.

> I'd also briefly considered just having the part of the code that
> pulls the stats out of pg_stats interpret a MCE array as meaning
> that nothing is more frequent than the threshold, but that would mean
> that that part of the code needs to know about how the threshold is
> chosen, which just seems wrong. And it would need to know the
> difference between NULL MCE because no stats were gathered, versus
> because stats were gathered but nothing met the threshold.

I'm not sure whether this is the same thing I just proposed ...

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: statistics for array types

From
Jeff Janes
Date:
On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 08/11/2015 04:38 PM, Jeff Janes wrote:
When reviewing some recent patches, I decided the statistics gathered
 for arrays had some pre-existing shortcomings.

The main one is that when the arrays contain rare elements there is
no histogram to fall back upon when the MCE array is empty, the way
there is for scalar stats.  So it has to punt completely and resort
to saying that it is 0.5% selectivity without recourse to any data at
all.

The rationale for applying the threshold before things are eligible
for inclusion in the MCE array seems to be that this puts some
theoretical bound on the amount of error we are likely to have in
that element.  But I think it is better to exceed that theoretical
bound than it is to have no data at all.

The attached patch forces there to be at least one element in MCE,
keeping the one element with the highest predicted frequency if the
MCE would otherwise be empty.  Then any other element queried for is
assumed to be no more common than this most common element.

We only really need the frequency, right? So do we really need to keep
the actual MCV element? I.e. most_common_elem_freqs does not have the
same number of values as most_common_elems anyway:

  A list of the frequencies of the most common element values, i.e., the
  fraction of rows containing at least one instance of the given value.
  Two or three additional values follow the per-element frequencies;
  these are the minimum and maximum of the preceding per-element
  frequencies, and optionally the frequency of null elements.
  (Null when most_common_elems is.)

So we might modify it so that it's always defined - either it tracks the
same values as today (when most_common_elems is defined), or the
frequency of the most common element (when most_common_elems is NULL).

I had also considered that.  It requires more changes to make it happen, and it seems to create a more complex contract on what those columns mean, but without giving a corresponding benefit.
 

This way we can keep the current theoretical error-bound on the MCE
frequencies, and if that's not possible we can have at least the new
value without confusing existing code.

But if the frequency of the most common element was grossly wrongly, then whatever value we stick in there is still going to be grossly wrong.  Removing the value associated with it isn't going to stop it from being wrong.  When we do query with the (incorrectly thought) first most common element, either it will find and use the wrong value from slot 1, or it will find nothing and fall back on the same wrong value from slot 3.
 

I'd also briefly considered just having the part of the code that
pulls the stats out of pg_stats interpret a MCE array as meaning
that nothing is more frequent than the threshold, but that would mean
that that part of the code needs to know about how the threshold is
chosen, which just seems wrong. And it would need to know the
difference between NULL MCE because no stats were gathered, versus
because stats were gathered but nothing met the threshold.

I'm not sure whether this is the same thing I just proposed ...


No, that was yet another option.  "The only way this slot can be null is if all values were present less than this number of times".  Or if analyze had never been run.

Cheers,

Jeff

Re: statistics for array types

From
Alvaro Herrera
Date:
Jeff Janes wrote:

> The attached patch forces there to be at least one element in MCE, keeping
> the one element with the highest predicted frequency if the MCE would
> otherwise be empty.  Then any other element queried for is assumed to be no
> more common than this most common element.

Hmm, what happens if a common-but-not-an-MCE element is pruned out of
the array when a bucket is filled?  I imagine it's going to mis-estimate
the selectivity (though I imagine the effect is going to be pretty
benign anyway, I mean it's still going to be better than stock 0.5%.)

> I'd also briefly considered just having the part of the code that pulls the
> stats out of pg_stats interpret a MCE array as meaning that nothing is more
> frequent than the threshold, but that would mean that that part of the code
> needs to know about how the threshold is chosen, which just seems wrong.

I wonder if we shouldn't add a separate stats STATISTIC_KIND for this,
instead ot trying to transfer knowledge.


Given how simple this patch is, I am tempted to apply it anyway.  It
needs a few additional comment to explain what is going on, though.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: statistics for array types

From
Alexander Korotkov
Date:
On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 08/11/2015 04:38 PM, Jeff Janes wrote:
When reviewing some recent patches, I decided the statistics gathered
 for arrays had some pre-existing shortcomings.

The main one is that when the arrays contain rare elements there is
no histogram to fall back upon when the MCE array is empty, the way
there is for scalar stats.  So it has to punt completely and resort
to saying that it is 0.5% selectivity without recourse to any data at
all.

The rationale for applying the threshold before things are eligible
for inclusion in the MCE array seems to be that this puts some
theoretical bound on the amount of error we are likely to have in
that element.  But I think it is better to exceed that theoretical
bound than it is to have no data at all.

The attached patch forces there to be at least one element in MCE,
keeping the one element with the highest predicted frequency if the
MCE would otherwise be empty.  Then any other element queried for is
assumed to be no more common than this most common element.

We only really need the frequency, right? So do we really need to keep
the actual MCV element? I.e. most_common_elem_freqs does not have the
same number of values as most_common_elems anyway:

  A list of the frequencies of the most common element values, i.e., the
  fraction of rows containing at least one instance of the given value.
  Two or three additional values follow the per-element frequencies;
  these are the minimum and maximum of the preceding per-element
  frequencies, and optionally the frequency of null elements.
  (Null when most_common_elems is.)

So we might modify it so that it's always defined - either it tracks the
same values as today (when most_common_elems is defined), or the
frequency of the most common element (when most_common_elems is NULL).

I had also considered that.  It requires more changes to make it happen, and it seems to create a more complex contract on what those columns mean, but without giving a corresponding benefit.
 

This way we can keep the current theoretical error-bound on the MCE
frequencies, and if that's not possible we can have at least the new
value without confusing existing code.

But if the frequency of the most common element was grossly wrongly, then whatever value we stick in there is still going to be grossly wrong.  Removing the value associated with it isn't going to stop it from being wrong.  When we do query with the (incorrectly thought) first most common element, either it will find and use the wrong value from slot 1, or it will find nothing and fall back on the same wrong value from slot 3.

Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we collect no MCEs. Moreover, I think we should store it even when num_mcelem >= track_len and we haven't cut MCEs we find. In this case we can get more precise estimation for rare element using the knowledge that all MCEs which exceeds the threshold are present (assuming their frequecies could be much higher than threshold).

When there are no MCEs then we should use assumption that there are no elements more frequent than cutoff_freq / nonnull_cnt. Using lower values wouldn't be statistically correct.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: statistics for array types

From
Alexander Korotkov
Date:
On Wed, Sep 16, 2015 at 8:01 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 08/11/2015 04:38 PM, Jeff Janes wrote:
When reviewing some recent patches, I decided the statistics gathered
 for arrays had some pre-existing shortcomings.

The main one is that when the arrays contain rare elements there is
no histogram to fall back upon when the MCE array is empty, the way
there is for scalar stats.  So it has to punt completely and resort
to saying that it is 0.5% selectivity without recourse to any data at
all.

The rationale for applying the threshold before things are eligible
for inclusion in the MCE array seems to be that this puts some
theoretical bound on the amount of error we are likely to have in
that element.  But I think it is better to exceed that theoretical
bound than it is to have no data at all.

The attached patch forces there to be at least one element in MCE,
keeping the one element with the highest predicted frequency if the
MCE would otherwise be empty.  Then any other element queried for is
assumed to be no more common than this most common element.

We only really need the frequency, right? So do we really need to keep
the actual MCV element? I.e. most_common_elem_freqs does not have the
same number of values as most_common_elems anyway:

  A list of the frequencies of the most common element values, i.e., the
  fraction of rows containing at least one instance of the given value.
  Two or three additional values follow the per-element frequencies;
  these are the minimum and maximum of the preceding per-element
  frequencies, and optionally the frequency of null elements.
  (Null when most_common_elems is.)

So we might modify it so that it's always defined - either it tracks the
same values as today (when most_common_elems is defined), or the
frequency of the most common element (when most_common_elems is NULL).

I had also considered that.  It requires more changes to make it happen, and it seems to create a more complex contract on what those columns mean, but without giving a corresponding benefit.
 

This way we can keep the current theoretical error-bound on the MCE
frequencies, and if that's not possible we can have at least the new
value without confusing existing code.

But if the frequency of the most common element was grossly wrongly, then whatever value we stick in there is still going to be grossly wrong.  Removing the value associated with it isn't going to stop it from being wrong.  When we do query with the (incorrectly thought) first most common element, either it will find and use the wrong value from slot 1, or it will find nothing and fall back on the same wrong value from slot 3.

Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we collect no MCEs. Moreover, I think we should store it even when num_mcelem >= track_len and we haven't cut MCEs we find. In this case we can get more precise estimation for rare element using the knowledge that all MCEs which exceeds the threshold are present (assuming their frequecies could be much higher than threshold).

When there are no MCEs then we should use assumption that there are no elements more frequent than cutoff_freq / nonnull_cnt. Using lower values wouldn't be statistically correct.

​The patch implementing my idea above​ is attached. In your example it gives following result.

# explain (analyze) select * from foobar where foo @> '{567}';
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2387.00 rows=30 width=61) (actual time=28.691..28.691 rows=0 loops=1)
   Filter: (foo @> '{567}'::integer[])
   Rows Removed by Filter: 100000
 Planning time: 0.044 ms
 Execution time: 28.707 ms
(5 rows)

In this particular example it gives less accurate estimate. However, I believe it would be more safe estimate in general.

I've faced difficulties with storing empty mcelements array. update_attstats turns empty array into null, and get_attstatsslot throws error when trying to read null array. I've changes get_attstatsslot so that it returns empty array when it meets null. I'm not sure in this solution.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: statistics for array types

From
Alvaro Herrera
Date:
Alexander Korotkov wrote:

> The patch implementing my idea above is attached.

What's the status here?  Jeff, did you have a look at Alexander's
version of your patch?  Tomas, does this patch satisfy your concerns?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: statistics for array types

From
David Steele
Date:
On 1/18/16 11:24 AM, Alvaro Herrera wrote:
> Alexander Korotkov wrote:
>
>> The patch implementing my idea above is attached.
> What's the status here?  Jeff, did you have a look at Alexander's
> version of your patch?  Tomas, does this patch satisfy your concerns?

This was marked as "needs review" but I have changed it to "waiting for 
author".

Thanks,
-David



Re: statistics for array types

From
David Steele
Date:
On 3/9/16 7:42 PM, David Steele wrote:
> On 1/18/16 11:24 AM, Alvaro Herrera wrote:
>> Alexander Korotkov wrote:
>>
>>> The patch implementing my idea above is attached.
>> What's the status here?  Jeff, did you have a look at Alexander's
>> version of your patch?  Tomas, does this patch satisfy your concerns?
> 
> This was marked as "needs review" but I have changed it to "waiting for
> author".

There has been no activity on this thread in quite a while and no
updated patch during this CF.  It has been marked "returned with
feedback". Please feel free to resubmit for 9.7!

-- 
-David
david@pgmasters.net