Thread: pg_stats and range statistics

pg_stats and range statistics

From

Egor Rogov

Date:

18 June 2021, 16:22:51

Hi,

Statistics for range types are not currently exposed in pg_stats view 
(i.e. STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and 
STATISTIC_KIND_BOUNDS_HISTOGRAM).

Shouldn't they? If so, here is a patch for adding them.

The following is a simple example of what it looks like:

CREATE TABLE test(r int4range);
INSERT INTO test
     SELECT int4range((random()*10)::integer,(10+random()*10)::integer)
     FROM generate_series(1,10000);
SET default_statistics_target = 10;
ANALYZE test;

SELECT range_length_histogram, range_length_empty_frac, 
range_bounds_histogram
FROM pg_stats
WHERE tablename = 'test' \gx

-[ RECORD 1 
]-----------+------------------------------------------------------------------------------------------------------
range_length_histogram  | {1,4,6,8,9,10,11,12,14,16,20}
range_length_empty_frac | {0.0036666666}
range_bounds_histogram  | 
{"[0,10)","[1,11)","[2,12)","[3,13)","[4,14)","[5,15)","[6,16)","[7,17)","[8,18)","[9,19)","[10,20)"}


Regards,
Egor Rogov.

Attachment

pgstats.patch

Re: pg_stats and range statistics

From

Tomas Vondra

Date:

18 June 2021, 20:31:43

On 6/18/21 6:22 PM, Egor Rogov wrote:
> Hi,
> 
> Statistics for range types are not currently exposed in pg_stats view 
> (i.e. STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and 
> STATISTIC_KIND_BOUNDS_HISTOGRAM).
> 
> Shouldn't they? If so, here is a patch for adding them.
> 

I think they should be exposed - I don't see why not to do that. I 
noticed this when working on the count-min sketch experiment too, so 
thanks for this patch.

FWIW I've added the patch to the next CF:

https://commitfest.postgresql.org/33/3184/

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_stats and range statistics

From

Soumyadeep Chakraborty

Date:

11 July 2021, 18:54:23

Hello,

This should have been added with [1].

Excerpt from the documentation:
"pg_stats is also designed to present the information in a more readable
format than the underlying catalog — at the cost that its schema must
be extended whenever new slot types are defined for pg_statistic." [2]

So, I added a reminder in pg_statistic.h.

Attached is v2 of this patch with some cosmetic changes. Renamed the columns a
bit and updated the docs to be a bit more descriptive.
(range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
range_bounds_histograms)

One question:

We do have the option of representing the histogram of lower bounds separately
from the histogram of upper bounds, as two separate view columns. Don't know if
there is much utility though and there is a fair bit of added complexity: see
below. Thoughts?

My attempts via SQL (unnest -> lower|upper -> array_agg) were futile given
unnest does not play nice with anyarray. For instance:

select unnest(stavalues1) from pg_statistic;
ERROR:  cannot determine element type of "anyarray" argument

Maybe the only option is to write a UDF pg_get_{lower|upper}_bounds_histogram
which can do something similar to what calc_hist_selectivity does:

/*
 * Convert histogram of ranges into histograms of its lower and upper
 * bounds.
 */
nhist = hslot.nvalues;
hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
for (i = 0; i < nhist; i++)
{
bool empty;

range_deserialize(rng_typcache, DatumGetRangeTypeP(hslot.values[i]),
  &hist_lower[i], &hist_upper[i], &empty);
/* The histogram should not contain any empty ranges */
if (empty)
elog(ERROR, "bounds histogram contains an empty range");
}

This is looking good and ready.

[1] https://github.com/postgres/postgres/commit/918eee0c497c88260a2e107318843c9b1947bc6f
[2] https://www.postgresql.org/docs/devel/view-pg-stats.html

Regards,
Soumyadeep (VMware)

Attachment

v2-pgstats.patch

Re: pg_stats and range statistics

From

Egor Rogov

Date:

12 July 2021, 11:10:53

Hi,

thanks for the review and corrections.

On 11.07.2021 21:54, Soumyadeep Chakraborty wrote:
> Hello,
>
> This should have been added with [1].
>
> Excerpt from the documentation:
> "pg_stats is also designed to present the information in a more readable
> format than the underlying catalog — at the cost that its schema must
> be extended whenever new slot types are defined for pg_statistic." [2]
>
> So, I added a reminder in pg_statistic.h.

Good point.


> Attached is v2 of this patch with some cosmetic changes.

I wonder why "TODO: catalog version bump"? This patch doesn't change 
catalog structure, or I miss something?


> Renamed the columns a
> bit and updated the docs to be a bit more descriptive.
> (range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
> range_bounds_histograms)

I intended to make the same prefix ("range_") for all columns concerned 
with range types, although I'm fine with the proposed naming.


> One question:
>
> We do have the option of representing the histogram of lower bounds separately
> from the histogram of upper bounds, as two separate view columns. Don't know if
> there is much utility though and there is a fair bit of added complexity: see
> below. Thoughts?

I thought about it too, and decided not to transform the underlying data 
structure. As far as I can see, pg_stats never employed such 
transformations. For example, STATISTIC_KIND_DECHIST is an array 
containing the histogram followed by the average in its last element. It 
is shown in pg_stats.elem_count_histogram as is, although it arguably 
may be splitted into two fields. All in all, I believe pg_stats's job is 
to "unpack" stavalues and stanumbers into meaningful fields, and not to 
try to go deeper than that.


>
> My attempts via SQL (unnest -> lower|upper -> array_agg) were futile given
> unnest does not play nice with anyarray. For instance:
>
> select unnest(stavalues1) from pg_statistic;
> ERROR:  cannot determine element type of "anyarray" argument
>
> Maybe the only option is to write a UDF pg_get_{lower|upper}_bounds_histogram
> which can do something similar to what calc_hist_selectivity does:
>
> /*
>   * Convert histogram of ranges into histograms of its lower and upper
>   * bounds.
>   */
> nhist = hslot.nvalues;
> hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
> hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
> for (i = 0; i < nhist; i++)
> {
> bool empty;
>
> range_deserialize(rng_typcache, DatumGetRangeTypeP(hslot.values[i]),
>    &hist_lower[i], &hist_upper[i], &empty);
> /* The histogram should not contain any empty ranges */
> if (empty)
> elog(ERROR, "bounds histogram contains an empty range");
> }
>
> This is looking good and ready.
>
> [1] https://github.com/postgres/postgres/commit/918eee0c497c88260a2e107318843c9b1947bc6f
> [2] https://www.postgresql.org/docs/devel/view-pg-stats.html
>
> Regards,
> Soumyadeep (VMware)

Re: pg_stats and range statistics

From

Tomas Vondra

Date:

12 July 2021, 13:04:08

On 7/12/21 1:10 PM, Egor Rogov wrote:
> Hi,
> 
> thanks for the review and corrections.
> 
> On 11.07.2021 21:54, Soumyadeep Chakraborty wrote:
>> Hello,
>>
>> This should have been added with [1].
>>
>> Excerpt from the documentation:
>> "pg_stats is also designed to present the information in a more readable
>> format than the underlying catalog — at the cost that its schema must
>> be extended whenever new slot types are defined for pg_statistic." [2]
>>
>> So, I added a reminder in pg_statistic.h.
> 
> Good point.
> 
> 
>> Attached is v2 of this patch with some cosmetic changes.
> 
> I wonder why "TODO: catalog version bump"? This patch doesn't change
> catalog structure, or I miss something?
> 

It changes system_views.sql, which is catalog change, as it redefines
the pg_stats system view (it adds 3 more columns). So it changes what
you get after initdb, hence catversion has to be bumped.

> 
>> Renamed the columns a
>> bit and updated the docs to be a bit more descriptive.
>> (range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
>> range_bounds_histograms)
> 
> I intended to make the same prefix ("range_") for all columns concerned
> with range types, although I'm fine with the proposed naming.
> 

Yeah, I'd vote to change empty_range_frac -> range_empty_frac.

> 
>> One question:
>>
>> We do have the option of representing the histogram of lower bounds
>> separately
>> from the histogram of upper bounds, as two separate view columns.
>> Don't know if
>> there is much utility though and there is a fair bit of added
>> complexity: see
>> below. Thoughts?
> 
> I thought about it too, and decided not to transform the underlying data
> structure. As far as I can see, pg_stats never employed such
> transformations. For example, STATISTIC_KIND_DECHIST is an array
> containing the histogram followed by the average in its last element. It
> is shown in pg_stats.elem_count_histogram as is, although it arguably
> may be splitted into two fields. All in all, I believe pg_stats's job is
> to "unpack" stavalues and stanumbers into meaningful fields, and not to
> try to go deeper than that.
> 

Not firm opinion, but the pg_stats is meant to be easier to
read/understand for humans. So far the transformation were simple
because all the data was fairly simple, but the range stuff may need
more complex transformation.

For example we do quite a bit more in pg_stats_ext views, because it
deals with multi-column stats.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_stats and range statistics

From

Egor Rogov

Date:

23 July 2021, 18:05:50

Hi Tomas,

On 12.07.2021 16:04, Tomas Vondra wrote:
> On 7/12/21 1:10 PM, Egor Rogov wrote:
>> Hi,
>>
>> thanks for the review and corrections.
>>
>> On 11.07.2021 21:54, Soumyadeep Chakraborty wrote:
>>> Hello,
>>>
>>> This should have been added with [1].
>>>
>>> Excerpt from the documentation:
>>> "pg_stats is also designed to present the information in a more readable
>>> format than the underlying catalog — at the cost that its schema must
>>> be extended whenever new slot types are defined for pg_statistic." [2]
>>>
>>> So, I added a reminder in pg_statistic.h.
>> Good point.
>>
>>
>>> Attached is v2 of this patch with some cosmetic changes.
>> I wonder why "TODO: catalog version bump"? This patch doesn't change
>> catalog structure, or I miss something?
>>
> It changes system_views.sql, which is catalog change, as it redefines
> the pg_stats system view (it adds 3 more columns). So it changes what
> you get after initdb, hence catversion has to be bumped.
>
>>> Renamed the columns a
>>> bit and updated the docs to be a bit more descriptive.
>>> (range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
>>> range_bounds_histograms)
>> I intended to make the same prefix ("range_") for all columns concerned
>> with range types, although I'm fine with the proposed naming.
>>
> Yeah, I'd vote to change empty_range_frac -> range_empty_frac.
>
>>> One question:
>>>
>>> We do have the option of representing the histogram of lower bounds
>>> separately
>>> from the histogram of upper bounds, as two separate view columns.
>>> Don't know if
>>> there is much utility though and there is a fair bit of added
>>> complexity: see
>>> below. Thoughts?
>> I thought about it too, and decided not to transform the underlying data
>> structure. As far as I can see, pg_stats never employed such
>> transformations. For example, STATISTIC_KIND_DECHIST is an array
>> containing the histogram followed by the average in its last element. It
>> is shown in pg_stats.elem_count_histogram as is, although it arguably
>> may be splitted into two fields. All in all, I believe pg_stats's job is
>> to "unpack" stavalues and stanumbers into meaningful fields, and not to
>> try to go deeper than that.
>>
> Not firm opinion, but the pg_stats is meant to be easier to
> read/understand for humans. So far the transformation were simple
> because all the data was fairly simple, but the range stuff may need
> more complex transformation.
>
> For example we do quite a bit more in pg_stats_ext views, because it
> deals with multi-column stats.


In pg_stats_ext, yes, but not in pg_stats (at least until now).

Since no one has expressed a strong desire for a more complex 
transformation, should we proceed with the proposed approach (with 
further renaming empty_range_frac -> range_empty_frac as you suggested)? 
Or should we wait more for someone to weigh in?


>
>
> regards
>

Re: pg_stats and range statistics

From

Tomas Vondra

Date:

20 January 2023, 21:50:56

Hi Egor,

While reviewing a patch improving join estimates for ranges [1] I
realized we don't show stats for ranges in pg_stats, and I recalled we
had this patch.

I rebased the v2, and I decided to took a stab at showing separate
histograms for lower/upper histogram bounds. I believe it makes it way
more readable, which is what pg_stats is about IMHO.

This simply adds two functions, accepting/producing anyarray - one for
lower bounds, one for upper bounds. I don't think it can be done with a
plain subquery (or at least I don't know how).

Finally, it renames the empty_range_frac to start with range_, per the
earlier discussion. I wonder if the new column names for lower/upper
bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are
too long ...

regards

[1] https://commitfest.postgresql.org/41/3821/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: pg_stats and range statistics

From

Egor Rogov

Date:

21 January 2023, 18:53:20

Hi Tomas,

On 21.01.2023 00:50, Tomas Vondra wrote:
> Hi Egor,
>
> While reviewing a patch improving join estimates for ranges [1] I
> realized we don't show stats for ranges in pg_stats, and I recalled we
> had this patch.
>
> I rebased the v2, and I decided to took a stab at showing separate
> histograms for lower/upper histogram bounds. I believe it makes it way
> more readable, which is what pg_stats is about IMHO.

Thanks for looking into this.

I have to admit it looks much better this way, so +1.

> This simply adds two functions, accepting/producing anyarray - one for
> lower bounds, one for upper bounds. I don't think it can be done with a
> plain subquery (or at least I don't know how).

Anyarray is an alien to SQL, so functions are well justified here. What 
makes me a bit uneasy is two almost identical functions. Should we 
consider other options like a function with an additional parameter or a 
function returning an array of bounds arrays (which is somewhat 
wasteful, but probably it doesn't matter much here)?

> Finally, it renames the empty_range_frac to start with range_, per the
> earlier discussion. I wonder if the new column names for lower/upper
> bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are
> too long ...

It seems so. The ending -s should be left out since it's a single 
histogram now. And I think that 
range_lower_histogram/range_upper_histogram are descriptive enough.

I'm adding one more patch to shorten the column names, refresh the docs, 
and make 'make check' happy (unfortunately, we have to edit 
src/regress/expected/rules.out every time pg_stats definition changes).

>
> regards
>
> [1] https://commitfest.postgresql.org/41/3821/
>

Attachment

Re: pg_stats and range statistics

From

Tomas Vondra

Date:

22 January 2023, 18:19:41

On 1/21/23 19:53, Egor Rogov wrote:
> Hi Tomas,
> 
> On 21.01.2023 00:50, Tomas Vondra wrote:
>> Hi Egor,
>>
>> While reviewing a patch improving join estimates for ranges [1] I
>> realized we don't show stats for ranges in pg_stats, and I recalled we
>> had this patch.
>>
>> I rebased the v2, and I decided to took a stab at showing separate
>> histograms for lower/upper histogram bounds. I believe it makes it way
>> more readable, which is what pg_stats is about IMHO.
> 
> 
> Thanks for looking into this.
> 
> I have to admit it looks much better this way, so +1.
> 

OK, good to hear.

> 
>> This simply adds two functions, accepting/producing anyarray - one for
>> lower bounds, one for upper bounds. I don't think it can be done with a
>> plain subquery (or at least I don't know how).
> 
> 
> Anyarray is an alien to SQL, so functions are well justified here. What
> makes me a bit uneasy is two almost identical functions. Should we
> consider other options like a function with an additional parameter or a
> function returning an array of bounds arrays (which is somewhat
> wasteful, but probably it doesn't matter much here)?
> 

I thought about that, but I think the alternatives (e.g. a single
function with a parameter determining which boundary to return). But I
don't think it's better.

Moreover, I think this is pretty similar to lower/upper, which already
work on range values. So if we have separate functions for that, we
should do the same thing here.

I renamed the functions to ranges_lower/ranges_upper, but maybe why not
to even call the functions lower/upper too?

The main trouble with the function I can think of is that we only have
anyarray type, not anyrangearray. So the functions will get called for
arbitrary array, and the check that it's array of ranges happens inside.
I'm not sure if that's a good or bad idea, or what would it take to add
a new polymorphic type ...

For now I at least kept "ranges_" to make it less likely.

> 
>> Finally, it renames the empty_range_frac to start with range_, per the
>> earlier discussion. I wonder if the new column names for lower/upper
>> bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are
>> too long ...
> 
> 
> It seems so. The ending -s should be left out since it's a single
> histogram now. And I think that
> range_lower_histogram/range_upper_histogram are descriptive enough.
> 
> I'm adding one more patch to shorten the column names, refresh the docs,
> and make 'make check' happy (unfortunately, we have to edit
> src/regress/expected/rules.out every time pg_stats definition changes).
> 

Thanks. I noticed the docs were added to pg_user_mapping by mistake, not
to pg_stats. So I fixed that, and I also added the new functions.

Finally, I reordered the fields a bit - moved range_empty_frac to keep
the histogram fields together.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: pg_stats and range statistics

From

Justin Pryzby

Date:

22 January 2023, 21:33:11

On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:
> On 1/21/23 19:53, Egor Rogov wrote:
> > Hi Tomas,
> > On 21.01.2023 00:50, Tomas Vondra wrote:
> >> This simply adds two functions, accepting/producing anyarray - one for
> >> lower bounds, one for upper bounds. I don't think it can be done with a
> >> plain subquery (or at least I don't know how).
> > 
> > Anyarray is an alien to SQL, so functions are well justified here. What
> > makes me a bit uneasy is two almost identical functions. Should we
> > consider other options like a function with an additional parameter or a
> > function returning an array of bounds arrays (which is somewhat
> > wasteful, but probably it doesn't matter much here)?
> > 
> 
> I thought about that, but I think the alternatives (e.g. a single
> function with a parameter determining which boundary to return). But I
> don't think it's better.

What about a common function, maybe called like:

ranges_upper_bounds(PG_FUNCTION_ARGS)
{
    AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
    Oid         element_type = AARR_ELEMTYPE(array);
    TypeCacheEntry *typentry;

    /* Get information about range type; note column might be a domain */
    typentry = range_get_typcache(fcinfo, getBaseType(element_type));

    return ranges_bounds_common(typentry, array, false);
}

That saves 40 LOC.

Shouldn't this add some sql tests ?

-- 
Justin

Re: pg_stats and range statistics

From

Tomas Vondra

Date:

22 January 2023, 23:21:21


On 1/22/23 22:33, Justin Pryzby wrote:
> On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:
>> On 1/21/23 19:53, Egor Rogov wrote:
>>> Hi Tomas,
>>> On 21.01.2023 00:50, Tomas Vondra wrote:
>>>> This simply adds two functions, accepting/producing anyarray - one for
>>>> lower bounds, one for upper bounds. I don't think it can be done with a
>>>> plain subquery (or at least I don't know how).
>>>
>>> Anyarray is an alien to SQL, so functions are well justified here. What
>>> makes me a bit uneasy is two almost identical functions. Should we
>>> consider other options like a function with an additional parameter or a
>>> function returning an array of bounds arrays (which is somewhat
>>> wasteful, but probably it doesn't matter much here)?
>>>
>>
>> I thought about that, but I think the alternatives (e.g. a single
>> function with a parameter determining which boundary to return). But I
>> don't think it's better.
> 
> What about a common function, maybe called like:
> 
> ranges_upper_bounds(PG_FUNCTION_ARGS)
> {
>     AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
>     Oid         element_type = AARR_ELEMTYPE(array);
>     TypeCacheEntry *typentry;
> 
>     /* Get information about range type; note column might be a domain */
>     typentry = range_get_typcache(fcinfo, getBaseType(element_type));
> 
>     return ranges_bounds_common(typentry, array, false);
> }
> 
> That saves 40 LOC.
> 

Thanks, that's better. But I'm still not sure it's a good idea to add
function with anyarray argument, when we need it to be an array of
ranges ...

I wonder if we have other functions doing something similar, i.e.
accepting a polymorphic type and then imposing additional restrictions
on it.

> Shouldn't this add some sql tests ?
> 

Yeah, I guess we should have a couple tests calling these functions on
different range arrays.

This reminds me lower()/upper() have some extra rules about handling
empty ranges / infinite boundaries etc. These functions should behave
consistently (as if we called lower() in a loop) and I'm pretty sure
that's not the current state.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_stats and range statistics

From

Egor Rogov

Date:

23 January 2023, 10:01:46

Hi,

On 23.01.2023 02:21, Tomas Vondra wrote:
>
> On 1/22/23 22:33, Justin Pryzby wrote:
>> On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:
>>> On 1/21/23 19:53, Egor Rogov wrote:
>>>> Hi Tomas,
>>>> On 21.01.2023 00:50, Tomas Vondra wrote:
>>>>> This simply adds two functions, accepting/producing anyarray - one for
>>>>> lower bounds, one for upper bounds. I don't think it can be done with a
>>>>> plain subquery (or at least I don't know how).
>>>> Anyarray is an alien to SQL, so functions are well justified here. What
>>>> makes me a bit uneasy is two almost identical functions. Should we
>>>> consider other options like a function with an additional parameter or a
>>>> function returning an array of bounds arrays (which is somewhat
>>>> wasteful, but probably it doesn't matter much here)?
>>>>
>>> I thought about that, but I think the alternatives (e.g. a single
>>> function with a parameter determining which boundary to return). But I
>>> don't think it's better.
>> What about a common function, maybe called like:
>>
>> ranges_upper_bounds(PG_FUNCTION_ARGS)
>> {
>>      AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
>>      Oid         element_type = AARR_ELEMTYPE(array);
>>      TypeCacheEntry *typentry;
>>
>>      /* Get information about range type; note column might be a domain */
>>      typentry = range_get_typcache(fcinfo, getBaseType(element_type));
>>
>>      return ranges_bounds_common(typentry, array, false);
>> }
>>
>> That saves 40 LOC.
>>
> Thanks, that's better. But I'm still not sure it's a good idea to add
> function with anyarray argument, when we need it to be an array of
> ranges ...
>
> I wonder if we have other functions doing something similar, i.e.
> accepting a polymorphic type and then imposing additional restrictions
> on it.


I couldn't find such examples, but adding an adhoc polymorphic type just 
doesn't look right for me. Besides, you'll end up adding not just 
anyrangearray type, but also anymultirangearray, 
anycompatiblerangearray, anycompatiblemultirangearray, and maybe their 
"non"-counterparts like anynonrangearray, and all of these are not of 
much use. And one day you may need an array of arrays or something...

I wonder if it's possible to teach SQL to work with anyarray type - at 
runtime the actual type of anyarray elements is known, right? In fact, 
unnest() alone is enough to eliminate the need of C functions altogether.


>> Shouldn't this add some sql tests ?
>>
> Yeah, I guess we should have a couple tests calling these functions on
> different range arrays.
>
> This reminds me lower()/upper() have some extra rules about handling
> empty ranges / infinite boundaries etc. These functions should behave
> consistently (as if we called lower() in a loop) and I'm pretty sure
> that's not the current state.


I can try to tidy things up, but first we need to decide on the general 
approach.


>
>
> regards
>

Re: pg_stats and range statistics

From

Egor Rogov

Date:

24 January 2023, 07:35:59

On 23.01.2023 13:01, Egor Rogov wrote:

> On 23.01.2023 02:21, Tomas Vondra wrote:
>> On 1/22/23 22:33, Justin Pryzby wrote:
>>> On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:
>>>> On 1/21/23 19:53, Egor Rogov wrote:
>>>>> Hi Tomas,
>>>>> On 21.01.2023 00:50, Tomas Vondra wrote:
>>>>>> This simply adds two functions, accepting/producing anyarray - 
>>>>>> one for
>>>>>> lower bounds, one for upper bounds. I don't think it can be done 
>>>>>> with a
>>>>>> plain subquery (or at least I don't know how).
>>>>> Anyarray is an alien to SQL, so functions are well justified here. 
>>>>> What
>>>>> makes me a bit uneasy is two almost identical functions. Should we
>>>>> consider other options like a function with an additional 
>>>>> parameter or a
>>>>> function returning an array of bounds arrays (which is somewhat
>>>>> wasteful, but probably it doesn't matter much here)?
>>>>>
>>>> I thought about that, but I think the alternatives (e.g. a single
>>>> function with a parameter determining which boundary to return). But I
>>>> don't think it's better.
>>> What about a common function, maybe called like:
>>>
>>> ranges_upper_bounds(PG_FUNCTION_ARGS)
>>> {
>>>      AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
>>>      Oid         element_type = AARR_ELEMTYPE(array);
>>>      TypeCacheEntry *typentry;
>>>
>>>      /* Get information about range type; note column might be a 
>>> domain */
>>>      typentry = range_get_typcache(fcinfo, getBaseType(element_type));
>>>
>>>      return ranges_bounds_common(typentry, array, false);
>>> }
>>>
>>> That saves 40 LOC.
>>>
>> Thanks, that's better. But I'm still not sure it's a good idea to add
>> function with anyarray argument, when we need it to be an array of
>> ranges ...
>>
>> I wonder if we have other functions doing something similar, i.e.
>> accepting a polymorphic type and then imposing additional restrictions
>> on it.
>
>
> I couldn't find such examples, but adding an adhoc polymorphic type 
> just doesn't look right for me. Besides, you'll end up adding not just 
> anyrangearray type, but also anymultirangearray, 
> anycompatiblerangearray, anycompatiblemultirangearray, and maybe their 
> "non"-counterparts like anynonrangearray, and all of these are not of 
> much use. And one day you may need an array of arrays or something...
>
> I wonder if it's possible to teach SQL to work with anyarray type - at 
> runtime the actual type of anyarray elements is known, right? In fact, 
> unnest() alone is enough to eliminate the need of C functions altogether.


When started to look at how we deal with anyarray columns, I came across 
the following comment in parse_coerce.c for 
enforce_generic_type_consistency():

* A special case is that we could see ANYARRAY as an actual_arg_type even
  * when allow_poly is false (this is possible only because pg_statistic has
  * columns shown as anyarray in the catalogs).

It makes me realize how anyarray as-a-real-type is specific to 
pg_statistic. Even if it's possible to somehow postpone type inference 
for this case from parse time to execute time, it clearly doesn't worth 
the effort.

So, I am for the simplest possible approach, that is, the two proposed 
functions ranges_upper(anyarray) and ranges_lower(anyarray). I am not 
even sure if it's worth documenting them, as they are very 
pg_statistic-specific and likely won't be useful for end users.


>
>
>>> Shouldn't this add some sql tests ?
>>>
>> Yeah, I guess we should have a couple tests calling these functions on
>> different range arrays.
>>
>> This reminds me lower()/upper() have some extra rules about handling
>> empty ranges / infinite boundaries etc. These functions should behave
>> consistently (as if we called lower() in a loop) and I'm pretty sure
>> that's not the current state.
>
>
> I can try to tidy things up, but first we need to decide on the 
> general approach.
>
>

Re: pg_stats and range statistics

From

"Gregory Stark (as CFM)"

Date:

20 March 2023, 19:27:37

On Sun, 22 Jan 2023 at 18:22, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I wonder if we have other functions doing something similar, i.e.
> accepting a polymorphic type and then imposing additional restrictions
> on it.

Meh, there's things like array comparison functions that require both
arguments to be the same kind of arrays. And array_agg that requires
the elements to be the same type as the state array (ie, same type as
the first element). Not sure there are any taking just one specific
type though.

> > Shouldn't this add some sql tests ?
>
> Yeah, I guess we should have a couple tests calling these functions on
> different range arrays.
>
> This reminds me lower()/upper() have some extra rules about handling
> empty ranges / infinite boundaries etc. These functions should behave
> consistently (as if we called lower() in a loop) and I'm pretty sure
> that's not the current state.

Are we still waiting on these two items? Egor, do you think you'll
have a chance to work it for this month?

-- 
Gregory Stark
As Commitfest Manager

Re: pg_stats and range statistics

From

Egor Rogov

Date:

20 March 2023, 19:54:24

On 20.03.2023 22:27, Gregory Stark (as CFM) wrote:
> On Sun, 22 Jan 2023 at 18:22, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> I wonder if we have other functions doing something similar, i.e.
>> accepting a polymorphic type and then imposing additional restrictions
>> on it.
> Meh, there's things like array comparison functions that require both
> arguments to be the same kind of arrays. And array_agg that requires
> the elements to be the same type as the state array (ie, same type as
> the first element). Not sure there are any taking just one specific
> type though.
>
>>> Shouldn't this add some sql tests ?
>> Yeah, I guess we should have a couple tests calling these functions on
>> different range arrays.
>>
>> This reminds me lower()/upper() have some extra rules about handling
>> empty ranges / infinite boundaries etc. These functions should behave
>> consistently (as if we called lower() in a loop) and I'm pretty sure
>> that's not the current state.
> Are we still waiting on these two items? Egor, do you think you'll
> have a chance to work it for this month?


I can try to tidy things up, but I'm not sure if we reached a consensus.

Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray) 
functions? This approach is okay with me. Tomas, have you made up your mind?

Do we want to document these functions? They are very 
pg_statistic-specific and won't be useful for end users imo.

Re: pg_stats and range statistics

From

Tomas Vondra

Date:

23 March 2023, 22:46:14


On 3/20/23 20:54, Egor Rogov wrote:
> On 20.03.2023 22:27, Gregory Stark (as CFM) wrote:
>> On Sun, 22 Jan 2023 at 18:22, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>> I wonder if we have other functions doing something similar, i.e.
>>> accepting a polymorphic type and then imposing additional restrictions
>>> on it.
>> Meh, there's things like array comparison functions that require both
>> arguments to be the same kind of arrays. And array_agg that requires
>> the elements to be the same type as the state array (ie, same type as
>> the first element). Not sure there are any taking just one specific
>> type though.
>>
>>>> Shouldn't this add some sql tests ?
>>> Yeah, I guess we should have a couple tests calling these functions on
>>> different range arrays.
>>>
>>> This reminds me lower()/upper() have some extra rules about handling
>>> empty ranges / infinite boundaries etc. These functions should behave
>>> consistently (as if we called lower() in a loop) and I'm pretty sure
>>> that's not the current state.
>> Are we still waiting on these two items? Egor, do you think you'll
>> have a chance to work it for this month?
> 
> 
> I can try to tidy things up, but I'm not sure if we reached a consensus.
> 

We don't have any objections, and that's probably the best consensus we
can get here, I guess ...

So if you could clean it up a bit, and do something about the two open
items I mentioned (a bunch of tests on different array, and behavior
consistent with lower/upper), that'd be great.

> Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray)
> functions? This approach is okay with me. Tomas, have you made up your
> mind?
> 

I think the function approach is fine, but in my January 22 message I
was wondering why we're not actually naming them simply lower/upper.

> Do we want to document these functions? They are very
> pg_statistic-specific and won't be useful for end users imo.
> 

I don't see why not to document them. Sure, we're using them in a fairly
specific context, but I don't see why not to let people use them too
(which would be hard without docs).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_stats and range statistics

From

Egor Rogov

Date:

24 March 2023, 18:48:09

On 24.03.2023 01:46, Tomas Vondra wrote:

>
> So if you could clean it up a bit, and do something about the two open
> items I mentioned (a bunch of tests on different array,

I've added some tests to resgress/sql/rangetypes.sql, based on the same 
dataset that is used to test lower() and upper().

> and behavior
> consistent with lower/upper),

Done. This required to switch from construct_array(), which doesn't 
support NULLs, to construct_md_array(), which does. A nice side effect 
is that now we also support multidimentional arrays.

I've moved a common part of ranges_lower_bounds() and 
ranges_upper_bounds() to ranges_bounds_common(), following Justin's advice.

There is one thing I'm not sure what to do about. This check:

      if (typentry->typtype != TYPTYPE_RANGE)
          ereport(ERROR,
                  (errcode(ERRCODE_DATATYPE_MISMATCH),
                   errmsg("expected array of ranges")));

doesn't work, because the range_get_typcache() call errors out first 
("type %u is not a range type"). The message doesn't look friendly 
enough for user-faced SQL function. Should we duplicate 
range_get_typcache's logic and replace the error message?

>   that'd be great.
>
>> Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray)
>> functions? This approach is okay with me. Tomas, have you made up your
>> mind?
>>
> I think the function approach is fine, but in my January 22 message I
> was wondering why we're not actually naming them simply lower/upper.

I'd expect from lower(anyarray) function to return the lowest element in 
the array. This name doesn't hint that the function takes an array of 
ranges. So, ranges_ prefix seems justified to me.

>
>> Do we want to document these functions? They are very
>> pg_statistic-specific and won't be useful for end users imo.
>>
> I don't see why not to document them. Sure, we're using them in a fairly
> specific context, but I don't see why not to let people use them too
> (which would be hard without docs).

Okay. I've corrected the examples a bit.

The patch is attached.

Thanks,
Egor

Attachment

pgstats_20230324.patch

Re: pg_stats and range statistics

From

"Gregory Stark (as CFM)"

Date:

03 April 2023, 21:10:00

On Fri, 24 Mar 2023 at 14:48, Egor Rogov <e.rogov@postgrespro.ru> wrote:
>
> Done.

> There is one thing I'm not sure what to do about. This check:
>
>       if (typentry->typtype != TYPTYPE_RANGE)
>           ereport(ERROR,
>                   (errcode(ERRCODE_DATATYPE_MISMATCH),
>                    errmsg("expected array of ranges")));
>
> doesn't work, because the range_get_typcache() call errors out first
> ("type %u is not a range type"). The message doesn't look friendly
> enough for user-faced SQL function. Should we duplicate
> range_get_typcache's logic and replace the error message?

> Okay. I've corrected the examples a bit.

It sounds like you've addressed Tomas's feedback and still have one
open question.

Fwiw I rebased it, it seemed to merge fine automatically.

I've updated the CF entry to Needs Review. But at this late date it
may have to wait until the next release.




-- 
Gregory Stark
As Commitfest Manager

Attachment

0001-pg_stats-and-range-statistics.patch

Re: pg_stats and range statistics

From

jian he

Date:

06 September 2023, 09:56:55

hi. I played around with the 2023-Apr 4 latest patch.

+        <literal>lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>
should be
+        <literal>ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>

+        <literal>upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>
should be
+        <literal>ranges_upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>

https://www.postgresql.org/docs/current/catalog-pg-type.html
there is no association between numrange and their base type numeric.
so for template: anyarray ranges_lower(anyarray). I don't think we can
input numrange array and return a numeric array.

https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
>> When the return value of a function is declared as a polymorphic type, there must be at least one argument position
thatis also >> polymorphic, and the actual data type(s) supplied for the polymorphic arguments determine the actual
resulttype for that call. 


regression=# select
ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4),
numrange(5.5,6.6)]);
 ranges_lower
---------------
 {1.1,3.3,5.5}
(1 row)
regression=# \gdesc
    Column    |    Type
--------------+------------
 ranges_lower | numrange[]
(1 row)

I don't think you can cast literal ' {1.1,3.3,5.5}' to numrange[].

Re: pg_stats and range statistics

From

Alexander Korotkov

Date:

24 November 2023, 23:06:24

Hi!

On Wed, Sep 6, 2023 at 6:18 PM jian he <jian.universality@gmail.com> wrote:
> +        <literal>lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>
> should be
> +        <literal>ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>
>
> +        <literal>upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>
> should be
> +        <literal>ranges_upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])</literal>
>
> https://www.postgresql.org/docs/current/catalog-pg-type.html
> there is no association between numrange and their base type numeric.
> so for template: anyarray ranges_lower(anyarray). I don't think we can
> input numrange array and return a numeric array.
>
> https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
> >> When the return value of a function is declared as a polymorphic type, there must be at least one argument
positionthat is also >> polymorphic, and the actual data type(s) supplied for the polymorphic arguments determine the
actualresult type for that call. 
>
>
> regression=# select
> ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4),
> numrange(5.5,6.6)]);
>  ranges_lower
> ---------------
>  {1.1,3.3,5.5}
> (1 row)
> regression=# \gdesc
>     Column    |    Type
> --------------+------------
>  ranges_lower | numrange[]
> (1 row)
>
> I don't think you can cast literal ' {1.1,3.3,5.5}' to numrange[].

Thank you for noticing this.  Indeed, our polymorphic type system
doesn't support this case.  In order to support this, we need
something like "anyrangearray" pseudo-type.  However, it seems
overkill to introduce a new pseudo-type just to update pg_stats.

Additionally, I found that the current patch can't handle infinite
range bounds and discards information about inclusiveness of range
bounds.  The infinite bounds could be represented as NULL (while I'm
not sure how good this representation is).  Regarding inclusiveness, I
don't see the possibility to represent them in a reasonable way within
an array of base types.  I also don't feel good about discarding the
accuracy in the pg_stats view.

In conclusion of all of the above, I decided to revise the patch and
show the bounds histogram as it's stored in pg_statistic.  I revised
the docs correspondingly.

Also for some reason, the patch added description of new columns to
the documentation of pg_user_mapping table.  I've fixed that by moving
them to the documentation of pg_stats view.

Also, I've extracted the new comment in pg_statistic.h into a separate patch.

I'm going to push this if there are no objections.

------
Regards,
Alexander Korotkov

Attachment

Re: pg_stats and range statistics

From

jian he

Date:

25 November 2023, 08:28:32

On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Hi!
>
> I'm going to push this if there are no objections.
>
> ------
> Regards,
> Alexander Korotkov

src/include/catalog/pg_statistic.h
268:  * range type's subdiff function. Only non-null rows are considered.

should it be:  * range type's subdiff function. Only non-null,
non-empty rows are considered.

Other than that, it looks fine to me.

Re: pg_stats and range statistics

From

jian he

Date:

25 November 2023, 08:57:49

On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Hi!
> Additionally, I found that the current patch can't handle infinite
> range bounds and discards information about inclusiveness of range
> bounds.  The infinite bounds could be represented as NULL (while I'm
> not sure how good this representation is).  Regarding inclusiveness, I
> don't see the possibility to represent them in a reasonable way within
> an array of base types.  I also don't feel good about discarding the
> accuracy in the pg_stats view.
>

in range_length_histogram, maybe we can document that when calculating
the length of a range, inclusiveness will be true.

Re: pg_stats and range statistics

From

Egor Rogov

Date:

25 November 2023, 09:14:24

Hi Alexander,

On 25.11.2023 02:06, Alexander Korotkov wrote:
>
> In conclusion of all of the above, I decided to revise the patch and
> show the bounds histogram as it's stored in pg_statistic.  I revised
> the docs correspondingly.

So basically we returned to what it all has started from? I guess it's 
better than nothing, although I have to admit that two-array 
representation is much more readable. Unfortunately it brings in a 
surprising amount of complexity.

Anyway, thanks for looking into this!

Re: pg_stats and range statistics

From

Alexander Korotkov

Date:

25 November 2023, 16:55:11

On Sat, Nov 25, 2023 at 10:58 AM jian he <jian.universality@gmail.com> wrote:
> On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > Hi!
> > Additionally, I found that the current patch can't handle infinite
> > range bounds and discards information about inclusiveness of range
> > bounds.  The infinite bounds could be represented as NULL (while I'm
> > not sure how good this representation is).  Regarding inclusiveness, I
> > don't see the possibility to represent them in a reasonable way within
> > an array of base types.  I also don't feel good about discarding the
> > accuracy in the pg_stats view.
> >
>
> in range_length_histogram, maybe we can document that when calculating
> the length of a range, inclusiveness will be true.

I've revised the patchset.  Edited comment in pg_statistic.h as you
proposed.  And I've added to the documentation a short note on how the
range length histogram is calculated.

------
Regards,
Alexander Korotkov

Attachment

Re: pg_stats and range statistics

From

Alexander Korotkov

Date:

25 November 2023, 16:57:19

On Sat, Nov 25, 2023 at 11:14 AM Egor Rogov <e.rogov@postgrespro.ru> wrote:
>
> Hi Alexander,
>
> On 25.11.2023 02:06, Alexander Korotkov wrote:
> >
> > In conclusion of all of the above, I decided to revise the patch and
> > show the bounds histogram as it's stored in pg_statistic.  I revised
> > the docs correspondingly.
>
>
> So basically we returned to what it all has started from? I guess it's
> better than nothing, although I have to admit that two-array
> representation is much more readable. Unfortunately it brings in a
> surprising amount of complexity.

Yep, it is.

> Anyway, thanks for looking into this!

And thank you for the feedback!

------
Regards,
Alexander Korotkov