Re: Specifying attribute slot for storing/reading statistics - Mailing list pgsql-hackers
From | Esteban Zimanyi |
---|---|
Subject | Re: Specifying attribute slot for storing/reading statistics |
Date | |
Msg-id | CAPqRbE5OnjZdOUKMdbqv3x3W7nrmjKRTKQO9oKbbNE8U_pigZA@mail.gmail.com Whole thread Raw |
In response to | Re: Specifying attribute slot for storing/reading statistics (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Specifying attribute slot for storing/reading statistics
|
List | pgsql-hackers |
Dear Tom
Many thanks for your quick reply. Indeed both solutions you proposed can be combined together in order to solve all the problems. However changes in the code are needed. Let me now elaborate on the solution concerning the combination of stakind/staop first and I will elaborate on adding a new kind identifier after.
In order to understand the setting, let me explain a little more about the different kinds of temporal types. As explained in my previous email these are types whose values are composed of elements v@t where v is a PostgreSQL/PostGIS type (float or geometry) and t is a TimestampTz. There are four kinds of temporal types, depending on the their duration
* Instant: Values of the form v@t. These are used for example to represent car accidents as in Point(0 0)@2000-01-01 08:30
* InstantSet: A set of values {v1@t1, ...., vn@tn} where the values between the points are unknown. These are used for example to represent checkins in FourSquare or RFID readings
* Sequence: A sequence of values [v1@t1, ...., vn@tn] where the values between two successive instants vi@ti vj@tj are (linearly) interpolated. These are used to represent for example GPS tracks.
* SequenceSet: A set of sequences {s1, ... , sn} where there is a temporal gap between them. These are used to represent for example GPS tracks where the signal was lost during a time period.
To compute the selectivity of temporal types we assume that time and space dimensions are independent and thus we can reuse all existing analyze and selectivity infrastructure in PostgreSQL/PostGIS. For the various durations this amounts to
* Instant: Use the functions in analyze.c and selfuncs.c independently for the value and time dimensions
* InstantSet: Use the functions in array_typanalyze.c, array_selfuncs.c independently for the value and time dimensions
* Sequence and SequenceSet: To simplify, we do not take into account the gaps, and thus use the functions in rangetypes_typanalyze.c, rangetypes_selfuncs.c independently for the value and time dimensions
However, this requires that the analyze and selectivity functions in all the above files satisfy the following
* Set the staop when computing statistics. For example in rangetypes_typanalyze.c the staop is set for STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM but not for STATISTIC_KIND_BOUNDS_HISTOGRAM
* Always call get_attstatsslot with the operator Oid not with InvalidOid. For example, from the 17 times this function is called in selfuncs.c only two are passed with an operator. This also requires to pass the operator as an additional parameter to several functions. For example, the operator should be passed to the function ineq_histogram_selectivity in selfuncs.c
* Export several top-level functions which are currently static. For example, var_eq_const, ineq_histogram_selectivity, eqjoinsel_inner and several others in the file selfuncs.c should be exported.
That would solve all the problems excepted for STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, since in this case the staop will always be Float8LessOperator, independently of whether we are computing lengths of value ranges or of tstzranges. This could be solved by using a different stakind for the value and time dimensions.
If you want I can prepare a PR in order to understand the implications of these changes. Please let me know.
Esteban
Many thanks for your quick reply. Indeed both solutions you proposed can be combined together in order to solve all the problems. However changes in the code are needed. Let me now elaborate on the solution concerning the combination of stakind/staop first and I will elaborate on adding a new kind identifier after.
In order to understand the setting, let me explain a little more about the different kinds of temporal types. As explained in my previous email these are types whose values are composed of elements v@t where v is a PostgreSQL/PostGIS type (float or geometry) and t is a TimestampTz. There are four kinds of temporal types, depending on the their duration
* Instant: Values of the form v@t. These are used for example to represent car accidents as in Point(0 0)@2000-01-01 08:30
* InstantSet: A set of values {v1@t1, ...., vn@tn} where the values between the points are unknown. These are used for example to represent checkins in FourSquare or RFID readings
* Sequence: A sequence of values [v1@t1, ...., vn@tn] where the values between two successive instants vi@ti vj@tj are (linearly) interpolated. These are used to represent for example GPS tracks.
* SequenceSet: A set of sequences {s1, ... , sn} where there is a temporal gap between them. These are used to represent for example GPS tracks where the signal was lost during a time period.
To compute the selectivity of temporal types we assume that time and space dimensions are independent and thus we can reuse all existing analyze and selectivity infrastructure in PostgreSQL/PostGIS. For the various durations this amounts to
* Instant: Use the functions in analyze.c and selfuncs.c independently for the value and time dimensions
* InstantSet: Use the functions in array_typanalyze.c, array_selfuncs.c independently for the value and time dimensions
* Sequence and SequenceSet: To simplify, we do not take into account the gaps, and thus use the functions in rangetypes_typanalyze.c, rangetypes_selfuncs.c independently for the value and time dimensions
However, this requires that the analyze and selectivity functions in all the above files satisfy the following
* Set the staop when computing statistics. For example in rangetypes_typanalyze.c the staop is set for STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM but not for STATISTIC_KIND_BOUNDS_HISTOGRAM
* Always call get_attstatsslot with the operator Oid not with InvalidOid. For example, from the 17 times this function is called in selfuncs.c only two are passed with an operator. This also requires to pass the operator as an additional parameter to several functions. For example, the operator should be passed to the function ineq_histogram_selectivity in selfuncs.c
* Export several top-level functions which are currently static. For example, var_eq_const, ineq_histogram_selectivity, eqjoinsel_inner and several others in the file selfuncs.c should be exported.
That would solve all the problems excepted for STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, since in this case the staop will always be Float8LessOperator, independently of whether we are computing lengths of value ranges or of tstzranges. This could be solved by using a different stakind for the value and time dimensions.
If you want I can prepare a PR in order to understand the implications of these changes. Please let me know.
Esteban
On Thu, Sep 5, 2019 at 5:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Esteban Zimanyi <ezimanyi@ulb.ac.be> writes:
> We are developing the analyze/selectivity functions for those types. Our
> approach is to use the standard PostgreSQL/PostGIS functions for the value
> and the time dimensions where the slots starting from 0 will be used for
> the value dimension, and the slots starting from 2 will be used for the
> time dimension. For example, for tfloat we use range_typanalyze and related
> functions for
> * collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the value
> dimension
> * collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to
> tstzranges) of the time dimension
IMO this is fundamentally wrong, or at least contrary to the design
of pg_statistic. It is not supposed to matter which "slot" a given
statistic type is actually stored in; rather, readers are supposed to
search for the desired statistic type using the stakindN, staopN and
(if relevant) stacollN fields.
In this case it seems like it'd be reasonable to rely on the staop
fields to distinguish between the value and time dimensions, since
(IIUC) they're of different types.
Another idea is to invent your own slot kind identifiers instead of
using built-in ones. I'm not sure that there's any point in using
the built-in kind values, since (a) none of the core selectivity code
is likely to get called on your data and (b) even if it were, it'd
likely do the wrong thing. See the comments in pg_statistic.h,
starting about line 150, about assignment of non-built-in slot kinds.
> Is there any chance that the API for accessing the typanalyze and
> selectivity functions will be enhanced in a future release ?
Well, maybe you could convince us that the stakind/staop scheme for
identifying statistics is inadequate so we need another identification
field (corresponding to a component of the column being described,
perhaps). I'd be strongly against assigning any semantic meaning
to the slot numbers, though. That's likely to break code that's
written according to existing conventions.
regards, tom lane
pgsql-hackers by date: