Thread: Specifying attribute slot for storing/reading statistics

Specifying attribute slot for storing/reading statistics

From
Esteban Zimanyi
Date:
Dear all

We are developing MobilityDB, an open source PostgreSQL/PostGIS extension that provides temporal and spatio-temporal types. The source code, manuals, and related publications are available at the address
https://github.com/ULB-CoDE-WIT/MobilityDB/

In MobilityDB temporal types are types derived from PostgreSQL/PostGIS types to which a time dimension is added. MobilityDB provides the following temporal types: tbool (temporal boolean), tint (temporal int), tfloat (temporal float), text (temporal text), tgeompoint (temporal geometric points) and tgeogpoint (temporal geographic points). For example, we can define a tfloat and a tgeompoint as follows

SELECT tfloat '[1.5@2000-01-01, 2.5@2000-01-02, 1.5@2000-01-03]';
SELECT tgeompoint '[Point(0 0)@2000-01-01 08:00, Point(1 0)@2000-01-02 08:05, Point(1 1)@2000-01-03 08:10]';

We are developing the analyze/selectivity functions for those types. Our approach is to use the standard PostgreSQL/PostGIS functions for the value and the time dimensions where the slots starting from 0 will be used for the value dimension, and the slots starting from 2 will be used for the time dimension. For example, for tfloat we use range_typanalyze and related functions for
* collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the value dimension
*  collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to tstzranges) of the time dimension

However, we end up copying several PostgreSQL functions to which we only add an additional parameter stating the slot number from which the specific statistic kind should be found (either 0 or 2)

bool
get_attstatsslot_mobdb(AttStatsSlot *sslot, HeapTuple statstuple,
int reqkind, Oid reqop, int flags, int startslot)
{
    [...]
    for (i = startslot; i < STATISTIC_NUM_SLOTS; i++)
    {
        if ((&stats->stakind1)[i] == reqkind &&
            (reqop == InvalidOid || (&stats->staop1)[i] == reqop))
            break;
    }
    [...]
}

double
var_eq_const_mobdb(VariableStatData *vardata, Oid operator, Datum constval,
    bool negate, int startslot)
{
    [...]
}
Selectivity
scalarineqsel_mobdb(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
VariableStatData *vardata, Datum constval, Oid consttype,
    int startslot)
{
    [...]
}

static Selectivity
mcv_selectivity_mobdb(VariableStatData *vardata, FmgrInfo *opproc,
Datum constval, Oid atttype, bool varonleft,
    double *sumcommonp, int startslot)
{
    [...]
}
static double
ineq_histogram_selectivity_mobdb(PlannerInfo *root, VariableStatData *vardata,
FmgrInfo *opproc, bool isgt, bool iseq, Datum constval,
    Oid consttype, int startslot)
{
    [...]
}

in addition to copying other functions needed by the above functions since they are not exported (defined as static)

static bool
get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
Oid sortop, Datum *min, Datum *max)

static bool
get_actual_variable_endpoint(Relation heapRel,
Relation indexRel, ScanDirection indexscandir,
ScanKey scankeys, int16 typLen,
bool typByVal, MemoryContext outercontext,
Datum *endpointDatum)

[...]

Is there a better way to do this ?

Is there any chance that the API for accessing the typanalyze and selectivity functions will be enhanced in a future release ?

Regards

Esteban

-- 
------------------------------------------------------------
Prof. Esteban Zimanyi
Department of Computer & Decision Engineering  (CoDE) CP 165/15   
Universite Libre de Bruxelles           
Avenue F. D. Roosevelt 50               
B-1050 Brussels, Belgium                
fax: + 32.2.650.47.13
tel: + 32.2.650.31.85
e-mail: ezimanyi@ulb.ac.be
Internet: http://code.ulb.ac.be/
------------------------------------------------------------

Re: Specifying attribute slot for storing/reading statistics

From
Tom Lane
Date:
Esteban Zimanyi <ezimanyi@ulb.ac.be> writes:
> We are developing the analyze/selectivity functions for those types. Our
> approach is to use the standard PostgreSQL/PostGIS functions for the value
> and the time dimensions where the slots starting from 0 will be used for
> the value dimension, and the slots starting from 2 will be used for the
> time dimension. For example, for tfloat we use range_typanalyze and related
> functions for
> * collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the value
> dimension
> *  collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to
> tstzranges) of the time dimension

IMO this is fundamentally wrong, or at least contrary to the design
of pg_statistic.  It is not supposed to matter which "slot" a given
statistic type is actually stored in; rather, readers are supposed to
search for the desired statistic type using the stakindN, staopN and
(if relevant) stacollN fields.

In this case it seems like it'd be reasonable to rely on the staop
fields to distinguish between the value and time dimensions, since
(IIUC) they're of different types.

Another idea is to invent your own slot kind identifiers instead of
using built-in ones.  I'm not sure that there's any point in using
the built-in kind values, since (a) none of the core selectivity code
is likely to get called on your data and (b) even if it were, it'd
likely do the wrong thing.  See the comments in pg_statistic.h,
starting about line 150, about assignment of non-built-in slot kinds.

> Is there any chance that the API for accessing the typanalyze and
> selectivity functions will be enhanced in a future release ?

Well, maybe you could convince us that the stakind/staop scheme for
identifying statistics is inadequate so we need another identification
field (corresponding to a component of the column being described,
perhaps).  I'd be strongly against assigning any semantic meaning
to the slot numbers, though.  That's likely to break code that's
written according to existing conventions.

            regards, tom lane



Re: Specifying attribute slot for storing/reading statistics

From
Esteban Zimanyi
Date:
Dear Tom

Many thanks for your quick reply. Indeed both solutions you proposed can be combined together in order to solve all the problems. However changes in the code are needed. Let me now elaborate on the solution concerning the combination of stakind/staop first and I will elaborate on adding a new kind identifier after.

In order to understand the setting, let me explain a little more about the different kinds of temporal types. As explained in my previous email these are types whose values are composed of elements v@t where v is a PostgreSQL/PostGIS type (float or geometry) and t is a TimestampTz. There are four kinds of temporal types, depending on the their duration
* Instant: Values of the form v@t. These are used for example to represent car accidents as in Point(0 0)@2000-01-01 08:30
* InstantSet: A set of values {v1@t1, ...., vn@tn} where the values between the points are unknown. These are used for example to represent checkins in FourSquare or RFID readings
* Sequence: A sequence of values [v1@t1, ...., vn@tn] where the values between two successive instants vi@ti vj@tj are (linearly) interpolated. These are used to represent for example GPS tracks.
* SequenceSet: A set of sequences {s1, ... , sn} where there is a temporal gap between them. These are used to represent for example GPS tracks where the signal was lost during a time period.

To compute the selectivity of temporal types we assume that time and space dimensions are independent and thus we can reuse all existing analyze and selectivity infrastructure in PostgreSQL/PostGIS. For the various durations this amounts to
* Instant: Use the functions in analyze.c and selfuncs.c independently for the value and time dimensions
* InstantSet: Use the functions in array_typanalyze.c, array_selfuncs.c independently for the value and time dimensions
* Sequence and SequenceSet: To simplify, we do not take into account the gaps, and thus use the functions in rangetypes_typanalyze.c, rangetypes_selfuncs.c independently for the value and time dimensions

However, this requires that the analyze and selectivity functions in all the above files satisfy the following
* Set the staop when computing statistics. For example in rangetypes_typanalyze.c the staop is set for STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM but not for STATISTIC_KIND_BOUNDS_HISTOGRAM
* Always call get_attstatsslot with the operator Oid not with InvalidOid. For example, from the 17 times this function is called in selfuncs.c only two are passed with an operator. This also requires to pass the operator as an additional parameter to several functions. For example, the operator should be passed to the function ineq_histogram_selectivity in selfuncs.c
* Export several top-level functions which are currently static. For example, var_eq_const, ineq_histogram_selectivity, eqjoinsel_inner and several others in the file selfuncs.c should be exported.

That would solve all the problems excepted for STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, since in this case the staop will always be Float8LessOperator, independently of whether we are computing lengths of value ranges or of tstzranges. This could be solved by using a different stakind for the value and time dimensions.

If you want I can prepare a PR in order to understand the implications of these changes. Please let me know.

Esteban


On Thu, Sep 5, 2019 at 5:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Esteban Zimanyi <ezimanyi@ulb.ac.be> writes:
> We are developing the analyze/selectivity functions for those types. Our
> approach is to use the standard PostgreSQL/PostGIS functions for the value
> and the time dimensions where the slots starting from 0 will be used for
> the value dimension, and the slots starting from 2 will be used for the
> time dimension. For example, for tfloat we use range_typanalyze and related
> functions for
> * collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the value
> dimension
> *  collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to
> tstzranges) of the time dimension

IMO this is fundamentally wrong, or at least contrary to the design
of pg_statistic.  It is not supposed to matter which "slot" a given
statistic type is actually stored in; rather, readers are supposed to
search for the desired statistic type using the stakindN, staopN and
(if relevant) stacollN fields.

In this case it seems like it'd be reasonable to rely on the staop
fields to distinguish between the value and time dimensions, since
(IIUC) they're of different types.

Another idea is to invent your own slot kind identifiers instead of
using built-in ones.  I'm not sure that there's any point in using
the built-in kind values, since (a) none of the core selectivity code
is likely to get called on your data and (b) even if it were, it'd
likely do the wrong thing.  See the comments in pg_statistic.h,
starting about line 150, about assignment of non-built-in slot kinds.

> Is there any chance that the API for accessing the typanalyze and
> selectivity functions will be enhanced in a future release ?

Well, maybe you could convince us that the stakind/staop scheme for
identifying statistics is inadequate so we need another identification
field (corresponding to a component of the column being described,
perhaps).  I'd be strongly against assigning any semantic meaning
to the slot numbers, though.  That's likely to break code that's
written according to existing conventions.

                        regards, tom lane

Re: Specifying attribute slot for storing/reading statistics

From
Tomas Vondra
Date:
Hi,

Please don't top-post. If you're not responding to parts of the e-mail,
then don't quote it.

On Fri, Sep 06, 2019 at 12:50:33PM +0200, Esteban Zimanyi wrote:
>Dear Tom
>
>Many thanks for your quick reply. Indeed both solutions you proposed can be
>combined together in order to solve all the problems. However changes in
>the code are needed. Let me now elaborate on the solution concerning the
>combination of stakind/staop first and I will elaborate on adding a new
>kind identifier after.
>
>In order to understand the setting, let me explain a little more about the
>different kinds of temporal types. As explained in my previous email these
>are types whose values are composed of elements v@t where v is a
>PostgreSQL/PostGIS type (float or geometry) and t is a TimestampTz. There
>are four kinds of temporal types, depending on the their duration
>* Instant: Values of the form v@t. These are used for example to represent
>car accidents as in Point(0 0)@2000-01-01 08:30
>* InstantSet: A set of values {v1@t1, ...., vn@tn} where the values between
>the points are unknown. These are used for example to represent checkins in
>FourSquare or RFID readings
>* Sequence: A sequence of values [v1@t1, ...., vn@tn] where the values
>between two successive instants vi@ti vj@tj are (linearly) interpolated.
>These are used to represent for example GPS tracks.
>* SequenceSet: A set of sequences {s1, ... , sn} where there is a temporal
>gap between them. These are used to represent for example GPS tracks where
>the signal was lost during a time period.
>

So these are 4 different data types (or classes of data types) that you
introduce in your extension? Or is that just a conceptual view and it's
stored in some other way (e.g. normalized in some way)?

>To compute the selectivity of temporal types we assume that time and space
>dimensions are independent and thus we can reuse all existing analyze and
>selectivity infrastructure in PostgreSQL/PostGIS. For the various durations
>this amounts to
>* Instant: Use the functions in analyze.c and selfuncs.c independently for
>the value and time dimensions
>* InstantSet: Use the functions in array_typanalyze.c, array_selfuncs.c
>independently for the value and time dimensions
>* Sequence and SequenceSet: To simplify, we do not take into account the
>gaps, and thus use the functions in rangetypes_typanalyze.c,
>rangetypes_selfuncs.c independently for the value and time dimensions
>

OK.

>However, this requires that the analyze and selectivity functions in all
>the above files satisfy the following
>* Set the staop when computing statistics. For example in
>rangetypes_typanalyze.c the staop is set for
>STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM but not for
>STATISTIC_KIND_BOUNDS_HISTOGRAM
>* Always call get_attstatsslot with the operator Oid not with InvalidOid.
>For example, from the 17 times this function is called in selfuncs.c only
>two are passed with an operator. This also requires to pass the operator as
>an additional parameter to several functions. For example, the operator
>should be passed to the function ineq_histogram_selectivity in selfuncs.c
>* Export several top-level functions which are currently static. For
>example, var_eq_const, ineq_histogram_selectivity, eqjoinsel_inner and
>several others in the file selfuncs.c should be exported.
>
>That would solve all the problems excepted for
>STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, since in this case the staop will
>always be Float8LessOperator, independently of whether we are computing
>lengths of value ranges or of tstzranges. This could be solved by using a
>different stakind for the value and time dimensions.
>

I don't think we're strongly against changing the code to allow this, as 
long as it does not break existing extensions/code (unnecessarily).

>If you want I can prepare a PR in order to understand the implications of
>these changes. Please let me know.
>

I think having an actual patch to look at would be helpful.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Specifying attribute slot for storing/reading statistics

From
Esteban Zimanyi
Date:
So these are 4 different data types (or classes of data types) that you
introduce in your extension? Or is that just a conceptual view and it's
stored in some other way (e.g. normalized in some way)?

At the SQL level these 4 durations are not distinguishable. For example for a tfloat (temporal float) we can have

select tfloat '1@2000-01-01' -- Instant duration
select tfloat '{1@2000-01-01 , 2@2000-01-02 , 1@2000-01-03}' -- Instant set duration
select tfloat '[1@2000-01-01, 2@2000-01-02 , 1@2000-01-03)' -- Sequence duration, left-inclusive and right-exclusive bound,
select tfloat {'[1@2000-01-01, 2@2000-01-02 , 1@2000-01-03], '[1@2000-01-04, 1@2000-01-05]}  ' -- Sequence set duration

Nevertheless it is possible to restrict a column to a specific duration with a typymod specifier as in

create table test ( ..., measure tfloat(Instant) -- only Instant durations accepted, ...)

At the C level these 4 durations are distinguished and implement in something equivalent to a template abstract class Temporal with four subclasses TemporalInst, TemporalI, TemporalSeq, and TemporalS. Indeed the algorithms for manipulating these 4 durations are completely different.  They are called template classes since they keep the Oid of the base type (float for tfloat or geometry for tgeompoint) in the same way array or ranges do.
 
For more information please refer to the manual at github
 
I don't think we're strongly against changing the code to allow this, as
long as it does not break existing extensions/code (unnecessarily).

>If you want I can prepare a PR in order to understand the implications of
>these changes. Please let me know.
>

I think having an actual patch to look at would be helpful.

I am preparing a first patch for the files selfuncs.h and selfunc.c and thus for instant duration selectivity. It basically 
1) Moves some prototypes of the static functions from the .c to the .h file so that the functions are exported.
2) Passes the operator from the top level functions to the inner functions such as mcv_selectivity or ineq_histogram_selectivity.

This allows me to call the functions twice, once for the value component and another for the time component, e.g. as follows.

    else if (cachedOp == CONTAINED_OP || cachedOp == OVERLAPS_OP)
    {
        /* Enable the addition of the selectivity of the value and time 
         * dimensions since either may be missing */
        int selec_value = 1.0, selec_time = 1.0

        /* Selectivity for the value dimension */
        if (MOBDB_FLAGS_GET_X(box->flags))
        {
            operator = oper_oid(LT_OP, valuetypid, valuetypid);
            selec_value = scalarineqsel(root, operator, falsefalse, vardata, 
                Float8GetDatum(box->xmin), valuetypid);
            operator = oper_oid(GT_OP, valuetypid, valuetypid);
            selec_value += scalarineqsel(root, operator, truefalse, vardata, 
                Float8GetDatum(box->xmax), valuetypid);
            selec_value = 1 - selec_value;
        }
        /* Selectivity for the time dimension */
        if (MOBDB_FLAGS_GET_T(box->flags))
        {
            operator = oper_oid(LT_OP, T_TIMESTAMPTZ, T_TIMESTAMPTZ);
            selec_time = scalarineqsel(root, operator, falsefalse, vardata, 
                TimestampTzGetDatum(box->tmin), TIMESTAMPTZOID);
            operator = oper_oid(GT_OP, T_TIMESTAMPTZ, T_TIMESTAMPTZ);
            selec_time += scalarineqsel(root, operator, truefalse, vardata, 
                TimestampTzGetDatum(box->tmax), TIMESTAMPTZOID);
            selec_time = 1 - selec_time;
        }
        selec = selec_value * selec_time;
    }

Regards

Esteban