Re: Functionscan estimates - Mailing list pgsql-performance

From PFC
Subject Re: Functionscan estimates
Date
Msg-id op.soyp090rth1vuj@localhost
Whole thread Raw
In response to Re: Functionscan estimates  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
> My solution would be a lot simpler, since we could simply populate
> pg_proc.proestrows with "1000" by default if not changed by the DBA.  In
> an
> even better world, we could tie it to a table, saying that, for example,
> proestrows = my_table*0.02.

    What if the estimated row is a function of a parameter ?
    Say a function takes as a parameter :
    - a number to use in a LIMIT
    - it's a function to generate a certain number of values from a
predetermined set (like, array -> set returning function)

    In all those cases it's no use to have just a fixed number.

    Id suggest two solutions :
    - The ideal solution which is impossible to do :
    The function tells the planner about its stats, looking at its parameters

    - A solution that would be possible to do
     pg_proc.proestrows is... the name of another function, defined by the
user, which takes the exact same parameters as the set returning function
we're talking about, and which returns estimates.

    For instance, in pseudo-sql :

CREATE FUNCTION int_array_srf( INTEGER[] ) RETURNS SETOF INTEGER LANGUAGE
plpgsql AS $$
BEGIN
    FOR _i IN 1..icount($1)
        RETURN NEXT $1[_i];
    END
END    In the two cases above, this would give :

CREATE FUNCTION array_srf_estimator( INTEGER[] ) RETURNS INTEGER LANGUAGE
plpgsql AS $$
BEGIN
    RETURN icount( $1 );
END;

ALTER FUNCTION array_srf SET ESTIMATOR array_srf_estimator;

    Another interesting case would be the famous "Top 5 by category" case
where we use a SRF to emulate an index skip scan. Say we have a table
Categories and a table Users, each User having columns "categories" and
"score" and we want the N users with best score in each category :

CREATE FUNCTION top_n_by_category( INTEGER ) RETURN SETOF users%ROWTYPE
LANGUAGE plpgsql AS $$
DECLARE
    _cat_id    INTEGER;
    _n ALIAS FOR $1;
    _user    users%ROWTYPE;
BEGIN
    FOR _cat_id IN SELECT category_id FROM categories DO
        FOR _user IN SELECT * FROM users WHERE category_id = _cat_id ORDER BY
score DESC LIMIT _n DO
            RETURN NEXT _user;
        END
    END
END

CREATE FUNCTION top_n_by_category_estimator( INTEGER ) RETURN INTEGER
LANGUAGE plpgsql AS $$
BEGIN
    RETURN $1 * (the estimated number of rows for the categories table taken
 from the table statistics);
END;

ALTER FUNCTION top_n_by_category SET ESTIMATOR top_n_by_category_estimator;

    Got it ?

    The array_srf case would be extremely useful as this type of function is
generally used to join against other tables, and having estimates is
useful for that.
    The top_n case would be useless if we're just returning the rows from the
function directly, but would be useful if we'll join them to other tables.

    This sounds pretty simple, powerful, and versatile.

    Additionally, in some cases (most notably the array case) it's easy to
estimate the statistics on the returned values because they're all in the
array already, so the mechanism could be extended to have a way of
returning a pseudo pg_stats for a Set Returning function.

    For instance, say you have a SRF which returns N random rows from a
table. It could have an estimator which would return a rowcount of N, and
a statistics estimator which would return the sats rows for the source
table, appropriately modified.

    This sounds harder to do.

    WHat do you think ?


















pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Functionscan estimates
Next
From: PFC
Date:
Subject: Re: Functionscan estimates