Re: Functionscan estimates - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: Functionscan estimates |
Date | |
Msg-id | op.soyp090rth1vuj@localhost Whole thread Raw |
In response to | Re: Functionscan estimates (Josh Berkus <josh@agliodbs.com>) |
List | pgsql-performance |
> My solution would be a lot simpler, since we could simply populate > pg_proc.proestrows with "1000" by default if not changed by the DBA. In > an > even better world, we could tie it to a table, saying that, for example, > proestrows = my_table*0.02. What if the estimated row is a function of a parameter ? Say a function takes as a parameter : - a number to use in a LIMIT - it's a function to generate a certain number of values from a predetermined set (like, array -> set returning function) In all those cases it's no use to have just a fixed number. Id suggest two solutions : - The ideal solution which is impossible to do : The function tells the planner about its stats, looking at its parameters - A solution that would be possible to do pg_proc.proestrows is... the name of another function, defined by the user, which takes the exact same parameters as the set returning function we're talking about, and which returns estimates. For instance, in pseudo-sql : CREATE FUNCTION int_array_srf( INTEGER[] ) RETURNS SETOF INTEGER LANGUAGE plpgsql AS $$ BEGIN FOR _i IN 1..icount($1) RETURN NEXT $1[_i]; END END In the two cases above, this would give : CREATE FUNCTION array_srf_estimator( INTEGER[] ) RETURNS INTEGER LANGUAGE plpgsql AS $$ BEGIN RETURN icount( $1 ); END; ALTER FUNCTION array_srf SET ESTIMATOR array_srf_estimator; Another interesting case would be the famous "Top 5 by category" case where we use a SRF to emulate an index skip scan. Say we have a table Categories and a table Users, each User having columns "categories" and "score" and we want the N users with best score in each category : CREATE FUNCTION top_n_by_category( INTEGER ) RETURN SETOF users%ROWTYPE LANGUAGE plpgsql AS $$ DECLARE _cat_id INTEGER; _n ALIAS FOR $1; _user users%ROWTYPE; BEGIN FOR _cat_id IN SELECT category_id FROM categories DO FOR _user IN SELECT * FROM users WHERE category_id = _cat_id ORDER BY score DESC LIMIT _n DO RETURN NEXT _user; END END END CREATE FUNCTION top_n_by_category_estimator( INTEGER ) RETURN INTEGER LANGUAGE plpgsql AS $$ BEGIN RETURN $1 * (the estimated number of rows for the categories table taken from the table statistics); END; ALTER FUNCTION top_n_by_category SET ESTIMATOR top_n_by_category_estimator; Got it ? The array_srf case would be extremely useful as this type of function is generally used to join against other tables, and having estimates is useful for that. The top_n case would be useless if we're just returning the rows from the function directly, but would be useful if we'll join them to other tables. This sounds pretty simple, powerful, and versatile. Additionally, in some cases (most notably the array case) it's easy to estimate the statistics on the returned values because they're all in the array already, so the mechanism could be extended to have a way of returning a pseudo pg_stats for a Set Returning function. For instance, say you have a SRF which returns N random rows from a table. It could have an estimator which would return a rowcount of N, and a statistics estimator which would return the sats rows for the source table, appropriately modified. This sounds harder to do. WHat do you think ?
pgsql-performance by date: