Thread: (Mis)using the PostgreSQL planner to get estimated row counts

(Mis)using the PostgreSQL planner to get estimated row counts

From
Nick Johnson
Date:
I'm trying to write a PostgreSQL extension to estimate the number of  
rows returned by a SELECT statement. Ideally, it'd be invoked along  
the lines of "SELECT estimate_row_count('SELECT foo FROM bar INNER  
JOIN baz ON (id) WHERE a=b');", and would be useful for estimating  
the number of pages in a search result, for example.

I've got as far as figuring out how to get an estimated row count  
from a Node object for a query (mostly by looking at the code for  
EXPLAIN), but there I'm a bit mired - I can't figure out how to take  
a string representing an SQL statement and parse it into a Node  
object I can feed to the planner.

So, a couple of questions:
1) Can anyone suggest where I should look in the source for the  
requisite functions for parsing an SQL string into a Node I can feed  
to the planner so I can get an estimated row count?
2) Is this something that's reasonable to do in the first place? I'm  
not sure if extensions are supposed to be allowed to delve into  
PostgreSQL's internals this much.
3) Are there any other gotchas around this area? For example, there's  
a lot going on with Snapshots and ActiveSnapshot that I really don't  
have any idea about.

-Nick Johnson




Re: (Mis)using the PostgreSQL planner to get estimated row counts

From
Tom Lane
Date:
Nick Johnson <arachnid@notdot.net> writes:
> I'm trying to write a PostgreSQL extension to estimate the number of  
> rows returned by a SELECT statement.

Instead of fooling around at the C level, why don't you just do an
EXPLAIN and parse out the first row of the result?  For instance

regression=# create function estimate_row_count(text) returns text as $$
regression$# declare x record;
regression$# begin
regression$#   for x in execute 'EXPLAIN ' || $1 loop
regression$#     return substring(x."QUERY PLAN" from 'rows=([0-9]+) width=');
regression$#   end loop;
regression$# end$$ language plpgsql strict;
CREATE FUNCTION
regression=# select estimate_row_count('select * from tenk1');estimate_row_count
--------------------10000
(1 row)

Of course this is subject to future breakage due to changes in the
output textual format, etc etc, but it's surely less fragile than
anything written in C will be.

Depending on what you want the results for, it might be best to ignore
any top-level LIMIT node.
        regards, tom lane