Hello,
I've taken part in some interesting research involving an estimation of the execution cost of a query by properly updating the statistics stored in pg_class and pg_statistic. The idea is to perform range partitioning of tables in a schema and then update the statistics in the two system catalogs mentioned above without transferring any data into the new schema. This provides fast algorithm execution times since we use some global optimization methods.
Using EXPLAIN we have successfully managed to get very similar execution plans and costs by the planner as if there were data in the newly created partitions. This method works just fine as long as the queries to be estimated consist of a join of no more than 2 tables.
When we try to estimate a query's cost that contains a join between 3 or more tables we get huge costs and wrong plans (in the rank of millions of cost units). When data is actually loaded into all of the partitions, the cost does not exceed a few thousand cost units.
The strategy to determine the partitions that the query should be executed against is using a simple interval tree to find overlapping intervals of the range values. For each partitioned table, a new master table is created and statistics are set to a "zero" value, meaning that the master (parent) table contains no data, while the statistics of the child tables that inherit the masters are updated properly.
We use the Star Schema Benchmark (a modification of TPC-H) and assume uniformity of data.
The question: Is the following list of updated statistics enough to fool the planner into generating accurate execution plans as if there were data in those partitions?
I've been searching around for a few weeks but I'm getting nowhere. Here's what we update:
- pg_statistic
- stawidth
- staop
- stanumbersN
- stakindN
- stavaluesN
- pg_class
- relfilenode (we create an empty file and fool the file system about its size, so the planner actually sees that there is data physically stored on disk, although the relation file is empty)
- reltuples
- relpages
This probably is a both tough and demanding question, but I guess this is the right place to ask.
Best regards,
Nino Arsov