Re: Ideas about a better API for postgres_fdw remote estimates - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Ideas about a better API for postgres_fdw remote estimates |
Date | |
Msg-id | CAG-ACPVQCoLaUEGrkKUV6Jd=AGVScab1MXMCwxGzOYQAGE22zQ@mail.gmail.com Whole thread Raw |
In response to | Re: Ideas about a better API for postgres_fdw remote estimates (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Ideas about a better API for postgres_fdw remote estimates
|
List | pgsql-hackers |
On Fri, 4 Sep 2020 at 20:27, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote
4) I wonder if we actually want/need to simply output pg_statistic data
verbatim like this. Is postgres_fdw actually going to benefit from it? I
kinda doubt that, and my assumption was that we'd return only a small
subset of the data, needed by get_remote_estimate.
This has a couple of issues. Firstly, it requires the knowledge of what
the stakind constants in pg_statistic mean and how to interpret it - but
OK, it's true that does not change very often (or at all). Secondly, it
entirely ignores extended statistics - OK, we might extract those too,
but it's going to be much more complex. And finally it entirely ignores
costing on the remote node. Surely we can't just apply local
random_page_cost or whatever, because those may be entirely different.
And we don't know if the remote is going to use index etc.
So is extracting data from pg_statistic the right approach?
There are two different problems, which ultimately might converge.
1. If use_remote_estimates = false, more generally if querying costs from foreign server for costing paths is impractical, we want to use local estimates and try to come up with costs. For that purpose we keep some statistics locally and user is expected to refresh it periodically by running ANALYZE on the foreign table. This patch is about a. doing this efficiently without requiring to fetch every row from the foreign server b. through autovacuum automatically without user firing ANALYZE. I think this also answers your question about vacuum_rel() above.
2. How to efficiently extract costs from an EXPLAIN plan when use_remote_eestimates is true. That's the subject of some nearby thread. I think you are referring to that problem here. Hence your next point.
Using EXPLAIN to get costs from the foreign server isn't efficient. It increases planning time a lot; sometimes planning time exceeds execution time. If usage of foreign tables becomes more and more common, this isn't ideal. I think we should move towards a model in which the optimizer can decide whether a subtree involving a foreign server should be evaluated locally or on the foreign server without the help of foreign server. One way to do it (I am not saying that this is the only or the best way) is to estimate the cost of foreign query locally based on the information available locally about the foreign server and foreign table. This might mean that we have to get that information from the foreign server and cache it locally and use it several times, including the indexes on foreign table, values of various costs etc. Though this approach doesn't solve all of those problems it's one step forward + it makes the current scenario also efficient.
I agree that the patch needs some work though, esp the code dealing with serialization and deserialization of statistics.
Best Wishes,
Ashutosh
pgsql-hackers by date: