Re: [HACKERS] postgres_fdw cost estimation defaults and documentation - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] postgres_fdw cost estimation defaults and documentation |
Date | |
Msg-id | CAFjFpRcr=dBzE_tL_ekyEGLa+=8MNX2STHnH_qxS9mGEpGg_zg@mail.gmail.com Whole thread Raw |
In response to | [HACKERS] postgres_fdw cost estimation defaults and documentation (Jeff Janes <jeff.janes@gmail.com>) |
List | pgsql-hackers |
On Tue, Jun 6, 2017 at 12:07 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > The default value for fdw_tuple_cost is 0.01, which seems way too low. If I > set up a loop-back foreign server with a large fetch_size, then tests like: > > select * from pgbench_accounts except select * from > loopback.pgbench_accounts > > vs > > select * from pgbench_accounts except select * from pgbench_accounts > > indicate that 0.1 is about the lowest value for fdw_tuple_cost that could > make sense, and a reasonable default would probably be 0.25. Yes, it is > only a default, but the default should make sense for at least some > situation, and I can't imagine any situation in which 0.01 makes sense The per tuple cost to transfer 10 rows and 1M rows is going to be different and so it's going to different when the widths of rows vary. It depends upon the package sizes and how many rows a packet can contains. So having a single value to represent that cost is insufficient. Till the time FDW gets smarter and adds some logic to amortize the cost across rows, this looks better. 0.01 probably keeps our regressions stable and mostly the users will have to change those values for each server based on the type of network and topology anyway. Said, that I am not objecting to changing it as long as regression tests are stable across multiple platforms and machines. > > In the documentation for fdw_startup_cost, it says "This represents the > additional overhead of establishing a connection, parsing and planning the > query on the remote side, etc.". I think that "establishing a connection" > should be stricken. Either you need a connection or you don't, there is > nothing the optimizer can do about this. And if do need one, you only > establish one once (at most), not once per query sent to the remote side. I > think the implementation correctly doesn't try to account for the overhead > of establishing a connection, so the docs should remove that claim. Yes. Instead of "establishing a connection", I think we should mention the protocol overhead to run a query. > > In regards to use_remote_estimate, the documentation says "Running ANALYZE > on the foreign table is the way to update the local statistics; this will > perform a scan of the remote table and then calculate and store statistics > just as though the table were local. Keeping local statistics can be a > useful way to reduce per-query planning overhead for a remote table — but if > the remote table is frequently updated, the local statistics will soon be > obsolete." This makes it send like local stats is basically equivalent to > use_remote_estimate, other than the staleness issue. But they are far from > equivalent. use_remote_estimate has implicit knowledge of the indexes on > the foreign server (implicit via the reduced cost estimates derived from the > foreign side for parameterized queries), whereas local stats of foreign > tables just assumes there are no indexes for planning purposes. Perhaps > adding something like "Also, local statistics do not contain information on > the available indexes on the remote side, while use_remote_estimate does > take these into account"? That's not the impression I got when I read the complete paragraph at [1] -- When use_remote_estimate is true, postgres_fdw obtains row count and cost estimates from the remote server and then adds fdw_startup_cost and fdw_tuple_cost to the cost estimates. When use_remote_estimate is false, postgres_fdw performs local row count and cost estimation and then adds fdw_startup_cost and fdw_tuple_cost to the cost estimates. This local estimation is unlikely to be very accurate unless local copies of the remote table's statistics are available. Running ANALYZE on the foreign table is the way to update the local statistics; this will perform a scan of the remote table and then calculate and store statistics just as though the table were local. Keeping local statistics can be a useful way to reduce per-query planning overhead for a remote table — but if the remote table is frequently updated, the local statistics will soon be obsolete. -- The paragraph you quoted should be read in the context of "When use_remote_estimate is false, ". It just says what happens when use_remote_estimate is turned off. It doesn't imply that local statistics is substitute for 'use_remote_estimate = true'. [1] https://www.postgresql.org/docs/10/static/postgres-fdw.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
pgsql-hackers by date: