Re: [HACKERS] postgres_fdw cost estimation defaults and documentation - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: [HACKERS] postgres_fdw cost estimation defaults and documentation
Date
Msg-id CAFjFpRcr=dBzE_tL_ekyEGLa+=8MNX2STHnH_qxS9mGEpGg_zg@mail.gmail.com
Whole thread Raw
In response to [HACKERS] postgres_fdw cost estimation defaults and documentation  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Tue, Jun 6, 2017 at 12:07 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> The default value for fdw_tuple_cost is 0.01, which seems way too low.  If I
> set up a loop-back foreign server with a large fetch_size, then tests like:
>
> select * from pgbench_accounts except select * from
> loopback.pgbench_accounts
>
> vs
>
> select * from pgbench_accounts except select * from pgbench_accounts
>
> indicate that 0.1 is about the lowest value for fdw_tuple_cost that could
> make sense, and a reasonable default would probably be 0.25.  Yes, it is
> only a default, but the default should make sense for at least some
> situation, and I can't imagine any situation in which 0.01 makes sense

The per tuple cost to transfer 10 rows and 1M rows is going to be
different and so it's going to different when the widths of rows vary.
It depends upon the package sizes and how many rows a packet can
contains. So having a single value to represent that cost is
insufficient. Till the time FDW gets smarter and adds some logic to
amortize the cost across rows, this looks better. 0.01 probably keeps
our regressions stable and mostly the users will have to change those
values for each server based on the type of network and topology
anyway. Said, that I am not objecting to changing it as long as
regression tests are stable across multiple platforms and machines.

>
> In the documentation for fdw_startup_cost, it says "This represents the
> additional overhead of establishing a connection, parsing and planning the
> query on the remote side, etc.".  I think that "establishing a connection"
> should be stricken. Either you need a connection or you don't, there is
> nothing the optimizer can do about this.  And if do need one, you only
> establish one once (at most), not once per query sent to the remote side.  I
> think the implementation correctly doesn't try to account for the overhead
> of establishing a connection, so the docs should remove that claim.

Yes. Instead of "establishing a connection", I think we should mention
the protocol overhead to run a query.

>
> In regards to use_remote_estimate, the documentation says "Running ANALYZE
> on the foreign table is the way to update the local statistics; this will
> perform a scan of the remote table and then calculate and store statistics
> just as though the table were local. Keeping local statistics can be a
> useful way to reduce per-query planning overhead for a remote table — but if
> the remote table is frequently updated, the local statistics will soon be
> obsolete."  This makes it send like local stats is basically equivalent to
> use_remote_estimate, other than the staleness issue.  But they are far from
> equivalent.  use_remote_estimate has implicit knowledge of the indexes on
> the foreign server (implicit via the reduced cost estimates derived from the
> foreign side for parameterized queries), whereas local stats of foreign
> tables just assumes there are no indexes for planning purposes. Perhaps
> adding something like "Also, local statistics do not contain information on
> the available indexes on the remote side, while use_remote_estimate does
> take these into account"?

That's not the impression I got when I read the complete paragraph at [1]
--
When use_remote_estimate is true, postgres_fdw obtains row count and
cost estimates from the remote server and then adds fdw_startup_cost
and fdw_tuple_cost to the cost estimates. When use_remote_estimate is
false, postgres_fdw performs local row count and cost estimation and
then adds fdw_startup_cost and fdw_tuple_cost to the cost estimates.
This local estimation is unlikely to be very accurate unless local
copies of the remote table's statistics are available. Running ANALYZE
on the foreign table is the way to update the local statistics; this
will perform a scan of the remote table and then calculate and store
statistics just as though the table were local. Keeping local
statistics can be a useful way to reduce per-query planning overhead
for a remote table — but if the remote table is frequently updated,
the local statistics will soon be obsolete.
--

The paragraph you quoted should be read in the context of "When
use_remote_estimate is false, ". It just says what happens when
use_remote_estimate is turned off. It doesn't imply that local
statistics is substitute for 'use_remote_estimate = true'.

[1] https://www.postgresql.org/docs/10/static/postgres-fdw.html
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Challenges preventing us moving to 64 bit transactionid (XID)?
Next
From: Michael Paquier
Date:
Subject: Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256