5.4. Distributed Query Diagnostics #

Shardman enhances the EXPLAIN command so that it can provide additional information about a query if it is distributed. The work with the distributed tables is based on the plan nodes with the ForeignScan type. A query to each remote partition is determined by a single plan node of this type, with Shardman submitting additional information to the EXPLAIN blocks with the node description.

When executing a distributed query, the part of the plan (a subtree) that relates to a specific remote partition is serialized into an SQL statement. This process is known as deparsing. Then, this statement is sent to a remote server. The result of this query is the output of a ForeignScan node. It is used to gather the final results of the distributed query execution.

When the VERBOSE option of the EXPLAIN command is set to on, the Remote SQL field of the ForeignScan node block shows the statement sent to the remote server. Also, the Server field indicates the name of the server as it was specified during the cluster configuration and as it is displayed in pg_foreign_server, along with the transport method used to send this statement. The transport field can take two values: silk for the the enhanced interconnect Shardman mechanism, or libpq for sending via the standard PostgreSQL protocol.

5.4.1. Displaying Plans from the Remote Server #

To see the execution plan that will be used on the remote server under the EXPLAIN block of the ForeignScan node, use the postgres_fdw.foreign_explain configuration parameter. The possible values are: none to exclude the EXPLAIN output from the remote servers, full to include the EXPLAIN output from the remote servers, collapsed to include the EXPLAIN output only for the first ForeignScan node under its Append/MergeAppend.

In production, it is recommended to disable this parameter (set it to none) or set it to collapsed, because obtaining any EXPLAIN information results in an additional implicit request to the server. Moreover, this request is executed in a synchronous mode, meaning the overall EXPLAIN output is built only once all the servers are sequentially queried. It can be a costly operationin case of a table with a large number of partitions.

Note that in case of the internal request for obtaining the EXPLAIN blocks for a remote plan, certain parameters are forcibly disabled, regardless of the parameters specified by a user when requesting EXPLAIN from the coordinator: ANALYZE OFF, TIMING OFF, SUMMARY OFF, SETTINGS OFF, NETWORK OFF. In this case, the EXPLAIN block of a remote plan will lack the corresponding metrics. Other EXPLAIN parameters (FORMAT, VERBOSE, COSTS, BUFFERS, WAL) are inherited from the coordinator.

If the subplan deparsing forms a statement that includes parameters (in the statement using symbols $1, $2, etc.), such a statement generally cannot be sent to the remote server to obtain EXPLAIN results. Therefore, the ForeignExplain blocks are not formed for the SQL statements with parameters.

5.4.2. Network Metrics and Latency #

Setting the NETWORK option of the EXPLAIN command to on shows the network operation metrics for the plan nodes, including individual ForeignScan nodes and general nodes Append or MergeAppend.

For each plan node, the FDW bytes, sent, and received parameters are displayed for the outgoing and incoming traffic when the node is executed (regardless of the transport type). Note that these metrics are only output when the ANALYZE option of the EXPLAIN command is set to on.

When the track_fdw_wait_timing configuration parameter is enabled, the wait_time metric is also output. This metric summarizes all stages of the plan node execution, starting from the time the request is sent to the remote server, including the time spent on the execution itself and all the time until the complete set of results for that plan node is received.

Note that the ForeignScan node can operate in both synchronous and asynchronous modes. For the asynchronous execution, the node's execution function sends a request to the remote server and completes its execution without waiting for the result. The result is considered and processed later, upon receipt. In this scenario, the wait_time metric may not accurately reflect the actual execution time.

5.4.3. Query Tracing for Silk Transport #

For the Silk transport, there is an option to output the extended debug information about tracing of a query passing from the coordinator to the remote server and back, including the results from the remote server. This information is only available if the ANALYZE option of the EXPLAIN command is set to on, and the shardman.silk_tracepoints configuration parameter is enabled.

When these parameters are enabled, each message transferred through the Silk transport (sending the SQL query, delivering it to the recipient, executing the query, and returning the execution result) is accompanied by an array of the timestamps measured at certain points in the pipeline. Once the query is executed, this information is displayed in the EXPLAIN block as rows starting with the word Trace. Each metric represents the difference between the timestamps at different points, in milliseconds:

Table 5.1. Query Tracing for Silk Transport Metrics

IntervalDescription
bk shm->mp1 (qry)The time taken to transfer an SQL query from the coordinator to its multiplexer via the shared memory.
mp1 shm->net (qry)The time between receiving a query within the multiplexer from the shared memory and transferring it over the network.
net (qry)The time spent by an SQL query to transfer over the network between the multiplexers.
mp2 recv->shm (qry)The time between receiving an SQL query from the network and placing it in the queue in the shared memory on a remote multiplexer.
wk exec (1st tup)The time spent to execute a query in Silkworm until the first row of the result is received.
wk exec (all tups)The time spent to execute a query on Silkworm until the complete result is received.
wk->shm (1st tup)The time taken to place the first row of the result into the Silkworm queue.
wk->shm (last tup)The time taken to place the last row of the result into the Silkworm queue.
mp2 shm->net (1st tup)The time between reading the first row of the result from the queue by the remote multiplexer and transferring it over the network.
net (1st tup)The time spent to transfer the first row of the result over the network between the multiplexers.
mp1 recv->shm (1st tup)The time between receiving the first row of the result from the network and placing it in the queue by the local multiplexer.
mp1 shm->bk (1st tup)The time spent to retrieve the first row of the result from the queue by the coordinator.
mp2 shm->net (last tup)The time between reading of the last row of the result from the queue by the remote multiplexer and transferring it over the network.
net (last tup)The time spent to transfer the last row of the result over the network between the multiplexers.
mp1 recv->shm (last tup)The time between receiving the last row of the result from the network and placing it in the queue by the local multiplexer.
mp1 shm->bk (last tup)The time taken by the coordinator to retrieve the last row o the result from the queue.
END-TO-ENDThe total time from sending the query to receiving the last row of the result. This approximately corresponds to the wait_time.

For the metrics net (qry), net (1st tup), and net (last tup), the interval value is calculated as the difference between timestamps on different servers. Therefore, negative values may appear in these lines. This difference includes both time spent on a message transfer and the clock drift (positive or negative) between these servers. Thus, even with a slight drift, the values will be negative if its absolute value exceeds the duration of network transfer. Although it is not a bug, you should pay close attention to whether the cluster clocks are synchronized. For more information, see Section 5.3.