We have successfully reproduced this issue and gained a clearer understanding of its root cause. The application uses a cursor to fetch partial results in batches, with a delay between consecutive fetch operations. When the interval between two batches exceeds the tcp_user_timeout threshold, the connection is terminated unexpectedly.
In my analysis, during cursor-based queries, applications typically retrieve results in partial batches. If the number of rows fetched in a single batch is smaller than the number of rows scanned from the local table, the executor is unable to proceed with fetching rows from the foreign table. While we have attempted workarounds such as adjusting the fetch size, tuning TCP buffer parameters, and modifying the tcp_user_timeout value, these measures only mitigate the symptoms without addressing the underlying problem.
To achieve a fundamental resolution, I propose two potential solutions:
- Alternate Row Fetching: Modify the executor to alternately retrieve rows from the local table and the foreign table, ensuring balanced data flow between the two data sources.
- Asynchronous Tuple Storage: Implement a tuple storage mechanism to asynchronously cache results from the foreign table. This would allow the executor to fetch foreign table results into the storage buffer independently, preventing TCP window exhaustion and decoupling the dependency between local and foreign data retrieval.