Home > mailing lists

Re: Bypassing cursors in postgres_fdw to enable parallel plans - Mailing list pgsql-hackers

From	KENAN YILMAZ
Subject	Re: Bypassing cursors in postgres_fdw to enable parallel plans
Date	November 25 17:24:14
Msg-id	CAEV8nunoatq-bGLo=mD=WNAFers+f4hqnxN-7gw-M=Cg9qOqqA@mail.gmail.com Whole thread Raw
In response to	Re: Bypassing cursors in postgres_fdw to enable parallel plans (Rafia Sabih <rafia.pghackers@gmail.com>)
List	pgsql-hackers

Tree view

Hello Hackers,

I have executed use_cursor = true/false with quick tests.

The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to do so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) || md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01 10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user 'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=0.00..467803.85 rows=1 width=105) (actual time=23260.306..23261.591 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=363637
I/O Timings: shared read=64590.910
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1 width=105) (actual time=23242.279..23242.280 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
Buffers: shared read=363637
I/O Timings: shared read=64590.910
Planning:
Buffers: shared hit=54 read=14 dirtied=1
I/O Timings: shared read=23.281
Planning Time: 38.734 ms
Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17 dirtied=1
I/O Timings: shared read=10.696
Planning Time: 43.852 ms
Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=11 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=12 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
[153045]: line=13 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: execute <unnamed>: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=14 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_execute_message, postgres.c:2245
[153045]: line=15 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: FETCH 100 FROM c1
[153045]: line=16 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
[153044]: line=5 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG: 00000: duration: 20074.799 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153045]: line=17 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: CLOSE c1
[153045]: line=18 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
[153045]: line=19 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: duration: 20057.543 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Seq Scan on t1 (cost=0.00..613637.45 rows=1 width=105) (actual time=20057.541..20057.541 rows=0.00 loops=1)
Filter: (a > 1000)
Rows Removed by Filter: 20000000
[153045]: line=20 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17
I/O Timings: shared read=30.074
Planning Time: 53.678 ms
Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=11 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=12 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
[153121]: line=13 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: execute <unnamed>: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=14 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_execute_message, postgres.c:2245
[153113]: line=7 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG: 00000: duration: 24080.958 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153121]: line=15 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
[153121]: line=16 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
[153121]: line=17 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: duration: 24059.372 ms plan:
Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Gather (cost=0.00..467803.85 rows=1 width=105) (actual time=24058.076..24059.367 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1 width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437

---

Kenan YILMAZ

Rafia Sabih <rafia.pghackers@gmail.com>, 25 Kas 2025 Sal, 17:01 tarihinde şunu yazdı:

On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
> > Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
> > in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
>
> This might have the advantage of being simpler, but it's definitely
> worse. If we're only fetching one result set, which will be common,
> we'll buffer the whole thing in a tuplestore where that could
> otherwise be avoided. Maybe it's still best to go with this approach;
> not sure.
>
> > And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
> > So, the tuples retrieved earlier for the first query were lost if not saved.
> > I would highly appreciate your time and feedback for this.
>
> My suggestions are to work on the following areas:
>
> 1. Automated testing. The patch has no regression tests, and won't get
> committed without those.
>
> 2. Manual testing. How does the performance with this new option
> compare to the existing method? The answer might be different for
> small result sets that fit in memory and large ones that spill to
> disk, and will certainly also depend on how productive parallel query
> can be on the remote side; but I think you want to do and post on this
> list some measurements showing the best and worst case for the patch.
>
> 3. Patch clean-up. There are plenty of typos and whitespace-only
> changes in the patch. It's best to clean those up. Running pgindent is
> a good idea too. Some places could use comments.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch more, I realised this was not equipped to handle the cases when there are more than two active cursors. So to accommodate such a case, I now modified the new struct for saving the previous query to a list of such structs. Also, it turns out we need not to save the tuples in case this is an active cursor, so we only populate the associated tuplestore only when we need to create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to test this patch for all the cases of postgres_fdw, the only way I could figure out how to do this was to test the select statements with the new GUC.

I also did some tests for performance. I used the contrib_regression database and populated the table "S1"."T1" with more tuples to understand the impact of patch on higher scale. I also used auto_explain to get the foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG: duration: 61.336 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG: duration: 60.575 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate (cost=21888.22..21888.23 rows=1 width=8)
-> Seq Scan on "T 1" (cost=0.00..19956.98 rows=772498 width=0)
count
--------
990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74 rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1) Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing significantly better. I investigated a bit more to find out why the query was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000 width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000 width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000 width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000 width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

pgsql-hackers by date:

From: Maxim Orlov
Date: 25 November, 17:15:50
Subject: Re: Add 64-bit XIDs into PostgreSQL 15

From: Dilip Kumar
Date: 25 November, 17:34:50
Subject: Re: Proposal: Conflict log history table for Logical Replication

Re: Bypassing cursors in postgres_fdw to enable parallel plans - Mailing list pgsql-hackers

Previous

Next