Re: Append with naive multiplexing of FDWs - Mailing list pgsql-hackers

From Etsuro Fujita
Subject Re: Append with naive multiplexing of FDWs
Date
Msg-id CAPmGK14ems1oXwAMAxECM4NydioxrDOJ+pkt8tG71KedHh52gg@mail.gmail.com
Whole thread Raw
In response to Re: Append with naive multiplexing of FDWs  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Append with naive multiplexing of FDWs  (Etsuro Fujita <etsuro.fujita@gmail.com>)
List pgsql-hackers
On Thu, Dec 5, 2019 at 1:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Dec 5, 2019 at 4:26 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > There's my pending (somewhat stale) patch, which allows to run local
> > scans while waiting for remote servers.
> >
> > https://www.postgresql.org/message-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp

I think it’s great to execute local scans while waiting for the
results of remote scans, but looking at your patch (the 0002 patch of
your patch set in [1]), I’m not sure that the 0002 patch does it much
efficiently, because it modifies nodeAppend.c so that all the work is
done by a single process.  Rather than doing so, I’m wondering if it
would be better to modify Parallel Append so that some processes
execute remote scans and others execute local scans.  I’m not sure
that we need to have this improvement as well in the first cut of this
feature, though.

> After rereading some threads to remind myself what happened here...
> right, my little patch began life in March 2016[1] when I wanted a
> test case to test Andres's work on WaitEventSets, and your patch set
> started a couple of months later and is vastly more ambitious[2][3].
> It wants to escape from the volcano give-me-one-tuple-or-give-me-EOF
> model.  And I totally agree that there are lots of reason to want to
> do that (including yielding to other parts of the plan instead of
> waiting for I/O, locks and some parallelism primitives enabling new
> kinds of parallelism), and I'm hoping to help with some small pieces
> of that if I can.
>
> My patch set (rebased upthread) was extremely primitive, with no new
> planner concepts, and added only a very simple new executor node
> method: ExecReady().  Append used that to try to ask its children if
> they'd like some time to warm up.  By default, ExecReady() says "I
> don't know what you're talking about, go away", but FDWs can provide
> an implementation that says "yes, please call me again when this fd is
> ready" or "yes, I am ready, please call ExecProc() now".  It doesn't
> deal with anything more complicated than that, and in particular it
> doesn't work if there are extra planner nodes in between Append and
> the foreign scan.  (It also doesn't mix particularly well with
> parallelism, as mentioned.)
>
> The reason I reposted this unambitious work is because Stephen keeps
> asking me why we don't consider the stupidly simple thing that would
> help with simple foreign partition-based queries today, instead of
> waiting for someone to redesign the entire executor, because that's
> ... really hard.

Yeah, I think your patch is much simpler, compared to Horiguchi-san’s
patch set, which I think is a good thing, considering this would be
rather an interim solution until executor rewrite is done.  Here are a
few comments that I have for now:

* I know your patch is a POC one, but one concern about it (and
Horiguchi-san's patch set) is concurrent data fetches by multiple
foreign scan nodes using the same connection in the case of
postgres_fdw.  Here is an example causing an error:

create or replace function slow_data_ext(name text, secs float)
returns setof t as
$$
begin
 perform pg_sleep(secs);
 return query select name, generate_series(1, 100)::text as i;
end;
$$
language plpgsql;

create view t11 as select * from slow_data_ext('t11', 1.0);
create view t12 as select * from slow_data_ext('t12', 2.0);
create view t13 as select * from slow_data_ext('t13', 3.0);
create foreign table ft11 (a text, b text) server server1 options
(table_name 't11');
create foreign table ft12 (a text, b text) server server2 options
(table_name 't12');
create foreign table ft13 (a text, b text) server server3 options
(table_name 't13');
create table pt1 (a text, b text) partition by list (a);
alter table pt1 attach partition ft11 for values in ('t11');
alter table pt1 attach partition ft12 for values in ('t12');
alter table pt1 attach partition ft13 for values in ('t13');

create view t21 as select * from slow_data_ext('t21', 1.0);
create view t22 as select * from slow_data_ext('t22', 2.0);
create view t23 as select * from slow_data_ext('t23', 3.0);
create foreign table ft21 (a text, b text) server server1 options
(table_name 't21');
create foreign table ft22 (a text, b text) server server2 options
(table_name 't22');
create foreign table ft23 (a text, b text) server server3 options
(table_name 't23');
create table pt2 (a text, b text) partition by list (a);
alter table pt2 attach partition ft21 for values in ('t21');
alter table pt2 attach partition ft22 for values in ('t22');
alter table pt2 attach partition ft23 for values in ('t23');

explain verbose select * from pt1, pt2 where pt2.a = 't22' or pt2.a = 't23';
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=200.00..1303.80 rows=50220 width=128)
   Output: pt1.a, pt1.b, pt2.a, pt2.b
   ->  Append  (cost=100.00..427.65 rows=2790 width=64)
         ->  Foreign Scan on public.ft11 pt1_1  (cost=100.00..137.90
rows=930 width=64)
               Output: pt1_1.a, pt1_1.b
               Remote SQL: SELECT a, b FROM public.t11
         ->  Foreign Scan on public.ft12 pt1_2  (cost=100.00..137.90
rows=930 width=64)
               Output: pt1_2.a, pt1_2.b
               Remote SQL: SELECT a, b FROM public.t12
         ->  Foreign Scan on public.ft13 pt1_3  (cost=100.00..137.90
rows=930 width=64)
               Output: pt1_3.a, pt1_3.b
               Remote SQL: SELECT a, b FROM public.t13
   ->  Materialize  (cost=100.00..248.44 rows=18 width=64)
         Output: pt2.a, pt2.b
         ->  Append  (cost=100.00..248.35 rows=18 width=64)
               ->  Foreign Scan on public.ft22 pt2_1
(cost=100.00..124.13 rows=9 width=64)
                     Output: pt2_1.a, pt2_1.b
                     Remote SQL: SELECT a, b FROM public.t22 WHERE
(((a = 't22'::text) OR (a = 't23'::text)))
               ->  Foreign Scan on public.ft23 pt2_2
(cost=100.00..124.13 rows=9 width=64)
                     Output: pt2_2.a, pt2_2.b
                     Remote SQL: SELECT a, b FROM public.t23 WHERE
(((a = 't22'::text) OR (a = 't23'::text)))
(21 rows)

select * from pt1, pt2 where pt2.a = 't22' or pt2.a = 't23';
ERROR:  another command is already in progress
CONTEXT:  remote SQL command: DECLARE c4 CURSOR FOR
SELECT a, b FROM public.t22 WHERE (((a = 't22'::text) OR (a = 't23'::text)))

I think the cause of this error is that an asynchronous data fetch for
ft22 is blocked by that for ft12 that is in progress.
(Horiguchi-san’s patch set doesn't work for this query either, causing
the same error.  Though, it looks like he intended to handle cases
like this by a queuing system added to postgres_fdw to process such
concurrent data fetches.)  I think a simple solution for this issue
would be to just disable asynchrony optimization for such cases.  I
think we could do so by tracking foreign scan nodes across the entire
final plan tree that use the same connection at plan or execution
time.

* Another one is “no new planner concepts”.  I think it leads to this:

@@ -239,9 +242,207 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
    /* For parallel query, this will be overridden later. */
    appendstate->choose_next_subplan = choose_next_subplan_locally;

+   /*
+    * Initially we consider all subplans to be potentially
asynchronous.
+    */
+   appendstate->asyncplans = (PlanState **) palloc(nplans *
sizeof(PlanState *));
+   appendstate->asyncfds = (int *) palloc0(nplans * sizeof(int));
+   appendstate->nasyncplans = nplans;
+   memcpy(appendstate->asyncplans, appendstate->appendplans, nplans *
sizeof(PlanState *));
+   appendstate->lastreadyplan = 0;

I’m not sure that this would cause performance degradation in the case
of an plain Append that involves no FDWs supporting asynchrony
optimization, but I think it would be better to determine subplans
with that optimization at plan time and save cycles at execution time
as done in Horiguchi-san’s patch set.  (I’m not sure that we need a
new ExecAppend function proposed there, though.)  Also, consider this
ordered Append example:

create table t31 (a int check (a >= 10 and a < 20), b text);
create table t32 (a int check (a >= 20 and a < 30), b text);
create table t33 (a int check (a >= 30 and a < 40), b text);
create foreign table ft31 (a int check (a >= 10 and a < 20), b text)
server server1 options (table_name 't31');
create foreign table ft32 (a int check (a >= 20 and a < 30), b text)
server server2 options (table_name 't32');
create foreign table ft33 (a int check (a >= 30 and a < 40), b text)
server server3 options (table_name 't33');
create table pt3 (a int, b text) partition by range (a);
alter table pt3 attach partition ft31 for values from (10) to (20);
alter table pt3 attach partition ft32 for values from (20) to (30);
alter table pt3 attach partition ft33 for values from (30) to (40);

explain verbose select * from pt3 order by a;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Append  (cost=300.00..487.52 rows=4095 width=36)
   ->  Foreign Scan on public.ft31 pt3_1  (cost=100.00..155.68
rows=1365 width=36)
         Output: pt3_1.a, pt3_1.b
         Remote SQL: SELECT a, b FROM public.t31 ORDER BY a ASC NULLS LAST
   ->  Foreign Scan on public.ft32 pt3_2  (cost=100.00..155.68
rows=1365 width=36)
         Output: pt3_2.a, pt3_2.b
         Remote SQL: SELECT a, b FROM public.t32 ORDER BY a ASC NULLS LAST
   ->  Foreign Scan on public.ft33 pt3_3  (cost=100.00..155.68
rows=1365 width=36)
         Output: pt3_3.a, pt3_3.b
         Remote SQL: SELECT a, b FROM public.t33 ORDER BY a ASC NULLS LAST
(10 rows)

For this query, we can’t apply asynchrony optimization.  To disable it
for such cases I think it would be better to do something at plan time
as well as done in his patch set.

I haven’t finished reviewing your patch, but before doing so, I’ll
review Horiguchi-san's patch set in more detail for further
comparison.  Attached is a rebased version of your patch, in which I
added the same changes to the postgres_fdw regression tests as
Horiguchi-san so that the tests run successfully.

Thank you for working on this, Thomas and Horiguchi-san!  Sorry for the delay.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/20200820.163608.1893015081639298019.horikyota.ntt%40gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Append with naive multiplexing of FDWs
Next
From: "Andrey V. Lepikhov"
Date:
Subject: Re: Ideas about a better API for postgres_fdw remote estimates