Re: Append with naive multiplexing of FDWs - Mailing list pgsql-hackers

From Ahsan Hadi
Subject Re: Append with naive multiplexing of FDWs
Date
Msg-id CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.com
Whole thread Raw
In response to Re: Append with naive multiplexing of FDWs  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Append with naive multiplexing of FDWs  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
Hi Hackers,

Sharing the email below from Movead Li, I believe he wanted to share the benchmarking results as a response to this email thread but it started a new thread.. Here it is...

"
Hello

I have tested the patch with a partition table with several foreign
partitions living on seperate data nodes. The initial testing was done
with a partition table having 3 foreign partitions, test was done with
variety of scale facters. The seonnd test was with fixed data per data
node but number of data nodes were increased incrementally to see
the peformance impact as more nodes are added to the cluster. The
test three is similar to the initial test but with much huge data and
4 nodes.

The results are summary is given below and test script attached:

Test ENV
Parent node:2Core 8G
Child Nodes:2Core 4G


Test one:

1.1 The partition struct as below:

 [ ptf:(a int, b int, c varchar)]
    (Parent node)
        |             |             |
    [ptf1]      [ptf2]      [ptf3]
 (Node1)   (Node2)    (Node3)

The table data is partitioned across nodes, the test is done using a
simple select query and a count aggregate as shown below. The result
is an average of executing each query multiple times to ensure reliable
and consistent results.

①select * from ptf where b = 100;
②select count(*) from ptf;

1.2. Test Results

 For ① result:
       scalepernode    master    patched     performance
           2G                    7s             2s               350%
           5G                    173s         63s             275%
           10G                  462s         156s           296%
           20G                  968s         327s           296%
           30G                  1472s       494s           297%
           
 For ② result:
       scalepernode    master    patched     performance
           2G                    1079s       291s           370%
           5G                    2688s       741s           362%
           10G                  4473s       1493s         299%

It takes too long time to test a aggregate so the test was done with a
smaller data size.


1.3. summary

With the table partitioned over 3 nodes, the average performance gain
across variety of scale factors is almost 300%


Test Two
2.1 The partition struct as below:

 [ ptf:(a int, b int, c varchar)]
    (Parent node)
        |             |             |
    [ptf1]         ...      [ptfN]
 (Node1)      (...)    (NodeN)

①select * from ptf
②select * from ptf where b = 100;

This test is done with same size of data per node but table is partitioned
across N number of nodes. Each varation (master or patches) is tested
at-least 3 times to get reliable and consistent results. The purpose of the
test is to see impact on performance as number of data nodes are increased.

2.2 The results

For ① result(scalepernode=2G):
    nodenumber  master    patched     performance
             2             432s        180s              240%
             3             636s         223s             285%
             4             830s         283s             293%
             5             1065s       361s             295%
For ② result(scalepernode=10G):
    nodenumber  master    patched     performance
             2             281s        140s             201%
             3             421s        140s             300%
             4             562s        141s             398%
             5             702s        141s             497%
             6             833s        139s             599%
             7             986s        141s             699%
             8             1125s      140s             803%


Test Three

This test is similar to the [test one] but with much huge data and 
4 nodes.

For ① result:
    scalepernode    master    patched     performance
      100G                6592s       1649s         399%
For ② result:
    scalepernode    master    patched     performance
      100G                35383      12363         286%
The result show it work well in much huge data.


Summary
The patch is pretty good, it works well when there were little data back to
the parent node. The patch doesn’t provide parallel FDW scan, it ensures
that child nodes can send data to parent in parallel but  the parent can only
sequennly process the data from data nodes.

Providing there is no performance degrdation for non FDW append queries,
I would recomend to consider this patch as an interim soluton while we are
waiting for parallel FDW scan.
"



On Thu, Dec 12, 2019 at 5:41 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello.

I think I can say that this patch doesn't slows non-AsyncAppend,
non-postgres_fdw scans.


At Mon, 9 Dec 2019 12:18:44 -0500, Bruce Momjian <bruce@momjian.us> wrote in
> Certainly any overhead on normal queries would be unacceptable.

I took performance numbers on the current shape of the async execution
patch for the following scan cases.

t0   : single local table (parallel disabled)
pll  : local partitioning (local Append, parallel disabled)
ft0  : single foreign table
pf0  : inheritance on 4 foreign tables, single connection
pf1  : inheritance on 4 foreign tables, 4 connections
ptf0 : partition on 4 foreign tables, single connection
ptf1 : partition on 4 foreign tables, 4 connections

The benchmarking system is configured as the follows on a single
machine.

          [ benchmark client   ]
           |                  |
    (localhost:5433)    (localhost:5432)
           |                  |
   +----+  |   +------+       |
   |    V  V   V      |       V
   | [master server]  |  [async server]
   |       V          |       V
   +--fdw--+          +--fdw--+


The patch works roughly in the following steps.

1. Planner decides how many children out of an append can run
  asynchrnously (called as async-capable.).

2. While ExecInit if an Append doesn't have an async-capable children,
  ExecAppend that is exactly the same function is set as
  ExecProcNode. Otherwise ExecAppendAsync is used.

If the infrastructure part in the patch causes any degradation, the
"t0"(scan on local single table) and/or "pll" test (scan on a local
paritioned table) gets slow.

3. postgresql_fdw always runs async-capable code path.

If the postgres_fdw part causes degradation, ft0 reflects that.


The tables has two integers and the query does sum(a) on all tuples.

With the default fetch_size = 100, number is run time in ms.  Each
number is the average of 14 runs.

     master  patched   gain
t0   7325    7130     +2.7%
pll  4558    4484     +1.7%
ft0  3670    3675     -0.1%
pf0  2322    1550    +33.3%
pf1  2367    1475    +37.7%
ptf0 2517    1624    +35.5%
ptf1 2343    1497    +36.2%

With larger fetch_size (200) the gain mysteriously decreases for
sharing single connection cases (pf0, ptf0), but others don't seem
change so much.

     master  patched   gain
t0   7212    7252     -0.6%
pll  4546    4397     +3.3%
ft0  3712    3731     -0.5%
pf0  2131    1570    +26.4%
pf1  1926    1189    +38.3%
ptf0 2001    1557    +22.2%
ptf1 1903    1193    +37.4%

FWIW, attached are the test script.

gentblr2.sql: Table creation script.
testrun.sh  : Benchmarking script.


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment

pgsql-hackers by date:

Previous
From: Kohei KaiGai
Date:
Subject: Re: TRUNCATE on foreign tables
Next
From: Dean Rasheed
Date:
Subject: Re: Errors when update a view with conditional-INSTEAD rules