Thread: Speeding up INSERTs and UPDATEs to partitioned tables

Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
Hi,

As part of my efforts to make partitioning scale better for larger
numbers of partitions, I've been looking at primarily INSERT VALUES
performance.  Here the overheads are almost completely in the
executor. Planning of this type of statement is very simple since
there is no FROM clause to process.

My benchmarks have been around a RANGE partitioned table with 10k leaf
partitions and no sub-partitioned tables. The partition key is a
timestamp column.

I've found that ExecSetupPartitionTupleRouting() is very slow indeed
and there are a number of things slow about it.  The biggest culprit
for the slowness is the locking of each partition inside of
find_all_inheritors().  For now, this needs to remain as we must hold
locks on each partition while performing RelationBuildPartitionDesc(),
otherwise, one of the partitions may get dropped out from under us.
There might be other valid reasons too, but please see my note at the
bottom of this email.

The locking is not the only slow thing. I found the following to also be slow:

1. RelationGetPartitionDispatchInfo uses a List and lappend() must
perform a palloc() each time a partition is added to the list.
2. A foreach loop is performed over leaf_parts to search for subplans
belonging to this partition. This seems pointless to do for INSERTs as
there's never any to find.
3. ExecCleanupTupleRouting() loops through the entire partitions
array. If a single tuple was inserted then all but one of the elements
will be NULL.
4. Tuple conversion map allocates an empty array thinking there might
be something to put into it. This is costly when the array is large
and pointless when there are no maps to store.
5. During get_partition_dispatch_recurse(), get_rel_relkind() is
called to determine if the partition is a partitioned table or a leaf
partition. This results in a slow relcache hashtable lookup.
6. get_partition_dispatch_recurse() also ends up just building the
indexes array with a sequence of numbers from 0 to nparts - 1 if there
are no sub-partitioned tables. Doing this is slow when there are many
partitions.

Besides the locking, the only thing that remains slow now is the
palloc0() for the 'partitions' array. In my test, it takes 0.6% of
execution time. I don't see any pretty ways to fix that.

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

The only questionable thing I see is what I did for 6.  In partcache.c
I'm basically building an array of nparts containing 0 to nparts -1.
It seems a bit pointless, so perhaps there's a better way. I was also
a bit too tight to memcpy() that out of relcache, and just pointed
directly to it. That might be a no-go area.

I've attached 2 patches:

0001: implements items 1-6
0002: Is not intended for commit. It just a demo of where we could get
the performance if we were smarter about locking partitions. I've just
included this to show 0001's worth.

Performance

AWS: m5d.large fsync=off

Unpatched:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 2836
latency average = 21.162 ms
tps = 47.254409 (including connections establishing)
tps = 47.255756 (excluding connections establishing)

(yes, it's bad)

0001:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 3235
latency average = 18.548 ms
tps = 53.913121 (including connections establishing)
tps = 53.914629 (excluding connections establishing)

(a small improvement from 0001)

0001+0002:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 660079
latency average = 0.091 ms
tps = 11001.303764 (including connections establishing)
tps = 11001.602377 (excluding connections establishing)

(something to aspire towards)

0002 (only):

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 27682
latency average = 2.168 ms
tps = 461.350885 (including connections establishing)
tps = 461.363327 (excluding connections establishing)

(shows that doing 0002 alone does not fix all our problems)

Unpartitioned table (control test):

$ pgbench -n -T 60 -f partbench__insert.sql postgres
transaction type: partbench__insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 801260
latency average = 0.075 ms
tps = 13354.311397 (including connections establishing)
tps = 13354.656163 (excluding connections establishing)

Test setup:

CREATE TABLE partbench_ (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL);
CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
PARTITION BY RANGE (date);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
|| ' hours')::interval || ''');'
from generate_Series(0,9999) x;
\gexec
\o

partbench_insert.sql contains:
insert into partbench values('2018-04-26 15:00:00',1,2,3,4,5);

partbench__insert.sql contains:
insert into partbench_ values('2018-04-26 15:00:00',1,2,3,4,5);

I don't want to discuss the locking on this thread. That discussion
will detract from discussing what I'm proposing here... Which is not
to change anything relating to locks. I'm still working on that and
will post elsewhere. Please start another thread if you'd like to
discuss that in the meantime. Feel free to link it in here so others
can follow.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 22 June 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've written fixes for items 1-6 above.
>
> I did:
>
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

I've added this to the July 'fest:

https://commitfest.postgresql.org/18/1690/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


RE: Speeding up INSERTs and UPDATEs to partitioned tables

From
"Kato, Sho"
Date:
Hi,

I tried to benchmark with v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch, but when I create the second
partition,server process get segmentation fault.
 

I don't know the cause, but it seems that an incorrect value is set to partdesc->boundinfo.

(gdb) p partdesc->boundinfo[0]
$6 = {strategy = 0 '\000', ndatums = 2139062142, datums = 0x7f7f7f7f7f7f7f7f, kind = 0x7f7f7f7f7f7f7f7f, indexes =
0x7f7f7f7f7f7f7f7f,null_index = 2139062143, default_index = 2139062143}
 


$ psql postgres
psql (11beta2)
Type "help" for help.

postgres=# create table a(i int) partition by range(i);
CREATE TABLE
postgres=# create table a_1 partition of a for values from(1) to (200);
CREATE TABLE
postgres=# create table a_2 partition of a for values from(200) to (400);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 

2018-07-05 14:02:52.405 JST [60250] LOG:  server process (PID 60272) was terminated by signal 11: Segmentation fault
2018-07-05 14:02:52.405 JST [60250] DETAIL:  Failed process was running: create table a_2 partition of a for values
from(200)to (400);
 

(gdb) bt
#0  0x0000000000596e52 in get_default_oid_from_partdesc (partdesc=0x259e928) at partition.c:269
#1  0x0000000000677355 in DefineRelation (stmt=0x259e610, relkind=114 'r', ownerId=10, typaddress=0x0,
queryString=0x24d58b8"create table a_2 partition of a for values from(200) to (400);") at tablecmds.c:832
 
#2  0x00000000008b6893 in ProcessUtilitySlow (pstate=0x259e4f8, pstmt=0x24d67d8, queryString=0x24d58b8 "create table
a_2partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, 
 
    params=0x0, queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:1000
#3  0x00000000008b66c2 in standard_ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition
ofa for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
 
    queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:920
#4  0x00000000008b583b in ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for
valuesfrom(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
 
    dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:360
#5  0x00000000008b482c in PortalRunUtility (portal=0x253af38, pstmt=0x24d67d8, isTopLevel=true, setHoldSnapshot=false,
dest=0x24d6ac8,completionTag=0x7ffc05932330 "") at pquery.c:1178
 
#6  0x00000000008b4a45 in PortalRunMulti (portal=0x253af38, isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8,
altdest=0x24d6ac8,completionTag=0x7ffc05932330 "") at pquery.c:1324
 
#7  0x00000000008b3f7d in PortalRun (portal=0x253af38, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x24d6ac8,altdest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:799
 
#8  0x00000000008adf16 in exec_simple_query (query_string=0x24d58b8 "create table a_2 partition of a for values
from(200)to (400);") at postgres.c:1122
 
#9  0x00000000008b21a5 in PostgresMain (argc=1, argv=0x24ff5b0, dbname=0x24ff410 "postgres", username=0x24d2358
"symfo")at postgres.c:4153
 
#10 0x00000000008113f4 in BackendRun (port=0x24f73f0) at postmaster.c:4361
#11 0x0000000000810b67 in BackendStartup (port=0x24f73f0) at postmaster.c:4033
#12 0x000000000080d0ed in ServerLoop () at postmaster.c:1706
#13 0x000000000080c9a3 in PostmasterMain (argc=1, argv=0x24d0310) at postmaster.c:1379
#14 0x0000000000737392 in main (argc=1, argv=0x24d0310) at main.c:228

(gdb) disassemble 
Dump of assembler code for function get_default_oid_from_partdesc:
   0x0000000000596e0a <+0>:     push   %rbp
   0x0000000000596e0b <+1>:     mov    %rsp,%rbp
   0x0000000000596e0e <+4>:     mov    %rdi,-0x8(%rbp)
   0x0000000000596e12 <+8>:     cmpq   $0x0,-0x8(%rbp)
   0x0000000000596e17 <+13>:    je     0x596e56 <get_default_oid_from_partdesc+76>
   0x0000000000596e19 <+15>:    mov    -0x8(%rbp),%rax
   0x0000000000596e1d <+19>:    mov    0x10(%rax),%rax
   0x0000000000596e21 <+23>:    test   %rax,%rax
   0x0000000000596e24 <+26>:    je     0x596e56 <get_default_oid_from_partdesc+76>
   0x0000000000596e26 <+28>:    mov    -0x8(%rbp),%rax
   0x0000000000596e2a <+32>:    mov    0x10(%rax),%rax
   0x0000000000596e2e <+36>:    mov    0x24(%rax),%eax
   0x0000000000596e31 <+39>:    cmp    $0xffffffff,%eax
   0x0000000000596e34 <+42>:    je     0x596e56 <get_default_oid_from_partdesc+76>
   0x0000000000596e36 <+44>:    mov    -0x8(%rbp),%rax
   0x0000000000596e3a <+48>:    mov    0x8(%rax),%rdx
   0x0000000000596e3e <+52>:    mov    -0x8(%rbp),%rax
   0x0000000000596e42 <+56>:    mov    0x10(%rax),%rax
   0x0000000000596e46 <+60>:    mov    0x24(%rax),%eax
   0x0000000000596e49 <+63>:    cltq   
   0x0000000000596e4b <+65>:    shl    $0x2,%rax
   0x0000000000596e4f <+69>:    add    %rdx,%rax
=> 0x0000000000596e52 <+72>:    mov    (%rax),%eax
   0x0000000000596e54 <+74>:    jmp    0x596e5b <get_default_oid_from_partdesc+81>
   0x0000000000596e56 <+76>:    mov    $0x0,%eax
   0x0000000000596e5b <+81>:    pop    %rbp
   0x0000000000596e5c <+82>:    retq   
End of assembler dump.

(gdb) i r
rax            0x20057e77c      8595695484
rbx            0x72     114
rcx            0x7f50ce90e0e8   139985039712488
rdx            0x259e980        39446912
rsi            0x7f50ce90e0a8   139985039712424
rdi            0x259e928        39446824
rbp            0x7ffc05931890   0x7ffc05931890
rsp            0x7ffc05931890   0x7ffc05931890
r8             0x7ffc059317bf   140720402012095
r9             0x0      0
r10            0x6b     107
r11            0x7f50cdbc3f10   139985025777424
r12            0x70     112
r13            0x0      0
r14            0x0      0
r15            0x0      0
rip            0x596e52 0x596e52 <get_default_oid_from_partdesc+72>
eflags         0x10202  [ IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

(gdb) list *0x596e52
0x596e52 is in get_default_oid_from_partdesc (partition.c:269).
264     Oid
265     get_default_oid_from_partdesc(PartitionDesc partdesc)
266     {
267             if (partdesc && partdesc->boundinfo &&
268                     partition_bound_has_default(partdesc->boundinfo))
269                     return partdesc->oids[partdesc->boundinfo->default_index];
270 
271             return InvalidOid;
272     }
273

regards,
-----Original Message-----
From: David Rowley [mailto:david.rowley@2ndquadrant.com] 
Sent: Saturday, June 23, 2018 7:19 AM
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 22 June 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've written fixes for items 1-6 above.
>
> I did:
>
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let 
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those 
> ones. Make this array small and increase it only when we need more 
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

I've added this to the July 'fest:

https://commitfest.postgresql.org/18/1690/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 5 July 2018 at 18:39, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> postgres=# create table a(i int) partition by range(i);
> CREATE TABLE
> postgres=# create table a_1 partition of a for values from(1) to (200);
> CREATE TABLE
> postgres=# create table a_2 partition of a for values from(200) to (400);
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Hi,

Thanks for testing. I'm unable to reproduce this on beta2 or master as
of f61988d16.

Did you try make clean then building again?  The 0001 patch does
change PartitionDescData, so if you've not rebuilt all .c files which
use that then that might explain your crash.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Jesper Pedersen
Date:
Hi David,

On 06/22/2018 02:28 AM, David Rowley wrote:
> I've attached 2 patches:
> 
> 0001: implements items 1-6
> 0002: Is not intended for commit. It just a demo of where we could get
> the performance if we were smarter about locking partitions. I've just
> included this to show 0001's worth.
> 

I did some tests with a 64 hash partition setup, and see a speedup for 
INSERT / UPDATE scenarios.

> I don't want to discuss the locking on this thread. That discussion
> will detract from discussing what I'm proposing here... Which is not
> to change anything relating to locks. I'm still working on that and
> will post elsewhere.

With 0002 INSERTs get close to the TPS of the non-partitioned case. 
However, UPDATEs doesn't see the same speedup. But, as you said, a 
discussion for another thread.

Best regards,
  Jesper


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 6 July 2018 at 01:18, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> With 0002 INSERTs get close to the TPS of the non-partitioned case. However,
> UPDATEs doesn't see the same speedup. But, as you said, a discussion for
> another thread.

Hi Jesper,

Thanks for testing this out.  It was only really the locking I didn't
want to discuss here due to the risk of the discussion of removing the
other overheads getting lost in discussions about locking.

It's most likely that the UPDATE is slower due to the planner still
being quite slow when dealing with partitioned tables. It still builds
RangeTblEntry and RelOptInfo objects for each partition even if the
partition is pruned. With INSERT with a VALUES clause, the planner
does not build these objects, in fact, the planner bearly does any
work at all, so this can be speeded up just by removing the executor
overheads.

(I do have other WIP patches to speed up the planner. After delaying
building the RelOptInfo and RangeTblEntry, with my 10k partition
setup, the planner SELECT became 1600 times faster. UPDATE did not
finish in the unpatched version, so gains there are harder to measure.
There's still much work to do on these patches, and there's still more
performance to squeeze out too. Hopefully, I'll be discussing this on
another thread soon.)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


RE: Speeding up INSERTs and UPDATEs to partitioned tables

From
"Kato, Sho"
Date:
Thanks David!

I did benchmark with pgbench, and see a speedup for INSERT / UPDATE scenarios.
I used range partition.

Benchmark results are as follows.

1. 11beta2 result

 part_num |   tps_ex   | latency_avg | update_latency | select_latency | insert_latency 
----------+------------+-------------+----------------+----------------+----------------
      100 | 479.456278 |       2.086 |          1.382 |          0.365 |          0.168
      200 | 169.155411 |       5.912 |          4.628 |          0.737 |          0.299
      400 |  24.857495 |       40.23 |         36.606 |          2.252 |          0.881
      800 |   6.718104 |     148.853 |        141.471 |          5.253 |          1.433
     1600 |   1.934908 |     516.825 |        489.982 |         21.102 |          3.701
     3200 |   0.456967 |    2188.362 |       2101.247 |         72.784 |          8.833
     6400 |   0.116643 |    8573.224 |        8286.79 |        257.904 |         14.949


2. 11beta2 + patch1 + patch2

patch1: Allow direct lookups of AppendRelInfo by child relid
        commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch

 part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency 
----------+-------------+-------------+----------------+----------------+----------------
      100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
      200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
      400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
      800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
     1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
     3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
     6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264


benchmark script:

\set aid random(1, 100 * 1)
\set delta random(-5000, 5000)
BEGIN;
UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM test.accounts WHERE aid = :aid;
INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
END;

partition key is aid.

-----Original Message-----
From: David Rowley [mailto:david.rowley@2ndquadrant.com] 
Sent: Thursday, July 05, 2018 6:19 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 5 July 2018 at 18:39, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> postgres=# create table a(i int) partition by range(i); CREATE TABLE 
> postgres=# create table a_1 partition of a for values from(1) to 
> (200); CREATE TABLE postgres=# create table a_2 partition of a for 
> values from(200) to (400); server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Hi,

Thanks for testing. I'm unable to reproduce this on beta2 or master as of f61988d16.

Did you try make clean then building again?  The 0001 patch does change PartitionDescData, so if you've not rebuilt all
.cfiles which use that then that might explain your crash.
 

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> 2. 11beta2 + patch1 + patch2
>
> patch1: Allow direct lookups of AppendRelInfo by child relid
>         commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
>
>  part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency
> ----------+-------------+-------------+----------------+----------------+----------------
>       100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
>       200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
>       400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
>       800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
>      1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
>      3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
>      6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264

Hi,

Thanks a lot for benchmarking this.

Just a note to say that the "Allow direct lookups of AppendRelInfo by
child relid" patch is already in master. It's much more relevant to be
testing with master than pg11. This patch is not intended for pg11.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
On 2018-Jul-11, David Rowley wrote:

> On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> > 2. 11beta2 + patch1 + patch2
> >
> > patch1: Allow direct lookups of AppendRelInfo by child relid
> >         commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
> >
> >  part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency
> > ----------+-------------+-------------+----------------+----------------+----------------
> >       100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
> >       200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
> >       400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
> >       800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
> >      1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
> >      3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
> >      6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264

> Just a note to say that the "Allow direct lookups of AppendRelInfo by
> child relid" patch is already in master. It's much more relevant to be
> testing with master than pg11. This patch is not intended for pg11.

That commit is also in pg11, though -- just not in beta2.  So we still
don't know how much of an improvement patch2 is by itself :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
Hi David.

On 2018/06/22 15:28, David Rowley wrote:
> Hi,
> 
> As part of my efforts to make partitioning scale better for larger
> numbers of partitions, I've been looking at primarily INSERT VALUES
> performance.  Here the overheads are almost completely in the
> executor. Planning of this type of statement is very simple since
> there is no FROM clause to process.

Thanks for this effort.

> My benchmarks have been around a RANGE partitioned table with 10k leaf
> partitions and no sub-partitioned tables. The partition key is a
> timestamp column.
> 
> I've found that ExecSetupPartitionTupleRouting() is very slow indeed
> and there are a number of things slow about it.  The biggest culprit
> for the slowness is the locking of each partition inside of
> find_all_inheritors().

Yes. :-(

> For now, this needs to remain as we must hold
> locks on each partition while performing RelationBuildPartitionDesc(),
> otherwise, one of the partitions may get dropped out from under us.

We lock all partitions using find_all_inheritors to keep locking order
consistent with other sites that may want to lock tables in the same
partition tree but with a possibly conflicting lock mode.  If we remove
the find_all_inheritors call in ExecSetupPartitionPruneState (like your
0002 does), we may end up locking partitions in arbitrary order in a given
transaction, because input tuples will be routed to various partitions in
an order that's not predetermined.

But, maybe it's not necessary to be that paranoid.  If we've locked on the
parent, any concurrent lockers would have to wait for the lock on the
parent anyway, so it doesn't matter which order tuple routing locks the
partitions.

> The locking is not the only slow thing. I found the following to also be slow:
> 
> 1. RelationGetPartitionDispatchInfo uses a List and lappend() must
> perform a palloc() each time a partition is added to the list.
> 2. A foreach loop is performed over leaf_parts to search for subplans
> belonging to this partition. This seems pointless to do for INSERTs as
> there's never any to find.
> 3. ExecCleanupTupleRouting() loops through the entire partitions
> array. If a single tuple was inserted then all but one of the elements
> will be NULL.
> 4. Tuple conversion map allocates an empty array thinking there might
> be something to put into it. This is costly when the array is large
> and pointless when there are no maps to store.
> 5. During get_partition_dispatch_recurse(), get_rel_relkind() is
> called to determine if the partition is a partitioned table or a leaf
> partition. This results in a slow relcache hashtable lookup.
> 6. get_partition_dispatch_recurse() also ends up just building the
> indexes array with a sequence of numbers from 0 to nparts - 1 if there
> are no sub-partitioned tables. Doing this is slow when there are many
> partitions.
> 
> Besides the locking, the only thing that remains slow now is the
> palloc0() for the 'partitions' array. In my test, it takes 0.6% of
> execution time. I don't see any pretty ways to fix that.
> 
> I've written fixes for items 1-6 above.
> 
> I did:
> 
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

The issues you list seem all legitimate to me and also your proposed fixes
for each, except I think we could go a bit further.

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?
ISTM, there is no need to determine the exact number of leaf partitions
and partitioned tables in the partition tree and allocate the arrays in
PartitionTupleRouting based on that.  I know that the indexes array in
PartitionDispatchData contains mapping from local partition indexes (0 to
partdesc->nparts - 1) to those that span *all* leaf partitions and *all*
partitioned tables (0 to proute->num_partitions or proute->num_dispatch)
in a partition tree, but we can change that.

The idea I had was inspired by looking at partitions_init stuff in your
patch.  We could allocate proute->partition_dispatch_info and
proute->partitions arrays to be of a predetermined size, which doesn't
require us to calculate the exact number of leaf partitions and
partitioned tables beforehand.  So, RelationGetPartitionDispatchInfo need
not recursively go over all of the partition tree.  Instead we create just
one PartitionDispatch object of the root parent table, whose indexes array
is initialized with -1 meaning none of the partitions has not been
encountered yet.  In ExecFindPartition, once tuple routing chooses a
partition, we create either a ResultRelInfo (if leaf) or a
PartitionDispatch for it and store it in the 0th slot in
proute->partitions or proute->partition_dispatch_info, respectively.
Also, we update the indexes array in the parent's PartitionDispatch to
replace -1 with 0 so that future tuples routing to that partition don't
allocate it again.  The process is repeated if the tuple needs to be
routed one more level down.  If the query needs to allocate more
ResultRelInfos and/or PartitionDispatch objects than we initially
allocated space for, we expand those arrays.  Finally, during
ExecCleanupTupleRouting, we only "clean up" the partitions that we
allocated ResultRelInfos and PartitionDispatch objects for, which is very
similar to the partitions_init idea in your patch.

I implemented that idea in the attached patch, which applies on top of
your 0001 patch, but I'd say it's too big to be just called a delta.  I
was able to get following performance numbers using the following pgbench
test:

pgbench -n -T 180 -f insert-ht.sql
cat insert-ht.sql
\set b random(1, 1000)
\set a random(1, 1000)
insert into ht values (:b, :a);

Note that pgbench is run 3 times and every tps result is listed below.

HEAD - 0 parts (unpartitioned table)
tps = 2519.603076 (including connections establishing)
tps = 2486.903189 (including connections establishing)
tps = 2518.771182 (including connections establishing)

HEAD - 2500 hash parts (no subpart)
tps = 13.158224 (including connections establishing)
tps = 12.940713 (including connections establishing)
tps = 12.882465 (including connections establishing)

David - 2500 hash parts (no subpart)
tps = 18.717628 (including connections establishing)
tps = 18.602280 (including connections establishing)
tps = 18.945527 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 18.576858 (including connections establishing)
tps = 18.431902 (including connections establishing)
tps = 18.797023 (including connections establishing)

HEAD - 2500 hash parts (4 hash subparts each)
tps = 2.339332 (including connections establishing)
tps = 2.339582 (including connections establishing)
tps = 2.317037 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 3.225044 (including connections establishing)
tps = 3.214053 (including connections establishing)
tps = 3.239591 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
tps = 3.321918 (including connections establishing)
tps = 3.305952 (including connections establishing)
tps = 3.301036 (including connections establishing)

Applying the lazy locking patch on top of David's and my patch,
respectively, produces the following results.

David - 2500 hash parts (no subpart)
tps = 1577.854360 (including connections establishing)
tps = 1532.681499 (including connections establishing)
tps = 1464.254096 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 1532.475751 (including connections establishing)
tps = 1534.650325 (including connections establishing)
tps = 1527.840837 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 78.845916 (including connections establishing)
tps = 79.167079 (including connections establishing)
tps = 79.621686 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
9:tps = 329.887008 (including connections establishing)
9:tps = 327.428103 (including connections establishing)
9:tps = 326.863248 (including connections establishing)

About the last two results: after getting rid of the time-hog that is
find_all_inheritors() call in ExecSetupPartitionTupleRouting for locking
all partitions, it seems that we'll end up spending most of the time in
RelationGetPartitionDispatchInfo() without my patch, because it will call
get_partition_dispatch_recurse() for each of the 2500 partitions of first
level that are themselves partitioned.  With my patch, we won't do that
and won't end up generating 2499 PartitionDispatch objects that would not
be needed for a single-row insert statement.

Thanks,
Amit

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 13 July 2018 at 20:20, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Why don't we abandon the notion altogether that
> ExecSetupPartitionTupleRouting *has to* process the whole partition tree?

[...]

> I implemented that idea in the attached patch, which applies on top of
> your 0001 patch, but I'd say it's too big to be just called a delta.  I
> was able to get following performance numbers using the following pgbench
> test:

Thanks for looking at this.  I like that your idea gets rid of the
indexes cache in syscache. I was not very happy with that part.

I've looked over the code and the ExecUseUpdateResultRelForRouting()
function is broken.  Your while loop only skips partitions for the
current partitioned table, it does not skip ModifyTable subnodes that
belong to other partitioned tables.

You can use the following. The code does not find the t1_a2 subnode.

create table t1 (a int, b int) partition by list(a);
create table t1_a1 partition of t1 for values in(1) partition by list(b);
create table t1_a2 partition of t1 for values in(2);
create table t1_a1_b1 partition of t1_a1 for values in(1);
create table t1_a1_b2 partition of t1_a1 for values in(2);
insert into t1 values(2,2);

update t1 set a = a;

I think there might not be enough information to make this work
correctly, as if you change the loop to skip subnodes, then it won't
work in cases where the partition[0] was pruned.

I've another patch sitting here, partly done, that changes
pg_class.relispartition into pg_class.relpartitionparent.  If we had
that then we could code your loop to work correctly.   Alternatively,
I guess we could just ignore the UPDATE's ResultRelInfos and just
build new ones. Unsure if there's actually a reason we need to reuse
the existing ones, is there?

I think you'd need to know the owning partition and skip subnodes that
don't belong to pd->reldesc. Alternatively, a hashtable could be built
with all the oids belonging to pd->reldesc, then we could loop over
the update_rris finding subnodes that can be found in the hashtable.
Likely this will be much slower than the sort of merge lookup that the
previous code did.

Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
The code seems to assume that there can only be at the most 65536
partitions, but I don't think there's any code which restricts us to
that. There is code in the planner that will bork when trying to
create a RangeTblEntry up that high, but as far as I know that won't
be noticed on the INSERT path.  I don't think this code has any
business knowing what the special varnos are set to either.  It would
be better to just remove the limit and suffer the small wasted array
space.  I understand you've probably coded it like this due to the
similar code that was in my patch, but with mine I knew the total
number of partitions. Your patch does not.

Other thoughts on the patch:

I wonder if it's worth having syscache keep a count on the number of
sub-partitioned tables a partition has. If there are none in the root
partition then the partition_dispatch_info can be initialized with
just 1 element to store the root details. Although, maybe it's not
worth it to reduce the array size by 7 elements.

Also, I'm a bit confused why you change the comments in
execPartition.h for PartitionTupleRouting to be inline again. I
brought those out of line as I thought the complexity of the code
warranted that. You're inlining them again goes against what all the
other structs do in that file.

Apart from that, I think the idea is promising. We'll just need to
find a way to make ExecUseUpdateResultRelForRouting work correctly.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
Hi David,

Thanks for taking a look.

On 2018/07/15 17:34, David Rowley wrote:
> I've looked over the code and the ExecUseUpdateResultRelForRouting()
> function is broken.  Your while loop only skips partitions for the
> current partitioned table, it does not skip ModifyTable subnodes that
> belong to other partitioned tables.
> 
> You can use the following. The code does not find the t1_a2 subnode.
> 
> create table t1 (a int, b int) partition by list(a);
> create table t1_a1 partition of t1 for values in(1) partition by list(b);
> create table t1_a2 partition of t1 for values in(2);
> create table t1_a1_b1 partition of t1_a1 for values in(1);
> create table t1_a1_b2 partition of t1_a1 for values in(2);
> insert into t1 values(2,2);
> 
> update t1 set a = a;

Hmm, it indeed is broken.

> I think there might not be enough information to make this work
> correctly, as if you change the loop to skip subnodes, then it won't
> work in cases where the partition[0] was pruned.
> 
> I've another patch sitting here, partly done, that changes
> pg_class.relispartition into pg_class.relpartitionparent.  If we had
> that then we could code your loop to work correctly.> Alternatively,
> I guess we could just ignore the UPDATE's ResultRelInfos and just
> build new ones. Unsure if there's actually a reason we need to reuse
> the existing ones, is there?

We try to use the existing ones because we thought back when the patch was
written (not by me though) that redoing all the work that
InitResultRelInfo does for each partition, for which we could have instead
used an existing one, would cumulatively end up being more expensive than
figuring out which ones we could reuse by a linear scan of partition and
result rels arrays in parallel.  I don't remember seeing a benchmark to
demonstrate the benefit of doing this though.  Maybe it was posted, but I
don't remember having looked at it closely.

> I think you'd need to know the owning partition and skip subnodes that
> don't belong to pd->reldesc. Alternatively, a hashtable could be built
> with all the oids belonging to pd->reldesc, then we could loop over
> the update_rris finding subnodes that can be found in the hashtable.
> Likely this will be much slower than the sort of merge lookup that the
> previous code did.

I think one option is to simply give up on the idea of matching *all*
UPDATE result rels that belong to a given partitioned table (pd->reldesc)
in one call of ExecUseUpdateResultRelForRouting.  Instead, pass the index
of the partition (in pd->partdesc->oids) to find the ResultRelInfo for,
loop over all UPDATE result rels looking for one, and return immediately
on finding one after having stored its pointer in proute->partitions.  In
the worst case, we'll end up scanning UPDATE result rels array for every
partition that gets touched, but maybe such an UPDATE query is less common
or even if such a query occurs, tuple routing might be the last of its
bottlenecks.

I have implemented that approach in the updated patch.

That means I also needed to change things so that
ExecUseUpdateResultRelsForRouting is now only called by ExecFindPartition,
because with the new arrangement, it's useless to call it from
ExecSetupPartitionTupleRouting.  Moreover, an UPDATE may not use tuple
routing at all, even if the fact that partition key is being updated
results in calling ExecSetupPartitionTupleRouting.

> Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
> The code seems to assume that there can only be at the most 65536
> partitions, but I don't think there's any code which restricts us to
> that. There is code in the planner that will bork when trying to
> create a RangeTblEntry up that high, but as far as I know that won't
> be noticed on the INSERT path.  I don't think this code has any
> business knowing what the special varnos are set to either.  It would
> be better to just remove the limit and suffer the small wasted array
> space.  I understand you've probably coded it like this due to the
> similar code that was in my patch, but with mine I knew the total
> number of partitions. Your patch does not.

OK, I changed it to UINT_MAX.

> Other thoughts on the patch:
> 
> I wonder if it's worth having syscache keep a count on the number of
> sub-partitioned tables a partition has. If there are none in the root
> partition then the partition_dispatch_info can be initialized with
> just 1 element to store the root details. Although, maybe it's not
> worth it to reduce the array size by 7 elements.

Hmm yes.  Allocating space for 8 pointers when we really need 1 is not too
bad, if the alternative is to modify partcache.c.

> Also, I'm a bit confused why you change the comments in
> execPartition.h for PartitionTupleRouting to be inline again. I
> brought those out of line as I thought the complexity of the code
> warranted that. You're inlining them again goes against what all the
> other structs do in that file.

It was out-of-line to begin with but it started to become distracting when
updating the comments.  But I agree about being consistent and hence I
have moved them back to where they were.  I have significantly rewritten
those comments though to be clearer.

> Apart from that, I think the idea is promising. We'll just need to
> find a way to make ExecUseUpdateResultRelForRouting work correctly.

Let me know what you think of the code in the updated patch.

Thanks,
Amit

Attachment

RE: Speeding up INSERTs and UPDATEs to partitioned tables

From
"Kato, Sho"
Date:
On 2018-Jul-11, Alvaro Herrera wrote:
> That commit is also in pg11, though -- just not in beta2.  So we still don't know how much of an improvement patch2
isby itself :-)
 

Oops! I benchmarked with 11beta2 + 0001-Speed​​-up-INSERT-and-UPDATE-on-partitioned-tables.patch.
Results are as follows.

Performance seems to be improved.

part_num | latency_avg |   tps_ex   | update_latency | select_latency | insert_latency 
----------+-------------+------------+----------------+----------------+----------------
      100 |        2.09 | 478.379516 |          1.407 |           0.36 |          0.159
      200 |       5.871 | 170.322179 |          4.621 |          0.732 |          0.285
      400 |      39.029 |  25.622384 |         35.542 |          2.273 |          0.758
      800 |     142.624 |   7.011494 |        135.447 |           5.04 |          1.388
     1600 |     559.872 |   1.786138 |        534.301 |         20.318 |          3.122
     3200 |    2161.834 |   0.462574 |       2077.737 |         72.804 |          7.037
     6400 |     8282.38 |   0.120739 |       7996.212 |        259.406 |         14.514

Thanks

Kato Sho
-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com] 
Sent: Wednesday, July 11, 2018 10:30 PM
To: David Rowley <david.rowley@2ndquadrant.com>
Cc: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Jul-11, David Rowley wrote:

> On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> > 2. 11beta2 + patch1 + patch2
> >
> > patch1: Allow direct lookups of AppendRelInfo by child relid
> >         commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
> >
> >  part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency
> > ----------+-------------+-------------+----------------+----------------+----------------
> >       100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
> >       200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
> >       400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
> >       800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
> >      1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
> >      3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
> >      6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264

> Just a note to say that the "Allow direct lookups of AppendRelInfo by 
> child relid" patch is already in master. It's much more relevant to be 
> testing with master than pg11. This patch is not intended for pg11.

That commit is also in pg11, though -- just not in beta2.  So we still don't know how much of an improvement patch2 is
byitself :-)
 

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 18 July 2018 at 21:44, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> part_num | latency_avg |   tps_ex   | update_latency | select_latency | insert_latency
> ----------+-------------+------------+----------------+----------------+----------------
>       100 |        2.09 | 478.379516 |          1.407 |           0.36 |          0.159
>       200 |       5.871 | 170.322179 |          4.621 |          0.732 |          0.285
>       400 |      39.029 |  25.622384 |         35.542 |          2.273 |          0.758
>       800 |     142.624 |   7.011494 |        135.447 |           5.04 |          1.388
>      1600 |     559.872 |   1.786138 |        534.301 |         20.318 |          3.122
>      3200 |    2161.834 |   0.462574 |       2077.737 |         72.804 |          7.037
>      6400 |     8282.38 |   0.120739 |       7996.212 |        259.406 |         14.514

Thanks for testing. It's fairly customary to include before/after,
unpatched/patched results.  I don't think your patched results really
mean much by themselves. It's pretty well known that adding more
partitions slows down the planner and the executor, to a lesser
extent. This patch only aims to reduce some of the executor startup
overheads for INSERT and UPDATE.

Also, the 0001 patch is not really aiming to break any performance
records. I posted results already and there is only a very small
improvement. The main aim with the 0001 patch is to remove the
bottlenecks so that the performance drop between partitioned and
non-partitioned is primarily due to the partition locking.  I'd like
to fix that too, but it's more work and I see no reason that we
shouldn't fix up the other slow parts first. I imagine this will
increase the motivation to resolve the locking all partitions issue
too.

I'd also recommend that if you're testing this, that you do so with a
recent master. The patch is not intended for pg11.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 18 July 2018 at 20:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Let me know what you think of the code in the updated patch.

Thanks for sending the updated patch.

I looked over it tonight and made a number of changes:

1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
useless since the int would have wrapped long before it reached
UINT_MAX. There's no shortage of other code doubling the size of an
array by multiplying it by 2 unconditionally without considering
overflowing an int. Unsure why you considered this more risky.
2) Fixed a series of bugs regarding the size of the arrays in
PartitionTupleRouting. The map arrays and the partitions array could
differ in size despite your comment that claimed
child_parent_tupconv_maps was the same size as 'partitions' when
non-NULL. The map arrays being a different size than the partitions
array caused the following two cases to segfault. I've included two
cases as it was two seperate bugs that caused them.

-- case 1
drop table listp;
create table listp (a int, b int) partition by list (a);
create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);
create table listp3 partition of listp for values in (3);
create table listp4 partition of listp for values in (4);
create table listp5 partition of listp for values in (5);
create table listp6 partition of listp for values in (6);
create table listp7 partition of listp for values in (7);
create table listp8 partition of listp for values in (8);
create table listp9 (b int, a int);

alter table listp attach partition listp9 for values in(9);

insert into listp select generate_series(1,9);

-- case 2
drop table listp;
create table listp (a int, b int) partition by list (a);
create table listp1 (b int, a int);

alter table listp attach partition listp1 for values in(1);

create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);
create table listp3 partition of listp for values in (3);
create table listp4 partition of listp for values in (4);
create table listp5 partition of listp for values in (5);
create table listp6 partition of listp for values in (6);
create table listp7 partition of listp for values in (7);
create table listp8 partition of listp for values in (8);
create table listp9 partition of listp for values in (9);

insert into listp select generate_series(1,9);

3) Got rid of ExecUseUpdateResultRelForRouting. I started to change
this to remove references to UPDATE in order to make it more friendly
towards other possible future node types that it would get used for
(aka MERGE). In the end, I found that performance could regress when
in cases like:

drop table listp;
create table listp (a int) partition by list(a);
\o /dev/null
\timing off
select 'create table listp'||x::Text||' partition of listp for values
in('||x::Text||');' from generate_series(1,1000) x;
\gexec
\o
insert into listp select x from generate_series(1,999) x;
\timing on
update listp set a = a+1;

It's true that UPDATEs with a large number of subplans performance is
quite terrible today in the planner, but this code made the
performance of planning+execution a bit worse. If we get around to
fixing the inheritance planner then I think
ExecUseUpdateResultRelForRouting() could easily appear in profiles.

I ended up rewriting it to just get called once and build a hash table
by Oid storing a ResultRelInfo pointer.  This also gets rid of the
slow nested loop in the cleanup operation inside
ExecCleanupTupleRouting().

4) Did some tuning work in ExecFindPartition() getting rid of a
redundant check after the loop completion. Also added some likely()
and unlikely() decorations around some conditions.

5) Updated some newly out-dated comments since your patch in execPartition.h.

6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a
palloc() updating the few fields that were not initialised. This might
save a few TPS (at least once we get rid of the all partition locking)
in the single-row INSERT case, but I've not tested the performance of
this yet.

7) Also moved and edited some comments above
ExecSetupPartitionTupleRouting() that I felt explained a little too
much about some internal implementation details.

One thing that I thought of, but didn't do was just having
ExecFindPartition() return the ResultRelInfo. I think it would be much
nicer in both call sites to not have to check the ->partitions array
to get that.  The copy.c call site would need a few modifications
around the detection code to see if the partition has changed, but it
all looks quite possible to change. I left this for now as I have
another patch which touches all that code that I feel is closer to
commit than this patch is.

I've attached a delta of the changes I made since your v2 delta and
also a complete updated patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
> useless since the int would have wrapped long before it reached
> UINT_MAX. There's no shortage of other code doubling the size of an
> array by multiplying it by 2 unconditionally without considering
> overflowing an int. Unsure why you considered this more risky.

As long as you're re-palloc'ing the array each time, and not increasing
its size more than 2X, this is perfectly safe because of the 1GB size
limit on palloc requests.  You'll fail because of that in the iteration
where the request is between 1GB and 2GB, just before integer overflow
can occur.

(Yes, this is intentional.)

            regards, tom lane


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 27 July 2018 at 04:19, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached a delta of the changes I made since your v2 delta and
> also a complete updated patch.

I did a very quick performance test of this patch on an AWS m5d.large
instance with fsync=off.

The test setup is the same as is described in my initial email on this thread.

The test compares the performance of INSERTs into a partitioned table
with 10k partitions compared to a non-partitioned table.

Patched with v2 patch on master@39d51fe87

-- partitioned
$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1063764
latency average = 0.056 ms
tps = 17729.375930 (including connections establishing)
tps = 17729.855215 (excluding connections establishing)

-- non-partitioned
$ pgbench -n -T 60 -f partbench__insert.sql postgres
transaction type: partbench__insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1147273
latency average = 0.052 ms
tps = 19121.194366 (including connections establishing)
tps = 19121.695469 (excluding connections establishing)

Here we're within 92% of the non-partitioned performance.

Looking back at the first email in this thread where I tested the v1
patch, we were within 82% with:

-- partitioned
tps = 11001.602377 (excluding connections establishing)

-- non-partitioned
tps = 13354.656163 (excluding connections establishing)

Again, same as with the v1 test, the v2 test was done with the locking
of all partitions removed with:

diff --git a/src/backend/executor/execPartition.c
b/src/backend/executor/execPartition.c
index d7b18f52ed..6223c62094 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -80,9 +80,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState
*mtstate, Relation rel)
  PartitionTupleRouting *proute;
  ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;

- /* Lock all the partitions. */
- (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
  /*
  * Here we attempt to expend as little effort as possible in setting up
  * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
@@ -442,7 +439,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
  * We locked all the partitions in ExecSetupPartitionTupleRouting
  * including the leaf partitions.
  */
- partrel = heap_open(partoid, NoLock);
+ partrel = heap_open(partoid, RowExclusiveLock);

  /*
  * Keep ResultRelInfo and other information for this partition in the

Again, the reduce locking is not meant for commit for this patch.
Changing the locking will require a discussion on its own thread.

And just for fun, the unpatched performance on the partitioned table:

ubuntu@ip-10-0-0-33:~$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 5751
latency average = 10.434 ms
tps = 95.836052 (including connections establishing)
tps = 95.838490 (excluding connections establishing)

(185x increase)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/07/27 1:19, David Rowley wrote:
> On 18 July 2018 at 20:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Let me know what you think of the code in the updated patch.
> 
> Thanks for sending the updated patch.
> 
> I looked over it tonight and made a number of changes:
> 
> 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
> useless since the int would have wrapped long before it reached
> UINT_MAX.

Oops, you're right.

> There's no shortage of other code doubling the size of an
> array by multiplying it by 2 unconditionally without considering
> overflowing an int. Unsure why you considered this more risky.

Just ill-informed paranoia on my part.  Let's just drop it as you say,
given also the Tom's comment that repalloc would fail anyway for requests
over 1GB.

> 2) Fixed a series of bugs regarding the size of the arrays in
> PartitionTupleRouting. The map arrays and the partitions array could
> differ in size despite your comment that claimed
> child_parent_tupconv_maps was the same size as 'partitions' when
> non-NULL. The map arrays being a different size than the partitions
> array caused the following two cases to segfault. I've included two
> cases as it was two seperate bugs that caused them.
> 
> -- case 1

[ .... ]

> -- case 2

Indeed, there were some holes in the logic that led to me to come up with
that code.

> 3) Got rid of ExecUseUpdateResultRelForRouting. I started to change
> this to remove references to UPDATE in order to make it more friendly
> towards other possible future node types that it would get used for
> (aka MERGE). In the end, I found that performance could regress when
> in cases like:
> 
> drop table listp;
> create table listp (a int) partition by list(a);
> \o /dev/null
> \timing off
> select 'create table listp'||x::Text||' partition of listp for values
> in('||x::Text||');' from generate_series(1,1000) x;
> \gexec
> \o
> insert into listp select x from generate_series(1,999) x;
> \timing on
> update listp set a = a+1;
> 
> It's true that UPDATEs with a large number of subplans performance is
> quite terrible today in the planner, but this code made the
> performance of planning+execution a bit worse. If we get around to
> fixing the inheritance planner then I think
> ExecUseUpdateResultRelForRouting() could easily appear in profiles.
>
> I ended up rewriting it to just get called once and build a hash table
> by Oid storing a ResultRelInfo pointer.  This also gets rid of the
> slow nested loop in the cleanup operation inside
> ExecCleanupTupleRouting().

OK, looks neat, although I'd name the hash table subplan_resultrel_hash
(like join_rel_hash in PlannerInfo), instead of subplan_partition_table.

> 4) Did some tuning work in ExecFindPartition() getting rid of a
> redundant check after the loop completion. Also added some likely()
> and unlikely() decorations around some conditions.

All changes seem good.

> 5) Updated some newly out-dated comments since your patch in execPartition.h.
> 
> 6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a
> palloc() updating the few fields that were not initialised. This might
> save a few TPS (at least once we get rid of the all partition locking)
> in the single-row INSERT case, but I've not tested the performance of
> this yet.
> 
> 7) Also moved and edited some comments above
> ExecSetupPartitionTupleRouting() that I felt explained a little too
> much about some internal implementation details.

Thanks, changes look good.

> One thing that I thought of, but didn't do was just having
> ExecFindPartition() return the ResultRelInfo. I think it would be much
> nicer in both call sites to not have to check the ->partitions array
> to get that.  The copy.c call site would need a few modifications
> around the detection code to see if the partition has changed, but it
> all looks quite possible to change. I left this for now as I have
> another patch which touches all that code that I feel is closer to
> commit than this patch is.

I had wondered about that too, but gave up on doing anything about it
because the callers of ExecFindPartition want to access other fields of
PartitionTupleRouting using the returned index.  Maybe, we could make it
return a ResultRelInfo * and also return the index itself using a separate
output argument.  Seems like a cosmetic improvement that can be made later.

> I've attached a delta of the changes I made since your v2 delta and
> also a complete updated patch.

Thanks.  Here are some other minor comments on the complete v2 patch.

-            tuple =
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+            tuple =
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
+
proute->parent_child_tupconv_maps[leaf_part_index] :
+                                                NULL,

This piece of code that's present in both ExecPrepareTupleRouting and
CopyFrom can be written as:

if (proute->parent_child_tupconv_maps)
    ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
                              tuple,
                              proute->partition_tuple_slot,
                              &slot);

+    /*
+     * If UPDATE needs to do tuple routing, we'll need a slot that will
+     * transiently store the tuple being routed using the root parent's
+     * rowtype.  We must set up at least this slot, because it's needed even
+     * before tuple routing begins.  Other necessary information is
+     * initialized when  tuple routing code calls
+     * ExecUseUpdateResultRelForRouting.
+     */
     if (node && node->operation == CMD_UPDATE)

This comment needs to be updated, because you changed the if block's body as:

+        ExecHashSubPlanResultRelsByOid(mtstate, proute);
         proute->root_tuple_slot = MakeTupleTableSlot(NULL);

So, we don't just set up the slot here, we also now set up the hash table
to store sub-plan result rels.  Also, ExecUseUpdateResultRelForRouting no
longer exist.

+            /*
+             * Get the index for PartitionTupleRouting->partitions array
index
+             * for this leaf partition.  This may require building a new
+             * ResultRelInfo.
+             */


1st sentence reads a bit strange.  Did you mean to write the following?

            /*
             * Get this leaf partition's index in the
             * PartitionTupleRouting->partitions array.
             * This may require building a new ResultRelInfo.
             */

The following block of code could use a one-line comment describing what's
going on (although, what's going on might be pretty clear to some eyes
just by looking at the code):

            else
            {
                if (proute->subplan_partition_table)
                {
                    ResultRelInfo *rri;
                    Oid         partoid = partdesc->oids[partidx];

                    rri = hash_search(proute->subplan_partition_table,
                                      &partoid, HASH_FIND, NULL);

 /*
+ * ExecInitPartitionDispatchInfo
+ *      Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */

You renamed 'dispatchidx' to 'partidx' in the function's signature but
forgot to update this comment.

I've attached a delta patch to make the above changes.  I'm leaving the
hash table rename up to you though.

Thanks
Amit

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 27 July 2018 at 19:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I've attached a delta patch to make the above changes.  I'm leaving the
> hash table rename up to you though.

Thanks for the delta patch. I took all of it, just rewrote a comment slightly.

I also renamed the hash table to your suggestion and changed a few more things.

Attached a delta based on v2 and the full v3 patch.

This includes another small change to make
PartitionDispatchData->indexes an array that's allocated in the same
memory as the PartitionDispatchData. This will save a palloc() call
and also should be a bit more cache friendly.

This also required a rebase on master again due to 3e32109049.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/07/28 10:54, David Rowley wrote:
> On 27 July 2018 at 19:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I've attached a delta patch to make the above changes.  I'm leaving the
>> hash table rename up to you though.
> 
> Thanks for the delta patch. I took all of it, just rewrote a comment slightly.
> 
> I also renamed the hash table to your suggestion and changed a few more things.
> 
> Attached a delta based on v2 and the full v3 patch.
> 
> This includes another small change to make
> PartitionDispatchData->indexes an array that's allocated in the same
> memory as the PartitionDispatchData. This will save a palloc() call
> and also should be a bit more cache friendly.
> 
> This also required a rebase on master again due to 3e32109049.

Thanks for the updated patch.

I couldn't find much to complain about in the latest v3, except I noticed
a few instances of the word "setup" where I think what's really meant is
"set up".

+     * must be setup, but any sub-partitioned tables can be setup lazily as

+                 * A ResultRelInfo has not been setup for this partition yet,


By the way, when going over the updated code, I noticed that the code
around child_parent_tupconv_maps could use some refactoring too.
Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
child-to-parent map array needed for transition tuple capture even if not
needed by any of the leaf partitions.  I'm attaching here a patch that
applies on top of your v3 to show what I'm thinking we could do.

Thanks,
Amit

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 30 July 2018 at 20:26, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I couldn't find much to complain about in the latest v3, except I noticed
> a few instances of the word "setup" where I think what's really meant is
> "set up".
>
> +     * must be setup, but any sub-partitioned tables can be setup lazily as
>
> +                 * A ResultRelInfo has not been setup for this partition yet,
>

Great. I've fixed those and also fixed a few other comments.  I found
the comments on PartitionTupleRouting didn't really explain how the
arrays were indexed. I've made an attempt to make that clear.

I've attached a complete v4 patch.

> By the way, when going over the updated code, I noticed that the code
> around child_parent_tupconv_maps could use some refactoring too.
> Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
> child-to-parent map array needed for transition tuple capture even if not
> needed by any of the leaf partitions.  I'm attaching here a patch that
> applies on top of your v3 to show what I'm thinking we could do.

Maybe we can do that as a follow-on patch. I think what we have so far
is already ended up quite complex to review. What do you think?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 31 July 2018 at 19:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached a complete v4 patch.

I've attached v5 of the patch which is based on top of today's master
(@ 579b985b22)

A couple of recent patches conflict with v4. I've also made another
tidy up pass, which was mostly just rewording comments in
execPartition.h

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
(looking at the v5 patch but replying to an older email)

On 2018/07/31 16:03, David Rowley wrote:
> I've attached a complete v4 patch.
> 
>> By the way, when going over the updated code, I noticed that the code
>> around child_parent_tupconv_maps could use some refactoring too.
>> Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
>> child-to-parent map array needed for transition tuple capture even if not
>> needed by any of the leaf partitions.  I'm attaching here a patch that
>> applies on top of your v3 to show what I'm thinking we could do.
> 
> Maybe we can do that as a follow-on patch.

We probably could, but I think it would be a good idea get rid of *all*
redundant allocations due to tuple routing in one patch, if that's the
mission of this thread and the patch anyway.

> I think what we have so far
> is already ended up quite complex to review. What do you think?

Yeah, it's kind of complex, but at least it seems that we're clear on the
point that what we're trying to do here is to try to get rid of redundant
allocations.

Parts of the patch that appear complex seems to be around the allocation
of various maps.  Especially the child-to-parent maps, which as things
stand today, come from two arrays -- a per-update-subplan array that's
needed by update tuple routing proper and per-leaf partition array (one in
PartitionTupleRouting) that's needed by transition capture machinery.  The
original coding was such the update tuple routing handling code would try
to avoid allocating the per-update-subplan array if it saw that per-leaf
partition array was already set up in PartitionTupleRouting, because
transition capture is active in the query.  For update-tuple-routing code
to be able to use maps from the per-leaf array, it would have to know
which update-subplans mapped to which tuple-routing-initialized
partitions.  That was maintained in the subplan_partition_offset array
that's now gone with this patch, because we no longer want to fix the
tuple-routing-initialized-partition offsets in advance.  So, it's better
to dissociate per-subplan maps which are initialized during
ExecInitModifyTable from per-leaf maps which are initialized lazily when
tuple routing initializes a partition, which is what my portion of the
patch did.

As mentioned in my last email, I still think it would be a good idea to
simplify the handling of child-to-parent maps in PartitionTupleRouting
even further, while we're at improving the code in this area.  I revised
the patch such that it makes the handling of maps in PartitionTupleRouting
even more uniform.  With that patch, we no longer have two completely
unrelated places in the code managing parent-to-child and child-to-parent
maps, even though both arrays are in the same PartitionTupleRouting.
Please find the updated patch attached with this email.

Thanks,
Amit

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 3 August 2018 at 17:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/07/31 16:03, David Rowley wrote:
>> Maybe we can do that as a follow-on patch.
>
> We probably could, but I think it would be a good idea get rid of *all*
> redundant allocations due to tuple routing in one patch, if that's the
> mission of this thread and the patch anyway.

I started looking at this patch today and I now agree that it should
be included in the main patch.

I changed a few things with the patch. For example, the map access
macros you'd defined were not in CamelCase. I also fixed a bug where
the child to parent map was not being initialised when on conflict
transition capture was required. I added a test which was crashing the
backend but fixed the code so it works correctly. I also got rid of
the child_parent_map_not_required array since we now no longer need
it.  The code now always initialises the maps in cases where they're
going to be required.

I've attached a v3 version of your patch and also v6 of the main patch
which includes the v3 patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/08/21 14:44, David Rowley wrote:
> On 3 August 2018 at 17:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/07/31 16:03, David Rowley wrote:
>>> Maybe we can do that as a follow-on patch.
>>
>> We probably could, but I think it would be a good idea get rid of *all*
>> redundant allocations due to tuple routing in one patch, if that's the
>> mission of this thread and the patch anyway.
> 
> I started looking at this patch today and I now agree that it should
> be included in the main patch.

Great, thanks.

> I changed a few things with the patch. For example, the map access
> macros you'd defined were not in CamelCase.

In the updated patch:

+#define PartitionTupRoutingGetToParentMap(p, i) \
+#define PartitionTupRoutingGetToChildMap(p, i) \

If the "Get" could be replaced by "Child" and "Parent", respectively,
they'd sound more meaningful, imho.

> I also fixed a bug where
> the child to parent map was not being initialised when on conflict
> transition capture was required. I added a test which was crashing the
> backend but fixed the code so it works correctly.

Oops, I guess you mean my omission of checking if
mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo.

Thanks for fixing it and adding the test case.

> I also got rid of
> the child_parent_map_not_required array since we now no longer need
> it.  The code now always initialises the maps in cases where they're
> going to be required.

Yes, thought I had removed the field in my patch, but looks like I had
just removed the comment about it.

> I've attached a v3 version of your patch and also v6 of the main patch
> which includes the v3 patch.

I've looked at v6 and spotted some minor typos.

+     * ResultRelInfo for, before we go making one, we check for a
pre-made one

s/making/make/g

+    /* If nobody else set the per-subplan array of maps, do so ouselves. */

I guess I'm the one to blame here for misspelling "ourselves".


Since the above two are minor issues, fixed them myself in the attached
updated version; didn't touch the macro though.

Do you agree to setting this patch to "Ready for Committer" in the
September CF?

Thanks,
Amit


Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> +#define PartitionTupRoutingGetToParentMap(p, i) \
> +#define PartitionTupRoutingGetToChildMap(p, i) \
>
> If the "Get" could be replaced by "Child" and "Parent", respectively,
> they'd sound more meaningful, imho.

I did that to save 3 chars.  I think putting the additional
Child/Parent in the name is not really required. It's not as if we're
going to have a ParentToParent or a ChildToChild map, so I thought it
might be okay to assume that if it's "ToParent", that it's being
converted from the child and "ToChild" seems safe to assume it's being
converted from the parent. I can change it though if you feel very
strongly that what I've got is no good.

>> I also fixed a bug where
>> the child to parent map was not being initialised when on conflict
>> transition capture was required. I added a test which was crashing the
>> backend but fixed the code so it works correctly.
>
> Oops, I guess you mean my omission of checking if
> mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo.

Yeah.

> I've looked at v6 and spotted some minor typos.
>
> +     * ResultRelInfo for, before we go making one, we check for a
> pre-made one
>
> s/making/make/g

I disagree, but perhaps we can just reword it a bit. I've now got:

+ * Every time a tuple is routed to a partition that we've yet to set the
+ * ResultRelInfo for, before we go to the trouble of making one, we check
+ * for a pre-made one in the hash table.

> +    /* If nobody else set the per-subplan array of maps, do so ouselves. */
>
> I guess I'm the one to blame here for misspelling "ourselves".

Thanks for noticing.

> Since the above two are minor issues, fixed them myself in the attached
> updated version; didn't touch the macro though.

I've attached a v8. The only change from your v7 is in the "go making" comment.

> Do you agree to setting this patch to "Ready for Committer" in the
> September CF?

I read through the entire patch a couple of times yesterday and saw
nothing else, so yeah, I think now is a good time for someone with
more authority to have a look at it.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/08/22 21:30, David Rowley wrote:
> On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> +#define PartitionTupRoutingGetToParentMap(p, i) \
>> +#define PartitionTupRoutingGetToChildMap(p, i) \
>>
>> If the "Get" could be replaced by "Child" and "Parent", respectively,
>> they'd sound more meaningful, imho.
> 
> I did that to save 3 chars.  I think putting the additional
> Child/Parent in the name is not really required. It's not as if we're
> going to have a ParentToParent or a ChildToChild map, so I thought it
> might be okay to assume that if it's "ToParent", that it's being
> converted from the child and "ToChild" seems safe to assume it's being
> converted from the parent. I can change it though if you feel very
> strongly that what I've got is no good.

No strong preference as such. Maybe, let's defer to committer.

>> I've looked at v6 and spotted some minor typos.
>>
>> +     * ResultRelInfo for, before we go making one, we check for a
>> pre-made one
>>
>> s/making/make/g
> 
> I disagree, but perhaps we can just reword it a bit. I've now got:
> 
> + * Every time a tuple is routed to a partition that we've yet to set the
> + * ResultRelInfo for, before we go to the trouble of making one, we check
> + * for a pre-made one in the hash table.

Sure.  I guess "to the trouble of" was missing then. :)

>> +    /* If nobody else set the per-subplan array of maps, do so ouselves. */
>>
>> I guess I'm the one to blame here for misspelling "ourselves".
> 
> Thanks for noticing.
> 
>> Since the above two are minor issues, fixed them myself in the attached
>> updated version; didn't touch the macro though.
> 
> I've attached a v8. The only change from your v7 is in the "go making" comment.

Thanks.

>> Do you agree to setting this patch to "Ready for Committer" in the
>> September CF?
> 
> I read through the entire patch a couple of times yesterday and saw
> nothing else, so yeah, I think now is a good time for someone with
> more authority to have a look at it.

Okay, doing it now.

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 23 August 2018 at 00:30, David Rowley <david.rowley@2ndquadrant.com> wrote:
I've attached a v8. The only change from your v7 is in the "go making" comment.

v9 patch attached. Fixes conflict with 6b78231d.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 17 September 2018 at 21:15, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> v9 patch attached. Fixes conflict with 6b78231d.

v10 patch attached. Fixes conflict with cc2905e9.

I'm not so sure we need to zero the partition_tuple_slots[] array at
all since we always set a value there is there's a corresponding map
stored. I considered pulling that out, but in the end, I didn't as I
saw some Asserts checking it's been properly set by checking the
element  != NULL in nodeModifyTable.c and copy.c.  Perhaps I should
have just gotten rid of those Asserts along with the palloc0 and
subsequent memset after the expansion of the array. I'm undecided.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
Hi David,

On 2018/10/05 21:55, David Rowley wrote:
> On 17 September 2018 at 21:15, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> v9 patch attached. Fixes conflict with 6b78231d.
> 
> v10 patch attached. Fixes conflict with cc2905e9.

Thanks for rebasing.

> I'm not so sure we need to zero the partition_tuple_slots[] array at
> all since we always set a value there is there's a corresponding map
> stored. I considered pulling that out, but in the end, I didn't as I
> saw some Asserts checking it's been properly set by checking the
> element  != NULL in nodeModifyTable.c and copy.c.  Perhaps I should
> have just gotten rid of those Asserts along with the palloc0 and
> subsequent memset after the expansion of the array. I'm undecided.

Maybe it's a good thing that it's doing the same thing as with the
child_to_parent_maps array, which is to zero-init it when allocated.

Thanks,
Amit




Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
Hi David,

On 2018/10/05 21:55, David Rowley wrote:
> On 17 September 2018 at 21:15, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> v9 patch attached. Fixes conflict with 6b78231d.
> 
> v10 patch attached. Fixes conflict with cc2905e9.

Thanks for rebasing.

> I'm not so sure we need to zero the partition_tuple_slots[] array at
> all since we always set a value there is there's a corresponding map
> stored. I considered pulling that out, but in the end, I didn't as I
> saw some Asserts checking it's been properly set by checking the
> element  != NULL in nodeModifyTable.c and copy.c.  Perhaps I should
> have just gotten rid of those Asserts along with the palloc0 and
> subsequent memset after the expansion of the array. I'm undecided.

Maybe it's a good thing that it's doing the same thing as with the
child_to_parent_maps array, which is to zero-init it when allocated.

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 9 October 2018 at 15:49, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/10/05 21:55, David Rowley wrote:
>> I'm not so sure we need to zero the partition_tuple_slots[] array at
>> all since we always set a value there is there's a corresponding map
>> stored. I considered pulling that out, but in the end, I didn't as I
>> saw some Asserts checking it's been properly set by checking the
>> element  != NULL in nodeModifyTable.c and copy.c.  Perhaps I should
>> have just gotten rid of those Asserts along with the palloc0 and
>> subsequent memset after the expansion of the array. I'm undecided.
>
> Maybe it's a good thing that it's doing the same thing as with the
> child_to_parent_maps array, which is to zero-init it when allocated.

Perhaps, but the maps do need to be zeroed, the partition_tuple_slots
array does not since we only access it when the parent to child map is
set.

In any case, since PARTITION_ROUTING_INITSIZE is just 8, it'll likely
not save much since it's really just saving a memset(..., 0, 64), for
single-row INSERTs on a 64-bit machine. So likely it won't save more
than a bunch of nanoseconds.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Krzysztof Nienartowicz
Date:
We see quite prohibitive 5-6x slowdown with native partitioning on in
comparison to trigger based in PG9.5.
This is clearly visible with highly parallel inserts (Can share
flamediagrams comparing the two). 

This basically excludes native partitioning from being used by us. Do you
think your changes could be backported to PG10? - we checked and this would
need quite a number of changes but given the weight of this change maybe it
could be considered?

Thanks
Krzysztof



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
Hi,

On 2018/10/15 19:04, Krzysztof Nienartowicz wrote:
> 
> We see quite prohibitive 5-6x slowdown with native partitioning on in
> comparison to trigger based in PG9.5.
> This is clearly visible with highly parallel inserts (Can share
> flamediagrams comparing the two). 
> 
> This basically excludes native partitioning from being used by us. Do you
> think your changes could be backported to PG10? - we checked and this would
> need quite a number of changes but given the weight of this change maybe it
> could be considered?

It's unfortunate that PG 10's partitioning cannot be used for your use
case, but I don't think such a major refactoring will be back-ported to 10
or 11. :-(

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 15 October 2018 at 23:04, Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:
> We see quite prohibitive 5-6x slowdown with native partitioning on in
> comparison to trigger based in PG9.5.
> This is clearly visible with highly parallel inserts (Can share
> flamediagrams comparing the two).

Does the 0001 patch here fix the problem? I imagined that it would be
the locking of all partitions that would have killed the performance.

> This basically excludes native partitioning from being used by us. Do you
> think your changes could be backported to PG10? - we checked and this would
> need quite a number of changes but given the weight of this change maybe it
> could be considered?

It's very unlikely to happen, especially so with the 0002 patch, which
I've so far just attached as a demonstration of where the performance
could end up.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Krzysztof Nienartowicz
Date:
In the end we hacked the code to re-enable triggers on partitioned
tables and switch off native insert code on partitioned tables. Quite
hackish and would be nice to have it fixed in a more natural manner.
Yes, it looked like locking but not only -
ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
looked like dominant and also convert_tuples_by_name but not sure if
the last one was not an artifact of perf sampling.
Will check the patch 0001, thanks.


On Tue, Oct 23, 2018 at 12:36 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On 15 October 2018 at 23:04, Krzysztof Nienartowicz
> <krzysztof.nienartowicz@gmail.com> wrote:
> > We see quite prohibitive 5-6x slowdown with native partitioning on in
> > comparison to trigger based in PG9.5.
> > This is clearly visible with highly parallel inserts (Can share
> > flamediagrams comparing the two).
>
> Does the 0001 patch here fix the problem? I imagined that it would be
> the locking of all partitions that would have killed the performance.
>
> > This basically excludes native partitioning from being used by us. Do you
> > think your changes could be backported to PG10? - we checked and this would
> > need quite a number of changes but given the weight of this change maybe it
> > could be considered?
>
> It's very unlikely to happen, especially so with the 0002 patch, which
> I've so far just attached as a demonstration of where the performance
> could end up.
>
> --
>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 23 October 2018 at 11:55, Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:
> In the end we hacked the code to re-enable triggers on partitioned
> tables and switch off native insert code on partitioned tables. Quite
> hackish and would be nice to have it fixed in a more natural manner.
> Yes, it looked like locking but not only -
> ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
> looked like dominant and also convert_tuples_by_name but not sure if
> the last one was not an artifact of perf sampling.

The ExecOpenIndices was likely fixed in edd44738bc8 (PG11).
find_all_inheritors does obtain the partition locks during the call,
so the slowness there most likely down to the locking rather than the
scanning of pg_inherits.

42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12).

> Will check the patch 0001, thanks.

I more meant that it might be 0002 that fixes the issue for you. I
just wanted to check if you'd tried 0001 and found that the problem
was fixed with that alone.

Do you mind sharing how many partitions you have and how many columns
the partitioned table has?


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Krzysztof Nienartowicz
Date:
On Tue, Oct 23, 2018 at 4:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On 23 October 2018 at 11:55, Krzysztof Nienartowicz
> <krzysztof.nienartowicz@gmail.com> wrote:
> > In the end we hacked the code to re-enable triggers on partitioned
> > tables and switch off native insert code on partitioned tables. Quite
> > hackish and would be nice to have it fixed in a more natural manner.
> > Yes, it looked like locking but not only -
> > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
> > looked like dominant and also convert_tuples_by_name but not sure if
> > the last one was not an artifact of perf sampling.
>
> The ExecOpenIndices was likely fixed in edd44738bc8 (PG11).
> find_all_inheritors does obtain the partition locks during the call,
> so the slowness there most likely down to the locking rather than the
> scanning of pg_inherits.
>
> 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12).
>
> > Will check the patch 0001, thanks.
>
> I more meant that it might be 0002 that fixes the issue for you. I
> just wanted to check if you'd tried 0001 and found that the problem
> was fixed with that alone.

Will it apply on PG10? (In fact the code base is PG XL10 but
src/backend/executor/nodeModifyTable.c is pure PG)

>
> Do you mind sharing how many partitions you have and how many columns
> the partitioned table has?

We have 2 level partitioning: 10 (possibly changing, up to say 20-30)
range partitions at first level and 20 range at the second level. We
have potentially hundreds processes inserting at the same time.

>
>
> --
>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Krzysztof Nienartowicz
Date:
To complement the info: number of columns varies from 20 to 100 but
some of the columns are composite types or arrays of composite types.

The flamegraph after applying changes from patch 0002 can be seen
here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP
shows most of the time is spent in the

convert_tuples_by_name (PG10 version).

Thanks
On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:
>
> On Tue, Oct 23, 2018 at 4:02 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> >
> > On 23 October 2018 at 11:55, Krzysztof Nienartowicz
> > <krzysztof.nienartowicz@gmail.com> wrote:
> > > In the end we hacked the code to re-enable triggers on partitioned
> > > tables and switch off native insert code on partitioned tables. Quite
> > > hackish and would be nice to have it fixed in a more natural manner.
> > > Yes, it looked like locking but not only -
> > > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
> > > looked like dominant and also convert_tuples_by_name but not sure if
> > > the last one was not an artifact of perf sampling.
> >
> > The ExecOpenIndices was likely fixed in edd44738bc8 (PG11).
> > find_all_inheritors does obtain the partition locks during the call,
> > so the slowness there most likely down to the locking rather than the
> > scanning of pg_inherits.
> >
> > 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12).
> >
> > > Will check the patch 0001, thanks.
> >
> > I more meant that it might be 0002 that fixes the issue for you. I
> > just wanted to check if you'd tried 0001 and found that the problem
> > was fixed with that alone.
>
> Will it apply on PG10? (In fact the code base is PG XL10 but
> src/backend/executor/nodeModifyTable.c is pure PG)
>
> >
> > Do you mind sharing how many partitions you have and how many columns
> > the partitioned table has?
>
> We have 2 level partitioning: 10 (possibly changing, up to say 20-30)
> range partitions at first level and 20 range at the second level. We
> have potentially hundreds processes inserting at the same time.
>
> >
> >
> > --
> >  David Rowley                   http://www.2ndQuadrant.com/
> >  PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/10/30 8:41, Krzysztof Nienartowicz wrote:
> On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz
> <krzysztof.nienartowicz@gmail.com> wrote:
>> On Tue, Oct 23, 2018 at 4:02 AM David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>>>
>>> I more meant that it might be 0002 that fixes the issue for you. I
>>> just wanted to check if you'd tried 0001 and found that the problem
>>> was fixed with that alone.
>>
>> Will it apply on PG10? (In fact the code base is PG XL10 but
>> src/backend/executor/nodeModifyTable.c is pure PG)
>
> To complement the info: number of columns varies from 20 to 100 but
> some of the columns are composite types or arrays of composite types.
> 
> The flamegraph after applying changes from patch 0002 can be seen
> here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP
> shows most of the time is spent in the
> 
> convert_tuples_by_name (PG10 version).

As David mentioned, the patches on this thread are meant to be applied
against latest PG 12 HEAD.  The insert tuple routing code has undergone
quite a bit of refactoring in PG 11, which itself should have gotten rid
of at least some of the hot-spots that are seen in the flame graph you shared.

What happens in PG 10 (as seen in the flame graph) is that
ExecSetupPartitionTupleRouting initializes information for *all*
partitions, which happens even before the 1st tuple processed.  So if
there are many partitions and with many columns, a lot of processing
happens in ExecSetupPartitionTupleRouting.  PG 11 changes it such that the
partition info is only initialized after the 1st tuple is processed and
only of the partition that's targeted, but some overheads still remain in
that code.  The patches on this thread are meant to address those overheads.

Unfortunately, I don't think the community will agree to back-porting the
changes in PG 11 and the patches being discussed here to PG 10.

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Krzysztof Nienartowicz
Date:
Thanks for both clarifications!
I skimmed through the commits related to Inserts with partitioning
since 10 and indeed - while not impossible it seems like quite some
work to merge them into PG 10 codebase.
We might consider preparing the patch in-house as otherwise PG 10
based partitioning is a major regression and we'd have to go back to
inheritance based one - which seems the best option for now.
Regards,
Krzysztof


On Tue, Oct 30, 2018 at 1:54 AM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2018/10/30 8:41, Krzysztof Nienartowicz wrote:
> > On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz
> > <krzysztof.nienartowicz@gmail.com> wrote:
> >> On Tue, Oct 23, 2018 at 4:02 AM David Rowley
> >> <david.rowley@2ndquadrant.com> wrote:
> >>>
> >>> I more meant that it might be 0002 that fixes the issue for you. I
> >>> just wanted to check if you'd tried 0001 and found that the problem
> >>> was fixed with that alone.
> >>
> >> Will it apply on PG10? (In fact the code base is PG XL10 but
> >> src/backend/executor/nodeModifyTable.c is pure PG)
> >
> > To complement the info: number of columns varies from 20 to 100 but
> > some of the columns are composite types or arrays of composite types.
> >
> > The flamegraph after applying changes from patch 0002 can be seen
> > here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP
> > shows most of the time is spent in the
> >
> > convert_tuples_by_name (PG10 version).
>
> As David mentioned, the patches on this thread are meant to be applied
> against latest PG 12 HEAD.  The insert tuple routing code has undergone
> quite a bit of refactoring in PG 11, which itself should have gotten rid
> of at least some of the hot-spots that are seen in the flame graph you shared.
>
> What happens in PG 10 (as seen in the flame graph) is that
> ExecSetupPartitionTupleRouting initializes information for *all*
> partitions, which happens even before the 1st tuple processed.  So if
> there are many partitions and with many columns, a lot of processing
> happens in ExecSetupPartitionTupleRouting.  PG 11 changes it such that the
> partition info is only initialized after the 1st tuple is processed and
> only of the partition that's targeted, but some overheads still remain in
> that code.  The patches on this thread are meant to address those overheads.
>
> Unfortunately, I don't think the community will agree to back-porting the
> changes in PG 11 and the patches being discussed here to PG 10.
>
> Thanks,
> Amit
>


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Robert Haas
Date:
On Wed, Aug 22, 2018 at 8:30 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > +#define PartitionTupRoutingGetToParentMap(p, i) \
> > +#define PartitionTupRoutingGetToChildMap(p, i) \
> >
> > If the "Get" could be replaced by "Child" and "Parent", respectively,
> > they'd sound more meaningful, imho.
>
> I did that to save 3 chars.  I think putting the additional
> Child/Parent in the name is not really required. It's not as if we're
> going to have a ParentToParent or a ChildToChild map, so I thought it
> might be okay to assume that if it's "ToParent", that it's being
> converted from the child and "ToChild" seems safe to assume it's being
> converted from the parent. I can change it though if you feel very
> strongly that what I've got is no good.

I'm not sure exactly what is best here, but it seems to unlikely to me
that somebody is going to read that macro name and think, oh, that
means "get the to-parent map".  They are more like be confused by the
juxtaposition of "get" and "to".

I think a better way to shorten the name would be to truncate the
PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 22, 2018 at 8:30 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> > +#define PartitionTupRoutingGetToParentMap(p, i) \
>> > +#define PartitionTupRoutingGetToChildMap(p, i) \
>> >
>> > If the "Get" could be replaced by "Child" and "Parent", respectively,
>> > they'd sound more meaningful, imho.
>>
>> I did that to save 3 chars.  I think putting the additional
>> Child/Parent in the name is not really required. It's not as if we're
>> going to have a ParentToParent or a ChildToChild map, so I thought it
>> might be okay to assume that if it's "ToParent", that it's being
>> converted from the child and "ToChild" seems safe to assume it's being
>> converted from the parent. I can change it though if you feel very
>> strongly that what I've got is no good.
>
> I'm not sure exactly what is best here, but it seems to unlikely to me
> that somebody is going to read that macro name and think, oh, that
> means "get the to-parent map".  They are more like be confused by the
> juxtaposition of "get" and "to".
>
> I think a better way to shorten the name would be to truncate the
> PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.

Thanks for chipping in on this.

I agree. I don't think "TupRouting" really needs to be in the name.
Probably "To" can also just become "2" and we can put back the
Parent/Child before that.

I've attached v11, which does this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/11/01 8:58, David Rowley wrote:
> On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Aug 22, 2018 at 8:30 AM David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>>> On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> +#define PartitionTupRoutingGetToParentMap(p, i) \
>>>> +#define PartitionTupRoutingGetToChildMap(p, i) \
>>>>
>>>> If the "Get" could be replaced by "Child" and "Parent", respectively,
>>>> they'd sound more meaningful, imho.
>>>
>>> I did that to save 3 chars.  I think putting the additional
>>> Child/Parent in the name is not really required. It's not as if we're
>>> going to have a ParentToParent or a ChildToChild map, so I thought it
>>> might be okay to assume that if it's "ToParent", that it's being
>>> converted from the child and "ToChild" seems safe to assume it's being
>>> converted from the parent. I can change it though if you feel very
>>> strongly that what I've got is no good.
>>
>> I'm not sure exactly what is best here, but it seems to unlikely to me
>> that somebody is going to read that macro name and think, oh, that
>> means "get the to-parent map".  They are more like be confused by the
>> juxtaposition of "get" and "to".
>>
>> I think a better way to shorten the name would be to truncate the
>> PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.
> 
> Thanks for chipping in on this.
> 
> I agree. I don't think "TupRouting" really needs to be in the name.
> Probably "To" can also just become "2" and we can put back the
> Parent/Child before that.

Agree that "TupRouting" can go, but "To" is not too long for using "2"
instead of it.

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 1 November 2018 at 13:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/11/01 8:58, David Rowley wrote:
>> On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I think a better way to shorten the name would be to truncate the
>>> PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.
>>
>> Thanks for chipping in on this.
>>
>> I agree. I don't think "TupRouting" really needs to be in the name.
>> Probably "To" can also just become "2" and we can put back the
>> Parent/Child before that.
>
> Agree that "TupRouting" can go, but "To" is not too long for using "2"
> instead of it.

Okay. Here's a version with "2" put back to "To"...

It's great to know the patch is now so perfect that we've only the
macro naming left to debate ;-)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Gavin Flower
Date:
On 01/11/2018 14:30, David Rowley wrote:
> On 1 November 2018 at 13:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/11/01 8:58, David Rowley wrote:
>>>
[...]
>>> I agree. I don't think "TupRouting" really needs to be in the name.
>>> Probably "To" can also just become "2" and we can put back the
>>> Parent/Child before that.
>> Agree that "TupRouting" can go, but "To" is not too long for using "2"
>> instead of it.
I think that while '2' may only be one character less than 'to', the 
character '2' stands out more.  However, can't say I could claim this 
was of the utmost importance!
> Okay. Here's a version with "2" put back to "To"...

[...]



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/11/01 10:30, David Rowley wrote:
> It's great to know the patch is now so perfect that we've only the
> macro naming left to debate ;-)

I looked over v12 again and noticed a couple minor issues.

+ *              table then we store the index into parenting
+ *              PartitionTupleRouting 'partition_dispatch_info' array.  An

s/PartitionTupleRouting/PartitionTupleRouting's/g

Also, I got a bit concerned about "parenting".  Does it mean something
like "enclosing", because the PartitionDispatch is a member of
PartitionTupleRouting?  I got concerned because using "parent" like this
may be confusing as this is the partitioning code we're talking about,
where "parent" is generally used to mean "parent" table.

+     * the partitioned table that's the target of the command.  If we must
+     * route tuple via some sub-partitioned table, then the PartitionDispatch
+     * for those is only built the first time it's required.

... via some sub-partitioned table"s"

Or perhaps rewrite a bit as:

If we must route the tuple via some sub-partitioned table, then its
PartitionDispatch is built the first time it's required.


The macro naming discussion got me thinking today about the macro itself.
It encapsulates access to the various PartitionTupleRouting arrays
containing the maps, but maybe we've got the interface of tuple routing a
bit (maybe a lot given this thread!) wrong to begin with.  Instead of
ExecFindPartition returning indexes into various PartitionTupleRouting
arrays and its callers then using those indexes to fetch various objects
from those arrays, why doesn't it return those objects itself?  Although
we made sure that the callers don't need to worry about the meaning of
these indexes changing with this patch, it still seems a bit odd for them
to have to go back to those arrays to get various objects.

How about we change ExecFindPartition's interface so that it returns the
ResultRelInfo, the two maps, and the partition slot?  So, the arrays
simply become a cache for ExecFindPartition et al and are no longer
accessed outside execPartition.c.  Although that makes the interface of
ExecFindPartition longer, I think it reduces overall complexity.

I've implemented that in the attached patch
1-revise-ExecFindPartition-interface.patch.

Also, since all members of PartitionTupleRouting are only accessed within
execPartition.c save root_tuple_slot, we can move it to execPartition.c to
make its internals private, after doing something about root_tuple_slot.
Looking at the code related to root_tuple_slot, it seems the field really
belongs in ModifyTableState, because it got nothing to do with routing.
Attached 2-make-PartitionTupleRouting-private.patch does that.

These patches 1 and 2 apply on top of v12-0001.. patch.

Thanks,
Amit


Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 1 November 2018 at 22:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/11/01 10:30, David Rowley wrote:
>> It's great to know the patch is now so perfect that we've only the
>> macro naming left to debate ;-)
>
> I looked over v12 again and noticed a couple minor issues.
>
> + *              table then we store the index into parenting
> + *              PartitionTupleRouting 'partition_dispatch_info' array.  An
>
> s/PartitionTupleRouting/PartitionTupleRouting's/g
>
> Also, I got a bit concerned about "parenting".  Does it mean something
> like "enclosing", because the PartitionDispatch is a member of
> PartitionTupleRouting?  I got concerned because using "parent" like this
> may be confusing as this is the partitioning code we're talking about,
> where "parent" is generally used to mean "parent" table.
>
> +     * the partitioned table that's the target of the command.  If we must
> +     * route tuple via some sub-partitioned table, then the PartitionDispatch
> +     * for those is only built the first time it's required.
>
> ... via some sub-partitioned table"s"
>
> Or perhaps rewrite a bit as:
>
> If we must route the tuple via some sub-partitioned table, then its
> PartitionDispatch is built the first time it's required.

I've attached v13 which hopefully addresses these.

> The macro naming discussion got me thinking today about the macro itself.
> It encapsulates access to the various PartitionTupleRouting arrays
> containing the maps, but maybe we've got the interface of tuple routing a
> bit (maybe a lot given this thread!) wrong to begin with.  Instead of
> ExecFindPartition returning indexes into various PartitionTupleRouting
> arrays and its callers then using those indexes to fetch various objects
> from those arrays, why doesn't it return those objects itself?  Although
> we made sure that the callers don't need to worry about the meaning of
> these indexes changing with this patch, it still seems a bit odd for them
> to have to go back to those arrays to get various objects.
>
> How about we change ExecFindPartition's interface so that it returns the
> ResultRelInfo, the two maps, and the partition slot?  So, the arrays
> simply become a cache for ExecFindPartition et al and are no longer
> accessed outside execPartition.c.  Although that makes the interface of
> ExecFindPartition longer, I think it reduces overall complexity.

I don't really think stuffing values into a bunch of output parameters
is much of an improvement. I'd rather allow callers to fetch what they
need using the index we return. Most callers don't need to know about
the child to parent maps, so it seems nicer for those places not to
have to invent a dummy variable to pass along to ExecFindPartition()
so it can needlessly populate it for them.

Perhaps a better design would be to instead of having random special
partitioned-table-only fields in ResultRelInfo, just have an extra
struct there that contains the extra partition information which would
include the translation maps and then just return the ResultRelInfo
and allow callers to lookup any extra details they require. I've not
looked into this in detail, but I think the committer work that's
required for the patch as it is today is already quite significant.
I'm not keen on warding any willing one off by making the commit job
any harder.  I agree that it would be good to stabilise the API for
all this partitioning code sometime, but I don't believe it needs to
be done all in one commit. My intent here is to improve performance or
INSERT and UPDATE on partitioned tables. Perhaps we can shape some API
redesign later in the release cycle. What do you think?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/11/04 19:07, David Rowley wrote:
> On 1 November 2018 at 22:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I've attached v13 which hopefully addresses these.

Thank you for updating the patch.

>> The macro naming discussion got me thinking today about the macro itself.
>> It encapsulates access to the various PartitionTupleRouting arrays
>> containing the maps, but maybe we've got the interface of tuple routing a
>> bit (maybe a lot given this thread!) wrong to begin with.  Instead of
>> ExecFindPartition returning indexes into various PartitionTupleRouting
>> arrays and its callers then using those indexes to fetch various objects
>> from those arrays, why doesn't it return those objects itself?  Although
>> we made sure that the callers don't need to worry about the meaning of
>> these indexes changing with this patch, it still seems a bit odd for them
>> to have to go back to those arrays to get various objects.
>>
>> How about we change ExecFindPartition's interface so that it returns the
>> ResultRelInfo, the two maps, and the partition slot?  So, the arrays
>> simply become a cache for ExecFindPartition et al and are no longer
>> accessed outside execPartition.c.  Although that makes the interface of
>> ExecFindPartition longer, I think it reduces overall complexity.
> 
> I don't really think stuffing values into a bunch of output parameters
> is much of an improvement. I'd rather allow callers to fetch what they
> need using the index we return. Most callers don't need to know about
> the child to parent maps, so it seems nicer for those places not to
> have to invent a dummy variable to pass along to ExecFindPartition()
> so it can needlessly populate it for them.

Well, if a caller finds a partition using ExecFindPartition, it's going to
need to fetch those other objects anyway.  Both of its callers that exist
today, CopyFrom and ExecPrepareTupleRouting, fetch both maps and the slot
in the same code block as ExecFindPartition.

> Perhaps a better design would be to instead of having random special
> partitioned-table-only fields in ResultRelInfo, just have an extra
> struct there that contains the extra partition information which would
> include the translation maps and then just return the ResultRelInfo
> and allow callers to lookup any extra details they require.

IIUC, you're saying that we could introduce a new struct that contains
auxiliary information needed by partition pruning (maps, slot, etc. for
tuple conversion) and add a new ResultRelInfo member of that struct type.
That way, there is no need to return them separately or return an index to
access them from their arrays.  I guess we won't even need the arrays we
have now.  I think that might be a good idea and simplifies things
significantly.

It reminds me of how ResultRelInfo grew a ri_onConflict member of type
OnConflictSetState [1].  We decided to go that way, as opposed to the
earlier approach of having arrays of num_partitions length in
ModifyTableState or PartitionTupleRouting that contained ON CONFLICT
related objects for individual partitions.

> I've not
> looked into this in detail, but I think the committer work that's
> required for the patch as it is today is already quite significant.
> I'm not keen on warding any willing one off by making the commit job
> any harder.  I agree that it would be good to stabilise the API for
> all this partitioning code sometime, but I don't believe it needs to
> be done all in one commit. My intent here is to improve performance or
> INSERT and UPDATE on partitioned tables. Perhaps we can shape some API
> redesign later in the release cycle. What do you think?

I do suspect that simplifying ExecFindPartition's interface as part of
patch will make a committer's life easier, as the resulting interface is
simpler, especially if we revise it like you suggest above.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=555ee77a9



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Jesper Pedersen
Date:
On 11/4/18 5:07 AM, David Rowley wrote:
> I've attached v13 which hopefully addresses these.
> 

I ran a test against the INSERT case using a 64 hash partition.

Non-partitioned table: ~73k TPS
Master: ~25k TPS
0001: ~26k TPS
0001 + 0002: ~68k TPS

The profile of 0001 shows that almost all of 
ExecSetupPartitionTupleRouting() is find_all_inheritors(), hence the 
last test with a rebase of the original v1-0002 patch.

Best regards,
  Jesper


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 5 November 2018 at 20:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/11/04 19:07, David Rowley wrote:
>> Perhaps a better design would be to instead of having random special
>> partitioned-table-only fields in ResultRelInfo, just have an extra
>> struct there that contains the extra partition information which would
>> include the translation maps and then just return the ResultRelInfo
>> and allow callers to lookup any extra details they require.
>
> IIUC, you're saying that we could introduce a new struct that contains
> auxiliary information needed by partition pruning (maps, slot, etc. for
> tuple conversion) and add a new ResultRelInfo member of that struct type.
> That way, there is no need to return them separately or return an index to
> access them from their arrays.  I guess we won't even need the arrays we
> have now.  I think that might be a good idea and simplifies things
> significantly.

I've attached a patch which does this.  It adds a new struct named
PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
out of PartitionTupleRouting. There are fields for each of what these
arrays used to store inside the PartitionRoutingInfo struct.

While doing this I kept glancing back over at ModifyTableState and at
the mt_per_subplan_tupconv_maps array. I wondered if it would be
better to make the PartitionRoutingInfo a bit more generic, perhaps
call it TupleConversionInfo and have fields like ti_ToGeneric and
ti_FromGeneric, with the idea that "generic" would be the root
partition or the first subplan for inheritance updates. This would
allow us to get rid of a good chunk of code inside nodeModifyTable.c.
However, I ended up not doing this and left PartitionRoutingInfo to be
specifically for Root to Partition conversion.

Also, on the topic about what to name the conversion maps from a few
days ago; After looking at this a bit more I decided that having them
named ParentToChild and ChildToParent is misleading.  If the child is
the child of some sub-partitioned table then the parent that the map
is talking about is not the partition's parent, but the root
partitioned table. So really RootToPartition and PartitionToRoot seem
like much more accurate names for the maps.

I made a few other changes along the way; I thought that
ExecFindPartition() would be a good candidate to take on the
responsibility of validating the partition is valid for INSERTs when
it uses a partition out of the subplan_resultrel_hash. I thought it
was better to check this once when we're in the code path of grabbing
the ResultRelInfo out that hash table rather than in a code path that
must check if it's been done already each time we route a tuple into a
partition. It also allowed me to get rid of
ri_PartitionReadyForRouting. I also moved the call to
ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
make that function static, which could result in the generation of
slightly more optimal compiled code.

Please find attached the v14 patch.

Rather nicely git --stat reports a net negative additional code (with
the 48 lines of new tests included)

 11 files changed, 632 insertions(+), 660 deletions(-)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Jesper Pedersen
Date:
Hi David,

On 11/7/18 6:46 AM, David Rowley wrote:
> I've attached a patch which does this.  It adds a new struct named
> PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
> out of PartitionTupleRouting. There are fields for each of what these
> arrays used to store inside the PartitionRoutingInfo struct.
> 
> While doing this I kept glancing back over at ModifyTableState and at
> the mt_per_subplan_tupconv_maps array. I wondered if it would be
> better to make the PartitionRoutingInfo a bit more generic, perhaps
> call it TupleConversionInfo and have fields like ti_ToGeneric and
> ti_FromGeneric, with the idea that "generic" would be the root
> partition or the first subplan for inheritance updates. This would
> allow us to get rid of a good chunk of code inside nodeModifyTable.c.
> However, I ended up not doing this and left PartitionRoutingInfo to be
> specifically for Root to Partition conversion.
> 

Yeah, it doesn't necessarily have to be part of this patch.

> Also, on the topic about what to name the conversion maps from a few
> days ago; After looking at this a bit more I decided that having them
> named ParentToChild and ChildToParent is misleading.  If the child is
> the child of some sub-partitioned table then the parent that the map
> is talking about is not the partition's parent, but the root
> partitioned table. So really RootToPartition and PartitionToRoot seem
> like much more accurate names for the maps.
> 

Agreed.

> I made a few other changes along the way; I thought that
> ExecFindPartition() would be a good candidate to take on the
> responsibility of validating the partition is valid for INSERTs when
> it uses a partition out of the subplan_resultrel_hash. I thought it
> was better to check this once when we're in the code path of grabbing
> the ResultRelInfo out that hash table rather than in a code path that
> must check if it's been done already each time we route a tuple into a
> partition. It also allowed me to get rid of
> ri_PartitionReadyForRouting. I also moved the call to
> ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
> make that function static, which could result in the generation of
> slightly more optimal compiled code.
> 
> Please find attached the v14 patch.
> 

Passes check-world, and has detailed documentation about the changes :)

Best regards,
  Jesper


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/11/07 20:46, David Rowley wrote:
> On 5 November 2018 at 20:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/11/04 19:07, David Rowley wrote:
>>> Perhaps a better design would be to instead of having random special
>>> partitioned-table-only fields in ResultRelInfo, just have an extra
>>> struct there that contains the extra partition information which would
>>> include the translation maps and then just return the ResultRelInfo
>>> and allow callers to lookup any extra details they require.
>>
>> IIUC, you're saying that we could introduce a new struct that contains
>> auxiliary information needed by partition pruning (maps, slot, etc. for
>> tuple conversion) and add a new ResultRelInfo member of that struct type.
>> That way, there is no need to return them separately or return an index to
>> access them from their arrays.  I guess we won't even need the arrays we
>> have now.  I think that might be a good idea and simplifies things
>> significantly.
> 
> I've attached a patch which does this.

Thank you for updating the patch this way.

> It adds a new struct named
> PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
> out of PartitionTupleRouting. There are fields for each of what these
> arrays used to store inside the PartitionRoutingInfo struct.
> 
> While doing this I kept glancing back over at ModifyTableState and at
> the mt_per_subplan_tupconv_maps array. I wondered if it would be
> better to make the PartitionRoutingInfo a bit more generic, perhaps
> call it TupleConversionInfo and have fields like ti_ToGeneric and
> ti_FromGeneric, with the idea that "generic" would be the root
> partition or the first subplan for inheritance updates. This would
> allow us to get rid of a good chunk of code inside nodeModifyTable.c.
> However, I ended up not doing this and left PartitionRoutingInfo to be
> specifically for Root to Partition conversion.

I think it's good that you left mt_per_subplan_tupconv_maps out of this.
UPDATE tuple routing can be said to have two steps: first step, a tiny
one, converts the tuple that needs to be routed from the source
partition's rowtype to the root's rowtype so that tuple routing proper can
begin, and second is the actual tuple routing carried out using
PartitionTupleRouting.  The first step is handled by nodeModifyTable.c and
so any data structures related to it should be in ModifyTableState.

Actually, as I also proposed upthread, we should move root_tuple_slot from
PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
it's part of the first step described above that has nothing to do with
partition tuple routing proper.  We can make PartitionTupleRouting private
to execPartition.c if we do that.

> Also, on the topic about what to name the conversion maps from a few
> days ago; After looking at this a bit more I decided that having them
> named ParentToChild and ChildToParent is misleading.  If the child is
> the child of some sub-partitioned table then the parent that the map
> is talking about is not the partition's parent, but the root
> partitioned table. So really RootToPartition and PartitionToRoot seem
> like much more accurate names for the maps.

Makes sense. :)

> I made a few other changes along the way; I thought that
> ExecFindPartition() would be a good candidate to take on the
> responsibility of validating the partition is valid for INSERTs when
> it uses a partition out of the subplan_resultrel_hash. I thought it
> was better to check this once when we're in the code path of grabbing
> the ResultRelInfo out that hash table rather than in a code path that
> must check if it's been done already each time we route a tuple into a
> partition. It also allowed me to get rid of
> ri_PartitionReadyForRouting.

Ah, I too had tried to remove ri_PartitionReadyForRouting, but had to give
up on that idea because I didn't think of moving steps that are needed to
be performed before setting it to true to that block of code in
ExecFindPartition.

> I also moved the call to
> ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
> make that function static, which could result in the generation of
> slightly more optimal compiled code.
> 
> Please find attached the v14 patch.
> 
> Rather nicely git --stat reports a net negative additional code (with
> the 48 lines of new tests included)
> 
>  11 files changed, 632 insertions(+), 660 deletions(-)

That's nice!

I didn't find anything quite significant to complain about, except just
one line:

+     * Initially we must only setup 1 PartitionDispatch object; the one for

setup -> set up

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 8 November 2018 at 20:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Actually, as I also proposed upthread, we should move root_tuple_slot from
> PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
> it's part of the first step described above that has nothing to do with
> partition tuple routing proper.  We can make PartitionTupleRouting private
> to execPartition.c if we do that.

okay. Makes sense. I've changed things around to PartitionTupleRouting
is now private to execPartition.c

> I didn't find anything quite significant to complain about, except just
> one line:
>
> +     * Initially we must only setup 1 PartitionDispatch object; the one for
>
> setup -> set up

Changed too.

I've attached v15 and a delta from v14 to ease re-review.

I also ran pgindent on this version. That's not part of the delta but
is in the main patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Robert Haas
Date:
On Thu, Nov 8, 2018 at 6:28 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I've attached v15 and a delta from v14 to ease re-review.
>
> I also ran pgindent on this version. That's not part of the delta but
> is in the main patch.

Did you notice http://postgr.es/m/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03
?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 9 November 2018 at 00:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached v15 and a delta from v14 to ease re-review.

I just revived the 0002 patch, which is still just for testing at this
stage. There was mention over on [1] about removing the
find_all_inheritors() call.

Also some benchmarks from v14 with 0001+0002.

Setup:

DROP TABLE hashp;
CREATE TABLE hashp (a INT) PARTITION BY HASH (a);
SELECT 'CREATE TABLE hashp'||x::Text || ' PARTITION OF hashp FOR
VALUES WITH (modulus 10000, remainder ' || x::text || ');' from
generate_Series(0,9999) x;
\gexec

(0 partitions is a non-partitioned table)

fsync=off

Partitions Patched Unpatched
0 23672 23785
10 22794 18385
100 22392 8541
1000 22419 1159
10000 22195 101

[1] https://www.postgresql.org/message-id/CA+TgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/11/08 20:28, David Rowley wrote:
> On 8 November 2018 at 20:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Actually, as I also proposed upthread, we should move root_tuple_slot from
>> PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
>> it's part of the first step described above that has nothing to do with
>> partition tuple routing proper.  We can make PartitionTupleRouting private
>> to execPartition.c if we do that.
> 
> okay. Makes sense. I've changed things around to PartitionTupleRouting
> is now private to execPartition.c

Thank you.  I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
 typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;

Thanks,
Amit




Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On 9 November 2018 at 19:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I have a comment regarding how you chose to make
> PartitionTupleRouting private.
>
> Using the v14_to_v15 diff, I could quickly see that there are many diffs
> changing PartitionTupleRouting to struct PartitionTupleRouting, but they
> would be unnecessary if you had added the following in execPartition.h, as
> my upthread had done.
>
> -/* See execPartition.c for the definition. */
> +/* See execPartition.c for the definitions. */
>  typedef struct PartitionDispatchData *PartitionDispatch;
> +typedef struct PartitionTupleRouting PartitionTupleRouting;

Okay, done that way. v16 attached.

The 0002 patch is included again, this time with a new proposed commit
message.  There was some discussion over on [1] where nobody seemed to
have any concerns about delaying the locking until we route the first
tuple to the partition.

[1] https://www.postgresql.org/message-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Jesper Pedersen
Date:
Hi,

On 11/12/18 6:17 PM, David Rowley wrote:
> On 9 November 2018 at 19:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I have a comment regarding how you chose to make
>> PartitionTupleRouting private.
>>
>> Using the v14_to_v15 diff, I could quickly see that there are many diffs
>> changing PartitionTupleRouting to struct PartitionTupleRouting, but they
>> would be unnecessary if you had added the following in execPartition.h, as
>> my upthread had done.
>>
>> -/* See execPartition.c for the definition. */
>> +/* See execPartition.c for the definitions. */
>>   typedef struct PartitionDispatchData *PartitionDispatch;
>> +typedef struct PartitionTupleRouting PartitionTupleRouting;
> 
> Okay, done that way. v16 attached.
> 

Still passes check-world.

Best regards,
  Jesper



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/11/14 0:32, Jesper Pedersen wrote:
> Hi,
> 
> On 11/12/18 6:17 PM, David Rowley wrote:
>> On 9 November 2018 at 19:18, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> I have a comment regarding how you chose to make
>>> PartitionTupleRouting private.
>>>
>>> Using the v14_to_v15 diff, I could quickly see that there are many diffs
>>> changing PartitionTupleRouting to struct PartitionTupleRouting, but they
>>> would be unnecessary if you had added the following in execPartition.h, as
>>> my upthread had done.
>>>
>>> -/* See execPartition.c for the definition. */
>>> +/* See execPartition.c for the definitions. */
>>>   typedef struct PartitionDispatchData *PartitionDispatch;
>>> +typedef struct PartitionTupleRouting PartitionTupleRouting;
>>
>> Okay, done that way. v16 attached.

Thank you.

> Still passes check-world.

I looked at v16 and noticed a few typos:

+  * partition_dispatch_info Array of 'dispatch_allocsize' elements containing
+  *                         a pointer to a PartitionDispatch objects for

a PartitionDispatch objects -> a PartitionDispatch object

+  * partitions              Array of 'partitions_allocsize' elements
+  *                         containing pointers to a ResultRelInfos of all
+  *                         leaf partitions touched by tuple routing.

a ResultRelInfos -> ResultRelInfos

+     * PartitionDispatch and ResultRelInfo pointers the 'partitions' array.

"in" the 'partitions' array.

+                        /* Setup the PartitionRoutingInfo for it */

Setup -> Set up

+ *      Ensure there's enough space in the 'partitions' array of 'proute'

+ *      and store it in the next empty slot in proute's partitions array.


Not a typo, but maybe just write proute->partitions instead of "partitions
array of proute" and "proute's partition array".

+ *      the next available slot in the 'proute' partition_dispatch_info[]
+ *      array.  Also, record the index into this array in the 'parent_pd'

Similarly, here: proute->partition_dipatch_info array

+ *      array.  Also, record the index into this array in the 'parent_pd'
+ *      indexes[] array in the partidx element so that we can properly

Similarly, parent_pd->indexes array

+    if (dispatchidx >= proute->dispatch_allocsize)
+    {
+        /* Expand allocated space. */
+        proute->dispatch_allocsize *= 2;
+        proute->partition_dispatch_info = (PartitionDispatchData **)
+            repalloc(proute->partition_dispatch_info,
+                     sizeof(PartitionDispatchData *) *
+                     proute->dispatch_allocsize);
+    }

Sorry, I forgot to point this out before, but can this code in
ExecInitPartitionDispatchInfo be accommodated in
ExecCheckPartitionArraySpace() for consistency?

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
Thanks for looking at this again.

On 14 November 2018 at 13:47, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> +    if (dispatchidx >= proute->dispatch_allocsize)
> +    {
> +        /* Expand allocated space. */
> +        proute->dispatch_allocsize *= 2;
> +        proute->partition_dispatch_info = (PartitionDispatchData **)
> +            repalloc(proute->partition_dispatch_info,
> +                     sizeof(PartitionDispatchData *) *
> +                     proute->dispatch_allocsize);
> +    }
>
> Sorry, I forgot to point this out before, but can this code in
> ExecInitPartitionDispatchInfo be accommodated in
> ExecCheckPartitionArraySpace() for consistency?

I don't really want to put that code in ExecCheckPartitionArraySpace()
as the way the function is now, it makes quite a lot of sense for the
compiler to inline it. If we add redundant work in there, then it
makes less sense.  There's never any need to check both arrays at once
as we're only adding the new item to one array at a time.

Instead, I've written a new function named
ExecCheckDispatchArraySpace() and put the resize code inside that.

I've fixed the typos you mentioned. The only other thing I changed was
to only allocate the PartitionDispatch->tupslot if a conversion is
required.  The previous code allocated this regardless if it was going
to be used or not.  This saves both the redundant allocation and also
very slightly reduces the cost of the if test in ExecFindPartition().
There's now no need to check if the map ! NULL as if the slot is there
then the map must exist too.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
Thanks for updating the patch.

On 2018/11/14 13:16, David Rowley wrote:
> Thanks for looking at this again.
> 
> On 14 November 2018 at 13:47, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> +    if (dispatchidx >= proute->dispatch_allocsize)
>> +    {
>> +        /* Expand allocated space. */
>> +        proute->dispatch_allocsize *= 2;
>> +        proute->partition_dispatch_info = (PartitionDispatchData **)
>> +            repalloc(proute->partition_dispatch_info,
>> +                     sizeof(PartitionDispatchData *) *
>> +                     proute->dispatch_allocsize);
>> +    }
>>
>> Sorry, I forgot to point this out before, but can this code in
>> ExecInitPartitionDispatchInfo be accommodated in
>> ExecCheckPartitionArraySpace() for consistency?
> 
> I don't really want to put that code in ExecCheckPartitionArraySpace()
> as the way the function is now, it makes quite a lot of sense for the
> compiler to inline it. If we add redundant work in there, then it
> makes less sense.  There's never any need to check both arrays at once
> as we're only adding the new item to one array at a time.
> 
> Instead, I've written a new function named
> ExecCheckDispatchArraySpace() and put the resize code inside that.

Okay, seems fine.

> I've fixed the typos you mentioned. The only other thing I changed was
> to only allocate the PartitionDispatch->tupslot if a conversion is
> required.  The previous code allocated this regardless if it was going
> to be used or not.  This saves both the redundant allocation and also
> very slightly reduces the cost of the if test in ExecFindPartition().
> There's now no need to check if the map ! NULL as if the slot is there

Also makes sense.

Although it seems that Alvaro has already started at looking at this, I'll
mark the CF entry as Ready for Committer anyway, because I don't have any
more comments. :)

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
What's with this comment?

     * Initially we must only set up 1 PartitionDispatch object; the one for
     * the partitioned table that's the target of the command.  If we must
     * route a tuple via some sub-partitioned table, then its
     * PartitionDispatch is only built the first time it's required.

You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
odds with the '1' mentioned in the comment.  Which is wrong?

(I have a few edits on the patch, so please don't send a full v18 -- a
delta patch would be welcome, if you have further changes to propose.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
Thanks for picking this up.

On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> What's with this comment?
>
>          * Initially we must only set up 1 PartitionDispatch object; the one for
>          * the partitioned table that's the target of the command.  If we must
>          * route a tuple via some sub-partitioned table, then its
>          * PartitionDispatch is only built the first time it's required.
>
> You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
> odds with the '1' mentioned in the comment.  Which is wrong?

I don't think either is wrong, but I guess something must be
misleading, so could perhaps be improved.

We're simply allocating enough space for PARTITION_ROUTING_INITSIZE
but we're only initialising 1 item. That leaves space for
PARTITION_ROUTING_INITSIZE - 1 more items before we'd need to
reallocate the array.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
On 2018-Nov-15, David Rowley wrote:

> On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > What's with this comment?
> >
> >          * Initially we must only set up 1 PartitionDispatch object; the one for
> >          * the partitioned table that's the target of the command.  If we must
> >          * route a tuple via some sub-partitioned table, then its
> >          * PartitionDispatch is only built the first time it's required.
> >
> > You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
> > odds with the '1' mentioned in the comment.  Which is wrong?
> 
> I don't think either is wrong, but I guess something must be
> misleading, so could perhaps be improved.

Ah, that makes sense.  Yeah, it seems a bit misleading to me.  No
worries.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On 2018/11/15 8:58, Alvaro Herrera wrote:
> On 2018-Nov-15, David Rowley wrote:
> 
>> On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> What's with this comment?
>>>
>>>          * Initially we must only set up 1 PartitionDispatch object; the one for
>>>          * the partitioned table that's the target of the command.  If we must
>>>          * route a tuple via some sub-partitioned table, then its
>>>          * PartitionDispatch is only built the first time it's required.
>>>
>>> You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
>>> odds with the '1' mentioned in the comment.  Which is wrong?
>>
>> I don't think either is wrong, but I guess something must be
>> misleading, so could perhaps be improved.
> 
> Ah, that makes sense.  Yeah, it seems a bit misleading to me.  No
> worries.

Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or
PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size?

Thanks,
Amit



Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
On 2018-Nov-15, Amit Langote wrote:

> Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or
> PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size?

Here's a proposed delta on v17 0001.  Most importantly, I noticed that
the hashed subplans stuff didn't actually work, because the hash API was
not being used correctly.  So the search in the hash would never return
a hit, and we'd create RRIs for those partitions again.  To fix this, I
added a new struct to hold hash entries.

I think this merits that the performance tests be redone.  (Unless I
misunderstand, this shouldn't change the performance of INSERT, only
that of UPDATE.)

On the subject of the ArraySpace routines, I decided to drop them and
instead do the re-allocations on the places where they were needed.
In the original code there were two places for the partitions array, but
both did the same thing so it made sense to create a separate routine to
do it instead (ExecRememberPartitionRel), and do the allocation there.
Just out of custom I moved the palloc to appear at the same place as the
repalloc, and after doing so it became obvious that we were
over-allocating memory for the PartitionDispatchData pointer --
allocating the size for the whole struct instead of just the pointer.

(I renamed the "allocsize" struct members to "max", as is customary.)

I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop.  It
shouldn't be useful if the code is correct, but if there are bugs it's
better to be able to interrupt infinite loops :-)

I reindented the comment atop PartitionTupleRouting.  The other way was
just too unwieldy.

Let me know what you think.  Regression tests still pass for me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
On Fri, Nov 16, 2018 at 11:40 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2018-Nov-15, Amit Langote wrote:
>
> > Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or
> > PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size?
>
> Here's a proposed delta on v17 0001.  Most importantly, I noticed that
> the hashed subplans stuff didn't actually work, because the hash API was
> not being used correctly.  So the search in the hash would never return
> a hit, and we'd create RRIs for those partitions again.  To fix this, I
> added a new struct to hold hash entries.

I'm a bit surprised that you found that the hash table didn't work,
because I remember having checked by attaching gdb that it works when
I was hacking on my own delta patch, but I may have been looking at
too many things.

> I think this merits that the performance tests be redone.  (Unless I
> misunderstand, this shouldn't change the performance of INSERT, only
> that of UPDATE.)

Actually, I don't remember seeing performance tests done with UPDATEs
on this thread.

Since we don't needlessly scan *all* subplan result rels anymore,
maybe this removes a good deal of overhead for UPDATEs that update
partition key.

> On the subject of the ArraySpace routines, I decided to drop them and
> instead do the re-allocations on the places where they were needed.
> In the original code there were two places for the partitions array, but
> both did the same thing so it made sense to create a separate routine to
> do it instead (ExecRememberPartitionRel), and do the allocation there.
> Just out of custom I moved the palloc to appear at the same place as the
> repalloc, and after doing so it became obvious that we were
> over-allocating memory for the PartitionDispatchData pointer --
> allocating the size for the whole struct instead of just the pointer.
>
> (I renamed the "allocsize" struct members to "max", as is customary.)

These changes look good to me.

> I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop.  It
> shouldn't be useful if the code is correct, but if there are bugs it's
> better to be able to interrupt infinite loops :-)

Good measure. :)

> I reindented the comment atop PartitionTupleRouting.  The other way was
> just too unwieldy.
>
> Let me know what you think.  Regression tests still pass for me.

Overall, it looks good to me.

Thanks,
Amit


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
One thing I don't quite like is the inconsistency in handling memory
context switches in the various function allocating stuff.  It seems
rather haphazard.  I'd rather have a memcxt member in
PartitionTupleRouting, which is set when the struct is created, and then
have all the other functions allocating stuff use that one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
On 2018-Nov-16, Alvaro Herrera wrote:

> One thing I don't quite like is the inconsistency in handling memory
> context switches in the various function allocating stuff.  It seems
> rather haphazard.  I'd rather have a memcxt member in
> PartitionTupleRouting, which is set when the struct is created, and then
> have all the other functions allocating stuff use that one.

So while researching this I finally realized that there was a "lexical
disconnect" between setting a ResultRelInfo's ri_PartitionInfo
struct/pointer and adding it to the PartitionTupleRoute arrays.
However, if you think about it, these things are one and the same, so we
don't need to do them separately; just merge the new function I wrote
into the existing ExecInitRoutingInfo().  Patch attached.

(This version also rebases across Andres' recent conflicting
TupleTableSlot changes.)

I'll now see about the commit message and push shortly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
I repeated David's original tests not terribly rigorously[*] and got
these numbers:

* Unpatched:    72.396196
* 0001     :    77.279404
* 0001+0002: 20409.415094
*      0002:   816.606613
* control  : 22969.140596 (insertion into unpartitioned table)

So while this patch by itself gives a pretty lame increase in tps, it
removes bottlenecks that will appear once we change the locking scheme.

[*] On my laptop, running each test only once for 60s.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
On 2018-Nov-13, David Rowley wrote:

> The 0002 patch is included again, this time with a new proposed commit
> message.  There was some discussion over on [1] where nobody seemed to
> have any concerns about delaying the locking until we route the first
> tuple to the partition.

Please get a new commitfest entry for this patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On Sat, 17 Nov 2018 at 04:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I'll now see about the commit message and push shortly.

Many thanks for making the required adjustments and pushing this.

If I wasn't on leave late last week and early this week then the only
thing I'd have mentioned was the lack of empty comment line in the
header comment for PartitionDispatchData. It looks a bit messy
without.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Alvaro Herrera
Date:
On 2018-Nov-21, David Rowley wrote:

> If I wasn't on leave late last week and early this week then the only
> thing I'd have mentioned was the lack of empty comment line in the
> header comment for PartitionDispatchData. It looks a bit messy
> without.

Absolutely.  Pushed a few newlines -- I hope I understood you correctly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2018-Nov-21, David Rowley wrote:
> > If I wasn't on leave late last week and early this week then the only
> > thing I'd have mentioned was the lack of empty comment line in the
> > header comment for PartitionDispatchData. It looks a bit messy
> > without.
>
> Absolutely.  Pushed a few newlines -- I hope I understood you correctly.

Thanks, you did. That looks better now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Amit Langote
Date:
Hi,

On Thu, Nov 22, 2018 at 7:25 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > On 2018-Nov-21, David Rowley wrote:
> > > If I wasn't on leave late last week and early this week then the only
> > > thing I'd have mentioned was the lack of empty comment line in the
> > > header comment for PartitionDispatchData. It looks a bit messy
> > > without.
> >
> > Absolutely.  Pushed a few newlines -- I hope I understood you correctly.
>
> Thanks, you did. That looks better now.

I noticed that there's a "be" missing in the comment above
ExecFindPartition.  Fixed in the attached.

Thanks,
Amit

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
Michael Paquier
Date:
On Thu, Nov 22, 2018 at 11:32:04AM +0900, Amit Langote wrote:
> I noticed that there's a "be" missing in the comment above
> ExecFindPartition.  Fixed in the attached.

Thanks Amit, I have committed this one.
--
Michael

Attachment

Re: Speeding up INSERTs and UPDATEs to partitioned tables

From
David Rowley
Date:
On Sat, 17 Nov 2018 at 07:28, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > The 0002 patch is included again, this time with a new proposed commit
> > message.  There was some discussion over on [1] where nobody seemed to
> > have any concerns about delaying the locking until we route the first
> > tuple to the partition.
>
> Please get a new commitfest entry for this patch.

Added to Jan-fest in: https://commitfest.postgresql.org/21/1887/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services