Thread: Speeding up INSERTs and UPDATEs to partitioned tables
Hi, As part of my efforts to make partitioning scale better for larger numbers of partitions, I've been looking at primarily INSERT VALUES performance. Here the overheads are almost completely in the executor. Planning of this type of statement is very simple since there is no FROM clause to process. My benchmarks have been around a RANGE partitioned table with 10k leaf partitions and no sub-partitioned tables. The partition key is a timestamp column. I've found that ExecSetupPartitionTupleRouting() is very slow indeed and there are a number of things slow about it. The biggest culprit for the slowness is the locking of each partition inside of find_all_inheritors(). For now, this needs to remain as we must hold locks on each partition while performing RelationBuildPartitionDesc(), otherwise, one of the partitions may get dropped out from under us. There might be other valid reasons too, but please see my note at the bottom of this email. The locking is not the only slow thing. I found the following to also be slow: 1. RelationGetPartitionDispatchInfo uses a List and lappend() must perform a palloc() each time a partition is added to the list. 2. A foreach loop is performed over leaf_parts to search for subplans belonging to this partition. This seems pointless to do for INSERTs as there's never any to find. 3. ExecCleanupTupleRouting() loops through the entire partitions array. If a single tuple was inserted then all but one of the elements will be NULL. 4. Tuple conversion map allocates an empty array thinking there might be something to put into it. This is costly when the array is large and pointless when there are no maps to store. 5. During get_partition_dispatch_recurse(), get_rel_relkind() is called to determine if the partition is a partitioned table or a leaf partition. This results in a slow relcache hashtable lookup. 6. get_partition_dispatch_recurse() also ends up just building the indexes array with a sequence of numbers from 0 to nparts - 1 if there are no sub-partitioned tables. Doing this is slow when there are many partitions. Besides the locking, the only thing that remains slow now is the palloc0() for the 'partitions' array. In my test, it takes 0.6% of execution time. I don't see any pretty ways to fix that. I've written fixes for items 1-6 above. I did: 1. Use an array instead of a List. 2. Don't do this loop. palloc0() the partitions array instead. Let UPDATE add whatever subplans exist to the zeroed array. 3. Track what we initialize in a gapless array and cleanup just those ones. Make this array small and increase it only when we need more space. 4. Only allocate the map array when we need to store a map. 5. Work that out in relcache beforehand. 6. ditto The only questionable thing I see is what I did for 6. In partcache.c I'm basically building an array of nparts containing 0 to nparts -1. It seems a bit pointless, so perhaps there's a better way. I was also a bit too tight to memcpy() that out of relcache, and just pointed directly to it. That might be a no-go area. I've attached 2 patches: 0001: implements items 1-6 0002: Is not intended for commit. It just a demo of where we could get the performance if we were smarter about locking partitions. I've just included this to show 0001's worth. Performance AWS: m5d.large fsync=off Unpatched: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 2836 latency average = 21.162 ms tps = 47.254409 (including connections establishing) tps = 47.255756 (excluding connections establishing) (yes, it's bad) 0001: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 3235 latency average = 18.548 ms tps = 53.913121 (including connections establishing) tps = 53.914629 (excluding connections establishing) (a small improvement from 0001) 0001+0002: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 660079 latency average = 0.091 ms tps = 11001.303764 (including connections establishing) tps = 11001.602377 (excluding connections establishing) (something to aspire towards) 0002 (only): $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 27682 latency average = 2.168 ms tps = 461.350885 (including connections establishing) tps = 461.363327 (excluding connections establishing) (shows that doing 0002 alone does not fix all our problems) Unpartitioned table (control test): $ pgbench -n -T 60 -f partbench__insert.sql postgres transaction type: partbench__insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 801260 latency average = 0.075 ms tps = 13354.311397 (including connections establishing) tps = 13354.656163 (excluding connections establishing) Test setup: CREATE TABLE partbench_ (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL); CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION BY RANGE (date); \o /dev/null select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || ' hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text || ' hours')::interval || ''');' from generate_Series(0,9999) x; \gexec \o partbench_insert.sql contains: insert into partbench values('2018-04-26 15:00:00',1,2,3,4,5); partbench__insert.sql contains: insert into partbench_ values('2018-04-26 15:00:00',1,2,3,4,5); I don't want to discuss the locking on this thread. That discussion will detract from discussing what I'm proposing here... Which is not to change anything relating to locks. I'm still working on that and will post elsewhere. Please start another thread if you'd like to discuss that in the meantime. Feel free to link it in here so others can follow. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 22 June 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've written fixes for items 1-6 above. > > I did: > > 1. Use an array instead of a List. > 2. Don't do this loop. palloc0() the partitions array instead. Let > UPDATE add whatever subplans exist to the zeroed array. > 3. Track what we initialize in a gapless array and cleanup just those > ones. Make this array small and increase it only when we need more > space. > 4. Only allocate the map array when we need to store a map. > 5. Work that out in relcache beforehand. > 6. ditto I've added this to the July 'fest: https://commitfest.postgresql.org/18/1690/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, I tried to benchmark with v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch, but when I create the second partition,server process get segmentation fault. I don't know the cause, but it seems that an incorrect value is set to partdesc->boundinfo. (gdb) p partdesc->boundinfo[0] $6 = {strategy = 0 '\000', ndatums = 2139062142, datums = 0x7f7f7f7f7f7f7f7f, kind = 0x7f7f7f7f7f7f7f7f, indexes = 0x7f7f7f7f7f7f7f7f,null_index = 2139062143, default_index = 2139062143} $ psql postgres psql (11beta2) Type "help" for help. postgres=# create table a(i int) partition by range(i); CREATE TABLE postgres=# create table a_1 partition of a for values from(1) to (200); CREATE TABLE postgres=# create table a_2 partition of a for values from(200) to (400); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> 2018-07-05 14:02:52.405 JST [60250] LOG: server process (PID 60272) was terminated by signal 11: Segmentation fault 2018-07-05 14:02:52.405 JST [60250] DETAIL: Failed process was running: create table a_2 partition of a for values from(200)to (400); (gdb) bt #0 0x0000000000596e52 in get_default_oid_from_partdesc (partdesc=0x259e928) at partition.c:269 #1 0x0000000000677355 in DefineRelation (stmt=0x259e610, relkind=114 'r', ownerId=10, typaddress=0x0, queryString=0x24d58b8"create table a_2 partition of a for values from(200) to (400);") at tablecmds.c:832 #2 0x00000000008b6893 in ProcessUtilitySlow (pstate=0x259e4f8, pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:1000 #3 0x00000000008b66c2 in standard_ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition ofa for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:920 #4 0x00000000008b583b in ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for valuesfrom(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:360 #5 0x00000000008b482c in PortalRunUtility (portal=0x253af38, pstmt=0x24d67d8, isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8,completionTag=0x7ffc05932330 "") at pquery.c:1178 #6 0x00000000008b4a45 in PortalRunMulti (portal=0x253af38, isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8, altdest=0x24d6ac8,completionTag=0x7ffc05932330 "") at pquery.c:1324 #7 0x00000000008b3f7d in PortalRun (portal=0x253af38, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x24d6ac8,altdest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:799 #8 0x00000000008adf16 in exec_simple_query (query_string=0x24d58b8 "create table a_2 partition of a for values from(200)to (400);") at postgres.c:1122 #9 0x00000000008b21a5 in PostgresMain (argc=1, argv=0x24ff5b0, dbname=0x24ff410 "postgres", username=0x24d2358 "symfo")at postgres.c:4153 #10 0x00000000008113f4 in BackendRun (port=0x24f73f0) at postmaster.c:4361 #11 0x0000000000810b67 in BackendStartup (port=0x24f73f0) at postmaster.c:4033 #12 0x000000000080d0ed in ServerLoop () at postmaster.c:1706 #13 0x000000000080c9a3 in PostmasterMain (argc=1, argv=0x24d0310) at postmaster.c:1379 #14 0x0000000000737392 in main (argc=1, argv=0x24d0310) at main.c:228 (gdb) disassemble Dump of assembler code for function get_default_oid_from_partdesc: 0x0000000000596e0a <+0>: push %rbp 0x0000000000596e0b <+1>: mov %rsp,%rbp 0x0000000000596e0e <+4>: mov %rdi,-0x8(%rbp) 0x0000000000596e12 <+8>: cmpq $0x0,-0x8(%rbp) 0x0000000000596e17 <+13>: je 0x596e56 <get_default_oid_from_partdesc+76> 0x0000000000596e19 <+15>: mov -0x8(%rbp),%rax 0x0000000000596e1d <+19>: mov 0x10(%rax),%rax 0x0000000000596e21 <+23>: test %rax,%rax 0x0000000000596e24 <+26>: je 0x596e56 <get_default_oid_from_partdesc+76> 0x0000000000596e26 <+28>: mov -0x8(%rbp),%rax 0x0000000000596e2a <+32>: mov 0x10(%rax),%rax 0x0000000000596e2e <+36>: mov 0x24(%rax),%eax 0x0000000000596e31 <+39>: cmp $0xffffffff,%eax 0x0000000000596e34 <+42>: je 0x596e56 <get_default_oid_from_partdesc+76> 0x0000000000596e36 <+44>: mov -0x8(%rbp),%rax 0x0000000000596e3a <+48>: mov 0x8(%rax),%rdx 0x0000000000596e3e <+52>: mov -0x8(%rbp),%rax 0x0000000000596e42 <+56>: mov 0x10(%rax),%rax 0x0000000000596e46 <+60>: mov 0x24(%rax),%eax 0x0000000000596e49 <+63>: cltq 0x0000000000596e4b <+65>: shl $0x2,%rax 0x0000000000596e4f <+69>: add %rdx,%rax => 0x0000000000596e52 <+72>: mov (%rax),%eax 0x0000000000596e54 <+74>: jmp 0x596e5b <get_default_oid_from_partdesc+81> 0x0000000000596e56 <+76>: mov $0x0,%eax 0x0000000000596e5b <+81>: pop %rbp 0x0000000000596e5c <+82>: retq End of assembler dump. (gdb) i r rax 0x20057e77c 8595695484 rbx 0x72 114 rcx 0x7f50ce90e0e8 139985039712488 rdx 0x259e980 39446912 rsi 0x7f50ce90e0a8 139985039712424 rdi 0x259e928 39446824 rbp 0x7ffc05931890 0x7ffc05931890 rsp 0x7ffc05931890 0x7ffc05931890 r8 0x7ffc059317bf 140720402012095 r9 0x0 0 r10 0x6b 107 r11 0x7f50cdbc3f10 139985025777424 r12 0x70 112 r13 0x0 0 r14 0x0 0 r15 0x0 0 rip 0x596e52 0x596e52 <get_default_oid_from_partdesc+72> eflags 0x10202 [ IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) list *0x596e52 0x596e52 is in get_default_oid_from_partdesc (partition.c:269). 264 Oid 265 get_default_oid_from_partdesc(PartitionDesc partdesc) 266 { 267 if (partdesc && partdesc->boundinfo && 268 partition_bound_has_default(partdesc->boundinfo)) 269 return partdesc->oids[partdesc->boundinfo->default_index]; 270 271 return InvalidOid; 272 } 273 regards, -----Original Message----- From: David Rowley [mailto:david.rowley@2ndquadrant.com] Sent: Saturday, June 23, 2018 7:19 AM To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables On 22 June 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've written fixes for items 1-6 above. > > I did: > > 1. Use an array instead of a List. > 2. Don't do this loop. palloc0() the partitions array instead. Let > UPDATE add whatever subplans exist to the zeroed array. > 3. Track what we initialize in a gapless array and cleanup just those > ones. Make this array small and increase it only when we need more > space. > 4. Only allocate the map array when we need to store a map. > 5. Work that out in relcache beforehand. > 6. ditto I've added this to the July 'fest: https://commitfest.postgresql.org/18/1690/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 5 July 2018 at 18:39, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > postgres=# create table a(i int) partition by range(i); > CREATE TABLE > postgres=# create table a_1 partition of a for values from(1) to (200); > CREATE TABLE > postgres=# create table a_2 partition of a for values from(200) to (400); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. Hi, Thanks for testing. I'm unable to reproduce this on beta2 or master as of f61988d16. Did you try make clean then building again? The 0001 patch does change PartitionDescData, so if you've not rebuilt all .c files which use that then that might explain your crash. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On 06/22/2018 02:28 AM, David Rowley wrote: > I've attached 2 patches: > > 0001: implements items 1-6 > 0002: Is not intended for commit. It just a demo of where we could get > the performance if we were smarter about locking partitions. I've just > included this to show 0001's worth. > I did some tests with a 64 hash partition setup, and see a speedup for INSERT / UPDATE scenarios. > I don't want to discuss the locking on this thread. That discussion > will detract from discussing what I'm proposing here... Which is not > to change anything relating to locks. I'm still working on that and > will post elsewhere. With 0002 INSERTs get close to the TPS of the non-partitioned case. However, UPDATEs doesn't see the same speedup. But, as you said, a discussion for another thread. Best regards, Jesper
On 6 July 2018 at 01:18, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > With 0002 INSERTs get close to the TPS of the non-partitioned case. However, > UPDATEs doesn't see the same speedup. But, as you said, a discussion for > another thread. Hi Jesper, Thanks for testing this out. It was only really the locking I didn't want to discuss here due to the risk of the discussion of removing the other overheads getting lost in discussions about locking. It's most likely that the UPDATE is slower due to the planner still being quite slow when dealing with partitioned tables. It still builds RangeTblEntry and RelOptInfo objects for each partition even if the partition is pruned. With INSERT with a VALUES clause, the planner does not build these objects, in fact, the planner bearly does any work at all, so this can be speeded up just by removing the executor overheads. (I do have other WIP patches to speed up the planner. After delaying building the RelOptInfo and RangeTblEntry, with my 10k partition setup, the planner SELECT became 1600 times faster. UPDATE did not finish in the unpatched version, so gains there are harder to measure. There's still much work to do on these patches, and there's still more performance to squeeze out too. Hopefully, I'll be discussing this on another thread soon.) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thanks David! I did benchmark with pgbench, and see a speedup for INSERT / UPDATE scenarios. I used range partition. Benchmark results are as follows. 1. 11beta2 result part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency ----------+------------+-------------+----------------+----------------+---------------- 100 | 479.456278 | 2.086 | 1.382 | 0.365 | 0.168 200 | 169.155411 | 5.912 | 4.628 | 0.737 | 0.299 400 | 24.857495 | 40.23 | 36.606 | 2.252 | 0.881 800 | 6.718104 | 148.853 | 141.471 | 5.253 | 1.433 1600 | 1.934908 | 516.825 | 489.982 | 21.102 | 3.701 3200 | 0.456967 | 2188.362 | 2101.247 | 72.784 | 8.833 6400 | 0.116643 | 8573.224 | 8286.79 | 257.904 | 14.949 2. 11beta2 + patch1 + patch2 patch1: Allow direct lookups of AppendRelInfo by child relid commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency ----------+-------------+-------------+----------------+----------------+---------------- 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048 200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05 400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052 800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147 3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214 6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264 benchmark script: \set aid random(1, 100 * 1) \set delta random(-5000, 5000) BEGIN; UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM test.accounts WHERE aid = :aid; INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP); END; partition key is aid. -----Original Message----- From: David Rowley [mailto:david.rowley@2ndquadrant.com] Sent: Thursday, July 05, 2018 6:19 PM To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com> Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables On 5 July 2018 at 18:39, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > postgres=# create table a(i int) partition by range(i); CREATE TABLE > postgres=# create table a_1 partition of a for values from(1) to > (200); CREATE TABLE postgres=# create table a_2 partition of a for > values from(200) to (400); server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. Hi, Thanks for testing. I'm unable to reproduce this on beta2 or master as of f61988d16. Did you try make clean then building again? The 0001 patch does change PartitionDescData, so if you've not rebuilt all .cfiles which use that then that might explain your crash. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > 2. 11beta2 + patch1 + patch2 > > patch1: Allow direct lookups of AppendRelInfo by child relid > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency > ----------+-------------+-------------+----------------+----------------+---------------- > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048 > 200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05 > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052 > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059 > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147 > 3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214 > 6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264 Hi, Thanks a lot for benchmarking this. Just a note to say that the "Allow direct lookups of AppendRelInfo by child relid" patch is already in master. It's much more relevant to be testing with master than pg11. This patch is not intended for pg11. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-Jul-11, David Rowley wrote: > On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > > 2. 11beta2 + patch1 + patch2 > > > > patch1: Allow direct lookups of AppendRelInfo by child relid > > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > > > part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency > > ----------+-------------+-------------+----------------+----------------+---------------- > > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048 > > 200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05 > > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052 > > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059 > > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147 > > 3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214 > > 6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264 > Just a note to say that the "Allow direct lookups of AppendRelInfo by > child relid" patch is already in master. It's much more relevant to be > testing with master than pg11. This patch is not intended for pg11. That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 is by itself :-) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi David. On 2018/06/22 15:28, David Rowley wrote: > Hi, > > As part of my efforts to make partitioning scale better for larger > numbers of partitions, I've been looking at primarily INSERT VALUES > performance. Here the overheads are almost completely in the > executor. Planning of this type of statement is very simple since > there is no FROM clause to process. Thanks for this effort. > My benchmarks have been around a RANGE partitioned table with 10k leaf > partitions and no sub-partitioned tables. The partition key is a > timestamp column. > > I've found that ExecSetupPartitionTupleRouting() is very slow indeed > and there are a number of things slow about it. The biggest culprit > for the slowness is the locking of each partition inside of > find_all_inheritors(). Yes. :-( > For now, this needs to remain as we must hold > locks on each partition while performing RelationBuildPartitionDesc(), > otherwise, one of the partitions may get dropped out from under us. We lock all partitions using find_all_inheritors to keep locking order consistent with other sites that may want to lock tables in the same partition tree but with a possibly conflicting lock mode. If we remove the find_all_inheritors call in ExecSetupPartitionPruneState (like your 0002 does), we may end up locking partitions in arbitrary order in a given transaction, because input tuples will be routed to various partitions in an order that's not predetermined. But, maybe it's not necessary to be that paranoid. If we've locked on the parent, any concurrent lockers would have to wait for the lock on the parent anyway, so it doesn't matter which order tuple routing locks the partitions. > The locking is not the only slow thing. I found the following to also be slow: > > 1. RelationGetPartitionDispatchInfo uses a List and lappend() must > perform a palloc() each time a partition is added to the list. > 2. A foreach loop is performed over leaf_parts to search for subplans > belonging to this partition. This seems pointless to do for INSERTs as > there's never any to find. > 3. ExecCleanupTupleRouting() loops through the entire partitions > array. If a single tuple was inserted then all but one of the elements > will be NULL. > 4. Tuple conversion map allocates an empty array thinking there might > be something to put into it. This is costly when the array is large > and pointless when there are no maps to store. > 5. During get_partition_dispatch_recurse(), get_rel_relkind() is > called to determine if the partition is a partitioned table or a leaf > partition. This results in a slow relcache hashtable lookup. > 6. get_partition_dispatch_recurse() also ends up just building the > indexes array with a sequence of numbers from 0 to nparts - 1 if there > are no sub-partitioned tables. Doing this is slow when there are many > partitions. > > Besides the locking, the only thing that remains slow now is the > palloc0() for the 'partitions' array. In my test, it takes 0.6% of > execution time. I don't see any pretty ways to fix that. > > I've written fixes for items 1-6 above. > > I did: > > 1. Use an array instead of a List. > 2. Don't do this loop. palloc0() the partitions array instead. Let > UPDATE add whatever subplans exist to the zeroed array. > 3. Track what we initialize in a gapless array and cleanup just those > ones. Make this array small and increase it only when we need more > space. > 4. Only allocate the map array when we need to store a map. > 5. Work that out in relcache beforehand. > 6. ditto The issues you list seem all legitimate to me and also your proposed fixes for each, except I think we could go a bit further. Why don't we abandon the notion altogether that ExecSetupPartitionTupleRouting *has to* process the whole partition tree? ISTM, there is no need to determine the exact number of leaf partitions and partitioned tables in the partition tree and allocate the arrays in PartitionTupleRouting based on that. I know that the indexes array in PartitionDispatchData contains mapping from local partition indexes (0 to partdesc->nparts - 1) to those that span *all* leaf partitions and *all* partitioned tables (0 to proute->num_partitions or proute->num_dispatch) in a partition tree, but we can change that. The idea I had was inspired by looking at partitions_init stuff in your patch. We could allocate proute->partition_dispatch_info and proute->partitions arrays to be of a predetermined size, which doesn't require us to calculate the exact number of leaf partitions and partitioned tables beforehand. So, RelationGetPartitionDispatchInfo need not recursively go over all of the partition tree. Instead we create just one PartitionDispatch object of the root parent table, whose indexes array is initialized with -1 meaning none of the partitions has not been encountered yet. In ExecFindPartition, once tuple routing chooses a partition, we create either a ResultRelInfo (if leaf) or a PartitionDispatch for it and store it in the 0th slot in proute->partitions or proute->partition_dispatch_info, respectively. Also, we update the indexes array in the parent's PartitionDispatch to replace -1 with 0 so that future tuples routing to that partition don't allocate it again. The process is repeated if the tuple needs to be routed one more level down. If the query needs to allocate more ResultRelInfos and/or PartitionDispatch objects than we initially allocated space for, we expand those arrays. Finally, during ExecCleanupTupleRouting, we only "clean up" the partitions that we allocated ResultRelInfos and PartitionDispatch objects for, which is very similar to the partitions_init idea in your patch. I implemented that idea in the attached patch, which applies on top of your 0001 patch, but I'd say it's too big to be just called a delta. I was able to get following performance numbers using the following pgbench test: pgbench -n -T 180 -f insert-ht.sql cat insert-ht.sql \set b random(1, 1000) \set a random(1, 1000) insert into ht values (:b, :a); Note that pgbench is run 3 times and every tps result is listed below. HEAD - 0 parts (unpartitioned table) tps = 2519.603076 (including connections establishing) tps = 2486.903189 (including connections establishing) tps = 2518.771182 (including connections establishing) HEAD - 2500 hash parts (no subpart) tps = 13.158224 (including connections establishing) tps = 12.940713 (including connections establishing) tps = 12.882465 (including connections establishing) David - 2500 hash parts (no subpart) tps = 18.717628 (including connections establishing) tps = 18.602280 (including connections establishing) tps = 18.945527 (including connections establishing) Amit - 2500 hash parts (no subpart) tps = 18.576858 (including connections establishing) tps = 18.431902 (including connections establishing) tps = 18.797023 (including connections establishing) HEAD - 2500 hash parts (4 hash subparts each) tps = 2.339332 (including connections establishing) tps = 2.339582 (including connections establishing) tps = 2.317037 (including connections establishing) David - 2500 hash parts (4 hash subparts each) tps = 3.225044 (including connections establishing) tps = 3.214053 (including connections establishing) tps = 3.239591 (including connections establishing) Amit - 2500 hash parts (4 hash subparts each) tps = 3.321918 (including connections establishing) tps = 3.305952 (including connections establishing) tps = 3.301036 (including connections establishing) Applying the lazy locking patch on top of David's and my patch, respectively, produces the following results. David - 2500 hash parts (no subpart) tps = 1577.854360 (including connections establishing) tps = 1532.681499 (including connections establishing) tps = 1464.254096 (including connections establishing) Amit - 2500 hash parts (no subpart) tps = 1532.475751 (including connections establishing) tps = 1534.650325 (including connections establishing) tps = 1527.840837 (including connections establishing) David - 2500 hash parts (4 hash subparts each) tps = 78.845916 (including connections establishing) tps = 79.167079 (including connections establishing) tps = 79.621686 (including connections establishing) Amit - 2500 hash parts (4 hash subparts each) 9:tps = 329.887008 (including connections establishing) 9:tps = 327.428103 (including connections establishing) 9:tps = 326.863248 (including connections establishing) About the last two results: after getting rid of the time-hog that is find_all_inheritors() call in ExecSetupPartitionTupleRouting for locking all partitions, it seems that we'll end up spending most of the time in RelationGetPartitionDispatchInfo() without my patch, because it will call get_partition_dispatch_recurse() for each of the 2500 partitions of first level that are themselves partitioned. With my patch, we won't do that and won't end up generating 2499 PartitionDispatch objects that would not be needed for a single-row insert statement. Thanks, Amit
Attachment
On 13 July 2018 at 20:20, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Why don't we abandon the notion altogether that > ExecSetupPartitionTupleRouting *has to* process the whole partition tree? [...] > I implemented that idea in the attached patch, which applies on top of > your 0001 patch, but I'd say it's too big to be just called a delta. I > was able to get following performance numbers using the following pgbench > test: Thanks for looking at this. I like that your idea gets rid of the indexes cache in syscache. I was not very happy with that part. I've looked over the code and the ExecUseUpdateResultRelForRouting() function is broken. Your while loop only skips partitions for the current partitioned table, it does not skip ModifyTable subnodes that belong to other partitioned tables. You can use the following. The code does not find the t1_a2 subnode. create table t1 (a int, b int) partition by list(a); create table t1_a1 partition of t1 for values in(1) partition by list(b); create table t1_a2 partition of t1 for values in(2); create table t1_a1_b1 partition of t1_a1 for values in(1); create table t1_a1_b2 partition of t1_a1 for values in(2); insert into t1 values(2,2); update t1 set a = a; I think there might not be enough information to make this work correctly, as if you change the loop to skip subnodes, then it won't work in cases where the partition[0] was pruned. I've another patch sitting here, partly done, that changes pg_class.relispartition into pg_class.relpartitionparent. If we had that then we could code your loop to work correctly. Alternatively, I guess we could just ignore the UPDATE's ResultRelInfos and just build new ones. Unsure if there's actually a reason we need to reuse the existing ones, is there? I think you'd need to know the owning partition and skip subnodes that don't belong to pd->reldesc. Alternatively, a hashtable could be built with all the oids belonging to pd->reldesc, then we could loop over the update_rris finding subnodes that can be found in the hashtable. Likely this will be much slower than the sort of merge lookup that the previous code did. Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code. The code seems to assume that there can only be at the most 65536 partitions, but I don't think there's any code which restricts us to that. There is code in the planner that will bork when trying to create a RangeTblEntry up that high, but as far as I know that won't be noticed on the INSERT path. I don't think this code has any business knowing what the special varnos are set to either. It would be better to just remove the limit and suffer the small wasted array space. I understand you've probably coded it like this due to the similar code that was in my patch, but with mine I knew the total number of partitions. Your patch does not. Other thoughts on the patch: I wonder if it's worth having syscache keep a count on the number of sub-partitioned tables a partition has. If there are none in the root partition then the partition_dispatch_info can be initialized with just 1 element to store the root details. Although, maybe it's not worth it to reduce the array size by 7 elements. Also, I'm a bit confused why you change the comments in execPartition.h for PartitionTupleRouting to be inline again. I brought those out of line as I thought the complexity of the code warranted that. You're inlining them again goes against what all the other structs do in that file. Apart from that, I think the idea is promising. We'll just need to find a way to make ExecUseUpdateResultRelForRouting work correctly. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, Thanks for taking a look. On 2018/07/15 17:34, David Rowley wrote: > I've looked over the code and the ExecUseUpdateResultRelForRouting() > function is broken. Your while loop only skips partitions for the > current partitioned table, it does not skip ModifyTable subnodes that > belong to other partitioned tables. > > You can use the following. The code does not find the t1_a2 subnode. > > create table t1 (a int, b int) partition by list(a); > create table t1_a1 partition of t1 for values in(1) partition by list(b); > create table t1_a2 partition of t1 for values in(2); > create table t1_a1_b1 partition of t1_a1 for values in(1); > create table t1_a1_b2 partition of t1_a1 for values in(2); > insert into t1 values(2,2); > > update t1 set a = a; Hmm, it indeed is broken. > I think there might not be enough information to make this work > correctly, as if you change the loop to skip subnodes, then it won't > work in cases where the partition[0] was pruned. > > I've another patch sitting here, partly done, that changes > pg_class.relispartition into pg_class.relpartitionparent. If we had > that then we could code your loop to work correctly.> Alternatively, > I guess we could just ignore the UPDATE's ResultRelInfos and just > build new ones. Unsure if there's actually a reason we need to reuse > the existing ones, is there? We try to use the existing ones because we thought back when the patch was written (not by me though) that redoing all the work that InitResultRelInfo does for each partition, for which we could have instead used an existing one, would cumulatively end up being more expensive than figuring out which ones we could reuse by a linear scan of partition and result rels arrays in parallel. I don't remember seeing a benchmark to demonstrate the benefit of doing this though. Maybe it was posted, but I don't remember having looked at it closely. > I think you'd need to know the owning partition and skip subnodes that > don't belong to pd->reldesc. Alternatively, a hashtable could be built > with all the oids belonging to pd->reldesc, then we could loop over > the update_rris finding subnodes that can be found in the hashtable. > Likely this will be much slower than the sort of merge lookup that the > previous code did. I think one option is to simply give up on the idea of matching *all* UPDATE result rels that belong to a given partitioned table (pd->reldesc) in one call of ExecUseUpdateResultRelForRouting. Instead, pass the index of the partition (in pd->partdesc->oids) to find the ResultRelInfo for, loop over all UPDATE result rels looking for one, and return immediately on finding one after having stored its pointer in proute->partitions. In the worst case, we'll end up scanning UPDATE result rels array for every partition that gets touched, but maybe such an UPDATE query is less common or even if such a query occurs, tuple routing might be the last of its bottlenecks. I have implemented that approach in the updated patch. That means I also needed to change things so that ExecUseUpdateResultRelsForRouting is now only called by ExecFindPartition, because with the new arrangement, it's useless to call it from ExecSetupPartitionTupleRouting. Moreover, an UPDATE may not use tuple routing at all, even if the fact that partition key is being updated results in calling ExecSetupPartitionTupleRouting. > Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code. > The code seems to assume that there can only be at the most 65536 > partitions, but I don't think there's any code which restricts us to > that. There is code in the planner that will bork when trying to > create a RangeTblEntry up that high, but as far as I know that won't > be noticed on the INSERT path. I don't think this code has any > business knowing what the special varnos are set to either. It would > be better to just remove the limit and suffer the small wasted array > space. I understand you've probably coded it like this due to the > similar code that was in my patch, but with mine I knew the total > number of partitions. Your patch does not. OK, I changed it to UINT_MAX. > Other thoughts on the patch: > > I wonder if it's worth having syscache keep a count on the number of > sub-partitioned tables a partition has. If there are none in the root > partition then the partition_dispatch_info can be initialized with > just 1 element to store the root details. Although, maybe it's not > worth it to reduce the array size by 7 elements. Hmm yes. Allocating space for 8 pointers when we really need 1 is not too bad, if the alternative is to modify partcache.c. > Also, I'm a bit confused why you change the comments in > execPartition.h for PartitionTupleRouting to be inline again. I > brought those out of line as I thought the complexity of the code > warranted that. You're inlining them again goes against what all the > other structs do in that file. It was out-of-line to begin with but it started to become distracting when updating the comments. But I agree about being consistent and hence I have moved them back to where they were. I have significantly rewritten those comments though to be clearer. > Apart from that, I think the idea is promising. We'll just need to > find a way to make ExecUseUpdateResultRelForRouting work correctly. Let me know what you think of the code in the updated patch. Thanks, Amit
Attachment
On 2018-Jul-11, Alvaro Herrera wrote: > That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 isby itself :-) Oops! I benchmarked with 11beta2 + 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch. Results are as follows. Performance seems to be improved. part_num | latency_avg | tps_ex | update_latency | select_latency | insert_latency ----------+-------------+------------+----------------+----------------+---------------- 100 | 2.09 | 478.379516 | 1.407 | 0.36 | 0.159 200 | 5.871 | 170.322179 | 4.621 | 0.732 | 0.285 400 | 39.029 | 25.622384 | 35.542 | 2.273 | 0.758 800 | 142.624 | 7.011494 | 135.447 | 5.04 | 1.388 1600 | 559.872 | 1.786138 | 534.301 | 20.318 | 3.122 3200 | 2161.834 | 0.462574 | 2077.737 | 72.804 | 7.037 6400 | 8282.38 | 0.120739 | 7996.212 | 259.406 | 14.514 Thanks Kato Sho -----Original Message----- From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com] Sent: Wednesday, July 11, 2018 10:30 PM To: David Rowley <david.rowley@2ndquadrant.com> Cc: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables On 2018-Jul-11, David Rowley wrote: > On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > > 2. 11beta2 + patch1 + patch2 > > > > patch1: Allow direct lookups of AppendRelInfo by child relid > > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > > > part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency > > ----------+-------------+-------------+----------------+----------------+---------------- > > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048 > > 200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05 > > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052 > > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059 > > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147 > > 3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214 > > 6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264 > Just a note to say that the "Allow direct lookups of AppendRelInfo by > child relid" patch is already in master. It's much more relevant to be > testing with master than pg11. This patch is not intended for pg11. That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 is byitself :-) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 18 July 2018 at 21:44, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > part_num | latency_avg | tps_ex | update_latency | select_latency | insert_latency > ----------+-------------+------------+----------------+----------------+---------------- > 100 | 2.09 | 478.379516 | 1.407 | 0.36 | 0.159 > 200 | 5.871 | 170.322179 | 4.621 | 0.732 | 0.285 > 400 | 39.029 | 25.622384 | 35.542 | 2.273 | 0.758 > 800 | 142.624 | 7.011494 | 135.447 | 5.04 | 1.388 > 1600 | 559.872 | 1.786138 | 534.301 | 20.318 | 3.122 > 3200 | 2161.834 | 0.462574 | 2077.737 | 72.804 | 7.037 > 6400 | 8282.38 | 0.120739 | 7996.212 | 259.406 | 14.514 Thanks for testing. It's fairly customary to include before/after, unpatched/patched results. I don't think your patched results really mean much by themselves. It's pretty well known that adding more partitions slows down the planner and the executor, to a lesser extent. This patch only aims to reduce some of the executor startup overheads for INSERT and UPDATE. Also, the 0001 patch is not really aiming to break any performance records. I posted results already and there is only a very small improvement. The main aim with the 0001 patch is to remove the bottlenecks so that the performance drop between partitioned and non-partitioned is primarily due to the partition locking. I'd like to fix that too, but it's more work and I see no reason that we shouldn't fix up the other slow parts first. I imagine this will increase the motivation to resolve the locking all partitions issue too. I'd also recommend that if you're testing this, that you do so with a recent master. The patch is not intended for pg11. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 18 July 2018 at 20:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Let me know what you think of the code in the updated patch. Thanks for sending the updated patch. I looked over it tonight and made a number of changes: 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was useless since the int would have wrapped long before it reached UINT_MAX. There's no shortage of other code doubling the size of an array by multiplying it by 2 unconditionally without considering overflowing an int. Unsure why you considered this more risky. 2) Fixed a series of bugs regarding the size of the arrays in PartitionTupleRouting. The map arrays and the partitions array could differ in size despite your comment that claimed child_parent_tupconv_maps was the same size as 'partitions' when non-NULL. The map arrays being a different size than the partitions array caused the following two cases to segfault. I've included two cases as it was two seperate bugs that caused them. -- case 1 drop table listp; create table listp (a int, b int) partition by list (a); create table listp1 partition of listp for values in (1); create table listp2 partition of listp for values in (2); create table listp3 partition of listp for values in (3); create table listp4 partition of listp for values in (4); create table listp5 partition of listp for values in (5); create table listp6 partition of listp for values in (6); create table listp7 partition of listp for values in (7); create table listp8 partition of listp for values in (8); create table listp9 (b int, a int); alter table listp attach partition listp9 for values in(9); insert into listp select generate_series(1,9); -- case 2 drop table listp; create table listp (a int, b int) partition by list (a); create table listp1 (b int, a int); alter table listp attach partition listp1 for values in(1); create table listp1 partition of listp for values in (1); create table listp2 partition of listp for values in (2); create table listp3 partition of listp for values in (3); create table listp4 partition of listp for values in (4); create table listp5 partition of listp for values in (5); create table listp6 partition of listp for values in (6); create table listp7 partition of listp for values in (7); create table listp8 partition of listp for values in (8); create table listp9 partition of listp for values in (9); insert into listp select generate_series(1,9); 3) Got rid of ExecUseUpdateResultRelForRouting. I started to change this to remove references to UPDATE in order to make it more friendly towards other possible future node types that it would get used for (aka MERGE). In the end, I found that performance could regress when in cases like: drop table listp; create table listp (a int) partition by list(a); \o /dev/null \timing off select 'create table listp'||x::Text||' partition of listp for values in('||x::Text||');' from generate_series(1,1000) x; \gexec \o insert into listp select x from generate_series(1,999) x; \timing on update listp set a = a+1; It's true that UPDATEs with a large number of subplans performance is quite terrible today in the planner, but this code made the performance of planning+execution a bit worse. If we get around to fixing the inheritance planner then I think ExecUseUpdateResultRelForRouting() could easily appear in profiles. I ended up rewriting it to just get called once and build a hash table by Oid storing a ResultRelInfo pointer. This also gets rid of the slow nested loop in the cleanup operation inside ExecCleanupTupleRouting(). 4) Did some tuning work in ExecFindPartition() getting rid of a redundant check after the loop completion. Also added some likely() and unlikely() decorations around some conditions. 5) Updated some newly out-dated comments since your patch in execPartition.h. 6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a palloc() updating the few fields that were not initialised. This might save a few TPS (at least once we get rid of the all partition locking) in the single-row INSERT case, but I've not tested the performance of this yet. 7) Also moved and edited some comments above ExecSetupPartitionTupleRouting() that I felt explained a little too much about some internal implementation details. One thing that I thought of, but didn't do was just having ExecFindPartition() return the ResultRelInfo. I think it would be much nicer in both call sites to not have to check the ->partitions array to get that. The copy.c call site would need a few modifications around the detection code to see if the partition has changed, but it all looks quite possible to change. I left this for now as I have another patch which touches all that code that I feel is closer to commit than this patch is. I've attached a delta of the changes I made since your v2 delta and also a complete updated patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> writes: > 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was > useless since the int would have wrapped long before it reached > UINT_MAX. There's no shortage of other code doubling the size of an > array by multiplying it by 2 unconditionally without considering > overflowing an int. Unsure why you considered this more risky. As long as you're re-palloc'ing the array each time, and not increasing its size more than 2X, this is perfectly safe because of the 1GB size limit on palloc requests. You'll fail because of that in the iteration where the request is between 1GB and 2GB, just before integer overflow can occur. (Yes, this is intentional.) regards, tom lane
On 27 July 2018 at 04:19, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached a delta of the changes I made since your v2 delta and > also a complete updated patch. I did a very quick performance test of this patch on an AWS m5d.large instance with fsync=off. The test setup is the same as is described in my initial email on this thread. The test compares the performance of INSERTs into a partitioned table with 10k partitions compared to a non-partitioned table. Patched with v2 patch on master@39d51fe87 -- partitioned $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1063764 latency average = 0.056 ms tps = 17729.375930 (including connections establishing) tps = 17729.855215 (excluding connections establishing) -- non-partitioned $ pgbench -n -T 60 -f partbench__insert.sql postgres transaction type: partbench__insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1147273 latency average = 0.052 ms tps = 19121.194366 (including connections establishing) tps = 19121.695469 (excluding connections establishing) Here we're within 92% of the non-partitioned performance. Looking back at the first email in this thread where I tested the v1 patch, we were within 82% with: -- partitioned tps = 11001.602377 (excluding connections establishing) -- non-partitioned tps = 13354.656163 (excluding connections establishing) Again, same as with the v1 test, the v2 test was done with the locking of all partitions removed with: diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index d7b18f52ed..6223c62094 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -80,9 +80,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel) PartitionTupleRouting *proute; ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL; - /* Lock all the partitions. */ - (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL); - /* * Here we attempt to expend as little effort as possible in setting up * the PartitionTupleRouting. Each partition's ResultRelInfo is built @@ -442,7 +439,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, * We locked all the partitions in ExecSetupPartitionTupleRouting * including the leaf partitions. */ - partrel = heap_open(partoid, NoLock); + partrel = heap_open(partoid, RowExclusiveLock); /* * Keep ResultRelInfo and other information for this partition in the Again, the reduce locking is not meant for commit for this patch. Changing the locking will require a discussion on its own thread. And just for fun, the unpatched performance on the partitioned table: ubuntu@ip-10-0-0-33:~$ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 5751 latency average = 10.434 ms tps = 95.836052 (including connections establishing) tps = 95.838490 (excluding connections establishing) (185x increase) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/07/27 1:19, David Rowley wrote: > On 18 July 2018 at 20:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Let me know what you think of the code in the updated patch. > > Thanks for sending the updated patch. > > I looked over it tonight and made a number of changes: > > 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was > useless since the int would have wrapped long before it reached > UINT_MAX. Oops, you're right. > There's no shortage of other code doubling the size of an > array by multiplying it by 2 unconditionally without considering > overflowing an int. Unsure why you considered this more risky. Just ill-informed paranoia on my part. Let's just drop it as you say, given also the Tom's comment that repalloc would fail anyway for requests over 1GB. > 2) Fixed a series of bugs regarding the size of the arrays in > PartitionTupleRouting. The map arrays and the partitions array could > differ in size despite your comment that claimed > child_parent_tupconv_maps was the same size as 'partitions' when > non-NULL. The map arrays being a different size than the partitions > array caused the following two cases to segfault. I've included two > cases as it was two seperate bugs that caused them. > > -- case 1 [ .... ] > -- case 2 Indeed, there were some holes in the logic that led to me to come up with that code. > 3) Got rid of ExecUseUpdateResultRelForRouting. I started to change > this to remove references to UPDATE in order to make it more friendly > towards other possible future node types that it would get used for > (aka MERGE). In the end, I found that performance could regress when > in cases like: > > drop table listp; > create table listp (a int) partition by list(a); > \o /dev/null > \timing off > select 'create table listp'||x::Text||' partition of listp for values > in('||x::Text||');' from generate_series(1,1000) x; > \gexec > \o > insert into listp select x from generate_series(1,999) x; > \timing on > update listp set a = a+1; > > It's true that UPDATEs with a large number of subplans performance is > quite terrible today in the planner, but this code made the > performance of planning+execution a bit worse. If we get around to > fixing the inheritance planner then I think > ExecUseUpdateResultRelForRouting() could easily appear in profiles. > > I ended up rewriting it to just get called once and build a hash table > by Oid storing a ResultRelInfo pointer. This also gets rid of the > slow nested loop in the cleanup operation inside > ExecCleanupTupleRouting(). OK, looks neat, although I'd name the hash table subplan_resultrel_hash (like join_rel_hash in PlannerInfo), instead of subplan_partition_table. > 4) Did some tuning work in ExecFindPartition() getting rid of a > redundant check after the loop completion. Also added some likely() > and unlikely() decorations around some conditions. All changes seem good. > 5) Updated some newly out-dated comments since your patch in execPartition.h. > > 6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a > palloc() updating the few fields that were not initialised. This might > save a few TPS (at least once we get rid of the all partition locking) > in the single-row INSERT case, but I've not tested the performance of > this yet. > > 7) Also moved and edited some comments above > ExecSetupPartitionTupleRouting() that I felt explained a little too > much about some internal implementation details. Thanks, changes look good. > One thing that I thought of, but didn't do was just having > ExecFindPartition() return the ResultRelInfo. I think it would be much > nicer in both call sites to not have to check the ->partitions array > to get that. The copy.c call site would need a few modifications > around the detection code to see if the partition has changed, but it > all looks quite possible to change. I left this for now as I have > another patch which touches all that code that I feel is closer to > commit than this patch is. I had wondered about that too, but gave up on doing anything about it because the callers of ExecFindPartition want to access other fields of PartitionTupleRouting using the returned index. Maybe, we could make it return a ResultRelInfo * and also return the index itself using a separate output argument. Seems like a cosmetic improvement that can be made later. > I've attached a delta of the changes I made since your v2 delta and > also a complete updated patch. Thanks. Here are some other minor comments on the complete v2 patch. - tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index], + tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ? + proute->parent_child_tupconv_maps[leaf_part_index] : + NULL, This piece of code that's present in both ExecPrepareTupleRouting and CopyFrom can be written as: if (proute->parent_child_tupconv_maps) ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx], tuple, proute->partition_tuple_slot, &slot); + /* + * If UPDATE needs to do tuple routing, we'll need a slot that will + * transiently store the tuple being routed using the root parent's + * rowtype. We must set up at least this slot, because it's needed even + * before tuple routing begins. Other necessary information is + * initialized when tuple routing code calls + * ExecUseUpdateResultRelForRouting. + */ if (node && node->operation == CMD_UPDATE) This comment needs to be updated, because you changed the if block's body as: + ExecHashSubPlanResultRelsByOid(mtstate, proute); proute->root_tuple_slot = MakeTupleTableSlot(NULL); So, we don't just set up the slot here, we also now set up the hash table to store sub-plan result rels. Also, ExecUseUpdateResultRelForRouting no longer exist. + /* + * Get the index for PartitionTupleRouting->partitions array index + * for this leaf partition. This may require building a new + * ResultRelInfo. + */ 1st sentence reads a bit strange. Did you mean to write the following? /* * Get this leaf partition's index in the * PartitionTupleRouting->partitions array. * This may require building a new ResultRelInfo. */ The following block of code could use a one-line comment describing what's going on (although, what's going on might be pretty clear to some eyes just by looking at the code): else { if (proute->subplan_partition_table) { ResultRelInfo *rri; Oid partoid = partdesc->oids[partidx]; rri = hash_search(proute->subplan_partition_table, &partoid, HASH_FIND, NULL); /* + * ExecInitPartitionDispatchInfo + * Initialize PartitionDispatch for a partitioned table + * + * This also stores it in the proute->partition_dispatch_info array at the + * specified index ('dispatchidx'), possibly expanding the array if there + * isn't space left in it. + */ You renamed 'dispatchidx' to 'partidx' in the function's signature but forgot to update this comment. I've attached a delta patch to make the above changes. I'm leaving the hash table rename up to you though. Thanks Amit
Attachment
On 27 July 2018 at 19:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I've attached a delta patch to make the above changes. I'm leaving the > hash table rename up to you though. Thanks for the delta patch. I took all of it, just rewrote a comment slightly. I also renamed the hash table to your suggestion and changed a few more things. Attached a delta based on v2 and the full v3 patch. This includes another small change to make PartitionDispatchData->indexes an array that's allocated in the same memory as the PartitionDispatchData. This will save a palloc() call and also should be a bit more cache friendly. This also required a rebase on master again due to 3e32109049. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/07/28 10:54, David Rowley wrote: > On 27 July 2018 at 19:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I've attached a delta patch to make the above changes. I'm leaving the >> hash table rename up to you though. > > Thanks for the delta patch. I took all of it, just rewrote a comment slightly. > > I also renamed the hash table to your suggestion and changed a few more things. > > Attached a delta based on v2 and the full v3 patch. > > This includes another small change to make > PartitionDispatchData->indexes an array that's allocated in the same > memory as the PartitionDispatchData. This will save a palloc() call > and also should be a bit more cache friendly. > > This also required a rebase on master again due to 3e32109049. Thanks for the updated patch. I couldn't find much to complain about in the latest v3, except I noticed a few instances of the word "setup" where I think what's really meant is "set up". + * must be setup, but any sub-partitioned tables can be setup lazily as + * A ResultRelInfo has not been setup for this partition yet, By the way, when going over the updated code, I noticed that the code around child_parent_tupconv_maps could use some refactoring too. Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates child-to-parent map array needed for transition tuple capture even if not needed by any of the leaf partitions. I'm attaching here a patch that applies on top of your v3 to show what I'm thinking we could do. Thanks, Amit
Attachment
On 30 July 2018 at 20:26, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I couldn't find much to complain about in the latest v3, except I noticed > a few instances of the word "setup" where I think what's really meant is > "set up". > > + * must be setup, but any sub-partitioned tables can be setup lazily as > > + * A ResultRelInfo has not been setup for this partition yet, > Great. I've fixed those and also fixed a few other comments. I found the comments on PartitionTupleRouting didn't really explain how the arrays were indexed. I've made an attempt to make that clear. I've attached a complete v4 patch. > By the way, when going over the updated code, I noticed that the code > around child_parent_tupconv_maps could use some refactoring too. > Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates > child-to-parent map array needed for transition tuple capture even if not > needed by any of the leaf partitions. I'm attaching here a patch that > applies on top of your v3 to show what I'm thinking we could do. Maybe we can do that as a follow-on patch. I think what we have so far is already ended up quite complex to review. What do you think? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 31 July 2018 at 19:03, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached a complete v4 patch. I've attached v5 of the patch which is based on top of today's master (@ 579b985b22) A couple of recent patches conflict with v4. I've also made another tidy up pass, which was mostly just rewording comments in execPartition.h -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
(looking at the v5 patch but replying to an older email) On 2018/07/31 16:03, David Rowley wrote: > I've attached a complete v4 patch. > >> By the way, when going over the updated code, I noticed that the code >> around child_parent_tupconv_maps could use some refactoring too. >> Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates >> child-to-parent map array needed for transition tuple capture even if not >> needed by any of the leaf partitions. I'm attaching here a patch that >> applies on top of your v3 to show what I'm thinking we could do. > > Maybe we can do that as a follow-on patch. We probably could, but I think it would be a good idea get rid of *all* redundant allocations due to tuple routing in one patch, if that's the mission of this thread and the patch anyway. > I think what we have so far > is already ended up quite complex to review. What do you think? Yeah, it's kind of complex, but at least it seems that we're clear on the point that what we're trying to do here is to try to get rid of redundant allocations. Parts of the patch that appear complex seems to be around the allocation of various maps. Especially the child-to-parent maps, which as things stand today, come from two arrays -- a per-update-subplan array that's needed by update tuple routing proper and per-leaf partition array (one in PartitionTupleRouting) that's needed by transition capture machinery. The original coding was such the update tuple routing handling code would try to avoid allocating the per-update-subplan array if it saw that per-leaf partition array was already set up in PartitionTupleRouting, because transition capture is active in the query. For update-tuple-routing code to be able to use maps from the per-leaf array, it would have to know which update-subplans mapped to which tuple-routing-initialized partitions. That was maintained in the subplan_partition_offset array that's now gone with this patch, because we no longer want to fix the tuple-routing-initialized-partition offsets in advance. So, it's better to dissociate per-subplan maps which are initialized during ExecInitModifyTable from per-leaf maps which are initialized lazily when tuple routing initializes a partition, which is what my portion of the patch did. As mentioned in my last email, I still think it would be a good idea to simplify the handling of child-to-parent maps in PartitionTupleRouting even further, while we're at improving the code in this area. I revised the patch such that it makes the handling of maps in PartitionTupleRouting even more uniform. With that patch, we no longer have two completely unrelated places in the code managing parent-to-child and child-to-parent maps, even though both arrays are in the same PartitionTupleRouting. Please find the updated patch attached with this email. Thanks, Amit
Attachment
On 3 August 2018 at 17:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/07/31 16:03, David Rowley wrote: >> Maybe we can do that as a follow-on patch. > > We probably could, but I think it would be a good idea get rid of *all* > redundant allocations due to tuple routing in one patch, if that's the > mission of this thread and the patch anyway. I started looking at this patch today and I now agree that it should be included in the main patch. I changed a few things with the patch. For example, the map access macros you'd defined were not in CamelCase. I also fixed a bug where the child to parent map was not being initialised when on conflict transition capture was required. I added a test which was crashing the backend but fixed the code so it works correctly. I also got rid of the child_parent_map_not_required array since we now no longer need it. The code now always initialises the maps in cases where they're going to be required. I've attached a v3 version of your patch and also v6 of the main patch which includes the v3 patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/08/21 14:44, David Rowley wrote: > On 3 August 2018 at 17:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/07/31 16:03, David Rowley wrote: >>> Maybe we can do that as a follow-on patch. >> >> We probably could, but I think it would be a good idea get rid of *all* >> redundant allocations due to tuple routing in one patch, if that's the >> mission of this thread and the patch anyway. > > I started looking at this patch today and I now agree that it should > be included in the main patch. Great, thanks. > I changed a few things with the patch. For example, the map access > macros you'd defined were not in CamelCase. In the updated patch: +#define PartitionTupRoutingGetToParentMap(p, i) \ +#define PartitionTupRoutingGetToChildMap(p, i) \ If the "Get" could be replaced by "Child" and "Parent", respectively, they'd sound more meaningful, imho. > I also fixed a bug where > the child to parent map was not being initialised when on conflict > transition capture was required. I added a test which was crashing the > backend but fixed the code so it works correctly. Oops, I guess you mean my omission of checking if mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo. Thanks for fixing it and adding the test case. > I also got rid of > the child_parent_map_not_required array since we now no longer need > it. The code now always initialises the maps in cases where they're > going to be required. Yes, thought I had removed the field in my patch, but looks like I had just removed the comment about it. > I've attached a v3 version of your patch and also v6 of the main patch > which includes the v3 patch. I've looked at v6 and spotted some minor typos. + * ResultRelInfo for, before we go making one, we check for a pre-made one s/making/make/g + /* If nobody else set the per-subplan array of maps, do so ouselves. */ I guess I'm the one to blame here for misspelling "ourselves". Since the above two are minor issues, fixed them myself in the attached updated version; didn't touch the macro though. Do you agree to setting this patch to "Ready for Committer" in the September CF? Thanks, Amit
Attachment
On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > +#define PartitionTupRoutingGetToParentMap(p, i) \ > +#define PartitionTupRoutingGetToChildMap(p, i) \ > > If the "Get" could be replaced by "Child" and "Parent", respectively, > they'd sound more meaningful, imho. I did that to save 3 chars. I think putting the additional Child/Parent in the name is not really required. It's not as if we're going to have a ParentToParent or a ChildToChild map, so I thought it might be okay to assume that if it's "ToParent", that it's being converted from the child and "ToChild" seems safe to assume it's being converted from the parent. I can change it though if you feel very strongly that what I've got is no good. >> I also fixed a bug where >> the child to parent map was not being initialised when on conflict >> transition capture was required. I added a test which was crashing the >> backend but fixed the code so it works correctly. > > Oops, I guess you mean my omission of checking if > mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo. Yeah. > I've looked at v6 and spotted some minor typos. > > + * ResultRelInfo for, before we go making one, we check for a > pre-made one > > s/making/make/g I disagree, but perhaps we can just reword it a bit. I've now got: + * Every time a tuple is routed to a partition that we've yet to set the + * ResultRelInfo for, before we go to the trouble of making one, we check + * for a pre-made one in the hash table. > + /* If nobody else set the per-subplan array of maps, do so ouselves. */ > > I guess I'm the one to blame here for misspelling "ourselves". Thanks for noticing. > Since the above two are minor issues, fixed them myself in the attached > updated version; didn't touch the macro though. I've attached a v8. The only change from your v7 is in the "go making" comment. > Do you agree to setting this patch to "Ready for Committer" in the > September CF? I read through the entire patch a couple of times yesterday and saw nothing else, so yeah, I think now is a good time for someone with more authority to have a look at it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/08/22 21:30, David Rowley wrote: > On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> +#define PartitionTupRoutingGetToParentMap(p, i) \ >> +#define PartitionTupRoutingGetToChildMap(p, i) \ >> >> If the "Get" could be replaced by "Child" and "Parent", respectively, >> they'd sound more meaningful, imho. > > I did that to save 3 chars. I think putting the additional > Child/Parent in the name is not really required. It's not as if we're > going to have a ParentToParent or a ChildToChild map, so I thought it > might be okay to assume that if it's "ToParent", that it's being > converted from the child and "ToChild" seems safe to assume it's being > converted from the parent. I can change it though if you feel very > strongly that what I've got is no good. No strong preference as such. Maybe, let's defer to committer. >> I've looked at v6 and spotted some minor typos. >> >> + * ResultRelInfo for, before we go making one, we check for a >> pre-made one >> >> s/making/make/g > > I disagree, but perhaps we can just reword it a bit. I've now got: > > + * Every time a tuple is routed to a partition that we've yet to set the > + * ResultRelInfo for, before we go to the trouble of making one, we check > + * for a pre-made one in the hash table. Sure. I guess "to the trouble of" was missing then. :) >> + /* If nobody else set the per-subplan array of maps, do so ouselves. */ >> >> I guess I'm the one to blame here for misspelling "ourselves". > > Thanks for noticing. > >> Since the above two are minor issues, fixed them myself in the attached >> updated version; didn't touch the macro though. > > I've attached a v8. The only change from your v7 is in the "go making" comment. Thanks. >> Do you agree to setting this patch to "Ready for Committer" in the >> September CF? > > I read through the entire patch a couple of times yesterday and saw > nothing else, so yeah, I think now is a good time for someone with > more authority to have a look at it. Okay, doing it now. Thanks, Amit
On 23 August 2018 at 00:30, David Rowley <david.rowley@2ndquadrant.com> wrote:
I've attached a v8. The only change from your v7 is in the "go making" comment.
v9 patch attached. Fixes conflict with 6b78231d.
--
Attachment
On 17 September 2018 at 21:15, David Rowley <david.rowley@2ndquadrant.com> wrote: > v9 patch attached. Fixes conflict with 6b78231d. v10 patch attached. Fixes conflict with cc2905e9. I'm not so sure we need to zero the partition_tuple_slots[] array at all since we always set a value there is there's a corresponding map stored. I considered pulling that out, but in the end, I didn't as I saw some Asserts checking it's been properly set by checking the element != NULL in nodeModifyTable.c and copy.c. Perhaps I should have just gotten rid of those Asserts along with the palloc0 and subsequent memset after the expansion of the array. I'm undecided. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David, On 2018/10/05 21:55, David Rowley wrote: > On 17 September 2018 at 21:15, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> v9 patch attached. Fixes conflict with 6b78231d. > > v10 patch attached. Fixes conflict with cc2905e9. Thanks for rebasing. > I'm not so sure we need to zero the partition_tuple_slots[] array at > all since we always set a value there is there's a corresponding map > stored. I considered pulling that out, but in the end, I didn't as I > saw some Asserts checking it's been properly set by checking the > element != NULL in nodeModifyTable.c and copy.c. Perhaps I should > have just gotten rid of those Asserts along with the palloc0 and > subsequent memset after the expansion of the array. I'm undecided. Maybe it's a good thing that it's doing the same thing as with the child_to_parent_maps array, which is to zero-init it when allocated. Thanks, Amit
Hi David, On 2018/10/05 21:55, David Rowley wrote: > On 17 September 2018 at 21:15, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> v9 patch attached. Fixes conflict with 6b78231d. > > v10 patch attached. Fixes conflict with cc2905e9. Thanks for rebasing. > I'm not so sure we need to zero the partition_tuple_slots[] array at > all since we always set a value there is there's a corresponding map > stored. I considered pulling that out, but in the end, I didn't as I > saw some Asserts checking it's been properly set by checking the > element != NULL in nodeModifyTable.c and copy.c. Perhaps I should > have just gotten rid of those Asserts along with the palloc0 and > subsequent memset after the expansion of the array. I'm undecided. Maybe it's a good thing that it's doing the same thing as with the child_to_parent_maps array, which is to zero-init it when allocated. Thanks, Amit
On 9 October 2018 at 15:49, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/10/05 21:55, David Rowley wrote: >> I'm not so sure we need to zero the partition_tuple_slots[] array at >> all since we always set a value there is there's a corresponding map >> stored. I considered pulling that out, but in the end, I didn't as I >> saw some Asserts checking it's been properly set by checking the >> element != NULL in nodeModifyTable.c and copy.c. Perhaps I should >> have just gotten rid of those Asserts along with the palloc0 and >> subsequent memset after the expansion of the array. I'm undecided. > > Maybe it's a good thing that it's doing the same thing as with the > child_to_parent_maps array, which is to zero-init it when allocated. Perhaps, but the maps do need to be zeroed, the partition_tuple_slots array does not since we only access it when the parent to child map is set. In any case, since PARTITION_ROUTING_INITSIZE is just 8, it'll likely not save much since it's really just saving a memset(..., 0, 64), for single-row INSERTs on a 64-bit machine. So likely it won't save more than a bunch of nanoseconds. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
We see quite prohibitive 5-6x slowdown with native partitioning on in comparison to trigger based in PG9.5. This is clearly visible with highly parallel inserts (Can share flamediagrams comparing the two). This basically excludes native partitioning from being used by us. Do you think your changes could be backported to PG10? - we checked and this would need quite a number of changes but given the weight of this change maybe it could be considered? Thanks Krzysztof -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Hi, On 2018/10/15 19:04, Krzysztof Nienartowicz wrote: > > We see quite prohibitive 5-6x slowdown with native partitioning on in > comparison to trigger based in PG9.5. > This is clearly visible with highly parallel inserts (Can share > flamediagrams comparing the two). > > This basically excludes native partitioning from being used by us. Do you > think your changes could be backported to PG10? - we checked and this would > need quite a number of changes but given the weight of this change maybe it > could be considered? It's unfortunate that PG 10's partitioning cannot be used for your use case, but I don't think such a major refactoring will be back-ported to 10 or 11. :-( Thanks, Amit
On 15 October 2018 at 23:04, Krzysztof Nienartowicz <krzysztof.nienartowicz@gmail.com> wrote: > We see quite prohibitive 5-6x slowdown with native partitioning on in > comparison to trigger based in PG9.5. > This is clearly visible with highly parallel inserts (Can share > flamediagrams comparing the two). Does the 0001 patch here fix the problem? I imagined that it would be the locking of all partitions that would have killed the performance. > This basically excludes native partitioning from being used by us. Do you > think your changes could be backported to PG10? - we checked and this would > need quite a number of changes but given the weight of this change maybe it > could be considered? It's very unlikely to happen, especially so with the 0002 patch, which I've so far just attached as a demonstration of where the performance could end up. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
In the end we hacked the code to re-enable triggers on partitioned tables and switch off native insert code on partitioned tables. Quite hackish and would be nice to have it fixed in a more natural manner. Yes, it looked like locking but not only - ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors looked like dominant and also convert_tuples_by_name but not sure if the last one was not an artifact of perf sampling. Will check the patch 0001, thanks. On Tue, Oct 23, 2018 at 12:36 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On 15 October 2018 at 23:04, Krzysztof Nienartowicz > <krzysztof.nienartowicz@gmail.com> wrote: > > We see quite prohibitive 5-6x slowdown with native partitioning on in > > comparison to trigger based in PG9.5. > > This is clearly visible with highly parallel inserts (Can share > > flamediagrams comparing the two). > > Does the 0001 patch here fix the problem? I imagined that it would be > the locking of all partitions that would have killed the performance. > > > This basically excludes native partitioning from being used by us. Do you > > think your changes could be backported to PG10? - we checked and this would > > need quite a number of changes but given the weight of this change maybe it > > could be considered? > > It's very unlikely to happen, especially so with the 0002 patch, which > I've so far just attached as a demonstration of where the performance > could end up. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
On 23 October 2018 at 11:55, Krzysztof Nienartowicz <krzysztof.nienartowicz@gmail.com> wrote: > In the end we hacked the code to re-enable triggers on partitioned > tables and switch off native insert code on partitioned tables. Quite > hackish and would be nice to have it fixed in a more natural manner. > Yes, it looked like locking but not only - > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors > looked like dominant and also convert_tuples_by_name but not sure if > the last one was not an artifact of perf sampling. The ExecOpenIndices was likely fixed in edd44738bc8 (PG11). find_all_inheritors does obtain the partition locks during the call, so the slowness there most likely down to the locking rather than the scanning of pg_inherits. 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12). > Will check the patch 0001, thanks. I more meant that it might be 0002 that fixes the issue for you. I just wanted to check if you'd tried 0001 and found that the problem was fixed with that alone. Do you mind sharing how many partitions you have and how many columns the partitioned table has? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 23, 2018 at 4:02 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On 23 October 2018 at 11:55, Krzysztof Nienartowicz > <krzysztof.nienartowicz@gmail.com> wrote: > > In the end we hacked the code to re-enable triggers on partitioned > > tables and switch off native insert code on partitioned tables. Quite > > hackish and would be nice to have it fixed in a more natural manner. > > Yes, it looked like locking but not only - > > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors > > looked like dominant and also convert_tuples_by_name but not sure if > > the last one was not an artifact of perf sampling. > > The ExecOpenIndices was likely fixed in edd44738bc8 (PG11). > find_all_inheritors does obtain the partition locks during the call, > so the slowness there most likely down to the locking rather than the > scanning of pg_inherits. > > 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12). > > > Will check the patch 0001, thanks. > > I more meant that it might be 0002 that fixes the issue for you. I > just wanted to check if you'd tried 0001 and found that the problem > was fixed with that alone. Will it apply on PG10? (In fact the code base is PG XL10 but src/backend/executor/nodeModifyTable.c is pure PG) > > Do you mind sharing how many partitions you have and how many columns > the partitioned table has? We have 2 level partitioning: 10 (possibly changing, up to say 20-30) range partitions at first level and 20 range at the second level. We have potentially hundreds processes inserting at the same time. > > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
To complement the info: number of columns varies from 20 to 100 but some of the columns are composite types or arrays of composite types. The flamegraph after applying changes from patch 0002 can be seen here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP shows most of the time is spent in the convert_tuples_by_name (PG10 version). Thanks On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz <krzysztof.nienartowicz@gmail.com> wrote: > > On Tue, Oct 23, 2018 at 4:02 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > > > On 23 October 2018 at 11:55, Krzysztof Nienartowicz > > <krzysztof.nienartowicz@gmail.com> wrote: > > > In the end we hacked the code to re-enable triggers on partitioned > > > tables and switch off native insert code on partitioned tables. Quite > > > hackish and would be nice to have it fixed in a more natural manner. > > > Yes, it looked like locking but not only - > > > ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors > > > looked like dominant and also convert_tuples_by_name but not sure if > > > the last one was not an artifact of perf sampling. > > > > The ExecOpenIndices was likely fixed in edd44738bc8 (PG11). > > find_all_inheritors does obtain the partition locks during the call, > > so the slowness there most likely down to the locking rather than the > > scanning of pg_inherits. > > > > 42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12). > > > > > Will check the patch 0001, thanks. > > > > I more meant that it might be 0002 that fixes the issue for you. I > > just wanted to check if you'd tried 0001 and found that the problem > > was fixed with that alone. > > Will it apply on PG10? (In fact the code base is PG XL10 but > src/backend/executor/nodeModifyTable.c is pure PG) > > > > > Do you mind sharing how many partitions you have and how many columns > > the partitioned table has? > > We have 2 level partitioning: 10 (possibly changing, up to say 20-30) > range partitions at first level and 20 range at the second level. We > have potentially hundreds processes inserting at the same time. > > > > > > > -- > > David Rowley http://www.2ndQuadrant.com/ > > PostgreSQL Development, 24x7 Support, Training & Services
On 2018/10/30 8:41, Krzysztof Nienartowicz wrote: > On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz > <krzysztof.nienartowicz@gmail.com> wrote: >> On Tue, Oct 23, 2018 at 4:02 AM David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>> >>> I more meant that it might be 0002 that fixes the issue for you. I >>> just wanted to check if you'd tried 0001 and found that the problem >>> was fixed with that alone. >> >> Will it apply on PG10? (In fact the code base is PG XL10 but >> src/backend/executor/nodeModifyTable.c is pure PG) > > To complement the info: number of columns varies from 20 to 100 but > some of the columns are composite types or arrays of composite types. > > The flamegraph after applying changes from patch 0002 can be seen > here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP > shows most of the time is spent in the > > convert_tuples_by_name (PG10 version). As David mentioned, the patches on this thread are meant to be applied against latest PG 12 HEAD. The insert tuple routing code has undergone quite a bit of refactoring in PG 11, which itself should have gotten rid of at least some of the hot-spots that are seen in the flame graph you shared. What happens in PG 10 (as seen in the flame graph) is that ExecSetupPartitionTupleRouting initializes information for *all* partitions, which happens even before the 1st tuple processed. So if there are many partitions and with many columns, a lot of processing happens in ExecSetupPartitionTupleRouting. PG 11 changes it such that the partition info is only initialized after the 1st tuple is processed and only of the partition that's targeted, but some overheads still remain in that code. The patches on this thread are meant to address those overheads. Unfortunately, I don't think the community will agree to back-porting the changes in PG 11 and the patches being discussed here to PG 10. Thanks, Amit
Thanks for both clarifications! I skimmed through the commits related to Inserts with partitioning since 10 and indeed - while not impossible it seems like quite some work to merge them into PG 10 codebase. We might consider preparing the patch in-house as otherwise PG 10 based partitioning is a major regression and we'd have to go back to inheritance based one - which seems the best option for now. Regards, Krzysztof On Tue, Oct 30, 2018 at 1:54 AM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2018/10/30 8:41, Krzysztof Nienartowicz wrote: > > On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz > > <krzysztof.nienartowicz@gmail.com> wrote: > >> On Tue, Oct 23, 2018 at 4:02 AM David Rowley > >> <david.rowley@2ndquadrant.com> wrote: > >>> > >>> I more meant that it might be 0002 that fixes the issue for you. I > >>> just wanted to check if you'd tried 0001 and found that the problem > >>> was fixed with that alone. > >> > >> Will it apply on PG10? (In fact the code base is PG XL10 but > >> src/backend/executor/nodeModifyTable.c is pure PG) > > > > To complement the info: number of columns varies from 20 to 100 but > > some of the columns are composite types or arrays of composite types. > > > > The flamegraph after applying changes from patch 0002 can be seen > > here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP > > shows most of the time is spent in the > > > > convert_tuples_by_name (PG10 version). > > As David mentioned, the patches on this thread are meant to be applied > against latest PG 12 HEAD. The insert tuple routing code has undergone > quite a bit of refactoring in PG 11, which itself should have gotten rid > of at least some of the hot-spots that are seen in the flame graph you shared. > > What happens in PG 10 (as seen in the flame graph) is that > ExecSetupPartitionTupleRouting initializes information for *all* > partitions, which happens even before the 1st tuple processed. So if > there are many partitions and with many columns, a lot of processing > happens in ExecSetupPartitionTupleRouting. PG 11 changes it such that the > partition info is only initialized after the 1st tuple is processed and > only of the partition that's targeted, but some overheads still remain in > that code. The patches on this thread are meant to address those overheads. > > Unfortunately, I don't think the community will agree to back-porting the > changes in PG 11 and the patches being discussed here to PG 10. > > Thanks, > Amit >
On Wed, Aug 22, 2018 at 8:30 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > +#define PartitionTupRoutingGetToParentMap(p, i) \ > > +#define PartitionTupRoutingGetToChildMap(p, i) \ > > > > If the "Get" could be replaced by "Child" and "Parent", respectively, > > they'd sound more meaningful, imho. > > I did that to save 3 chars. I think putting the additional > Child/Parent in the name is not really required. It's not as if we're > going to have a ParentToParent or a ChildToChild map, so I thought it > might be okay to assume that if it's "ToParent", that it's being > converted from the child and "ToChild" seems safe to assume it's being > converted from the parent. I can change it though if you feel very > strongly that what I've got is no good. I'm not sure exactly what is best here, but it seems to unlikely to me that somebody is going to read that macro name and think, oh, that means "get the to-parent map". They are more like be confused by the juxtaposition of "get" and "to". I think a better way to shorten the name would be to truncate the PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 22, 2018 at 8:30 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: >> On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> > +#define PartitionTupRoutingGetToParentMap(p, i) \ >> > +#define PartitionTupRoutingGetToChildMap(p, i) \ >> > >> > If the "Get" could be replaced by "Child" and "Parent", respectively, >> > they'd sound more meaningful, imho. >> >> I did that to save 3 chars. I think putting the additional >> Child/Parent in the name is not really required. It's not as if we're >> going to have a ParentToParent or a ChildToChild map, so I thought it >> might be okay to assume that if it's "ToParent", that it's being >> converted from the child and "ToChild" seems safe to assume it's being >> converted from the parent. I can change it though if you feel very >> strongly that what I've got is no good. > > I'm not sure exactly what is best here, but it seems to unlikely to me > that somebody is going to read that macro name and think, oh, that > means "get the to-parent map". They are more like be confused by the > juxtaposition of "get" and "to". > > I think a better way to shorten the name would be to truncate the > PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. Thanks for chipping in on this. I agree. I don't think "TupRouting" really needs to be in the name. Probably "To" can also just become "2" and we can put back the Parent/Child before that. I've attached v11, which does this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/11/01 8:58, David Rowley wrote: > On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Aug 22, 2018 at 8:30 AM David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>> On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> +#define PartitionTupRoutingGetToParentMap(p, i) \ >>>> +#define PartitionTupRoutingGetToChildMap(p, i) \ >>>> >>>> If the "Get" could be replaced by "Child" and "Parent", respectively, >>>> they'd sound more meaningful, imho. >>> >>> I did that to save 3 chars. I think putting the additional >>> Child/Parent in the name is not really required. It's not as if we're >>> going to have a ParentToParent or a ChildToChild map, so I thought it >>> might be okay to assume that if it's "ToParent", that it's being >>> converted from the child and "ToChild" seems safe to assume it's being >>> converted from the parent. I can change it though if you feel very >>> strongly that what I've got is no good. >> >> I'm not sure exactly what is best here, but it seems to unlikely to me >> that somebody is going to read that macro name and think, oh, that >> means "get the to-parent map". They are more like be confused by the >> juxtaposition of "get" and "to". >> >> I think a better way to shorten the name would be to truncate the >> PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. > > Thanks for chipping in on this. > > I agree. I don't think "TupRouting" really needs to be in the name. > Probably "To" can also just become "2" and we can put back the > Parent/Child before that. Agree that "TupRouting" can go, but "To" is not too long for using "2" instead of it. Thanks, Amit
On 1 November 2018 at 13:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/11/01 8:58, David Rowley wrote: >> On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote: >>> I think a better way to shorten the name would be to truncate the >>> PartitionTupRouting() prefix in some way, e.g. dropping TupRouting. >> >> Thanks for chipping in on this. >> >> I agree. I don't think "TupRouting" really needs to be in the name. >> Probably "To" can also just become "2" and we can put back the >> Parent/Child before that. > > Agree that "TupRouting" can go, but "To" is not too long for using "2" > instead of it. Okay. Here's a version with "2" put back to "To"... It's great to know the patch is now so perfect that we've only the macro naming left to debate ;-) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 01/11/2018 14:30, David Rowley wrote: > On 1 November 2018 at 13:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/11/01 8:58, David Rowley wrote: >>> [...] >>> I agree. I don't think "TupRouting" really needs to be in the name. >>> Probably "To" can also just become "2" and we can put back the >>> Parent/Child before that. >> Agree that "TupRouting" can go, but "To" is not too long for using "2" >> instead of it. I think that while '2' may only be one character less than 'to', the character '2' stands out more. However, can't say I could claim this was of the utmost importance! > Okay. Here's a version with "2" put back to "To"... [...]
On 2018/11/01 10:30, David Rowley wrote: > It's great to know the patch is now so perfect that we've only the > macro naming left to debate ;-) I looked over v12 again and noticed a couple minor issues. + * table then we store the index into parenting + * PartitionTupleRouting 'partition_dispatch_info' array. An s/PartitionTupleRouting/PartitionTupleRouting's/g Also, I got a bit concerned about "parenting". Does it mean something like "enclosing", because the PartitionDispatch is a member of PartitionTupleRouting? I got concerned because using "parent" like this may be confusing as this is the partitioning code we're talking about, where "parent" is generally used to mean "parent" table. + * the partitioned table that's the target of the command. If we must + * route tuple via some sub-partitioned table, then the PartitionDispatch + * for those is only built the first time it's required. ... via some sub-partitioned table"s" Or perhaps rewrite a bit as: If we must route the tuple via some sub-partitioned table, then its PartitionDispatch is built the first time it's required. The macro naming discussion got me thinking today about the macro itself. It encapsulates access to the various PartitionTupleRouting arrays containing the maps, but maybe we've got the interface of tuple routing a bit (maybe a lot given this thread!) wrong to begin with. Instead of ExecFindPartition returning indexes into various PartitionTupleRouting arrays and its callers then using those indexes to fetch various objects from those arrays, why doesn't it return those objects itself? Although we made sure that the callers don't need to worry about the meaning of these indexes changing with this patch, it still seems a bit odd for them to have to go back to those arrays to get various objects. How about we change ExecFindPartition's interface so that it returns the ResultRelInfo, the two maps, and the partition slot? So, the arrays simply become a cache for ExecFindPartition et al and are no longer accessed outside execPartition.c. Although that makes the interface of ExecFindPartition longer, I think it reduces overall complexity. I've implemented that in the attached patch 1-revise-ExecFindPartition-interface.patch. Also, since all members of PartitionTupleRouting are only accessed within execPartition.c save root_tuple_slot, we can move it to execPartition.c to make its internals private, after doing something about root_tuple_slot. Looking at the code related to root_tuple_slot, it seems the field really belongs in ModifyTableState, because it got nothing to do with routing. Attached 2-make-PartitionTupleRouting-private.patch does that. These patches 1 and 2 apply on top of v12-0001.. patch. Thanks, Amit
Attachment
On 1 November 2018 at 22:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/11/01 10:30, David Rowley wrote: >> It's great to know the patch is now so perfect that we've only the >> macro naming left to debate ;-) > > I looked over v12 again and noticed a couple minor issues. > > + * table then we store the index into parenting > + * PartitionTupleRouting 'partition_dispatch_info' array. An > > s/PartitionTupleRouting/PartitionTupleRouting's/g > > Also, I got a bit concerned about "parenting". Does it mean something > like "enclosing", because the PartitionDispatch is a member of > PartitionTupleRouting? I got concerned because using "parent" like this > may be confusing as this is the partitioning code we're talking about, > where "parent" is generally used to mean "parent" table. > > + * the partitioned table that's the target of the command. If we must > + * route tuple via some sub-partitioned table, then the PartitionDispatch > + * for those is only built the first time it's required. > > ... via some sub-partitioned table"s" > > Or perhaps rewrite a bit as: > > If we must route the tuple via some sub-partitioned table, then its > PartitionDispatch is built the first time it's required. I've attached v13 which hopefully addresses these. > The macro naming discussion got me thinking today about the macro itself. > It encapsulates access to the various PartitionTupleRouting arrays > containing the maps, but maybe we've got the interface of tuple routing a > bit (maybe a lot given this thread!) wrong to begin with. Instead of > ExecFindPartition returning indexes into various PartitionTupleRouting > arrays and its callers then using those indexes to fetch various objects > from those arrays, why doesn't it return those objects itself? Although > we made sure that the callers don't need to worry about the meaning of > these indexes changing with this patch, it still seems a bit odd for them > to have to go back to those arrays to get various objects. > > How about we change ExecFindPartition's interface so that it returns the > ResultRelInfo, the two maps, and the partition slot? So, the arrays > simply become a cache for ExecFindPartition et al and are no longer > accessed outside execPartition.c. Although that makes the interface of > ExecFindPartition longer, I think it reduces overall complexity. I don't really think stuffing values into a bunch of output parameters is much of an improvement. I'd rather allow callers to fetch what they need using the index we return. Most callers don't need to know about the child to parent maps, so it seems nicer for those places not to have to invent a dummy variable to pass along to ExecFindPartition() so it can needlessly populate it for them. Perhaps a better design would be to instead of having random special partitioned-table-only fields in ResultRelInfo, just have an extra struct there that contains the extra partition information which would include the translation maps and then just return the ResultRelInfo and allow callers to lookup any extra details they require. I've not looked into this in detail, but I think the committer work that's required for the patch as it is today is already quite significant. I'm not keen on warding any willing one off by making the commit job any harder. I agree that it would be good to stabilise the API for all this partitioning code sometime, but I don't believe it needs to be done all in one commit. My intent here is to improve performance or INSERT and UPDATE on partitioned tables. Perhaps we can shape some API redesign later in the release cycle. What do you think? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/11/04 19:07, David Rowley wrote: > On 1 November 2018 at 22:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I've attached v13 which hopefully addresses these. Thank you for updating the patch. >> The macro naming discussion got me thinking today about the macro itself. >> It encapsulates access to the various PartitionTupleRouting arrays >> containing the maps, but maybe we've got the interface of tuple routing a >> bit (maybe a lot given this thread!) wrong to begin with. Instead of >> ExecFindPartition returning indexes into various PartitionTupleRouting >> arrays and its callers then using those indexes to fetch various objects >> from those arrays, why doesn't it return those objects itself? Although >> we made sure that the callers don't need to worry about the meaning of >> these indexes changing with this patch, it still seems a bit odd for them >> to have to go back to those arrays to get various objects. >> >> How about we change ExecFindPartition's interface so that it returns the >> ResultRelInfo, the two maps, and the partition slot? So, the arrays >> simply become a cache for ExecFindPartition et al and are no longer >> accessed outside execPartition.c. Although that makes the interface of >> ExecFindPartition longer, I think it reduces overall complexity. > > I don't really think stuffing values into a bunch of output parameters > is much of an improvement. I'd rather allow callers to fetch what they > need using the index we return. Most callers don't need to know about > the child to parent maps, so it seems nicer for those places not to > have to invent a dummy variable to pass along to ExecFindPartition() > so it can needlessly populate it for them. Well, if a caller finds a partition using ExecFindPartition, it's going to need to fetch those other objects anyway. Both of its callers that exist today, CopyFrom and ExecPrepareTupleRouting, fetch both maps and the slot in the same code block as ExecFindPartition. > Perhaps a better design would be to instead of having random special > partitioned-table-only fields in ResultRelInfo, just have an extra > struct there that contains the extra partition information which would > include the translation maps and then just return the ResultRelInfo > and allow callers to lookup any extra details they require. IIUC, you're saying that we could introduce a new struct that contains auxiliary information needed by partition pruning (maps, slot, etc. for tuple conversion) and add a new ResultRelInfo member of that struct type. That way, there is no need to return them separately or return an index to access them from their arrays. I guess we won't even need the arrays we have now. I think that might be a good idea and simplifies things significantly. It reminds me of how ResultRelInfo grew a ri_onConflict member of type OnConflictSetState [1]. We decided to go that way, as opposed to the earlier approach of having arrays of num_partitions length in ModifyTableState or PartitionTupleRouting that contained ON CONFLICT related objects for individual partitions. > I've not > looked into this in detail, but I think the committer work that's > required for the patch as it is today is already quite significant. > I'm not keen on warding any willing one off by making the commit job > any harder. I agree that it would be good to stabilise the API for > all this partitioning code sometime, but I don't believe it needs to > be done all in one commit. My intent here is to improve performance or > INSERT and UPDATE on partitioned tables. Perhaps we can shape some API > redesign later in the release cycle. What do you think? I do suspect that simplifying ExecFindPartition's interface as part of patch will make a committer's life easier, as the resulting interface is simpler, especially if we revise it like you suggest above. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=555ee77a9
On 11/4/18 5:07 AM, David Rowley wrote: > I've attached v13 which hopefully addresses these. > I ran a test against the INSERT case using a 64 hash partition. Non-partitioned table: ~73k TPS Master: ~25k TPS 0001: ~26k TPS 0001 + 0002: ~68k TPS The profile of 0001 shows that almost all of ExecSetupPartitionTupleRouting() is find_all_inheritors(), hence the last test with a rebase of the original v1-0002 patch. Best regards, Jesper
On 5 November 2018 at 20:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/11/04 19:07, David Rowley wrote: >> Perhaps a better design would be to instead of having random special >> partitioned-table-only fields in ResultRelInfo, just have an extra >> struct there that contains the extra partition information which would >> include the translation maps and then just return the ResultRelInfo >> and allow callers to lookup any extra details they require. > > IIUC, you're saying that we could introduce a new struct that contains > auxiliary information needed by partition pruning (maps, slot, etc. for > tuple conversion) and add a new ResultRelInfo member of that struct type. > That way, there is no need to return them separately or return an index to > access them from their arrays. I guess we won't even need the arrays we > have now. I think that might be a good idea and simplifies things > significantly. I've attached a patch which does this. It adds a new struct named PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays out of PartitionTupleRouting. There are fields for each of what these arrays used to store inside the PartitionRoutingInfo struct. While doing this I kept glancing back over at ModifyTableState and at the mt_per_subplan_tupconv_maps array. I wondered if it would be better to make the PartitionRoutingInfo a bit more generic, perhaps call it TupleConversionInfo and have fields like ti_ToGeneric and ti_FromGeneric, with the idea that "generic" would be the root partition or the first subplan for inheritance updates. This would allow us to get rid of a good chunk of code inside nodeModifyTable.c. However, I ended up not doing this and left PartitionRoutingInfo to be specifically for Root to Partition conversion. Also, on the topic about what to name the conversion maps from a few days ago; After looking at this a bit more I decided that having them named ParentToChild and ChildToParent is misleading. If the child is the child of some sub-partitioned table then the parent that the map is talking about is not the partition's parent, but the root partitioned table. So really RootToPartition and PartitionToRoot seem like much more accurate names for the maps. I made a few other changes along the way; I thought that ExecFindPartition() would be a good candidate to take on the responsibility of validating the partition is valid for INSERTs when it uses a partition out of the subplan_resultrel_hash. I thought it was better to check this once when we're in the code path of grabbing the ResultRelInfo out that hash table rather than in a code path that must check if it's been done already each time we route a tuple into a partition. It also allowed me to get rid of ri_PartitionReadyForRouting. I also moved the call to ExecInitRoutingInfo() into ExecFindPartition() which allowed me to make that function static, which could result in the generation of slightly more optimal compiled code. Please find attached the v14 patch. Rather nicely git --stat reports a net negative additional code (with the 48 lines of new tests included) 11 files changed, 632 insertions(+), 660 deletions(-) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David, On 11/7/18 6:46 AM, David Rowley wrote: > I've attached a patch which does this. It adds a new struct named > PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays > out of PartitionTupleRouting. There are fields for each of what these > arrays used to store inside the PartitionRoutingInfo struct. > > While doing this I kept glancing back over at ModifyTableState and at > the mt_per_subplan_tupconv_maps array. I wondered if it would be > better to make the PartitionRoutingInfo a bit more generic, perhaps > call it TupleConversionInfo and have fields like ti_ToGeneric and > ti_FromGeneric, with the idea that "generic" would be the root > partition or the first subplan for inheritance updates. This would > allow us to get rid of a good chunk of code inside nodeModifyTable.c. > However, I ended up not doing this and left PartitionRoutingInfo to be > specifically for Root to Partition conversion. > Yeah, it doesn't necessarily have to be part of this patch. > Also, on the topic about what to name the conversion maps from a few > days ago; After looking at this a bit more I decided that having them > named ParentToChild and ChildToParent is misleading. If the child is > the child of some sub-partitioned table then the parent that the map > is talking about is not the partition's parent, but the root > partitioned table. So really RootToPartition and PartitionToRoot seem > like much more accurate names for the maps. > Agreed. > I made a few other changes along the way; I thought that > ExecFindPartition() would be a good candidate to take on the > responsibility of validating the partition is valid for INSERTs when > it uses a partition out of the subplan_resultrel_hash. I thought it > was better to check this once when we're in the code path of grabbing > the ResultRelInfo out that hash table rather than in a code path that > must check if it's been done already each time we route a tuple into a > partition. It also allowed me to get rid of > ri_PartitionReadyForRouting. I also moved the call to > ExecInitRoutingInfo() into ExecFindPartition() which allowed me to > make that function static, which could result in the generation of > slightly more optimal compiled code. > > Please find attached the v14 patch. > Passes check-world, and has detailed documentation about the changes :) Best regards, Jesper
On 2018/11/07 20:46, David Rowley wrote: > On 5 November 2018 at 20:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/11/04 19:07, David Rowley wrote: >>> Perhaps a better design would be to instead of having random special >>> partitioned-table-only fields in ResultRelInfo, just have an extra >>> struct there that contains the extra partition information which would >>> include the translation maps and then just return the ResultRelInfo >>> and allow callers to lookup any extra details they require. >> >> IIUC, you're saying that we could introduce a new struct that contains >> auxiliary information needed by partition pruning (maps, slot, etc. for >> tuple conversion) and add a new ResultRelInfo member of that struct type. >> That way, there is no need to return them separately or return an index to >> access them from their arrays. I guess we won't even need the arrays we >> have now. I think that might be a good idea and simplifies things >> significantly. > > I've attached a patch which does this. Thank you for updating the patch this way. > It adds a new struct named > PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays > out of PartitionTupleRouting. There are fields for each of what these > arrays used to store inside the PartitionRoutingInfo struct. > > While doing this I kept glancing back over at ModifyTableState and at > the mt_per_subplan_tupconv_maps array. I wondered if it would be > better to make the PartitionRoutingInfo a bit more generic, perhaps > call it TupleConversionInfo and have fields like ti_ToGeneric and > ti_FromGeneric, with the idea that "generic" would be the root > partition or the first subplan for inheritance updates. This would > allow us to get rid of a good chunk of code inside nodeModifyTable.c. > However, I ended up not doing this and left PartitionRoutingInfo to be > specifically for Root to Partition conversion. I think it's good that you left mt_per_subplan_tupconv_maps out of this. UPDATE tuple routing can be said to have two steps: first step, a tiny one, converts the tuple that needs to be routed from the source partition's rowtype to the root's rowtype so that tuple routing proper can begin, and second is the actual tuple routing carried out using PartitionTupleRouting. The first step is handled by nodeModifyTable.c and so any data structures related to it should be in ModifyTableState. Actually, as I also proposed upthread, we should move root_tuple_slot from PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because it's part of the first step described above that has nothing to do with partition tuple routing proper. We can make PartitionTupleRouting private to execPartition.c if we do that. > Also, on the topic about what to name the conversion maps from a few > days ago; After looking at this a bit more I decided that having them > named ParentToChild and ChildToParent is misleading. If the child is > the child of some sub-partitioned table then the parent that the map > is talking about is not the partition's parent, but the root > partitioned table. So really RootToPartition and PartitionToRoot seem > like much more accurate names for the maps. Makes sense. :) > I made a few other changes along the way; I thought that > ExecFindPartition() would be a good candidate to take on the > responsibility of validating the partition is valid for INSERTs when > it uses a partition out of the subplan_resultrel_hash. I thought it > was better to check this once when we're in the code path of grabbing > the ResultRelInfo out that hash table rather than in a code path that > must check if it's been done already each time we route a tuple into a > partition. It also allowed me to get rid of > ri_PartitionReadyForRouting. Ah, I too had tried to remove ri_PartitionReadyForRouting, but had to give up on that idea because I didn't think of moving steps that are needed to be performed before setting it to true to that block of code in ExecFindPartition. > I also moved the call to > ExecInitRoutingInfo() into ExecFindPartition() which allowed me to > make that function static, which could result in the generation of > slightly more optimal compiled code. > > Please find attached the v14 patch. > > Rather nicely git --stat reports a net negative additional code (with > the 48 lines of new tests included) > > 11 files changed, 632 insertions(+), 660 deletions(-) That's nice! I didn't find anything quite significant to complain about, except just one line: + * Initially we must only setup 1 PartitionDispatch object; the one for setup -> set up Thanks, Amit
On 8 November 2018 at 20:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Actually, as I also proposed upthread, we should move root_tuple_slot from > PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because > it's part of the first step described above that has nothing to do with > partition tuple routing proper. We can make PartitionTupleRouting private > to execPartition.c if we do that. okay. Makes sense. I've changed things around to PartitionTupleRouting is now private to execPartition.c > I didn't find anything quite significant to complain about, except just > one line: > > + * Initially we must only setup 1 PartitionDispatch object; the one for > > setup -> set up Changed too. I've attached v15 and a delta from v14 to ease re-review. I also ran pgindent on this version. That's not part of the delta but is in the main patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Nov 8, 2018 at 6:28 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached v15 and a delta from v14 to ease re-review. > > I also ran pgindent on this version. That's not part of the delta but > is in the main patch. Did you notice http://postgr.es/m/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03 ? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 November 2018 at 00:28, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached v15 and a delta from v14 to ease re-review. I just revived the 0002 patch, which is still just for testing at this stage. There was mention over on [1] about removing the find_all_inheritors() call. Also some benchmarks from v14 with 0001+0002. Setup: DROP TABLE hashp; CREATE TABLE hashp (a INT) PARTITION BY HASH (a); SELECT 'CREATE TABLE hashp'||x::Text || ' PARTITION OF hashp FOR VALUES WITH (modulus 10000, remainder ' || x::text || ');' from generate_Series(0,9999) x; \gexec (0 partitions is a non-partitioned table) fsync=off Partitions Patched Unpatched 0 23672 23785 10 22794 18385 100 22392 8541 1000 22419 1159 10000 22195 101 [1] https://www.postgresql.org/message-id/CA+TgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/11/08 20:28, David Rowley wrote: > On 8 November 2018 at 20:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Actually, as I also proposed upthread, we should move root_tuple_slot from >> PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because >> it's part of the first step described above that has nothing to do with >> partition tuple routing proper. We can make PartitionTupleRouting private >> to execPartition.c if we do that. > > okay. Makes sense. I've changed things around to PartitionTupleRouting > is now private to execPartition.c Thank you. I have a comment regarding how you chose to make PartitionTupleRouting private. Using the v14_to_v15 diff, I could quickly see that there are many diffs changing PartitionTupleRouting to struct PartitionTupleRouting, but they would be unnecessary if you had added the following in execPartition.h, as my upthread had done. -/* See execPartition.c for the definition. */ +/* See execPartition.c for the definitions. */ typedef struct PartitionDispatchData *PartitionDispatch; +typedef struct PartitionTupleRouting PartitionTupleRouting; Thanks, Amit
On 9 November 2018 at 19:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I have a comment regarding how you chose to make > PartitionTupleRouting private. > > Using the v14_to_v15 diff, I could quickly see that there are many diffs > changing PartitionTupleRouting to struct PartitionTupleRouting, but they > would be unnecessary if you had added the following in execPartition.h, as > my upthread had done. > > -/* See execPartition.c for the definition. */ > +/* See execPartition.c for the definitions. */ > typedef struct PartitionDispatchData *PartitionDispatch; > +typedef struct PartitionTupleRouting PartitionTupleRouting; Okay, done that way. v16 attached. The 0002 patch is included again, this time with a new proposed commit message. There was some discussion over on [1] where nobody seemed to have any concerns about delaying the locking until we route the first tuple to the partition. [1] https://www.postgresql.org/message-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03 -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 11/12/18 6:17 PM, David Rowley wrote: > On 9 November 2018 at 19:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I have a comment regarding how you chose to make >> PartitionTupleRouting private. >> >> Using the v14_to_v15 diff, I could quickly see that there are many diffs >> changing PartitionTupleRouting to struct PartitionTupleRouting, but they >> would be unnecessary if you had added the following in execPartition.h, as >> my upthread had done. >> >> -/* See execPartition.c for the definition. */ >> +/* See execPartition.c for the definitions. */ >> typedef struct PartitionDispatchData *PartitionDispatch; >> +typedef struct PartitionTupleRouting PartitionTupleRouting; > > Okay, done that way. v16 attached. > Still passes check-world. Best regards, Jesper
On 2018/11/14 0:32, Jesper Pedersen wrote: > Hi, > > On 11/12/18 6:17 PM, David Rowley wrote: >> On 9 November 2018 at 19:18, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> I have a comment regarding how you chose to make >>> PartitionTupleRouting private. >>> >>> Using the v14_to_v15 diff, I could quickly see that there are many diffs >>> changing PartitionTupleRouting to struct PartitionTupleRouting, but they >>> would be unnecessary if you had added the following in execPartition.h, as >>> my upthread had done. >>> >>> -/* See execPartition.c for the definition. */ >>> +/* See execPartition.c for the definitions. */ >>> typedef struct PartitionDispatchData *PartitionDispatch; >>> +typedef struct PartitionTupleRouting PartitionTupleRouting; >> >> Okay, done that way. v16 attached. Thank you. > Still passes check-world. I looked at v16 and noticed a few typos: + * partition_dispatch_info Array of 'dispatch_allocsize' elements containing + * a pointer to a PartitionDispatch objects for a PartitionDispatch objects -> a PartitionDispatch object + * partitions Array of 'partitions_allocsize' elements + * containing pointers to a ResultRelInfos of all + * leaf partitions touched by tuple routing. a ResultRelInfos -> ResultRelInfos + * PartitionDispatch and ResultRelInfo pointers the 'partitions' array. "in" the 'partitions' array. + /* Setup the PartitionRoutingInfo for it */ Setup -> Set up + * Ensure there's enough space in the 'partitions' array of 'proute' + * and store it in the next empty slot in proute's partitions array. Not a typo, but maybe just write proute->partitions instead of "partitions array of proute" and "proute's partition array". + * the next available slot in the 'proute' partition_dispatch_info[] + * array. Also, record the index into this array in the 'parent_pd' Similarly, here: proute->partition_dipatch_info array + * array. Also, record the index into this array in the 'parent_pd' + * indexes[] array in the partidx element so that we can properly Similarly, parent_pd->indexes array + if (dispatchidx >= proute->dispatch_allocsize) + { + /* Expand allocated space. */ + proute->dispatch_allocsize *= 2; + proute->partition_dispatch_info = (PartitionDispatchData **) + repalloc(proute->partition_dispatch_info, + sizeof(PartitionDispatchData *) * + proute->dispatch_allocsize); + } Sorry, I forgot to point this out before, but can this code in ExecInitPartitionDispatchInfo be accommodated in ExecCheckPartitionArraySpace() for consistency? Thanks, Amit
Thanks for looking at this again. On 14 November 2018 at 13:47, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > + if (dispatchidx >= proute->dispatch_allocsize) > + { > + /* Expand allocated space. */ > + proute->dispatch_allocsize *= 2; > + proute->partition_dispatch_info = (PartitionDispatchData **) > + repalloc(proute->partition_dispatch_info, > + sizeof(PartitionDispatchData *) * > + proute->dispatch_allocsize); > + } > > Sorry, I forgot to point this out before, but can this code in > ExecInitPartitionDispatchInfo be accommodated in > ExecCheckPartitionArraySpace() for consistency? I don't really want to put that code in ExecCheckPartitionArraySpace() as the way the function is now, it makes quite a lot of sense for the compiler to inline it. If we add redundant work in there, then it makes less sense. There's never any need to check both arrays at once as we're only adding the new item to one array at a time. Instead, I've written a new function named ExecCheckDispatchArraySpace() and put the resize code inside that. I've fixed the typos you mentioned. The only other thing I changed was to only allocate the PartitionDispatch->tupslot if a conversion is required. The previous code allocated this regardless if it was going to be used or not. This saves both the redundant allocation and also very slightly reduces the cost of the if test in ExecFindPartition(). There's now no need to check if the map ! NULL as if the slot is there then the map must exist too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Thanks for updating the patch. On 2018/11/14 13:16, David Rowley wrote: > Thanks for looking at this again. > > On 14 November 2018 at 13:47, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> + if (dispatchidx >= proute->dispatch_allocsize) >> + { >> + /* Expand allocated space. */ >> + proute->dispatch_allocsize *= 2; >> + proute->partition_dispatch_info = (PartitionDispatchData **) >> + repalloc(proute->partition_dispatch_info, >> + sizeof(PartitionDispatchData *) * >> + proute->dispatch_allocsize); >> + } >> >> Sorry, I forgot to point this out before, but can this code in >> ExecInitPartitionDispatchInfo be accommodated in >> ExecCheckPartitionArraySpace() for consistency? > > I don't really want to put that code in ExecCheckPartitionArraySpace() > as the way the function is now, it makes quite a lot of sense for the > compiler to inline it. If we add redundant work in there, then it > makes less sense. There's never any need to check both arrays at once > as we're only adding the new item to one array at a time. > > Instead, I've written a new function named > ExecCheckDispatchArraySpace() and put the resize code inside that. Okay, seems fine. > I've fixed the typos you mentioned. The only other thing I changed was > to only allocate the PartitionDispatch->tupslot if a conversion is > required. The previous code allocated this regardless if it was going > to be used or not. This saves both the redundant allocation and also > very slightly reduces the cost of the if test in ExecFindPartition(). > There's now no need to check if the map ! NULL as if the slot is there Also makes sense. Although it seems that Alvaro has already started at looking at this, I'll mark the CF entry as Ready for Committer anyway, because I don't have any more comments. :) Thanks, Amit
What's with this comment? * Initially we must only set up 1 PartitionDispatch object; the one for * the partitioned table that's the target of the command. If we must * route a tuple via some sub-partitioned table, then its * PartitionDispatch is only built the first time it's required. You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at odds with the '1' mentioned in the comment. Which is wrong? (I have a few edits on the patch, so please don't send a full v18 -- a delta patch would be welcome, if you have further changes to propose.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks for picking this up. On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > What's with this comment? > > * Initially we must only set up 1 PartitionDispatch object; the one for > * the partitioned table that's the target of the command. If we must > * route a tuple via some sub-partitioned table, then its > * PartitionDispatch is only built the first time it's required. > > You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at > odds with the '1' mentioned in the comment. Which is wrong? I don't think either is wrong, but I guess something must be misleading, so could perhaps be improved. We're simply allocating enough space for PARTITION_ROUTING_INITSIZE but we're only initialising 1 item. That leaves space for PARTITION_ROUTING_INITSIZE - 1 more items before we'd need to reallocate the array. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-Nov-15, David Rowley wrote: > On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > What's with this comment? > > > > * Initially we must only set up 1 PartitionDispatch object; the one for > > * the partitioned table that's the target of the command. If we must > > * route a tuple via some sub-partitioned table, then its > > * PartitionDispatch is only built the first time it's required. > > > > You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at > > odds with the '1' mentioned in the comment. Which is wrong? > > I don't think either is wrong, but I guess something must be > misleading, so could perhaps be improved. Ah, that makes sense. Yeah, it seems a bit misleading to me. No worries. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/11/15 8:58, Alvaro Herrera wrote: > On 2018-Nov-15, David Rowley wrote: > >> On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> What's with this comment? >>> >>> * Initially we must only set up 1 PartitionDispatch object; the one for >>> * the partitioned table that's the target of the command. If we must >>> * route a tuple via some sub-partitioned table, then its >>> * PartitionDispatch is only built the first time it's required. >>> >>> You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at >>> odds with the '1' mentioned in the comment. Which is wrong? >> >> I don't think either is wrong, but I guess something must be >> misleading, so could perhaps be improved. > > Ah, that makes sense. Yeah, it seems a bit misleading to me. No > worries. Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size? Thanks, Amit
On 2018-Nov-15, Amit Langote wrote: > Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or > PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size? Here's a proposed delta on v17 0001. Most importantly, I noticed that the hashed subplans stuff didn't actually work, because the hash API was not being used correctly. So the search in the hash would never return a hit, and we'd create RRIs for those partitions again. To fix this, I added a new struct to hold hash entries. I think this merits that the performance tests be redone. (Unless I misunderstand, this shouldn't change the performance of INSERT, only that of UPDATE.) On the subject of the ArraySpace routines, I decided to drop them and instead do the re-allocations on the places where they were needed. In the original code there were two places for the partitions array, but both did the same thing so it made sense to create a separate routine to do it instead (ExecRememberPartitionRel), and do the allocation there. Just out of custom I moved the palloc to appear at the same place as the repalloc, and after doing so it became obvious that we were over-allocating memory for the PartitionDispatchData pointer -- allocating the size for the whole struct instead of just the pointer. (I renamed the "allocsize" struct members to "max", as is customary.) I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop. It shouldn't be useful if the code is correct, but if there are bugs it's better to be able to interrupt infinite loops :-) I reindented the comment atop PartitionTupleRouting. The other way was just too unwieldy. Let me know what you think. Regression tests still pass for me. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Fri, Nov 16, 2018 at 11:40 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Nov-15, Amit Langote wrote: > > > Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or > > PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size? > > Here's a proposed delta on v17 0001. Most importantly, I noticed that > the hashed subplans stuff didn't actually work, because the hash API was > not being used correctly. So the search in the hash would never return > a hit, and we'd create RRIs for those partitions again. To fix this, I > added a new struct to hold hash entries. I'm a bit surprised that you found that the hash table didn't work, because I remember having checked by attaching gdb that it works when I was hacking on my own delta patch, but I may have been looking at too many things. > I think this merits that the performance tests be redone. (Unless I > misunderstand, this shouldn't change the performance of INSERT, only > that of UPDATE.) Actually, I don't remember seeing performance tests done with UPDATEs on this thread. Since we don't needlessly scan *all* subplan result rels anymore, maybe this removes a good deal of overhead for UPDATEs that update partition key. > On the subject of the ArraySpace routines, I decided to drop them and > instead do the re-allocations on the places where they were needed. > In the original code there were two places for the partitions array, but > both did the same thing so it made sense to create a separate routine to > do it instead (ExecRememberPartitionRel), and do the allocation there. > Just out of custom I moved the palloc to appear at the same place as the > repalloc, and after doing so it became obvious that we were > over-allocating memory for the PartitionDispatchData pointer -- > allocating the size for the whole struct instead of just the pointer. > > (I renamed the "allocsize" struct members to "max", as is customary.) These changes look good to me. > I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop. It > shouldn't be useful if the code is correct, but if there are bugs it's > better to be able to interrupt infinite loops :-) Good measure. :) > I reindented the comment atop PartitionTupleRouting. The other way was > just too unwieldy. > > Let me know what you think. Regression tests still pass for me. Overall, it looks good to me. Thanks, Amit
One thing I don't quite like is the inconsistency in handling memory context switches in the various function allocating stuff. It seems rather haphazard. I'd rather have a memcxt member in PartitionTupleRouting, which is set when the struct is created, and then have all the other functions allocating stuff use that one. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-Nov-16, Alvaro Herrera wrote: > One thing I don't quite like is the inconsistency in handling memory > context switches in the various function allocating stuff. It seems > rather haphazard. I'd rather have a memcxt member in > PartitionTupleRouting, which is set when the struct is created, and then > have all the other functions allocating stuff use that one. So while researching this I finally realized that there was a "lexical disconnect" between setting a ResultRelInfo's ri_PartitionInfo struct/pointer and adding it to the PartitionTupleRoute arrays. However, if you think about it, these things are one and the same, so we don't need to do them separately; just merge the new function I wrote into the existing ExecInitRoutingInfo(). Patch attached. (This version also rebases across Andres' recent conflicting TupleTableSlot changes.) I'll now see about the commit message and push shortly. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
I repeated David's original tests not terribly rigorously[*] and got these numbers: * Unpatched: 72.396196 * 0001 : 77.279404 * 0001+0002: 20409.415094 * 0002: 816.606613 * control : 22969.140596 (insertion into unpartitioned table) So while this patch by itself gives a pretty lame increase in tps, it removes bottlenecks that will appear once we change the locking scheme. [*] On my laptop, running each test only once for 60s. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-Nov-13, David Rowley wrote: > The 0002 patch is included again, this time with a new proposed commit > message. There was some discussion over on [1] where nobody seemed to > have any concerns about delaying the locking until we route the first > tuple to the partition. Please get a new commitfest entry for this patch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, 17 Nov 2018 at 04:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I'll now see about the commit message and push shortly. Many thanks for making the required adjustments and pushing this. If I wasn't on leave late last week and early this week then the only thing I'd have mentioned was the lack of empty comment line in the header comment for PartitionDispatchData. It looks a bit messy without. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-Nov-21, David Rowley wrote: > If I wasn't on leave late last week and early this week then the only > thing I'd have mentioned was the lack of empty comment line in the > header comment for PartitionDispatchData. It looks a bit messy > without. Absolutely. Pushed a few newlines -- I hope I understood you correctly. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Nov-21, David Rowley wrote: > > If I wasn't on leave late last week and early this week then the only > > thing I'd have mentioned was the lack of empty comment line in the > > header comment for PartitionDispatchData. It looks a bit messy > > without. > > Absolutely. Pushed a few newlines -- I hope I understood you correctly. Thanks, you did. That looks better now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On Thu, Nov 22, 2018 at 7:25 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2018-Nov-21, David Rowley wrote: > > > If I wasn't on leave late last week and early this week then the only > > > thing I'd have mentioned was the lack of empty comment line in the > > > header comment for PartitionDispatchData. It looks a bit messy > > > without. > > > > Absolutely. Pushed a few newlines -- I hope I understood you correctly. > > Thanks, you did. That looks better now. I noticed that there's a "be" missing in the comment above ExecFindPartition. Fixed in the attached. Thanks, Amit
Attachment
On Thu, Nov 22, 2018 at 11:32:04AM +0900, Amit Langote wrote: > I noticed that there's a "be" missing in the comment above > ExecFindPartition. Fixed in the attached. Thanks Amit, I have committed this one. -- Michael
Attachment
On Sat, 17 Nov 2018 at 07:28, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > The 0002 patch is included again, this time with a new proposed commit > > message. There was some discussion over on [1] where nobody seemed to > > have any concerns about delaying the locking until we route the first > > tuple to the partition. > > Please get a new commitfest entry for this patch. Added to Jan-fest in: https://commitfest.postgresql.org/21/1887/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services