Thread: How to make partitioning scale better for larger numbers ofpartitions
Hi,
I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.
But, statement latencies on a partitioned table is much slower than on a non-partitioned table.
UPDATE latency is 210 times slower than a non-partitioned table.
SELECT latency is 36 times slower than a non-partitioned table.
Surprisingly INSERT latency is almost same.
Of course I'm sure table partitioning work well with up to a hundred partitions as written on the postgresql document.
But, my customer will use partitioned table with 1.1k leaf partitions.
So, we need to improve performance.
Any ideas?
The results of pgbench and perf are listed below.
pgbench results
---------------
transaction type: update.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 648
latency average = 278.202 ms
tps = 3.594512 (including connections establishing)
tps = 3.594545 (excluding connections establishing)
statement latencies in milliseconds:
0.011 \set aid random(1, 1100 * 1)
0.004 \set delta random(-5000, 5000)
0.038 BEGIN;
277.005 UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
1.140 END;
transaction type: select.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 19415
latency average = 9.281 ms
tps = 107.745068 (including connections establishing)
tps = 107.746067 (excluding connections establishing)
statement latencies in milliseconds:
0.800 \set aid random(1, 1100 * 1)
0.137 \set delta random(-5000, 5000)
1.351 BEGIN;
4.941 SELECT abalance FROM test.accounts WHERE aid = :aid;
2.052 END;
transaction type: insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 31895
latency average = 5.654 ms
tps = 176.865541 (including connections establishing)
tps = 176.867086 (excluding connections establishing)
statement latencies in milliseconds:
2.083 \set aid random(1, 1100 * 1)
0.003 \set delta random(-5000, 5000)
0.029 BEGIN;
3.222 INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
0.317 END;
Top 10 of perf report
------------
UPDATE:
21.33% postgres postgres [.] range_table_mutator
12.57% postgres postgres [.] AllocSetAlloc
4.97% postgres postgres [.] palloc
4.48% postgres postgres [.] make_one_rel
3.96% postgres postgres [.] lappend
2.74% postgres [kernel.kallsyms] [k] get_page_from_freelist
1.87% postgres postgres [.] setup_append_rel_array
1.68% postgres [kernel.kallsyms] [k] list_del
1.64% postgres [kernel.kallsyms] [k] __alloc_pages_nodemask
1.62% postgres [kernel.kallsyms] [k] unmap_vmas
SELECT:
14.72% postgres postgres [.] AllocSetAlloc
5.14% postgres postgres [.] hash_search_with_hash_value
4.23% postgres postgres [.] palloc
4.06% postgres postgres [.] MemoryContextAllocZeroAligned
2.61% postgres postgres [.] copyObjectImpl
2.34% postgres postgres [.] expression_tree_mutator
2.13% postgres [kernel.kallsyms] [k] _spin_lock
1.91% postgres postgres [.] lappend
1.59% postgres [kernel.kallsyms] [k] __link_path_walk
1.50% postgres postgres [.] set_rel_size
INSERT:
20.75% postgres postgres [.] hash_search_with_hash_value
6.03% postgres postgres [.] hash_any
4.88% postgres postgres [.] AllocSetAlloc
4.05% postgres postgres [.] LWLockRelease
4.05% postgres postgres [.] LWLockAcquire
3.27% postgres postgres [.] oid_cmp
3.06% postgres postgres [.] SearchCatCache
2.97% postgres postgres [.] LockReleaseAll
2.57% postgres postgres [.] pg_qsort
2.37% postgres postgres [.] hash_seq_search
The following is information on the environment used for the benchmark.
Server spec
-----------
Server has 16 cpu.
Memory size is 264GB.
Database directory is on SSD.
database tuning
---------------
shared_buffers = 102GB
max_locks_per_transactions = 1000000
postgresql version
------------------
11beta2 + patch1 + patch2
patch1: Allow direct lookups of AppendRelInfo by child relid
commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
https://commitfest.postgresql.org/18/1690/
table definition
----------------
create table test.accounts(aid serial, abalance int) partition by range(aid));
create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid);
create table test.account_part_1 partition of test.accounts for values from (1) to (2);
create table test.account_part_2 partition of test.accounts for values from (2) to (3);
.
.
create table test.account_part_1100 partition of test.accounts for values from (1100) to (1101);
accounts_history is also partitioned in the same way.
There is only one data in each leaf partitions for UPDATE/SELECT benchmark.
regards,
>I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or>INSERT/UPDATE directly into child) ? Thank you for your reply. I compared to PG11beta2 with non-partitioned table. Non-partitioned table has 1100 records in one table. Partitioned table has one record on each leaf partitions. Regards, -----Original Message----- From: Justin Pryzby [mailto:pryzby@telsasoft.com] Sent: Friday, July 13, 2018 12:11 PM To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com> Subject: Re: How to make partitioning scale better for larger numbers of partitions Hi, On Fri, Jul 13, 2018 at 02:58:53AM +0000, Kato, Sho wrote: > I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables. > But, statement latencies on a partitioned table is much slower than on a non-partitioned table. > > UPDATE latency is 210 times slower than a non-partitioned table. > SELECT latency is 36 times slower than a non-partitioned table. > Surprisingly INSERT latency is almost same. I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule orINSERT/UPDATE directly into child) ? Justin
Kato-san, On 2018/07/13 11:58, Kato, Sho wrote: > Hi, > > I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables. Thanks for sharing the results. > But, statement latencies on a partitioned table is much slower than on a non-partitioned table. > > UPDATE latency is 210 times slower than a non-partitioned table. > SELECT latency is 36 times slower than a non-partitioned table. > Surprisingly INSERT latency is almost same. Yes, INSERT comes out ahead because there is no overhead of partitioning in the planning phase. As David Rowley reported on the nearby thread ("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still significant overhead during its execution, so we're still a bit a fair bit away from the best possible performance. For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase is pretty significant and gets worse as the number of partitions grows. I had intended to fix that in PG 11, but we could only manage to get part of that work into PG 11, with significant help from David and others. So, while PG 11's overhead of partitioning during planning is less than PG 10's due to improved pruning algorithm, it's still pretty far from ideal, because it isn't just the pruning algorithm that had overheads. In fact, PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still carry the overhead that was in PG 10. The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used to perform inheritance planning, which requires to lock and open *all* partitions. We have so far been able to refactor just enough to use the new code for partition pruning, but there is much refactoring work left to avoid needlessly locking and opening all partitions. Thanks, Amit
RE: How to make partitioning scale better for larger numbers ofpartitions
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase > is pretty significant and gets worse as the number of partitions grows. > I > had intended to fix that in PG 11, but we could only manage to get part > of > that work into PG 11, with significant help from David and others. So, > while PG 11's overhead of partitioning during planning is less than PG > 10's due to improved pruning algorithm, it's still pretty far from ideal, > because it isn't just the pruning algorithm that had overheads. In fact, > PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still > carry the overhead that was in PG 10. David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in thecurrent CF, though.) What challenges are there for future versions, and which of them are being addressed by patchesin progress for PG 12, and which issues are untouched? > The overheads I mention stem from > the fact that for partitioning we still rely on the old planner code > that's used to perform inheritance planning, which requires to lock and > open *all* partitions. We have so far been able to refactor just enough > to use the new code for partition pruning, but there is much refactoring > work left to avoid needlessly locking and opening all partitions. I remember the issue of opening and locking all partitions was discussed before. Does this relate only to the planning phase? Kato-san, could you try pgbench -M prepared? Regards Takayuki Tsunakawa
On Fri, Jul 13, 2018 at 05:49:20AM +0000, Tsunakawa, Takayuki wrote: > David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it inthe current CF, though.) What challenges are there for future versions, and which of them are being addressed by patchesin progress for PG 12, and which issues are untouched? Here's an known issue which I'm not sure is on anybody's list: https://www.postgresql.org/message-id/flat/4F989CD8.8020804%40strategicdata.com.au#4F989CD8.8020804@strategicdata.com.au => planning of UPDATE/DELETE (but not SELECT) takes huge amount of RAM when run on parent with many childs. We don't typically have UPDATEs, and those we do are against the child tables; but ran into this last month with a manual query on parent causing OOM. I worked around it, but keep meaning to revisit to see if this changed at all in v11 (very brief testing suggests not changed). Justin
On 13 July 2018 at 17:49, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it inthe current CF, though.) What challenges are there for future versions, and which of them are being addressed by patchesin progress for PG 12, and which issues are untouched? I've not submitted that for PG12 yet. I had other ideas about just getting rid of the inheritance planner altogether, but so far don't have a patch for that. Still uncertain if there are any huge blockers to that either. Join search needs performed multiple times, but a good advantage will be that we can take advantage of partition pruning to only join search the non-pruned partitions. The reason I change my mind about the patch you're talking about is that it meant that RangeTblEntries needed to keep the same relation id in the inheritance planner as they did in the grouping planner, and another patch I have semi-baked delays building both RelOptInfo and RangeTblEntry records for partitions until after pruning. Without the RangeTblEntry it was difficult to ensure the relids were in lock-step between the two planners. There are ways to do it, it just didn't look pretty. Hoping to have something later in the cycle. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 13 July 2018 at 14:58, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > Of course I'm sure table partitioning work well with up to a hundred > partitions as written on the postgresql document. > > But, my customer will use partitioned table with 1.1k leaf partitions. > > So, we need to improve performance. > > Any ideas? It would be really good if you could review and test my partitioning patches in the current commitfest. These are the ones authored by me with the word "partition" in the title. These 4 patches don't solve all the problems, but they do make some good gains in some areas. I've still more work to do, so the earlier I can be finished with the 4 that are there now, the more I can focus on the rest. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/07/13 14:49, Tsunakawa, Takayuki wrote: > From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] >> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase >> is pretty significant and gets worse as the number of partitions grows. >> I >> had intended to fix that in PG 11, but we could only manage to get part >> of >> that work into PG 11, with significant help from David and others. So, >> while PG 11's overhead of partitioning during planning is less than PG >> 10's due to improved pruning algorithm, it's still pretty far from ideal, >> because it isn't just the pruning algorithm that had overheads. In fact, >> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still >> carry the overhead that was in PG 10. > > David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it inthe current CF, though.) I don't think he has posted a new patch for improving the speed for UDPATE/DELETE planning yet, although I remember he had posted a PoC patch back in February or March. > What challenges are there for future versions, and which of them are being addressed by patches in progress for PG 12,and which issues are untouched? The immediate one I think is to refactor the planner such that the new pruning code, that we were able to utilize for SELECT in PG 11, can also be used for UPDATE/DELETE. Refactoring needed to replace the pruning algorithm was minimal for SELECT, but not so much for UPDATE/DELETE. Once we're able to utilize the new pruning algorithm for all the cases, we can move on to refactoring to avoid locking and opening of all partitions. As long as we're relying on constraint exclusion for partition pruning, which we still do for UPDATE/DELETE, we cannot do that because constraint exclusion has to look at each partition individually. The UPDATE/DELETE planning for partitioning using huge memory and CPU is a pretty big issue and refactoring planner to avoid that may be what's hardest of all the problems to be solved here. >> The overheads I mention stem from >> the fact that for partitioning we still rely on the old planner code >> that's used to perform inheritance planning, which requires to lock and >> open *all* partitions. We have so far been able to refactor just enough >> to use the new code for partition pruning, but there is much refactoring >> work left to avoid needlessly locking and opening all partitions. > > I remember the issue of opening and locking all partitions was discussed before. Quite a few times I suppose. > Does this relate only to the planning phase? If the benchmark contains queries that will need to access just one partition, then yes the planning part has is the biggest overhead. Execution-time overhead is limited to having an extra, possibly needless, Append node, but I know David has patch for that too. Thanks, Amit
RE: How to make partitioning scale better for larger numbers ofpartitions
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > > David has submitted multiple patches for PG 12, one of which speeds up > pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) > What challenges are there for future versions, and which of them are being > addressed by patches in progress for PG 12, and which issues are untouched? > > I've not submitted that for PG12 yet. I had other ideas about just > getting rid of the inheritance planner altogether, but so far don't > have a patch for that. Still uncertain if there are any huge blockers > to that either. Sorry, I seem to have misunderstood something. By the way, what do you think is the "ideal and should-be-feasible" goal and the "realistic" goal we can reach in the nearfuture (e.g. PG 12)? Say, * Planning and execution time is O(log n), where n is the number of partitions * Planning time is O(log n), execution time is O(1) * Planning and execution time is O(1), where n is the number of partitions Regards Takayuki Tsunakawa
On 13 July 2018 at 18:53, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > By the way, what do you think is the "ideal and should-be-feasible" goal and the "realistic" goal we can reach in the nearfuture (e.g. PG 12)? Say, Depends. Patched don't move that fast without review and nothing gets committed without a committer. > * Planning and execution time is O(log n), where n is the number of partitions > * Planning time is O(log n), execution time is O(1) > * Planning and execution time is O(1), where n is the number of partitions It's going to depend on how many partitions are pruned. We still need to generate paths for all non-pruned partitions which is going to be slow when there are many partitions. I think we can get pretty close to the non-partitioned planning performance with SELECT/UPDATE/DELETE when all but 1 partition survives pruning. There are always going to be some additional overhead we can't get rid of, but hopefully, those will be small. Please feel free to review what I have in the July 'fest. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: How to make partitioning scale better for larger numbers ofpartitions
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > The immediate one I think is to refactor the planner such that the new > pruning code, that we were able to utilize for SELECT in PG 11, can also > be used for UPDATE/DELETE. Refactoring needed to replace the pruning > algorithm was minimal for SELECT, but not so much for UPDATE/DELETE. > > Once we're able to utilize the new pruning algorithm for all the cases, > we > can move on to refactoring to avoid locking and opening of all partitions. > As long as we're relying on constraint exclusion for partition pruning, > which we still do for UPDATE/DELETE, we cannot do that because constraint > exclusion has to look at each partition individually. > > The UPDATE/DELETE planning for partitioning using huge memory and CPU is > a > pretty big issue and refactoring planner to avoid that may be what's > hardest of all the problems to be solved here. Thank you. There seem to be many challenges to address... As a user and PG developer, I'd be happy to see some wiki pagethat describes the current performance characteristics in terms of # of partitions, the ideal and reachable performance,and the challenges to overcome to reach that ideal goal. > If the benchmark contains queries that will need to access just one > partition, then yes the planning part has is the biggest overhead. > > Execution-time overhead is limited to having an extra, possibly needless, > Append node, but I know David has patch for that too. That's good news, thanks. Our user will perform around a hundred single-row INSERT/SELECT/UPDATE/DELETE statements in eachtransaction, and those are PREPAREd. I hope PG 11 (with David's patches) will work well for that workload. I'll waitfor Kato-san's pgbench -M prepared result. Regards Takayuki Tsunakawa
Hi Amit, Thank you for letting me know challenges of partitioning. >So, while PG 11's overhead of partitioning during planning is less than PG 10's due to improved pruning algorithm, it'sstill pretty far from ideal, because it isn't just the pruning algorithm that had overheads. That makes sense. I also benchmark PG10. Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10. SELECT latency with 800 leaf partition -------------------------------------- PG10 5.62 ms PG11 3.869 ms But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf partitions. It takes a long time though partition pruning algorithm of PG11 is binary search. >The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used to performinheritance planning, which requires to lock and open *all* partitions. I debug update statement execution on partitioned table. range_table_mutator seems process all leaf partitions. -----Original Message----- From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] Sent: Friday, July 13, 2018 1:35 PM To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org> Subject: Re: How to make partitioning scale better for larger numbers of partitions Kato-san, On 2018/07/13 11:58, Kato, Sho wrote: > Hi, > > I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables. Thanks for sharing the results. > But, statement latencies on a partitioned table is much slower than on a non-partitioned table. > > UPDATE latency is 210 times slower than a non-partitioned table. > SELECT latency is 36 times slower than a non-partitioned table. > Surprisingly INSERT latency is almost same. Yes, INSERT comes out ahead because there is no overhead of partitioning in the planning phase. As David Rowley reportedon the nearby thread ("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still significant overheadduring its execution, so we're still a bit a fair bit away from the best possible performance. For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase is pretty significant and gets worse as the numberof partitions grows. I had intended to fix that in PG 11, but we could only manage to get part of that work into PG11, with significant help from David and others. So, while PG 11's overhead of partitioning during planning is less thanPG 10's due to improved pruning algorithm, it's still pretty far from ideal, because it isn't just the pruning algorithmthat had overheads. In fact, PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still carry theoverhead that was in PG 10. The overheads I mention stem from the fact that for partitioning we still rely on the oldplanner code that's used to perform inheritance planning, which requires to lock and open *all* partitions. We have sofar been able to refactor just enough to use the new code for partition pruning, but there is much refactoring work leftto avoid needlessly locking and opening all partitions. Thanks, Amit
On 2018/07/13 16:29, Kato, Sho wrote: > I also benchmark PG10. > Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10. > > SELECT latency with 800 leaf partition > -------------------------------------- > PG10 5.62 ms > PG11 3.869 ms > > But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf partitions. > It takes a long time though partition pruning algorithm of PG11 is binary search. Yeah, pruning algorithm change evidently removed only a small percentage of the overhead. >> The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used toperform inheritance planning, which requires to lock and open *all* partitions. > > I debug update statement execution on partitioned table. > range_table_mutator seems process all leaf partitions. That's one of the the symptoms of it, yes. With the existing code for UPDATE/DELETE planning of partitioned tables, the whole Query structure is modified to translate the parent's attribute numbers to partition attribute numbers and planned freshly *for every partition*. range_table_mutator() is invoked sometime during that translation process and itself loops over all partitions, so I'd think it is susceptible to being prominently visible in perf profiles. Thanks, Amit
Tsunakawa-san >Kato-san, could you try pgbench -M prepared? I did pgbench -M prepared and perf record. UPDATE latency in prepared mode is 95% shorter than in simple mode. SELECT latency in prepared mode is 54% shorter than in simple mode. INSERT latency in prepared mode is 8% shorter than in simple mode. In perf report, AllocSetAlloc, hash_search_with_hash_value and hash_any appeared in all SQL. Details are as follows. pgbench results -------------- transaction type: update.sql scaling factor: 1 query mode: prepared number of clients: 1 number of threads: 1 duration: 180 s number of transactions actually processed: 12295 latency average = 14.641 ms tps = 68.302806 (including connections establishing) tps = 68.303430 (excluding connections establishing) statement latencies in milliseconds: 0.009 \set aid random(1, 1100 * 1) 0.004 \set delta random(-5000, 5000) 0.026 BEGIN; 14.089 UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.513 END; transaction type: select.sql scaling factor: 1 query mode: prepared number of clients: 1 number of threads: 1 duration: 180 s number of transactions actually processed: 45145 latency average = 3.996 ms tps = 250.272094 (including connections establishing) tps = 250.274404 (excluding connections establishing) statement latencies in milliseconds: 0.750 \set aid random(1, 1100 * 1) 0.714 \set delta random(-5000, 5000) 0.023 BEGIN; 2.262 SELECT abalance FROM test.accounts WHERE aid = :aid; 0.247 END; transaction type: insert.sql scaling factor: 1 query mode: prepared number of clients: 1 number of threads: 1 duration: 180 s number of transactions actually processed: 34894 latency average = 5.158 ms tps = 193.855216 (including connections establishing) tps = 193.857020 (excluding connections establishing) statement latencies in milliseconds: 1.007 \set aid random(1, 1100 * 1) 0.802 \set delta random(-5000, 5000) 0.025 BEGIN; 2.963 INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP); 0.360 END; Top 10 of perf report --------------------- UPDATE: 11.86% postgres postgres [.] list_nth 10.23% postgres postgres [.] ExecOpenScanRelation 7.47% postgres postgres [.] list_member_int 4.78% postgres postgres [.] AllocSetAlloc 3.61% postgres postgres [.] palloc0 3.09% postgres postgres [.] hash_search_with_hash_value 2.33% postgres postgres [.] ResourceArrayAdd 1.99% postgres postgres [.] hash_any 1.83% postgres postgres [.] copyObjectImpl 1.64% postgres postgres [.] SearchCatCache1 SELECT: 33.60% postgres postgres [.] hash_search_with_hash_value 13.02% postgres postgres [.] hash_any 5.30% postgres postgres [.] LockAcquireExtended 5.16% postgres postgres [.] LockReleaseAll 4.00% postgres postgres [.] hash_seq_search 3.84% postgres postgres [.] LWLockRelease 3.81% postgres postgres [.] AllocSetAlloc 3.23% postgres postgres [.] LWLockAcquire 2.55% postgres postgres [.] SetupLockInTable 1.90% postgres postgres [.] AcquireExecutorLocks INSERT: 21.86% postgres postgres [.] hash_search_with_hash_value 6.36% postgres postgres [.] hash_any 4.95% postgres postgres [.] AllocSetAlloc 4.08% postgres postgres [.] LWLockRelease 3.83% postgres postgres [.] LWLockAcquire 3.26% postgres postgres [.] SearchCatCache 3.15% postgres postgres [.] oid_cmp 2.78% postgres postgres [.] LockReleaseAll 2.76% postgres postgres [.] pg_qsort 2.41% postgres postgres [.] SearchCatCache1 -----Original Message----- From: Tsunakawa, Takayuki/綱川 貴之 Sent: Friday, July 13, 2018 2:49 PM To: 'Amit Langote' <Langote_Amit_f8@lab.ntt.co.jp>; Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org> Subject: RE: How to make partitioning scale better for larger numbers of partitions From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase > is pretty significant and gets worse as the number of partitions grows. > I > had intended to fix that in PG 11, but we could only manage to get part > of > that work into PG 11, with significant help from David and others. So, > while PG 11's overhead of partitioning during planning is less than PG > 10's due to improved pruning algorithm, it's still pretty far from ideal, > because it isn't just the pruning algorithm that had overheads. In fact, > PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still > carry the overhead that was in PG 10. David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in thecurrent CF, though.) What challenges are there for future versions, and which of them are being addressed by patchesin progress for PG 12, and which issues are untouched? > The overheads I mention stem from > the fact that for partitioning we still rely on the old planner code > that's used to perform inheritance planning, which requires to lock and > open *all* partitions. We have so far been able to refactor just enough > to use the new code for partition pruning, but there is much refactoring > work left to avoid needlessly locking and opening all partitions. I remember the issue of opening and locking all partitions was discussed before. Does this relate only to the planning phase? Kato-san, could you try pgbench -M prepared? Regards Takayuki Tsunakawa
Re: How to make partitioning scale better for larger numbers of partitions
On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: >>I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or>INSERT/UPDATE directly into child) ? > > Thank you for your reply. > > I compared to PG11beta2 with non-partitioned table. > > Non-partitioned table has 1100 records in one table. > Partitioned table has one record on each leaf partitions. > I don't think partitioning should be employed this way even for the sake of comparison. Depending upon the size of each tuple, 1100 tuples are inserted into a single table, they will probably occupy few hundred pages. In a partitioned table with one tuple per partition they will occupy 1100 pages at least. There is other space, locking overheads to maintain 1100 tables. I think the right way to compare is to have really large that which really requires 1100 partitions and then compare performance by putting that data in 1100 partitions and in an unpartitioned table. Even with that kind of data, you will see some difference in performance, but that won't be as dramatic as you report. I might be missing something though. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
RE: How to make partitioning scale better for larger numbers ofpartitions
From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com] > I did pgbench -M prepared and perf record. > > UPDATE latency in prepared mode is 95% shorter than in simple mode. > SELECT latency in prepared mode is 54% shorter than in simple mode. > INSERT latency in prepared mode is 8% shorter than in simple mode. Thanks. And what does the comparison look like between the unpartitioned case and various partition counts? What's theperformance characteristics in terms of the latency and partition count? I thought that's what you tried to reveal first? Regards Takayuki Tsunakawa
On 2018/07/13 22:10, Ashutosh Bapat wrote: > On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: >>> I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or ruleor >INSERT/UPDATE directly into child) ? >> >> Thank you for your reply. >> >> I compared to PG11beta2 with non-partitioned table. >> >> Non-partitioned table has 1100 records in one table. >> Partitioned table has one record on each leaf partitions. >> > > I don't think partitioning should be employed this way even for the > sake of comparison. Depending upon the size of each tuple, 1100 tuples > are inserted into a single table, they will probably occupy few > hundred pages. In a partitioned table with one tuple per partition > they will occupy 1100 pages at least. There is other space, locking > overheads to maintain 1100 tables. I think the right way to compare is > to have really large that which really requires 1100 partitions and > then compare performance by putting that data in 1100 partitions and > in an unpartitioned table. Even with that kind of data, you will see > some difference in performance, but that won't be as dramatic as you > report. > > I might be missing something though. Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitions is just too high compared to the time it spends on a non-partitioned table. It is and has been clear that that's because the planning time explodes as the number of partitions increases. If there's lots of data in it, then the result will look completely different as you say, because scanning a single partition (of the 1100 total) will spend less time than scanning a non-partitioned table containing 1100 partitions worth of data. But the planning time would still be more for the partitioned table, which seems to be the point of this benchmark. Thanks, Amit
On 2018/07/17 10:49, Amit Langote wrote: >Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitionsis just too high compared to the time it spends on a non-partitioned table. yes, It is included for the purposes of this comparison. The purpose of this comparison is to find where the partitioning bottleneck is. Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the performanceof unpartitioned table as much as possible. -----Original Message----- From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] Sent: Tuesday, July 17, 2018 10:49 AM To: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>; Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com> Cc: Justin Pryzby <pryzby@telsasoft.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org> Subject: Re: How to make partitioning scale better for larger numbers of partitions On 2018/07/13 22:10, Ashutosh Bapat wrote: > On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: >>> I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or ruleor >INSERT/UPDATE directly into child) ? >> >> Thank you for your reply. >> >> I compared to PG11beta2 with non-partitioned table. >> >> Non-partitioned table has 1100 records in one table. >> Partitioned table has one record on each leaf partitions. >> > > I don't think partitioning should be employed this way even for the > sake of comparison. Depending upon the size of each tuple, 1100 tuples > are inserted into a single table, they will probably occupy few > hundred pages. In a partitioned table with one tuple per partition > they will occupy 1100 pages at least. There is other space, locking > overheads to maintain 1100 tables. I think the right way to compare is > to have really large that which really requires 1100 partitions and > then compare performance by putting that data in 1100 partitions and > in an unpartitioned table. Even with that kind of data, you will see > some difference in performance, but that won't be as dramatic as you > report. > > I might be missing something though. Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitionsis just too high compared to the time it spends on a non-partitioned table. It is and has been clear that that'sbecause the planning time explodes as the number of partitions increases. If there's lots of data in it, then the result will look completely different as you say, because scanning a single partition(of the 1100 total) will spend less time than scanning a non-partitioned table containing 1100 partitions worth of data. But the planningtime would still be more for the partitioned table, which seems to be the point of this benchmark. Thanks, Amit
Re: How to make partitioning scale better for larger numbers of partitions
On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > On 2018/07/17 10:49, Amit Langote wrote: >>Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitionsis just too high compared to the time it spends on a non-partitioned table. > > yes, It is included for the purposes of this comparison. > > The purpose of this comparison is to find where the partitioning bottleneck is. > Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the performanceof unpartitioned table as much as possible. > That's a good thing, but may not turn out to be realistic. We should compare performance where partitioning matters and try to improve in those contexts. Else we might improve performance in scenarios which are never used. In this case, even if we improve the planning time by 100%, it hardly matters since planning time is neglegible compared to the execution time because of huge data where partitioning is useful. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2018/07/17 12:14, Ashutosh Bapat wrote: > On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: >> On 2018/07/17 10:49, Amit Langote wrote: >>> Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitionsis just too high compared to the time it spends on a non-partitioned table. >> >> yes, It is included for the purposes of this comparison. >> >> The purpose of this comparison is to find where the partitioning bottleneck is. >> Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the performanceof unpartitioned table as much as possible. >> > > That's a good thing, but may not turn out to be realistic. We should > compare performance where partitioning matters and try to improve in > those contexts. Else we might improve performance in scenarios which > are never used. > > In this case, even if we improve the planning time by 100%, it hardly > matters since planning time is neglegible compared to the execution > time because of huge data where partitioning is useful. While I agree that it's a good idea to tell users to use partitioning only if the overhead of having the partitioning in the first place is bearable, especially the planner overhead, this benchmark shows us that even for what I assume might be fairly commonly occurring queries (select .. from / update .. / delete from partitioned_table where partkey = ?), planner spends way too many redundant cycles. Some amount of that overhead will always remain and planning with partitioning will always be a bit slower than without partitioning, it's *too* slow right now for non-trivial number of partitions. Thanks, Amit
On 2018/07/16 13:16, Tsunakawa-san wrote: >Thanks. And what does the comparison look like between the unpartitioned case and various partition counts? What's theperformance characteristics in terms of the latency and partition count? I thought that's what you tried to reveal first? In unpartitioned table, latency of SELECT/UPDATE statement is close to O(n), where n is number of records. Latency of INSERT statements is close to O(1). In partitioned table, up to 400 partitions, latency of SELECT/INSERT statement is close to O(log n), where n is the numberof partitions. Between 400 and 6400 partitions, latency is close to O(n). Up to 400 partitions, latency of UPDATE statement is close to O(n). Between 400 and 6400 partitions, latency of UPDATE statement seems to O(n^2). Details are as follows. unpartitioned table result (prepared mode) ------------------------------------------ scale | latency_avg | tps_ex | update_latency | select_latency | insert_latency -------+-------------+-------------+----------------+----------------+---------------- 100 | 0.24 | 4174.395738 | 0.056 | 0.051 | 0.04 200 | 0.258 | 3873.099014 | 0.065 | 0.059 | 0.04 400 | 0.29 | 3453.171112 | 0.081 | 0.072 | 0.041 800 | 0.355 | 2814.936942 | 0.112 | 0.105 | 0.041 1600 | 0.493 | 2027.702689 | 0.18 | 0.174 | 0.042 3200 | 0.761 | 1313.532458 | 0.314 | 0.307 | 0.043 6400 | 1.214 | 824.001431 | 0.54 | 0.531 | 0.043 partitioned talble result (prepared mode) ----------------------------------------- num_part | latency_avg | tps_ex | update_latency | select_latency | insert_latency ----------+-------------+-------------+----------------+----------------+---------------- 100 | 0.937 | 1067.473258 | 0.557 | 0.087 | 0.123 200 | 1.65 | 606.244552 | 1.115 | 0.121 | 0.188 400 | 3.295 | 303.491681 | 2.446 | 0.19 | 0.322 800 | 8.102 | 123.422456 | 6.553 | 0.337 | 0.637 1600 | 36.996 | 27.030027 | 31.528 | 1.621 | 2.455 3200 | 147.998 | 6.756922 | 136.136 | 4.21 | 4.94 6400 | 666.947 | 1.499383 | 640.879 | 7.631 | 9.642 regards, -----Original Message----- From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] Sent: Monday, July 16, 2018 1:16 PM To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com> Cc: 'Amit Langote' <Langote_Amit_f8@lab.ntt.co.jp>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org> Subject: RE: How to make partitioning scale better for larger numbers of partitions From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com] > I did pgbench -M prepared and perf record. > > UPDATE latency in prepared mode is 95% shorter than in simple mode. > SELECT latency in prepared mode is 54% shorter than in simple mode. > INSERT latency in prepared mode is 8% shorter than in simple mode. Thanks. And what does the comparison look like between the unpartitioned case and various partition counts? What's theperformance characteristics in terms of the latency and partition count? I thought that's what you tried to reveal first? Regards Takayuki Tsunakawa