Thread: Performance improvements of INSERTs to a partitioned table
Hi,
I want to discuss performance improvements of INSERTs to a partitioned table.
When an application inserts records into a table partitioned into thousands, find_all_inheritors() locks all of the partitions.
Updating partition key locks in the same way.
So, Execution time becomes longer as the number of partition increases.
* nparts 8
testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.281..0.281 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.079..0.080 rows=1 loops=1)
Planning Time: 0.080 ms
Execution Time: 0.362 ms
(4 rows)
* nparts 8192
testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.058..0.059 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.032..0.034 rows=1 loops=1)
Planning Time: 0.032 ms
Execution Time: 12.508 ms
(4 rows)
Locking only the target partitions like the patch previously proposed by David[1], the performance will improve greatly.
* nparts 8192 (patched)
testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.415..0.416 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.140..0.141 rows=1 loops=1)
Planning Time: 0.120 ms
Execution Time: 1.694 ms
(4 rows)
However, I am concerned that "unsafe" is included in the name of this patch.
If locking only target partitions and locking all of partitions are executed at the same time, a problem may occurs.
But, I'm not sure what kind of problem will occur.
Is it enough to lock only the target partitions?
If a problem occurs in above case, I think it is safer to divide the steps to acquire the lock into two.
In first step, locking only the parent table in share or exclusive mode.
In second step, locking only the target partitions after locking the parent table.
Thoughts?
[1]: https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
regards,
Sho Kato
AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are executed at the same time, this patch causes deadlock.
* partitions information
Partition key: RANGE (a)
Partitions: a_1 FOR VALUES FROM (1) TO (100),
a_2 FOR VALUES FROM (100) TO (200)
T1: create index a_1_ix on a_1(a);
T2: insert into a values(101),(1); locking a_2 and waiting releasing a_1’s lock
T1: create index a_2_ix on a_2(a); ERROR: deadlock detected |
I think this situation does not mean unsafe because similar situation will occurs at no partitioned tables and DBMS could not prevent this situation.
But, I'm not sure this behavior is correct.
Does similar deadlock occur in other DBMS like Oracle?
From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com]
Sent: Tuesday, November 6, 2018 6:36 PM
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Performance improvements of INSERTs to a partitioned table
Hi,
I want to discuss performance improvements of INSERTs to a partitioned table.
When an application inserts records into a table partitioned into thousands, find_all_inheritors() locks all of the partitions.
Updating partition key locks in the same way.
So, Execution time becomes longer as the number of partition increases.
* nparts 8
testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.281..0.281 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.079..0.080 rows=1 loops=1)
Planning Time: 0.080 ms
Execution Time: 0.362 ms
(4 rows)
* nparts 8192
testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.058..0.059 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.032..0.034 rows=1 loops=1)
Planning Time: 0.032 ms
Execution Time: 12.508 ms
(4 rows)
Locking only the target partitions like the patch previously proposed by David[1], the performance will improve greatly.
* nparts 8192 (patched)
testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.415..0.416 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.140..0.141 rows=1 loops=1)
Planning Time: 0.120 ms
Execution Time: 1.694 ms
(4 rows)
However, I am concerned that "unsafe" is included in the name of this patch.
If locking only target partitions and locking all of partitions are executed at the same time, a problem may occurs.
But, I'm not sure what kind of problem will occur.
Is it enough to lock only the target partitions?
If a problem occurs in above case, I think it is safer to divide the steps to acquire the lock into two.
In first step, locking only the parent table in share or exclusive mode.
In second step, locking only the target partitions after locking the parent table.
Thoughts?
[1]: https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
regards,
Sho Kato
On 7 November 2018 at 21:31, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are > executed at the same time, this patch causes deadlock. > > * partitions information > > Partition key: RANGE (a) > Partitions: a_1 FOR VALUES FROM (1) TO (100), > a_2 FOR VALUES FROM (100) TO (200) > > T1: create index a_1_ix on a_1(a); > T2: insert into a values(101),(1); locking a_2 and waiting releasing a_1’s > lock > T1: create index a_2_ix on a_2(a); ERROR: deadlock detected > > I think this situation does not mean unsafe because similar situation will > occurs at no partitioned tables and DBMS could not prevent this situation. > > But, I'm not sure this behavior is correct. That's one case that will deadlock with the 0002 patch in [1]. Another command that takes a conflicting lock on the partition only is TRUNCATE. Many other commands that normally take an AEL can't be run on a partition, for example, ALTER TABLE ADD COLUMN. The same would have deadlocked on master if you'd created the indexes in the reverse order, but I guess the point is that today there's some defined order that you can create the indexes in that won't cause a deadlock, but with [1] there is no defined order since the tables are locked in order that tuples are routed to them. Robert pointed out something interesting in [2] about UPDATEs of a partition key perhaps routing a tuple to a partition that's been excluded by constraint exclusion, so the lock is not taken initially, but would be taken later in ExecSetupPartitionTupleRouting(). I ran through that scenario today and discovered it can't happen as excluded partitions are still in the rangetable, so are still locked either in the planner or by AcquireExecutorLocks() and the order those are locked in is well defined from the planner. However, I believe that's something Amit plans to change in [3], where he proposes to only lock partitions that survive partition pruning (among many other things). There are probably a few things we could do here: a. Have all DDL on partitions obtain the same lock level on all ancestor partitioned tables taking a lock on the root first and working down to the leaf. b. Just lock what we touch and increase the likelihood of deadlocks occurring. c. Reject [1] and [3] because we don't want to increase the chances of deadlocks. I'm highly against doing 'c' since both [1] and [3] are very useful patches which scale partitioning to work well with very large numbers of partitions. At the moment we just can bearly do half a dozen partitions before the performance starts going south. 'a' is not that ideal a solution either as it means that anyone doing CREATE INDEX or TRUNCATE on a partition would conflict with queries that are querying an unrelated part of the partitioned table. While I don't really like 'b' either, I wonder how many people are going to hit deadlocks here. To hit that, I believe that you'd have to be doing the TRUNCATE or CREATE INDEX inside a transaction and to hit it where you couldn't hit it before you'd have had to carefully written your CREATE INDEX / TRUNCATE statements to ensure they're executed in partition order. I'm unsure how likely that is, but I certainly can't rule out that doing 'b' won't upset anyone. Perhaps a GUC is needed to choose between 'a' and 'b'? I'm adding Amit to this email because [3] is his and Robert because he's recently been talking about partition locking too. [1] https://www.postgresql.org/message-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv1J_C-mG%3DCs2UwhsD6cqwg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BTgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q%40mail.gmail.com [3] https://commitfest.postgresql.org/20/1778/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/11/09 11:19, David Rowley wrote: > On 7 November 2018 at 21:31, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: >> AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are >> executed at the same time, this patch causes deadlock. >> >> * partitions information >> >> Partition key: RANGE (a) >> Partitions: a_1 FOR VALUES FROM (1) TO (100), >> a_2 FOR VALUES FROM (100) TO (200) >> >> T1: create index a_1_ix on a_1(a); >> T2: insert into a values(101),(1); locking a_2 and waiting releasing a_1’s >> lock >> T1: create index a_2_ix on a_2(a); ERROR: deadlock detected >> >> I think this situation does not mean unsafe because similar situation will >> occurs at no partitioned tables and DBMS could not prevent this situation. >> >> But, I'm not sure this behavior is correct. > > That's one case that will deadlock with the 0002 patch in [1]. Another > command that takes a conflicting lock on the partition only is > TRUNCATE. Many other commands that normally take an AEL can't be run > on a partition, for example, ALTER TABLE ADD COLUMN. > > The same would have deadlocked on master if you'd created the indexes > in the reverse order, but I guess the point is that today there's some > defined order that you can create the indexes in that won't cause a > deadlock, but with [1] there is no defined order since the tables are > locked in order that tuples are routed to them. We cannot control the order in which a user performs operations directly on partitions. What we do have to worry about is the order in which partitions are locked by a particular piece of backend code when a certain operation is applied to the parent table. However, if the user had applied the operation to the parent table instead, then it would've blocked for any concurrent queries that already had a lock on the parent, or vice versa, because obviously the table named in the command is locked before any of its children. > Robert pointed out something interesting in [2] about UPDATEs of a > partition key perhaps routing a tuple to a partition that's been > excluded by constraint exclusion, so the lock is not taken initially> but would be taken later in ExecSetupPartitionTupleRouting().I ran > through that scenario today and discovered it can't happen as excluded > partitions are still in the rangetable, so are still locked either in > the planner or by AcquireExecutorLocks() and the order those are > locked in is well defined from the planner. Yes, expand_inherited_rtentry() would lock *all* partitions even if some of them might not end up in the final plan due to some being excluded. > However, I believe that's > something Amit plans to change in [3], where he proposes to only lock > partitions that survive partition pruning (among many other things). With my patch, while the planner would take the locks in deterministic order on the partitions that it adds to the plan, the executor time locking order is non-deterministic considering that tuple routing may encounter multiple, not yet seen, partitions. > There are probably a few things we could do here: > > a. Have all DDL on partitions obtain the same lock level on all > ancestor partitioned tables taking a lock on the root first and > working down to the leaf. > b. Just lock what we touch and increase the likelihood of deadlocks occurring. > c. Reject [1] and [3] because we don't want to increase the chances of > deadlocks. > > I'm highly against doing 'c' since both [1] and [3] are very useful > patches which scale partitioning to work well with very large numbers > of partitions. At the moment we just can bearly do half a dozen > partitions before the performance starts going south. Agreed. > 'a' is not that > ideal a solution either as it means that anyone doing CREATE INDEX or > TRUNCATE on a partition would conflict with queries that are querying > an unrelated part of the partitioned table. As you pointed out, many of the commands that require locks that would conflict with concurrent queries are not allowed to run directly on partitions anyway. For operations that *are* allowed, doing 'a' is same as asking the user to perform the operation on the root parent instead, at least as far as the concurrency is concerned (user may not want to *truncate* all partitions, only some.) > While I don't really like > 'b' either, I wonder how many people are going to hit deadlocks here. > To hit that, I believe that you'd have to be doing the TRUNCATE or > CREATE INDEX inside a transaction and to hit it where you couldn't hit > it before you'd have had to carefully written your CREATE INDEX / > TRUNCATE statements to ensure they're executed in partition order. I'm > unsure how likely that is, but I certainly can't rule out that doing > 'b' won't upset anyone. As long as queries involve tuple routing that touches multiple not yet seen partitions, someone doing conflicting operations directly on multiple partitions in a transaction will have to be ready to see deadlocks. Maybe, we can document that. > Perhaps a GUC is needed to choose between 'a' and 'b'? I guess we'll have a hard time explaining such configuration option to users though. Thanks, Amit
Thanks David. > I'm highly against doing 'c' since both [1] and [3] are very useful patches which scale partitioning to work well withvery large numbers of partitions. Agreed. >'a' is not that ideal a solution either as it means that anyone doing CREATE INDEX or TRUNCATE on a partition would conflictwith queries that are querying an unrelated part of the partitioned table. If only deadlock is a problem, I think intention lock is needed. All SQL on a partitioned tables take a intention exclusive or shared lock on the root first and take a conventional lockon target leaf next. > While I don't really like 'b' either, I wonder how many people are going to hit deadlocks here. >To hit that, I believe that you'd have to be doing the TRUNCATE or CREATE INDEX inside a transaction and to hit it whereyou couldn't hit it before you'd have had to carefully written your CREATE INDEX / TRUNCATE statements to ensure they'reexecuted in partition order. In case of CREATE INDEX on the leaf, users have to ensure to statements executed in partition order since similar case couldoccur in no partitioned table. But, DBMS have to prevent deadlock occurred by CREATE INDEX/TRUNCATE on the root. >Perhaps a GUC is needed to choose between 'a' and 'b'? I agree with Amit. It is difficult to choose for users. regards, Sho Kato > -----Original Message----- > From: David Rowley [mailto:david.rowley@2ndquadrant.com] > Sent: Friday, November 9, 2018 11:20 AM > To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp>; Robert Haas <robertmhaas@gmail.com> > Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> > Subject: Re: Performance improvements of INSERTs to a partitioned table > > On 7 November 2018 at 21:31, Kato, Sho <kato-sho@jp.fujitsu.com> wrote: > > AFAIK, When CREATE INDEX on a partition and INSERT to a parent table > > are executed at the same time, this patch causes deadlock. > > > > * partitions information > > > > Partition key: RANGE (a) > > Partitions: a_1 FOR VALUES FROM (1) TO (100), > > a_2 FOR VALUES FROM (100) TO (200) > > > > T1: create index a_1_ix on a_1(a); > > T2: insert into a values(101),(1); locking a_2 and waiting releasing > > a_1’s lock > > T1: create index a_2_ix on a_2(a); ERROR: deadlock detected > > > > I think this situation does not mean unsafe because similar situation > > will occurs at no partitioned tables and DBMS could not prevent this > situation. > > > > But, I'm not sure this behavior is correct. > > That's one case that will deadlock with the 0002 patch in [1]. Another > command that takes a conflicting lock on the partition only is TRUNCATE. > Many other commands that normally take an AEL can't be run on a partition, > for example, ALTER TABLE ADD COLUMN. > > The same would have deadlocked on master if you'd created the indexes > in the reverse order, but I guess the point is that today there's some > defined order that you can create the indexes in that won't cause a > deadlock, but with [1] there is no defined order since the tables are > locked in order that tuples are routed to them. > > Robert pointed out something interesting in [2] about UPDATEs of a > partition key perhaps routing a tuple to a partition that's been excluded > by constraint exclusion, so the lock is not taken initially, but would > be taken later in ExecSetupPartitionTupleRouting(). I ran through that > scenario today and discovered it can't happen as excluded partitions are > still in the rangetable, so are still locked either in the planner or > by AcquireExecutorLocks() and the order those are locked in is well > defined from the planner. However, I believe that's something Amit plans > to change in [3], where he proposes to only lock partitions that survive > partition pruning (among many other things). > > There are probably a few things we could do here: > > a. Have all DDL on partitions obtain the same lock level on all ancestor > partitioned tables taking a lock on the root first and working down to > the leaf. > b. Just lock what we touch and increase the likelihood of deadlocks > occurring. > c. Reject [1] and [3] because we don't want to increase the chances of > deadlocks. > > I'm highly against doing 'c' since both [1] and [3] are very useful patches > which scale partitioning to work well with very large numbers of > partitions. At the moment we just can bearly do half a dozen partitions > before the performance starts going south. 'a' is not that ideal a solution > either as it means that anyone doing CREATE INDEX or TRUNCATE on a > partition would conflict with queries that are querying an unrelated part > of the partitioned table. While I don't really like 'b' either, I wonder > how many people are going to hit deadlocks here. > To hit that, I believe that you'd have to be doing the TRUNCATE or CREATE > INDEX inside a transaction and to hit it where you couldn't hit it before > you'd have had to carefully written your CREATE INDEX / TRUNCATE > statements to ensure they're executed in partition order. I'm unsure how > likely that is, but I certainly can't rule out that doing 'b' won't upset > anyone. > > Perhaps a GUC is needed to choose between 'a' and 'b'? > > I'm adding Amit to this email because [3] is his and Robert because he's > recently been talking about partition locking too. > > [1] > https://www.postgresql.org/message-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv > 1J_C-mG%3DCs2UwhsD6cqwg%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/CA%2BTgmoZGJsy-nRFnzurhZQJtHdD > h5fzJKvbvhS0byN6_46pB9Q%40mail.gmail.com > [3] https://commitfest.postgresql.org/20/1778/ > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
On 9 November 2018 at 18:45, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > As long as queries involve tuple routing that touches multiple not yet > seen partitions, someone doing conflicting operations directly on multiple > partitions in a transaction will have to be ready to see deadlocks. > Maybe, we can document that. Perhaps it's good enough. A guess there's at least a workaround of doing LOCK TABLE <root_partitioned_table> in the CREATE INDEX / TRUNCATE session. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services