Thread: Guideline To Resolve LWLock:SubtransControlLock
Thank you
On 2018-Aug-16, Fred Habash wrote: > One of our database API's is run concurrently by near 40 sessions. We see > all of them waiting back and forth on this wait state. What version are you running? > Why is it called Subtrans Control Lock? It controls access to the pg_subtrans structure, which is used to record parent/child transaction relationships (as you say, savepoints and EXCEPTIONs in plpgsql are the most common uses, but not the only ones). Normally lookup of these is optimized away, but once you cross a threshold it cannot any longer. > What are the common user session scenarios causing this wait? > - I have read some describe the use of SQL savepoints or PL/pgSQL > exception handling. > What are known resolution measures? Are you in a position to recompile Postgres? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-Aug-16, Fred Habash wrote:
> One of our database API's is run concurrently by near 40 sessions. We see
> all of them waiting back and forth on this wait state.
What version are you running?
> Why is it called Subtrans Control Lock?
It controls access to the pg_subtrans structure, which is used to record
parent/child transaction relationships (as you say, savepoints and
EXCEPTIONs in plpgsql are the most common uses, but not the only ones).
Normally lookup of these is optimized away, but once you cross a
threshold it cannot any longer.
> What are the common user session scenarios causing this wait?
> - I have read some describe the use of SQL savepoints or PL/pgSQL
> exception handling.
> What are known resolution measures?
Are you in a position to recompile Postgres?
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you
On 2018-Aug-17, Fred Habash wrote: > Aurora Postgres 9.6.3 Oh, okay, I don't know this one. Did you contact Amazon support? > So, no chance to recompile (AFAIK). > Is there a design anti-pattern at the schema or data access level that we > should look for and correct? Maybe ... > And as for the recompile, are you thinking 'NUM_SUBTRANS_BUFFERS'? Yes, that's one option, but there's also TOTAL_MAX_CACHED_SUBXIDS. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
'NUM_SUBTRANS_BUFFERS'?TOTAL_MAX_CACHED_SUBXIDS
And as for the recompile, are you thinking 'NUM_SUBTRANS_BUFFERS'?
Yes, that's one option, but there's also TOTAL_MAX_CACHED_SUBXIDS.
On 2018-Aug-20, Fred Habash wrote: > How do we go about calculating appropriate values for these two parameters ... I don't know a lot about your system, so don't have anything to go on. Also, Aurora is mostly unknown to me. What did Amazon say? > > 'NUM_SUBTRANS_BUFFERS'? > TOTAL_MAX_CACHED_SUBXIDS > > And do both require a recompile? Yes. But maybe they'll just move the contention point a little bit backwards without actually fixing anything. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8/17/18 11:07, Fred Habash wrote: > Aurora Postgres 9.6.3 Hi Fred! The Amazon team does watch the AWS forums and that's the place to raise questions that are specific to PostgreSQL on RDS or questions specific to Aurora. In fact we would love to see this question over there since it might be something other people see as well. https://forums.aws.amazon.com/forum.jspa?forumID=227 That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child transaction relationships pretty much the same way that community PostgreSQL 9.6.3 does. The uses you pointed out (savepoints and exceptions in plpgsql) are the most common causes of contention I've seen - similar to what Alvaro said his experience is. I have seen applications grind to a halt on SubtransControlLock when they make heavy use of exception blocks in plpgsql code; in fact it's pretty straightforward to demonstrate this behavior with pgbench on community PostgreSQL. On 8/20/18 14:00, Alvaro Herrera wrote: >> And do both require a recompile? > > Yes. But maybe they'll just move the contention point a little bit > backwards without actually fixing anything. When it comes to resolution, I agree with Alvaro's assessment here; unfortunately, I don't know of a great solution on community PostgreSQL outside of trying to reduce the use of exception blocks in your plpgsql code. Increasing the cache size can give a little more head room but doesn't move the contention point significantly. That single global control lock is hard to get around when you try to use subtransactions at scale. -Jeremy P.S. This applies on the Aurora PostgreSQL 9.6.3 build too but I'm discussing here in the context of community PostgreSQL code and we can put further Aurora-specific discussion on the AWS forums. -- Jeremy Schneider Database Engineer Amazon Web Services
Thanks, Jeremy …
“ That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child transaction relationships pretty much the same way that community PostgreSQL 9.6.3 does …”
This is why I posted here first. This particular wait state did not appear to be Aurora specific and was not listed as part of https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraPostgreSQL.Reference.html#AuroraPostgreSQL.Reference.Waitevents
I go back and forth posting issues between the two forums depending on the nature of it.
----------------
Thank you
From: Jeremy Schneider
Sent: Monday, August 20, 2018 6:19 PM
To: Fred Habash
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Guideline To Resolve LWLock:SubtransControlLock
On 8/17/18 11:07, Fred Habash wrote:
> Aurora Postgres 9.6.3
Hi Fred! The Amazon team does watch the AWS forums and that's the place
to raise questions that are specific to PostgreSQL on RDS or questions
specific to Aurora. In fact we would love to see this question over
there since it might be something other people see as well.
https://forums.aws.amazon.com/forum.jspa?forumID=227
That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child
transaction relationships pretty much the same way that community
PostgreSQL 9.6.3 does. The uses you pointed out (savepoints and
exceptions in plpgsql) are the most common causes of contention I've
seen - similar to what Alvaro said his experience is. I have seen
applications grind to a halt on SubtransControlLock when they make heavy
use of exception blocks in plpgsql code; in fact it's pretty
straightforward to demonstrate this behavior with pgbench on community
PostgreSQL.
On 8/20/18 14:00, Alvaro Herrera wrote:
>> And do both require a recompile?
>
> Yes. But maybe they'll just move the contention point a little bit
> backwards without actually fixing anything.
When it comes to resolution, I agree with Alvaro's assessment here;
unfortunately, I don't know of a great solution on community PostgreSQL
outside of trying to reduce the use of exception blocks in your plpgsql
code. Increasing the cache size can give a little more head room but
doesn't move the contention point significantly. That single global
control lock is hard to get around when you try to use subtransactions
at scale.
-Jeremy
P.S. This applies on the Aurora PostgreSQL 9.6.3 build too but I'm
discussing here in the context of community PostgreSQL code and we can
put further Aurora-specific discussion on the AWS forums.
--
Jeremy Schneider
Database Engineer
Amazon Web Services
Jeremy …
In your statement, what constitutes ‘heavy use of exception blocks’?
Thanks
I have seen
applications grind to a halt on SubtransControlLock when they make heavy
use of exception blocks in plpgsql code; in fact it's pretty
straightforward to demonstrate this behavior with pgbench on community
PostgreSQL.
----------------
Thank you
From: Jeremy Schneider
Sent: Monday, August 20, 2018 6:19 PM
To: Fred Habash
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Guideline To Resolve LWLock:SubtransControlLock
On 8/17/18 11:07, Fred Habash wrote:
> Aurora Postgres 9.6.3
Hi Fred! The Amazon team does watch the AWS forums and that's the place
to raise questions that are specific to PostgreSQL on RDS or questions
specific to Aurora. In fact we would love to see this question over
there since it might be something other people see as well.
https://forums.aws.amazon.com/forum.jspa?forumID=227
That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child
transaction relationships pretty much the same way that community
PostgreSQL 9.6.3 does. The uses you pointed out (savepoints and
exceptions in plpgsql) are the most common causes of contention I've
seen - similar to what Alvaro said his experience is. I have seen
applications grind to a halt on SubtransControlLock when they make heavy
use of exception blocks in plpgsql code; in fact it's pretty
straightforward to demonstrate this behavior with pgbench on community
PostgreSQL.
On 8/20/18 14:00, Alvaro Herrera wrote:
>> And do both require a recompile?
>
> Yes. But maybe they'll just move the contention point a little bit
> backwards without actually fixing anything.
When it comes to resolution, I agree with Alvaro's assessment here;
unfortunately, I don't know of a great solution on community PostgreSQL
outside of trying to reduce the use of exception blocks in your plpgsql
code. Increasing the cache size can give a little more head room but
doesn't move the contention point significantly. That single global
control lock is hard to get around when you try to use subtransactions
at scale.
-Jeremy
P.S. This applies on the Aurora PostgreSQL 9.6.3 build too but I'm
discussing here in the context of community PostgreSQL code and we can
put further Aurora-specific discussion on the AWS forums.
--
Jeremy Schneider
Database Engineer
Amazon Web Services
On 8/22/18 13:07, Fd Habash wrote: > In your statement, what constitutes ‘heavy use of exception blocks’? > > "I have seen > applications grind to a halt on SubtransControlLock when they make heavy > use of exception blocks in plpgsql code; in fact it's pretty > straightforward to demonstrate this behavior with pgbench on community > PostgreSQL." In one of the most dramatic cases I saw, the customer was migrating from another database system and had a very large workload running on the largest instance class we currently offer. They were quite savvy and had already gone through all of the procedural code they migrated and removed all of the exception blocks. Nonetheless, when they hit their peak workload, we observed this wait event. It was finally discovered that the framework/ORM they were using had a capability to automatically use savepoints for partial rollback. They had not explicitly configured it (afaik) - but their framework was using savepoints. In some complex code paths we were seeing several hundred subtransactions within one master transaction. I haven't thoroughly tested yet, but anecdotally I don't think that you'll have a problem with contention on this lock until you get to a sufficiently large database server. The machine I described above was a 32-core box; I suspect that a box with 2 cores is going to be waiting on something else before it gets stuck here. If you want to see a system choke on this lock, just spin up a 32-core box and run two separate pgbenchs in parallel (needs to be two)... the first as select-only and the second modified to create some savepoints while updating pgbench_accounts. To directly answer the question "what constitutes heavy use": if folks are building high-throughput applications that they expect to scale nicely on PostgreSQL up to 32-core boxes and beyond, I'd suggest avoiding savepoints in any key codepaths that are part of the primary transactional workload (low-latency operations that are executed many times per second). On 8/22/18 08:48, Fd Habash wrote: > “ That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child > transaction relationships pretty much the same way that community > PostgreSQL 9.6.3 does …” > > This is why I posted here first. This particular wait state did not > appear to be Aurora specific and was not listed as part of > https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraPostgreSQL.Reference.html#AuroraPostgreSQL.Reference.Waitevents// > > I go back and forth posting issues between the two forums depending on > the nature of it. Just added it to the aforementioned Aurora docs, hopefully heading off a few future questions. -Jeremy -- Jeremy Schneider Database Engineer Amazon Web Services