Thread: Guideline To Resolve LWLock:SubtransControlLock

Guideline To Resolve LWLock:SubtransControlLock

From
Fred Habash
Date:
One of our database API's is run concurrently by near 40 sessions. We see all of them waiting back and forth on this wait state.

There is one scenario described in some forum where sessions connected a read-only replica are affected. This does not apply to our use case.

Why is it called Subtrans Control Lock?
What are the common user session scenarios causing this wait? 
  - I have read some describe the use of SQL savepoints or PL/pgSQL exception handling. 
What are known resolution measures?  


----------------------------------------
Thank you


Re: Guideline To Resolve LWLock:SubtransControlLock

From
Alvaro Herrera
Date:
On 2018-Aug-16, Fred Habash wrote:

> One of our database API's is run concurrently by near 40 sessions. We see
> all of them waiting back and forth on this wait state.

What version are you running?

> Why is it called Subtrans Control Lock?

It controls access to the pg_subtrans structure, which is used to record
parent/child transaction relationships (as you say, savepoints and
EXCEPTIONs in plpgsql are the most common uses, but not the only ones).
Normally lookup of these is optimized away, but once you cross a
threshold it cannot any longer.

> What are the common user session scenarios causing this wait?
>   - I have read some describe the use of SQL savepoints or PL/pgSQL
> exception handling.
> What are known resolution measures?

Are you in a position to recompile Postgres?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Guideline To Resolve LWLock:SubtransControlLock

From
Fred Habash
Date:
Aurora Postgres 9.6.3
So, no chance to recompile (AFAIK).

Is there a design anti-pattern at the schema or data access level that we should look for and correct? 

And as for the recompile, are you thinking 'NUM_SUBTRANS_BUFFERS'?

Thanks 



On Thu, Aug 16, 2018 at 2:36 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2018-Aug-16, Fred Habash wrote:

> One of our database API's is run concurrently by near 40 sessions. We see
> all of them waiting back and forth on this wait state.

What version are you running?

> Why is it called Subtrans Control Lock?

It controls access to the pg_subtrans structure, which is used to record
parent/child transaction relationships (as you say, savepoints and
EXCEPTIONs in plpgsql are the most common uses, but not the only ones).
Normally lookup of these is optimized away, but once you cross a
threshold it cannot any longer.

> What are the common user session scenarios causing this wait?
>   - I have read some describe the use of SQL savepoints or PL/pgSQL
> exception handling.
> What are known resolution measures?

Are you in a position to recompile Postgres?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--

----------------------------------------
Thank you


Re: Guideline To Resolve LWLock:SubtransControlLock

From
Alvaro Herrera
Date:
On 2018-Aug-17, Fred Habash wrote:

> Aurora Postgres 9.6.3

Oh, okay, I don't know this one.  Did you contact Amazon support?

> So, no chance to recompile (AFAIK).
> Is there a design anti-pattern at the schema or data access level that we
> should look for and correct?

Maybe ...

> And as for the recompile, are you thinking 'NUM_SUBTRANS_BUFFERS'?

Yes, that's one option, but there's also TOTAL_MAX_CACHED_SUBXIDS.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Guideline To Resolve LWLock:SubtransControlLock

From
Fred Habash
Date:
Thanks.  
How do we go about calculating appropriate values for these two parameters ...

'NUM_SUBTRANS_BUFFERS'?
TOTAL_MAX_CACHED_SUBXIDS

And do both require a recompile?


————-
Thank you. 

On Aug 17, 2018, at 2:26 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

And as for the recompile, are you thinking 'NUM_SUBTRANS_BUFFERS'?

Yes, that's one option, but there's also TOTAL_MAX_CACHED_SUBXIDS.

Re: Guideline To Resolve LWLock:SubtransControlLock

From
Alvaro Herrera
Date:
On 2018-Aug-20, Fred Habash wrote:

> How do we go about calculating appropriate values for these two parameters ...

I don't know a lot about your system, so don't have anything to go on.
Also, Aurora is mostly unknown to me.  What did Amazon say?

> > 'NUM_SUBTRANS_BUFFERS'?
> TOTAL_MAX_CACHED_SUBXIDS
> 
> And do both require a recompile?

Yes.  But maybe they'll just move the contention point a little bit
backwards without actually fixing anything.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Guideline To Resolve LWLock:SubtransControlLock

From
Jeremy Schneider
Date:
On 8/17/18 11:07, Fred Habash wrote:
> Aurora Postgres 9.6.3

Hi Fred! The Amazon team does watch the AWS forums and that's the place
to raise questions that are specific to PostgreSQL on RDS or questions
specific to Aurora. In fact we would love to see this question over
there since it might be something other people see as well.

https://forums.aws.amazon.com/forum.jspa?forumID=227

That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child
transaction relationships pretty much the same way that community
PostgreSQL 9.6.3 does. The uses you pointed out (savepoints and
exceptions in plpgsql) are the most common causes of contention I've
seen - similar to what Alvaro said his experience is. I have seen
applications grind to a halt on SubtransControlLock when they make heavy
use of exception blocks in plpgsql code; in fact it's pretty
straightforward to demonstrate this behavior with pgbench on community
PostgreSQL.

On 8/20/18 14:00, Alvaro Herrera wrote:
>> And do both require a recompile?
>
> Yes.  But maybe they'll just move the contention point a little bit
> backwards without actually fixing anything.

When it comes to resolution, I agree with Alvaro's assessment here;
unfortunately, I don't know of a great solution on community PostgreSQL
outside of trying to reduce the use of exception blocks in your plpgsql
code. Increasing the cache size can give a little more head room but
doesn't move the contention point significantly. That single global
control lock is hard to get around when you try to use subtransactions
at scale.

-Jeremy

P.S. This applies on the Aurora PostgreSQL 9.6.3 build too but I'm
discussing here in the context of community PostgreSQL code and we can
put further Aurora-specific discussion on the AWS forums.

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services


RE: Guideline To Resolve LWLock:SubtransControlLock

From
Fd Habash
Date:

Thanks, Jeremy …

 

“ That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child transaction relationships pretty much the same way that community PostgreSQL 9.6.3 does …”

 

This is why I posted here first. This particular wait state did not appear to be Aurora specific and was not listed as part of https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraPostgreSQL.Reference.html#AuroraPostgreSQL.Reference.Waitevents

 

I go back and forth posting issues between the two forums depending on the nature of it.

 

 

----------------
Thank you

 

From: Jeremy Schneider
Sent: Monday, August 20, 2018 6:19 PM
To: Fred Habash
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Guideline To Resolve LWLock:SubtransControlLock

 

On 8/17/18 11:07, Fred Habash wrote:

> Aurora Postgres 9.6.3

 

Hi Fred! The Amazon team does watch the AWS forums and that's the place

to raise questions that are specific to PostgreSQL on RDS or questions

specific to Aurora. In fact we would love to see this question over

there since it might be something other people see as well.

 

https://forums.aws.amazon.com/forum.jspa?forumID=227

 

That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child

transaction relationships pretty much the same way that community

PostgreSQL 9.6.3 does. The uses you pointed out (savepoints and

exceptions in plpgsql) are the most common causes of contention I've

seen - similar to what Alvaro said his experience is. I have seen

applications grind to a halt on SubtransControlLock when they make heavy

use of exception blocks in plpgsql code; in fact it's pretty

straightforward to demonstrate this behavior with pgbench on community

PostgreSQL.

 

On 8/20/18 14:00, Alvaro Herrera wrote:

>> And do both require a recompile?

> Yes.  But maybe they'll just move the contention point a little bit

> backwards without actually fixing anything.

 

When it comes to resolution, I agree with Alvaro's assessment here;

unfortunately, I don't know of a great solution on community PostgreSQL

outside of trying to reduce the use of exception blocks in your plpgsql

code. Increasing the cache size can give a little more head room but

doesn't move the contention point significantly. That single global

control lock is hard to get around when you try to use subtransactions

at scale.

 

-Jeremy

 

P.S. This applies on the Aurora PostgreSQL 9.6.3 build too but I'm

discussing here in the context of community PostgreSQL code and we can

put further Aurora-specific discussion on the AWS forums.

 

--

Jeremy Schneider

Database Engineer

Amazon Web Services

 

RE: Guideline To Resolve LWLock:SubtransControlLock

From
Fd Habash
Date:

Jeremy …

 

In your statement, what constitutes ‘heavy use of exception blocks’?  

 

Thanks

 

 

 

I have seen

applications grind to a halt on SubtransControlLock when they make heavy

use of exception blocks in plpgsql code; in fact it's pretty

straightforward to demonstrate this behavior with pgbench on community

PostgreSQL.

 

----------------
Thank you

 

From: Jeremy Schneider
Sent: Monday, August 20, 2018 6:19 PM
To: Fred Habash
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Guideline To Resolve LWLock:SubtransControlLock

 

On 8/17/18 11:07, Fred Habash wrote:

> Aurora Postgres 9.6.3

 

Hi Fred! The Amazon team does watch the AWS forums and that's the place

to raise questions that are specific to PostgreSQL on RDS or questions

specific to Aurora. In fact we would love to see this question over

there since it might be something other people see as well.

 

https://forums.aws.amazon.com/forum.jspa?forumID=227

 

That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child

transaction relationships pretty much the same way that community

PostgreSQL 9.6.3 does. The uses you pointed out (savepoints and

exceptions in plpgsql) are the most common causes of contention I've

seen - similar to what Alvaro said his experience is. I have seen

applications grind to a halt on SubtransControlLock when they make heavy

use of exception blocks in plpgsql code; in fact it's pretty

straightforward to demonstrate this behavior with pgbench on community

PostgreSQL.

 

On 8/20/18 14:00, Alvaro Herrera wrote:

>> And do both require a recompile?

> Yes.  But maybe they'll just move the contention point a little bit

> backwards without actually fixing anything.

 

When it comes to resolution, I agree with Alvaro's assessment here;

unfortunately, I don't know of a great solution on community PostgreSQL

outside of trying to reduce the use of exception blocks in your plpgsql

code. Increasing the cache size can give a little more head room but

doesn't move the contention point significantly. That single global

control lock is hard to get around when you try to use subtransactions

at scale.

 

-Jeremy

 

P.S. This applies on the Aurora PostgreSQL 9.6.3 build too but I'm

discussing here in the context of community PostgreSQL code and we can

put further Aurora-specific discussion on the AWS forums.

 

--

Jeremy Schneider

Database Engineer

Amazon Web Services

 

Re: Guideline To Resolve LWLock:SubtransControlLock

From
Jeremy Schneider
Date:
On 8/22/18 13:07, Fd Habash wrote:
> In your statement, what constitutes ‘heavy use of exception blocks’?  
> 
> "I have seen
> applications grind to a halt on SubtransControlLock when they make heavy
> use of exception blocks in plpgsql code; in fact it's pretty
> straightforward to demonstrate this behavior with pgbench on community
> PostgreSQL."

In one of the most dramatic cases I saw, the customer was migrating from
another database system and had a very large workload running on the
largest instance class we currently offer. They were quite savvy and had
already gone through all of the procedural code they migrated and
removed all of the exception blocks. Nonetheless, when they hit their
peak workload, we observed this wait event.

It was finally discovered that the framework/ORM they were using had a
capability to automatically use savepoints for partial rollback. They
had not explicitly configured it (afaik) - but their framework was using
savepoints. In some complex code paths we were seeing several hundred
subtransactions within one master transaction.

I haven't thoroughly tested yet, but anecdotally I don't think that
you'll have a problem with contention on this lock until you get to a
sufficiently large database server.  The machine I described above was a
32-core box; I suspect that a box with 2 cores is going to be waiting on
something else before it gets stuck here.  If you want to see a system
choke on this lock, just spin up a 32-core box and run two separate
pgbenchs in parallel (needs to be two)... the first as select-only and
the second modified to create some savepoints while updating
pgbench_accounts.

To directly answer the question "what constitutes heavy use": if folks
are building high-throughput applications that they expect to scale
nicely on PostgreSQL up to 32-core boxes and beyond, I'd suggest
avoiding savepoints in any key codepaths that are part of the primary
transactional workload (low-latency operations that are executed many
times per second).


On 8/22/18 08:48, Fd Habash wrote:
> “ That said... FWIW, Aurora PostgreSQL version 9.6.3 uses parent/child
> transaction relationships pretty much the same way that community
> PostgreSQL 9.6.3 does …”
>
> This is why I posted here first. This particular wait state did not
> appear to be Aurora specific and was not listed as part of
>

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraPostgreSQL.Reference.html#AuroraPostgreSQL.Reference.Waitevents//
>
> I go back and forth posting issues between the two forums depending on
> the nature of it.

Just added it to the aforementioned Aurora docs, hopefully heading off a
few future questions.

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services