Thread: Configurable FP_LOCK_SLOTS_PER_BACKEND
We're observing a few cases with lockmanager spikes in a few quite loaded systems.
These cases are different; queries are different, Postgres versions are 12, 13, and 14.
But in all cases, servers are quite beefy (96-128 vCPUs, ~600-800 GiB) receiving a lot of TPS (a few dozens of thousands). Most queries that struggle from wait_event=lockmanager involve a substantial number of tables/indexes, often with partitioning.
FP_LOCK_SLOTS_PER_BACKEND is now hard-coded 16 in storage/proc.h – and it is now very easy to hit this threshold in a loaded system, especially, for example, if a table with a dozen of indexes was partitioned. It seems any system with good growth hits it sooner or later.
I wonder, would it make sense to:
1) either consider increasing this hard-coded value, taking into account that 16 seems to be very low for modern workloads, schemas, and hardware – say, it could be 64,
2) or even make it configurable – a new GUC.
Nikolay Samokhvalov
Founder, Postgres.ai
On 7/13/23 07:02, Nikolay Samokhvalov wrote: > We're observing a few cases with lockmanager spikes in a few quite > loaded systems. > > These cases are different; queries are different, Postgres versions are > 12, 13, and 14. > > But in all cases, servers are quite beefy (96-128 vCPUs, ~600-800 GiB) > receiving a lot of TPS (a few dozens of thousands). Most queries that > struggle from wait_event=lockmanager involve a substantial number of > tables/indexes, often with partitioning. > > FP_LOCK_SLOTS_PER_BACKEND is now hard-coded 16 in storage/proc.h – and > it is now very easy to hit this threshold in a loaded system, > especially, for example, if a table with a dozen of indexes was > partitioned. It seems any system with good growth hits it sooner or later. > > I wonder, would it make sense to: > 1) either consider increasing this hard-coded value, taking into account > that 16 seems to be very low for modern workloads, schemas, and hardware > – say, it could be 64, Well, that has a cost too, as it makes PGPROC larger, right? At the moment that struct is already ~880B / 14 cachelines, adding 48 XIDs would make it +192B / +3 cachelines. I doubt that won't impact other common workloads ... However, the lmgr/README says this is meant to alleviate contention on the lmgr partition locks. Wouldn't it be better to increase the number of those locks, without touching the PGPROC stuff? Could this be tuned using some heuristics based on number of connections? > 2) or even make it configurable – a new GUC. > I'm rather strongly against adding yet another GUC for this low-level stuff. We already have enough of such knobs it's almost impossible for regular users to actually tune the system without doing something wrong. I'd even say it's actively harmful, especially if it's aimed at very common setups/workloads (like here). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I thought it might be helpful to share some more details from one of the case studies behind Nik's suggestion.
Bursty contention on lock_manager lwlocks recently became a recurring cause of query throughput drops for GitLab.com, and we got to study the behavior via USDT and uprobe instrumentation along with more conventional observations (see https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301). This turned up some interesting finds, and I thought sharing some of that research might be helpful.
Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would have a much larger positive impact than any other mitigation strategy we have evaluated. Rather than reducing hold duration or collision rate, adding fastpath slots reduces the frequency of even having to acquire those lock_manager lwlocks. I suspect this would be helpful for many other workloads, particularly those having high frequency queries whose tables collectively have more than about 16 or indexes.
Lowering the lock_manager lwlock acquisition rate means lowering its contention rate (and probably also its contention duration, since exclusive mode forces concurrent lockers to queue).
I'm confident this would help our workload, and I strongly suspect it would be generally helpful by letting queries use fastpath locking more often.
> However, the lmgr/README says this is meant to alleviate contention on
> the lmgr partition locks. Wouldn't it be better to increase the number
> of those locks, without touching the PGPROC stuff?
That was my first thought too, but growing the lock_manager lwlock tranche isn't nearly as helpful.
On the slowpath, each relation's lock tag deterministically hashes onto a specific lock_manager lwlock, so growing the number of lock_manager lwlocks just makes it less likely for two or more frequently locked relations to hash onto the same lock_manager.
In contrast, growing the number of fastpath slots completely avoids calls to the slowpath (i.e. no need to acquire a lock_manager lwlock).
The saturation condition we'd like to solve is heavy contention on one or more of the lock_manager lwlocks. Since that is driven by the slowpath acquisition rate of heavyweight locks, avoiding the slowpath is better than just moderately reducing the contention on the slowpath.
To be fair, increasing the number of lock_manager locks definitely can help to a certain extent, but it doesn't cover an important general case. As a thought experiment, suppose we increase the lock_manager tranche to some arbitrarily large size, larger than the number of relations in the db. This unrealistically large size means we have the best case for avoiding collisions -- each relation maps uniquely onto its own lock_manager lwlock. That helps a lot in the case where the workload is spread among many non-overlapping sets of relations. But it doesn't help a workload where any one table is accessed frequently via slowpath locking.
Continuing the thought experiment, if that frequently queried table has 16 or more indexes, or if it is joined to other tables that collectively add up to over 16 relations, then each of those queries is guaranteed to have to use the slowpath and acquire the deterministically associated lock_manager lwlocks.
So growing the tranche of lock_manager lwlocks would help for some workloads, while other workloads would not be helped much at all. (As a concrete example, a workload at GitLab has several frequently queried tables with over 16 indexes that consequently always use at least some slowpath locks.)
For additional context:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate
Summarizes the pathology and its current mitigations.
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678
Documents the supporting research methodology.
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510
What code paths require an exclusive mode lwlock for lock_manager?
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142
Comparison of fastpath vs. slowpath locking, including quantifying the rate difference.
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726
Confirms the acquisition rate of lock_manager locks is not uniform. The sampled workload has a 3x difference in the most vs. least frequently acquired lock_manager lock, corresponding to the workload's most frequently accessed relations.
> Well, that has a cost too, as it makes PGPROC larger, right? At the
> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
> would make it +192B / +3 cachelines. I doubt that won't impact other
> common workloads ...
That's true; growing the data structure may affect L2/L3 cache hit rates when touching PGPROC. Is that cost worth the benefit of using fastpath for a higher percentage of table locks? The answer may be workload- and platform-specific. Exposing this as a GUC gives the admin a way to make a different choice if our default (currently 16) is bad for them.
I share your reluctance to add another low-level tunable, but like many other GUCs, having a generally reasonable default that can be adjusted is better than forcing folks to fork postgres to adjust a compile-time constant. And unfortunately I don't see a better way to solve this problem. Growing the lock_manager lwlock tranche isn't as effective, because it doesn't help workloads where one or more relations are locked frequently enough to hit this saturation point.
Handling a larger percentage of heavyweight lock acquisitions via fastpath instead of slowpath seems likely to help many high-throughput workloads, since it avoids having to exclusively acquire an lwlock. It seemed like the least intrusive general-purpose solution we've come up with so far. That's why we wanted to solicit feedback or new ideas from the community. Currently, the only options folks have to solve this class of saturation are through some combination of schema changes, application changes, vertical scaling, and spreading the query rate among more postgres instances. Those are not feasible and efficient options. Lacking a better solution, exposing a GUC that rarely needs tuning seems reasonable to me.
Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would have a much larger positive impact than any other mitigation strategy we have evaluated. Rather than reducing hold duration or collision rate, adding fastpath slots reduces the frequency of even having to acquire those lock_manager lwlocks. I suspect this would be helpful for many other workloads, particularly those having high frequency queries whose tables collectively have more than about 16 or indexes.
Lowering the lock_manager lwlock acquisition rate means lowering its contention rate (and probably also its contention duration, since exclusive mode forces concurrent lockers to queue).
I'm confident this would help our workload, and I strongly suspect it would be generally helpful by letting queries use fastpath locking more often.
> However, the lmgr/README says this is meant to alleviate contention on
> the lmgr partition locks. Wouldn't it be better to increase the number
> of those locks, without touching the PGPROC stuff?
That was my first thought too, but growing the lock_manager lwlock tranche isn't nearly as helpful.
On the slowpath, each relation's lock tag deterministically hashes onto a specific lock_manager lwlock, so growing the number of lock_manager lwlocks just makes it less likely for two or more frequently locked relations to hash onto the same lock_manager.
In contrast, growing the number of fastpath slots completely avoids calls to the slowpath (i.e. no need to acquire a lock_manager lwlock).
The saturation condition we'd like to solve is heavy contention on one or more of the lock_manager lwlocks. Since that is driven by the slowpath acquisition rate of heavyweight locks, avoiding the slowpath is better than just moderately reducing the contention on the slowpath.
To be fair, increasing the number of lock_manager locks definitely can help to a certain extent, but it doesn't cover an important general case. As a thought experiment, suppose we increase the lock_manager tranche to some arbitrarily large size, larger than the number of relations in the db. This unrealistically large size means we have the best case for avoiding collisions -- each relation maps uniquely onto its own lock_manager lwlock. That helps a lot in the case where the workload is spread among many non-overlapping sets of relations. But it doesn't help a workload where any one table is accessed frequently via slowpath locking.
Continuing the thought experiment, if that frequently queried table has 16 or more indexes, or if it is joined to other tables that collectively add up to over 16 relations, then each of those queries is guaranteed to have to use the slowpath and acquire the deterministically associated lock_manager lwlocks.
So growing the tranche of lock_manager lwlocks would help for some workloads, while other workloads would not be helped much at all. (As a concrete example, a workload at GitLab has several frequently queried tables with over 16 indexes that consequently always use at least some slowpath locks.)
For additional context:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate
Summarizes the pathology and its current mitigations.
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678
Documents the supporting research methodology.
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510
What code paths require an exclusive mode lwlock for lock_manager?
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142
Comparison of fastpath vs. slowpath locking, including quantifying the rate difference.
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726
Confirms the acquisition rate of lock_manager locks is not uniform. The sampled workload has a 3x difference in the most vs. least frequently acquired lock_manager lock, corresponding to the workload's most frequently accessed relations.
> Well, that has a cost too, as it makes PGPROC larger, right? At the
> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
> would make it +192B / +3 cachelines. I doubt that won't impact other
> common workloads ...
That's true; growing the data structure may affect L2/L3 cache hit rates when touching PGPROC. Is that cost worth the benefit of using fastpath for a higher percentage of table locks? The answer may be workload- and platform-specific. Exposing this as a GUC gives the admin a way to make a different choice if our default (currently 16) is bad for them.
I share your reluctance to add another low-level tunable, but like many other GUCs, having a generally reasonable default that can be adjusted is better than forcing folks to fork postgres to adjust a compile-time constant. And unfortunately I don't see a better way to solve this problem. Growing the lock_manager lwlock tranche isn't as effective, because it doesn't help workloads where one or more relations are locked frequently enough to hit this saturation point.
Handling a larger percentage of heavyweight lock acquisitions via fastpath instead of slowpath seems likely to help many high-throughput workloads, since it avoids having to exclusively acquire an lwlock. It seemed like the least intrusive general-purpose solution we've come up with so far. That's why we wanted to solicit feedback or new ideas from the community. Currently, the only options folks have to solve this class of saturation are through some combination of schema changes, application changes, vertical scaling, and spreading the query rate among more postgres instances. Those are not feasible and efficient options. Lacking a better solution, exposing a GUC that rarely needs tuning seems reasonable to me.
Anyway, hopefully the extra context is helpful! Please do share your thoughts.
Matt Smiley | Staff Site Reliability Engineer at GitLab
On 8/3/23 01:51, Matt Smiley wrote: > I thought it might be helpful to share some more details from one of the > case studies behind Nik's suggestion. > > Bursty contention on lock_manager lwlocks recently became a recurring > cause of query throughput drops for GitLab.com, and we got to study the > behavior via USDT and uprobe instrumentation along with more > conventional observations (see > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301 > <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301>). > This turned up some interesting finds, and I thought sharing some of > that research might be helpful. > The analysis in the linked gitlab issue is pretty amazing. I wasn't planning to argue against the findings anyway, but plenty of data supporting the conclusions is good. I'm not an expert on locking, so some of the stuff I say may be trivially obvious - it's just me thinking about ... I wonder what's the rough configuration of those systems, though. Both the hardware and PostgreSQL side. How many cores / connections, etc.? > Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would > have a much larger positive impact than any other mitigation strategy we > have evaluated. Rather than reducing hold duration or collision rate, > adding fastpath slots reduces the frequency of even having to acquire > those lock_manager lwlocks. I suspect this would be helpful for many > other workloads, particularly those having high frequency queries whose > tables collectively have more than about 16 or indexes. > Yes, I agree with that. Partitioning makes this issue works, I guess. Schemas with indexes on every column are disturbingly common these days too, which hits the issue too ... > Lowering the lock_manager lwlock acquisition rate means lowering its > contention rate (and probably also its contention duration, since > exclusive mode forces concurrent lockers to queue). > > I'm confident this would help our workload, and I strongly suspect it > would be generally helpful by letting queries use fastpath locking more > often. > OK >> However, the lmgr/README says this is meant to alleviate contention on >> the lmgr partition locks. Wouldn't it be better to increase the number >> of those locks, without touching the PGPROC stuff? > > That was my first thought too, but growing the lock_manager lwlock > tranche isn't nearly as helpful. > > On the slowpath, each relation's lock tag deterministically hashes onto > a specific lock_manager lwlock, so growing the number of lock_manager > lwlocks just makes it less likely for two or more frequently locked > relations to hash onto the same lock_manager. > Hmmm, so if we have a query that joins 16 tables, or a couple tables with indexes, all backends running this will acquire exactly the same partition locks. And we're likely acquiring them in exactly the same order (to lock objects in the same order because of deadlocks), making the issue worse. > In contrast, growing the number of fastpath slots completely avoids > calls to the slowpath (i.e. no need to acquire a lock_manager lwlock). > > The saturation condition we'd like to solve is heavy contention on one > or more of the lock_manager lwlocks. Since that is driven by the > slowpath acquisition rate of heavyweight locks, avoiding the slowpath is > better than just moderately reducing the contention on the slowpath. > > To be fair, increasing the number of lock_manager locks definitely can > help to a certain extent, but it doesn't cover an important general > case. As a thought experiment, suppose we increase the lock_manager > tranche to some arbitrarily large size, larger than the number of > relations in the db. This unrealistically large size means we have the > best case for avoiding collisions -- each relation maps uniquely onto > its own lock_manager lwlock. That helps a lot in the case where the > workload is spread among many non-overlapping sets of relations. But it > doesn't help a workload where any one table is accessed frequently via > slowpath locking. > Understood. > Continuing the thought experiment, if that frequently queried table has > 16 or more indexes, or if it is joined to other tables that collectively > add up to over 16 relations, then each of those queries is guaranteed to > have to use the slowpath and acquire the deterministically associated > lock_manager lwlocks. > > So growing the tranche of lock_manager lwlocks would help for some > workloads, while other workloads would not be helped much at all. (As a > concrete example, a workload at GitLab has several frequently queried > tables with over 16 indexes that consequently always use at least some > slowpath locks.) > > For additional context: > > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate> > Summarizes the pathology and its current mitigations. > > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678> > Documents the supporting research methodology. > > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510> > What code paths require an exclusive mode lwlock for lock_manager? > > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142> > Comparison of fastpath vs. slowpath locking, including quantifying the > rate difference. > > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726> > Confirms the acquisition rate of lock_manager locks is not uniform. The > sampled workload has a 3x difference in the most vs. least frequently > acquired lock_manager lock, corresponding to the workload's most > frequently accessed relations. > Those are pretty great pieces of information. I wonder if some of the measurements may be affecting the observation (by consuming too much CPU, making the contention worse), but overall it seems convincing. Would it be difficult to sample just a small fraction of the calls? Say, 1%, to get good histograms/estimated with acceptable CPU usage. In any case, it's a great source of information to reproduce the issue and evaluate possible fixes. >> Well, that has a cost too, as it makes PGPROC larger, right? At the >> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs >> would make it +192B / +3 cachelines. I doubt that won't impact other >> common workloads ... > > That's true; growing the data structure may affect L2/L3 cache hit rates > when touching PGPROC. Is that cost worth the benefit of using fastpath > for a higher percentage of table locks? The answer may be workload- and > platform-specific. Exposing this as a GUC gives the admin a way to make > a different choice if our default (currently 16) is bad for them. > After looking at the code etc. I think the main trade-off here is going to be the cost of searching the fpRelId array. At the moment it's searched linearly, which is cheap for 16 locks. But at some point it'll become as expensive as updating the slowpath, and the question is when. I wonder if we could switch to a more elaborate strategy if the number of locks is high enough. Say, a hash table, or some hybrid approach. > I share your reluctance to add another low-level tunable, but like many > other GUCs, having a generally reasonable default that can be adjusted > is better than forcing folks to fork postgres to adjust a compile-time > constant. And unfortunately I don't see a better way to solve this > problem. Growing the lock_manager lwlock tranche isn't as effective, > because it doesn't help workloads where one or more relations are locked > frequently enough to hit this saturation point. > I understand. I have two concerns: 1) How would the users know they need to tune this / determine what's the right value, and what's the right value for their system. 2) Having to deal with misconfigured systems as people tend to blindly tune everything to 100x the default, because more is better :-( > Handling a larger percentage of heavyweight lock acquisitions via > fastpath instead of slowpath seems likely to help many high-throughput > workloads, since it avoids having to exclusively acquire an lwlock. It > seemed like the least intrusive general-purpose solution we've come up > with so far. That's why we wanted to solicit feedback or new ideas from > the community. Currently, the only options folks have to solve this > class of saturation are through some combination of schema changes, > application changes, vertical scaling, and spreading the query rate > among more postgres instances. Those are not feasible and efficient > options. Lacking a better solution, exposing a GUC that rarely needs > tuning seems reasonable to me. > > Anyway, hopefully the extra context is helpful! Please do share your > thoughts. > Absolutely! I think the next step for me is to go through the analysis again, and try to construct a couple of different workloads hitting this in some way. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/3/23 22:39, Tomas Vondra wrote: > On 8/3/23 01:51, Matt Smiley wrote: >> I thought it might be helpful to share some more details from one of the >> case studies behind Nik's suggestion. >> >> Bursty contention on lock_manager lwlocks recently became a recurring >> cause of query throughput drops for GitLab.com, and we got to study the >> behavior via USDT and uprobe instrumentation along with more >> conventional observations (see >> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301 >> <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301>). >> This turned up some interesting finds, and I thought sharing some of >> that research might be helpful. >> > > The analysis in the linked gitlab issue is pretty amazing. I wasn't > planning to argue against the findings anyway, but plenty of data > supporting the conclusions is good. > > I'm not an expert on locking, so some of the stuff I say may be > trivially obvious - it's just me thinking about ... > > I wonder what's the rough configuration of those systems, though. Both > the hardware and PostgreSQL side. How many cores / connections, etc.? > > >> Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would >> have a much larger positive impact than any other mitigation strategy we >> have evaluated. Rather than reducing hold duration or collision rate, >> adding fastpath slots reduces the frequency of even having to acquire >> those lock_manager lwlocks. I suspect this would be helpful for many >> other workloads, particularly those having high frequency queries whose >> tables collectively have more than about 16 or indexes. >> > > Yes, I agree with that. Partitioning makes this issue works, I guess. > Schemas with indexes on every column are disturbingly common these days > too, which hits the issue too ... > >> Lowering the lock_manager lwlock acquisition rate means lowering its >> contention rate (and probably also its contention duration, since >> exclusive mode forces concurrent lockers to queue). >> >> I'm confident this would help our workload, and I strongly suspect it >> would be generally helpful by letting queries use fastpath locking more >> often. >> > > OK > >>> However, the lmgr/README says this is meant to alleviate contention on >>> the lmgr partition locks. Wouldn't it be better to increase the number >>> of those locks, without touching the PGPROC stuff? >> >> That was my first thought too, but growing the lock_manager lwlock >> tranche isn't nearly as helpful. >> >> On the slowpath, each relation's lock tag deterministically hashes onto >> a specific lock_manager lwlock, so growing the number of lock_manager >> lwlocks just makes it less likely for two or more frequently locked >> relations to hash onto the same lock_manager. >> > > Hmmm, so if we have a query that joins 16 tables, or a couple tables > with indexes, all backends running this will acquire exactly the same > partition locks. And we're likely acquiring them in exactly the same > order (to lock objects in the same order because of deadlocks), making > the issue worse. > >> In contrast, growing the number of fastpath slots completely avoids >> calls to the slowpath (i.e. no need to acquire a lock_manager lwlock). >> >> The saturation condition we'd like to solve is heavy contention on one >> or more of the lock_manager lwlocks. Since that is driven by the >> slowpath acquisition rate of heavyweight locks, avoiding the slowpath is >> better than just moderately reducing the contention on the slowpath. >> >> To be fair, increasing the number of lock_manager locks definitely can >> help to a certain extent, but it doesn't cover an important general >> case. As a thought experiment, suppose we increase the lock_manager >> tranche to some arbitrarily large size, larger than the number of >> relations in the db. This unrealistically large size means we have the >> best case for avoiding collisions -- each relation maps uniquely onto >> its own lock_manager lwlock. That helps a lot in the case where the >> workload is spread among many non-overlapping sets of relations. But it >> doesn't help a workload where any one table is accessed frequently via >> slowpath locking. >> > > Understood. > >> Continuing the thought experiment, if that frequently queried table has >> 16 or more indexes, or if it is joined to other tables that collectively >> add up to over 16 relations, then each of those queries is guaranteed to >> have to use the slowpath and acquire the deterministically associated >> lock_manager lwlocks. >> >> So growing the tranche of lock_manager lwlocks would help for some >> workloads, while other workloads would not be helped much at all. (As a >> concrete example, a workload at GitLab has several frequently queried >> tables with over 16 indexes that consequently always use at least some >> slowpath locks.) >> >> For additional context: >> >> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate> >> Summarizes the pathology and its current mitigations. >> >> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678> >> Documents the supporting research methodology. >> >> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510> >> What code paths require an exclusive mode lwlock for lock_manager? >> >> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142> >> Comparison of fastpath vs. slowpath locking, including quantifying the >> rate difference. >> >> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726> >> Confirms the acquisition rate of lock_manager locks is not uniform. The >> sampled workload has a 3x difference in the most vs. least frequently >> acquired lock_manager lock, corresponding to the workload's most >> frequently accessed relations. >> > > Those are pretty great pieces of information. I wonder if some of the > measurements may be affecting the observation (by consuming too much > CPU, making the contention worse), but overall it seems convincing. > > Would it be difficult to sample just a small fraction of the calls? Say, > 1%, to get good histograms/estimated with acceptable CPU usage. > > In any case, it's a great source of information to reproduce the issue > and evaluate possible fixes. > >>> Well, that has a cost too, as it makes PGPROC larger, right? At the >>> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs >>> would make it +192B / +3 cachelines. I doubt that won't impact other >>> common workloads ... >> >> That's true; growing the data structure may affect L2/L3 cache hit rates >> when touching PGPROC. Is that cost worth the benefit of using fastpath >> for a higher percentage of table locks? The answer may be workload- and >> platform-specific. Exposing this as a GUC gives the admin a way to make >> a different choice if our default (currently 16) is bad for them. >> > > After looking at the code etc. I think the main trade-off here is going > to be the cost of searching the fpRelId array. At the moment it's > searched linearly, which is cheap for 16 locks. But at some point it'll > become as expensive as updating the slowpath, and the question is when. > > I wonder if we could switch to a more elaborate strategy if the number > of locks is high enough. Say, a hash table, or some hybrid approach. > >> I share your reluctance to add another low-level tunable, but like many >> other GUCs, having a generally reasonable default that can be adjusted >> is better than forcing folks to fork postgres to adjust a compile-time >> constant. And unfortunately I don't see a better way to solve this >> problem. Growing the lock_manager lwlock tranche isn't as effective, >> because it doesn't help workloads where one or more relations are locked >> frequently enough to hit this saturation point. >> > > I understand. I have two concerns: > > 1) How would the users know they need to tune this / determine what's > the right value, and what's the right value for their system. > > 2) Having to deal with misconfigured systems as people tend to blindly > tune everything to 100x the default, because more is better :-( > > >> Handling a larger percentage of heavyweight lock acquisitions via >> fastpath instead of slowpath seems likely to help many high-throughput >> workloads, since it avoids having to exclusively acquire an lwlock. It >> seemed like the least intrusive general-purpose solution we've come up >> with so far. That's why we wanted to solicit feedback or new ideas from >> the community. Currently, the only options folks have to solve this >> class of saturation are through some combination of schema changes, >> application changes, vertical scaling, and spreading the query rate >> among more postgres instances. Those are not feasible and efficient >> options. Lacking a better solution, exposing a GUC that rarely needs >> tuning seems reasonable to me. >> >> Anyway, hopefully the extra context is helpful! Please do share your >> thoughts. >> > > Absolutely! I think the next step for me is to go through the analysis > again, and try to construct a couple of different workloads hitting this > in some way. > FWIW I did some progress on this - I think I managed to reproduce the issue on a synthetic workload (with a partitioned table, using variable number of partitions / indexes). It's hard to say for sure how serious the reproduced cases are, but I do see spikes of lock_manager wait events, and so on. That however brings me to the second step - I was planning to increase the FP_LOCK_SLOTS_PER_BACKEND value and see how much it helps, and also measure some of the negative impact, to get a better idea what the trade offs are. But I quickly realized it's far more complicated than just increasing the define. The thing is, it's not enough to make fpRelId larger, there's also fpLockBits, tracking additional info about the locks (lock mode etc.). But FP_LOCK_SLOTS_PER_BACKEND does not affect that, it's just that int64 is enough to store the bits for 16 fastpath locks. Note: This means one of the "mitigations" in the analysis (just rebuild postgres with custom FP_LOCK_SLOTS_PER_BACKEND value) won't work. I tried to fix that in a naive way (replacing it with an int64 array, with one value for 16 locks), but I must be missing something as there are locking failures. I'm not sure I'll have time to hack on this soon, but if someone else wants to take a stab at it and produce a minimal patch, I might be able to run more tests on it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-08-02 16:51:29 -0700, Matt Smiley wrote: > I thought it might be helpful to share some more details from one of the > case studies behind Nik's suggestion. > > Bursty contention on lock_manager lwlocks recently became a recurring cause > of query throughput drops for GitLab.com, and we got to study the behavior > via USDT and uprobe instrumentation along with more conventional > observations (see > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301). This > turned up some interesting finds, and I thought sharing some of that > research might be helpful. Hm, I'm curious whether you have a way to trigger the issue outside of your prod environment. Mainly because I'm wondering if you're potentially hitting the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so you'd not see the benefit unless you reproduced the load in 16+. I'm also wondering if it's possible that the reason for the throughput drops are possibly correlated with heavyweight contention or higher frequency access to the pg_locks view. Deadlock checking and the locks view acquire locks on all lock manager partitions... So if there's a bout of real lock contention (for longer than deadlock_timeout)... Given that most of your lock manager traffic comes from query planning - have you evaluated using prepared statements more heavily? Greetings, Andres Freund
On 8/6/23 16:44, Tomas Vondra wrote: > ... > > I'm not sure I'll have time to hack on this soon, but if someone else > wants to take a stab at it and produce a minimal patch, I might be able > to run more tests on it. > Nah, I gave it another try, handling the bitmap in a different way, and that happened to work sufficiently well. So here are some results and also the bench script. Note: I haven't reproduced the terrible regressions described in this thread. Either I don't have a system large enough, the workload may not be exactly right, or maybe it's due to the commit missing in older branches (mentioned by Andres). Still, the findings seem interesting. The workload is very simple - create a table "t" with certain number of partitiones and indexes, add certain number of rows (100, 10k or 1M) and do "select count(*) from t". And measure throughput. There's also a script collecting wait-event/lock info, but I haven't looked at that. I did this for current master (17dev), with two patch versions. master - current master, with 16 fast-path slots v1 increases the number of slots to 64, but switches to a single array combining the bitmap and OIDs. I'm sure the original approach (separate bitmap) can be made to work, and it's possible this is responsible for a small regression in some runs (more about it in a minute). v2 was an attempt to address the small regressions in v1, which may be due to having to search larger arrays. The core always walks the whole array, even if we know there have never been that many entries. So this tries to track the last used slot, and stop the loops earlier. The attached PDF visualizes the results, and differences between master and the two patches. It plots throughput against number of rows / tables and indexes, and also concurrent clients. The last two columns show throughput vs. master, with a simple color scale: green - speedup (good), red - regression (bad). Let's talk about the smallest data set (100 rows). The 10k test has the same behavior, with smaller differences (as the locking accounts for a smaller part of the total duration). On the 1M data set the patches make almost no difference. There's pretty clear flip once we reach 16 partitions - on master the throughput drops from 310k tps to 210k tps (for 32 clients, the machine has 32 cores). With both patches, the drop is to only about 240k tps, so ~20% improvement compared to master. The other interesting thing is behavior with many clients: 1 16 32 64 96 128 master 17603 169132 204870 199390 199662 196774 v1 17062 180475 234644 266372 267105 265885 v2 18972 187294 242838 275064 275931 274332 So the master "stagnates" or maybe even drops off, while with both patches the throughput continues to grow beyond 32 clients. This is even more obvious for 32 or 64 partitions - for 32, the results are 1 16 32 64 96 128 master 11292 93783 111223 102580 95197 87800 v1 12025 123503 168382 179952 180191 179846 v2 12501 126438 174255 185435 185332 184938 That's a pretty massive improvement, IMO. Would help OLTP scalability. For 10k rows the patterns is the same, although the differences are less significant. For 1M rows there's no speedup. The bad news is this seems to have negative impact on cases with few partitions, that'd fit into 16 slots. Which is not surprising, as the code has to walk longer arrays, it probably affects caching etc. So this would hurt the systems that don't use that many relations - not much, but still. The regression appears to be consistently ~3%, and v2 aimed to improve that - at least for the case with just 100 rows. It even gains ~5% in a couple cases. It's however a bit strange v2 doesn't really help the two larger cases. Overall, I think this seems interesting - it's hard to not like doubling the throughput in some cases. Yes, it's 100 rows only, and the real improvements are bound to be smaller, it would help short OLTP queries that only process a couple rows. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Aug 07, 2023 at 12:51:24PM +0200, Tomas Vondra wrote: > The bad news is this seems to have negative impact on cases with few > partitions, that'd fit into 16 slots. Which is not surprising, as the > code has to walk longer arrays, it probably affects caching etc. So this > would hurt the systems that don't use that many relations - not much, > but still. > > The regression appears to be consistently ~3%, and v2 aimed to improve > that - at least for the case with just 100 rows. It even gains ~5% in a > couple cases. It's however a bit strange v2 doesn't really help the two > larger cases. > > Overall, I think this seems interesting - it's hard to not like doubling > the throughput in some cases. Yes, it's 100 rows only, and the real > improvements are bound to be smaller, it would help short OLTP queries > that only process a couple rows. Indeed. I wonder whether we could mitigate the regressions by using SIMD intrinsics in the loops. Or auto-vectorization, if that is possible. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Aug 7, 2023 at 6:51 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > The regression appears to be consistently ~3%, and v2 aimed to improve > that - at least for the case with just 100 rows. It even gains ~5% in a > couple cases. It's however a bit strange v2 doesn't really help the two > larger cases. To me, that's an absolutely monumental regression for a change like this. Sure, lots of people have partitioned tables. But also, lots of people don't. Penalizing very simple queries by 2-3% seems completely over the top to me. I can't help wondering whether there's actually something wrong with the test, or the coding, because that seems huge to me. I would also argue that the results are actually not that great, because once you get past 64 partitions you're right back where you started, or maybe worse off. To me, there's nothing magical about cases between 16 and 64 relations that makes them deserve special treatment - plenty of people are going to want to use hundreds of partitions, and even if you only use a few dozen, this isn't going to help as soon as you join two or three partitioned tables, and I suspect it hurts whenever it doesn't help. I think we need a design that scales better. I don't really know what that would look like, exactly, but your idea of a hybrid approach seems like it might be worth further consideration. We don't have to store an infinite number of fast-path locks in an array that we search linearly, and it might be better that if we moved to some other approach we could avoid some of the regression. You mentioned a hash table; a partially associative cache might also be worth considering, like having an array of 1k slots but dividing it logically into 128 bins of 16 slots each and only allowing an OID to be recorded in the bin whose low 7 bits match the low 7 bits of the OID. But maybe first we need to understand where all the CPU cycles are going, because maybe that's optimizing completely the wrong thing and, again, it seems like an awfully large hit. Of course, another thing we could do is try to improve the main lock manager somehow. I confess that I don't have a great idea for that at the moment, but the current locking scheme there is from a very, very long time ago and clearly wasn't designed with modern hardware in mind. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/7/23 19:05, Robert Haas wrote: > On Mon, Aug 7, 2023 at 6:51 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> The regression appears to be consistently ~3%, and v2 aimed to improve >> that - at least for the case with just 100 rows. It even gains ~5% in a >> couple cases. It's however a bit strange v2 doesn't really help the two >> larger cases. > > To me, that's an absolutely monumental regression for a change like > this. Sure, lots of people have partitioned tables. But also, lots of > people don't. Penalizing very simple queries by 2-3% seems completely > over the top to me. I can't help wondering whether there's actually > something wrong with the test, or the coding, because that seems huge > to me. > I'm the first to admit the coding (in my patches) is far from perfect, and this may easily be a consequence of that. My motivation was to get some quick measurements for the "bad case". > I would also argue that the results are actually not that great, > because once you get past 64 partitions you're right back where you > started, or maybe worse off. To me, there's nothing magical about > cases between 16 and 64 relations that makes them deserve special > treatment - plenty of people are going to want to use hundreds of > partitions, and even if you only use a few dozen, this isn't going to > help as soon as you join two or three partitioned tables, and I > suspect it hurts whenever it doesn't help. > That's true, but doesn't that apply to any cache that can overflow? You could make the same argument about the default value of 16 slots - why not to have just 8? FWIW I wasn't really suggesting we should increase the value to 64, I was just trying to get a better idea of the costs at play (fast-path cache maintenance and regular locking). > I think we need a design that scales better. I don't really know what > that would look like, exactly, but your idea of a hybrid approach > seems like it might be worth further consideration. We don't have to > store an infinite number of fast-path locks in an array that we search > linearly, and it might be better that if we moved to some other > approach we could avoid some of the regression. You mentioned a hash > table; a partially associative cache might also be worth considering, > like having an array of 1k slots but dividing it logically into 128 > bins of 16 slots each and only allowing an OID to be recorded in the > bin whose low 7 bits match the low 7 bits of the OID. Yes, I agree. I don't know if this particular design would be the right one (1000 elements seems a bit too much for something included right in PGPROC). But yeah, something that flips from linear search to something else would be reasonable. > > But maybe first we need to understand where all the CPU cycles are > going, because maybe that's optimizing completely the wrong thing and, > again, it seems like an awfully large hit. > Right. We're mostly just guessing what the issue is. > Of course, another thing we could do is try to improve the main lock > manager somehow. I confess that I don't have a great idea for that at > the moment, but the current locking scheme there is from a very, very > long time ago and clearly wasn't designed with modern hardware in > mind. > No idea, but I'd bet some of the trade offs may need re-evaluation. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/7/23 18:56, Nathan Bossart wrote: > On Mon, Aug 07, 2023 at 12:51:24PM +0200, Tomas Vondra wrote: >> The bad news is this seems to have negative impact on cases with few >> partitions, that'd fit into 16 slots. Which is not surprising, as the >> code has to walk longer arrays, it probably affects caching etc. So this >> would hurt the systems that don't use that many relations - not much, >> but still. >> >> The regression appears to be consistently ~3%, and v2 aimed to improve >> that - at least for the case with just 100 rows. It even gains ~5% in a >> couple cases. It's however a bit strange v2 doesn't really help the two >> larger cases. >> >> Overall, I think this seems interesting - it's hard to not like doubling >> the throughput in some cases. Yes, it's 100 rows only, and the real >> improvements are bound to be smaller, it would help short OLTP queries >> that only process a couple rows. > > Indeed. I wonder whether we could mitigate the regressions by using SIMD > intrinsics in the loops. Or auto-vectorization, if that is possible. > Maybe, but from what I know about SIMD it would require a lot of changes to the design, so that the loops don't mix accesses to different PGPROC fields (fpLockBits, fpRelId) and so on. But I think it'd be better to just stop walking the whole array regularly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 7, 2023 at 3:02 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I would also argue that the results are actually not that great, > > because once you get past 64 partitions you're right back where you > > started, or maybe worse off. To me, there's nothing magical about > > cases between 16 and 64 relations that makes them deserve special > > treatment - plenty of people are going to want to use hundreds of > > partitions, and even if you only use a few dozen, this isn't going to > > help as soon as you join two or three partitioned tables, and I > > suspect it hurts whenever it doesn't help. > > That's true, but doesn't that apply to any cache that can overflow? You > could make the same argument about the default value of 16 slots - why > not to have just 8? Yes and no. I mean, there are situations where when the cache overflows, you still get a lot of benefit out of the entries that you are able to cache, as when the frequency of access follows some kind of non-uniform distribution, Zipfian or decreasing geometrically or whatever. There are also situations where you can just make the cache big enough that as a practical matter it's never going to overflow. I can't think of a PostgreSQL-specific example right now, but if you find that a 10-entry cache of other people living in your house isn't good enough, a 200-entry cache should solve the problem for nearly everyone alive. If that doesn't cause a resource crunch, crank up the cache size and forget about it. But here we have neither of those situations. The access frequency is basically uniform, and the cache size needed to avoid overflows seems to be unrealistically large, at least given the current design. So I think that in this case upping the cache size figures to be much less effective than in some other cases. It's also a bit questionable whether "cache" is even the right word here. I'd say it isn't, because it's not like the information in the fast-path locking structures is a subset of the full information stored elsewhere. Whatever information is stored there is canonical for those entries. > Yes, I agree. I don't know if this particular design would be the right > one (1000 elements seems a bit too much for something included right in > PGPROC). But yeah, something that flips from linear search to something > else would be reasonable. Yeah ... or there could be a few slots in the PGPROC and then a bit indicating whether to jump to a larger shared memory structure located in a separate array. Not sure exactly. -- Robert Haas EDB: http://www.enterprisedb.com
Thank you Tomas! I really appreciate your willingness to dig in here and help us out! The rest of my replies are inline below.
On Thu, Aug 3, 2023 at 1:39 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
The analysis in the linked gitlab issue is pretty amazing. I wasn't
planning to argue against the findings anyway, but plenty of data
supporting the conclusions is good.
Thank you! I totally agree, having supporting data is so helpful.
I'm not an expert on locking, so some of the stuff I say may be
trivially obvious - it's just me thinking about ...
Absolutely makes sense to check assumptions, etc. Thanks for being open! For what it's worth, I've also been working with Postgres for many years, and I love that it keeps teaching me new things, this topic being just the latest.
I wonder what's the rough configuration of those systems, though. Both
the hardware and PostgreSQL side. How many cores / connections, etc.?
Each of the postgres hosts had 96 vCPUs and at peak handled roughly 80 concurrently active connections.
For purposes of reproducing the pathology, I think we can do so with a single postgres instance. We will need a high enough query rate to push the bottleneck to lock_manager lwlock contention. The simplest way to do so is probably to give it a small dataset that fits easily in cache and run several concurrent client connections doing cheap single-row queries, each in its own transaction, against a target table that has either many indexes or partitions or both.
For context, here's a brief summary of the production environment where we first observed this pathology:
The writable primary postgres instance has several streaming replicas, used for read-only portions of the workload. All of them run on equivalent hardware. Most of the research focuses on the streaming replica postgres instances, although the same pathology has been observed in the writable primary node as well. The general topology is thousands of client connections fanning down into several pgbouncer instances per Postgres instance. From each Postgres instance's perspective, its workload generally has a daily peak of roughly 80 concurrently active backends supporting a throughput of 75K transactions second, where most transactions run a single query.
Yes, I agree with that. Partitioning makes this issue works, I guess.
Schemas with indexes on every column are disturbingly common these days
too, which hits the issue too ...
Agreed.
Those are pretty great pieces of information. I wonder if some of the
measurements may be affecting the observation (by consuming too much
CPU, making the contention worse), but overall it seems convincing.
Yes, definitely overhead is a concern, glad you asked!
Here are my notes on the overhead for each bpftrace script: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678
Here is a summary of where that overhead comes from: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365310956
Here are more generic benchmark results for uprobe overhead: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1383
Briefly, we generally expect the instrumentation overhead to be roughly 1-2 microseconds per call to the instrumented instruction. It partly depends on what we're doing in the instrumentation, but most of that overhead is just the interrupt-handling to transfer control flow to/from the BPF code.
Would it be difficult to sample just a small fraction of the calls? Say,
1%, to get good histograms/estimated with acceptable CPU usage.
That would be great, but since the overhead comes mostly from the control transfer, it wouldn't help to put sampling logic in the tracer itself. The main way to mitigate that overhead is to choose instrumentation points where the call rate is tolerably low. That's why the only instrumentation I used for more than a few seconds at a time were the "low overhead" scripts that instrument only the stalled call to LWLockAcquire.
In any case, it's a great source of information to reproduce the issue
and evaluate possible fixes.
Thanks, that's my hope!
After looking at the code etc. I think the main trade-off here is going
to be the cost of searching the fpRelId array. At the moment it's
searched linearly, which is cheap for 16 locks. But at some point it'll
become as expensive as updating the slowpath, and the question is when.
I wonder if we could switch to a more elaborate strategy if the number
of locks is high enough. Say, a hash table, or some hybrid approach.
Interesting idea! I was hoping a linear search would stay cheap enough but you're right, it's going to become too inefficient at some point. It might make sense to start with just blackbox timing or throughput measurements, because directly measuring that search duration may not be cheap. To observe durations via BPF, we have to instrument 2 points (e.g. function entry and return, or more generally the instructions before and after the critical section we're observing). For code called as frequently as LWLockAcquire, that overhead would be prohibitively expensive, so we might need to measure it natively with counters for each histogram bucket we care about. Just thinking ahead; we don't need to deal with this yet, I guess.
I understand. I have two concerns:
1) How would the users know they need to tune this / determine what's
the right value, and what's the right value for their system.
2) Having to deal with misconfigured systems as people tend to blindly
tune everything to 100x the default, because more is better :-(
These are very valid concerns. Thanks for articulating them!
For point 1:
The advice might be to only increase the number of slots for fastpath locks per xact if sampling pg_stat_activity frequently shows "lock_manager" wait_events affecting a significant percentage of your non-idle backends. And increase it as little as possible, due to the extra overhead incurred when checking locks. For calibration purposes, I polled pg_locks periodically, counting the number of slowpath locks where the lock mode was weak enough to qualify for fastpath if there had been enough fastpath slots available. See the graph and SQL at the bottom of this note: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142.
For point 2:
Maybe we should put a ceiling on the allowed value. Maybe 64 or 128 or 256? Probably should depend on the cost of searching the fpRelid array, as you described earlier.
On 8/7/23 21:21, Robert Haas wrote: > On Mon, Aug 7, 2023 at 3:02 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >>> I would also argue that the results are actually not that great, >>> because once you get past 64 partitions you're right back where you >>> started, or maybe worse off. To me, there's nothing magical about >>> cases between 16 and 64 relations that makes them deserve special >>> treatment - plenty of people are going to want to use hundreds of >>> partitions, and even if you only use a few dozen, this isn't going to >>> help as soon as you join two or three partitioned tables, and I >>> suspect it hurts whenever it doesn't help. >> >> That's true, but doesn't that apply to any cache that can overflow? You >> could make the same argument about the default value of 16 slots - why >> not to have just 8? > > Yes and no. I mean, there are situations where when the cache > overflows, you still get a lot of benefit out of the entries that you > are able to cache, as when the frequency of access follows some kind > of non-uniform distribution, Zipfian or decreasing geometrically or > whatever. There are also situations where you can just make the cache > big enough that as a practical matter it's never going to overflow. I > can't think of a PostgreSQL-specific example right now, but if you > find that a 10-entry cache of other people living in your house isn't > good enough, a 200-entry cache should solve the problem for nearly > everyone alive. If that doesn't cause a resource crunch, crank up the > cache size and forget about it. But here we have neither of those > situations. The access frequency is basically uniform, and the cache > size needed to avoid overflows seems to be unrealistically large, at > least given the current design. So I think that in this case upping > the cache size figures to be much less effective than in some other > cases. > Why would the access frequency be uniform? In particular, there's a huge variability in how long the locks need to exist - IIRC we may be keeping locks for tables for a long time, but not for indexes. From this POV it might be better to do fast-path locking for indexes, no? > It's also a bit questionable whether "cache" is even the right word > here. I'd say it isn't, because it's not like the information in the > fast-path locking structures is a subset of the full information > stored elsewhere. Whatever information is stored there is canonical > for those entries. > Right. Calling this a cache might be a bit misleading. >> Yes, I agree. I don't know if this particular design would be the right >> one (1000 elements seems a bit too much for something included right in >> PGPROC). But yeah, something that flips from linear search to something >> else would be reasonable. > > Yeah ... or there could be a few slots in the PGPROC and then a bit > indicating whether to jump to a larger shared memory structure located > in a separate array. Not sure exactly. > Maybe, but isn't that mostly what the regular non-fast-path locking does? Wouldn't that defeat the whole purpose of fast-path locking? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 7, 2023 at 3:48 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Why would the access frequency be uniform? In particular, there's a huge > variability in how long the locks need to exist - IIRC we may be keeping > locks for tables for a long time, but not for indexes. From this POV it > might be better to do fast-path locking for indexes, no? If you're not using explicit transactions, you take a bunch of locks at the start of a statement and then release all of them at the end. None of the locks stick around so fast-path locking structure goes through cycles where it starts out empty, fills up to N items, and then goes back to empty. If you visualize it as a cache, we're flushing the entire cache at the end of every operation. If you run multiple statements in a transaction, the locks will be kept until the end of the transaction, once acquired. So then you could start with a small number and gradually accumulate more. But then you're going to release them all at once at the end. The main thing that matters here seems to be whether or not all of the locks can go through the fast-path mechanism, or how many have to go through the regular mechanism. It shouldn't matter, AFAICS, *which ones* go through the fast-path mechanism. If you think it does, I'd like to hear why - it's possible I'm missing something here. > Maybe, but isn't that mostly what the regular non-fast-path locking > does? Wouldn't that defeat the whole purpose of fast-path locking? I don't think so. The main lock manager has two flaws that hinder performance in comparison with the fast-path mechanism. The first, but less important, one is that the data structures are just a lot simpler. For access to a small number of fixed-size elements, a C array is hard to beat, and the main lock manager data structures are a lot more complex. The second one, which I think is more important, is that we've essentially flipped the ordering of the primary key. In the main lock manager, you start by hashing the locked object and that gives you a partition number and you then take that partition lock. Then, you iterate through a list of backends that have that object locked. This means that if a lot of people are taking locks on the same object, even if there's no actual conflict between the lock modes, you still get a lot of contention. But in the fast-path mechanism, it's reversed: first, you go to the shared memory *for your backend* and then you search through it for the particular locked object at issue. So basically the main lock manager treats the primary key as (what, who) while the fast-path mechanism treats it as (who, what). And that gets rid of a ton of contention because then different backends locking the same object (in sufficiently weak lock modes) never touch the same cache lines, so there's actually zero contention. That is, I believe, the most important thing about the fast-path locking system. What I've just said is slightly complicated by the existence of FastPathStrongRelationLockData, which is concurrently accessed by all backends when using fast-path locking, but it's only read-only as nobody actually takes a strong lock (like an AccessExclusiveLock on a table). So you probably do get some cache line effects there, but because it's read-only, they don't cause too much of a headache. We do have to be careful that the overhead of checking multiple locking data structures doesn't add up to a problem, for sure. But there can still, I believe, be a lot of benefit in dividing up access first by "who" and then by "what" for weak relation locks even if the per-backend data structures become more complex. Or at least I hope so. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Andres, thanks for helping! Great questions, replies are inline below.
On Sun, Aug 6, 2023 at 1:00 PM Andres Freund <andres@anarazel.de> wrote:
Hm, I'm curious whether you have a way to trigger the issue outside of your
prod environment. Mainly because I'm wondering if you're potentially hitting
the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
you'd not see the benefit unless you reproduced the load in 16+.
Thanks for sharing this!
I have not yet written a reproducer since we see this daily in production. I have a sketch of a few ways that I think will reproduce the behavior we're observing, but haven't had time to implement it.
I'm not sure if we're seeing this behavior in production, but it's definitely an interesting find. Currently we are running postgres 12.11, with an upcoming upgrade to 15 planned. Good to know there's a potential improvement waiting in 16. I noticed that in LWLockAcquire the call to LWLockDequeueSelf occurs (https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218) directly between the unsuccessful attempt to immediately acquire the lock and reporting the backend's wait event. The distinctive indicators we have been using for this pathology are that "lock_manager" wait_event and its associated USDT probe (https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1236-L1237), both of which occur after whatever overhead is incurred by LWLockDequeueSelf. As you mentioned in your commit message, that overhead is hard to detect. My first impression is that whatever overhead it incurs is in addition to what we are investigating.
I'm also wondering if it's possible that the reason for the throughput drops
are possibly correlated with heavyweight contention or higher frequency access
to the pg_locks view. Deadlock checking and the locks view acquire locks on
all lock manager partitions... So if there's a bout of real lock contention
(for longer than deadlock_timeout)...
Great questions, but we ruled that out. The deadlock_timeout is 5 seconds, so frequently hitting that would massively violate SLO and would alert the on-call engineers. The pg_locks view is scraped a couple times per minute for metrics collection, but the lock_manager lwlock contention can be observed thousands of times every second, typically with very short durations. The following example (captured just now) shows the number of times per second over a 10-second window that any 1 of the 16 "lock_manager" lwlocks was contended:
msmiley@patroni-main-2004-103-db-gprd.c.gitlab-production.internal:~$ sudo ./bpftrace -e 'usdt:/usr/lib/postgresql/12/bin/postgres:lwlock__wait__start /str(arg0) == "lock_manager"/ { @[arg1] = count(); } interval:s:1 { print(@); clear(@); } interval:s:10 { exit(); }'
Attaching 5 probes...
@[0]: 12122
@[0]: 12888
@[0]: 13011
@[0]: 13348
@[0]: 11461
@[0]: 10637
@[0]: 10892
@[0]: 12334
@[0]: 11565
@[0]: 11596
Attaching 5 probes...
@[0]: 12122
@[0]: 12888
@[0]: 13011
@[0]: 13348
@[0]: 11461
@[0]: 10637
@[0]: 10892
@[0]: 12334
@[0]: 11565
@[0]: 11596
Typically that contention only lasts a couple microseconds. But the long tail can sometimes be much slower. Details here: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365159507.
Given that most of your lock manager traffic comes from query planning - have
you evaluated using prepared statements more heavily?
Yes, there are unrelated obstacles to doing so -- that's a separate can of worms, unfortunately. But in this pathology, even if we used prepared statements, the backend would still need to reacquire the same locks during each executing transaction. So in terms of lock acquisition rate, whether it's via the planner or executor doing it, the same relations have to be locked.
Hi, On 2023-08-07 13:59:26 -0700, Matt Smiley wrote: > I have not yet written a reproducer since we see this daily in production. > I have a sketch of a few ways that I think will reproduce the behavior > we're observing, but haven't had time to implement it. > > I'm not sure if we're seeing this behavior in production It might be worth for you to backpatch commit 92daeca45df Author: Andres Freund <andres@anarazel.de> Date: 2022-11-21 20:34:17 -0800 Add wait event for pg_usleep() in perform_spin_delay() into 12. That should be low risk and have only trivially resolvable conflicts. Alternatively, you could use bpftrace et al to set a userspace probe on perform_spin_delay(). > , but it's definitely an interesting find. Currently we are running > postgres 12.11, with an upcoming upgrade to 15 planned. Good to know > there's a potential improvement waiting in 16. I noticed that in > LWLockAcquire the call to LWLockDequeueSelf occurs ( > https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218) > directly between the unsuccessful attempt to immediately acquire the lock > and reporting the backend's wait event. That's normal. > > I'm also wondering if it's possible that the reason for the throughput > > drops > > are possibly correlated with heavyweight contention or higher frequency > > access > > to the pg_locks view. Deadlock checking and the locks view acquire locks on > > all lock manager partitions... So if there's a bout of real lock contention > > (for longer than deadlock_timeout)... > > > > Great questions, but we ruled that out. The deadlock_timeout is 5 seconds, > so frequently hitting that would massively violate SLO and would alert the > on-call engineers. The pg_locks view is scraped a couple times per minute > for metrics collection, but the lock_manager lwlock contention can be > observed thousands of times every second, typically with very short > durations. The following example (captured just now) shows the number of > times per second over a 10-second window that any 1 of the 16 > "lock_manager" lwlocks was contended: Some short-lived contention is fine and expected - the question is how long the waits are... Unfortunately my experience is that the overhead of bpftrace means that analyzing things like this with bpftrace is very hard... :(. > > Given that most of your lock manager traffic comes from query planning - > > have you evaluated using prepared statements more heavily? > > > > Yes, there are unrelated obstacles to doing so -- that's a separate can of > worms, unfortunately. But in this pathology, even if we used prepared > statements, the backend would still need to reacquire the same locks during > each executing transaction. So in terms of lock acquisition rate, whether > it's via the planner or executor doing it, the same relations have to be > locked. Planning will often lock more database objects than query execution. Which can keep you using fastpath locks for longer. Greetings, Andres Freund
Why would the access frequency be uniform? In particular, there's a huge
variability in how long the locks need to exist
As a supporting data point, our example production workload shows a 3x difference between the most versus least frequently contended lock_manager lock:
Since we deterministically distribute relations among those 16 lock_manager lwlocks by hashing their lock tag, we can probably assume a roughly uniform number of relations are being managed by each lock_manager lock, but the demand (and contention) for them is non-uniform. This 3x spread corroborates the intuition that some relations are locked more frequently than others (that being both a schema- and workload-specific property).
Since we're contemplating a new hashing scheme, I wonder how we could accommodate that kind of asymmetry, where some relations are locked more frequently than others.
Hi, On 2023-08-07 13:05:32 -0400, Robert Haas wrote: > I would also argue that the results are actually not that great, > because once you get past 64 partitions you're right back where you > started, or maybe worse off. To me, there's nothing magical about > cases between 16 and 64 relations that makes them deserve special > treatment - plenty of people are going to want to use hundreds of > partitions, and even if you only use a few dozen, this isn't going to > help as soon as you join two or three partitioned tables, and I > suspect it hurts whenever it doesn't help. > > I think we need a design that scales better. I don't really know what > that would look like, exactly, but your idea of a hybrid approach > seems like it might be worth further consideration. We don't have to > store an infinite number of fast-path locks in an array that we search > linearly, and it might be better that if we moved to some other > approach we could avoid some of the regression. My gut feeling is that the state for fast path locks doesn't live in quite the right place. What if fast path locks entered PROCLOCK into the shared hashtable, just like with normal locks, the first time a lock is acquired by a backend. Except that we'd set a flag indicating the lock is a fastpath lock. When the lock is released, neither the LOCALLOCK nor the PROCLOCK entry would be removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the lock is not held anymore. That itself wouldn't buy us much - we'd still need to do a lookup in the shared hashtable. But, by the time we decide whether to use fast path locks, we've already done a hash lookup in the LOCALLOCK hashtable. Because the PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get the PROCLOCK entry without a shared hash table lookup. Acquiring a strong lock on a fastpath lock would basically entail modifying all the relevant PROCLOCKs atomically to indicate that fast path locks aren't possible anymore. Acquiring a fast path lock would just require atomically modifying the PROCLOCK to indicate that the lock is held. On a first blush, this sounds like it could end up being fairly clean and generic? Greetings, Andres Freund
Hi, On 2023-08-07 14:36:48 -0700, Andres Freund wrote: > What if fast path locks entered PROCLOCK into the shared hashtable, just like > with normal locks, the first time a lock is acquired by a backend. Except that > we'd set a flag indicating the lock is a fastpath lock. When the lock is > released, neither the LOCALLOCK nor the PROCLOCK entry would be > removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the > lock is not held anymore. > > That itself wouldn't buy us much - we'd still need to do a lookup in the > shared hashtable. But, by the time we decide whether to use fast path locks, > we've already done a hash lookup in the LOCALLOCK hashtable. Because the > PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get > the PROCLOCK entry without a shared hash table lookup. > > Acquiring a strong lock on a fastpath lock would basically entail modifying > all the relevant PROCLOCKs atomically to indicate that fast path locks aren't > possible anymore. Acquiring a fast path lock would just require atomically > modifying the PROCLOCK to indicate that the lock is held. > > On a first blush, this sounds like it could end up being fairly clean and > generic? On 2023-08-07 13:05:32 -0400, Robert Haas wrote: > Of course, another thing we could do is try to improve the main lock > manager somehow. I confess that I don't have a great idea for that at > the moment, but the current locking scheme there is from a very, very > long time ago and clearly wasn't designed with modern hardware in > mind. I think the biggest flaw of the locking scheme is that the LockHash locks protect two, somewhat independent, things: 1) the set of currently lockable objects, i.e. the entries in the hash table [partition] 2) the state of all the locks [in a partition] It'd not be that hard to avoid the shared hashtable lookup in a number of cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest above. But we can't, in general, avoid the lock on the partition anyway, as the each lock's state is also protected by the partition lock. The amount of work to do a lookup in the shared hashtable and/or create a new entry therein, is quite bound. But the work for acquiring a lock is much less so. We'll e.g. often have to iterate over the set of lock holders etc. I think we ought to investigate whether pushing down the locking for the "lock state" into the individual locks is worth it. That way the partitioned lock would just protect the hashtable. The biggest issue I see is deadlock checking. Right now acquiring all lock partitions gives you a consistent view of all the non-fastpath locks - and fastpath locks can't participate in deadlocks. Any scheme that makes "lock state" locking in general more granular, will make it next to impossible to have a similarly consistent view of all locks. I'm not sure the current degree of consistency is required however - the lockers participating in a lock cycle, pretty much by definition, are blocked. A secondary issue is that making the locks more granular could affect the happy path measurably - we'd need two atomics for each heavyweight lock acquisition, not one. But if we cached the lookup in the shared hashtable, we'd commonly be able to skip the hashtable lookup... Greetings, Andres Freund
On Mon, Aug 7, 2023 at 6:05 PM Andres Freund <andres@anarazel.de> wrote: > I think the biggest flaw of the locking scheme is that the LockHash locks > protect two, somewhat independent, things: > 1) the set of currently lockable objects, i.e. the entries in the hash table [partition] > 2) the state of all the locks [in a partition] > > It'd not be that hard to avoid the shared hashtable lookup in a number of > cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest > above. But we can't, in general, avoid the lock on the partition anyway, as > the each lock's state is also protected by the partition lock. Yes, and that's a huge problem. The main selling point of the whole fast-path mechanism is to ease the pressure on the lock manager partition locks, and if we did something like what you described in the previous email without changing the locking regimen, we'd bring all of that contention back. I'm pretty sure that would suck. > The amount of work to do a lookup in the shared hashtable and/or create a new > entry therein, is quite bound. But the work for acquiring a lock is much less > so. We'll e.g. often have to iterate over the set of lock holders etc. > > I think we ought to investigate whether pushing down the locking for the "lock > state" into the individual locks is worth it. That way the partitioned lock > would just protect the hashtable. I think this would still suck. Suppose you put an LWLock or slock_t in each LOCK. If you now run a lot of select queries against the same table (e.g. pgbench -S -c 64 -j 64), everyone is going to fight over the lock counts for that table. Here again, the value of the fast-path system is that it spreads out the contention in ways that approaches like this can't do. Or, hmm, maybe what you're really suggesting is pushing the state down into each PROCLOCK rather than each LOCK. That would be more promising if we could do it, because that is per-lock *and also per-backend*. But you can't decide from looking at a single PROCLOCK whether a new lock at some given lock mode is grantable or not, at least not with the current PROCLOCK representation. I think any workable solution here has to allow a backend to take a weak relation lock without contending with other backends trying to take the same weak relation lock (provided there are no strong lockers). Maybe backends should be able to allocate PROCLOCKs and record weak relation locks there without actually linking them up to LOCK objects, or something like that. Anyone who wants a strong lock must first go and find all of those objects for the LOCK they want and connect them up to that LOCK. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-08-08 16:44:37 -0400, Robert Haas wrote: > On Mon, Aug 7, 2023 at 6:05 PM Andres Freund <andres@anarazel.de> wrote: > > I think the biggest flaw of the locking scheme is that the LockHash locks > > protect two, somewhat independent, things: > > 1) the set of currently lockable objects, i.e. the entries in the hash table [partition] > > 2) the state of all the locks [in a partition] > > > > It'd not be that hard to avoid the shared hashtable lookup in a number of > > cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest > > above. But we can't, in general, avoid the lock on the partition anyway, as > > the each lock's state is also protected by the partition lock. > > Yes, and that's a huge problem. The main selling point of the whole > fast-path mechanism is to ease the pressure on the lock manager > partition locks, and if we did something like what you described in > the previous email without changing the locking regimen, we'd bring > all of that contention back. I'm pretty sure that would suck. Yea - I tried to outline how I think we could implement the fastpath locking scheme in a less limited way in the earlier email, that I had quoted above this bit. Here I was pontificating on what we possibly should do in addition to that. I think even if we had "unlimited" fastpath locking, there's still enough pressure on the lock manager locks that it's worth improving the overall locking scheme. Greetings, Andres Freund
On 8/8/23 3:04 PM, Andres Freund wrote: > On 2023-08-08 16:44:37 -0400, Robert Haas wrote: >> On Mon, Aug 7, 2023 at 6:05 PM Andres Freund <andres@anarazel.de> wrote: >>> I think the biggest flaw of the locking scheme is that the LockHash locks >>> protect two, somewhat independent, things: >>> 1) the set of currently lockable objects, i.e. the entries in the hash table [partition] >>> 2) the state of all the locks [in a partition] >>> >>> It'd not be that hard to avoid the shared hashtable lookup in a number of >>> cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest >>> above. But we can't, in general, avoid the lock on the partition anyway, as >>> the each lock's state is also protected by the partition lock. >> >> Yes, and that's a huge problem. The main selling point of the whole >> fast-path mechanism is to ease the pressure on the lock manager >> partition locks, and if we did something like what you described in >> the previous email without changing the locking regimen, we'd bring >> all of that contention back. I'm pretty sure that would suck. > > Yea - I tried to outline how I think we could implement the fastpath locking > scheme in a less limited way in the earlier email, that I had quoted above > this bit. Here I was pontificating on what we possibly should do in addition > to that. I think even if we had "unlimited" fastpath locking, there's still > enough pressure on the lock manager locks that it's worth improving the > overall locking scheme. Has anyone considered whether increasing NUM_LOCK_PARTITIONS to something bigger than 16 might offer cheap/easy/small short-term improvements while folks continue to think about the bigger long-term ideas? cf. https://www.postgresql.org/message-id/flat/VI1PR05MB620666631A41186ACC3FC91ACFC70%40VI1PR05MB6206.eurprd05.prod.outlook.com I haven't looked deeply into it myself yet. Didn't see a mention in this thread or in Matt's gitlab research ticket. Maybe it doesn't actually help. Anyway Alexander Pyhalov's email about LWLock optimization and NUM_LOCK_PARTITIONS is out there, and I wondered about this. -Jeremy -- Jeremy Schneider Performance Engineer Amazon Web Services
Good morning! FYI: I know many people are/were tracking this email thread rather than the newer and more recent one "scalability bottlenecks with (many) partitions (and more)", but please see [1] [2] , where Tomas committed enhanced fast-path locking to the master(18). Thanks Tomas for persistence on this! -J. [1] - https://www.postgresql.org/message-id/E1ss4gX-000IvX-63%40gemulon.postgresql.org [2] - https://www.postgresql.org/message-id/7c1eeafb-2375-4ff6-8469-0640d52d44ed%40vondra.me