Thread: Getting rid of "tuple concurrently updated" elog()s with concurrentDDLs (at least ALTER TABLE)
Getting rid of "tuple concurrently updated" elog()s with concurrentDDLs (at least ALTER TABLE)
From
Michael Paquier
Date:
Hi all, Triggering "tuple concurrently updated" from heapam.c, which is an elog(), is not that difficult for some DDL commands. For example with ALTER ROLE, just create a role: psql --no-psqlrc -c 'create role popo' And then run that in two sessions and the error will show up, triggered from CatalogTupleUpdate(): while true; do psql --no-psqlrc -c "alter role popo password 'foo'"; done In this case, upgrading the lock taken on pg_authid from RowExclusiveLock to ShareRowExclusiveLock prevents concurrent sessions to update each other, which is what the patch attached does. I think that we should get rid of all those errors, for many applications doing concurrent DDLs getting this error is surprising, and could be thought as a corrupted instance. I am just touching the tip of the iceberg here, as I have the gut feeling that there are other objects where it is possible to trigger the failure, and an analysis effort is going to be costly. So I'd like to get first the temperature about such analysis. So, opinions? -- Michael
Attachment
Re: Getting rid of "tuple concurrently updated" elog()s with concurrent DDLs (at least ALTER TABLE)
From
Andres Freund
Date:
On December 26, 2017 5:55:03 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote: >Hi all, > >Triggering "tuple concurrently updated" from heapam.c, which is an >elog(), >is not that difficult for some DDL commands. For example with ALTER >ROLE, >just create a role: >psql --no-psqlrc -c 'create role popo' > >And then run that in two sessions and the error will show up, triggered >from CatalogTupleUpdate(): >while true; do psql --no-psqlrc -c "alter role popo password 'foo'"; >done > >In this case, upgrading the lock taken on pg_authid from >RowExclusiveLock >to ShareRowExclusiveLock prevents concurrent sessions to update each >other, >which is what the patch attached does. > >I think that we should get rid of all those errors, for many >applications >doing concurrent DDLs getting this error is surprising, and could be >thought >as a corrupted instance. I am just touching the tip of the iceberg >here, as >I have the gut feeling that there are other objects where it is >possible to >trigger the failure, and an analysis effort is going to be costly. So >I'd >like to get first the temperature about such analysis. > >So, opinions? You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? That seemsunacceptable. Isn't the canonical way to solve this to take object locks? Andres Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Getting rid of "tuple concurrently updated" elog()s withconcurrent DDLs (at least ALTER TABLE)
From
Michael Paquier
Date:
On Tue, Dec 26, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote: > You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? Thatseems unacceptable. > Isn't the canonical way to solve this to take object locks? Sure. That's where things in lmgr.c come into play, like LockSharedObject(), and you could hold with an exclusive lock on a given object until the end of a transaction before opening the catalog relation with heap_open(), however with those you need to know the object OID before taking a lock on the parent relation, right? So you face problems with lock upgrades, or race conditions show up more easily. I have to admit that I have not dug much into the problem yet, it is easy enough to have isolation tests by the way, and I just noticed that ALTER DATABASE SET can equally trigger the error. -- Michael
Re: Getting rid of "tuple concurrently updated" elog()s withconcurrent DDLs (at least ALTER TABLE)
From
Robert Haas
Date:
On Tue, Dec 26, 2017 at 4:14 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Dec 26, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote: >> You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? Thatseems unacceptable. >> Isn't the canonical way to solve this to take object locks? > > Sure. That's where things in lmgr.c come into play, like > LockSharedObject(), and you could hold with an exclusive lock on a > given object until the end of a transaction before opening the catalog > relation with heap_open(), however with those you need to know the > object OID before taking a lock on the parent relation, right? So you > face problems with lock upgrades, or race conditions show up more > easily. I have to admit that I have not dug much into the problem yet, > it is easy enough to have isolation tests by the way, and I just > noticed that ALTER DATABASE SET can equally trigger the error. I don't understand what you mean by "you need to know the object OID before taking a lock on the parent relation, right?". That seems wrong. I think you might need something like what was done in b3ad5d02c9cd8a4c884cd78480f221afe8ce5590; if, after we look up the name and before we acquire a lock on the OID, we accept any invalidation messages, recheck that the object we've locked is still the one associated with that name. I think Andres is certainly right that locking all of pg_authid is a nonstarter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Getting rid of "tuple concurrently updated" elog()s withconcurrent DDLs (at least ALTER TABLE)
From
Michael Paquier
Date:
On Tue, Dec 26, 2017 at 10:47:59PM -0800, Robert Haas wrote: > On Tue, Dec 26, 2017 at 4:14 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Tue, Dec 26, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote: >>> You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? Thatseems unacceptable. >>> Isn't the canonical way to solve this to take object locks? >> >> Sure. That's where things in lmgr.c come into play, like >> LockSharedObject(), and you could hold with an exclusive lock on a >> given object until the end of a transaction before opening the catalog >> relation with heap_open(), however with those you need to know the >> object OID before taking a lock on the parent relation, right? So you >> face problems with lock upgrades, or race conditions show up more >> easily. I have to admit that I have not dug much into the problem yet, >> it is easy enough to have isolation tests by the way, and I just >> noticed that ALTER DATABASE SET can equally trigger the error. > > I don't understand what you mean by "you need to know the object OID > before taking a lock on the parent relation, right?". That seems > wrong. Well, the point I am trying to outline here is that it is essential to take a lock on the object ID before opening pg_authid with heap_open() with a low-level lock to avoid any concurrency issues when working on the catalog relation opened. > I think you might need something like what was done in > b3ad5d02c9cd8a4c884cd78480f221afe8ce5590; if, after we look up the > name and before we acquire a lock on the OID, we accept any > invalidation messages, recheck that the object we've locked is still > the one associated with that name. Indeed, this bit is important, and RangeVarGet does so. Looking at get_object_address(), it also handles locking of the object. Using that as interface to address all the concurrency problems could be appealing, and less invasive for a final result as many DDLs are visibly impacted (still need to make a list here). What do you think? -- Michael
Attachment
Re: Getting rid of "tuple concurrently updated" elog()s withconcurrent DDLs (at least ALTER TABLE)
From
Michael Paquier
Date:
On Wed, Dec 27, 2017 at 04:53:42PM +0900, Michael Paquier wrote: > Indeed, this bit is important, and RangeVarGet does so. Looking at > get_object_address(), it also handles locking of the object. Using that > as interface to address all the concurrency problems could be appealing, > and less invasive for a final result as many DDLs are visibly > impacted (still need to make a list here). What do you think? So, I have looked at all the object types to spot concurrency problems, and I have found some more. And the winners are: - database - role - domain - event trigger - FDW - server - user mapping - type - tablespace Most of the failures show "tuple concurrently updated" using a given set of ALTER commands running concurrently, but I have been able to trigger as well one "cache lookup failure" for domains, and a duplicate key value violation for databases. I have done some extra checks with parallel ALTER commands for the following objects as well, without spotting issues: - access method - cast - attribute - extension - view - policy - publication - subscription - rule - transform - text search stuff - statistics - operator [family], etc. As looking for those failures manually is no fun, I am attaching a patch which includes a set of isolation regression tests able to trigger them. You just need to look for unwelcome ERROR messages and you are good to go. The goal to move forward will be to take care of all those failures. Please note that isolation tests don't allow dynamic input, so those have no tests but you can reproduce an error easily with ALTER TABLESPACE SET SCHEMA between two sessions. Note as well that the domain test will fail because the cache lookup error on domains exposes the domain OID, but that's nothing to care about. Opinions or thoughts? -- Michael