Thread: Parallel tests publication and subscription might fail due to concurrent tuple update
Parallel tests publication and subscription might fail due to concurrent tuple update
From
Alexander Lakhin
Date:
Hello hackers, A recent desman failure [1] with the following diagnostics: # parallel group (2 tests): subscription publication not ok 157 + publication 2251 ms ok 158 + subscription 415 ms --- /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/expected/publication.out 2024-12-09 18:34:02.939762233 +0000 +++ /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/results/publication.out 2024-12-09 18:44:48.582958859 +0000 @@ -1193,23 +1193,29 @@ ERROR: permission denied for database regression SET ROLE regress_publication_user; GRANT CREATE ON DATABASE regression TO regress_publication_user2; +ERROR: tuple concurrently updated SET ROLE regress_publication_user2; SET client_min_messages = 'ERROR'; CREATE PUBLICATION testpub2; -- ok +ERROR: permission denied for database regression and postmaster.log containing: 2024-12-09 18:44:46.753 UTC [1345157:903] pg_regress/publication STATEMENT: CREATE PUBLICATION testpub2; 2024-12-09 18:44:46.753 UTC [1345158:287] pg_regress/subscription LOG: statement: REVOKE CREATE ON DATABASE REGRESSION FROM regress_subscription_user3; 2024-12-09 18:44:46.754 UTC [1345157:904] pg_regress/publication LOG: statement: SET ROLE regress_publication_user; 2024-12-09 18:44:46.754 UTC [1345157:905] pg_regress/publication LOG: statement: GRANT CREATE ON DATABASE regression TO regress_publication_user2; 2024-12-09 18:44:46.754 UTC [1345157:906] pg_regress/publication ERROR: tuple concurrently updated 2024-12-09 18:44:46.754 UTC [1345157:907] pg_regress/publication STATEMENT: GRANT CREATE ON DATABASE regression TO regress_publication_user2; shows that the subscription and publication tests are not concurrent-safe, because modifying the same pg_database entry might fail with the "tuple concurrently updated" error. I've managed to reproduce the error with: sed -E "s/(REVOKE CREATE ON DATABASE REGRESSION FROM regress_subscription_user3;)/$(printf '\\1%.0s' {1..2000})/" -i.bak\ src/test/regress/sql/subscription.sql src/test/regress/expected/subscription.out sed -E "s/(GRANT CREATE ON DATABASE regression TO regress_publication_user2;)/$(printf '\\1%.0s' {1..1000})/" -i.bak \ src/test/regress/sql/publication.sql src/test/regress/expected/publication.out sed -E "s/(test: publication subscription$)/$(printf '\\1\\n%.0s' {1..10})/" -i.bak src/test/regress/parallel_schedule This makes `make check` fail like below: # parallel group (2 tests): subscription publication ok 170 + publication 202 ms ok 171 + subscription 100 ms # parallel group (2 tests): subscription publication ok 172 + publication 198 ms not ok 173 + subscription 107 ms # parallel group (2 tests): subscription publication ok 174 + publication 204 ms ok 175 + subscription 100 ms src/test/regress/regression.diffs contains: +ERROR: tuple concurrently updated This issue is reproduced starting from commit c3afe8cf5 (dated 2023-03-30), which added "REVOKE CREATE ON DATABASE REGRESSION ..." into the subscription test. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=desman&dt=2024-12-09%2018%3A33%3A49&stg=check Best regards, Alexander
Re: Parallel tests publication and subscription might fail due to concurrent tuple update
From
Tom Lane
Date:
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote: >> shows that the subscription and publication tests are not concurrent-safe, >> because modifying the same pg_database entry might fail with the "tuple >> concurrently updated" error. > This seems related to this thread about concurrency issues in > ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it > seems. > The easiest way to address the flakiness of this test though is > probably to just don't run these tests in in parallel. See attached. I grepped through the buildfarm logs and discovered that desman's run of 2024-12-09 18:33:49 is the *only* such failure recorded in the last year. What's more, that run was on v16 not master. So now I'm inclined to think that "do nothing" is the right answer. It would be kind of sad to lose all parallelism for these two tests, and one-failure-per-year is surely below our noise threshold. (Mind you, I'd love to be in a position where that sort of failure rate does make it onto our radar. But we're not there today.) The fact that it's only been seen on v16 may well mean that subsequent changes in one or the other test have further reduced the failure probability, too. Also, we'd be unlikely to remember to undo this change if anyone ever fixes the GRANT/REVOKE race condition. It seems possible that someone will get annoyed enough with that to make it happen, because we've seen related field complaints. So on the whole I want to reject this. We can reconsider if we see more such failures, of course. regards, tom lane
Re: Parallel tests publication and subscription might fail due to concurrent tuple update
From
vignesh C
Date:
On Mon, 17 Mar 2025 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jelte Fennema-Nio <postgres@jeltef.nl> writes: > > On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote: > >> shows that the subscription and publication tests are not concurrent-safe, > >> because modifying the same pg_database entry might fail with the "tuple > >> concurrently updated" error. > > > This seems related to this thread about concurrency issues in > > ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it > > seems. > > > The easiest way to address the flakiness of this test though is > > probably to just don't run these tests in in parallel. See attached. > > I grepped through the buildfarm logs and discovered that desman's > run of 2024-12-09 18:33:49 is the *only* such failure recorded > in the last year. What's more, that run was on v16 not master. > > So now I'm inclined to think that "do nothing" is the right answer. > It would be kind of sad to lose all parallelism for these two > tests, and one-failure-per-year is surely below our noise threshold. > (Mind you, I'd love to be in a position where that sort of failure > rate does make it onto our radar. But we're not there today.) > The fact that it's only been seen on v16 may well mean that subsequent > changes in one or the other test have further reduced the failure > probability, too. > > Also, we'd be unlikely to remember to undo this change if anyone > ever fixes the GRANT/REVOKE race condition. It seems possible that > someone will get annoyed enough with that to make it happen, because > we've seen related field complaints. > > So on the whole I want to reject this. We can reconsider if we > see more such failures, of course. I suggest we close the commitfest entry at [1] and create a new one if we encounter this buildfarm failure again. [1] - https://commitfest.postgresql.org/patch/5459/ Regards, Vignesh