Thread: Parallel tests publication and subscription might fail due to concurrent tuple update

Parallel tests publication and subscription might fail due to concurrent tuple update

From
Alexander Lakhin
Date:
Hello hackers,

A recent desman failure [1] with the following diagnostics:
# parallel group (2 tests):  subscription publication
not ok 157   + publication                              2251 ms
ok 158       + subscription                              415 ms

--- /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/expected/publication.out 2024-12-09 
18:34:02.939762233 +0000
+++ /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/results/publication.out 2024-12-09 
18:44:48.582958859 +0000
@@ -1193,23 +1193,29 @@
  ERROR:  permission denied for database regression
  SET ROLE regress_publication_user;
  GRANT CREATE ON DATABASE regression TO regress_publication_user2;
+ERROR:  tuple concurrently updated
  SET ROLE regress_publication_user2;
  SET client_min_messages = 'ERROR';
  CREATE PUBLICATION testpub2;  -- ok
+ERROR:  permission denied for database regression

and postmaster.log containing:
2024-12-09 18:44:46.753 UTC [1345157:903] pg_regress/publication STATEMENT:  CREATE PUBLICATION testpub2;
2024-12-09 18:44:46.753 UTC [1345158:287] pg_regress/subscription LOG:  statement: REVOKE CREATE ON DATABASE REGRESSION

FROM regress_subscription_user3;
2024-12-09 18:44:46.754 UTC [1345157:904] pg_regress/publication LOG:  statement: SET ROLE regress_publication_user;
2024-12-09 18:44:46.754 UTC [1345157:905] pg_regress/publication LOG:  statement: GRANT CREATE ON DATABASE regression
TO
 
regress_publication_user2;
2024-12-09 18:44:46.754 UTC [1345157:906] pg_regress/publication ERROR:  tuple concurrently updated
2024-12-09 18:44:46.754 UTC [1345157:907] pg_regress/publication STATEMENT:  GRANT CREATE ON DATABASE regression TO 
regress_publication_user2;

shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.

I've managed to reproduce the error with:
sed -E "s/(REVOKE CREATE ON DATABASE REGRESSION FROM regress_subscription_user3;)/$(printf '\\1%.0s' {1..2000})/"
-i.bak\
 
   src/test/regress/sql/subscription.sql src/test/regress/expected/subscription.out
sed -E "s/(GRANT CREATE ON DATABASE regression TO regress_publication_user2;)/$(printf '\\1%.0s' {1..1000})/" -i.bak \
   src/test/regress/sql/publication.sql src/test/regress/expected/publication.out
sed -E "s/(test: publication subscription$)/$(printf '\\1\\n%.0s' {1..10})/" -i.bak src/test/regress/parallel_schedule

This makes `make check` fail like below:
# parallel group (2 tests):  subscription publication
ok 170       + publication                               202 ms
ok 171       + subscription                              100 ms
# parallel group (2 tests):  subscription publication
ok 172       + publication                               198 ms
not ok 173   + subscription                              107 ms
# parallel group (2 tests):  subscription publication
ok 174       + publication                               204 ms
ok 175       + subscription                              100 ms

src/test/regress/regression.diffs contains:
+ERROR:  tuple concurrently updated

This issue is reproduced starting from commit c3afe8cf5 (dated 2023-03-30),
which added "REVOKE CREATE ON DATABASE REGRESSION ..." into the subscription
test.

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=desman&dt=2024-12-09%2018%3A33%3A49&stg=check

Best regards,
Alexander



Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>> shows that the subscription and publication tests are not concurrent-safe,
>> because modifying the same pg_database entry might fail with the "tuple
>> concurrently updated" error.

> This seems related to this thread about concurrency issues in
> ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it
> seems.

> The easiest way to address the flakiness of this test though is
> probably to just don't run these tests in in parallel. See attached.

I grepped through the buildfarm logs and discovered that desman's
run of 2024-12-09 18:33:49 is the *only* such failure recorded
in the last year.  What's more, that run was on v16 not master.

So now I'm inclined to think that "do nothing" is the right answer.
It would be kind of sad to lose all parallelism for these two
tests, and one-failure-per-year is surely below our noise threshold.
(Mind you, I'd love to be in a position where that sort of failure
rate does make it onto our radar.  But we're not there today.)
The fact that it's only been seen on v16 may well mean that subsequent
changes in one or the other test have further reduced the failure
probability, too.

Also, we'd be unlikely to remember to undo this change if anyone
ever fixes the GRANT/REVOKE race condition.  It seems possible that
someone will get annoyed enough with that to make it happen, because
we've seen related field complaints.

So on the whole I want to reject this.  We can reconsider if we
see more such failures, of course.

            regards, tom lane



On Mon, 17 Mar 2025 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> > On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> >> shows that the subscription and publication tests are not concurrent-safe,
> >> because modifying the same pg_database entry might fail with the "tuple
> >> concurrently updated" error.
>
> > This seems related to this thread about concurrency issues in
> > ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it
> > seems.
>
> > The easiest way to address the flakiness of this test though is
> > probably to just don't run these tests in in parallel. See attached.
>
> I grepped through the buildfarm logs and discovered that desman's
> run of 2024-12-09 18:33:49 is the *only* such failure recorded
> in the last year.  What's more, that run was on v16 not master.
>
> So now I'm inclined to think that "do nothing" is the right answer.
> It would be kind of sad to lose all parallelism for these two
> tests, and one-failure-per-year is surely below our noise threshold.
> (Mind you, I'd love to be in a position where that sort of failure
> rate does make it onto our radar.  But we're not there today.)
> The fact that it's only been seen on v16 may well mean that subsequent
> changes in one or the other test have further reduced the failure
> probability, too.
>
> Also, we'd be unlikely to remember to undo this change if anyone
> ever fixes the GRANT/REVOKE race condition.  It seems possible that
> someone will get annoyed enough with that to make it happen, because
> we've seen related field complaints.
>
> So on the whole I want to reject this.  We can reconsider if we
> see more such failures, of course.

I suggest we close the commitfest entry at [1] and create a new one if
we encounter this buildfarm failure again.
[1] - https://commitfest.postgresql.org/patch/5459/

Regards,
Vignesh