xact_rollback spikes when logical walsender exits - Mailing list pgsql-hackers

From Nikolay Samokhvalov
Subject xact_rollback spikes when logical walsender exits
Date
Msg-id CAM527d_EbU5Li4a5FdKQjYsdF-4Lqr_i3jXmZOm7Wbb=Q2KzTw@mail.gmail.com
Whole thread
List pgsql-hackers
Hi hackers,

There is a bug on logical-replication publishers where every decoded
committed transaction bumps pg_stat_database.xact_rollback.
ReorderBufferProcessTXN() ends each decoded transaction with
AbortCurrentTransaction() for catalog cleanup; in the walsender that
is a top-level abort, so AtEOXact_PgStat_Database(isCommit=false)
increments the backend-local pgStatXactRollback.

The counts are flushed to shared stats on walsender exit, producing
an acute spike. Result: for production systems with SREs on call and tight
alerting on xact_rollback, this turns routine logical-replication operations
(disabling a subscription, dropping a slot, walsender restart) into
false-positive pages.

Reported in [1]; also experienced at GitLab [2][3][4].

Attaching a simple patch that adds a backend-local flag pgStatXactSkipCounters
in pgstat_database.c that AtEOXact_PgStat_Database() honors to skip
the counter bump.

Added TAP test that fails on master with 5/0 and passes with the patch.

If there is agreement on this shape, happy to send patches for all
supported branches. Let me know what you think.

[1] https://postgr.es/m/CAG0ozMo_xWQn%2BAvv8jzbbhePGp5OnhdO%2BYWTkdg4faWSXz0Jzg%40mail.gmail.com
[2] https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/8290
[3] https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/work_items/39
[4] https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/work_items/406

Nik

Attachment

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: First draft of PG 19 release notes
Next
From: Antonin Houska
Date:
Subject: Re: [PATCH] Compressed TOAST data corruption with REPACK CONCURRENTLY