Re: long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers

From Andres Freund
Subject Re: long-standing data loss bug in initial sync of logical replication
Date
Msg-id 20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de
Whole thread Raw
In response to Re: long-standing data loss bug in initial sync of logical replication  (Andres Freund <andres@anarazel.de>)
Responses Re: long-standing data loss bug in initial sync of logical replication
List pgsql-hackers
Hi,

On 2023-11-17 17:54:43 -0800, Andres Freund wrote:
> On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:
> > Overall, this looks, walks and quacks like a cache invalidation issue,
> > likely a missing invalidation somewhere in the ALTER PUBLICATION code.

I can confirm that something is broken with invalidation handling.

To test this I just used pg_recvlogical to stdout. It's just interesting
whether something arrives, that's easy to discern even with binary output.

CREATE PUBLICATION pb;
src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d postgres -o proto_version=4 -o
publication_names=pb-o messages=true -f -
 

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN; INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: COMMIT
S2: INSERT INTO d VALUES('d3');
S1: INSERT INTO d VALUES('d4');
RL: <nothing>

Without the 'd2' insert in an in-progress transaction, pgoutput *does* react
to the ALTER PUBLICATION.

I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD
TABLE d basically modifies the catalog state of 'd', without a lock preventing
other sessions from having a valid cache entry that they could continue to
use. Due to this, decoding S2's transactions that started before S2's commit,
will populate the cache entry with the state as of the time of S1's last
action, i.e. no need to output the change.

The reason this can happen is because OpenTableList() uses
ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while
there's an ongoing INSERT.

I think this isn't just a logical decoding issue. S2's cache state just after
the ALTER PUBLICATION is going to be wrong - the table is already locked,
therefore further operations on the table don't trigger cache invalidation
processing - but the catalog state *has* changed.  It's a bigger problem for
logical decoding though, as it's a bit more lazy about invalidation processing
than normal transactions, allowing the problem to persist for longer.


I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.


We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()).  But I think that'd
be a fairly significant increase in overhead.



Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [PoC] Reducing planning time when tables have many partitions
Next
From: Tom Lane
Date:
Subject: Re: [PoC] Reducing planning time when tables have many partitions