Thread: long-standing data loss bug in initial sync of logical replication
Hi, It seems there's a long-standing data loss issue related to the initial sync of tables in the built-in logical replication (publications etc.). I can reproduce it fairly reliably, but I haven't figured out all the details yet and I'm a bit out of ideas, so I'm sharing what I know with the hope someone takes a look and either spots the issue or has some other insight ... On the pgsql-bugs, Depesz reported reported [1] cases where tables are added to a publication but end up missing rows on the subscriber. I didn't know what might be the issue, but given his experience I decided to take a do some blind attempts to reproduce the issue. I'm not going to repeat all the details from the pgsql-bugs thread, but I ended up writing a script that does randomized stress test tablesync under concurrent load. Attached are two scripts, where crash-test.sh does the main work, while run.sh drives the test - executes crash-test.sh in a loop and generates random parameters for it. The run.sh generates number of tables, refresh interval (after how many tables we refresh subscription) and how long to sleep between steps (to allow pgbench to do more work). The crash-test.sh then does this: 1) initializes two clusters (expects $PATH to have pg_ctl etc.) 2) configures them for logical replication (wal_level, ...) 3) creates publication and subscription on the nodes 4) creates some a bunch of tables 5) starts a pgbench that inserts data into the tables 6) adds the tables to the publication one by one, occasionally refreshing the subscription 7) waits for tablesync of all the tables to complete (so that the tables get into the 'r' state, thus replicating normally) 8) stops the pgbench 9) waits for the subscriber to fully catch up 10) compares that the tables on publisher/subscriber nodes To run this, just make sure PATH includes pg, and do e.g. ./run.sh 10 which does 10 runs of crash-test.sh with random parameters. Each run can take a couple minutes, depending on the parameters, hardware etc. Obviously, we expect the tables to match on the two nodes, but the script regularly detects cases where the subscriber is missing some of the rows. The script dumps those tables, and the rows contain timestamps and LSNs to allow "rough correlation" (imperfect thanks to concurrency). Depesz reported "gaps" in the data, i.e. missing a chunk of data, but then following rows seemingly replicated. I did see such cases too, but most of the time I see a missing chunk of rows at the end (but maybe if the test continued a bit longer, it'd replicate some rows). The report talks about replication between pg12->pg14, but I don't think the cross-version part is necessary - I'm able to reproduce the issue on individual versions (e.g. 12->12) since 12 (I haven't tried 11, but I'd be surprised if it wasn't affected too). The rows include `pg_current_wal_lsn()` to roughly track the LSN where the row is inserted, and the "gap" of missing rows for each table seems to match pg_subscription_rel.srsublsn, i.e. the LSN up to which tablesync copied data, and the table should be replicated as usual. Another interesting observation is that the issue only happens for "bulk insert" transactions, i.e. BEGIN; ... INSERT into all tables ... COMMIT; but not when each insert is a separate transaction. A bit strange. After quite a bit of debugging, I came to the conclusion this happens because we fail to invalidate caches on the publisher, so it does not realize it should start sending rows for that table. In particular, we initially build RelationSyncEntry when the table is not yet included in the publication, so we end up with pubinsert=false, thus not replicating the inserts. Which makes sense, but we then seems to fail to invalidate the entry after it's added to the publication. The other problem is that even if we happen to invalidate the entry, we call GetRelationPublications(). But even if it happens long after the table gets added to the publication (both in time and LSN terms), it still returns NIL as if the table had no publications. And we end up with pubinsert=false, skipping the inserts again. Attached are three patches against master. 0001 adds some debug logging that I found useful when investigating the issue. 0002 illustrates the issue by forcefully invalidating the entry for each change, and implementing a non-syscache variant of the GetRelationPublication(). This makes the code unbearably slow, but with both changes in place I can no longer reproduce the issue. Undoing either of the two changes makes it reproducible again. (I'll talk about 0003 later.) I suppose timing matters, so it's possible it gets "fixed" simply because of that, but I find that unlikely given the number of runs I did without observing any failure. Overall, this looks, walks and quacks like a cache invalidation issue, likely a missing invalidation somewhere in the ALTER PUBLICATION code. If we fail to invalidate the pg_publication_rel syscache somewhere, that obviously explain why GetRelationPublications() returns stale data, but it would also explain why the RelationSyncEntry is not invalidated, as that happens in a syscache callback. But I tried to do various crazy things in the ALTER PUBLICATION code, and none of that worked, so I'm a bit confused/lost. However, while randomly poking at different things, I realized that if I change the lock obtained on the relation in OpenTableList() from ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away. I don't know why it works, and I don't even recall what exactly led me to the idea of changing it. This is what 0003 does - it reverts 0002 and changes the lock level. AFAIK the logical decoding code doesn't actually acquire locks on the decoded tables, so why would this change matter? The only place that does lock the relation is the tablesync, which gets RowExclusiveLock on it. And it's interesting that RowExclusiveLock does not conflict with ShareUpdateExclusiveLock, but does with ShareRowExclusiveLock. But why would this even matter, when the tablesync can only touch the table after it gets added to the publication? regards [1] https://www.postgresql.org/message-id/ZTu8GTDajCkZVjMs@depesz.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote: > It seems there's a long-standing data loss issue related to the initial > sync of tables in the built-in logical replication (publications etc.). :( > Overall, this looks, walks and quacks like a cache invalidation issue, > likely a missing invalidation somewhere in the ALTER PUBLICATION code. It could also be be that pgoutput doesn't have sufficient invalidation handling. One thing that looks bogus on the DDL side is how the invalidation handling interacts with locking. For tables etc the invalidation handling works because we hold a lock on the relation before modifying the catalog and don't release that lock until transaction end. That part is crucial: We queue shared invalidations at transaction commit, *after* the transaction is marked as visible, but *before* locks are released. That guarantees that any backend processing invalidations will see the new contents. However, if the lock on the modified object is released before transaction commit, other backends can build and use a cache entry that hasn't processed invalidations (invaliations are processed when acquiring locks). While there is such an object for publications, it seems to be acquired too late to actually do much good in a number of paths. And not at all in others. E.g.: pubform = (Form_pg_publication) GETSTRUCT(tup); /* * If the publication doesn't publish changes via the root partitioned * table, the partition's row filter and column list will be used. So * disallow using WHERE clause and column lists on partitioned table in * this case. */ if (!pubform->puballtables && publish_via_partition_root_given && !publish_via_partition_root) { /* * Lock the publication so nobody else can do anything with it. This * prevents concurrent alter to add partitioned table(s) with WHERE * clause(s) and/or column lists which we don't allow when not * publishing via root. */ LockDatabaseObject(PublicationRelationId, pubform->oid, 0, AccessShareLock); a) Another session could have modified the publication and made puballtables out-of-date b) The LockDatabaseObject() uses AccessShareLock, so others can get past this point as well b) seems like a copy-paste bug or such? I don't see any locking of the publication around RemovePublicationRelById(), for example. I might just be misunderstanding things the way publication locking is intended to work. > However, while randomly poking at different things, I realized that if I > change the lock obtained on the relation in OpenTableList() from > ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away. That's odd. There's cases where changing the lock level can cause invalidation processing to happen because there is no pre-existing lock for the "new" lock level, but there was for the old. But OpenTableList() is used when altering the publications, so I don't see how that connects. Greetings, Andres Freund
Hi, On 2023-11-17 17:54:43 -0800, Andres Freund wrote: > On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote: > > Overall, this looks, walks and quacks like a cache invalidation issue, > > likely a missing invalidation somewhere in the ALTER PUBLICATION code. I can confirm that something is broken with invalidation handling. To test this I just used pg_recvlogical to stdout. It's just interesting whether something arrives, that's easy to discern even with binary output. CREATE PUBLICATION pb; src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d postgres -o proto_version=4 -o publication_names=pb-o messages=true -f - S1: CREATE TABLE d(data text not null); S1: INSERT INTO d VALUES('d1'); S2: BEGIN; INSERT INTO d VALUES('d2'); S1: ALTER PUBLICATION pb ADD TABLE d; S2: COMMIT S2: INSERT INTO d VALUES('d3'); S1: INSERT INTO d VALUES('d4'); RL: <nothing> Without the 'd2' insert in an in-progress transaction, pgoutput *does* react to the ALTER PUBLICATION. I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD TABLE d basically modifies the catalog state of 'd', without a lock preventing other sessions from having a valid cache entry that they could continue to use. Due to this, decoding S2's transactions that started before S2's commit, will populate the cache entry with the state as of the time of S1's last action, i.e. no need to output the change. The reason this can happen is because OpenTableList() uses ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while there's an ongoing INSERT. I think this isn't just a logical decoding issue. S2's cache state just after the ALTER PUBLICATION is going to be wrong - the table is already locked, therefore further operations on the table don't trigger cache invalidation processing - but the catalog state *has* changed. It's a bigger problem for logical decoding though, as it's a bit more lazy about invalidation processing than normal transactions, allowing the problem to persist for longer. I guess it's not really feasible to just increase the lock level here though :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL would perhaps lead to new deadlocks and such? But it also seems quite wrong. We could brute force this in the logical decoding infrastructure, by distributing invalidations from catalog modifying transactions to all concurrent in-progress transactions (like already done for historic catalog snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd be a fairly significant increase in overhead. Greetings, Andres Freund
On 11/18/23 02:54, Andres Freund wrote: > Hi, > > On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote: >> It seems there's a long-standing data loss issue related to the initial >> sync of tables in the built-in logical replication (publications etc.). > > :( > Yeah :-( > >> Overall, this looks, walks and quacks like a cache invalidation issue, >> likely a missing invalidation somewhere in the ALTER PUBLICATION code. > > It could also be be that pgoutput doesn't have sufficient invalidation > handling. > I'm not sure about the details, but it can't be just about pgoutput failing to react to some syscache invalidation. As described, just resetting the RelationSyncEntry doesn't fix the issue - it's the syscache that's not invalidated, IMO. But maybe that's what you mean. > > One thing that looks bogus on the DDL side is how the invalidation handling > interacts with locking. > > > For tables etc the invalidation handling works because we hold a lock on the > relation before modifying the catalog and don't release that lock until > transaction end. That part is crucial: We queue shared invalidations at > transaction commit, *after* the transaction is marked as visible, but *before* > locks are released. That guarantees that any backend processing invalidations > will see the new contents. However, if the lock on the modified object is > released before transaction commit, other backends can build and use a cache > entry that hasn't processed invalidations (invaliations are processed when > acquiring locks). > Right. > While there is such an object for publications, it seems to be acquired too > late to actually do much good in a number of paths. And not at all in others. > > E.g.: > > pubform = (Form_pg_publication) GETSTRUCT(tup); > > /* > * If the publication doesn't publish changes via the root partitioned > * table, the partition's row filter and column list will be used. So > * disallow using WHERE clause and column lists on partitioned table in > * this case. > */ > if (!pubform->puballtables && publish_via_partition_root_given && > !publish_via_partition_root) > { > /* > * Lock the publication so nobody else can do anything with it. This > * prevents concurrent alter to add partitioned table(s) with WHERE > * clause(s) and/or column lists which we don't allow when not > * publishing via root. > */ > LockDatabaseObject(PublicationRelationId, pubform->oid, 0, > AccessShareLock); > > a) Another session could have modified the publication and made puballtables out-of-date > b) The LockDatabaseObject() uses AccessShareLock, so others can get past this > point as well > > b) seems like a copy-paste bug or such? > > > I don't see any locking of the publication around RemovePublicationRelById(), > for example. > > I might just be misunderstanding things the way publication locking is > intended to work. > I've been asking similar questions while investigating this, but the interactions with logical decoding (which kinda happens concurrently in terms of WAL, but not concurrently in terms of time), historical snapshots etc. make my head spin. > >> However, while randomly poking at different things, I realized that if I >> change the lock obtained on the relation in OpenTableList() from >> ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away. > > That's odd. There's cases where changing the lock level can cause invalidation > processing to happen because there is no pre-existing lock for the "new" lock > level, but there was for the old. But OpenTableList() is used when altering > the publications, so I don't see how that connects. > Yeah, I had the idea that maybe the transaction already holds the lock on the table, and changing this to ShareRowExclusiveLock makes it different, possibly triggering a new invalidation or something. But I did check with gdb, and if I set a breakpoint at OpenTableList, there are no locks on the table. But the effect is hard to deny - if I run the test 100 times, with the SharedUpdateExclusiveLock I get maybe 80 failures. After changing it to ShareRowExclusiveLock I get 0. Sure, there's some randomness for cases like this, but this is pretty unlikely. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/18/23 03:54, Andres Freund wrote: > Hi, > > On 2023-11-17 17:54:43 -0800, Andres Freund wrote: >> On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote: >>> Overall, this looks, walks and quacks like a cache invalidation issue, >>> likely a missing invalidation somewhere in the ALTER PUBLICATION code. > > I can confirm that something is broken with invalidation handling. > > To test this I just used pg_recvlogical to stdout. It's just interesting > whether something arrives, that's easy to discern even with binary output. > > CREATE PUBLICATION pb; > src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d postgres -o proto_version=4 -o publication_names=pb-o messages=true -f - > > S1: CREATE TABLE d(data text not null); > S1: INSERT INTO d VALUES('d1'); > S2: BEGIN; INSERT INTO d VALUES('d2'); > S1: ALTER PUBLICATION pb ADD TABLE d; > S2: COMMIT > S2: INSERT INTO d VALUES('d3'); > S1: INSERT INTO d VALUES('d4'); > RL: <nothing> > > Without the 'd2' insert in an in-progress transaction, pgoutput *does* react > to the ALTER PUBLICATION. > > I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD > TABLE d basically modifies the catalog state of 'd', without a lock preventing > other sessions from having a valid cache entry that they could continue to > use. Due to this, decoding S2's transactions that started before S2's commit, > will populate the cache entry with the state as of the time of S1's last > action, i.e. no need to output the change. > > The reason this can happen is because OpenTableList() uses > ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while > there's an ongoing INSERT. > I guess this would also explain why changing the lock mode from ShareUpdateExclusiveLock to ShareRowExclusiveLock changes the behavior. INSERT acquires RowExclusiveLock, which doesn't conflict only with the latter. > I think this isn't just a logical decoding issue. S2's cache state just after > the ALTER PUBLICATION is going to be wrong - the table is already locked, > therefore further operations on the table don't trigger cache invalidation > processing - but the catalog state *has* changed. It's a bigger problem for > logical decoding though, as it's a bit more lazy about invalidation processing > than normal transactions, allowing the problem to persist for longer. > Yeah. I'm wondering if there's some other operation acquiring a lock weaker than RowExclusiveLock that might be affected by this. Because then we'd need to get an even stronger lock ... > > I guess it's not really feasible to just increase the lock level here though > :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL > would perhaps lead to new deadlocks and such? But it also seems quite wrong. > If this really is about the lock being too weak, then I don't see why would it be wrong? If it's required for correctness, it's not really wrong, IMO. Sure, stronger locks are not great ... I'm not sure about the risk of deadlocks. If you do ALTER PUBLICATION ... ADD TABLE it's not holding many other locks. It essentially gets a lock just a lock on pg_publication catalog, and then the publication row. That's it. If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, we're making it conflict with RowExclusive. Which is just DML, and I think we need to do that. So maybe that's fine? For me, a detected deadlock is better than silently missing some of the data. > > We could brute force this in the logical decoding infrastructure, by > distributing invalidations from catalog modifying transactions to all > concurrent in-progress transactions (like already done for historic catalog > snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd > be a fairly significant increase in overhead. > I have no idea what the overhead would be - perhaps not too bad, considering catalog changes are not too common (I'm sure there are extreme cases). And maybe we could even restrict this only to "interesting" catalogs, or something like that? (However I hate those weird differences in behavior, it can easily lead to bugs.) But it feels more like a band-aid than actually fixing the issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote: > > I guess it's not really feasible to just increase the lock level here though > > :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL > > would perhaps lead to new deadlocks and such? But it also seems quite wrong. > > > > If this really is about the lock being too weak, then I don't see why > would it be wrong? Sorry, that was badly formulated. The wrong bit is the use of ShareUpdateExclusiveLock. > If it's required for correctness, it's not really wrong, IMO. Sure, stronger > locks are not great ... > > I'm not sure about the risk of deadlocks. If you do > > ALTER PUBLICATION ... ADD TABLE > > it's not holding many other locks. It essentially gets a lock just a > lock on pg_publication catalog, and then the publication row. That's it. > > If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, > we're making it conflict with RowExclusive. Which is just DML, and I > think we need to do that. From what I can tell it needs to to be an AccessExlusiveLock. Completely independent of logical decoding. The way the cache stays coherent is catalog modifications conflicting with anything that builds cache entries. We have a few cases where we do use lower level locks, but for those we have explicit analysis for why that's ok (see e.g. reloptions.c) or we block until nobody could have an old view of the catalog (various CONCURRENTLY) operations. > So maybe that's fine? For me, a detected deadlock is better than > silently missing some of the data. That certainly is true. > > We could brute force this in the logical decoding infrastructure, by > > distributing invalidations from catalog modifying transactions to all > > concurrent in-progress transactions (like already done for historic catalog > > snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd > > be a fairly significant increase in overhead. > > > > I have no idea what the overhead would be - perhaps not too bad, > considering catalog changes are not too common (I'm sure there are > extreme cases). And maybe we could even restrict this only to > "interesting" catalogs, or something like that? (However I hate those > weird differences in behavior, it can easily lead to bugs.) > > But it feels more like a band-aid than actually fixing the issue. Agreed. Greetings, Andres Freund
On 11/18/23 19:12, Andres Freund wrote: > Hi, > > On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote: >>> I guess it's not really feasible to just increase the lock level here though >>> :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL >>> would perhaps lead to new deadlocks and such? But it also seems quite wrong. >>> >> >> If this really is about the lock being too weak, then I don't see why >> would it be wrong? > > Sorry, that was badly formulated. The wrong bit is the use of > ShareUpdateExclusiveLock. > Ah, you meant the current lock mode seems wrong, not that changing the locks seems wrong. Yeah, true. > >> If it's required for correctness, it's not really wrong, IMO. Sure, stronger >> locks are not great ... >> >> I'm not sure about the risk of deadlocks. If you do >> >> ALTER PUBLICATION ... ADD TABLE >> >> it's not holding many other locks. It essentially gets a lock just a >> lock on pg_publication catalog, and then the publication row. That's it. >> >> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, >> we're making it conflict with RowExclusive. Which is just DML, and I >> think we need to do that. > > From what I can tell it needs to to be an AccessExlusiveLock. Completely > independent of logical decoding. The way the cache stays coherent is catalog > modifications conflicting with anything that builds cache entries. We have a > few cases where we do use lower level locks, but for those we have explicit > analysis for why that's ok (see e.g. reloptions.c) or we block until nobody > could have an old view of the catalog (various CONCURRENTLY) operations. > Yeah, I got too focused on the issue I triggered, which seems to be fixed by using SRE (still don't understand why ...). But you're probably right there may be other cases where SRE would not be sufficient, I certainly can't prove it'd be safe. > >> So maybe that's fine? For me, a detected deadlock is better than >> silently missing some of the data. > > That certainly is true. > > >>> We could brute force this in the logical decoding infrastructure, by >>> distributing invalidations from catalog modifying transactions to all >>> concurrent in-progress transactions (like already done for historic catalog >>> snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd >>> be a fairly significant increase in overhead. >>> >> >> I have no idea what the overhead would be - perhaps not too bad, >> considering catalog changes are not too common (I'm sure there are >> extreme cases). And maybe we could even restrict this only to >> "interesting" catalogs, or something like that? (However I hate those >> weird differences in behavior, it can easily lead to bugs.) >> >> But it feels more like a band-aid than actually fixing the issue. > > Agreed. > ... and it would no not fix the other places outside logical decoding. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote: > On 11/18/23 19:12, Andres Freund wrote: > >> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, > >> we're making it conflict with RowExclusive. Which is just DML, and I > >> think we need to do that. > > > > From what I can tell it needs to to be an AccessExlusiveLock. Completely > > independent of logical decoding. The way the cache stays coherent is catalog > > modifications conflicting with anything that builds cache entries. We have a > > few cases where we do use lower level locks, but for those we have explicit > > analysis for why that's ok (see e.g. reloptions.c) or we block until nobody > > could have an old view of the catalog (various CONCURRENTLY) operations. > > > > Yeah, I got too focused on the issue I triggered, which seems to be > fixed by using SRE (still don't understand why ...). But you're probably > right there may be other cases where SRE would not be sufficient, I > certainly can't prove it'd be safe. I think it makes sense here: SRE prevents the problematic "scheduling" in your test - with SRE no DML started before ALTER PUB ... ADD can commit after. I'm not sure there are any cases where using SRE instead of AE would cause problems for logical decoding, but it seems very hard to prove. I'd be very surprised if just using SRE would not lead to corrupted cache contents in some situations. The cases where a lower lock level is ok are ones where we just don't care that the cache is coherent in that moment. In a way, the logical decoding cache-invalidation situation is a lot more atomic than the "normal" situation. During normal operation locking is strictly required to prevent incoherent states when building a cache entry after a transaction committed, but before the sinval entries have been queued. But in the logical decoding case that window doesn't exist. Greetings, Andres Freund
On 11/18/23 22:05, Andres Freund wrote: > Hi, > > On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote: >> On 11/18/23 19:12, Andres Freund wrote: >>>> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, >>>> we're making it conflict with RowExclusive. Which is just DML, and I >>>> think we need to do that. >>> >>> From what I can tell it needs to to be an AccessExlusiveLock. Completely >>> independent of logical decoding. The way the cache stays coherent is catalog >>> modifications conflicting with anything that builds cache entries. We have a >>> few cases where we do use lower level locks, but for those we have explicit >>> analysis for why that's ok (see e.g. reloptions.c) or we block until nobody >>> could have an old view of the catalog (various CONCURRENTLY) operations. >>> >> >> Yeah, I got too focused on the issue I triggered, which seems to be >> fixed by using SRE (still don't understand why ...). But you're probably >> right there may be other cases where SRE would not be sufficient, I >> certainly can't prove it'd be safe. > > I think it makes sense here: SRE prevents the problematic "scheduling" in your > test - with SRE no DML started before ALTER PUB ... ADD can commit after. > If understand correctly, with the current code (which only gets ShareUpdateExclusiveLock), we may end up in a situation like this (sessions A and B): A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock A: writes the invalidation message(s) into WAL B: inserts into table "t" B: commit A: commit With the stronger SRE lock, the commits would have to happen in the opposite order, because as you say it prevents the bad ordering. But why would this matter for logical decoding? We accumulate the the invalidations and execute them at transaction commit, or did I miss something? So what I think should happen is we get to apply B first, which won't see the table as part of the publication. It might even build the cache entries (syscache+relsync), reflecting that. But then we get to execute A, along with all the invalidations, and that should invalidate them. I'm clearly missing something, because the SRE does change the behavior, so there has to be a difference (and by my reasoning it shouldn't be). Or maybe it's the other way around? Won't B get the invalidation, but use a historical snapshot that doesn't yet see the table in publication? > I'm not sure there are any cases where using SRE instead of AE would cause > problems for logical decoding, but it seems very hard to prove. I'd be very > surprised if just using SRE would not lead to corrupted cache contents in some > situations. The cases where a lower lock level is ok are ones where we just > don't care that the cache is coherent in that moment. > Are you saying it might break cases that are not corrupted now? How could obtaining a stronger lock have such effect? > In a way, the logical decoding cache-invalidation situation is a lot more > atomic than the "normal" situation. During normal operation locking is > strictly required to prevent incoherent states when building a cache entry > after a transaction committed, but before the sinval entries have been > queued. But in the logical decoding case that window doesn't exist. > Because we apply the invalidations at commit time, so it happens as a single operation that can't interleave with other sessions? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote: > > > On 11/18/23 22:05, Andres Freund wrote: > > Hi, > > > > On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote: > >> On 11/18/23 19:12, Andres Freund wrote: > >>>> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, > >>>> we're making it conflict with RowExclusive. Which is just DML, and I > >>>> think we need to do that. > >>> > >>> From what I can tell it needs to to be an AccessExlusiveLock. Completely > >>> independent of logical decoding. The way the cache stays coherent is catalog > >>> modifications conflicting with anything that builds cache entries. We have a > >>> few cases where we do use lower level locks, but for those we have explicit > >>> analysis for why that's ok (see e.g. reloptions.c) or we block until nobody > >>> could have an old view of the catalog (various CONCURRENTLY) operations. > >>> > >> > >> Yeah, I got too focused on the issue I triggered, which seems to be > >> fixed by using SRE (still don't understand why ...). But you're probably > >> right there may be other cases where SRE would not be sufficient, I > >> certainly can't prove it'd be safe. > > > > I think it makes sense here: SRE prevents the problematic "scheduling" in your > > test - with SRE no DML started before ALTER PUB ... ADD can commit after. > > > > If understand correctly, with the current code (which only gets > ShareUpdateExclusiveLock), we may end up in a situation like this > (sessions A and B): > > A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock > A: writes the invalidation message(s) into WAL > B: inserts into table "t" > B: commit > A: commit I don't think this the problematic sequence - at least it's not what I had reproed in https://postgr.es/m/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de Adding line numbers: 1) S1: CREATE TABLE d(data text not null); 2) S1: INSERT INTO d VALUES('d1'); 3) S2: BEGIN; INSERT INTO d VALUES('d2'); 4) S1: ALTER PUBLICATION pb ADD TABLE d; 5) S2: COMMIT 6) S2: INSERT INTO d VALUES('d3'); 7) S1: INSERT INTO d VALUES('d4'); 8) RL: <nothing> The problem with the sequence is that the insert from 3) is decoded *after* 4) and that to decode the insert (which happened before the ALTER) the catalog snapshot and cache state is from *before* the ALTER TABLE. Because the transaction started in 3) doesn't actually modify any catalogs, no invalidations are executed after decoding it. The result is that the cache looks like it did at 3), not like after 4). Undesirable timetravel... It's worth noting that here the cache state is briefly correct, after 4), it's just that after 5) it stays the old state. If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and everything is fine. > > I'm not sure there are any cases where using SRE instead of AE would cause > > problems for logical decoding, but it seems very hard to prove. I'd be very > > surprised if just using SRE would not lead to corrupted cache contents in some > > situations. The cases where a lower lock level is ok are ones where we just > > don't care that the cache is coherent in that moment. > Are you saying it might break cases that are not corrupted now? How > could obtaining a stronger lock have such effect? No, I mean that I don't know if using SRE instead of AE would have negative consequences for logical decoding. I.e. whether, from a logical decoding POV, it'd suffice to increase the lock level to just SRE instead of AE. Since I don't see how it'd be correct otherwise, it's kind of a moot question. > > In a way, the logical decoding cache-invalidation situation is a lot more > > atomic than the "normal" situation. During normal operation locking is > > strictly required to prevent incoherent states when building a cache entry > > after a transaction committed, but before the sinval entries have been > > queued. But in the logical decoding case that window doesn't exist. > > > Because we apply the invalidations at commit time, so it happens as a > single operation that can't interleave with other sessions? Yea, the situation is much simpler during logical decoding than "originally" - there's no concurrency. Greetings, Andres Freund
Hi, On 19.11.2023 09:18, Andres Freund wrote: > Yea, the situation is much simpler during logical decoding than "originally" - > there's no concurrency. > > Greetings, > > Andres Freund > We've encountered a similar error on our industrial server. The case: After adding a table to logical replication, table initialization proceeds normally, but new data from the publisher's table does not appear on the subscriber server. After we added the table, we checked and saw that the data was present on the subscriber and everything was normal, we discovered the error after some time. I have attached scripts to the email. The patch from the first message also solves this problem. -- Best regards, Vadim Lakt
Attachment
On Sun, Nov 19, 2023 at 7:48 AM Andres Freund <andres@anarazel.de> wrote: > > On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote: > > > > If understand correctly, with the current code (which only gets > > ShareUpdateExclusiveLock), we may end up in a situation like this > > (sessions A and B): > > > > A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock > > A: writes the invalidation message(s) into WAL > > B: inserts into table "t" > > B: commit > > A: commit > > I don't think this the problematic sequence - at least it's not what I had > reproed in > https://postgr.es/m/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de > > Adding line numbers: > > 1) S1: CREATE TABLE d(data text not null); > 2) S1: INSERT INTO d VALUES('d1'); > 3) S2: BEGIN; INSERT INTO d VALUES('d2'); > 4) S1: ALTER PUBLICATION pb ADD TABLE d; > 5) S2: COMMIT > 6) S2: INSERT INTO d VALUES('d3'); > 7) S1: INSERT INTO d VALUES('d4'); > 8) RL: <nothing> > > The problem with the sequence is that the insert from 3) is decoded *after* 4) > and that to decode the insert (which happened before the ALTER) the catalog > snapshot and cache state is from *before* the ALTER TABLE. Because the > transaction started in 3) doesn't actually modify any catalogs, no > invalidations are executed after decoding it. The result is that the cache > looks like it did at 3), not like after 4). Undesirable timetravel... > > It's worth noting that here the cache state is briefly correct, after 4), it's > just that after 5) it stays the old state. > > If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and > everything is fine. > I agree, your analysis looks right to me. > > > > > I'm not sure there are any cases where using SRE instead of AE would cause > > > problems for logical decoding, but it seems very hard to prove. I'd be very > > > surprised if just using SRE would not lead to corrupted cache contents in some > > > situations. The cases where a lower lock level is ok are ones where we just > > > don't care that the cache is coherent in that moment. > > > Are you saying it might break cases that are not corrupted now? How > > could obtaining a stronger lock have such effect? > > No, I mean that I don't know if using SRE instead of AE would have negative > consequences for logical decoding. I.e. whether, from a logical decoding POV, > it'd suffice to increase the lock level to just SRE instead of AE. > > Since I don't see how it'd be correct otherwise, it's kind of a moot question. > We lost track of this thread and the bug is still open. IIUC, the conclusion is to use SRE in OpenTableList() to fix the reported issue. Andres, Tomas, please let me know if my understanding is wrong, otherwise, let's proceed and fix this issue. -- With Regards, Amit Kapila.
On 6/24/24 12:54, Amit Kapila wrote: > ... >> >>>> I'm not sure there are any cases where using SRE instead of AE would cause >>>> problems for logical decoding, but it seems very hard to prove. I'd be very >>>> surprised if just using SRE would not lead to corrupted cache contents in some >>>> situations. The cases where a lower lock level is ok are ones where we just >>>> don't care that the cache is coherent in that moment. >> >>> Are you saying it might break cases that are not corrupted now? How >>> could obtaining a stronger lock have such effect? >> >> No, I mean that I don't know if using SRE instead of AE would have negative >> consequences for logical decoding. I.e. whether, from a logical decoding POV, >> it'd suffice to increase the lock level to just SRE instead of AE. >> >> Since I don't see how it'd be correct otherwise, it's kind of a moot question. >> > > We lost track of this thread and the bug is still open. IIUC, the > conclusion is to use SRE in OpenTableList() to fix the reported issue. > Andres, Tomas, please let me know if my understanding is wrong, > otherwise, let's proceed and fix this issue. > It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I don't think we 'lost track' of it, but it's true we haven't done much progress recently. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 6/24/24 12:54, Amit Kapila wrote: > > ... > >> > >>>> I'm not sure there are any cases where using SRE instead of AE would cause > >>>> problems for logical decoding, but it seems very hard to prove. I'd be very > >>>> surprised if just using SRE would not lead to corrupted cache contents in some > >>>> situations. The cases where a lower lock level is ok are ones where we just > >>>> don't care that the cache is coherent in that moment. > >> > >>> Are you saying it might break cases that are not corrupted now? How > >>> could obtaining a stronger lock have such effect? > >> > >> No, I mean that I don't know if using SRE instead of AE would have negative > >> consequences for logical decoding. I.e. whether, from a logical decoding POV, > >> it'd suffice to increase the lock level to just SRE instead of AE. > >> > >> Since I don't see how it'd be correct otherwise, it's kind of a moot question. > >> > > > > We lost track of this thread and the bug is still open. IIUC, the > > conclusion is to use SRE in OpenTableList() to fix the reported issue. > > Andres, Tomas, please let me know if my understanding is wrong, > > otherwise, let's proceed and fix this issue. > > > > It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I > don't think we 'lost track' of it, but it's true we haven't done much > progress recently. > Okay, thanks for pointing to the CF entry. Would you like to take care of this? Are you seeing anything more than the simple fix to use SRE in OpenTableList()? -- With Regards, Amit Kapila.
On 6/25/24 07:04, Amit Kapila wrote: > On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 6/24/24 12:54, Amit Kapila wrote: >>> ... >>>> >>>>>> I'm not sure there are any cases where using SRE instead of AE would cause >>>>>> problems for logical decoding, but it seems very hard to prove. I'd be very >>>>>> surprised if just using SRE would not lead to corrupted cache contents in some >>>>>> situations. The cases where a lower lock level is ok are ones where we just >>>>>> don't care that the cache is coherent in that moment. >>>> >>>>> Are you saying it might break cases that are not corrupted now? How >>>>> could obtaining a stronger lock have such effect? >>>> >>>> No, I mean that I don't know if using SRE instead of AE would have negative >>>> consequences for logical decoding. I.e. whether, from a logical decoding POV, >>>> it'd suffice to increase the lock level to just SRE instead of AE. >>>> >>>> Since I don't see how it'd be correct otherwise, it's kind of a moot question. >>>> >>> >>> We lost track of this thread and the bug is still open. IIUC, the >>> conclusion is to use SRE in OpenTableList() to fix the reported issue. >>> Andres, Tomas, please let me know if my understanding is wrong, >>> otherwise, let's proceed and fix this issue. >>> >> >> It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I >> don't think we 'lost track' of it, but it's true we haven't done much >> progress recently. >> > > Okay, thanks for pointing to the CF entry. Would you like to take care > of this? Are you seeing anything more than the simple fix to use SRE > in OpenTableList()? > I did not find a simpler fix than adding the SRE, and I think pretty much any other fix is guaranteed to be more complex. I don't remember all the details without relearning all the details, but IIRC the main challenge for me was to convince myself it's a sufficient and reliable fix (and not working simply by chance). I won't have time to look into this anytime soon, so feel free to take care of this and push the fix. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 6/25/24 07:04, Amit Kapila wrote: > > On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 6/24/24 12:54, Amit Kapila wrote: > >>> ... > >>>> > >>>>>> I'm not sure there are any cases where using SRE instead of AE would cause > >>>>>> problems for logical decoding, but it seems very hard to prove. I'd be very > >>>>>> surprised if just using SRE would not lead to corrupted cache contents in some > >>>>>> situations. The cases where a lower lock level is ok are ones where we just > >>>>>> don't care that the cache is coherent in that moment. > >>>> > >>>>> Are you saying it might break cases that are not corrupted now? How > >>>>> could obtaining a stronger lock have such effect? > >>>> > >>>> No, I mean that I don't know if using SRE instead of AE would have negative > >>>> consequences for logical decoding. I.e. whether, from a logical decoding POV, > >>>> it'd suffice to increase the lock level to just SRE instead of AE. > >>>> > >>>> Since I don't see how it'd be correct otherwise, it's kind of a moot question. > >>>> > >>> > >>> We lost track of this thread and the bug is still open. IIUC, the > >>> conclusion is to use SRE in OpenTableList() to fix the reported issue. > >>> Andres, Tomas, please let me know if my understanding is wrong, > >>> otherwise, let's proceed and fix this issue. > >>> > >> > >> It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I > >> don't think we 'lost track' of it, but it's true we haven't done much > >> progress recently. > >> > > > > Okay, thanks for pointing to the CF entry. Would you like to take care > > of this? Are you seeing anything more than the simple fix to use SRE > > in OpenTableList()? > > > > I did not find a simpler fix than adding the SRE, and I think pretty > much any other fix is guaranteed to be more complex. I don't remember > all the details without relearning all the details, but IIRC the main > challenge for me was to convince myself it's a sufficient and reliable > fix (and not working simply by chance). > > I won't have time to look into this anytime soon, so feel free to take > care of this and push the fix. > Okay, I'll take care of this. -- With Regards, Amit Kapila.
On Thu, 27 Jun 2024 at 08:38, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 6/25/24 07:04, Amit Kapila wrote: > > > On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra > > > <tomas.vondra@enterprisedb.com> wrote: > > >> > > >> On 6/24/24 12:54, Amit Kapila wrote: > > >>> ... > > >>>> > > >>>>>> I'm not sure there are any cases where using SRE instead of AE would cause > > >>>>>> problems for logical decoding, but it seems very hard to prove. I'd be very > > >>>>>> surprised if just using SRE would not lead to corrupted cache contents in some > > >>>>>> situations. The cases where a lower lock level is ok are ones where we just > > >>>>>> don't care that the cache is coherent in that moment. > > >>>> > > >>>>> Are you saying it might break cases that are not corrupted now? How > > >>>>> could obtaining a stronger lock have such effect? > > >>>> > > >>>> No, I mean that I don't know if using SRE instead of AE would have negative > > >>>> consequences for logical decoding. I.e. whether, from a logical decoding POV, > > >>>> it'd suffice to increase the lock level to just SRE instead of AE. > > >>>> > > >>>> Since I don't see how it'd be correct otherwise, it's kind of a moot question. > > >>>> > > >>> > > >>> We lost track of this thread and the bug is still open. IIUC, the > > >>> conclusion is to use SRE in OpenTableList() to fix the reported issue. > > >>> Andres, Tomas, please let me know if my understanding is wrong, > > >>> otherwise, let's proceed and fix this issue. > > >>> > > >> > > >> It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I > > >> don't think we 'lost track' of it, but it's true we haven't done much > > >> progress recently. > > >> > > > > > > Okay, thanks for pointing to the CF entry. Would you like to take care > > > of this? Are you seeing anything more than the simple fix to use SRE > > > in OpenTableList()? > > > > > > > I did not find a simpler fix than adding the SRE, and I think pretty > > much any other fix is guaranteed to be more complex. I don't remember > > all the details without relearning all the details, but IIRC the main > > challenge for me was to convince myself it's a sufficient and reliable > > fix (and not working simply by chance). > > > > I won't have time to look into this anytime soon, so feel free to take > > care of this and push the fix. > > > > Okay, I'll take care of this. This issue is present in all supported versions. I was able to reproduce it using the steps recommended by Andres and Tomas's scripts. I also conducted a small test through TAP tests to verify the problem. Attached is the alternate_lock_HEAD.patch, which includes the lock modification(Tomas's change) and the TAP test. To reproduce the issue in the HEAD version, we cannot use the same test as in the alternate_lock_HEAD patch because the behavior changes slightly after the fix to wait for the lock until the open transaction completes. The attached issue_reproduce_testcase_head.patch can be used to reproduce the issue through TAP test in HEAD. The changes made in the HEAD version do not directly apply to older branches. For PG14, PG13, and PG12 branches, you can use the alternate_lock_PG14.patch. Regards, Vignesh
Attachment
On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote: > > > This issue is present in all supported versions. I was able to > reproduce it using the steps recommended by Andres and Tomas's > scripts. I also conducted a small test through TAP tests to verify the > problem. Attached is the alternate_lock_HEAD.patch, which includes the > lock modification(Tomas's change) and the TAP test. > @@ -1568,7 +1568,7 @@ OpenTableList(List *tables) /* Allow query cancel in case this takes a long time */ CHECK_FOR_INTERRUPTS(); - rel = table_openrv(t->relation, ShareUpdateExclusiveLock); + rel = table_openrv(t->relation, ShareRowExclusiveLock); The comment just above this code ("Open, share-lock, and check all the explicitly-specified relations") needs modification. It would be better to explain the reason of why we would need SRE lock here. > To reproduce the issue in the HEAD version, we cannot use the same > test as in the alternate_lock_HEAD patch because the behavior changes > slightly after the fix to wait for the lock until the open transaction > completes. > But won't the test that reproduces the problem in HEAD be successful after the code change? If so, can't we use the same test instead of slight modification to verify the lock mode? > The attached issue_reproduce_testcase_head.patch can be > used to reproduce the issue through TAP test in HEAD. > The changes made in the HEAD version do not directly apply to older > branches. For PG14, PG13, and PG12 branches, you can use the > alternate_lock_PG14.patch. > Why didn't you include the test in the back branches? If it is due to background psql stuff, then won't commit (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=187b8991f70fc3d2a13dc709edd408a8df0be055) can address it? -- With Regards, Amit Kapila.
On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > This issue is present in all supported versions. I was able to > > reproduce it using the steps recommended by Andres and Tomas's > > scripts. I also conducted a small test through TAP tests to verify the > > problem. Attached is the alternate_lock_HEAD.patch, which includes the > > lock modification(Tomas's change) and the TAP test. > > > > @@ -1568,7 +1568,7 @@ OpenTableList(List *tables) > /* Allow query cancel in case this takes a long time */ > CHECK_FOR_INTERRUPTS(); > > - rel = table_openrv(t->relation, ShareUpdateExclusiveLock); > + rel = table_openrv(t->relation, ShareRowExclusiveLock); > > The comment just above this code ("Open, share-lock, and check all the > explicitly-specified relations") needs modification. It would be > better to explain the reason of why we would need SRE lock here. Updated comments for the same. > > To reproduce the issue in the HEAD version, we cannot use the same > > test as in the alternate_lock_HEAD patch because the behavior changes > > slightly after the fix to wait for the lock until the open transaction > > completes. > > > > But won't the test that reproduces the problem in HEAD be successful > after the code change? If so, can't we use the same test instead of > slight modification to verify the lock mode? Before the patch fix, the ALTER PUBLICATION command would succeed immediately. Now, the ALTER PUBLICATION command waits until it acquires the ShareRowExclusiveLock. This change means that in test cases, previously we waited until the table was added to the publication, whereas now, after applying the patch, we wait until the ALTER PUBLICATION command is actively waiting for the ShareRowExclusiveLock. This waiting step ensures consistent execution and sequencing of tests each time. > > The attached issue_reproduce_testcase_head.patch can be > > used to reproduce the issue through TAP test in HEAD. > > The changes made in the HEAD version do not directly apply to older > > branches. For PG14, PG13, and PG12 branches, you can use the > > alternate_lock_PG14.patch. > > > > Why didn't you include the test in the back branches? If it is due to > background psql stuff, then won't commit > (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=187b8991f70fc3d2a13dc709edd408a8df0be055) > can address it? Indeed, I initially believed it wasn't available. Currently, I haven't incorporated the back branch patch, but I plan to include it in a subsequent version once there are no review comments on the HEAD patch. The updated v2 version patch has the fix for the comments. Regards, Vignesh
Attachment
On Tue, Jul 9, 2024 at 8:14 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > This issue is present in all supported versions. I was able to > > > reproduce it using the steps recommended by Andres and Tomas's > > > scripts. I also conducted a small test through TAP tests to verify the > > > problem. Attached is the alternate_lock_HEAD.patch, which includes the > > > lock modification(Tomas's change) and the TAP test. > > > > > > > @@ -1568,7 +1568,7 @@ OpenTableList(List *tables) > > /* Allow query cancel in case this takes a long time */ > > CHECK_FOR_INTERRUPTS(); > > > > - rel = table_openrv(t->relation, ShareUpdateExclusiveLock); > > + rel = table_openrv(t->relation, ShareRowExclusiveLock); > > > > The comment just above this code ("Open, share-lock, and check all the > > explicitly-specified relations") needs modification. It would be > > better to explain the reason of why we would need SRE lock here. > > Updated comments for the same. > The patch missed to use the ShareRowExclusiveLock for partitions, see attached. I haven't tested it but they should also face the same problem. Apart from that, I have changed the comments in a few places in the patch. -- With Regards, Amit Kapila.
Attachment
On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 9, 2024 at 8:14 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > > > This issue is present in all supported versions. I was able to > > > > reproduce it using the steps recommended by Andres and Tomas's > > > > scripts. I also conducted a small test through TAP tests to verify the > > > > problem. Attached is the alternate_lock_HEAD.patch, which includes the > > > > lock modification(Tomas's change) and the TAP test. > > > > > > > > > > @@ -1568,7 +1568,7 @@ OpenTableList(List *tables) > > > /* Allow query cancel in case this takes a long time */ > > > CHECK_FOR_INTERRUPTS(); > > > > > > - rel = table_openrv(t->relation, ShareUpdateExclusiveLock); > > > + rel = table_openrv(t->relation, ShareRowExclusiveLock); > > > > > > The comment just above this code ("Open, share-lock, and check all the > > > explicitly-specified relations") needs modification. It would be > > > better to explain the reason of why we would need SRE lock here. > > > > Updated comments for the same. > > > > The patch missed to use the ShareRowExclusiveLock for partitions, see > attached. I haven't tested it but they should also face the same > problem. Apart from that, I have changed the comments in a few places > in the patch. I could not hit the updated ShareRowExclusiveLock changes through the partition table, instead I could verify it using the inheritance table. Added a test for the same and also attaching the backbranch patch. Regards, Vignesh
Attachment
- v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_HEAD.patch
- v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG14.patch
- v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG12.patch
- v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG13.patch
- v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG16.patch
- v2_issue_reproduce_testcase_head.patch
On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote: > > The patch missed to use the ShareRowExclusiveLock for partitions, see > > attached. I haven't tested it but they should also face the same > > problem. Apart from that, I have changed the comments in a few places > > in the patch. > > I could not hit the updated ShareRowExclusiveLock changes through the > partition table, instead I could verify it using the inheritance > table. Added a test for the same and also attaching the backbranch > patch. > Hi, I tested alternative-experimental-fix-lock.patch provided by Tomas (replaces SUE with SRE in OpenTableList). I believe there are a couple of scenarios the patch does not cover. 1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES IN SCHEMA <schema>". I took crash-test.sh provided by Tomas and modified it to add all tables in the schema to publication using the following command : ALTER PUBLICATION p ADD TABLES IN SCHEMA public The modified script is attached (crash-test-with-schema.sh). With this script, I can reproduce the issue even with the patch applied. This is because the code path to add a schema to the publication doesn't go through OpenTableList. I have also attached a script run-test-with-schema.sh to run crash-test-with-schema.sh in a loop with randomly generated parameters (modified from run.sh provided by Tomas). 2. The second issue is a deadlock which happens when the alter publication command is run for a comma separated list of tables. I created another script create-test-tables-order-reverse.sh. This script runs a command like the following : ALTER PUBLICATION p ADD TABLE test_2,test_1 Running the above script, I was able to get a deadlock error (the output is attached in deadlock.txt). In the alter publication command, I added the tables in the reverse order to increase the probability of the deadlock. But it should happen with any order of tables. I am not sure if the deadlock is a major issue because detecting the deadlock is better than data loss. The schema issue is probably more important. I didn't test it out with the latest patches sent by Vignesh but since the code changes in that patch are also in OpenTableList, I think the schema scenario won't be covered by those. Thanks & Regards, Nitin Motiani Google
Attachment
On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > The patch missed to use the ShareRowExclusiveLock for partitions, see > > > attached. I haven't tested it but they should also face the same > > > problem. Apart from that, I have changed the comments in a few places > > > in the patch. > > > > I could not hit the updated ShareRowExclusiveLock changes through the > > partition table, instead I could verify it using the inheritance > > table. Added a test for the same and also attaching the backbranch > > patch. > > > > Hi, > > I tested alternative-experimental-fix-lock.patch provided by Tomas > (replaces SUE with SRE in OpenTableList). I believe there are a couple > of scenarios the patch does not cover. > > 1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES > IN SCHEMA <schema>". > > I took crash-test.sh provided by Tomas and modified it to add all > tables in the schema to publication using the following command : > > ALTER PUBLICATION p ADD TABLES IN SCHEMA public > > The modified script is attached (crash-test-with-schema.sh). With this > script, I can reproduce the issue even with the patch applied. This is > because the code path to add a schema to the publication doesn't go > through OpenTableList. > > I have also attached a script run-test-with-schema.sh to run > crash-test-with-schema.sh in a loop with randomly generated parameters > (modified from run.sh provided by Tomas). > > 2. The second issue is a deadlock which happens when the alter > publication command is run for a comma separated list of tables. > > I created another script create-test-tables-order-reverse.sh. This > script runs a command like the following : > > ALTER PUBLICATION p ADD TABLE test_2,test_1 > > Running the above script, I was able to get a deadlock error (the > output is attached in deadlock.txt). In the alter publication command, > I added the tables in the reverse order to increase the probability of > the deadlock. But it should happen with any order of tables. > > I am not sure if the deadlock is a major issue because detecting the > deadlock is better than data loss. The schema issue is probably more > important. I didn't test it out with the latest patches sent by > Vignesh but since the code changes in that patch are also in > OpenTableList, I think the schema scenario won't be covered by those. > Hi, I looked further into the scenario of adding the tables in schema to the publication. Since in that case, the entry is added to pg_publication_namespace instead of pg_publication_rel, the codepaths for 'add table' and 'add tables in schema' are different. And in the 'add tables in schema' scenario, the OpenTableList function is not called to get the relation ids. Therefore even with the proposed patch, the data loss issue still persists in that case. To validate this idea, I tried locking all the affected tables in the schema just before the invalidation for those relations (in ShareRowExclusiveLock mode). I am attaching the small patch for that (alter_pub_for_schema.patch) where the change is made in the function publication_add_schema in pg_publication.c. I am not sure if this is the best place to make this change or if it is the right fix. It is conceptually similar to the proposed change in OpenTableList but here we are not just changing the lockmode but taking locks which were not taken before. But with this change, the data loss errors went away in my test script. Another issue which persists with this change is the deadlock. Since multiple table locks are acquired, the test script detects deadlock a few times. Therefore I'm also attaching another modified script which does a few retries in case of deadlock. The script is crash-test-with-retries-for-schema.sh. It runs the following command in a retry loop : ALTER PUBLICATION p ADD TABLES IN SCHEMA public If the command fails, it sleeps for a random amount of time (upper bound by a MAXWAIT parameter) and then retries the command. If it fails to run the command in the max number of retries, the final return value from the script is DEADLOCK as we can't do a consistency check in this scenario. Also attached is another script run-with-deadlock-detection.sh which can run the above script for multiple iterations. I tried the test scripts with and without alter_pub_for_schema.patch. Without the patch, I get the final output ERROR majority of the time which means that the publication was altered successfully but the data was lost on the subscriber. When I run it with the patch, I get a mix of OK (no data loss) and DEADLOCK (the publication was not altered) but no ERROR. I think by changing the parameters of sleep time and number of retries we can get different fractions of OK and DEADLOCK. I am not sure if this is the right or a clean way to fix the issue but I think conceptually this might be the right direction. Please let me know if my understanding is wrong or if I'm missing something. Thanks & Regards, Nitin Motiani Google
Attachment
On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > The patch missed to use the ShareRowExclusiveLock for partitions, see > > > > attached. I haven't tested it but they should also face the same > > > > problem. Apart from that, I have changed the comments in a few places > > > > in the patch. > > > > > > I could not hit the updated ShareRowExclusiveLock changes through the > > > partition table, instead I could verify it using the inheritance > > > table. Added a test for the same and also attaching the backbranch > > > patch. > > > > > > > Hi, > > > > I tested alternative-experimental-fix-lock.patch provided by Tomas > > (replaces SUE with SRE in OpenTableList). I believe there are a couple > > of scenarios the patch does not cover. > > > > 1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES > > IN SCHEMA <schema>". > > > > I took crash-test.sh provided by Tomas and modified it to add all > > tables in the schema to publication using the following command : > > > > ALTER PUBLICATION p ADD TABLES IN SCHEMA public > > > > The modified script is attached (crash-test-with-schema.sh). With this > > script, I can reproduce the issue even with the patch applied. This is > > because the code path to add a schema to the publication doesn't go > > through OpenTableList. > > > > I have also attached a script run-test-with-schema.sh to run > > crash-test-with-schema.sh in a loop with randomly generated parameters > > (modified from run.sh provided by Tomas). > > > > 2. The second issue is a deadlock which happens when the alter > > publication command is run for a comma separated list of tables. > > > > I created another script create-test-tables-order-reverse.sh. This > > script runs a command like the following : > > > > ALTER PUBLICATION p ADD TABLE test_2,test_1 > > > > Running the above script, I was able to get a deadlock error (the > > output is attached in deadlock.txt). In the alter publication command, > > I added the tables in the reverse order to increase the probability of > > the deadlock. But it should happen with any order of tables. > > > > I am not sure if the deadlock is a major issue because detecting the > > deadlock is better than data loss. > > The deadlock reported in this case is an expected behavior. This is no different that locking tables or rows in reverse order. > > I looked further into the scenario of adding the tables in schema to > the publication. Since in that case, the entry is added to > pg_publication_namespace instead of pg_publication_rel, the codepaths > for 'add table' and 'add tables in schema' are different. And in the > 'add tables in schema' scenario, the OpenTableList function is not > called to get the relation ids. Therefore even with the proposed > patch, the data loss issue still persists in that case. > > To validate this idea, I tried locking all the affected tables in the > schema just before the invalidation for those relations (in > ShareRowExclusiveLock mode). > This sounds like a reasonable approach to fix the issue. However, we should check SET publication_object as well, especially the drop part in it. It should not happen that we miss sending the data for ADD but for DROP, we send data when we shouldn't have sent it. -- With Regards, Amit Kapila.
On Mon, 15 Jul 2024 at 15:31, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > > > On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > The patch missed to use the ShareRowExclusiveLock for partitions, see > > > > > attached. I haven't tested it but they should also face the same > > > > > problem. Apart from that, I have changed the comments in a few places > > > > > in the patch. > > > > > > > > I could not hit the updated ShareRowExclusiveLock changes through the > > > > partition table, instead I could verify it using the inheritance > > > > table. Added a test for the same and also attaching the backbranch > > > > patch. > > > > > > > > > > Hi, > > > > > > I tested alternative-experimental-fix-lock.patch provided by Tomas > > > (replaces SUE with SRE in OpenTableList). I believe there are a couple > > > of scenarios the patch does not cover. > > > > > > 1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES > > > IN SCHEMA <schema>". > > > > > > I took crash-test.sh provided by Tomas and modified it to add all > > > tables in the schema to publication using the following command : > > > > > > ALTER PUBLICATION p ADD TABLES IN SCHEMA public > > > > > > The modified script is attached (crash-test-with-schema.sh). With this > > > script, I can reproduce the issue even with the patch applied. This is > > > because the code path to add a schema to the publication doesn't go > > > through OpenTableList. > > > > > > I have also attached a script run-test-with-schema.sh to run > > > crash-test-with-schema.sh in a loop with randomly generated parameters > > > (modified from run.sh provided by Tomas). > > > > > > 2. The second issue is a deadlock which happens when the alter > > > publication command is run for a comma separated list of tables. > > > > > > I created another script create-test-tables-order-reverse.sh. This > > > script runs a command like the following : > > > > > > ALTER PUBLICATION p ADD TABLE test_2,test_1 > > > > > > Running the above script, I was able to get a deadlock error (the > > > output is attached in deadlock.txt). In the alter publication command, > > > I added the tables in the reverse order to increase the probability of > > > the deadlock. But it should happen with any order of tables. > > > > > > I am not sure if the deadlock is a major issue because detecting the > > > deadlock is better than data loss. > > > > > The deadlock reported in this case is an expected behavior. This is no > different that locking tables or rows in reverse order. > > > > > I looked further into the scenario of adding the tables in schema to > > the publication. Since in that case, the entry is added to > > pg_publication_namespace instead of pg_publication_rel, the codepaths > > for 'add table' and 'add tables in schema' are different. And in the > > 'add tables in schema' scenario, the OpenTableList function is not > > called to get the relation ids. Therefore even with the proposed > > patch, the data loss issue still persists in that case. > > > > To validate this idea, I tried locking all the affected tables in the > > schema just before the invalidation for those relations (in > > ShareRowExclusiveLock mode). > > > > This sounds like a reasonable approach to fix the issue. However, we > should check SET publication_object as well, especially the drop part > in it. It should not happen that we miss sending the data for ADD but > for DROP, we send data when we shouldn't have sent it. There were few other scenarios, similar to the one you mentioned, where the issue occurred. For example: a) When specifying a subset of existing tables in the ALTER PUBLICATION ... SET TABLE command, the tables that were supposed to be removed from the publication were not locked in ShareRowExclusiveLock mode. b) The ALTER PUBLICATION ... DROP TABLES IN SCHEMA command did not lock the relations that will be removed from the publication in ShareRowExclusiveLock mode. Both of these scenarios resulted in data inconsistency due to inadequate locking. The attached patch addresses these issues. Regards, Vignesh
Attachment
On Mon, Jul 15, 2024 at 11:42 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 15 Jul 2024 at 15:31, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > I looked further into the scenario of adding the tables in schema to > > > the publication. Since in that case, the entry is added to > > > pg_publication_namespace instead of pg_publication_rel, the codepaths > > > for 'add table' and 'add tables in schema' are different. And in the > > > 'add tables in schema' scenario, the OpenTableList function is not > > > called to get the relation ids. Therefore even with the proposed > > > patch, the data loss issue still persists in that case. > > > > > > To validate this idea, I tried locking all the affected tables in the > > > schema just before the invalidation for those relations (in > > > ShareRowExclusiveLock mode). > > > > > > > This sounds like a reasonable approach to fix the issue. However, we > > should check SET publication_object as well, especially the drop part > > in it. It should not happen that we miss sending the data for ADD but > > for DROP, we send data when we shouldn't have sent it. > > There were few other scenarios, similar to the one you mentioned, > where the issue occurred. For example: a) When specifying a subset of > existing tables in the ALTER PUBLICATION ... SET TABLE command, the > tables that were supposed to be removed from the publication were not > locked in ShareRowExclusiveLock mode. b) The ALTER PUBLICATION ... > DROP TABLES IN SCHEMA command did not lock the relations that will be > removed from the publication in ShareRowExclusiveLock mode. Both of > these scenarios resulted in data inconsistency due to inadequate > locking. The attached patch addresses these issues. > Hi, A couple of questions on the latest patch : 1. I see there is this logic in PublicationDropSchemas to first check if there is a valid entry for the schema in pg_publication_namespace psid = GetSysCacheOid2(PUBLICATIONNAMESPACEMAP, Anum_pg_publication_namespace_oid, ObjectIdGetDatum(schemaid), ObjectIdGetDatum(pubid)); if (!OidIsValid(psid)) { if (missing_ok) continue; ereport(ERROR, (errcode(ERRCODE_UNDEFINED_OBJECT), errmsg("tables from schema \"%s\" are not part of the publication", get_namespace_name(schemaid)))); } Your proposed change locks the schemaRels before this code block. Would it be better to lock the schemaRels after the error check? So that just in case, the publication on the schema is not valid anymore, the lock is not held unnecessarily on all its tables. 2. The function publication_add_schema explicitly invalidates cache by calling InvalidatePublicationRels(schemaRels). That is not present in the current PublicationDropSchemas code. Is that something which should be added in the drop scenario also? Please let me know if there is some context that I'm missing regarding why this was not added originally for the drop scenario. Thanks & Regards, Nitin Motiani Google
On Tue, Jul 16, 2024 at 12:48 AM Nitin Motiani <nitinmotiani@google.com> wrote: > > A couple of questions on the latest patch : > > 1. I see there is this logic in PublicationDropSchemas to first check > if there is a valid entry for the schema in pg_publication_namespace > > psid = GetSysCacheOid2(PUBLICATIONNAMESPACEMAP, > > Anum_pg_publication_namespace_oid, > > ObjectIdGetDatum(schemaid), > > ObjectIdGetDatum(pubid)); > if (!OidIsValid(psid)) > { > if (missing_ok) > continue; > > ereport(ERROR, > (errcode(ERRCODE_UNDEFINED_OBJECT), > errmsg("tables from schema > \"%s\" are not part of the publication", > > get_namespace_name(schemaid)))); > } > > Your proposed change locks the schemaRels before this code block. > Would it be better to lock the schemaRels after the error check? So > that just in case, the publication on the schema is not valid anymore, > the lock is not held unnecessarily on all its tables. > Good point. It is better to lock the relations in RemovePublicationSchemaById() where we are invalidating relcache as well. See the response to your next point as well. > 2. The function publication_add_schema explicitly invalidates cache by > calling InvalidatePublicationRels(schemaRels). That is not present in > the current PublicationDropSchemas code. Is that something which > should be added in the drop scenario also? Please let me know if there > is some context that I'm missing regarding why this was not added > originally for the drop scenario. > The required invalidation happens in the function RemovePublicationSchemaById(). So, we should lock in RemovePublicationSchemaById() as that would avoid calling GetSchemaPublicationRelations() multiple times. One related comment: @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt *stmt, HeapTuple tup, oldrel = palloc(sizeof(PublicationRelInfo)); oldrel->whereClause = NULL; oldrel->columns = NIL; + + /* + * Data loss due to concurrency issues are avoided by locking + * the relation in ShareRowExclusiveLock as described atop + * OpenTableList. + */ oldrel->relation = table_open(oldrelid, - ShareUpdateExclusiveLock); + ShareRowExclusiveLock); Isn't it better to lock the required relations in RemovePublicationRelById()? -- With Regards, Amit Kapila.
On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > One related comment: > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > *stmt, HeapTuple tup, > oldrel = palloc(sizeof(PublicationRelInfo)); > oldrel->whereClause = NULL; > oldrel->columns = NIL; > + > + /* > + * Data loss due to concurrency issues are avoided by locking > + * the relation in ShareRowExclusiveLock as described atop > + * OpenTableList. > + */ > oldrel->relation = table_open(oldrelid, > - ShareUpdateExclusiveLock); > + ShareRowExclusiveLock); > > Isn't it better to lock the required relations in RemovePublicationRelById()? > On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly added tests. It isn't worth adding this much extra time for one bug fix. Can we combine table and schema tests into one single test and avoid inheritance table tests as the code for those will mostly follow the same path as a regular table? -- With Regards, Amit Kapila.
On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > One related comment: > > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > > *stmt, HeapTuple tup, > > oldrel = palloc(sizeof(PublicationRelInfo)); > > oldrel->whereClause = NULL; > > oldrel->columns = NIL; > > + > > + /* > > + * Data loss due to concurrency issues are avoided by locking > > + * the relation in ShareRowExclusiveLock as described atop > > + * OpenTableList. > > + */ > > oldrel->relation = table_open(oldrelid, > > - ShareUpdateExclusiveLock); > > + ShareRowExclusiveLock); > > > > Isn't it better to lock the required relations in RemovePublicationRelById()? > > > > On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a > patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly > added tests. It isn't worth adding this much extra time for one bug > fix. Can we combine table and schema tests into one single test and > avoid inheritance table tests as the code for those will mostly follow > the same path as a regular table? Yes, that is better. The attached v6 version patch has the changes for the same. The patch also addresses the comments from [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1LZDW2AVDYFZdZcvmsKVGajH2-gZmjXr9BsYiy8ct_fEw%40mail.gmail.com Regards, Vignesh
Attachment
On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > One related comment: > > > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > > > *stmt, HeapTuple tup, > > > oldrel = palloc(sizeof(PublicationRelInfo)); > > > oldrel->whereClause = NULL; > > > oldrel->columns = NIL; > > > + > > > + /* > > > + * Data loss due to concurrency issues are avoided by locking > > > + * the relation in ShareRowExclusiveLock as described atop > > > + * OpenTableList. > > > + */ > > > oldrel->relation = table_open(oldrelid, > > > - ShareUpdateExclusiveLock); > > > + ShareRowExclusiveLock); > > > > > > Isn't it better to lock the required relations in RemovePublicationRelById()? > > > > > > > On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a > > patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly > > added tests. It isn't worth adding this much extra time for one bug > > fix. Can we combine table and schema tests into one single test and > > avoid inheritance table tests as the code for those will mostly follow > > the same path as a regular table? > > Yes, that is better. The attached v6 version patch has the changes for the same. > The patch also addresses the comments from [1]. > Thanks, I don't see any noticeable difference in test timing with new tests. I have slightly modified the comments in the attached diff patch (please rename it to .patch). BTW, I noticed that we don't take any table-level locks for Create Publication .. For ALL TABLES (and Drop Publication). Can that create a similar problem? I haven't tested so not sure but even if there is a problem for the Create case, it should lead to some ERROR like missing publication. -- With Regards, Amit Kapila.
Attachment
On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > One related comment: > > > > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > > > > *stmt, HeapTuple tup, > > > > oldrel = palloc(sizeof(PublicationRelInfo)); > > > > oldrel->whereClause = NULL; > > > > oldrel->columns = NIL; > > > > + > > > > + /* > > > > + * Data loss due to concurrency issues are avoided by locking > > > > + * the relation in ShareRowExclusiveLock as described atop > > > > + * OpenTableList. > > > > + */ > > > > oldrel->relation = table_open(oldrelid, > > > > - ShareUpdateExclusiveLock); > > > > + ShareRowExclusiveLock); > > > > > > > > Isn't it better to lock the required relations in RemovePublicationRelById()? > > > > > > > > > > On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a > > > patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly > > > added tests. It isn't worth adding this much extra time for one bug > > > fix. Can we combine table and schema tests into one single test and > > > avoid inheritance table tests as the code for those will mostly follow > > > the same path as a regular table? > > > > Yes, that is better. The attached v6 version patch has the changes for the same. > > The patch also addresses the comments from [1]. > > > > Thanks, I don't see any noticeable difference in test timing with new > tests. I have slightly modified the comments in the attached diff > patch (please rename it to .patch). > > BTW, I noticed that we don't take any table-level locks for Create > Publication .. For ALL TABLES (and Drop Publication). Can that create > a similar problem? I haven't tested so not sure but even if there is a > problem for the Create case, it should lead to some ERROR like missing > publication. I tested these scenarios, and as you expected, it throws an error for the create publication case: 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive data from WAL stream: ERROR: publication "pub1" does not exist CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/1510CD8 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker "logical replication apply worker" (PID 481526) exited with exit code 1 The steps for this process are as follows: 1) Create tables in both the publisher and subscriber. 2) On the publisher: Create a replication slot. 3) On the subscriber: Create a subscription using the slot created by the publisher. 4) On the publisher: 4.a) Session 1: BEGIN; INSERT INTO T1; 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES 4.c) Session 1: COMMIT; Since we are throwing out a "publication does not exist" error, there is no inconsistency issue here. However, an issue persists with DROP ALL TABLES publication, where data continues to replicate even after the publication is dropped. This happens because the open transaction consumes the invalidation, causing the publications to be revalidated using old snapshot. As a result, both the open transactions and the subsequent transactions are getting replicated. We can reproduce this issue by following these steps in a logical replication setup with an "ALL TABLES" publication: On the publisher: Session 1: BEGIN; INSERT INTO T1 VALUES (val1); In another session on the publisher: Session 2: DROP PUBLICATION Back in Session 1 on the publisher: COMMIT; Finally, in Session 1 on the publisher: INSERT INTO T1 VALUES (val2); Even after dropping the publication, both val1 and val2 are still being replicated to the subscriber. This means that both the in-progress concurrent transaction and the subsequent transactions are being replicated. I don't think locking all tables is a viable solution in this case, as it would require asking the user to refrain from performing any operations on any of the tables in the database while creating a publication. Thoughts? Regards, Vignesh
On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > One related comment: > > > > > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > > > > > *stmt, HeapTuple tup, > > > > > oldrel = palloc(sizeof(PublicationRelInfo)); > > > > > oldrel->whereClause = NULL; > > > > > oldrel->columns = NIL; > > > > > + > > > > > + /* > > > > > + * Data loss due to concurrency issues are avoided by locking > > > > > + * the relation in ShareRowExclusiveLock as described atop > > > > > + * OpenTableList. > > > > > + */ > > > > > oldrel->relation = table_open(oldrelid, > > > > > - ShareUpdateExclusiveLock); > > > > > + ShareRowExclusiveLock); > > > > > > > > > > Isn't it better to lock the required relations in RemovePublicationRelById()? > > > > > > > > > > > > > On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a > > > > patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly > > > > added tests. It isn't worth adding this much extra time for one bug > > > > fix. Can we combine table and schema tests into one single test and > > > > avoid inheritance table tests as the code for those will mostly follow > > > > the same path as a regular table? > > > > > > Yes, that is better. The attached v6 version patch has the changes for the same. > > > The patch also addresses the comments from [1]. > > > > > > > Thanks, I don't see any noticeable difference in test timing with new > > tests. I have slightly modified the comments in the attached diff > > patch (please rename it to .patch). > > > > BTW, I noticed that we don't take any table-level locks for Create > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > a similar problem? I haven't tested so not sure but even if there is a > > problem for the Create case, it should lead to some ERROR like missing > > publication. > > I tested these scenarios, and as you expected, it throws an error for > the create publication case: > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > data from WAL stream: ERROR: publication "pub1" does not exist > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > callback, associated LSN 0/1510CD8 > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > "logical replication apply worker" (PID 481526) exited with exit code > 1 > > The steps for this process are as follows: > 1) Create tables in both the publisher and subscriber. > 2) On the publisher: Create a replication slot. > 3) On the subscriber: Create a subscription using the slot created by > the publisher. > 4) On the publisher: > 4.a) Session 1: BEGIN; INSERT INTO T1; > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > 4.c) Session 1: COMMIT; > > Since we are throwing out a "publication does not exist" error, there > is no inconsistency issue here. > > However, an issue persists with DROP ALL TABLES publication, where > data continues to replicate even after the publication is dropped. > This happens because the open transaction consumes the invalidation, > causing the publications to be revalidated using old snapshot. As a > result, both the open transactions and the subsequent transactions are > getting replicated. > > We can reproduce this issue by following these steps in a logical > replication setup with an "ALL TABLES" publication: > On the publisher: > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > In another session on the publisher: > Session 2: DROP PUBLICATION > Back in Session 1 on the publisher: > COMMIT; > Finally, in Session 1 on the publisher: > INSERT INTO T1 VALUES (val2); > > Even after dropping the publication, both val1 and val2 are still > being replicated to the subscriber. This means that both the > in-progress concurrent transaction and the subsequent transactions are > being replicated. > Hi, I tried the 'DROP PUBLICATION' command even for a publication with a single table. And there also the data continues to get replicated. To test this, I did a similar experiment as the above but instead of creating publication on all tables, I did it for one specific table. Here are the steps : 1. Create table test_1 and test_2 on both the publisher and subscriber instances. 2. Create publication p for table test_1 on the publisher. 3. Create a subscription s which subscribes to p. 4. On the publisher 4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1); 4b) Session 2 : DROP PUBLICATION p; 4c) Session 1 : Commit; 5. On the publisher : INSERT INTO test_1 VALUES(val2); After these, when I check the subscriber, both val1 and val2 have been replicated. I tried a few more inserts on publisher after this and they all got replicated to the subscriber. Only after explicitly creating a new publication p2 for test_1 on the publisher, the replication stopped. Most likely because the create publication command invalidated the cache. My guess is that this issue probably comes from the fact that RemoveObjects in dropcmds.c doesn't do any special handling or invalidation for the object drop command. Please let me know if I'm missing something in my setup or if my understanding of the drop commands is wrong. Thanks
On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > I tested these scenarios, and as you expected, it throws an error for > > the create publication case: > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > data from WAL stream: ERROR: publication "pub1" does not exist > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > callback, associated LSN 0/1510CD8 > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > "logical replication apply worker" (PID 481526) exited with exit code > > 1 > > > > The steps for this process are as follows: > > 1) Create tables in both the publisher and subscriber. > > 2) On the publisher: Create a replication slot. > > 3) On the subscriber: Create a subscription using the slot created by > > the publisher. > > 4) On the publisher: > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > 4.c) Session 1: COMMIT; > > > > Since we are throwing out a "publication does not exist" error, there > > is no inconsistency issue here. > > > > However, an issue persists with DROP ALL TABLES publication, where > > data continues to replicate even after the publication is dropped. > > This happens because the open transaction consumes the invalidation, > > causing the publications to be revalidated using old snapshot. As a > > result, both the open transactions and the subsequent transactions are > > getting replicated. > > > > We can reproduce this issue by following these steps in a logical > > replication setup with an "ALL TABLES" publication: > > On the publisher: > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > In another session on the publisher: > > Session 2: DROP PUBLICATION > > Back in Session 1 on the publisher: > > COMMIT; > > Finally, in Session 1 on the publisher: > > INSERT INTO T1 VALUES (val2); > > > > Even after dropping the publication, both val1 and val2 are still > > being replicated to the subscriber. This means that both the > > in-progress concurrent transaction and the subsequent transactions are > > being replicated. > > > > Hi, > > I tried the 'DROP PUBLICATION' command even for a publication with a > single table. And there also the data continues to get replicated. > > To test this, I did a similar experiment as the above but instead of > creating publication on all tables, I did it for one specific table. > > Here are the steps : > 1. Create table test_1 and test_2 on both the publisher and subscriber > instances. > 2. Create publication p for table test_1 on the publisher. > 3. Create a subscription s which subscribes to p. > 4. On the publisher > 4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1); > 4b) Session 2 : DROP PUBLICATION p; > 4c) Session 1 : Commit; > 5. On the publisher : INSERT INTO test_1 VALUES(val2); > > After these, when I check the subscriber, both val1 and val2 have been > replicated. I tried a few more inserts on publisher after this and > they all got replicated to the subscriber. Only after explicitly > creating a new publication p2 for test_1 on the publisher, the > replication stopped. Most likely because the create publication > command invalidated the cache. > > My guess is that this issue probably comes from the fact that > RemoveObjects in dropcmds.c doesn't do any special handling or > invalidation for the object drop command. > I checked further and I see that RemovePublicationById does do cache invalidation but it is only done in the scenario when the publication is on all tables. This is done without taking any locks. But for the other cases (eg. publication on one table), I don't see any cache invalidation in RemovePublicationById. That would explain why the replication kept happening for multiple transactions after the drop publication command in my example.. Thanks & Regards Nitin Motiani Google
On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > the create publication case: > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > callback, associated LSN 0/1510CD8 > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > "logical replication apply worker" (PID 481526) exited with exit code > > > 1 > > > > > > The steps for this process are as follows: > > > 1) Create tables in both the publisher and subscriber. > > > 2) On the publisher: Create a replication slot. > > > 3) On the subscriber: Create a subscription using the slot created by > > > the publisher. > > > 4) On the publisher: > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > 4.c) Session 1: COMMIT; > > > > > > Since we are throwing out a "publication does not exist" error, there > > > is no inconsistency issue here. > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > data continues to replicate even after the publication is dropped. > > > This happens because the open transaction consumes the invalidation, > > > causing the publications to be revalidated using old snapshot. As a > > > result, both the open transactions and the subsequent transactions are > > > getting replicated. > > > > > > We can reproduce this issue by following these steps in a logical > > > replication setup with an "ALL TABLES" publication: > > > On the publisher: > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > In another session on the publisher: > > > Session 2: DROP PUBLICATION > > > Back in Session 1 on the publisher: > > > COMMIT; > > > Finally, in Session 1 on the publisher: > > > INSERT INTO T1 VALUES (val2); > > > > > > Even after dropping the publication, both val1 and val2 are still > > > being replicated to the subscriber. This means that both the > > > in-progress concurrent transaction and the subsequent transactions are > > > being replicated. > > > > > > > Hi, > > > > I tried the 'DROP PUBLICATION' command even for a publication with a > > single table. And there also the data continues to get replicated. > > > > To test this, I did a similar experiment as the above but instead of > > creating publication on all tables, I did it for one specific table. > > > > Here are the steps : > > 1. Create table test_1 and test_2 on both the publisher and subscriber > > instances. > > 2. Create publication p for table test_1 on the publisher. > > 3. Create a subscription s which subscribes to p. > > 4. On the publisher > > 4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1); > > 4b) Session 2 : DROP PUBLICATION p; > > 4c) Session 1 : Commit; > > 5. On the publisher : INSERT INTO test_1 VALUES(val2); > > > > After these, when I check the subscriber, both val1 and val2 have been > > replicated. I tried a few more inserts on publisher after this and > > they all got replicated to the subscriber. Only after explicitly > > creating a new publication p2 for test_1 on the publisher, the > > replication stopped. Most likely because the create publication > > command invalidated the cache. > > > > My guess is that this issue probably comes from the fact that > > RemoveObjects in dropcmds.c doesn't do any special handling or > > invalidation for the object drop command. > > > > I checked further and I see that RemovePublicationById does do cache > invalidation but it is only done in the scenario when the publication > is on all tables. This is done without taking any locks. But for the > other cases (eg. publication on one table), I don't see any cache > invalidation in RemovePublicationById. That would explain why the > replication kept happening for multiple transactions after the drop > publication command in my example.. > Sorry, I missed that for the individual table scenario, the invalidation would happen in RemovePublicationRelById. That is invalidating the cache for all relids. But this is also not taking any locks. So that would explain why dropping the publication on a single table doesn't invalidate the cache in an ongoing transaction. I'm not sure why the replication kept happening even in subsequent transactions. Either way I think the SRE lock should be taken for all relids in that function also before the invalidations. Thanks & Regards Nitin Motiani Google
On Thu, Jul 18, 2024 at 3:30 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > > the create publication case: > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > > callback, associated LSN 0/1510CD8 > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > "logical replication apply worker" (PID 481526) exited with exit code > > > > 1 > > > > > > > > The steps for this process are as follows: > > > > 1) Create tables in both the publisher and subscriber. > > > > 2) On the publisher: Create a replication slot. > > > > 3) On the subscriber: Create a subscription using the slot created by > > > > the publisher. > > > > 4) On the publisher: > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > 4.c) Session 1: COMMIT; > > > > > > > > Since we are throwing out a "publication does not exist" error, there > > > > is no inconsistency issue here. > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > data continues to replicate even after the publication is dropped. > > > > This happens because the open transaction consumes the invalidation, > > > > causing the publications to be revalidated using old snapshot. As a > > > > result, both the open transactions and the subsequent transactions are > > > > getting replicated. > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > replication setup with an "ALL TABLES" publication: > > > > On the publisher: > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > In another session on the publisher: > > > > Session 2: DROP PUBLICATION > > > > Back in Session 1 on the publisher: > > > > COMMIT; > > > > Finally, in Session 1 on the publisher: > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > being replicated to the subscriber. This means that both the > > > > in-progress concurrent transaction and the subsequent transactions are > > > > being replicated. > > > > > > > > > > Hi, > > > > > > I tried the 'DROP PUBLICATION' command even for a publication with a > > > single table. And there also the data continues to get replicated. > > > > > > To test this, I did a similar experiment as the above but instead of > > > creating publication on all tables, I did it for one specific table. > > > > > > Here are the steps : > > > 1. Create table test_1 and test_2 on both the publisher and subscriber > > > instances. > > > 2. Create publication p for table test_1 on the publisher. > > > 3. Create a subscription s which subscribes to p. > > > 4. On the publisher > > > 4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1); > > > 4b) Session 2 : DROP PUBLICATION p; > > > 4c) Session 1 : Commit; > > > 5. On the publisher : INSERT INTO test_1 VALUES(val2); > > > > > > After these, when I check the subscriber, both val1 and val2 have been > > > replicated. I tried a few more inserts on publisher after this and > > > they all got replicated to the subscriber. Only after explicitly > > > creating a new publication p2 for test_1 on the publisher, the > > > replication stopped. Most likely because the create publication > > > command invalidated the cache. > > > > > > My guess is that this issue probably comes from the fact that > > > RemoveObjects in dropcmds.c doesn't do any special handling or > > > invalidation for the object drop command. > > > > > > > I checked further and I see that RemovePublicationById does do cache > > invalidation but it is only done in the scenario when the publication > > is on all tables. This is done without taking any locks. But for the > > other cases (eg. publication on one table), I don't see any cache > > invalidation in RemovePublicationById. That would explain why the > > replication kept happening for multiple transactions after the drop > > publication command in my example.. > > > > Sorry, I missed that for the individual table scenario, the > invalidation would happen in RemovePublicationRelById. That is > invalidating the cache for all relids. But this is also not taking any > locks. So that would explain why dropping the publication on a single > table doesn't invalidate the cache in an ongoing transaction. I'm not > sure why the replication kept happening even in subsequent > transactions. > > Either way I think the SRE lock should be taken for all relids in that > function also before the invalidations. > My apologies. I wasn't testing with the latest patch. I see this has already been done in the v6 patch file. Thanks & Regards Nitin Motiani Google
On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > BTW, I noticed that we don't take any table-level locks for Create > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > a similar problem? I haven't tested so not sure but even if there is a > > problem for the Create case, it should lead to some ERROR like missing > > publication. > > I tested these scenarios, and as you expected, it throws an error for > the create publication case: > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > data from WAL stream: ERROR: publication "pub1" does not exist > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > callback, associated LSN 0/1510CD8 > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > "logical replication apply worker" (PID 481526) exited with exit code > 1 > > The steps for this process are as follows: > 1) Create tables in both the publisher and subscriber. > 2) On the publisher: Create a replication slot. > 3) On the subscriber: Create a subscription using the slot created by > the publisher. > 4) On the publisher: > 4.a) Session 1: BEGIN; INSERT INTO T1; > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > 4.c) Session 1: COMMIT; > > Since we are throwing out a "publication does not exist" error, there > is no inconsistency issue here. > > However, an issue persists with DROP ALL TABLES publication, where > data continues to replicate even after the publication is dropped. > This happens because the open transaction consumes the invalidation, > causing the publications to be revalidated using old snapshot. As a > result, both the open transactions and the subsequent transactions are > getting replicated. > > We can reproduce this issue by following these steps in a logical > replication setup with an "ALL TABLES" publication: > On the publisher: > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > In another session on the publisher: > Session 2: DROP PUBLICATION > Back in Session 1 on the publisher: > COMMIT; > Finally, in Session 1 on the publisher: > INSERT INTO T1 VALUES (val2); > > Even after dropping the publication, both val1 and val2 are still > being replicated to the subscriber. This means that both the > in-progress concurrent transaction and the subsequent transactions are > being replicated. > > I don't think locking all tables is a viable solution in this case, as > it would require asking the user to refrain from performing any > operations on any of the tables in the database while creating a > publication. > Indeed, locking all tables in the database to prevent concurrent DMLs for this scenario also looks odd to me. The other alternative previously suggested by Andres is to distribute catalog modifying transactions to all concurrent in-progress transactions [1] but as mentioned this could add an overhead. One possibility to reduce overhead is that we selectively distribute invalidations for catalogs-related publications but I haven't analyzed the feasibility. We need more opinions to decide here, so let me summarize the problem and solutions discussed. As explained with an example in an email [1], the problem related to logical decoding is that it doesn't process invalidations corresponding to DDLs for the already in-progress transactions. We discussed preventing DMLs in the first place when concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in progress. The solution discussed was to acquire ShareUpdateExclusiveLock for all the tables being added via such commands. Further analysis revealed that the same handling is required for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all the tables in the specified schemas. Then DROP PUBLICATION also seems to have similar symptoms which means in the worst case (where publication is for ALL TABLES) we have to lock all the tables in the database. We are not sure if that is good so the other alternative we can pursue is to distribute invalidations in logical decoding infrastructure [1] which has its downsides. Thoughts? [1] - https://www.postgresql.org/message-id/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de -- With Regards, Amit Kapila.
On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > a similar problem? I haven't tested so not sure but even if there is a > > > problem for the Create case, it should lead to some ERROR like missing > > > publication. > > > > I tested these scenarios, and as you expected, it throws an error for > > the create publication case: > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > data from WAL stream: ERROR: publication "pub1" does not exist > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > callback, associated LSN 0/1510CD8 > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > "logical replication apply worker" (PID 481526) exited with exit code > > 1 > > > > The steps for this process are as follows: > > 1) Create tables in both the publisher and subscriber. > > 2) On the publisher: Create a replication slot. > > 3) On the subscriber: Create a subscription using the slot created by > > the publisher. > > 4) On the publisher: > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > 4.c) Session 1: COMMIT; > > > > Since we are throwing out a "publication does not exist" error, there > > is no inconsistency issue here. > > > > However, an issue persists with DROP ALL TABLES publication, where > > data continues to replicate even after the publication is dropped. > > This happens because the open transaction consumes the invalidation, > > causing the publications to be revalidated using old snapshot. As a > > result, both the open transactions and the subsequent transactions are > > getting replicated. > > > > We can reproduce this issue by following these steps in a logical > > replication setup with an "ALL TABLES" publication: > > On the publisher: > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > In another session on the publisher: > > Session 2: DROP PUBLICATION > > Back in Session 1 on the publisher: > > COMMIT; > > Finally, in Session 1 on the publisher: > > INSERT INTO T1 VALUES (val2); > > > > Even after dropping the publication, both val1 and val2 are still > > being replicated to the subscriber. This means that both the > > in-progress concurrent transaction and the subsequent transactions are > > being replicated. > > > > I don't think locking all tables is a viable solution in this case, as > > it would require asking the user to refrain from performing any > > operations on any of the tables in the database while creating a > > publication. > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > for this scenario also looks odd to me. The other alternative > previously suggested by Andres is to distribute catalog modifying > transactions to all concurrent in-progress transactions [1] but as > mentioned this could add an overhead. One possibility to reduce > overhead is that we selectively distribute invalidations for > catalogs-related publications but I haven't analyzed the feasibility. > > We need more opinions to decide here, so let me summarize the problem > and solutions discussed. As explained with an example in an email [1], > the problem related to logical decoding is that it doesn't process > invalidations corresponding to DDLs for the already in-progress > transactions. We discussed preventing DMLs in the first place when > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > progress. The solution discussed was to acquire > ShareUpdateExclusiveLock for all the tables being added via such > commands. Further analysis revealed that the same handling is required > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > the tables in the specified schemas. Then DROP PUBLICATION also seems > to have similar symptoms which means in the worst case (where > publication is for ALL TABLES) we have to lock all the tables in the > database. We are not sure if that is good so the other alternative we > can pursue is to distribute invalidations in logical decoding > infrastructure [1] which has its downsides. > > Thoughts? Thank you for summarizing the problem and solutions! I think it's worth trying the idea of distributing invalidation messages, and we will see if there could be overheads or any further obstacles. IIUC this approach would resolve another issue we discussed before too[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > > a similar problem? I haven't tested so not sure but even if there is a > > > > problem for the Create case, it should lead to some ERROR like missing > > > > publication. > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > the create publication case: > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > callback, associated LSN 0/1510CD8 > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > "logical replication apply worker" (PID 481526) exited with exit code > > > 1 > > > > > > The steps for this process are as follows: > > > 1) Create tables in both the publisher and subscriber. > > > 2) On the publisher: Create a replication slot. > > > 3) On the subscriber: Create a subscription using the slot created by > > > the publisher. > > > 4) On the publisher: > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > 4.c) Session 1: COMMIT; > > > > > > Since we are throwing out a "publication does not exist" error, there > > > is no inconsistency issue here. > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > data continues to replicate even after the publication is dropped. > > > This happens because the open transaction consumes the invalidation, > > > causing the publications to be revalidated using old snapshot. As a > > > result, both the open transactions and the subsequent transactions are > > > getting replicated. > > > > > > We can reproduce this issue by following these steps in a logical > > > replication setup with an "ALL TABLES" publication: > > > On the publisher: > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > In another session on the publisher: > > > Session 2: DROP PUBLICATION > > > Back in Session 1 on the publisher: > > > COMMIT; > > > Finally, in Session 1 on the publisher: > > > INSERT INTO T1 VALUES (val2); > > > > > > Even after dropping the publication, both val1 and val2 are still > > > being replicated to the subscriber. This means that both the > > > in-progress concurrent transaction and the subsequent transactions are > > > being replicated. > > > > > > I don't think locking all tables is a viable solution in this case, as > > > it would require asking the user to refrain from performing any > > > operations on any of the tables in the database while creating a > > > publication. > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > for this scenario also looks odd to me. The other alternative > > previously suggested by Andres is to distribute catalog modifying > > transactions to all concurrent in-progress transactions [1] but as > > mentioned this could add an overhead. One possibility to reduce > > overhead is that we selectively distribute invalidations for > > catalogs-related publications but I haven't analyzed the feasibility. > > > > We need more opinions to decide here, so let me summarize the problem > > and solutions discussed. As explained with an example in an email [1], > > the problem related to logical decoding is that it doesn't process > > invalidations corresponding to DDLs for the already in-progress > > transactions. We discussed preventing DMLs in the first place when > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > progress. The solution discussed was to acquire > > ShareUpdateExclusiveLock for all the tables being added via such > > commands. Further analysis revealed that the same handling is required > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > > the tables in the specified schemas. Then DROP PUBLICATION also seems > > to have similar symptoms which means in the worst case (where > > publication is for ALL TABLES) we have to lock all the tables in the > > database. We are not sure if that is good so the other alternative we > > can pursue is to distribute invalidations in logical decoding > > infrastructure [1] which has its downsides. > > > > Thoughts? > > Thank you for summarizing the problem and solutions! > > I think it's worth trying the idea of distributing invalidation > messages, and we will see if there could be overheads or any further > obstacles. IIUC this approach would resolve another issue we discussed > before too[1]. > Yes, and we also discussed having a similar solution at the time when that problem was reported. So, it is clear that even though locking tables can work for commands alter ALTER PUBLICATION ... ADD TABLE ..., we need a solution for distributing invalidations to the in-progress transactions during logical decoding for other cases as reported by you previously. Thanks for looking into this. -- With Regards, Amit Kapila.
On Wed, 31 Jul 2024 at 09:36, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > > > a similar problem? I haven't tested so not sure but even if there is a > > > > > problem for the Create case, it should lead to some ERROR like missing > > > > > publication. > > > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > > the create publication case: > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > > callback, associated LSN 0/1510CD8 > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > "logical replication apply worker" (PID 481526) exited with exit code > > > > 1 > > > > > > > > The steps for this process are as follows: > > > > 1) Create tables in both the publisher and subscriber. > > > > 2) On the publisher: Create a replication slot. > > > > 3) On the subscriber: Create a subscription using the slot created by > > > > the publisher. > > > > 4) On the publisher: > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > 4.c) Session 1: COMMIT; > > > > > > > > Since we are throwing out a "publication does not exist" error, there > > > > is no inconsistency issue here. > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > data continues to replicate even after the publication is dropped. > > > > This happens because the open transaction consumes the invalidation, > > > > causing the publications to be revalidated using old snapshot. As a > > > > result, both the open transactions and the subsequent transactions are > > > > getting replicated. > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > replication setup with an "ALL TABLES" publication: > > > > On the publisher: > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > In another session on the publisher: > > > > Session 2: DROP PUBLICATION > > > > Back in Session 1 on the publisher: > > > > COMMIT; > > > > Finally, in Session 1 on the publisher: > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > being replicated to the subscriber. This means that both the > > > > in-progress concurrent transaction and the subsequent transactions are > > > > being replicated. > > > > > > > > I don't think locking all tables is a viable solution in this case, as > > > > it would require asking the user to refrain from performing any > > > > operations on any of the tables in the database while creating a > > > > publication. > > > > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > > for this scenario also looks odd to me. The other alternative > > > previously suggested by Andres is to distribute catalog modifying > > > transactions to all concurrent in-progress transactions [1] but as > > > mentioned this could add an overhead. One possibility to reduce > > > overhead is that we selectively distribute invalidations for > > > catalogs-related publications but I haven't analyzed the feasibility. > > > > > > We need more opinions to decide here, so let me summarize the problem > > > and solutions discussed. As explained with an example in an email [1], > > > the problem related to logical decoding is that it doesn't process > > > invalidations corresponding to DDLs for the already in-progress > > > transactions. We discussed preventing DMLs in the first place when > > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > > progress. The solution discussed was to acquire > > > ShareUpdateExclusiveLock for all the tables being added via such > > > commands. Further analysis revealed that the same handling is required > > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > > > the tables in the specified schemas. Then DROP PUBLICATION also seems > > > to have similar symptoms which means in the worst case (where > > > publication is for ALL TABLES) we have to lock all the tables in the > > > database. We are not sure if that is good so the other alternative we > > > can pursue is to distribute invalidations in logical decoding > > > infrastructure [1] which has its downsides. > > > > > > Thoughts? > > > > Thank you for summarizing the problem and solutions! > > > > I think it's worth trying the idea of distributing invalidation > > messages, and we will see if there could be overheads or any further > > obstacles. IIUC this approach would resolve another issue we discussed > > before too[1]. > > > > Yes, and we also discussed having a similar solution at the time when > that problem was reported. So, it is clear that even though locking > tables can work for commands alter ALTER PUBLICATION ... ADD TABLE > ..., we need a solution for distributing invalidations to the > in-progress transactions during logical decoding for other cases as > reported by you previously. > > Thanks for looking into this. > Thanks, I am working on to implement a solution for distributing invalidations. Will share a patch for the same. Thanks and Regards, Shlok Kyal
On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > On Wed, 31 Jul 2024 at 09:36, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > > > > a similar problem? I haven't tested so not sure but even if there is a > > > > > > problem for the Create case, it should lead to some ERROR like missing > > > > > > publication. > > > > > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > > > the create publication case: > > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > > > callback, associated LSN 0/1510CD8 > > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > > "logical replication apply worker" (PID 481526) exited with exit code > > > > > 1 > > > > > > > > > > The steps for this process are as follows: > > > > > 1) Create tables in both the publisher and subscriber. > > > > > 2) On the publisher: Create a replication slot. > > > > > 3) On the subscriber: Create a subscription using the slot created by > > > > > the publisher. > > > > > 4) On the publisher: > > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > > 4.c) Session 1: COMMIT; > > > > > > > > > > Since we are throwing out a "publication does not exist" error, there > > > > > is no inconsistency issue here. > > > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > > data continues to replicate even after the publication is dropped. > > > > > This happens because the open transaction consumes the invalidation, > > > > > causing the publications to be revalidated using old snapshot. As a > > > > > result, both the open transactions and the subsequent transactions are > > > > > getting replicated. > > > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > > replication setup with an "ALL TABLES" publication: > > > > > On the publisher: > > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > > In another session on the publisher: > > > > > Session 2: DROP PUBLICATION > > > > > Back in Session 1 on the publisher: > > > > > COMMIT; > > > > > Finally, in Session 1 on the publisher: > > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > > being replicated to the subscriber. This means that both the > > > > > in-progress concurrent transaction and the subsequent transactions are > > > > > being replicated. > > > > > > > > > > I don't think locking all tables is a viable solution in this case, as > > > > > it would require asking the user to refrain from performing any > > > > > operations on any of the tables in the database while creating a > > > > > publication. > > > > > > > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > > > for this scenario also looks odd to me. The other alternative > > > > previously suggested by Andres is to distribute catalog modifying > > > > transactions to all concurrent in-progress transactions [1] but as > > > > mentioned this could add an overhead. One possibility to reduce > > > > overhead is that we selectively distribute invalidations for > > > > catalogs-related publications but I haven't analyzed the feasibility. > > > > > > > > We need more opinions to decide here, so let me summarize the problem > > > > and solutions discussed. As explained with an example in an email [1], > > > > the problem related to logical decoding is that it doesn't process > > > > invalidations corresponding to DDLs for the already in-progress > > > > transactions. We discussed preventing DMLs in the first place when > > > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > > > progress. The solution discussed was to acquire > > > > ShareUpdateExclusiveLock for all the tables being added via such > > > > commands. Further analysis revealed that the same handling is required > > > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > > > > the tables in the specified schemas. Then DROP PUBLICATION also seems > > > > to have similar symptoms which means in the worst case (where > > > > publication is for ALL TABLES) we have to lock all the tables in the > > > > database. We are not sure if that is good so the other alternative we > > > > can pursue is to distribute invalidations in logical decoding > > > > infrastructure [1] which has its downsides. > > > > > > > > Thoughts? > > > > > > Thank you for summarizing the problem and solutions! > > > > > > I think it's worth trying the idea of distributing invalidation > > > messages, and we will see if there could be overheads or any further > > > obstacles. IIUC this approach would resolve another issue we discussed > > > before too[1]. > > > > > > > Yes, and we also discussed having a similar solution at the time when > > that problem was reported. So, it is clear that even though locking > > tables can work for commands alter ALTER PUBLICATION ... ADD TABLE > > ..., we need a solution for distributing invalidations to the > > in-progress transactions during logical decoding for other cases as > > reported by you previously. > > > > Thanks for looking into this. > > > > Thanks, I am working on to implement a solution for distributing > invalidations. Will share a patch for the same. Created a patch for distributing invalidations. Here we collect the invalidation messages for the current transaction and distribute it to all the inprogress transactions, whenever we are distributing the snapshots..Thoughts? Thanks and Regards, Shlok Kyal
Attachment
On Thu, 8 Aug 2024 at 16:24, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > On Wed, 31 Jul 2024 at 09:36, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > > > > > a similar problem? I haven't tested so not sure but even if there is a > > > > > > > problem for the Create case, it should lead to some ERROR like missing > > > > > > > publication. > > > > > > > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > > > > the create publication case: > > > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > > > > callback, associated LSN 0/1510CD8 > > > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > > > "logical replication apply worker" (PID 481526) exited with exit code > > > > > > 1 > > > > > > > > > > > > The steps for this process are as follows: > > > > > > 1) Create tables in both the publisher and subscriber. > > > > > > 2) On the publisher: Create a replication slot. > > > > > > 3) On the subscriber: Create a subscription using the slot created by > > > > > > the publisher. > > > > > > 4) On the publisher: > > > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > > > 4.c) Session 1: COMMIT; > > > > > > > > > > > > Since we are throwing out a "publication does not exist" error, there > > > > > > is no inconsistency issue here. > > > > > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > > > data continues to replicate even after the publication is dropped. > > > > > > This happens because the open transaction consumes the invalidation, > > > > > > causing the publications to be revalidated using old snapshot. As a > > > > > > result, both the open transactions and the subsequent transactions are > > > > > > getting replicated. > > > > > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > > > replication setup with an "ALL TABLES" publication: > > > > > > On the publisher: > > > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > > > In another session on the publisher: > > > > > > Session 2: DROP PUBLICATION > > > > > > Back in Session 1 on the publisher: > > > > > > COMMIT; > > > > > > Finally, in Session 1 on the publisher: > > > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > > > being replicated to the subscriber. This means that both the > > > > > > in-progress concurrent transaction and the subsequent transactions are > > > > > > being replicated. > > > > > > > > > > > > I don't think locking all tables is a viable solution in this case, as > > > > > > it would require asking the user to refrain from performing any > > > > > > operations on any of the tables in the database while creating a > > > > > > publication. > > > > > > > > > > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > > > > for this scenario also looks odd to me. The other alternative > > > > > previously suggested by Andres is to distribute catalog modifying > > > > > transactions to all concurrent in-progress transactions [1] but as > > > > > mentioned this could add an overhead. One possibility to reduce > > > > > overhead is that we selectively distribute invalidations for > > > > > catalogs-related publications but I haven't analyzed the feasibility. > > > > > > > > > > We need more opinions to decide here, so let me summarize the problem > > > > > and solutions discussed. As explained with an example in an email [1], > > > > > the problem related to logical decoding is that it doesn't process > > > > > invalidations corresponding to DDLs for the already in-progress > > > > > transactions. We discussed preventing DMLs in the first place when > > > > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > > > > progress. The solution discussed was to acquire > > > > > ShareUpdateExclusiveLock for all the tables being added via such > > > > > commands. Further analysis revealed that the same handling is required > > > > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > > > > > the tables in the specified schemas. Then DROP PUBLICATION also seems > > > > > to have similar symptoms which means in the worst case (where > > > > > publication is for ALL TABLES) we have to lock all the tables in the > > > > > database. We are not sure if that is good so the other alternative we > > > > > can pursue is to distribute invalidations in logical decoding > > > > > infrastructure [1] which has its downsides. > > > > > > > > > > Thoughts? > > > > > > > > Thank you for summarizing the problem and solutions! > > > > > > > > I think it's worth trying the idea of distributing invalidation > > > > messages, and we will see if there could be overheads or any further > > > > obstacles. IIUC this approach would resolve another issue we discussed > > > > before too[1]. > > > > > > > > > > Yes, and we also discussed having a similar solution at the time when > > > that problem was reported. So, it is clear that even though locking > > > tables can work for commands alter ALTER PUBLICATION ... ADD TABLE > > > ..., we need a solution for distributing invalidations to the > > > in-progress transactions during logical decoding for other cases as > > > reported by you previously. > > > > > > Thanks for looking into this. > > > > > > > Thanks, I am working on to implement a solution for distributing > > invalidations. Will share a patch for the same. > > Created a patch for distributing invalidations. > Here we collect the invalidation messages for the current transaction > and distribute it to all the inprogress transactions, whenever we are > distributing the snapshots..Thoughts? In the v7 patch, I am looping through the reorder buffer of the current committed transaction and storing all invalidation messages in a list. Then I am distributing those invalidations. But I found that for a transaction we already store all the invalidation messages (see [1]). So we don't need to loop through the reorder buffer and store the invalidations. I have modified the patch accordingly and attached the same. [1]: https://github.com/postgres/postgres/blob/7da1bdc2c2f17038f2ae1900be90a0d7b5e361e0/src/include/replication/reorderbuffer.h#L384 Thanks and Regards, Shlok Kyal
Attachment
On Thu, Aug 15, 2024 at 9:31 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 8 Aug 2024 at 16:24, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > > > > Created a patch for distributing invalidations. > > Here we collect the invalidation messages for the current transaction > > and distribute it to all the inprogress transactions, whenever we are > > distributing the snapshots..Thoughts? > > Since we are applying invalidations to all in-progress transactions, > the publisher will only replicate half of the transaction data up to > the point of invalidation, while the remaining half will not be > replicated. > Ex: > Session1: > BEGIN; > INSERT INTO tab_conc VALUES (1); > > Session2: > ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc; > > Session1: > INSERT INTO tab_conc VALUES (2); > INSERT INTO tab_conc VALUES (3); > COMMIT; > > After the above the subscriber data looks like: > postgres=# select * from tab_conc ; > a > --- > 1 > (1 row) > > You can reproduce the issue using the attached test. > I'm not sure if this behavior is ok. At present, we’ve replicated the > first record within the same transaction, but the second and third > records are being skipped. > This can happen even without a concurrent DDL if some of the tables in the database are part of the publication and others are not. In such a case inserts for publicized tables will be replicated but other inserts won't. Sending the partial data of the transaction isn't a problem to me. Do you have any other concerns that I am missing? > Would it be better to apply invalidations > after the transaction is underway? > But that won't fix the problem reported by Sawada-san in an email [1]. BTW, we should do some performance testing by having a mix of DML and DDLs to see the performance impact of this patch. [1] - https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com -- With Regards, Amit Kapila.
On Tue, 20 Aug 2024 at 16:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 15, 2024 at 9:31 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Thu, 8 Aug 2024 at 16:24, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > > > > > > > Created a patch for distributing invalidations. > > > Here we collect the invalidation messages for the current transaction > > > and distribute it to all the inprogress transactions, whenever we are > > > distributing the snapshots..Thoughts? > > > > Since we are applying invalidations to all in-progress transactions, > > the publisher will only replicate half of the transaction data up to > > the point of invalidation, while the remaining half will not be > > replicated. > > Ex: > > Session1: > > BEGIN; > > INSERT INTO tab_conc VALUES (1); > > > > Session2: > > ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc; > > > > Session1: > > INSERT INTO tab_conc VALUES (2); > > INSERT INTO tab_conc VALUES (3); > > COMMIT; > > > > After the above the subscriber data looks like: > > postgres=# select * from tab_conc ; > > a > > --- > > 1 > > (1 row) > > > > You can reproduce the issue using the attached test. > > I'm not sure if this behavior is ok. At present, we’ve replicated the > > first record within the same transaction, but the second and third > > records are being skipped. > > > > This can happen even without a concurrent DDL if some of the tables in > the database are part of the publication and others are not. In such a > case inserts for publicized tables will be replicated but other > inserts won't. Sending the partial data of the transaction isn't a > problem to me. Do you have any other concerns that I am missing? My main concern was about sending only part of the data from a transaction table and leaving out the rest. However, since this is happening elsewhere as well, I'm okay with it. Regards, Vignesh
On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > Next I am planning to test solely on the logical decoding side and > will share the results. > Thanks, the next set of proposed tests makes sense to me. It will also be useful to generate some worst-case scenarios where the number of invalidations is more to see the distribution cost in such cases. For example, Truncate/Drop a table with 100 or 1000 partitions. -- With Regards, Amit Kapila.
On Tue, Aug 20, 2024 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 15, 2024 at 9:31 PM vignesh C <vignesh21@gmail.com> wrote: > > Since we are applying invalidations to all in-progress transactions, > > the publisher will only replicate half of the transaction data up to > > the point of invalidation, while the remaining half will not be > > replicated. > > Ex: > > Session1: > > BEGIN; > > INSERT INTO tab_conc VALUES (1); > > > > Session2: > > ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc; > > > > Session1: > > INSERT INTO tab_conc VALUES (2); > > INSERT INTO tab_conc VALUES (3); > > COMMIT; > > > > After the above the subscriber data looks like: > > postgres=# select * from tab_conc ; > > a > > --- > > 1 > > (1 row) > > > > You can reproduce the issue using the attached test. > > I'm not sure if this behavior is ok. At present, we’ve replicated the > > first record within the same transaction, but the second and third > > records are being skipped. > > > > This can happen even without a concurrent DDL if some of the tables in > the database are part of the publication and others are not. In such a > case inserts for publicized tables will be replicated but other > inserts won't. Sending the partial data of the transaction isn't a > problem to me. Do you have any other concerns that I am missing? > Hi, I think that the partial data replication for one table is a bigger issue than the case of data being sent for a subset of the tables in the transaction. This can lead to inconsistent data if the same row is updated multiple times or deleted in the same transaction. In such a case if only the partial updates from the transaction are sent to the subscriber, it might end up with the data which was never visible on the publisher side. Here is an example I tried with the patch v8-001 : I created following 2 tables on the publisher and the subscriber : CREATE TABLE delete_test(id int primary key, name varchar(100)); CREATE TABLE update_test(id int primary key, name varchar(100)); I added both the tables to the publication p on the publisher and created a subscription s on the subscriber. I run 2 sessions on the publisher and do the following : Session 1 : BEGIN; INSERT INTO delete_test VALUES(0, 'Nitin'); Session 2 : ALTER PUBLICATION p DROP TABLE delete_test; Session 1 : DELETE FROM delete_test WHERE id=0; COMMIT; After the commit there should be no new row created on the publisher. But because the partial data was replicated, this is what the select on the subscriber shows : SELECT * FROM delete_test; id | name ----+----------- 0 | Nitin (1 row) I don't think the above is a common use case. But this is still an issue because the subscriber has the data which never existed on the publisher. Similar issue can be seen with an update command. Session 1 : BEGIN; INSERT INTO update_test VALUES(1, 'Chiranjiv'); Session 2 : ALTER PUBLICATION p DROP TABLE update_test; Session 1: UPDATE update_test SET name='Eeshan' where id=1; COMMIT; After the commit, this is the state on the publisher : SELECT * FROM update_test; 1 | Eeshan (1 row) While this is the state on the subscriber : SELECT * FROM update_test; 1 | Chiranjiv (1 row) I think the update during a transaction scenario might be more common than deletion right after insertion. But both of these seem like real issues to consider. Please let me know if I'm missing something. Thanks & Regards Nitin Motiani Google
On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > I think that the partial data replication for one table is a bigger > issue than the case of data being sent for a subset of the tables in > the transaction. This can lead to inconsistent data if the same row is > updated multiple times or deleted in the same transaction. In such a > case if only the partial updates from the transaction are sent to the > subscriber, it might end up with the data which was never visible on > the publisher side. > > Here is an example I tried with the patch v8-001 : > > I created following 2 tables on the publisher and the subscriber : > > CREATE TABLE delete_test(id int primary key, name varchar(100)); > CREATE TABLE update_test(id int primary key, name varchar(100)); > > I added both the tables to the publication p on the publisher and > created a subscription s on the subscriber. > > I run 2 sessions on the publisher and do the following : > > Session 1 : > BEGIN; > INSERT INTO delete_test VALUES(0, 'Nitin'); > > Session 2 : > ALTER PUBLICATION p DROP TABLE delete_test; > > Session 1 : > DELETE FROM delete_test WHERE id=0; > COMMIT; > > After the commit there should be no new row created on the publisher. > But because the partial data was replicated, this is what the select > on the subscriber shows : > > SELECT * FROM delete_test; > id | name > ----+----------- > 0 | Nitin > (1 row) > > I don't think the above is a common use case. But this is still an > issue because the subscriber has the data which never existed on the > publisher. > I don't think that is the correct conclusion because the user has intentionally avoided sending part of the transaction changes. This can happen in various ways without the patch as well. For example, if the user has performed the ALTER in the same transaction. Publisher: ========= BEGIN postgres=*# Insert into delete_test values(0, 'Nitin'); INSERT 0 1 postgres=*# Alter Publication pub1 drop table delete_test; ALTER PUBLICATION postgres=*# Delete from delete_test where id=0; DELETE 1 postgres=*# commit; COMMIT postgres=# select * from delete_test; id | name ----+------ (0 rows) Subscriber: ========= postgres=# select * from delete_test; id | name ----+------- 0 | Nitin (1 row) This can also happen when the user has published only 'inserts' but not 'updates' or 'deletes'. -- With Regards, Amit Kapila.
On Thu, Sep 5, 2024 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > I think that the partial data replication for one table is a bigger > > issue than the case of data being sent for a subset of the tables in > > the transaction. This can lead to inconsistent data if the same row is > > updated multiple times or deleted in the same transaction. In such a > > case if only the partial updates from the transaction are sent to the > > subscriber, it might end up with the data which was never visible on > > the publisher side. > > > > Here is an example I tried with the patch v8-001 : > > > > I created following 2 tables on the publisher and the subscriber : > > > > CREATE TABLE delete_test(id int primary key, name varchar(100)); > > CREATE TABLE update_test(id int primary key, name varchar(100)); > > > > I added both the tables to the publication p on the publisher and > > created a subscription s on the subscriber. > > > > I run 2 sessions on the publisher and do the following : > > > > Session 1 : > > BEGIN; > > INSERT INTO delete_test VALUES(0, 'Nitin'); > > > > Session 2 : > > ALTER PUBLICATION p DROP TABLE delete_test; > > > > Session 1 : > > DELETE FROM delete_test WHERE id=0; > > COMMIT; > > > > After the commit there should be no new row created on the publisher. > > But because the partial data was replicated, this is what the select > > on the subscriber shows : > > > > SELECT * FROM delete_test; > > id | name > > ----+----------- > > 0 | Nitin > > (1 row) > > > > I don't think the above is a common use case. But this is still an > > issue because the subscriber has the data which never existed on the > > publisher. > > > > I don't think that is the correct conclusion because the user has > intentionally avoided sending part of the transaction changes. This > can happen in various ways without the patch as well. For example, if > the user has performed the ALTER in the same transaction. > > Publisher: > ========= > BEGIN > postgres=*# Insert into delete_test values(0, 'Nitin'); > INSERT 0 1 > postgres=*# Alter Publication pub1 drop table delete_test; > ALTER PUBLICATION > postgres=*# Delete from delete_test where id=0; > DELETE 1 > postgres=*# commit; > COMMIT > postgres=# select * from delete_test; > id | name > ----+------ > (0 rows) > > Subscriber: > ========= > postgres=# select * from delete_test; > id | name > ----+------- > 0 | Nitin > (1 row) > > This can also happen when the user has published only 'inserts' but > not 'updates' or 'deletes'. > Thanks for the clarification. I didn't think of this case. The change seems fine if this can already happen. Thanks & Regards Nitin Motiani Google
On Mon, 9 Sept 2024 at 10:41, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > On Mon, 2 Sept 2024 at 10:12, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > > > Next I am planning to test solely on the logical decoding side and > > > will share the results. > > > > > > > Thanks, the next set of proposed tests makes sense to me. It will also > > be useful to generate some worst-case scenarios where the number of > > invalidations is more to see the distribution cost in such cases. For > > example, Truncate/Drop a table with 100 or 1000 partitions. > > > > -- > > With Regards, > > Amit Kapila. > > Hi, > > I did some performance testing solely on the logical decoding side and > found some degradation in performance, for the following testcase: > 1. Created a publisher on a single table, say 'tab_conc1'; > 2. Created a second publisher on a single table say 'tp'; > 4. two sessions are running in parallel, let's say S1 and S2. > 5. Begin a transaction in S1. > 6. Now in a loop (this loop runs 'count' times): > S1: Insert a row in table 'tab_conc1' > S2: BEGIN; Alter publication DROP/ ADD tp; COMMIT > 7. COMMIT the transaction in S1. > 8. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. > > Observation: > With fix a new entry is added in decoding. During debugging I found > that this entry only comes when we do a 'INSERT' in Session 1 after we > do 'ALTER PUBLICATION' in another session in parallel (or we can say > due to invalidation). Also, I observed that this new entry is related > to sending replica identity, attributes,etc as function > 'logicalrep_write_rel' is called. > > Performance: > We see a performance degradation as we are sending new entries during > logical decoding. Results are an average of 5 runs. > > count | Head (sec) | Fix (sec) | Degradation (%) > ------------------------------------------------------------------------------ > 10000 | 1.298 | 1.574 | 21.26348228 > 50000 | 22.892 | 24.997 | 9.195352088 > 100000 | 88.602 | 93.759 | 5.820410374 > > I have also attached the test script here. > For the above case I tried to investigate the inconsistent degradation and found out that Serialization was happening for a large number of 'count'. So, I tried adjusting 'logical_decoding_work_mem' to a large value, so that we can avoid serialization here. I ran the above performance test again and got the following results: count | Head (sec) | Fix (sec) | Degradation (%) ----------------------------------------------------------------------------------- 10000 | 0.415446 | 0.53596167 | 29.00874482 50000 | 7.950266 | 10.37375567 | 30.48312685 75000 | 17.192372 | 22.246715 | 29.39875312 100000 | 30.555903 | 39.431542 | 29.04721552 These results are an average of 3 runs. Here the degradation is consistent around ~30%. Thanks and Regards, Shlok Kyal
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Shlok, > Hi, > > I tried to add changes to selectively invalidate the cache to reduce > the performance degradation during the distribution of invalidations. Thanks for improving the patch! >... > > Solution: > 1. When we alter a publication using commands like ‘ALTER PUBLICATION > pub_name DROP TABLE table_name’, first all tables in the publications > are invalidated using the function ‘rel_sync_cache_relation_cb’. Then > again ‘rel_sync_cache_publication_cb’ function is called which > invalidates all the tables. On my environment, rel_sync_cache_publication_cb() was called first and invalidate all the entries, then rel_sync_cache_relation_cb() was called and the specified entry is invalidated - hence second is NO-OP. > This happens because of the following > callback registered: > CacheRegisterSyscacheCallback(PUBLICATIONRELMAP, > rel_sync_cache_publication_cb, (Datum) 0); But even in this case, I could understand that you want to remove the rel_sync_cache_publication_cb() callback. > 2. When we add/drop a schema to/from a publication using command like > ‘ALTER PUBLICATION pub_name ADD TABLES in SCHEMA schema_name’, first > all tables in that schema are invalidated using > ‘rel_sync_cache_relation_cb’ and then again > ‘rel_sync_cache_publication_cb’ function is called which invalidates > all the tables. Even in this case, rel_sync_cache_publication_cb() was called first and then rel_sync_cache_relation_cb(). > > 3. When we alter a namespace using command like ‘ALTER SCHEMA > schema_name RENAME to new_schema_name’ all the table in cache are > invalidated as ‘rel_sync_cache_publication_cb’ is called due to the > following registered callback: > CacheRegisterSyscacheCallback(NAMESPACEOID, > rel_sync_cache_publication_cb, (Datum) 0); > > So, we added a new callback function ‘rel_sync_cache_namespacerel_cb’ > will be called instead of function ‘rel_sync_cache_publication_cb’ , > which invalidates only the cache of the tables which are part of that > particular namespace. For the new function the ‘namespace id’ is added > in the Invalidation message. Hmm, I feel this fix is too much. Unlike ALTER PUBLICATION statements, I think ALTER SCHEMA is rarely executed at the production stage. However, this approach requires adding a new cache callback system, which affects the entire postgres system; this is not very beneficial compared to the outcome. It should be discussed on another thread to involve more people, and then we can add the improvement after being accepted. > Performance Comparison: > I have run the same tests as shared in [1] and observed a significant > decrease in the degradation with the new changes. With selective > invalidation degradation is around ~5%. This results are an average of > 3 runs. IIUC, the executed workload did not contain ALTER SCHEMA command, so third improvement did not contribute this improvement. Best regards, Hayato Kuroda FUJITSU LIMITED
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Shlok, > I have addressed the comment for 0002 patch and attached the patches. > Also, I have moved the tests in the 0002 to 0001 patch. Thanks for updating the patch. 0002 patch seems to remove cache invalidations from publication_invalidation_cb(). Related with it, I found an issue and had a concern. 1. The replication continues even after ALTER PUBLICATION RENAME is executed. For example - assuming that a subscriber subscribes only "pub": ``` pub=# INSERT INTO tab values (1); INSERT 0 1 pub=# ALTER PUBLICATION pub RENAME TO pub1; ALTER PUBLICATION pub=# INSERT INTO tab values (2); INSERT 0 1 sub=# SELECT * FROM tab ; -- (2) should not be replicated however... a --- 1 2 (2 rows) ``` This happens because 1) ALTER PUBLICATION RENAME statement won't be invalidate the relation cache, and 2) publications are reloaded only when invalid RelationSyncEntry is found. In given example, the first INSERT creates the valid cache and second INSERT reuses it. Therefore, the pubname-check is skipped. For now, the actual renaming is done at AlterObjectRename_internal(), a generic function. I think we must implement a dedecated function to publication and must invalidate relcaches there. 2. Similarly with above, the relcache won't be invalidated when ALTER PUBLICATION OWNER TO is executed. This means that privilage checks may be ignored if the entry is still valid. Not sure, but is there a possibility this causes an inconsistency? Best regards, Hayato Kuroda FUJITSU LIMITED
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
> 2. > Similarly with above, the relcache won't be invalidated when ALTER > PUBLICATION > OWNER TO is executed. This means that privilage checks may be ignored if the > entry > is still valid. Not sure, but is there a possibility this causes an inconsistency? Hmm, IIUC, the attribute pubowner is not used for now. The paragpargh "There are currently no privileges on publications...." [1] may show the current status. However, to keep the current behavior, I suggest to invalidate the relcache of pubrelations when the owner is altered. [1]: https://www.postgresql.org/docs/devel/logical-replication-security.html Best regards, Hayato Kuroda FUJITSU LIMITED
On Mon, 30 Sept 2024 at 16:56, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > 2. > > Similarly with above, the relcache won't be invalidated when ALTER > > PUBLICATION > > OWNER TO is executed. This means that privilage checks may be ignored if the > > entry > > is still valid. Not sure, but is there a possibility this causes an inconsistency? > > Hmm, IIUC, the attribute pubowner is not used for now. The paragpargh > "There are currently no privileges on publications...." [1] may show the current > status. However, to keep the current behavior, I suggest to invalidate the relcache > of pubrelations when the owner is altered. > > [1]: https://www.postgresql.org/docs/devel/logical-replication-security.html I have shared the updated patch [1]. So now, when 'ALTER .. PUBLICATION .. OWNER TO ..' is executed the relcache is invalidated for that specific publication. [1] : https://www.postgresql.org/message-id/CANhcyEWEXL3rxvKH9-Xtx-DgGX0D62EktHpW%2BnG%2BMSSaMVUVig%40mail.gmail.com Thanks and Regards, Shlok Kyal
On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > > a similar problem? I haven't tested so not sure but even if there is a > > > > problem for the Create case, it should lead to some ERROR like missing > > > > publication. > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > the create publication case: > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > callback, associated LSN 0/1510CD8 > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > "logical replication apply worker" (PID 481526) exited with exit code > > > 1 > > > > > > The steps for this process are as follows: > > > 1) Create tables in both the publisher and subscriber. > > > 2) On the publisher: Create a replication slot. > > > 3) On the subscriber: Create a subscription using the slot created by > > > the publisher. > > > 4) On the publisher: > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > 4.c) Session 1: COMMIT; > > > > > > Since we are throwing out a "publication does not exist" error, there > > > is no inconsistency issue here. > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > data continues to replicate even after the publication is dropped. > > > This happens because the open transaction consumes the invalidation, > > > causing the publications to be revalidated using old snapshot. As a > > > result, both the open transactions and the subsequent transactions are > > > getting replicated. > > > > > > We can reproduce this issue by following these steps in a logical > > > replication setup with an "ALL TABLES" publication: > > > On the publisher: > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > In another session on the publisher: > > > Session 2: DROP PUBLICATION > > > Back in Session 1 on the publisher: > > > COMMIT; > > > Finally, in Session 1 on the publisher: > > > INSERT INTO T1 VALUES (val2); > > > > > > Even after dropping the publication, both val1 and val2 are still > > > being replicated to the subscriber. This means that both the > > > in-progress concurrent transaction and the subsequent transactions are > > > being replicated. > > > > > > I don't think locking all tables is a viable solution in this case, as > > > it would require asking the user to refrain from performing any > > > operations on any of the tables in the database while creating a > > > publication. > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > for this scenario also looks odd to me. The other alternative > > previously suggested by Andres is to distribute catalog modifying > > transactions to all concurrent in-progress transactions [1] but as > > mentioned this could add an overhead. One possibility to reduce > > overhead is that we selectively distribute invalidations for > > catalogs-related publications but I haven't analyzed the feasibility. > > > > We need more opinions to decide here, so let me summarize the problem > > and solutions discussed. As explained with an example in an email [1], > > the problem related to logical decoding is that it doesn't process > > invalidations corresponding to DDLs for the already in-progress > > transactions. We discussed preventing DMLs in the first place when > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > progress. The solution discussed was to acquire > > ShareUpdateExclusiveLock for all the tables being added via such > > commands. Further analysis revealed that the same handling is required > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > > the tables in the specified schemas. Then DROP PUBLICATION also seems > > to have similar symptoms which means in the worst case (where > > publication is for ALL TABLES) we have to lock all the tables in the > > database. We are not sure if that is good so the other alternative we > > can pursue is to distribute invalidations in logical decoding > > infrastructure [1] which has its downsides. > > > > Thoughts? > > Thank you for summarizing the problem and solutions! > > I think it's worth trying the idea of distributing invalidation > messages, and we will see if there could be overheads or any further > obstacles. IIUC this approach would resolve another issue we discussed > before too[1]. > > Regards, > > [1] https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com > Hi Sawada-san, I have tested the scenario shared by you on the thread [1]. And I confirm that the latest patch [2] fixes this issue. [1] https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com [2] https://www.postgresql.org/message-id/CANhcyEWfqdUvn2d2KOdvkhebBi5VO6O8J%2BC6%2BOwsPNwCTM%3DakQ%40mail.gmail.com Thanks and Regards, Shlok Kyal
On Mon, Dec 9, 2024 at 10:27 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Oct 08, 2024 at 03:21:38PM +0530, Shlok Kyal wrote: > > I have tested the scenario shared by you on the thread [1]. And I > > confirm that the latest patch [2] fixes this issue. > > > > [1] https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com > > [2] https://www.postgresql.org/message-id/CANhcyEWfqdUvn2d2KOdvkhebBi5VO6O8J%2BC6%2BOwsPNwCTM%3DakQ%40mail.gmail.com > > Sawada-san, are you planning to look at that? It looks like this > thread is waiting for your input. Sorry I lost track of this thread. I'll check the test results and patch soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > > > a similar problem? I haven't tested so not sure but even if there is a > > > > > problem for the Create case, it should lead to some ERROR like missing > > > > > publication. > > > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > > the create publication case: > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > > callback, associated LSN 0/1510CD8 > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > "logical replication apply worker" (PID 481526) exited with exit code > > > > 1 > > > > > > > > The steps for this process are as follows: > > > > 1) Create tables in both the publisher and subscriber. > > > > 2) On the publisher: Create a replication slot. > > > > 3) On the subscriber: Create a subscription using the slot created by > > > > the publisher. > > > > 4) On the publisher: > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > 4.c) Session 1: COMMIT; > > > > > > > > Since we are throwing out a "publication does not exist" error, there > > > > is no inconsistency issue here. > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > data continues to replicate even after the publication is dropped. > > > > This happens because the open transaction consumes the invalidation, > > > > causing the publications to be revalidated using old snapshot. As a > > > > result, both the open transactions and the subsequent transactions are > > > > getting replicated. > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > replication setup with an "ALL TABLES" publication: > > > > On the publisher: > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > In another session on the publisher: > > > > Session 2: DROP PUBLICATION > > > > Back in Session 1 on the publisher: > > > > COMMIT; > > > > Finally, in Session 1 on the publisher: > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > being replicated to the subscriber. This means that both the > > > > in-progress concurrent transaction and the subsequent transactions are > > > > being replicated. > > > > > > > > I don't think locking all tables is a viable solution in this case, as > > > > it would require asking the user to refrain from performing any > > > > operations on any of the tables in the database while creating a > > > > publication. > > > > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > > for this scenario also looks odd to me. The other alternative > > > previously suggested by Andres is to distribute catalog modifying > > > transactions to all concurrent in-progress transactions [1] but as > > > mentioned this could add an overhead. One possibility to reduce > > > overhead is that we selectively distribute invalidations for > > > catalogs-related publications but I haven't analyzed the feasibility. > > > > > > We need more opinions to decide here, so let me summarize the problem > > > and solutions discussed. As explained with an example in an email [1], > > > the problem related to logical decoding is that it doesn't process > > > invalidations corresponding to DDLs for the already in-progress > > > transactions. We discussed preventing DMLs in the first place when > > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > > progress. The solution discussed was to acquire > > > ShareUpdateExclusiveLock for all the tables being added via such > > > commands. Further analysis revealed that the same handling is required > > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > > > the tables in the specified schemas. Then DROP PUBLICATION also seems > > > to have similar symptoms which means in the worst case (where > > > publication is for ALL TABLES) we have to lock all the tables in the > > > database. We are not sure if that is good so the other alternative we > > > can pursue is to distribute invalidations in logical decoding > > > infrastructure [1] which has its downsides. > > > > > > Thoughts? > > > > Thank you for summarizing the problem and solutions! > > > > I think it's worth trying the idea of distributing invalidation > > messages, and we will see if there could be overheads or any further > > obstacles. IIUC this approach would resolve another issue we discussed > > before too[1]. > > > > Regards, > > > > [1] https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com > > > > Hi Sawada-san, > > I have tested the scenario shared by you on the thread [1]. And I > confirm that the latest patch [2] fixes this issue. > I confirmed that the proposed patch fixes these issues. I have one question about the patch: In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the following code: /* * If we don't have a base snapshot yet, there are no changes in this * transaction which in turn implies we don't yet need a snapshot at * all. We'll add a snapshot when the first change gets queued. * * NB: This works correctly even for subtransactions because * ReorderBufferAssignChild() takes care to transfer the base snapshot * to the top-level transaction, and while iterating the changequeue * we'll get the change from the subtxn. */ if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid)) continue; Is there any case where we need to distribute inval messages to transactions that don't have the base snapshot yet but eventually need the inval messages? Overall, with this idea, we distribute invalidation messages to all concurrent decoded transactions. It could introduce performance regressions by several causes. For example, we could end up invalidating RelationSyncCache entries in more cases. While this is addressed by your selectively cache invalidation patch, there is still 5% regression. We might need to accept a certain amount of regressions for making it correct but it would be better to figure out where these regressions come from. Other than that, I think the performance regression could happen due to the costs of distributing invalidation messages. You've already observed there is 1~3% performance regression in cases where we distribute a large amount of invalidation messages to one concurrently decoded transaction[1]. I guess that the selectively cache invalidation idea would not help this case. Also, I think we might want to test other cases like where we distribute a small amount of invalidation messages to many concurrently decoded transactions. Regards, [1] https://www.postgresql.org/message-id/CANhcyEX%2BC3G68W51myHWfbpAdmSXDwHdMsWUa%2BzHBF_QKKvZMw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I confirmed that the proposed patch fixes these issues. I have one > question about the patch: > > In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the > following code: > > /* > * If we don't have a base snapshot yet, there are no changes in this > * transaction which in turn implies we don't yet need a snapshot at > * all. We'll add a snapshot when the first change gets queued. > * > * NB: This works correctly even for subtransactions because > * ReorderBufferAssignChild() takes care to transfer the base snapshot > * to the top-level transaction, and while iterating the changequeue > * we'll get the change from the subtxn. > */ > if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid)) > continue; > > Is there any case where we need to distribute inval messages to > transactions that don't have the base snapshot yet but eventually need > the inval messages? > Good point. It is mentioned that for snapshots: "We'll add a snapshot when the first change gets queued.". I think we achieve this via builder->committed.xip array such that when we set a base snapshot for a transaction, we use that array to form a snapshot. However, I don't see any such consideration for invalidations. Now, we could either always add invalidations to xacts that don't have base_snapshot yet or have a mechanism similar committed.xid array. But it is better to first reproduce the problem. > Overall, with this idea, we distribute invalidation messages to all > concurrent decoded transactions. It could introduce performance > regressions by several causes. For example, we could end up > invalidating RelationSyncCache entries in more cases. While this is > addressed by your selectively cache invalidation patch, there is still > 5% regression. We might need to accept a certain amount of regressions > for making it correct but it would be better to figure out where these > regressions come from. Other than that, I think the performance > regression could happen due to the costs of distributing invalidation > messages. You've already observed there is 1~3% performance regression > in cases where we distribute a large amount of invalidation messages > to one concurrently decoded transaction[1]. I guess that the > selectively cache invalidation idea would not help this case. Also, I > think we might want to test other cases like where we distribute a > small amount of invalidation messages to many concurrently decoded > transactions. > +1. -- With Regards, Amit Kapila.
On Wed, 11 Dec 2024 at 09:13, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > On Tue, 8 Oct 2024 at 11:11, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > Hi Kuroda-san, > > > > > > I have also modified the tests in 0001 patch. These changes are only > > > > related to syntax of writing tests. > > > > > > LGTM. I found small improvements, please find the attached. > > > > I have applied the changes and updated the patch. > > > Patches needed a rebase. Attached the rebased patch. > Patches need a rebase. Attached the rebased patch. Thanks and regards, Shlok Kyal
Attachment
Hi, After reading the thread and doing a bit of testing, the problem seems significant and is still present. The fact that it's probably not well known makes it more concerning, in my opinion. I was wondering what could be done to help move this topic forward (given my limited abilities)? -- Benoit Lobréau Consultant http://dalibo.com
On Tue, Feb 25, 2025 at 3:56 PM Benoit Lobréau <benoit.lobreau@dalibo.com> wrote: > > After reading the thread and doing a bit of testing, the problem seems > significant and is still present. The fact that it's probably not well > known makes it more concerning, in my opinion. I was wondering what > could be done to help move this topic forward (given my limited abilities)? > You can help with the review/test of the proposed patch. Also, help with the performance impact of the patch, if possible. Shlok has done some performance testing of the patch which you can perform independently and then Sawada-San has asked for more performance tests in his last email (1) which you can also help with. (1) - https://www.postgresql.org/message-id/CAD21AoDoWc8MWTyKtmNF_606bcW6J0gV%3D%3Dr%3DVmPXKUN-e3o9ew%40mail.gmail.com -- With Regards, Amit Kapila.
On Wed, 11 Dec 2024 at 12:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > > > Publication .. For ALL TABLES (and Drop Publication). Can that create > > > > > > a similar problem? I haven't tested so not sure but even if there is a > > > > > > problem for the Create case, it should lead to some ERROR like missing > > > > > > publication. > > > > > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > > > the create publication case: > > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > > > callback, associated LSN 0/1510CD8 > > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > > "logical replication apply worker" (PID 481526) exited with exit code > > > > > 1 > > > > > > > > > > The steps for this process are as follows: > > > > > 1) Create tables in both the publisher and subscriber. > > > > > 2) On the publisher: Create a replication slot. > > > > > 3) On the subscriber: Create a subscription using the slot created by > > > > > the publisher. > > > > > 4) On the publisher: > > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > > 4.c) Session 1: COMMIT; > > > > > > > > > > Since we are throwing out a "publication does not exist" error, there > > > > > is no inconsistency issue here. > > > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > > data continues to replicate even after the publication is dropped. > > > > > This happens because the open transaction consumes the invalidation, > > > > > causing the publications to be revalidated using old snapshot. As a > > > > > result, both the open transactions and the subsequent transactions are > > > > > getting replicated. > > > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > > replication setup with an "ALL TABLES" publication: > > > > > On the publisher: > > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > > In another session on the publisher: > > > > > Session 2: DROP PUBLICATION > > > > > Back in Session 1 on the publisher: > > > > > COMMIT; > > > > > Finally, in Session 1 on the publisher: > > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > > being replicated to the subscriber. This means that both the > > > > > in-progress concurrent transaction and the subsequent transactions are > > > > > being replicated. > > > > > > > > > > I don't think locking all tables is a viable solution in this case, as > > > > > it would require asking the user to refrain from performing any > > > > > operations on any of the tables in the database while creating a > > > > > publication. > > > > > > > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > > > for this scenario also looks odd to me. The other alternative > > > > previously suggested by Andres is to distribute catalog modifying > > > > transactions to all concurrent in-progress transactions [1] but as > > > > mentioned this could add an overhead. One possibility to reduce > > > > overhead is that we selectively distribute invalidations for > > > > catalogs-related publications but I haven't analyzed the feasibility. > > > > > > > > We need more opinions to decide here, so let me summarize the problem > > > > and solutions discussed. As explained with an example in an email [1], > > > > the problem related to logical decoding is that it doesn't process > > > > invalidations corresponding to DDLs for the already in-progress > > > > transactions. We discussed preventing DMLs in the first place when > > > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > > > progress. The solution discussed was to acquire > > > > ShareUpdateExclusiveLock for all the tables being added via such > > > > commands. Further analysis revealed that the same handling is required > > > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all > > > > the tables in the specified schemas. Then DROP PUBLICATION also seems > > > > to have similar symptoms which means in the worst case (where > > > > publication is for ALL TABLES) we have to lock all the tables in the > > > > database. We are not sure if that is good so the other alternative we > > > > can pursue is to distribute invalidations in logical decoding > > > > infrastructure [1] which has its downsides. > > > > > > > > Thoughts? > > > > > > Thank you for summarizing the problem and solutions! > > > > > > I think it's worth trying the idea of distributing invalidation > > > messages, and we will see if there could be overheads or any further > > > obstacles. IIUC this approach would resolve another issue we discussed > > > before too[1]. > > > > > > Regards, > > > > > > [1] https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com > > > > > > > Hi Sawada-san, > > > > I have tested the scenario shared by you on the thread [1]. And I > > confirm that the latest patch [2] fixes this issue. > > > > I confirmed that the proposed patch fixes these issues. I have one > question about the patch: > > In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the > following code: > > /* > * If we don't have a base snapshot yet, there are no changes in this > * transaction which in turn implies we don't yet need a snapshot at > * all. We'll add a snapshot when the first change gets queued. > * > * NB: This works correctly even for subtransactions because > * ReorderBufferAssignChild() takes care to transfer the base snapshot > * to the top-level transaction, and while iterating the changequeue > * we'll get the change from the subtxn. > */ > if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid)) > continue; > > Is there any case where we need to distribute inval messages to > transactions that don't have the base snapshot yet but eventually need > the inval messages? > > Overall, with this idea, we distribute invalidation messages to all > concurrent decoded transactions. It could introduce performance > regressions by several causes. For example, we could end up > invalidating RelationSyncCache entries in more cases. While this is > addressed by your selectively cache invalidation patch, there is still > 5% regression. We might need to accept a certain amount of regressions > for making it correct but it would be better to figure out where these > regressions come from. Other than that, I think the performance > regression could happen due to the costs of distributing invalidation > messages. You've already observed there is 1~3% performance regression > in cases where we distribute a large amount of invalidation messages > to one concurrently decoded transaction[1]. I guess that the > selectively cache invalidation idea would not help this case. Also, I > think we might want to test other cases like where we distribute a > small amount of invalidation messages to many concurrently decoded > transactions. > Hi Sawada-san, I have done the performance testing for cases where we distribute a small amount of invalidation messages to many concurrently decoded transactions. Here are results: Concurrent Txn | Head (sec) | Patch (sec) | Degradation in % --------------------------------------------------------------------------------------------- 50 | 0.2627734 | 0.2654608 | 1.022706256 100 | 0.4801048 | 0.4869254 | 1.420648158 500 | 2.2170336 | 2.2438656 | 1.210265825 1000 | 4.4957402 | 4.5282574 | 0.723289126 2000 | 9.2013082 | 9.21164 | 0.112286207 The steps I followed is: 1. Initially logical replication is setup. 2. Then we start 'n' number of concurrent transactions. Each txn look like: BEGIN; Insert into t1 values(11); 3. Now we add two invalidation which will be distributed each transaction by running command: ALTER PUBLICATION regress_pub1 DROP TABLE t1 ALTER PUBLICATION regress_pub1 ADD TABLE t1 4. Then run an insert for each txn. It will build cache for relation in each txn. 5. Commit Each transaction. I have also attached the script. Thanks and Regards, Shlok Kyal
Attachment
On Wed, Feb 26, 2025 at 9:21 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > I have done the performance testing for cases where we distribute a > small amount of invalidation messages to many concurrently decoded > transactions. > Here are results: > > Concurrent Txn | Head (sec) | Patch (sec) | Degradation in % > --------------------------------------------------------------------------------------------- > 50 | 0.2627734 | 0.2654608 | 1.022706256 > 100 | 0.4801048 | 0.4869254 | 1.420648158 > 500 | 2.2170336 | 2.2438656 | 1.210265825 > 1000 | 4.4957402 | 4.5282574 | 0.723289126 > 2000 | 9.2013082 | 9.21164 | 0.112286207 > > The steps I followed is: > 1. Initially logical replication is setup. > 2. Then we start 'n' number of concurrent transactions. > Each txn look like: > BEGIN; > Insert into t1 values(11); > 3. Now we add two invalidation which will be distributed each > transaction by running command: > ALTER PUBLICATION regress_pub1 DROP TABLE t1 > ALTER PUBLICATION regress_pub1 ADD TABLE t1 > 4. Then run an insert for each txn. It will build cache for relation > in each txn. > 5. Commit Each transaction. > > I have also attached the script. > The tests are done using pub-sub setup which has some overhead of logical replication as well. Can we try this test by fetching changes via SQL API using pgoutput as plugin to see the impact? -- With Regards, Amit Kapila.
RE: long-standing data loss bug in initial sync of logical replication
From
"Zhijie Hou (Fujitsu)"
Date:
On Monday, February 24, 2025 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > I confirmed that the proposed patch fixes these issues. I have one > > question about the patch: > > > > In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the > > following code: > > > > /* > > * If we don't have a base snapshot yet, there are no changes in this > > * transaction which in turn implies we don't yet need a snapshot at > > * all. We'll add a snapshot when the first change gets queued. > > * > > * NB: This works correctly even for subtransactions because > > * ReorderBufferAssignChild() takes care to transfer the base > snapshot > > * to the top-level transaction, and while iterating the changequeue > > * we'll get the change from the subtxn. > > */ > > if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid)) > > continue; > > > > Is there any case where we need to distribute inval messages to > > transactions that don't have the base snapshot yet but eventually need > > the inval messages? > > > > Good point. It is mentioned that for snapshots: "We'll add a snapshot > when the first change gets queued.". I think we achieve this via > builder->committed.xip array such that when we set a base snapshot for > a transaction, we use that array to form a snapshot. However, I don't > see any such consideration for invalidations. Now, we could either > always add invalidations to xacts that don't have base_snapshot yet or > have a mechanism similar committed.xid array. But it is better to > first reproduce the problem. I think distributing invalidations to a transaction that has not yet built a base snapshot is un-necessary. This is because, during the process of building its base snapshot, such a transaction will have already recorded the XID of the transaction that altered the publication information into its array of committed XIDs. Consequently, it will reflect the latest changes in the catalog from the beginning. In the context of logical decoding, this scenario is analogous to decoding a new transaction initiated after the catalog-change transaction has been committed. The original issue arises because the catalog cache was constructed using an outdated snapshot that could not reflect the latest catalog changes. However, this is not a problem in cases without a base snapshot. Since the existing catalog cache should have been invalidated upon decoding the committed catalog-change transaction, the subsequent transactions will construct a new cache with the latest snapshot. I also considered the scenario where only a sub-transaction has a base snapshot that has not yet been transferred to its top-level transaction. However, I think this is not problematic because a sub-transaction transfers its snapshot immediately upon building it (see ReorderBufferSetBaseSnapshot). The only exception is if the sub-transaction is independent (i.e., not yet associated with its top-level transaction). In such a case, the sub-transaction is treated as a top-level transaction, and invalidations will be distributed to this sub-transaction after applying the patch which is sufficient to resolve the issue. Considering the complexity of this topic, I think it would be better to add some comments like the attachment Best Regards, Hou zj
Attachment
On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, February 24, 2025 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > > > I confirmed that the proposed patch fixes these issues. I have one > > > question about the patch: > > > > > > In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the > > > following code: > > > > > > /* > > > * If we don't have a base snapshot yet, there are no changes in this > > > * transaction which in turn implies we don't yet need a snapshot at > > > * all. We'll add a snapshot when the first change gets queued. > > > * > > > * NB: This works correctly even for subtransactions because > > > * ReorderBufferAssignChild() takes care to transfer the base > > snapshot > > > * to the top-level transaction, and while iterating the changequeue > > > * we'll get the change from the subtxn. > > > */ > > > if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid)) > > > continue; > > > > > > Is there any case where we need to distribute inval messages to > > > transactions that don't have the base snapshot yet but eventually need > > > the inval messages? > > > > > > > Good point. It is mentioned that for snapshots: "We'll add a snapshot > > when the first change gets queued.". I think we achieve this via > > builder->committed.xip array such that when we set a base snapshot for > > a transaction, we use that array to form a snapshot. However, I don't > > see any such consideration for invalidations. Now, we could either > > always add invalidations to xacts that don't have base_snapshot yet or > > have a mechanism similar committed.xid array. But it is better to > > first reproduce the problem. > > I think distributing invalidations to a transaction that has not yet built a > base snapshot is un-necessary. This is because, during the process of building > its base snapshot, such a transaction will have already recorded the XID of the > transaction that altered the publication information into its array of > committed XIDs. Consequently, it will reflect the latest changes in the catalog > from the beginning. In the context of logical decoding, this scenario is > analogous to decoding a new transaction initiated after the catalog-change > transaction has been committed. > > The original issue arises because the catalog cache was constructed using an > outdated snapshot that could not reflect the latest catalog changes. However, > this is not a problem in cases without a base snapshot. Since the existing > catalog cache should have been invalidated upon decoding the committed > catalog-change transaction, the subsequent transactions will construct a new > cache with the latest snapshot. I've also concluded it's not necessary but the reason and analysis might be somewhat different. IIUC in the original issue (looking at Andres's reproducer[1]), the fact that when replaying a non-catalog-change transaction, the walsender constructed the snapshot that doesn't reflect the catalog change is fine because the first change of that transaction was made before the catalog change. The problem is that the walsender process absorbed the invalidation message when replaying the change that happened before the catalog change, and ended up keeping replaying the subsequent changes with that snapshot. That is why we concluded that we need to distribute the invalidation messages to concurrently decoded transactions so that we can invalidate the cache again at that point. As the comment mentioned, the base snapshot is set before queuing any changes, so if the transaction doesn't have the base snapshot yet, there must be no queued change that happened before the catalog change. The transactions that initiated after the catalog change don't have this issue. Regards, [1] https://www.postgresql.org/message-id/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Feb 28, 2025 at 6:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Monday, February 24, 2025 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada > > > <sawada.mshk@gmail.com> wrote: > > > > > > > > I confirmed that the proposed patch fixes these issues. I have one > > > > question about the patch: > > > > > > > > In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the > > > > following code: > > > > > > > > /* > > > > * If we don't have a base snapshot yet, there are no changes in this > > > > * transaction which in turn implies we don't yet need a snapshot at > > > > * all. We'll add a snapshot when the first change gets queued. > > > > * > > > > * NB: This works correctly even for subtransactions because > > > > * ReorderBufferAssignChild() takes care to transfer the base > > > snapshot > > > > * to the top-level transaction, and while iterating the changequeue > > > > * we'll get the change from the subtxn. > > > > */ > > > > if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid)) > > > > continue; > > > > > > > > Is there any case where we need to distribute inval messages to > > > > transactions that don't have the base snapshot yet but eventually need > > > > the inval messages? > > > > > > > > > > Good point. It is mentioned that for snapshots: "We'll add a snapshot > > > when the first change gets queued.". I think we achieve this via > > > builder->committed.xip array such that when we set a base snapshot for > > > a transaction, we use that array to form a snapshot. However, I don't > > > see any such consideration for invalidations. Now, we could either > > > always add invalidations to xacts that don't have base_snapshot yet or > > > have a mechanism similar committed.xid array. But it is better to > > > first reproduce the problem. > > > > I think distributing invalidations to a transaction that has not yet built a > > base snapshot is un-necessary. This is because, during the process of building > > its base snapshot, such a transaction will have already recorded the XID of the > > transaction that altered the publication information into its array of > > committed XIDs. Consequently, it will reflect the latest changes in the catalog > > from the beginning. In the context of logical decoding, this scenario is > > analogous to decoding a new transaction initiated after the catalog-change > > transaction has been committed. > > > > The original issue arises because the catalog cache was constructed using an > > outdated snapshot that could not reflect the latest catalog changes. However, > > this is not a problem in cases without a base snapshot. Since the existing > > catalog cache should have been invalidated upon decoding the committed > > catalog-change transaction, the subsequent transactions will construct a new > > cache with the latest snapshot. > > I've also concluded it's not necessary but the reason and analysis > might be somewhat different. IIUC in the original issue (looking at > Andres's reproducer[1]), the fact that when replaying a > non-catalog-change transaction, the walsender constructed the snapshot > that doesn't reflect the catalog change is fine because the first > change of that transaction was made before the catalog change. The > problem is that the walsender process absorbed the invalidation > message when replaying the change that happened before the catalog > change, and ended up keeping replaying the subsequent changes with > that snapshot. That is why we concluded that we need to distribute the > invalidation messages to concurrently decoded transactions so that we > can invalidate the cache again at that point. As the comment > mentioned, the base snapshot is set before queuing any changes, so if > the transaction doesn't have the base snapshot yet, there must be no > queued change that happened before the catalog change. The > transactions that initiated after the catalog change don't have this > issue. > I think both of you are saying the same thing with slightly different words. Hou-San's explanation goes into more detail at the code level, and you have said the same thing with a slightly higher-level view. Additionally, for streaming transactions where we would have already sent one or more streams, we don't need anything special since they behave similarly to a transaction having a base snapshot because we save the snapshot after sending each stream. -- With Regards, Amit Kapila.
On Fri, Feb 28, 2025 at 9:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 28, 2025 at 6:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu) > > > > > > I think distributing invalidations to a transaction that has not yet built a > > > base snapshot is un-necessary. This is because, during the process of building > > > its base snapshot, such a transaction will have already recorded the XID of the > > > transaction that altered the publication information into its array of > > > committed XIDs. Consequently, it will reflect the latest changes in the catalog > > > from the beginning. In the context of logical decoding, this scenario is > > > analogous to decoding a new transaction initiated after the catalog-change > > > transaction has been committed. > > > > > > The original issue arises because the catalog cache was constructed using an > > > outdated snapshot that could not reflect the latest catalog changes. However, > > > this is not a problem in cases without a base snapshot. Since the existing > > > catalog cache should have been invalidated upon decoding the committed > > > catalog-change transaction, the subsequent transactions will construct a new > > > cache with the latest snapshot. > > > > I've also concluded it's not necessary but the reason and analysis > > might be somewhat different. > Based on the discussion on this point and Hou-San's proposed comment, I have tried to add/edit a few comments in 0001 patch. See, if those make sense to you, it is important to capture the reason and theory we discussed here in the form of comments so that it will be easy to remember the reason in the future. -- With Regards, Amit Kapila.
Attachment
Hi, It took me a while but I ran the test on my laptop with 20 runs per test. I asked for a dedicated server and will re-run the tests if/when I have it. count of partitions | Head (sec) | Fix (sec) | Degradation (%) ---------------------------------------------------------------------- 1000 | 0,0265 | 0,028 | 5,66037735849054 5000 | 0,091 | 0,0945 | 3,84615384615385 10000 | 0,1795 | 0,1815 | 1,11420612813371 Concurrent Txn | Head (sec) | Patch (sec) | Degradation in % --------------------------------------------------------------------- 50 | 0,1797647 | 0,1920949 | 6,85907744957 100 | 0,3693029 | 0,3823425 | 3,53086856344 500 | 1,62265755 | 1,91427485 | 17,97158617972 1000 | 3,01388635 | 3,57678295 | 18,67676928162 2000 | 7,0171877 | 6,4713304 | 8,43500897435 I'll try to run test2.pl later (right now it fails). hope this helps. -- Benoit Lobréau Consultant http://dalibo.com
Attachment
On Mon, Feb 24, 2025 at 4:49 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > Patches need a rebase. Attached the rebased patch. > I would like to discuss 0002 patch: publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue) { publications_valid = false; - - /* - * Also invalidate per-relation cache so that next time the filtering info - * is checked it will be updated with the new publication settings. - */ - rel_sync_cache_publication_cb(arg, cacheid, hashvalue); } /* @@ -1970,18 +1964,6 @@ init_rel_sync_cache(MemoryContext cachectx) rel_sync_cache_publication_cb, (Datum) 0); - /* - * Flush all cache entries after any publication changes. (We need no - * callback entry for pg_publication, because publication_invalidation_cb - * will take care of it.) - */ - CacheRegisterSyscacheCallback(PUBLICATIONRELMAP, - rel_sync_cache_publication_cb, - (Datum) 0); - CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP, - rel_sync_cache_publication_cb, - (Datum) 0); In 0002 patch, we are improving the performance by avoiding invalidation processing in a number of cases. Basically, the claim is that we are unnecessarily invalidating all the RelSyncCache entries when a particular relation's entry could be invalidated. I have not verified it, but IIUC, this should be an independent improvement atop HEAD; if so, then we should start a separate thread to discuss it. Thoughts? -- With Regards, Amit Kapila.
RE: long-standing data loss bug in initial sync of logical replication
From
"Zhijie Hou (Fujitsu)"
Date:
On Friday, February 28, 2025 4:28 PM Benoit Lobréau <benoit.lobreau@dalibo.com> wrote: > > It took me a while but I ran the test on my laptop with 20 runs per test. I asked > for a dedicated server and will re-run the tests if/when I have it. > > count of partitions | Head (sec) | Fix (sec) | Degradation (%) > ---------------------------------------------------------------------- > 1000 | 0,0265 | 0,028 | 5,66037735849054 > 5000 | 0,091 | 0,0945 | 3,84615384615385 > 10000 | 0,1795 | 0,1815 | 1,11420612813371 > > Concurrent Txn | Head (sec) | Patch (sec) | Degradation in % > --------------------------------------------------------------------- > 50 | 0,1797647 | 0,1920949 | 6,85907744957 > 100 | 0,3693029 | 0,3823425 | 3,53086856344 > 500 | 1,62265755 | 1,91427485 | 17,97158617972 > 1000 | 3,01388635 | 3,57678295 | 18,67676928162 > 2000 | 7,0171877 | 6,4713304 | 8,43500897435 > > I'll try to run test2.pl later (right now it fails). > > hope this helps. Thank you for testing and sharing the data! A nitpick with the data for the Concurrent Transaction (2000) case. The results show that the HEAD's data appears worse than the patch data, which seems unusual. However, I confirmed that the details in the attachment are as expected, so, this seems to be a typo. (I assume you intended to use a decimal point instead of a comma in the data like (8,43500...)) The data suggests some regression, slightly more than Shlok’s findings, but it is still within an acceptable range for me. Since the test script builds a real subscription for testing, the results might be affected by network and replication factors, as Amit pointed out, we will share a new test script soon that uses the SQL API xxx_get_changes() to test. It would be great if you could verify the performance using the updated script as well. Best Regards, Hou zj
On 3/3/25 8:41 AM, Zhijie Hou (Fujitsu) wrote: > A nitpick with the data for the Concurrent Transaction (2000) case. The results > show that the HEAD's data appears worse than the patch data, which seems > unusual. However, I confirmed that the details in the attachment are as expected, > so, this seems to be a typo. (I assume you intended to use a > decimal point instead of a comma in the data like (8,43500...)) Hi, Argh, yes, sorry! I didn't pay enough attention and accidentally inverted the Patch and Head numbers in the last line when copying them from the ODS to the email to match the previous report layout. The comma is due to how decimals are written in my language (comma instead of dot). I forgot to "translate" it. Concurrent Txn | Head (sec) | Patch (sec) | Degradation in % --------------------------------------------------------------------- 50 | 0.1797647 | 0.1920949 | 6.85907744957 100 | 0.3693029 | 0.3823425 | 3.53086856344 500 | 1.62265755 | 1.91427485 | 17.97158617972 1000 | 3.01388635 | 3.57678295 | 18.67676928162 2000 | 6.4713304 | 7.0171877 | 8.43500897435 > as Amit pointed out, we will share a new test script soon > that uses the SQL API xxx_get_changes() to test. It would be great if you could > verify the performance using the updated script as well. Will do. -- Benoit Lobréau Consultant http://dalibo.com
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers, I found that the patch needs to be rebased due to ac4494, PSA new version. It could be applied atop HEAD and tests worked well. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Hi Hackers, Our team (mainly Shlok) did a performance testing with several workloads. Let me share them on -hackers. We did it for master/REL_17 branches, and in this post master's one will be discussed. The observed trend is: We observed that the performance regression exists primarily during frequent execution of publication DDL statements that modify published tables. This is expected due to the necessary cache rebuild and distribution overhead involved. The regression is small or nearly nonexistent in scenarios where DDLs do not affect published tables or when the frequency of such DDL statements is low. Used source ======== The base code was HEAD plus some modifications, which could selectively invalidate a relsync caches. It is now pushed by 3abe9d. The compared patch was v16. We did five benchmarks, let me share one by one. ----- Workload A: No DDL operation done in concurrent session ====================================== In this workload, number of concurrent transactions were varied, but none of them contained DDL commands. Decoding time of all transactions were measured and compared. We expected that the performance would not be changed because any of caches could be invalidated. Actual workload is noted in [1] and runner is attached. Below table contains a result. We could not find notable degradations. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.013196 | 0.013314 | 0.8968 100 | 0.014531 | 0.014728 | 1.3558 500 | 0.018079 | 0.017969 | -0.6066 1000 | 0.023087 | 0.023772 | 2.9670 2000 | 0.031311 | 0.031750 | 1.4010 ----- Workload B: DDL is happening but is unrelated to publication ======================================= In this workload, one of concurrent transactions contained a DDL, but it did not related with the publication and publishing tables. We also expected that the performance would not be changed. Actual workload is noted in [2] and runner is attached. Below table contains a result. The patch we proposed distributes invalidation messages to concurrent decoding transactions. It would be roughly proportional to the concurrency, and what we observed proves the theory. Since inval messages does not invalidate relsync caches, the difference is not so large. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.013410 | 0.013217 | -1.4417 100 | 0.014694 | 0.015260 | 3.8496 1000 | 0.023211 | 0.025376 | 9.3289 2000 | 0.032954 | 0.036322 | 10.2213 ----- Workload C. DDL is happening on publication but on unrelated table =========================================== In this workload, one of concurrent transactions contained a DDL which altered the using publication. But it just ADD/DROP table which was not being decoded. Actual workload is noted in [3] and runner is attached. Below table contains a result. Since the commit 3abe9dc, no need to rebuild the whole of relsync cache anymore for the unrelated publish actions. Thus the degradation was mostly same as B. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.013546 | 0.013409 | -1.0089 100 | 0.015225 | 0.015357 | 0.8648 500 | 0.017848 | 0.019300 | 8.1372 1000 | 0.023430 | 0.025152 | 7.3497 2000 | 0.032041 | 0.035877 | 11.9723 ----- Workload D. DDL is happening on the related published table, and one insert is done per invalidation ========================================= In this workload, one of concurrent transactions contained a DDL which altered the using publication. Also, it DROP/ADD table which was being decoded. Actual workload is noted in [4] and runner is attached. Below table contains a result. Apart from B and C, we could expect that this workload had huge degradation, because each distributed message would require the rebuild of relsync caches. This meant that caches were discarded and re-built for every transaction. And the result showed around 300% regression for 2000 concurrent transactions. IIUC it is difficult to avoid the regression with current design. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.013944 | 0.016460 | 18.0384 100 | 0.014952 | 0.020160 | 34.8322 500 | 0.018535 | 0.043122 | 132.6577 1000 | 0.023426 | 0.072215 | 208.2628 2000 | 0.032055 | 0.131884 | 311.4314 ----- Workload E. DDL is happening on the related published table, and 1000 inserts are done per invalidation =========================================== This workload was mostly same ad D, but the number of inserted tuples was 1000x. We expected that rebuilding caches is not so dominant in the workload so that the regression would be small. Actual workload is noted in [5] and runner is attached. Below contains result. Apart from D. there were not huge regression. This reasonable result because decoding insertion 1000 times occupied much CPU time. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.093019 | 0.108820 | 16.9869 100 | 0.188367 | 0.199621 | 5.9741 500 | 0.967896 | 0.970674 | 0.2870 1000 | 1.658552 | 1.803991 | 8.7691 2000 | 3.482935 | 3.682771 | 5.7376 Thanks again Shlok to measure data. [1]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1'; In a session : Insert a row in table 'tab_conc1'; Insert a row in table 'tab_conc1' All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. [2]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; ALTER TABLE t1 ADD COLUMN b int; COMMIT; BEGIN; ALTER TABLE t1 DROP COLUMN b; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. [3] Steps: 1. Created a publisher on a table, say 'tab_conc1', 't1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; ALTER PUBLICATION regress_pub1 DROP TABLE t1; COMMIT; BEGIN; ALTER PUBLICATION regress_pub1 ADD TABLE t1; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. [4]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n + 1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; Alter publication DROP 'tab_conc1'; COMMIT; BEGIN; Alter publication ADD 'tab_conc1'; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. [5]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert 1000 rows in table 'tab_conc1' In a session : BEGIN; ALTER PUBLICATION regress_pub1 DROP 'tab_conc1'; COMMIT; BEGIN; ALTER PUBLICATION regress_pub1 ADD 'tab_conc1'; COMMIT; All 'n' sessions : Insert 1000 rows in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Hi hackers, > Our team (mainly Shlok) did a performance testing with several workloads. Let > me > share them on -hackers. We did it for master/REL_17 branches, and in this post > master's one will be discussed. I posted benchmark results for master [1]. In this post contains a result for back branch, especially REL_17_STABLE. The observed trend is the same as master's one: Frequent DDL for publishing tables can cause huge regression, but this is expected. For other cases, it is small or does not exist. Used source =========== The base code was HEAD of REL_17_STABLE, and compared patch was v16. The large difference is that master tries to preserve relsync caches as much as possible, but REL_17_STABLE discards them more aggressively. Please refer recent commit, 3abe9d and 588acf6. Executed workloads were mostly same as master's case. ----- Workload A: No DDL operation done in concurrent session ====================================== No regression was observed in the workload. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.013706 | 0.013398 | -2.2496 100 | 0.014811 | 0.014821 | 0.0698 500 | 0.018288 | 0.018318 | 0.1640 1000 | 0.022613 | 0.022622 | 0.0413 2000 | 0.031812 | 0.031891 | 0.2504 ----- Workload B: DDL is happening but is unrelated to publication ======================================== Small regression was observed when the concurrency was huge. Because the DDL transaction would send inval messages to all the concurrent transactions. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.013159 | 0.013305 | 1.1120 100 | 0.014718 | 0.014725 | 0.0476 500 | 0.018134 | 0.019578 | 7.9628 1000 | 0.022762 | 0.025228 | 10.8324 2000 | 0.032326 | 0.035638 | 10.2467 ----- Workload C. DDL is happening on publication but on unrelated table ============================================ We did not run the workload because we expected this could be same results as D. 588acf6 is needed to optimize the workload. ----- Workload D. DDL is happening on the related published table, and one insert is done per invalidation ========================================= This workload had huge regression same as the master branch. This is expected because distributed invalidation messages require all concurrent transactions to rebuild relsync caches. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.013496 | 0.015588 | 15.5034 100 | 0.015112 | 0.018868 | 24.8517 500 | 0.018483 | 0.038714 | 109.4536 1000 | 0.023402 | 0.063735 | 172.3524 2000 | 0.031596 | 0.110860 | 250.8720 ----- Workload E. DDL is happening on the related published table, and 1000 inserts are done per invalidation ============================================ The regression seen by D. cannot be observed. This is same as master's case and expected because decoding 1000 tuples requires much time. Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) ------------------ | ------------ | ------------ | ---------------- 50 | 0.093019 | 0.108820 | 16.9869 100 | 0.188367 | 0.199621 | 5.9741 500 | 0.967896 | 0.970674 | 0.2870 1000 | 1.658552 | 1.803991 | 8.7691 2000 | 3.482935 | 3.682771 | 5.7376 [1]: https://www.postgresql.org/message-id/OSCPR01MB149661EA973D65EBEC2B60D98F5D32%40OSCPR01MB14966.jpnprd01.prod.outlook.com Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Mar 13, 2025 at 2:12 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Workload C. DDL is happening on publication but on unrelated table > ============================================ > We did not run the workload because we expected this could be same results as D. > 588acf6 is needed to optimize the workload. > > ----- > > Workload D. DDL is happening on the related published table, > and one insert is done per invalidation > ========================================= > This workload had huge regression same as the master branch. This is expected > because distributed invalidation messages require all concurrent transactions > to rebuild relsync caches. > > Concurrent txn | Head (sec) | Patch (sec) | Degradation (%) > ------------------ | ------------ | ------------ | ---------------- > 50 | 0.013496 | 0.015588 | 15.5034 > 100 | 0.015112 | 0.018868 | 24.8517 > 500 | 0.018483 | 0.038714 | 109.4536 > 1000 | 0.023402 | 0.063735 | 172.3524 > 2000 | 0.031596 | 0.110860 | 250.8720 > IIUC, workloads C and D will have regression in back branches, and HEAD will have regression only for workload D. We have avoided workload C regression in HEAD via commits 7c99dc587a and 3abe9dc188. We can backpatch those commits if required, but I think it is better not to do those as scenarios C and D won't be that common, and we should go ahead with the fix as it is. In the future, if we get any way to avoid regression due to scenario-D, then we can do that for the HEAD branch. Thoughts? -- With Regards, Amit Kapila.
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit, > IIUC, workloads C and D will have regression in back branches, and > HEAD will have regression only for workload D. We have avoided > workload C regression in HEAD via commits 7c99dc587a and 3abe9dc188. Right. > We can backpatch those commits if required, but I think it is better > not to do those as scenarios C and D won't be that common, and we > should go ahead with the fix as it is. In the future, if we get any > way to avoid regression due to scenario-D, then we can do that for the > HEAD branch. OK, let me share patched for back branches. Mostly the same fix patched as master can be used for PG14-PG17, like attached. Regarding the PG13, it cannot be applied as-is thus some adjustments are needed. I will share it in upcoming posts. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
- v18_REL_14-0001-Distribute-invalidatons-if-change-in-cata.patch
- v18_REL_15-0001-Distribute-invalidatons-if-change-in-cata.patch
- v18_REL_16-0001-Distribute-invalidatons-if-change-in-cata.patch
- v18_REL_17-0001-Distribute-invalidatons-if-change-in-cata.patch
- v18-0001-Distribute-invalidatons-if-change-in-catalog-tab.patch
On Mon, Mar 17, 2025 at 6:53 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > OK, let me share patched for back branches. Mostly the same fix patched as master > can be used for PG14-PG17, like attached. > A few comments: ============== 1. +SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid) { dlist_iter txn_i; ReorderBufferTXN *txn; + ReorderBufferTXN *curr_txn; + + curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false); The above is used to access invalidations from curr_txn. I am thinking about whether it would be better to expose a new function to get invalidations for a txn based on xid instead of getting ReorderBufferTXN. It would avoid any layering violation and misuse of ReorderBufferTXN by other modules. 2. The patch has a lot of tests to verify the same points. Can't we have one simple test using SQL API based on what Andres presented in an email [1]? 3. I have made minor changes in the comments in the attached. [1] - https://www.postgresql.org/message-id/20231119021830.d6t6aaxtrkpn743y%40awork3.anarazel.de -- With Regards, Amit Kapila.
Attachment
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit, > A few comments: > ============== > 1. > +SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr > lsn, TransactionId xid) > { > dlist_iter txn_i; > ReorderBufferTXN *txn; > + ReorderBufferTXN *curr_txn; > + > + curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, > InvalidXLogRecPtr, false); > > The above is used to access invalidations from curr_txn. I am thinking > about whether it would be better to expose a new function to get > invalidations for a txn based on xid instead of getting > ReorderBufferTXN. It would avoid any layering violation and misuse of > ReorderBufferTXN by other modules. Sounds reasonable. I introduced new function ReorderBufferGetInvalidations() which obtains the number of invalidations and its list. ReorderBufferTXN() is not exported anymore. > 2. The patch has a lot of tests to verify the same points. Can't we > have one simple test using SQL API based on what Andres presented in > an email [1]? You meant that we need to test only the case reported by the Andres, right? New version did like that. To make the test faster test was migrated to isolation tester instead of the TAP test. > 3. I have made minor changes in the comments in the attached. Thanks, I included. PSA new version for PG 14-master. Special thanks for Hou to minimize the test code. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
- v19_REL_14-0001-Distribute-invalidatons-if-change-in-cata.patch
- v19_REL_15-0001-Distribute-invalidatons-if-change-in-cata.patch
- v19-0001-Distribute-invalidatons-if-change-in-catalog-tab.patch
- v19_REL_17-0001-Distribute-invalidatons-if-change-in-cata.patch
- v19_REL_16-0001-Distribute-invalidatons-if-change-in-cata.patch
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers, > Regarding the PG13, it cannot be > applied > as-is thus some adjustments are needed. I will share it in upcoming posts. Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is, because... 1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist. 2. Thus the ReorderBufferChange for the invalidation does not exist. Our patch tries to distribute it but cannot be done as-is. 3. Codes assumed that invalidation messages can be added only once. 4. The timing when invalidation messages are consumed is limited: a. COMMAND_ID change is poped, b. start of decoding a transaction, or c. end of decoding a transaction. Above means that invalidations cannot be executed while being decoded. I created two patch sets to resolve the data loss issue. 0001 has less code changes but could resolve a part of issue, 0002 has huge changes but provides a complete solution. 0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations() was adjusted to allow being called several times. As I said above, 0001 cannot execute inval messages while decoding the transacitons. 0002 - introduces new ReorderBufferChange type to indicate inval messages. It would be handled like PG14+. Here is an example. Assuming that the table foo exists on both nodes, a publication "pub" which publishes all tables, and a subscription "sub" which subscribes "pub". What if the workload is executed? ``` S1 S2 BEGIN; INSERT INTO foo VALUES (1) ALTER PUBLICATION pub RENAME TO pub_renamed; INSERT INTO foo VALUES (2) COMMIT; LR -> ? ``` With 0001, tuples (1) and (2) would be replicated to the subscriber. An error "publication "pub" does not exist" would raise when new changes are done later. 0001+0002 works more aggressively; the error would raise when S1 transaction is decoded. The behavior is same as for patched PG14-PG17. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
RE: long-standing data loss bug in initial sync of logical replication
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers, Attached patch set contains proper commit message. It briefly describes the background and handlings. Regarding the PG13, the same commit message is used for 0001. 0002 is still rough. Renamed backpatches to .txt, to make cfbot happy. Thanks Hou working for it. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
- v20_REL_13-0001-Fix-data-loss-in-logical-replication.txt
- v20_REL_13-0002-Backpatch-introducing-invalidation-messag.txt
- v20_REL_14-0001-Fix-data-loss-in-logical-replication.txt
- v20_REL_15-0001-Fix-data-loss-in-logical-replication.txt
- v20_REL_16-0001-Fix-data-loss-in-logical-replication.txt
- v20_REL_17-0001-Fix-data-loss-in-logical-replication.txt
- v20-0001-Fix-data-loss-in-logical-replication.patch
On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Regarding the PG13, it cannot be > > applied > > as-is thus some adjustments are needed. I will share it in upcoming posts. > > Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is, > because... > > 1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist. > 2. Thus the ReorderBufferChange for the invalidation does not exist. > Our patch tries to distribute it but cannot be done as-is. > 3. Codes assumed that invalidation messages can be added only once. > 4. The timing when invalidation messages are consumed is limited: > a. COMMAND_ID change is poped, > b. start of decoding a transaction, or > c. end of decoding a transaction. > > Above means that invalidations cannot be executed while being decoded. > I created two patch sets to resolve the data loss issue. 0001 has less code > changes but could resolve a part of issue, 0002 has huge changes but provides a > complete solution. > > 0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations() > was adjusted to allow being called several times. As I said above, > 0001 cannot execute inval messages while decoding the transacitons. > 0002 - introduces new ReorderBufferChange type to indicate inval messages. > It would be handled like PG14+. > > Here is an example. Assuming that the table foo exists on both nodes, a > publication "pub" which publishes all tables, and a subscription "sub" which > subscribes "pub". What if the workload is executed? > > ``` > S1 S2 > BEGIN; > INSERT INTO foo VALUES (1) > ALTER PUBLICATION pub RENAME TO pub_renamed; > INSERT INTO foo VALUES (2) > COMMIT; > LR -> ? > ``` > > With 0001, tuples (1) and (2) would be replicated to the subscriber. > An error "publication "pub" does not exist" would raise when new changes are done > later. > > 0001+0002 works more aggressively; the error would raise when S1 transaction is decoded. > The behavior is same as for patched PG14-PG17. > I understand that with 0001 the fix is partial in the sense that because invalidations are processed at the transaction end, the changes of concurrent DDL will only be reflected for the next transaction. Now, on one hand, it is prudent to not add a new type of ReorderBufferChange in the backbranch (v13) but the change is not that invasive, so we can go with it as well. My preference would be to go with just 0001 for v13 to minimize the risk of adding new bugs or breaking something unintentionally. Sawada-San, and others involved here, do you have any suggestions on this matter? -- With Regards, Amit Kapila.