Re: Small fixes needed by high-availability tools - Mailing list pgsql-hackers
From | Mihail Nikalayeu |
---|---|
Subject | Re: Small fixes needed by high-availability tools |
Date | |
Msg-id | CADzfLwWLUXXQ=rYg2L6ZREbBu+zsZSBXFk0rJoAPvXKV9=9+9Q@mail.gmail.com Whole thread Raw |
In response to | Re: Small fixes needed by high-availability tools (Ants Aasma <ants.aasma@cybertec.at>) |
Responses |
Re: Small fixes needed by high-availability tools
|
List | pgsql-hackers |
Hello, everyone! > On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> >> Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data ascommitted. This leads to loosing data with UPSERTs. >> > > >> > > Could you explain why specifically UPSERTs would lose data (vs any >> > other user workload) in cancellations during SyncRepWaitForLSN? >> >> Upserts change data conditionally. That's where observed effect affect writtned data. But the root problem is observingnon-replicated data, it only becomes obvious when issuing: "INSERT ON CONFLICT DO >NOTHING" and retrying it. >> 1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication >> 2. JDBC cancels query by after default timeout >> 3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written > Right. I think upsert is a red herring here. Any system trying to > implement idempotency/exactly once delivery will be built around a > similar pattern. Check if a transaction has already been executed, if > not run the transaction, commit, on failure retry. This is > particularly vulnerable to the visibility issue because the retry is > likely to land on the partitioned off leader. I think UPSERT is just one specific case here. Any data that becomes visible and then disappears can cause a variety of issues. For example, the system receives a callback from a payment system, marks an order as "PAID," commits the transaction, and returns a 200 response to the payment system (so it won't retry the callback). However, if the transaction is lost due to a new primary, we end up with an order that is paid in the real world, but the system is unaware of it. And yes, that patch has actually been applied on top of HEAD by most PG cloud providers for over four years now.... [0]. > One idea to solve this problem could be that whenever we cancel > sync_rep_wait, we set some system-wide flag that indicates that any > new transaction must ensure that all the current data is replicated to > the synchronous standby. Once we ensure that we have waited for > pending transactions to replicate, we can toggle back that system-wide > flag. Now, if the system restarts for any reason during such a wait, > we can use your idea to disallow new connections until the standby > quorum is established. It might not necessarily be a flag—it could be some LSN value instead. Also, it's not just about a "new transaction," but about any new snapshot that could see data not yet replicated to the synchronous standby. Best regards, Mikhail. [0]: https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50
pgsql-hackers by date: