Re: Small fixes needed by high-availability tools - Mailing list pgsql-hackers

From Mihail Nikalayeu
Subject Re: Small fixes needed by high-availability tools
Date
Msg-id CADzfLwWLUXXQ=rYg2L6ZREbBu+zsZSBXFk0rJoAPvXKV9=9+9Q@mail.gmail.com
Whole thread Raw
In response to Re: Small fixes needed by high-availability tools  (Ants Aasma <ants.aasma@cybertec.at>)
Responses Re: Small fixes needed by high-availability tools
List pgsql-hackers
Hello, everyone!

> On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> >> Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data
ascommitted. This leads to loosing data with UPSERTs. 
>> > >
>> > > Could you explain why specifically UPSERTs would lose data (vs any
>> > other user workload) in cancellations during SyncRepWaitForLSN?
>>
>> Upserts change data conditionally. That's where observed effect affect writtned data. But the root problem is
observingnon-replicated data, it only becomes obvious when issuing: "INSERT ON CONFLICT DO >NOTHING" and retrying it. 
>> 1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
>> 2. JDBC cancels query by after default timeout
>> 3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

> Right. I think upsert is a red herring here. Any system trying to
> implement idempotency/exactly once delivery will be built around a
> similar pattern. Check if a transaction has already been executed, if
> not run the transaction, commit, on failure retry. This is
> particularly vulnerable to the visibility issue because the retry is
> likely to land on the partitioned off leader.

I think UPSERT is just one specific case here. Any data that becomes
visible and then disappears can cause a variety of issues.

For example, the system receives a callback from a payment system,
marks an order as "PAID," commits the transaction, and returns a 200
response to the payment system (so it won't retry the callback).
However, if the transaction is lost due to a new primary, we end up
with an order that is paid in the real world, but the system is
unaware of it.

And yes, that patch has actually been applied on top of HEAD by most
PG cloud providers for over four years now.... [0].

> One idea to solve this problem could be that whenever we cancel
> sync_rep_wait, we set some system-wide flag that indicates that any
> new transaction must ensure that all the current data is replicated to
> the synchronous standby. Once we ensure that we have waited for
> pending transactions to replicate, we can toggle back that system-wide
> flag. Now, if the system restarts for any reason during such a wait,
> we can use your idea to disallow new connections until the standby
> quorum is established.

It might not necessarily be a flag—it could be some LSN value instead.
Also, it's not just about a "new transaction," but about any new
snapshot that could see data not yet replicated to the synchronous
standby.

Best regards,
Mikhail.

[0]:
https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50



pgsql-hackers by date:

Previous
From: "Scott Mead"
Date:
Subject: Disable parallel query by default
Next
From: Greg Sabino Mullane
Date:
Subject: Re: Disable parallel query by default