Thread: Synchronous commit behavior during network outage
I would like to know your opinion on the following behaviour I see for PostgreSQL setup with synchronous replication.
This behaviour happens in a special use case. In this use case, there are 2 synchronous replicas with the following config (truncated):
- 2 nodes
- synchronous_standby_names='*'
- synchronous_commit=remote_apply
With this setup run the following steps (LAN down - LAN between master and replica):
-----------------
postgres=# truncate table a;
TRUNCATE TABLE
postgres=# insert into a values (1); -- LAN up, insert has been applied to replica.
INSERT 0 1
Vypnu LAN na serveru se standby:
postgres=# insert into a values (2); --LAN down, waiting for a confirmation from sync replica. In this situation cancel it (press CTRL+C)
^CCancel request sent
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
INSERT 0 1
There will be warning that commit was performed only locally:
2021-04-12 19:55:53.063 CEST [26104] WARNING: canceling wait for synchronous replication due to user request
2021-04-12 19:55:53.063 CEST [26104] DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
postgres=# insert into a values (2); --LAN down, waiting for a confirmation from sync replica. In this situation cancel it (press CTRL+C)
^CCancel request sent
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
INSERT 0 1
postgres=# insert into a values (2); --LAN down, waiting for sync replica, second attempt, cancel it as well (CTRL+C)
^CCancel request sent
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
INSERT 0 1
postgres=# update a set n=3 where n=2; --LAN down, waiting for sync replica, cancel it (CTRL+C)
^CCancel request sent
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
UPDATE 2
postgres=# update a set n=3 where n=2; -- run the same update,because data from the previous attempt was commited on master, it is sucessfull, but no changes
UPDATE 0
postgres=# select * from a;
n
---
1
3
3
(3 rows)
postgres=#
------------------------
Now, there is only value 1 in the sync replica table (no other values), data is not in sync. This is expected, after the LAN restore, data will come sync again, but if the main/primary node will fail and we failover to replica before the LAN is back up or the storage for this node would be destroyed and data would not sync to replica before it, we will lose data even if the client received successful commit (with a warning).
From the synchronous_commit=remote_write level and "higher", I would expect, that when the remote application (doesn't matter if flush, write or apply) would not be applied I would not receive a confirmation about the commit (even with a warning). Something like, if there is no commit from sync replica, there is no commit on primary and if someone performs the steps above, the whole transaction will not send a confirmation.
This can cause issues if the application receives a confirmation about the success and performs some follow-up steps e.g. create a user account and sends a request to the mail system to create an account or create a VPN account. If the scenario above happens, there can exist a VPN account that does not have any presence in the central database and can be a security issue.
I hope I explained it sufficiently. :-)
Do you think, that would be possible to implement a process that would solve this use case?
Thank you
Ondrej
Hi Ondřej, Thanks for the report. It seems to be a clear violation of what is promised in the docs. Although it's unlikely that someone implemented an application which deals with important data and "pressed Ctr+C" as it's done in psql. So this might be not such a critical issue after all. BTW what version of PostgreSQL are you using? On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka <ondrej.zizka@stratox.cz> wrote: > > Hello all, > I would like to know your opinion on the following behaviour I see for PostgreSQL setup with synchronous replication. > > This behaviour happens in a special use case. In this use case, there are 2 synchronous replicas with the following config(truncated): > > - 2 nodes > - synchronous_standby_names='*' > - synchronous_commit=remote_apply > > > With this setup run the following steps (LAN down - LAN between master and replica): > ----------------- > postgres=# truncate table a; > TRUNCATE TABLE > postgres=# insert into a values (1); -- LAN up, insert has been applied to replica. > INSERT 0 1 > Vypnu LAN na serveru se standby: > postgres=# insert into a values (2); --LAN down, waiting for a confirmation from sync replica. In this situation cancelit (press CTRL+C) > ^CCancel request sent > WARNING: canceling wait for synchronous replication due to user request > DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. > INSERT 0 1 > There will be warning that commit was performed only locally: > 2021-04-12 19:55:53.063 CEST [26104] WARNING: canceling wait for synchronous replication due to user request > 2021-04-12 19:55:53.063 CEST [26104] DETAIL: The transaction has already committed locally, but might not have been replicatedto the standby. > > postgres=# insert into a values (2); --LAN down, waiting for a confirmation from sync replica. In this situation cancelit (press CTRL+C) > ^CCancel request sent > WARNING: canceling wait for synchronous replication due to user request > DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. > INSERT 0 1 > postgres=# insert into a values (2); --LAN down, waiting for sync replica, second attempt, cancel it as well (CTRL+C) > ^CCancel request sent > WARNING: canceling wait for synchronous replication due to user request > DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. > INSERT 0 1 > postgres=# update a set n=3 where n=2; --LAN down, waiting for sync replica, cancel it (CTRL+C) > ^CCancel request sent > WARNING: canceling wait for synchronous replication due to user request > DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. > UPDATE 2 > postgres=# update a set n=3 where n=2; -- run the same update,because data from the previous attempt was commited on master,it is sucessfull, but no changes > UPDATE 0 > postgres=# select * from a; > n > --- > 1 > 3 > 3 > (3 rows) > postgres=# > ------------------------ > > Now, there is only value 1 in the sync replica table (no other values), data is not in sync. This is expected, after theLAN restore, data will come sync again, but if the main/primary node will fail and we failover to replica before the LANis back up or the storage for this node would be destroyed and data would not sync to replica before it, we will losedata even if the client received successful commit (with a warning). > From the synchronous_commit=remote_write level and "higher", I would expect, that when the remote application (doesn'tmatter if flush, write or apply) would not be applied I would not receive a confirmation about the commit (even witha warning). Something like, if there is no commit from sync replica, there is no commit on primary and if someone performsthe steps above, the whole transaction will not send a confirmation. > > This can cause issues if the application receives a confirmation about the success and performs some follow-up steps e.g.create a user account and sends a request to the mail system to create an account or create a VPN account. If the scenarioabove happens, there can exist a VPN account that does not have any presence in the central database and can be asecurity issue. > > I hope I explained it sufficiently. :-) > > Do you think, that would be possible to implement a process that would solve this use case? > > Thank you > Ondrej -- Best regards, Aleksander Alekseev
On 4/20/21 6:23 PM, Aleksander Alekseev wrote: > Hi Ondřej, > > Thanks for the report. It seems to be a clear violation of what is > promised in the docs. Although it's unlikely that someone implemented > an application which deals with important data and "pressed Ctr+C" as > it's done in psql. So this might be not such a critical issue after > all. BTW what version of PostgreSQL are you using? > Which part of the docs does this contradict? With Ctrl+C the application *did not* receive confirmation - the commit was interrupted before fully completing. In a way, it's about the same situation as if a regular commit was interrupted randomly. It might have happened before the commit log got updated, or maybe right after it, which determines the outcome. What I find a bit strange is that this inserts 1, 2, 2, 2 locally, and yet we end up with just two rows with 2 (before the update). I don't see why a network outage should have such consequence. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Aleksander, Thank you for the reaction. This was tested on version 13.2. There are also other possible situations with the same setup and similar issue: ----------------- When the background process on server fails.... On postgresql1: tecmint=# select * from a; --> LAN on sync replica is OK id ---- 1 (1 row) tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN and insert is waiting. During this time kill the background process on the PostgreSQL server for this session WARNING: canceling the wait for synchronous replication and terminating connection due to administrator command DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. tecmint=# select * from a; id ---- 1 2 (2 rows) tecmint=# ---> LAN on sync replica is still DOWN The potgres session will restore after the background process failed. When you run select on master, it still looks OK. But data is still not replicated on the sync replica. If we lost the master now, we would lost this data as well. ************** Another case ************** Kill the client process. tecmint=# select * from a; id ---- 1 2 3 (3 rows) tecmint=# --> Disconnect the sync replica now. LAN on replica is DOWN tecmint=# insert into a values (4); --> Kill the client process Terminated xzizka@service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d tecmint Password for user postgres: psql (13.2 (Debian 13.2-1.pgdg100+1)) Type "help" for help. tecmint=# select * from a; id ---- 1 2 3 (3 rows) tecmint=# --> Number 4 is not there. Now switch the LAN on sync replica ON. ---------- Result from sync replica after the LAN is again UP: tecmint=# select * from a; id ---- 1 2 3 4 (4 rows) In this situation, try to insert the number 4 again to the table. tecmint=# select * from a; id ---- 1 2 3 (3 rows) tecmint=# insert into a values (4); ERROR: duplicate key value violates unique constraint "a_pkey" DETAIL: Key (id)=(4) already exists. tecmint=# This is really strange... Application can be confused, It is not possible to insert record, which is not there, but some systems which use the sync node as a read replica maybe already read that record from the sync replica database and done some steps which can cause issues and can be hard to track. If I say, that it would be hard to send the CTRL+C to the database from the client, I need to say, that the 2 situations I described here can happen in real. What do you think? Thank you and regards Ondrej On 20/04/2021 17:23, Aleksander Alekseev wrote: > Hi Ondřej, > > Thanks for the report. It seems to be a clear violation of what is > promised in the docs. Although it's unlikely that someone implemented > an application which deals with important data and "pressed Ctr+C" as > it's done in psql. So this might be not such a critical issue after > all. BTW what version of PostgreSQL are you using? > > > On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka <ondrej.zizka@stratox.cz> wrote: >> Hello all, >> I would like to know your opinion on the following behaviour I see for PostgreSQL setup with synchronous replication. >> >> This behaviour happens in a special use case. In this use case, there are 2 synchronous replicas with the following config(truncated): >> >> - 2 nodes >> - synchronous_standby_names='*' >> - synchronous_commit=remote_apply >> >> >> With this setup run the following steps (LAN down - LAN between master and replica): >> ----------------- >> postgres=# truncate table a; >> TRUNCATE TABLE >> postgres=# insert into a values (1); -- LAN up, insert has been applied to replica. >> INSERT 0 1 >> Vypnu LAN na serveru se standby: >> postgres=# insert into a values (2); --LAN down, waiting for a confirmation from sync replica. In this situation cancelit (press CTRL+C) >> ^CCancel request sent >> WARNING: canceling wait for synchronous replication due to user request >> DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. >> INSERT 0 1 >> There will be warning that commit was performed only locally: >> 2021-04-12 19:55:53.063 CEST [26104] WARNING: canceling wait for synchronous replication due to user request >> 2021-04-12 19:55:53.063 CEST [26104] DETAIL: The transaction has already committed locally, but might not have been replicatedto the standby. >> >> postgres=# insert into a values (2); --LAN down, waiting for a confirmation from sync replica. In this situation cancelit (press CTRL+C) >> ^CCancel request sent >> WARNING: canceling wait for synchronous replication due to user request >> DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. >> INSERT 0 1 >> postgres=# insert into a values (2); --LAN down, waiting for sync replica, second attempt, cancel it as well (CTRL+C) >> ^CCancel request sent >> WARNING: canceling wait for synchronous replication due to user request >> DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. >> INSERT 0 1 >> postgres=# update a set n=3 where n=2; --LAN down, waiting for sync replica, cancel it (CTRL+C) >> ^CCancel request sent >> WARNING: canceling wait for synchronous replication due to user request >> DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. >> UPDATE 2 >> postgres=# update a set n=3 where n=2; -- run the same update,because data from the previous attempt was commited on master,it is sucessfull, but no changes >> UPDATE 0 >> postgres=# select * from a; >> n >> --- >> 1 >> 3 >> 3 >> (3 rows) >> postgres=# >> ------------------------ >> >> Now, there is only value 1 in the sync replica table (no other values), data is not in sync. This is expected, after theLAN restore, data will come sync again, but if the main/primary node will fail and we failover to replica before the LANis back up or the storage for this node would be destroyed and data would not sync to replica before it, we will losedata even if the client received successful commit (with a warning). >> From the synchronous_commit=remote_write level and "higher", I would expect, that when the remote application (doesn'tmatter if flush, write or apply) would not be applied I would not receive a confirmation about the commit (even witha warning). Something like, if there is no commit from sync replica, there is no commit on primary and if someone performsthe steps above, the whole transaction will not send a confirmation. >> >> This can cause issues if the application receives a confirmation about the success and performs some follow-up steps e.g.create a user account and sends a request to the mail system to create an account or create a VPN account. If the scenarioabove happens, there can exist a VPN account that does not have any presence in the central database and can be asecurity issue. >> >> I hope I explained it sufficiently. :-) >> >> Do you think, that would be possible to implement a process that would solve this use case? >> >> Thank you >> Ondrej > >
Hi! This is a known issue with synchronous replication [1]. You might inject into unmodified operation some dummy modification to overcome the negative sides of such partially committing without source code patching. On 20.04.2021 19:23, Aleksander Alekseev wrote: > Although it's unlikely that someone implemented > an application which deals with important data and "pressed Ctr+C" as > it's done in psql. Some client libraries have feature to cancel session that has similar effect to "Ctrl+C" from psql after specified by client deadline expiration [2]. Hence, this case might be quite often when application interacts with database. > On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka <ondrej.zizka@stratox.cz> wrote: > > From the synchronous_commit=remote_write level and "higher", I would expect, that when the remote application (doesn'tmatter if flush, write or apply) would not be applied I would not receive a confirmation about the commit (even witha warning). Something like, if there is no commit from sync replica, there is no commit on primary and if someone performsthe steps above, the whole transaction will not send a confirmation. The warning have to be accounted here and performed commit have not to be treated as *successful*. 1. https://www.postgresql.org/message-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru 2. https://www.postgresql.org/message-id/CANtu0ogbu%2By6Py963p-zKJ535b8zm5AOq7zkX7wW-tryPYi1DA%40mail.gmail.com -- Regards, Maksim Milyutin
Hello Maksim, I know your post [1]. That thread is why there we performed more tests (see another my email in this thread). We are trying to somehow implement RPO=0 solution using PostgreSQL. Knowing this... Would be possible to build RPO=0 solution with PostgreSQL? Ondrej On 20/04/2021 18:51, Maksim Milyutin wrote: > Hi! > > > This is a known issue with synchronous replication [1]. You might > inject into unmodified operation some dummy modification to overcome > the negative sides of such partially committing without source code > patching. > > > On 20.04.2021 19:23, Aleksander Alekseev wrote: >> Although it's unlikely that someone implemented >> an application which deals with important data and "pressed Ctr+C" as >> it's done in psql. > > > Some client libraries have feature to cancel session that has similar > effect to "Ctrl+C" from psql after specified by client deadline > expiration [2]. Hence, this case might be quite often when application > interacts with database. > > >> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka >> <ondrej.zizka@stratox.cz> wrote: >> >> From the synchronous_commit=remote_write level and "higher", I would >> expect, that when the remote application (doesn't matter if flush, >> write or apply) would not be applied I would not receive a >> confirmation about the commit (even with a warning). Something like, >> if there is no commit from sync replica, there is no commit on >> primary and if someone performs the steps above, the whole >> transaction will not send a confirmation. > > > The warning have to be accounted here and performed commit have not to > be treated as *successful*. > > > 1. > https://www.postgresql.org/message-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru > > 2. > https://www.postgresql.org/message-id/CANtu0ogbu%2By6Py963p-zKJ535b8zm5AOq7zkX7wW-tryPYi1DA%40mail.gmail.com > >
I am sorry, I forgot mentioned, that in the second situation I added a primary key to the table. Ondrej On 20/04/2021 18:49, Ondřej Žižka wrote: > Hello Aleksander, > > Thank you for the reaction. This was tested on version 13.2. > > There are also other possible situations with the same setup and > similar issue: > > ----------------- > When the background process on server fails.... > > On postgresql1: > tecmint=# select * from a; --> LAN on sync replica is OK > id > ---- > 1 > (1 row) > > tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN > and insert is waiting. During this time kill the background process on > the PostgreSQL server for this session > WARNING: canceling the wait for synchronous replication and > terminating connection due to administrator command > DETAIL: The transaction has already committed locally, but might not > have been replicated to the standby. > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Succeeded. > tecmint=# select * from a; > id > ---- > 1 > 2 > (2 rows) > > tecmint=# ---> LAN on sync replica is still DOWN > > The potgres session will restore after the background process failed. > When you run select on master, it still looks OK. But data is still > not replicated on the sync replica. If we lost the master now, we > would lost this data as well. > > ************** > Another case > ************** > > Kill the client process. > > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > (3 rows) > tecmint=# --> Disconnect the sync replica now. LAN on > replica is DOWN > tecmint=# insert into a values (4); --> Kill the client process > Terminated > xzizka@service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d tecmint > Password for user postgres: > psql (13.2 (Debian 13.2-1.pgdg100+1)) > Type "help" for help. > > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > (3 rows) > > tecmint=# --> Number 4 is not there. Now switch the LAN on sync > replica ON. > > ---------- > > Result from sync replica after the LAN is again UP: > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > 4 > (4 rows) > > > In this situation, try to insert the number 4 again to the table. > > tecmint=# select * from a; > id > ---- > 1 > 2 > 3 > (3 rows) > > tecmint=# insert into a values (4); > ERROR: duplicate key value violates unique constraint "a_pkey" > DETAIL: Key (id)=(4) already exists. > tecmint=# > > This is really strange... Application can be confused, It is not > possible to insert record, which is not there, but some systems which > use the sync node as a read replica maybe already read that record > from the sync replica database and done some steps which can cause > issues and can be hard to track. > > If I say, that it would be hard to send the CTRL+C to the database > from the client, I need to say, that the 2 situations I described here > can happen in real. > > What do you think? > > Thank you and regards > Ondrej > > On 20/04/2021 17:23, Aleksander Alekseev wrote: >> Hi Ondřej, >> >> Thanks for the report. It seems to be a clear violation of what is >> promised in the docs. Although it's unlikely that someone implemented >> an application which deals with important data and "pressed Ctr+C" as >> it's done in psql. So this might be not such a critical issue after >> all. BTW what version of PostgreSQL are you using? >> >> >> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka >> <ondrej.zizka@stratox.cz> wrote: >>> Hello all, >>> I would like to know your opinion on the following behaviour I see >>> for PostgreSQL setup with synchronous replication. >>> >>> This behaviour happens in a special use case. In this use case, >>> there are 2 synchronous replicas with the following config (truncated): >>> >>> - 2 nodes >>> - synchronous_standby_names='*' >>> - synchronous_commit=remote_apply >>> >>> >>> With this setup run the following steps (LAN down - LAN between >>> master and replica): >>> ----------------- >>> postgres=# truncate table a; >>> TRUNCATE TABLE >>> postgres=# insert into a values (1); -- LAN up, insert has been >>> applied to replica. >>> INSERT 0 1 >>> Vypnu LAN na serveru se standby: >>> postgres=# insert into a values (2); --LAN down, waiting for a >>> confirmation from sync replica. In this situation cancel it (press >>> CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> INSERT 0 1 >>> There will be warning that commit was performed only locally: >>> 2021-04-12 19:55:53.063 CEST [26104] WARNING: canceling wait for >>> synchronous replication due to user request >>> 2021-04-12 19:55:53.063 CEST [26104] DETAIL: The transaction has >>> already committed locally, but might not have been replicated to the >>> standby. >>> >>> postgres=# insert into a values (2); --LAN down, waiting for a >>> confirmation from sync replica. In this situation cancel it (press >>> CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> INSERT 0 1 >>> postgres=# insert into a values (2); --LAN down, waiting for sync >>> replica, second attempt, cancel it as well (CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> INSERT 0 1 >>> postgres=# update a set n=3 where n=2; --LAN down, waiting for sync >>> replica, cancel it (CTRL+C) >>> ^CCancel request sent >>> WARNING: canceling wait for synchronous replication due to user >>> request >>> DETAIL: The transaction has already committed locally, but might >>> not have been replicated to the standby. >>> UPDATE 2 >>> postgres=# update a set n=3 where n=2; -- run the same >>> update,because data from the previous attempt was commited on >>> master, it is sucessfull, but no changes >>> UPDATE 0 >>> postgres=# select * from a; >>> n >>> --- >>> 1 >>> 3 >>> 3 >>> (3 rows) >>> postgres=# >>> ------------------------ >>> >>> Now, there is only value 1 in the sync replica table (no other >>> values), data is not in sync. This is expected, after the LAN >>> restore, data will come sync again, but if the main/primary node >>> will fail and we failover to replica before the LAN is back up or >>> the storage for this node would be destroyed and data would not sync >>> to replica before it, we will lose data even if the client received >>> successful commit (with a warning). >>> From the synchronous_commit=remote_write level and "higher", I >>> would expect, that when the remote application (doesn't matter if >>> flush, write or apply) would not be applied I would not receive a >>> confirmation about the commit (even with a warning). Something like, >>> if there is no commit from sync replica, there is no commit on >>> primary and if someone performs the steps above, the whole >>> transaction will not send a confirmation. >>> >>> This can cause issues if the application receives a confirmation >>> about the success and performs some follow-up steps e.g. create a >>> user account and sends a request to the mail system to create an >>> account or create a VPN account. If the scenario above happens, >>> there can exist a VPN account that does not have any presence in the >>> central database and can be a security issue. >>> >>> I hope I explained it sufficiently. :-) >>> >>> Do you think, that would be possible to implement a process that >>> would solve this use case? >>> >>> Thank you >>> Ondrej >> >>
On 20.04.2021 19:38, Tomas Vondra wrote: > > On 4/20/21 6:23 PM, Aleksander Alekseev wrote: >> Hi Ondřej, >> >> Thanks for the report. It seems to be a clear violation of what is >> promised in the docs. Although it's unlikely that someone implemented >> an application which deals with important data and "pressed Ctr+C" as >> it's done in psql. So this might be not such a critical issue after >> all. BTW what version of PostgreSQL are you using? >> > Which part of the docs does this contradict? I think, Aleksandr refers to the following phrase in docs: "The guarantee we offer is that the application will not receive explicit acknowledgment of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys." [1] And IMO confusing here regards to the notion of `successful commit`. Does warning attached to received commit message make it not *successful*? I think we have to explicitly mention cases about cancellation and termination session in docs to avoid ambiguity in understanding of phrase above. 1. https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA -- Regards, Maksim Milyutin
I am sorry, I forgot mentioned, that in the second situation I added a
primary key to the table.
Ondrej
On 20/04/2021 18:49, Ondřej Žižka wrote:
> Hello Aleksander,
>
> Thank you for the reaction. This was tested on version 13.2.
>
> There are also other possible situations with the same setup and
> similar issue:
>
> -----------------
> When the background process on server fails....
>
> On postgresql1:
> tecmint=# select * from a; --> LAN on sync replica is OK
> id
> ----
> 1
> (1 row)
>
> tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN
> and insert is waiting. During this time kill the background process on
> the PostgreSQL server for this session
> WARNING: canceling the wait for synchronous replication and
> terminating connection due to administrator command
> DETAIL: The transaction has already committed locally, but might not
> have been replicated to the standby.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> (2 rows)
>
> tecmint=# ---> LAN on sync replica is still DOWN
>
> The potgres session will restore after the background process failed.
> When you run select on master, it still looks OK. But data is still
> not replicated on the sync replica. If we lost the master now, we
> would lost this data as well.
>
> **************
> Another case
> **************
>
> Kill the client process.
>
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> (3 rows)
> tecmint=# --> Disconnect the sync replica now. LAN on
> replica is DOWN
> tecmint=# insert into a values (4); --> Kill the client process
> Terminated
> xzizka@service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d tecmint
> Password for user postgres:
> psql (13.2 (Debian 13.2-1.pgdg100+1))
> Type "help" for help.
>
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> (3 rows)
>
> tecmint=# --> Number 4 is not there. Now switch the LAN on sync
> replica ON.
>
> ----------
>
> Result from sync replica after the LAN is again UP:
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> 4
> (4 rows)
>
>
> In this situation, try to insert the number 4 again to the table.
>
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> (3 rows)
>
> tecmint=# insert into a values (4);
> ERROR: duplicate key value violates unique constraint "a_pkey"
> DETAIL: Key (id)=(4) already exists.
> tecmint=#
>
> This is really strange... Application can be confused, It is not
> possible to insert record, which is not there, but some systems which
> use the sync node as a read replica maybe already read that record
> from the sync replica database and done some steps which can cause
> issues and can be hard to track.
>
> If I say, that it would be hard to send the CTRL+C to the database
> from the client, I need to say, that the 2 situations I described here
> can happen in real.
>
> What do you think?
>
> Thank you and regards
> Ondrej
>
> On 20/04/2021 17:23, Aleksander Alekseev wrote:
>> Hi Ondřej,
>>
>> Thanks for the report. It seems to be a clear violation of what is
>> promised in the docs. Although it's unlikely that someone implemented
>> an application which deals with important data and "pressed Ctr+C" as
>> it's done in psql. So this might be not such a critical issue after
>> all. BTW what version of PostgreSQL are you using?
>>
>>
>> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka
>> <ondrej.zizka@stratox.cz> wrote:
>>> Hello all,
>>> I would like to know your opinion on the following behaviour I see
>>> for PostgreSQL setup with synchronous replication.
>>>
>>> This behaviour happens in a special use case. In this use case,
>>> there are 2 synchronous replicas with the following config (truncated):
>>>
>>> - 2 nodes
>>> - synchronous_standby_names='*'
>>> - synchronous_commit=remote_apply
>>>
>>>
>>> With this setup run the following steps (LAN down - LAN between
>>> master and replica):
>>> -----------------
>>> postgres=# truncate table a;
>>> TRUNCATE TABLE
>>> postgres=# insert into a values (1); -- LAN up, insert has been
>>> applied to replica.
>>> INSERT 0 1
>>> Vypnu LAN na serveru se standby:
>>> postgres=# insert into a values (2); --LAN down, waiting for a
>>> confirmation from sync replica. In this situation cancel it (press
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> There will be warning that commit was performed only locally:
>>> 2021-04-12 19:55:53.063 CEST [26104] WARNING: canceling wait for
>>> synchronous replication due to user request
>>> 2021-04-12 19:55:53.063 CEST [26104] DETAIL: The transaction has
>>> already committed locally, but might not have been replicated to the
>>> standby.
>>>
>>> postgres=# insert into a values (2); --LAN down, waiting for a
>>> confirmation from sync replica. In this situation cancel it (press
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# insert into a values (2); --LAN down, waiting for sync
>>> replica, second attempt, cancel it as well (CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# update a set n=3 where n=2; --LAN down, waiting for sync
>>> replica, cancel it (CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> UPDATE 2
>>> postgres=# update a set n=3 where n=2; -- run the same
>>> update,because data from the previous attempt was commited on
>>> master, it is sucessfull, but no changes
>>> UPDATE 0
>>> postgres=# select * from a;
>>> n
>>> ---
>>> 1
>>> 3
>>> 3
>>> (3 rows)
>>> postgres=#
>>> ------------------------
>>>
>>> Now, there is only value 1 in the sync replica table (no other
>>> values), data is not in sync. This is expected, after the LAN
>>> restore, data will come sync again, but if the main/primary node
>>> will fail and we failover to replica before the LAN is back up or
>>> the storage for this node would be destroyed and data would not sync
>>> to replica before it, we will lose data even if the client received
>>> successful commit (with a warning).
>>> From the synchronous_commit=remote_write level and "higher", I
>>> would expect, that when the remote application (doesn't matter if
>>> flush, write or apply) would not be applied I would not receive a
>>> confirmation about the commit (even with a warning). Something like,
>>> if there is no commit from sync replica, there is no commit on
>>> primary and if someone performs the steps above, the whole
>>> transaction will not send a confirmation.
>>>
>>> This can cause issues if the application receives a confirmation
>>> about the success and performs some follow-up steps e.g. create a
>>> user account and sends a request to the mail system to create an
>>> account or create a VPN account. If the scenario above happens,
>>> there can exist a VPN account that does not have any presence in the
>>> central database and can be a security issue.
>>>
>>> I hope I explained it sufficiently. :-)
>>>
>>> Do you think, that would be possible to implement a process that
>>> would solve this use case?
>>>
>>> Thank you
>>> Ondrej
>>
>>
This can be an option for us in our case. But there also needs to be a process how to detect these "stuck commits" and how to invalidate/remove them, because in reality, if the app/user would not see the change in the database, it/he/she will try to insert/delete it again. If it just stuck without management, it will create a queue which can cause, that in the queue there will be 2 similar inserts/deletes which can again cause issues (like with the primary key I mentioned before).
So the process should be in this case:
- DBA receives information, that write operations stuck (DBA in coordination with the infrastructure team disconnects all clients and prevent new ones to create a new connection).
- DBA will recognize, that there is an issue in communication between the primary and the sync replica (caused the issue with the propagation of commits)
- DBA will see that there are some commits that are in the "stuck state"
- DBA removes these stuck commits. Note: Because the client never received a confirmation about the successful commit -> changes in the DB client tried to perform can't be considered as successful.
- DBA and infrastructure team restore the communication between server nodes to be able to propagate commits from the primary node to sync replica.
- DBA and infrastructure team allows new connections to the database
This approach would require external monitoring and alerting, but I would say, that this is an acceptable solution. Would your patch be able to perform that?
Thank you
Ondrej
One idea here is to make the backend ignore query cancellation/backend termination while waiting for the synchronous commit ACK. This way client never reads the data that was never flushed remotely. The problem with this approach is that your backends get stuck until your commit log record is flushed on the remote side. Also, the client can see the data not flushed remotely if the server crashes and comes back online. You can prevent the latter case by making a SyncRepWaitForLSN before opening up the connections to the non-superusers. I have a working prototype of this logic, if there is enough interest I can post the patch.On Tue, Apr 20, 2021 at 11:25 AM Ondřej Žižka <ondrej.zizka@stratox.cz> wrote:I am sorry, I forgot mentioned, that in the second situation I added a
primary key to the table.
Ondrej
On 20/04/2021 18:49, Ondřej Žižka wrote:
> Hello Aleksander,
>
> Thank you for the reaction. This was tested on version 13.2.
>
> There are also other possible situations with the same setup and
> similar issue:
>
> -----------------
> When the background process on server fails....
>
> On postgresql1:
> tecmint=# select * from a; --> LAN on sync replica is OK
> id
> ----
> 1
> (1 row)
>
> tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN
> and insert is waiting. During this time kill the background process on
> the PostgreSQL server for this session
> WARNING: canceling the wait for synchronous replication and
> terminating connection due to administrator command
> DETAIL: The transaction has already committed locally, but might not
> have been replicated to the standby.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> (2 rows)
>
> tecmint=# ---> LAN on sync replica is still DOWN
>
> The potgres session will restore after the background process failed.
> When you run select on master, it still looks OK. But data is still
> not replicated on the sync replica. If we lost the master now, we
> would lost this data as well.
>
> **************
> Another case
> **************
>
> Kill the client process.
>
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> (3 rows)
> tecmint=# --> Disconnect the sync replica now. LAN on
> replica is DOWN
> tecmint=# insert into a values (4); --> Kill the client process
> Terminated
> xzizka@service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d tecmint
> Password for user postgres:
> psql (13.2 (Debian 13.2-1.pgdg100+1))
> Type "help" for help.
>
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> (3 rows)
>
> tecmint=# --> Number 4 is not there. Now switch the LAN on sync
> replica ON.
>
> ----------
>
> Result from sync replica after the LAN is again UP:
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> 4
> (4 rows)
>
>
> In this situation, try to insert the number 4 again to the table.
>
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> 3
> (3 rows)
>
> tecmint=# insert into a values (4);
> ERROR: duplicate key value violates unique constraint "a_pkey"
> DETAIL: Key (id)=(4) already exists.
> tecmint=#
>
> This is really strange... Application can be confused, It is not
> possible to insert record, which is not there, but some systems which
> use the sync node as a read replica maybe already read that record
> from the sync replica database and done some steps which can cause
> issues and can be hard to track.
>
> If I say, that it would be hard to send the CTRL+C to the database
> from the client, I need to say, that the 2 situations I described here
> can happen in real.
>
> What do you think?
>
> Thank you and regards
> Ondrej
>
> On 20/04/2021 17:23, Aleksander Alekseev wrote:
>> Hi Ondřej,
>>
>> Thanks for the report. It seems to be a clear violation of what is
>> promised in the docs. Although it's unlikely that someone implemented
>> an application which deals with important data and "pressed Ctr+C" as
>> it's done in psql. So this might be not such a critical issue after
>> all. BTW what version of PostgreSQL are you using?
>>
>>
>> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka
>> <ondrej.zizka@stratox.cz> wrote:
>>> Hello all,
>>> I would like to know your opinion on the following behaviour I see
>>> for PostgreSQL setup with synchronous replication.
>>>
>>> This behaviour happens in a special use case. In this use case,
>>> there are 2 synchronous replicas with the following config (truncated):
>>>
>>> - 2 nodes
>>> - synchronous_standby_names='*'
>>> - synchronous_commit=remote_apply
>>>
>>>
>>> With this setup run the following steps (LAN down - LAN between
>>> master and replica):
>>> -----------------
>>> postgres=# truncate table a;
>>> TRUNCATE TABLE
>>> postgres=# insert into a values (1); -- LAN up, insert has been
>>> applied to replica.
>>> INSERT 0 1
>>> Vypnu LAN na serveru se standby:
>>> postgres=# insert into a values (2); --LAN down, waiting for a
>>> confirmation from sync replica. In this situation cancel it (press
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> There will be warning that commit was performed only locally:
>>> 2021-04-12 19:55:53.063 CEST [26104] WARNING: canceling wait for
>>> synchronous replication due to user request
>>> 2021-04-12 19:55:53.063 CEST [26104] DETAIL: The transaction has
>>> already committed locally, but might not have been replicated to the
>>> standby.
>>>
>>> postgres=# insert into a values (2); --LAN down, waiting for a
>>> confirmation from sync replica. In this situation cancel it (press
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# insert into a values (2); --LAN down, waiting for sync
>>> replica, second attempt, cancel it as well (CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# update a set n=3 where n=2; --LAN down, waiting for sync
>>> replica, cancel it (CTRL+C)
>>> ^CCancel request sent
>>> WARNING: canceling wait for synchronous replication due to user
>>> request
>>> DETAIL: The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> UPDATE 2
>>> postgres=# update a set n=3 where n=2; -- run the same
>>> update,because data from the previous attempt was commited on
>>> master, it is sucessfull, but no changes
>>> UPDATE 0
>>> postgres=# select * from a;
>>> n
>>> ---
>>> 1
>>> 3
>>> 3
>>> (3 rows)
>>> postgres=#
>>> ------------------------
>>>
>>> Now, there is only value 1 in the sync replica table (no other
>>> values), data is not in sync. This is expected, after the LAN
>>> restore, data will come sync again, but if the main/primary node
>>> will fail and we failover to replica before the LAN is back up or
>>> the storage for this node would be destroyed and data would not sync
>>> to replica before it, we will lose data even if the client received
>>> successful commit (with a warning).
>>> From the synchronous_commit=remote_write level and "higher", I
>>> would expect, that when the remote application (doesn't matter if
>>> flush, write or apply) would not be applied I would not receive a
>>> confirmation about the commit (even with a warning). Something like,
>>> if there is no commit from sync replica, there is no commit on
>>> primary and if someone performs the steps above, the whole
>>> transaction will not send a confirmation.
>>>
>>> This can cause issues if the application receives a confirmation
>>> about the success and performs some follow-up steps e.g. create a
>>> user account and sends a request to the mail system to create an
>>> account or create a VPN account. If the scenario above happens,
>>> there can exist a VPN account that does not have any presence in the
>>> central database and can be a security issue.
>>>
>>> I hope I explained it sufficiently. :-)
>>>
>>> Do you think, that would be possible to implement a process that
>>> would solve this use case?
>>>
>>> Thank you
>>> Ondrej
>>
>>
Hi Timas, > > Thanks for the report. It seems to be a clear violation of what is > > promised in the docs. Although it's unlikely that someone implemented > > an application which deals with important data and "pressed Ctr+C" as > > it's done in psql. So this might be not such a critical issue after > > all. BTW what version of PostgreSQL are you using? > > > > Which part of the docs does this contradict? The documentation to synchronous_commit = remote_apply explicitly states [1]: """ When set to remote_apply, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s), and also written to durable storage on the standbys. """ Here commit on the master happened before receiving replies from the standby(s). [1]: https://www.postgresql.org/docs/13/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT -- Best regards, Aleksander Alekseev
On Tue, 2021-04-20 at 18:49 +0100, Ondřej Žižka wrote: > tecmint=# select * from a; --> LAN on sync replica is OK > id > ---- > 1 > (1 row) > > tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN and > insert is waiting. During this time kill the background process on the > PostgreSQL server for this session > WARNING: canceling the wait for synchronous replication and terminating > connection due to administrator command > DETAIL: The transaction has already committed locally, but might not > have been replicated to the standby. > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Succeeded. > > tecmint=# select * from a; > id > ---- > 1 > 2 > (2 rows) It is well known that synchronous replication is sublect to that problem, since it doesn't use the two-phase commit protocol. What surprises me is that this is a warning. In my opinion it should be an error. Yours, Laurenz Albe
On Tue, 2021-04-20 at 18:49 +0100, Ondřej Žižka wrote:
> tecmint=# select * from a; --> LAN on sync replica is OK
> id
> ----
> 1
> (1 row)
>
> tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN and
> insert is waiting. During this time kill the background process on the
> PostgreSQL server for this session
> WARNING: canceling the wait for synchronous replication and terminating
> connection due to administrator command
> DETAIL: The transaction has already committed locally, but might not
> have been replicated to the standby.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
>
> tecmint=# select * from a;
> id
> ----
> 1
> 2
> (2 rows)
It is well known that synchronous replication is sublect to that problem,
since it doesn't use the two-phase commit protocol.
What surprises me is that this is a warning.
In my opinion it should be an error.
Yours,
Laurenz Albe
This can be an option for us in our case. But there also needs to be a process how to detect these "stuck commits" and how to invalidate/remove them, because in reality, if the app/user would not see the change in the database, it/he/she will try to insert/delete it again. If it just stuck without management, it will create a queue which can cause, that in the queue there will be 2 similar inserts/deletes which can again cause issues (like with the primary key I mentioned before).
So the process should be in this case:
- DBA receives information, that write operations stuck (DBA in coordination with the infrastructure team disconnects all clients and prevent new ones to create a new connection).
- DBA will recognize, that there is an issue in communication between the primary and the sync replica (caused the issue with the propagation of commits)
- DBA will see that there are some commits that are in the "stuck state"
- DBA removes these stuck commits. Note: Because the client never received a confirmation about the successful commit -> changes in the DB client tried to perform can't be considered as successful.
- DBA and infrastructure team restore the communication between server nodes to be able to propagate commits from the primary node to sync replica.
- DBA and infrastructure team allows new connections to the database
This approach would require external monitoring and alerting, but I would say, that this is an acceptable solution. Would your patch be able to perform that?
Hello,
> You can monitor the pg_stat_activity for the SYNC_REP_WAIT_FLUSH wait types to detect this.
I tried to see this this wait_event_type Client or IPC and wait_event Client_Read or SyncRep. In which situation I can see the SYNC_REP_WAIT_FLUSH value?
> You should consider these as in doubt transactions and the client should retry. Again, this can happen in a normal server crash case too. For example, a transaction committed on the server and before sending the acknowledgement crashed. *The client should know how to handle these cases.*
This can be an option for us in our case. But there also needs to be a process how to detect these "stuck commits" and how to invalidate/remove them, because in reality, if the app/user would not see the change in the database, it/he/she will try to insert/delete it again. If it just stuck without management, it will create a queue which can cause, that in the queue there will be 2 similar inserts/deletes which can again cause issues (like with the primary key I mentioned before).This shouldn't be a problem as the previous transaction is still holding the locks and the new transaction is blocked behind this. Outside of the sync replication, this can happen today too with glitches/timeouts/ retries between the client and the server. Am I missing something?So the process should be in this case:
- DBA receives information, that write operations stuck (DBA in coordination with the infrastructure team disconnects all clients and prevent new ones to create a new connection).You can monitor the pg_stat_activity for the SYNC_REP_WAIT_FLUSH wait types to detect this.- DBA will recognize, that there is an issue in communication between the primary and the sync replica (caused the issue with the propagation of commits)
- DBA will see that there are some commits that are in the "stuck state"
- DBA removes these stuck commits. Note: Because the client never received a confirmation about the successful commit -> changes in the DB client tried to perform can't be considered as successful.You should consider these as in doubt transactions and the client should retry. Again, this can happen in a normal server crash case too. For example, a transaction committed on the server and before sending the acknowledgement crashed. The client should know how to handle these cases.- DBA and infrastructure team restore the communication between server nodes to be able to propagate commits from the primary node to sync replica.
- DBA and infrastructure team allows new connections to the database
This approach would require external monitoring and alerting, but I would say, that this is an acceptable solution. Would your patch be able to perform that?My patch handles ignoring the cancel events. I ended up keeping the other logic (blocking super user connections in the client_authentication_hook.There is a third problem that I didn't talk about in this thread where the async clients (including logical decoding and replication clients) can get ahead of the new primary and there is no easier way to undo those changes. For this problem, we need to implement some protocol in the WAL sender where it sends the log to the consumer only up to the flush LSN of the standby/quorum replicas. This is something I am working on right now.
Hi Ondrej! > 19 апр. 2021 г., в 22:19, Ondřej Žižka <ondrej.zizka@stratox.cz> написал(а): > > Do you think, that would be possible to implement a process that would solve this use case? > Thank you > Ondrej > Feel free to review patch fixing this at [0]. It's classified as "Server Features", but I'm sure it's a bug fix. Yandex.Cloud PG runs with this patch for more than half a year. Because we cannot afford loosing data in HA clusters. It's somewhat incomplete solution, because PG restart or crash recovery will make waiting transactions visible. But we protectfrom this on HA tool's side. Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/33/2402/
Hello Andrey, I went through the thread for your patch and seems to me as an acceptable solution... > The only case patch does not handle is sudden backend crash - Postgres will recover without a restart. We also use a HA tool (Patroni). If the whole machine fails, it will find a new master and it should be OK. We use a 4 node setup (2 sync replicas and 1 async from every replica). If there is an issue just with sync replica (async operated normally) and the master fails completely in this situation, it will be solved by Patroni (the async replica become another sync), but if it is just the backend process, the master will not failover and changes will be still visible... If the sync replica outage is temporal it will be solved itself when the node will establish a replication slot again... If the outage is "long", Patroni will remove the "old" sync replica from the cluster and the async replica reading from the master would be new sync. So yes... In 2 node setup, this can be an issue, but in 4 node setup, this seems to me like a solution. The only situation I can imagine is a situation when the client connections use a different network than the replication network and the replication network would be down completely, but the client network will be up. In that case, the master can be an "isolated island" and if it fails, we can lose the changed data. Is this situation also covered in your model: "transaction effects should not be observable on primary until requirements of synchronous_commit are satisfied." Do you agree with my thoughts? Maybe would be possible to implement it into PostgreSQL with a note in documentation, that a multinode (>=3 nodes) cluster is necessary. Regards Ondrej On 22/04/2021 05:55, Andrey Borodin wrote: > Hi Ondrej! > >> 19 апр. 2021 г., в 22:19, Ondřej Žižka <ondrej.zizka@stratox.cz> написал(а): >> >> Do you think, that would be possible to implement a process that would solve this use case? >> Thank you >> Ondrej >> > Feel free to review patch fixing this at [0]. It's classified as "Server Features", but I'm sure it's a bug fix. > > Yandex.Cloud PG runs with this patch for more than half a year. Because we cannot afford loosing data in HA clusters. > > It's somewhat incomplete solution, because PG restart or crash recovery will make waiting transactions visible. But weprotect from this on HA tool's side. > > Best regards, Andrey Borodin. > > [0] https://commitfest.postgresql.org/33/2402/
Thanks for reviewing Ondřej! > 26 апр. 2021 г., в 22:01, Ondřej Žižka <ondrej.zizka@stratox.cz> написал(а): > > Hello Andrey, > > I went through the thread for your patch and seems to me as an acceptable solution... > > > The only case patch does not handle is sudden backend crash - Postgres will recover without a restart. > > We also use a HA tool (Patroni). If the whole machine fails, it will find a new master and it should be OK. We use a 4node setup (2 sync replicas and 1 async from every replica). If there is an issue just with sync replica (async operatednormally) and the master fails completely in this situation, it will be solved by Patroni (the async replica becomeanother sync), but if it is just the backend process, the master will not failover and changes will be still visible... > > If the sync replica outage is temporal it will be solved itself when the node will establish a replication slot again...If the outage is "long", Patroni will remove the "old" sync replica from the cluster and the async replica readingfrom the master would be new sync. So yes... In 2 node setup, this can be an issue, but in 4 node setup, this seemsto me like a solution. > The only situation I can imagine is a situation when the client connections use a different network than the replicationnetwork and the replication network would be down completely, but the client network will be up. In that case,the master can be an "isolated island" and if it fails, we can lose the changed data. It is, in fact, very common type of network partition. > Is this situation also covered in your model: "transaction effects should not be observable on primary until requirementsof synchronous_commit are satisfied." Yes. If synchronous_commit_cancelation = off, no backend crash occurs and HA tool does not start PostgreSQL service whenin doubt that other primary may exists. > Do you agree with my thoughts? I could not understand your reasoning about 2 and 4 nodes. Can you please clarify a bit how 4 node setup can help preventvisibility of commited-locall-but-canceled transactions? I do not think we can classify network partitions as "temporal" and "long". Due to the distributed nature of the system networkpartitions are eternal and momentary. Simultaneously. And if the node A can access node B and node C, this neitherimplies B can access C, nor B can access A. > Maybe would be possible to implement it into PostgreSQL with a note in documentation, that a multinode (>=3 nodes) clusteris necessary. PostgreSQL does not provide and fault detection and automatic failover. Documenting anything wrt failover is the responsibilityof HA tool. Thanks! Best regards, Andrey Borodin.
On 06/05/2021 06:09, Andrey Borodin wrote: > I could not understand your reasoning about 2 and 4 nodes. Can you please clarify a bit how 4 node setup can help preventvisibility of commited-locall-but-canceled transactions? Hello Andrey, The initial request (for us) was to have a geo cluster with 2 locations where would be possible to have 2 sync replicas even in case of failure of one location. This means to have 2 nodes in every location (4 together). If one location fails completely (broken network connection), Patroni will choose the working location (5 node etcd in 3 locations to ensure this). In the initial state, there is 1 sync replica in each location and one async replica in each location using as a source the sync replica in its location. Let's have the following initial situation: 1) Nodes pg11 and pg12 are in one location nodes pg21 and pg22 are in another location. 2) Nodes pg11 and pg21 are in sync replica 3) Node pg12 is an async replica from pg11 4) Node pg22 is an async replica from pg21 5) Master is pg11. When the commited-locally-but-canceled situation happens and there is a problem only with node pg21 (not with the network between nodes), the async replica pg12 will receive the local commit from pg11 just after the local commit on pg11 even if the cancellation happens. So there will be a situation when the commit is present on both pg11 and pg12. If the pg11 fails, the transaction already exists on pg12 and this node will be selected as a new leader (latest LSN). There is a period between the time it is committed and the time it will have been sent to the async replica when we can lose data, but I expect this in milliseconds (maybe less). It will not prevent visibility but will ensure, that the data would not be lost and in that case, data can be visible on the leader even if they are not present on the sync replica because there is ensured the continuity of the data persistence in the async replica. I hope I explained it understandably. Regards Ondrej
On Tue, 2021-04-20 at 14:19 -0700, SATYANARAYANA NARLAPURAM wrote: > One idea here is to make the backend ignore query > cancellation/backend termination while waiting for the synchronous > commit ACK. This way client never reads the data that was never > flushed remotely. The problem with this approach is that your > backends get stuck until your commit log record is flushed on the > remote side. Also, the client can see the data not flushed remotely > if the server crashes and comes back online. You can prevent the > latter case by making a SyncRepWaitForLSN before opening up the > connections to the non-superusers. I have a working prototype of this > logic, if there is enough interest I can post the patch. I didn't see a patch here yet, so I wrote a simple one for consideration (attached). The problem exists for both cancellation and termination requests. The patch adds a GUC that makes SyncRepWaitForLSN keep waiting. It does not ignore the requests; for instance, a termination request will still be honored when it's done waiting for sync rep. The idea of this GUC is not to wait forever (obviously), but to allow the administrator (or an automated network agent) to be in control of the logic: If the primary is non-responsive, the administrator can decide to fail over, knowing that all visible transactions on the primary are durable on the standby (because any transaction that didn't make it to the standby also didn't release locks yet). If the standby is non- responsive, the administrator can intervene with something like: ALTER SYSTEM SET synchronous_standby_names = ''; SELECT pg_reload_conf(); which will disable sync rep, allowing the primary to complete the query and continue on without the standby; but in that case the admin must be sure not to fail over until there's a new standby fully caught-up. The patch may be somewhat controversial, so I'll wait for feedback before documenting it properly. Regards, Jeff Davis
Attachment
> 29 июня 2021 г., в 03:56, Jeff Davis <pgsql@j-davis.com> написал(а): > > The patch may be somewhat controversial, so I'll wait for feedback > before documenting it properly. The patch seems similar to [0]. But I like your wording :) I'd be happy if we go with any version of these idea. Best regards, Andrey Borodin. [0]https://commitfest.postgresql.org/33/2402/
On Tue, 2021-06-29 at 11:48 +0500, Andrey Borodin wrote: > > 29 июня 2021 г., в 03:56, Jeff Davis <pgsql@j-davis.com> > > написал(а): > > > > The patch may be somewhat controversial, so I'll wait for feedback > > before documenting it properly. > > The patch seems similar to [0]. But I like your wording :) > I'd be happy if we go with any version of these idea. Thank you, somehow I missed that one, we should combine the CF entries. My patch also covers the backend termination case. Is there a reason you left that case out? Regards, Jeff Davis
> 29 июня 2021 г., в 23:35, Jeff Davis <pgsql@j-davis.com> написал(а): > > On Tue, 2021-06-29 at 11:48 +0500, Andrey Borodin wrote: >>> 29 июня 2021 г., в 03:56, Jeff Davis <pgsql@j-davis.com> >>> написал(а): >>> >>> The patch may be somewhat controversial, so I'll wait for feedback >>> before documenting it properly. >> >> The patch seems similar to [0]. But I like your wording :) >> I'd be happy if we go with any version of these idea. > > Thank you, somehow I missed that one, we should combine the CF entries. > > My patch also covers the backend termination case. Is there a reason > you left that case out? Yes, backend termination is used by HA tool before rewinding the node. Initially I was considering termination as PANIC andgot a ton of coredumps during failovers on drills. There is one more caveat we need to fix: we should prevent instant recovery from happening. HA tool must know that our processwas restarted. Consider following scenario: 1. Node A is primary with sync rep. 2. A is going through network partitioning, somewhere node B is promoted. 3. All backends of A are stuck in sync rep, until HA tool discovers A is failed node. 4. One backend crashes with segfault in some buggy extension or OOM or whatever 5. Postgres server is doing restartless crash recovery making local-but-not-replicated data visible. We should prevent 5 also as we prevent cancels. HA tool will discover postmaster fail and will recheck in coordinatino systemthat it can raise up Postgres locally. Thanks! Best regards, Andrey Borodin.
On Wed, 2021-06-30 at 17:28 +0500, Andrey Borodin wrote: > > My patch also covers the backend termination case. Is there a > > reason > > you left that case out? > > Yes, backend termination is used by HA tool before rewinding the > node. Can't you just disable sync rep first (using ALTER SYSTEM SET synchronous_standby_names=''), which will unstick the backend, and then terminate it? If you don't handle the termination case, then there's still a chance for the transaction to become visible to other clients before its replicated. > There is one more caveat we need to fix: we should prevent instant > recovery from happening. That can already be done with the restart_after_crash GUC. Regards, Jeff Davis
> 2 июля 2021 г., в 10:59, Jeff Davis <pgsql@j-davis.com> написал(а): > > On Wed, 2021-06-30 at 17:28 +0500, Andrey Borodin wrote: >>> My patch also covers the backend termination case. Is there a >>> reason >>> you left that case out? >> >> Yes, backend termination is used by HA tool before rewinding the >> node. > > Can't you just disable sync rep first (using ALTER SYSTEM SET > synchronous_standby_names=''), which will unstick the backend, and then > terminate it? If the failover happens due to unresponsive node we cannot just turn off sync rep. We need to have some spare connectionsfor that (number of stuck backends will skyrocket during network partitioning). We need available descriptorsand some memory to fork new backend. We will need to re-read config. We need time to try after all. At some failures we may lack some of these. Partial degradation is already hard task. Without ability to easily terminate running Postgres HA tool will often resortto SIGKILL. > > If you don't handle the termination case, then there's still a chance > for the transaction to become visible to other clients before its > replicated. Termination is admin command, they know what they are doing. Cancelation is part of user protocol. BTW can we have two GUCs? So that HA tool developers will decide on their own which guaranties they provide? > >> There is one more caveat we need to fix: we should prevent instant >> recovery from happening. > > That can already be done with the restart_after_crash GUC. Oh, I didn't know it, we will use it. Thanks! Best regards, Andrey Borodin.
On Fri, 2021-07-02 at 11:39 +0500, Andrey Borodin wrote: > If the failover happens due to unresponsive node we cannot just turn > off sync rep. We need to have some spare connections for that (number > of stuck backends will skyrocket during network partitioning). We > need available descriptors and some memory to fork new backend. We > will need to re-read config. We need time to try after all. > At some failures we may lack some of these. I think it's a good point that, when things start to go wrong, they can go very wrong very quickly. But until you've disabled sync rep, the primary will essentially be down for writes whether using this new feature or not. Even if you can terminate some backends to try to free space, the application will just make new connections that will get stuck the same way. You can avoid the "fork backend" problem by keeping a connection always open from the HA tool, or by editing the conf to disable sync rep and issuing SIGHUP instead. Granted, that still takes some memory. > Partial degradation is already hard task. Without ability to easily > terminate running Postgres HA tool will often resort to SIGKILL. When the system is really wedged as you describe (waiting on sync rep, tons of connections, and low memory), what information do you expect the HA tool to be able to collect, and what actions do you expect it to take? Presumably, you'd want it to disable sync rep at some point to get back online. Where does SIGTERM fit into the picture? > > If you don't handle the termination case, then there's still a > > chance > > for the transaction to become visible to other clients before its > > replicated. > > Termination is admin command, they know what they are doing. > Cancelation is part of user protocol. From the pg_terminate_backend() docs: "This is also allowed if the calling role is a member of the role whose backend is being terminated or the calling role has been granted pg_signal_backend", so it's not really an admin command. Even for an admin, it might be hard to understand why terminating a backend could result in losing a visible transaction. I'm not really seeing two use cases here for two GUCs. Are you sure you want to disable only cancels but allow termination to proceed? Regards, Jeff Davis
> 3 июля 2021 г., в 01:15, Jeff Davis <pgsql@j-davis.com> написал(а): > > On Fri, 2021-07-02 at 11:39 +0500, Andrey Borodin wrote: >> If the failover happens due to unresponsive node we cannot just turn >> off sync rep. We need to have some spare connections for that (number >> of stuck backends will skyrocket during network partitioning). We >> need available descriptors and some memory to fork new backend. We >> will need to re-read config. We need time to try after all. >> At some failures we may lack some of these. > > I think it's a good point that, when things start to go wrong, they can > go very wrong very quickly. > > But until you've disabled sync rep, the primary will essentially be > down for writes whether using this new feature or not. Even if you can > terminate some backends to try to free space, the application will just > make new connections that will get stuck the same way. Surely I'm talking about terminating postmaster, not individual backends. But postmaster will need to terminate each runningquery. We surely need to have a way to stop whole instance without making any single query. And I do not like kill -9 for this purpose. > > You can avoid the "fork backend" problem by keeping a connection always > open from the HA tool, or by editing the conf to disable sync rep and > issuing SIGHUP instead. Granted, that still takes some memory. > >> Partial degradation is already hard task. Without ability to easily >> terminate running Postgres HA tool will often resort to SIGKILL. > > When the system is really wedged as you describe (waiting on sync rep, > tons of connections, and low memory), what information do you expect > the HA tool to be able to collect, and what actions do you expect it to > take? HA tool is not going to collect anything. It just calls pg_ctl stop [0] or it's equivalent. > > Presumably, you'd want it to disable sync rep at some point to get back > online. Where does SIGTERM fit into the picture? HA tool is going to terminate running instance, rewind it, switch to new timeline and enroll into cluster again as standby. > >>> If you don't handle the termination case, then there's still a >>> chance >>> for the transaction to become visible to other clients before its >>> replicated. >> >> Termination is admin command, they know what they are doing. >> Cancelation is part of user protocol. > > From the pg_terminate_backend() docs: "This is also allowed if the > calling role is a member of the role whose backend is being terminated > or the calling role has been granted pg_signal_backend", so it's not > really an admin command. Even for an admin, it might be hard to > understand why terminating a backend could result in losing a visible > transaction. Ok, I see backend termination is not described as admin command. We cannot prevent user from doing stupid things, they are able to delete their data anyway. > I'm not really seeing two use cases here for two GUCs. Are you sure you > want to disable only cancels but allow termination to proceed? Yes, I'm sure. I had been running production with disabled termination for some weeks. cluster reparation was much slower.For some reason kill-9-ed instances were successfully rewound much less often. But maybe I've done something wrong. If we can stop whole instance the same way as we did without activating proposed GUC - there is no any problem. Thanks! Best regards, Andrey Borodin. [0] https://github.com/zalando/patroni/blob/master/patroni/postgresql/postmaster.py#L155
On Sat, 2021-07-03 at 14:06 +0500, Andrey Borodin wrote: > > But until you've disabled sync rep, the primary will essentially be > > down for writes whether using this new feature or not. Even if you > > can > > terminate some backends to try to free space, the application will > > just > > make new connections that will get stuck the same way. > > Surely I'm talking about terminating postmaster, not individual > backends. But postmaster will need to terminate each running query. > We surely need to have a way to stop whole instance without making > any single query. And I do not like kill -9 for this purpose. kill -6 would suffice. I see the point that you don't want this to interfere with an administrative shutdown. But it seems like most shutdowns will need to escalate to SIGABRT for cases where things are going badly wrong (low memory, etc.) anyway. I don't see a better solution here. I don't fully understand why you'd be concerned about cancellation but not concerned about similar problems with termination, but if you think two GUCs are important I can do that. Regards, Jeff Davis
> 3 июля 2021 г., в 23:44, Jeff Davis <pgsql@j-davis.com> написал(а): > > On Sat, 2021-07-03 at 14:06 +0500, Andrey Borodin wrote: >>> But until you've disabled sync rep, the primary will essentially be >>> down for writes whether using this new feature or not. Even if you >>> can >>> terminate some backends to try to free space, the application will >>> just >>> make new connections that will get stuck the same way. >> >> Surely I'm talking about terminating postmaster, not individual >> backends. But postmaster will need to terminate each running query. >> We surely need to have a way to stop whole instance without making >> any single query. And I do not like kill -9 for this purpose. > > kill -6 would suffice. SIGABRT is expected to generate a core dump, isn't it? Node failover is somewhat expected state in HA system. > > I see the point that you don't want this to interfere with an > administrative shutdown. But it seems like most shutdowns will need to > escalate to SIGABRT for cases where things are going badly wrong (low > memory, etc.) anyway. I don't see a better solution here. In my experience SIGTERM coped fine so far. > I don't fully understand why you'd be concerned about cancellation but > not concerned about similar problems with termination, but if you think > two GUCs are important I can do that. I think 2 GUCs is a better solution than 1 GUC disabling both cancelation and termination. It would be great if some other HA tool developers would chime in. Thanks! Best regards, Andrey Borodin.
On Fri, 2021-07-09 at 23:10 +0500, Andrey Borodin wrote: > In my experience SIGTERM coped fine so far. OK. I don't think ignoring SIGTERM in the way my patch does it is a great solution, and it's not getting much support, so I think I'll back away from that idea. I had a separate discussion with Andres, and he made a distinction between explicit vs. implicit actions. For instance, an explicit SIGTERM or SIGINT should not be ignored (or the functions that cause those to happen); but if we are waiting for sync rep then it might be OK to ignore a cancel caused by statement_timeout or a termination due to a network disconnect. Separately, I'm taking a vacation. Since there are two versions of the patch floating around, I will withdraw mine. Regards, Jeff Davis