Re: Synchronous commit behavior during network outage - Mailing list pgsql-hackers

From Ondřej Žižka
Subject Re: Synchronous commit behavior during network outage
Date
Msg-id 39d141cf-89c0-6aaa-c8e9-500c05d0d744@stratox.cz
Whole thread Raw
In response to Re: Synchronous commit behavior during network outage  (Ondřej Žižka <ondrej.zizka@stratox.cz>)
Responses Re: Synchronous commit behavior during network outage  (SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>)
List pgsql-hackers
I am sorry, I forgot mentioned, that in the second situation I added a 
primary key to the table.

Ondrej


On 20/04/2021 18:49, Ondřej Žižka wrote:
> Hello Aleksander,
>
> Thank you for the reaction. This was tested on version 13.2.
>
> There are also other possible situations with the same setup and 
> similar issue:
>
> -----------------
> When the background process on server fails....
>
> On postgresql1:
> tecmint=# select * from a; --> LAN on sync replica is OK
>  id
> ----
>   1
> (1 row)
>
> tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN 
> and insert is waiting. During this time kill the background process on 
> the PostgreSQL server for this session
> WARNING:  canceling the wait for synchronous replication and 
> terminating connection due to administrator command
> DETAIL:  The transaction has already committed locally, but might not 
> have been replicated to the standby.
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
> (2 rows)
>
> tecmint=# ---> LAN on sync replica is still DOWN
>
> The potgres session will restore after the background process failed. 
> When you run select on master, it still looks OK. But data is still 
> not replicated on the sync replica. If we lost the master now, we 
> would lost this data as well.
>
> **************
> Another case
> **************
>
> Kill the client process.
>
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
> (3 rows)
> tecmint=#                --> Disconnect the sync replica now. LAN on 
> replica is DOWN
> tecmint=# insert into a values (4); --> Kill the client process
> Terminated
> xzizka@service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d tecmint
> Password for user postgres:
> psql (13.2 (Debian 13.2-1.pgdg100+1))
> Type "help" for help.
>
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
> (3 rows)
>
> tecmint=# --> Number 4 is not there. Now switch the LAN on sync 
> replica ON.
>
> ----------
>
> Result from sync replica after the LAN is again UP:
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
>   4
> (4 rows)
>
>
> In this situation, try to insert the number 4 again to the table.
>
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
> (3 rows)
>
> tecmint=# insert into a values (4);
> ERROR:  duplicate key value violates unique constraint "a_pkey"
> DETAIL:  Key (id)=(4) already exists.
> tecmint=#
>
> This is really strange... Application can be confused, It is not 
> possible to insert record, which is not there, but some systems which 
> use the sync node as a read replica maybe already read that record 
> from the sync replica database and done some steps which can cause 
> issues and can be hard to track.
>
> If I say, that it would be hard to send the CTRL+C to the database 
> from the client, I need to say, that the 2 situations I described here 
> can happen in real.
>
> What do you think?
>
> Thank you and regards
> Ondrej
>
> On 20/04/2021 17:23, Aleksander Alekseev wrote:
>> Hi Ondřej,
>>
>> Thanks for the report. It seems to be a clear violation of what is
>> promised in the docs. Although it's unlikely that someone implemented
>> an application which deals with important data and "pressed Ctr+C" as
>> it's done in psql. So this might be not such a critical issue after
>> all. BTW what version of PostgreSQL are you using?
>>
>>
>> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka 
>> <ondrej.zizka@stratox.cz> wrote:
>>> Hello all,
>>> I would like to know your opinion on the following behaviour I see 
>>> for PostgreSQL setup with synchronous replication.
>>>
>>> This behaviour happens in a special use case. In this use case, 
>>> there are 2 synchronous replicas with the following config (truncated):
>>>
>>> - 2 nodes
>>> - synchronous_standby_names='*'
>>> - synchronous_commit=remote_apply
>>>
>>>
>>> With this setup run the following steps (LAN down - LAN between 
>>> master and replica):
>>> -----------------
>>> postgres=# truncate table a;
>>> TRUNCATE TABLE
>>> postgres=# insert into a values (1); -- LAN up, insert has been 
>>> applied to replica.
>>> INSERT 0 1
>>> Vypnu LAN na serveru se standby:
>>> postgres=# insert into a values (2); --LAN down, waiting for a 
>>> confirmation from sync replica. In this situation cancel it (press 
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user 
>>> request
>>> DETAIL:  The transaction has already committed locally, but might 
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> There will be warning that commit was performed only locally:
>>> 2021-04-12 19:55:53.063 CEST [26104] WARNING:  canceling wait for 
>>> synchronous replication due to user request
>>> 2021-04-12 19:55:53.063 CEST [26104] DETAIL:  The transaction has 
>>> already committed locally, but might not have been replicated to the 
>>> standby.
>>>
>>> postgres=# insert into a values (2); --LAN down, waiting for a 
>>> confirmation from sync replica. In this situation cancel it (press 
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user 
>>> request
>>> DETAIL:  The transaction has already committed locally, but might 
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# insert into a values (2); --LAN down, waiting for sync 
>>> replica, second attempt, cancel it as well (CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user 
>>> request
>>> DETAIL:  The transaction has already committed locally, but might 
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# update a set n=3 where n=2; --LAN down, waiting for sync 
>>> replica, cancel it (CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user 
>>> request
>>> DETAIL:  The transaction has already committed locally, but might 
>>> not have been replicated to the standby.
>>> UPDATE 2
>>> postgres=# update a set n=3 where n=2; -- run the same 
>>> update,because data from the previous attempt was commited on 
>>> master, it is sucessfull, but no changes
>>> UPDATE 0
>>> postgres=# select * from a;
>>>   n
>>> ---
>>>   1
>>>   3
>>>   3
>>> (3 rows)
>>> postgres=#
>>> ------------------------
>>>
>>> Now, there is only value 1 in the sync replica table (no other 
>>> values), data is not in sync. This is expected, after the LAN 
>>> restore, data will come sync again, but if the main/primary node 
>>> will fail and we failover to replica before the LAN is back up or 
>>> the storage for this node would be destroyed and data would not sync 
>>> to replica before it, we will lose data even if the client received 
>>> successful commit (with a warning).
>>>  From the synchronous_commit=remote_write level and "higher", I 
>>> would expect, that when the remote application (doesn't matter if 
>>> flush, write or apply) would not be applied I would not receive a 
>>> confirmation about the commit (even with a warning). Something like, 
>>> if there is no commit from sync replica, there is no commit on 
>>> primary and if someone performs the steps above, the whole 
>>> transaction will not send a confirmation.
>>>
>>> This can cause issues if the application receives a confirmation 
>>> about the success and performs some follow-up steps e.g. create a 
>>> user account and sends a request to the mail system to create an 
>>> account or create a VPN account. If the scenario above happens, 
>>> there can exist a VPN account that does not have any presence in the 
>>> central database and can be a security issue.
>>>
>>> I hope I explained it sufficiently. :-)
>>>
>>> Do you think, that would be possible to implement a process that 
>>> would solve this use case?
>>>
>>> Thank you
>>> Ondrej
>>
>>



pgsql-hackers by date:

Previous
From: Ondřej Žižka
Date:
Subject: Re: Synchronous commit behavior during network outage
Next
From: Maksim Milyutin
Date:
Subject: Re: Synchronous commit behavior during network outage