BUG #18457: Possible data loss needs to be evaluated - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18457: Possible data loss needs to be evaluated
Date
Msg-id 18457-dc62fafe74c67bdd@postgresql.org
Whole thread Raw
Responses Re: BUG #18457: Possible data loss needs to be evaluated
Re: BUG #18457: Possible data loss needs to be evaluated
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18457
Logged by:          Julius Ernesti
Email address:      julius.ernesti@kpi4me.de
PostgreSQL version: 14.10
Operating system:   Rocky Linux 9
Description:

Our services connect to the database via libpq. It is ensured that
connections are not used by two threads at the same time, with several
physical connections to the database. Here's what happened:

A very large transaction, which makes inserts into different tables, could
not be executed due to a unique key constraint. Due to a not yet known
problem, this transaction was not rolled back properly and the connection
was in an invalid state. The Portgresql server log logged the following
messages every second over a period of hours:

2024-05-05 22:15:27.461 CEST [2287043] ERROR:  current transaction is
aborted, commands ignored until end of transaction block
2024-05-05 22:15:27.461 CEST [2287043] STATEMENT:  SAVEPOINT SP_264211

This is probably because a thread has continued to use the broken connection
to execute transactions. I assume that this thread also received
corresponding messages, which I cannot verify due to the lack of logs.

Other processes were still able to communicate successfully with the
database. Inserts were executed and generated rowids, which are generated
via sequences, were also returned. In retrospect, I can no longer find these
rows in the database. This means that inserts were apparently executed, but
no data exists in the final result.

The inserts correspond to the same tables as the failed transaction
mentioned above. As a result, we had a data loss of 1 day's work, which
unfortunately was only noticed afterwards.

Assumption:
I am sure that many unfavorable factors have led to a fatal error here. I
think that bugs from different systems have overlapped. We have not managed
to reproduce the invalid state described. I think that the transactions
described were successfully executed in the database. As we only need to
store data first and do not select it at that point, the error only became
apparent so late. At the moment I assume that the command “STATEMENT:
SAVEPOINT SP_264211” caused my data to be rolled back again without me
noticing in the service. I am sure that there is a link to the failed
transaction, that perhaps some tables were still locked and the rollback
droped those inserted rows.  

My knowledge of postgres has reached its limit here. I would like to
evaluate whether inserts from other connections, which have been
successfully executed, can be rolled back by another connection via the
statements 
2024-05-05 22:15:27.461 CEST [2287043] ERROR: current transaction is
aborted, commands ignored until end of transaction block
2024-05-05 22:15:27.461 CEST [2287043] STATEMENT: SAVEPOINT SP_264211

For the sake of simplicity, here is another example: There are two
connections A and B

Via A: insert into test (id, val) values (2, 'this will be a unique key
constraint error'); --> The Rollback statement will not be executed due to
an unknown error
Via B: insert into test (id, val) values (4, 'no error'); --> Successful
Transaction
Via A: insert into test (id, val) values (5, 'no error, but the is currently
a running transaction') --> Failed. Maybe a rollback is executed, I dont
know
Via B; select * from test where id = 4; --> Empty query 

We are still trying to find out how this could have happened. If the example
described actually leads to successful transactions being canceled again, I
would describe this as a bug in Postgres. Our codes have been running for 4
years and we have never seen an error like this.

Best regards.

Julius Ernesti


pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18456: Trigger data in plpython3u trigger-function changes in AFTER UPDATE OR INSERT trigger
Next
From: PG Bug reporting form
Date:
Subject: BUG #18458: SQL Error [XX000]: ERROR: cache lookup failed for type 0