Thread: 7.4.5 losing committed transactions
The attached archive contains a script that I used to reproduce the error multiple times. Setup: * create database crashtest * start 6 instances of testload.tcl as ./testload.tcl tN dbname=crashtest where N = 1..6 * frequently kill a backend to cause a postmaster restart. The testload script creates a table and starts inserting rows, 2 to 6 per transaction. If the transaction succeeds, it increments the internal ID counter by the number of rows inserted. If the transaction fails with a dupkey error, it increments the ID counter by 1. On any failure, it reconnects to the database. The duplicate key errors can happen because the transaction is committed, but the response doesn't reach the client before the crash. So that is a normal and expected behaviour. But occasionally there will appear a gap in the data. With the given logic only to increment the counter on a dupkey or after a positive COMMIT response by the backend, IMHO there can only be one if we lose transactions after commit on a crash restart. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Attachment
Jan Wieck <JanWieck@Yahoo.com> writes: > But occasionally there will appear a gap in the data. With the given > logic only to increment the counter on a dupkey or after a positive > COMMIT response by the backend, IMHO there can only be one if we lose > transactions after commit on a crash restart. Hmm. I was able to reproduce this in 7.4.5 but not (so far) in 8.0. What I see in the 7.4.5 postmortem is that the "missing" rows are present in the table, as you'd expect, and are marked as inserted by a transaction number that *definitely did not commit* according to the WAL log --- there are WAL entries showing it inserting the missing rows, and their btree index entries, but no commit. The post-crash database state seems exactly consistent with what is in WAL. This means either that the server sent a commit message before it had xlog'd the commit, or that Pgtcl mistakenly reported the command as successful when it was not. Any thoughts? regards, tom lane
On 9/24/2004 5:12 PM, Tom Lane wrote: > This means either that the server sent a commit message before it had > xlog'd the commit, or that Pgtcl mistakenly reported the command as > successful when it was not. Any thoughts? Is it somehow possible that the commit record was still sitting in the shared WAL buffers (unwritten) when the response got sent to the client? That would be the only possibility I can see right now, because Pgtcl used as in that script sits on top of libpq in blocking mode and that pretty much outrules early reporting of COMMAND_OK. Fsync/disk-flush issues can't be the case either because it was only a PM restart without any OS crash involved, and I don't like the idea of <whatever*nix> forgetting a write(). Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > Is it somehow possible that the commit record was still sitting in the > shared WAL buffers (unwritten) when the response got sent to the client? I don't think so. What I see in the two cases I have now are: (1) The backend that was doing the "lost" transaction is *not* the one I kill -9'd. I know this in both cases because I know which table has the missing entries, and I can see that that instance of the script got a "WARNING: terminating connection because of crash of another server process" message rather than just a connection closure. (2) There's a pretty fair distance in the WAL log between the entries made by the "lost" transaction and the checkpoint made by recovery --- a dozen or so other transactions were made and committed in between. It seems unlikely that this transaction would have been the only one to lose a WAL record if something like that had happened. What I'm currently speculating about is that there might be some weirdness associated with the very act of sending out the WARNING. quickdie() isn't doing anything to ensure that the system is in a good state before it calls ereport --- which is probably not so cool considering it is a signal handler. It might be wise to reset at least the elog.c state before doing this. Can you still reproduce the problem if you take out the ereport call in quickdie()? BTW, what led you to develop this test setup ... had you already seen something that made you suspect a data loss problem? regards, tom lane
>> This means either that the server sent a commit message before it had >> xlog'd the commit, or that Pgtcl mistakenly reported the command as >> successful when it was not. Any thoughts? Oh, fooey. exec_simple_query calls EndCommand before it calls finish_xact_command, and of course the latter is where the COMMIT really happens. But EndCommand puts the 'C' message for the COMMIT into the outbound message queue. Under normal circumstances this is no problem because we don't pq_flush the data to the client until after the commit is done. But suppose quickdie() is called in between. It will call ereport, which will emit the WARNING message *and pqflush it*. libpq gets the 'C' message and therefore reports that the COMMIT is complete. More generally, any sort of warning message occuring during transaction commit would do the wrong thing. (Errors chance not to, because those will make libpq replace the pending COMMAND_OK result with an error result. I'm not sure what the internal logic in the JDBC driver is, but I guess it acts similarly, or else we'd have heard about this before from JDBC users.) Not sure what the cleanest solution is... regards, tom lane
On 9/24/2004 6:37 PM, Tom Lane wrote: > Can you still reproduce the problem if you take out the ereport call > in quickdie()? Will check ... > > BTW, what led you to develop this test setup ... had you already seen > something that made you suspect a data loss problem? Good guess ... what actually happenend was that after a couple of these getaddrinfo() SIGSEGV's on AIX, one of our Slony-replica was out of sync. Long story short, it all didn't make sense. Now the scary thing is that not only did this crash rollback a committed transaction. Another session had enough time in between to receive a NOTIFY and select the data that got rolled back later. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > Now the scary thing is that not only did this crash rollback a committed > transaction. Another session had enough time in between to receive a > NOTIFY and select the data that got rolled back later. Different session, or same session? NOTIFY is one of the cases that would cause the backend to emit messages within the trouble window between EndCommand and actual commit. I don't believe that that path will do a deliberate pq_flush, but it would be possible that the NOTIFY message fills the output buffer and causes the 'C' message to go out prematurely. If you can actually prove that a *different session* was able to see as committed data that was not safely committed, then we have another problem to look for. I am hoping we have only one nasty bug today ;-) regards, tom lane
On 9/24/2004 10:24 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> Now the scary thing is that not only did this crash rollback a committed >> transaction. Another session had enough time in between to receive a >> NOTIFY and select the data that got rolled back later. > > Different session, or same session? NOTIFY is one of the cases that > would cause the backend to emit messages within the trouble window > between EndCommand and actual commit. I don't believe that that path > will do a deliberate pq_flush, but it would be possible that the NOTIFY > message fills the output buffer and causes the 'C' message to go out > prematurely. > > If you can actually prove that a *different session* was able to see as > committed data that was not safely committed, then we have another > problem to look for. I am hoping we have only one nasty bug today ;-) I do mean *different session*. My current theory about how the subscriber got out of sync is this: In Slony the chunks of serializable replication data are applied in one transaction, together with the SYNC event and the events CONFIRM record plus a notify on the confirm relation. The data provider (master or cascading node) does listen on the subscribers (slave) confirm relation. So immediately after the subscriber commits, the provider will pick up the confirm record and knows now that the data has propagated and could be deleted. If now the crash whipes out the committed transaction, the entire SYNC has to be redone. A problem that will be fixed in 1.0.3 can cause the replication engine not to restart immediately, and that probably gave the data providers cleanup procedure enough time to purge the replication data. That way it was possible, that a direct subscriber was still in sync, but a cascaded subscriber behind it wasn't. That constellation automatically ruled out that the update wasn't captured on the master. And since the log forwarding is stored within the same transaction too, the direct subscriber who had the correct data, must at that time have had the correct replication log as well. I guess nobody ever relied that heavily on data to be persistent at the microsecond the NOTIFY arrives ... Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
I said: > Oh, fooey. > exec_simple_query calls EndCommand before it calls finish_xact_command, Fooey again --- that theory is all wrong. Back to the drawing board. I have managed to reproduce the bug on CVS tip, btw. But it's very painful to make it happen. Have you got any tips for making it more probable? regards, tom lane
Jan Wieck <JanWieck@Yahoo.com> writes: > I guess nobody ever relied that heavily on data to be persistent at the > microsecond the NOTIFY arrives ... Sure they have. In theory you cannot see a NOTIFY before the sending transaction commits, because the sender is holding a lock on pg_notify and you can't even find out that you've been notified until he releases it. Your idea that the COMMIT WAL record is getting dropped would fit the facts, but I really am having a hard time believing it. Why would the COMMIT record be more prone to loss than any other record? All the cases I have seen so far have the right number of non-COMMIT records in the log, so the bogus transaction is not getting lost altogether. regards, tom lane
Jan Wieck <JanWieck@Yahoo.com> writes: > On 9/24/2004 10:24 PM, Tom Lane wrote: >> If you can actually prove that a *different session* was able to see as >> committed data that was not safely committed, then we have another >> problem to look for. I am hoping we have only one nasty bug today ;-) > I do mean *different session*. Hmm. I have found another explanation for the problem I'm seeing ... but the problem I'm seeing is still just that the connected client thinks the transaction has committed when it has not. No one else could think so, because it hasn't. I rewrote the test case in C to get rid of any question about whether this was a libpgtcl bug and to be able to add more instrumentation. Using PQtrace tells the tale. Here is a normal commit sequence: To backend> Msg Q To backend> 'insert into t2 values (4751, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140509 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'insert into t2 values (4752, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140510 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'commit' To backend> Msg complete, length 12 From backend> C From backend (#4)> 11 From backend> "COMMIT" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> I (The reason "Z" appears twice is that libpq stops reading the input and then re-reads that message. It's not really two Zs in the datastream.) Here is the failure case: To backend> Msg Q To backend> 'insert into t2 values (4760, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140518 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'insert into t2 values (4761, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140519 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'commit' To backend> Msg complete, length 12 -- this response is totally bogus: From backend> C From backend (#4)> 21 From backend> "INSERT 4140519 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T -- here begins the I'm-crashing NOTICE message: From backend> N From backend (#4)> 386 From backend> S From backend> "WARNING" From backend> C From backend> "57P02" From backend> M From backend> "terminating connection because of crash of another server process" From backend> D From backend> "The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory." From backend> H From backend> "In a moment you should be able to reconnect to the database and repeat your command." From backend> F From backend> "postgres.c" From backend> L From backend> "1880" From backend> R From backend> "quickdie" From backend> : server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. disconnect My present theory about the cause is that the backend lost its CPU quantum immediately after doing the send() that responded to the last INSERT, and was interrupted by SIGQUIT before it could continue. If you look at pq_flush() you'll see that it does not reset PqSendPointer until it exits. This means that when quickdie() pushes the WARNING into the send buffer, it will be added onto what is there --- and then the pq_flush call at the end of ereport will transmit that bufferload a second time. Dunno if you've done any testing yet with quickdie's ereport call removed, but this theory implies you won't be able to reproduce the bug without that. It seems possible that there are related bugs that don't need quickdie() to trigger them --- any elog call from a routine called by pq_flush is a risk factor as well. (In theory all such messages should be COMMERROR so as not to try to send to the client, but in practice that's a fragile rule.) The fix I am thinking about is to introduce some sort of reentrancy detection into pqcomm.c, so that it will ignore attempts to send messages at times when it is partway through sending a message. regards, tom lane /** test for commit-loss problem** Invocation: losstest tablename "conn options"*/ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include "libpq-fe.h" static void exit_nicely(PGconn *conn) {PQfinish(conn);exit(1); } static PGconn * db_connect(const char *conninfo) {PGconn *conn; for (;;){ conn = PQconnectdb(conninfo); if (PQstatus(conn) == CONNECTION_OK) return conn; printf("connectfailed: %s", PQerrorMessage(conn)); PQfinish(conn); sleep(1);} } static void db_disconnect(PGconn *conn) {PQfinish(conn);printf("disconnect\n"); } static int db_exec(PGconn *conn, const char *query, const char *qstatus) {PGresult *res;char *status; res = PQexec(conn, query);if (PQresultStatus(res) != PGRES_COMMAND_OK){ printf("%s failed: %s", query, PQerrorMessage(conn)); PQclear(res); return 0;}status = PQcmdStatus(res);if (strcmp(status, qstatus) != 0){ printf("%sreturned status '%s'\n", query, status);}PQclear(res);return 1; } int main(int argc, char **argv) {const char *tab_name;const char *conninfo;PGconn *conn;PGresult *res;FILE *tracefile;char query[1024];int id_count = 1; /* * If the user supplies a parameter on the command line, use it as the * table name. */if (argc > 1) tab_name = argv[1];else tab_name = "t1"; /* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to settingdbname=crashtest and * using environment variables or defaults for all other connection * parameters. */if (argc >2) conninfo = argv[2];else conninfo = "dbname = crashtest"; /* Make a connection to the database */conn = db_connect(conninfo); /* Set up the target table */sprintf(query, "drop table %s", tab_name);(void) db_exec(conn, query, "DROP TABLE");sprintf(query,"create table %s (id int primary key, data text)", tab_name);if (!db_exec(conn, query, "CREATETABLE")) exit_nicely(conn); /* Drop connection to enter loop */db_disconnect(conn); /* Set up file for tracing libpq activity */sprintf(query, "trace_%s_%d", tab_name, (int) getpid());tracefile = fopen(query,"w");if (!tracefile){ printf("couldn't open %s: %s\n", query, strerror(errno)); exit(1);} for (;;){ /* Make a connection to the database */ conn = db_connect(conninfo); /* Establish tracing */ fprintf(tracefile, "\n-- NEW CONNECTION --\n\n"); PQtrace(conn, tracefile); for (;;) { int n = random() % 4 + 2; int i; if (!db_exec(conn, "start transaction", "START TRANSACTION")) goto fail; for (i = 0; i < n; i++) { sprintf(query, "insert into %s values (%d, 'test data')", tab_name, id_count +i); res = PQexec(conn, query); if (PQresultStatus(res) != PGRES_COMMAND_OK) { printf("%s failed: %s", query, PQerrorMessage(conn)); if (strstr(PQerrorMessage(conn), "duplicate key")) id_count += 1; PQclear(res); goto fail; } if(strncmp(PQcmdStatus(res), "INSERT ", 7) != 0) { printf("%s returned status '%s'\n", query, PQcmdStatus(res)); } PQclear(res); } if (!db_exec(conn, "commit", "COMMIT")) goto fail; id_count += n; printf("\r%d rows done", id_count); fflush(stdout); }fail: /* Drop dead connection */ db_disconnect(conn);} return 0; }
I wrote: > My present theory about the cause is that the backend lost its CPU > quantum immediately after doing the send() that responded to the last > INSERT, and was interrupted by SIGQUIT before it could continue. > If you look at pq_flush() you'll see that it does not reset > PqSendPointer until it exits. This means that when quickdie() > pushes the WARNING into the send buffer, it will be added onto > what is there --- and then the pq_flush call at the end of ereport > will transmit that bufferload a second time. I have applied a patch to prevent this scenario in 7.4 and 8.0 branches. Next question is whether your original problem was due to this, or something else. As far as I can see, this problem occurs *only* when a SIGQUIT arrives at just the right time, so it seems unlikely to explain any problems you'd have seen in the field. regards, tom lane