Re: 7.4.5 losing committed transactions - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: 7.4.5 losing committed transactions |
Date | |
Msg-id | 18705.1096141512@sss.pgh.pa.us Whole thread Raw |
In response to | Re: 7.4.5 losing committed transactions (Jan Wieck <JanWieck@Yahoo.com>) |
Responses |
Re: 7.4.5 losing committed transactions
|
List | pgsql-hackers |
Jan Wieck <JanWieck@Yahoo.com> writes: > On 9/24/2004 10:24 PM, Tom Lane wrote: >> If you can actually prove that a *different session* was able to see as >> committed data that was not safely committed, then we have another >> problem to look for. I am hoping we have only one nasty bug today ;-) > I do mean *different session*. Hmm. I have found another explanation for the problem I'm seeing ... but the problem I'm seeing is still just that the connected client thinks the transaction has committed when it has not. No one else could think so, because it hasn't. I rewrote the test case in C to get rid of any question about whether this was a libpgtcl bug and to be able to add more instrumentation. Using PQtrace tells the tale. Here is a normal commit sequence: To backend> Msg Q To backend> 'insert into t2 values (4751, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140509 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'insert into t2 values (4752, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140510 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'commit' To backend> Msg complete, length 12 From backend> C From backend (#4)> 11 From backend> "COMMIT" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> I (The reason "Z" appears twice is that libpq stops reading the input and then re-reads that message. It's not really two Zs in the datastream.) Here is the failure case: To backend> Msg Q To backend> 'insert into t2 values (4760, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140518 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'insert into t2 values (4761, 'test data')' To backend> Msg complete, length 47 From backend> C From backend (#4)> 21 From backend> "INSERT 4140519 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T To backend> Msg Q To backend> 'commit' To backend> Msg complete, length 12 -- this response is totally bogus: From backend> C From backend (#4)> 21 From backend> "INSERT 4140519 1" From backend> Z From backend (#4)> 5 From backend> Z From backend (#4)> 5 From backend> T -- here begins the I'm-crashing NOTICE message: From backend> N From backend (#4)> 386 From backend> S From backend> "WARNING" From backend> C From backend> "57P02" From backend> M From backend> "terminating connection because of crash of another server process" From backend> D From backend> "The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory." From backend> H From backend> "In a moment you should be able to reconnect to the database and repeat your command." From backend> F From backend> "postgres.c" From backend> L From backend> "1880" From backend> R From backend> "quickdie" From backend> : server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. disconnect My present theory about the cause is that the backend lost its CPU quantum immediately after doing the send() that responded to the last INSERT, and was interrupted by SIGQUIT before it could continue. If you look at pq_flush() you'll see that it does not reset PqSendPointer until it exits. This means that when quickdie() pushes the WARNING into the send buffer, it will be added onto what is there --- and then the pq_flush call at the end of ereport will transmit that bufferload a second time. Dunno if you've done any testing yet with quickdie's ereport call removed, but this theory implies you won't be able to reproduce the bug without that. It seems possible that there are related bugs that don't need quickdie() to trigger them --- any elog call from a routine called by pq_flush is a risk factor as well. (In theory all such messages should be COMMERROR so as not to try to send to the client, but in practice that's a fragile rule.) The fix I am thinking about is to introduce some sort of reentrancy detection into pqcomm.c, so that it will ignore attempts to send messages at times when it is partway through sending a message. regards, tom lane /** test for commit-loss problem** Invocation: losstest tablename "conn options"*/ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include "libpq-fe.h" static void exit_nicely(PGconn *conn) {PQfinish(conn);exit(1); } static PGconn * db_connect(const char *conninfo) {PGconn *conn; for (;;){ conn = PQconnectdb(conninfo); if (PQstatus(conn) == CONNECTION_OK) return conn; printf("connectfailed: %s", PQerrorMessage(conn)); PQfinish(conn); sleep(1);} } static void db_disconnect(PGconn *conn) {PQfinish(conn);printf("disconnect\n"); } static int db_exec(PGconn *conn, const char *query, const char *qstatus) {PGresult *res;char *status; res = PQexec(conn, query);if (PQresultStatus(res) != PGRES_COMMAND_OK){ printf("%s failed: %s", query, PQerrorMessage(conn)); PQclear(res); return 0;}status = PQcmdStatus(res);if (strcmp(status, qstatus) != 0){ printf("%sreturned status '%s'\n", query, status);}PQclear(res);return 1; } int main(int argc, char **argv) {const char *tab_name;const char *conninfo;PGconn *conn;PGresult *res;FILE *tracefile;char query[1024];int id_count = 1; /* * If the user supplies a parameter on the command line, use it as the * table name. */if (argc > 1) tab_name = argv[1];else tab_name = "t1"; /* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to settingdbname=crashtest and * using environment variables or defaults for all other connection * parameters. */if (argc >2) conninfo = argv[2];else conninfo = "dbname = crashtest"; /* Make a connection to the database */conn = db_connect(conninfo); /* Set up the target table */sprintf(query, "drop table %s", tab_name);(void) db_exec(conn, query, "DROP TABLE");sprintf(query,"create table %s (id int primary key, data text)", tab_name);if (!db_exec(conn, query, "CREATETABLE")) exit_nicely(conn); /* Drop connection to enter loop */db_disconnect(conn); /* Set up file for tracing libpq activity */sprintf(query, "trace_%s_%d", tab_name, (int) getpid());tracefile = fopen(query,"w");if (!tracefile){ printf("couldn't open %s: %s\n", query, strerror(errno)); exit(1);} for (;;){ /* Make a connection to the database */ conn = db_connect(conninfo); /* Establish tracing */ fprintf(tracefile, "\n-- NEW CONNECTION --\n\n"); PQtrace(conn, tracefile); for (;;) { int n = random() % 4 + 2; int i; if (!db_exec(conn, "start transaction", "START TRANSACTION")) goto fail; for (i = 0; i < n; i++) { sprintf(query, "insert into %s values (%d, 'test data')", tab_name, id_count +i); res = PQexec(conn, query); if (PQresultStatus(res) != PGRES_COMMAND_OK) { printf("%s failed: %s", query, PQerrorMessage(conn)); if (strstr(PQerrorMessage(conn), "duplicate key")) id_count += 1; PQclear(res); goto fail; } if(strncmp(PQcmdStatus(res), "INSERT ", 7) != 0) { printf("%s returned status '%s'\n", query, PQcmdStatus(res)); } PQclear(res); } if (!db_exec(conn, "commit", "COMMIT")) goto fail; id_count += n; printf("\r%d rows done", id_count); fflush(stdout); }fail: /* Drop dead connection */ db_disconnect(conn);} return 0; }
pgsql-hackers by date: