Re: 7.4.5 losing committed transactions - Mailing list pgsql-hackers

From Tom Lane
Subject Re: 7.4.5 losing committed transactions
Date
Msg-id 18705.1096141512@sss.pgh.pa.us
Whole thread Raw
In response to Re: 7.4.5 losing committed transactions  (Jan Wieck <JanWieck@Yahoo.com>)
Responses Re: 7.4.5 losing committed transactions
List pgsql-hackers
Jan Wieck <JanWieck@Yahoo.com> writes:
> On 9/24/2004 10:24 PM, Tom Lane wrote:
>> If you can actually prove that a *different session* was able to see as
>> committed data that was not safely committed, then we have another
>> problem to look for.  I am hoping we have only one nasty bug today ;-)

> I do mean *different session*.

Hmm.  I have found another explanation for the problem I'm seeing
... but the problem I'm seeing is still just that the connected client
thinks the transaction has committed when it has not.  No one else could
think so, because it hasn't.

I rewrote the test case in C to get rid of any question about whether
this was a libpgtcl bug and to be able to add more instrumentation.
Using PQtrace tells the tale.  Here is a normal commit sequence:

To backend> Msg Q
To backend> 'insert into t2 values (4751, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140509 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'insert into t2 values (4752, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140510 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'commit'
To backend> Msg complete, length 12
From backend> C
From backend (#4)> 11
From backend> "COMMIT"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> I

(The reason "Z" appears twice is that libpq stops reading the input and
then re-reads that message.  It's not really two Zs in the datastream.)

Here is the failure case:

To backend> Msg Q
To backend> 'insert into t2 values (4760, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140518 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'insert into t2 values (4761, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140519 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'commit'
To backend> Msg complete, length 12
-- this response is totally bogus:
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140519 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
-- here begins the I'm-crashing NOTICE message:
From backend> N
From backend (#4)> 386
From backend> S
From backend> "WARNING"
From backend> C
From backend> "57P02"
From backend> M
From backend> "terminating connection because of crash of another server process"
From backend> D
From backend> "The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory."
 
From backend> H
From backend> "In a moment you should be able to reconnect to the database and repeat your command."
From backend> F
From backend> "postgres.c"
From backend> L
From backend> "1880"
From backend> R
From backend> "quickdie"
From backend> : server closed the connection unexpectedly       This probably means the server terminated abnormally
  before or while processing the request.
 
disconnect

My present theory about the cause is that the backend lost its CPU
quantum immediately after doing the send() that responded to the last
INSERT, and was interrupted by SIGQUIT before it could continue.
If you look at pq_flush() you'll see that it does not reset
PqSendPointer until it exits.  This means that when quickdie()
pushes the WARNING into the send buffer, it will be added onto
what is there --- and then the pq_flush call at the end of ereport
will transmit that bufferload a second time.

Dunno if you've done any testing yet with quickdie's ereport call
removed, but this theory implies you won't be able to reproduce the
bug without that.

It seems possible that there are related bugs that don't need quickdie()
to trigger them --- any elog call from a routine called by pq_flush is a
risk factor as well.  (In theory all such messages should be COMMERROR
so as not to try to send to the client, but in practice that's a fragile
rule.)  The fix I am thinking about is to introduce some sort of
reentrancy detection into pqcomm.c, so that it will ignore attempts to
send messages at times when it is partway through sending a message.
        regards, tom lane


/** test for commit-loss problem** Invocation: losstest tablename "conn options"*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include "libpq-fe.h"


static void
exit_nicely(PGconn *conn)
{PQfinish(conn);exit(1);
}

static PGconn *
db_connect(const char *conninfo)
{PGconn       *conn;
for (;;){    conn = PQconnectdb(conninfo);    if (PQstatus(conn) == CONNECTION_OK)        return conn;
printf("connectfailed: %s", PQerrorMessage(conn));    PQfinish(conn);    sleep(1);}
 
}

static void
db_disconnect(PGconn *conn)
{PQfinish(conn);printf("disconnect\n");
}

static int
db_exec(PGconn *conn, const char *query, const char *qstatus)
{PGresult   *res;char *status;
res = PQexec(conn, query);if (PQresultStatus(res) != PGRES_COMMAND_OK){    printf("%s failed: %s", query,
PQerrorMessage(conn));   PQclear(res);    return 0;}status = PQcmdStatus(res);if (strcmp(status, qstatus) != 0){
printf("%sreturned status '%s'\n", query, status);}PQclear(res);return 1;
 
}

int
main(int argc, char **argv)
{const char *tab_name;const char *conninfo;PGconn       *conn;PGresult   *res;FILE       *tracefile;char
query[1024];int           id_count = 1;
 
/* * If the user supplies a parameter on the command line, use it as the * table name. */if (argc > 1)    tab_name =
argv[1];else   tab_name = "t1";
 
/* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to
settingdbname=crashtest and * using environment variables or defaults for all other connection * parameters. */if (argc
>2)    conninfo = argv[2];else    conninfo = "dbname = crashtest";
 
/* Make a connection to the database */conn = db_connect(conninfo);
/* Set up the target table */sprintf(query, "drop table %s", tab_name);(void) db_exec(conn, query, "DROP
TABLE");sprintf(query,"create table %s (id int primary key, data text)",        tab_name);if (!db_exec(conn, query,
"CREATETABLE"))    exit_nicely(conn);
 
/* Drop connection to enter loop */db_disconnect(conn);
/* Set up file for tracing libpq activity */sprintf(query, "trace_%s_%d", tab_name, (int) getpid());tracefile =
fopen(query,"w");if (!tracefile){    printf("couldn't open %s: %s\n", query, strerror(errno));    exit(1);}
 
for (;;){    /* Make a connection to the database */    conn = db_connect(conninfo);
    /* Establish tracing */    fprintf(tracefile, "\n-- NEW CONNECTION --\n\n");    PQtrace(conn, tracefile);
    for (;;)    {        int        n = random() % 4 + 2;        int            i;
        if (!db_exec(conn, "start transaction", "START TRANSACTION"))            goto fail;        for (i = 0; i < n;
i++)       {            sprintf(query, "insert into %s values (%d, 'test data')",                    tab_name, id_count
+i);            res = PQexec(conn, query);            if (PQresultStatus(res) != PGRES_COMMAND_OK)            {
      printf("%s failed: %s", query, PQerrorMessage(conn));                if (strstr(PQerrorMessage(conn), "duplicate
key"))                   id_count += 1;                PQclear(res);                goto fail;            }
if(strncmp(PQcmdStatus(res), "INSERT ", 7) != 0)            {                printf("%s returned status '%s'\n", query,
PQcmdStatus(res));           }            PQclear(res);        }        if (!db_exec(conn, "commit", "COMMIT"))
  goto fail;
 
        id_count += n;        printf("\r%d rows done", id_count);        fflush(stdout);    }fail:
    /* Drop dead connection */    db_disconnect(conn);}
return 0;
}


pgsql-hackers by date:

Previous
From: "Marc G. Fournier"
Date:
Subject: Re: cvsup
Next
From: Manfred Spraul
Date:
Subject: Re: tweaking MemSet() performance - 7.4.5