Thread: 7.4.5 losing committed transactions

7.4.5 losing committed transactions

From
Jan Wieck
Date:
The attached archive contains a script that I used to reproduce the
error multiple times.

Setup:

   * create database crashtest
   * start 6 instances of testload.tcl as
         ./testload.tcl tN dbname=crashtest
     where N = 1..6
   * frequently kill a backend to cause a postmaster restart.

The testload script creates a table and starts inserting rows, 2 to 6
per transaction. If the transaction succeeds, it increments the internal
ID counter by the number of rows inserted. If the transaction fails with
a dupkey error, it increments the ID counter by 1. On any failure, it
reconnects to the database.

The duplicate key errors can happen because the transaction is
committed, but the response doesn't reach the client before the crash.
So that is a normal and expected behaviour.

But occasionally there will appear a gap in the data. With the given
logic only to increment the counter on a dupkey or after a positive
COMMIT response by the backend, IMHO there can only be one if we lose
transactions after commit on a crash restart.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Attachment

Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> But occasionally there will appear a gap in the data. With the given 
> logic only to increment the counter on a dupkey or after a positive 
> COMMIT response by the backend, IMHO there can only be one if we lose 
> transactions after commit on a crash restart.

Hmm.  I was able to reproduce this in 7.4.5 but not (so far) in 8.0.

What I see in the 7.4.5 postmortem is that the "missing" rows are
present in the table, as you'd expect, and are marked as inserted by a
transaction number that *definitely did not commit* according to the WAL
log --- there are WAL entries showing it inserting the missing rows,
and their btree index entries, but no commit.  The post-crash database
state seems exactly consistent with what is in WAL.

This means either that the server sent a commit message before it had
xlog'd the commit, or that Pgtcl mistakenly reported the command as
successful when it was not.  Any thoughts?
        regards, tom lane


Re: 7.4.5 losing committed transactions

From
Jan Wieck
Date:
On 9/24/2004 5:12 PM, Tom Lane wrote:

> This means either that the server sent a commit message before it had
> xlog'd the commit, or that Pgtcl mistakenly reported the command as
> successful when it was not.  Any thoughts?

Is it somehow possible that the commit record was still sitting in the 
shared WAL buffers (unwritten) when the response got sent to the client? 
That would be the only possibility I can see right now, because Pgtcl 
used as in that script sits on top of libpq in blocking mode and that 
pretty much outrules early reporting of COMMAND_OK. Fsync/disk-flush 
issues can't be the case either because it was only a PM restart without 
any OS crash involved, and I don't like the idea of <whatever*nix> 
forgetting a write().


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> Is it somehow possible that the commit record was still sitting in the 
> shared WAL buffers (unwritten) when the response got sent to the client? 

I don't think so.  What I see in the two cases I have now are:

(1) The backend that was doing the "lost" transaction is *not* the one
I kill -9'd.  I know this in both cases because I know which table has
the missing entries, and I can see that that instance of the script got
a "WARNING: terminating connection because of crash of another server
process" message rather than just a connection closure.

(2) There's a pretty fair distance in the WAL log between the entries
made by the "lost" transaction and the checkpoint made by recovery ---
a dozen or so other transactions were made and committed in between.
It seems unlikely that this transaction would have been the only one to
lose a WAL record if something like that had happened.

What I'm currently speculating about is that there might be some
weirdness associated with the very act of sending out the WARNING.
quickdie() isn't doing anything to ensure that the system is in a good
state before it calls ereport --- which is probably not so cool
considering it is a signal handler.  It might be wise to reset at least
the elog.c state before doing this.

Can you still reproduce the problem if you take out the ereport call
in quickdie()?

BTW, what led you to develop this test setup ... had you already seen
something that made you suspect a data loss problem?
        regards, tom lane


Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
>> This means either that the server sent a commit message before it had
>> xlog'd the commit, or that Pgtcl mistakenly reported the command as
>> successful when it was not.  Any thoughts?

Oh, fooey.

exec_simple_query calls EndCommand before it calls finish_xact_command,
and of course the latter is where the COMMIT really happens.  But
EndCommand puts the 'C' message for the COMMIT into the outbound message
queue.

Under normal circumstances this is no problem because we don't pq_flush
the data to the client until after the commit is done.  But suppose
quickdie() is called in between.  It will call ereport, which will emit
the WARNING message *and pqflush it*.  libpq gets the 'C' message and
therefore reports that the COMMIT is complete.  More generally, any sort
of warning message occuring during transaction commit would do the wrong
thing.  (Errors chance not to, because those will make libpq replace the
pending COMMAND_OK result with an error result.  I'm not sure what the
internal logic in the JDBC driver is, but I guess it acts similarly, or
else we'd have heard about this before from JDBC users.)

Not sure what the cleanest solution is...
        regards, tom lane


Re: 7.4.5 losing committed transactions

From
Jan Wieck
Date:
On 9/24/2004 6:37 PM, Tom Lane wrote:

> Can you still reproduce the problem if you take out the ereport call
> in quickdie()?

Will check ...

> 
> BTW, what led you to develop this test setup ... had you already seen
> something that made you suspect a data loss problem?

Good guess ... what actually happenend was that after a couple of these 
getaddrinfo() SIGSEGV's on AIX, one of our Slony-replica was out of 
sync. Long story short, it all didn't make sense.

Now the scary thing is that not only did this crash rollback a committed 
transaction. Another session had enough time in between to receive a 
NOTIFY and select the data that got rolled back later.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> Now the scary thing is that not only did this crash rollback a committed 
> transaction. Another session had enough time in between to receive a 
> NOTIFY and select the data that got rolled back later.

Different session, or same session?  NOTIFY is one of the cases that
would cause the backend to emit messages within the trouble window
between EndCommand and actual commit.  I don't believe that that path
will do a deliberate pq_flush, but it would be possible that the NOTIFY
message fills the output buffer and causes the 'C' message to go out
prematurely.

If you can actually prove that a *different session* was able to see as
committed data that was not safely committed, then we have another
problem to look for.  I am hoping we have only one nasty bug today ;-)
        regards, tom lane


Re: 7.4.5 losing committed transactions

From
Jan Wieck
Date:
On 9/24/2004 10:24 PM, Tom Lane wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
>> Now the scary thing is that not only did this crash rollback a committed 
>> transaction. Another session had enough time in between to receive a 
>> NOTIFY and select the data that got rolled back later.
> 
> Different session, or same session?  NOTIFY is one of the cases that
> would cause the backend to emit messages within the trouble window
> between EndCommand and actual commit.  I don't believe that that path
> will do a deliberate pq_flush, but it would be possible that the NOTIFY
> message fills the output buffer and causes the 'C' message to go out
> prematurely.
> 
> If you can actually prove that a *different session* was able to see as
> committed data that was not safely committed, then we have another
> problem to look for.  I am hoping we have only one nasty bug today ;-)

I do mean *different session*.

My current theory about how the subscriber got out of sync is this:

In Slony the chunks of serializable replication data are applied in one 
transaction, together with the SYNC event and the events CONFIRM record 
plus a notify on the confirm relation. The data provider (master or 
cascading node) does listen on the subscribers (slave) confirm relation. 
So immediately after the subscriber commits, the provider will pick up 
the confirm record and knows now that the data has propagated and could 
be deleted.

If now the crash whipes out the committed transaction, the entire SYNC 
has to be redone. A problem that will be fixed in 1.0.3 can cause the 
replication engine not to restart immediately, and that probably gave 
the data providers cleanup procedure enough time to purge the 
replication data. That way it was possible, that a direct subscriber was 
still in sync, but a cascaded subscriber behind it wasn't. That 
constellation automatically ruled out that the update wasn't captured on 
the master. And since the log forwarding is stored within the same 
transaction too, the direct subscriber who had the correct data, must at 
that time have had the correct replication log as well.

I guess nobody ever relied that heavily on data to be persistent at the 
microsecond the NOTIFY arrives ...


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
I said:
> Oh, fooey.

> exec_simple_query calls EndCommand before it calls finish_xact_command,

Fooey again --- that theory is all wrong.  Back to the drawing board.

I have managed to reproduce the bug on CVS tip, btw.  But it's very
painful to make it happen.  Have you got any tips for making it more
probable?
        regards, tom lane


Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> I guess nobody ever relied that heavily on data to be persistent at the 
> microsecond the NOTIFY arrives ...

Sure they have.

In theory you cannot see a NOTIFY before the sending transaction
commits, because the sender is holding a lock on pg_notify and you can't
even find out that you've been notified until he releases it.

Your idea that the COMMIT WAL record is getting dropped would fit the
facts, but I really am having a hard time believing it.  Why would the
COMMIT record be more prone to loss than any other record?  All the
cases I have seen so far have the right number of non-COMMIT records in
the log, so the bogus transaction is not getting lost altogether.
        regards, tom lane


Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> On 9/24/2004 10:24 PM, Tom Lane wrote:
>> If you can actually prove that a *different session* was able to see as
>> committed data that was not safely committed, then we have another
>> problem to look for.  I am hoping we have only one nasty bug today ;-)

> I do mean *different session*.

Hmm.  I have found another explanation for the problem I'm seeing
... but the problem I'm seeing is still just that the connected client
thinks the transaction has committed when it has not.  No one else could
think so, because it hasn't.

I rewrote the test case in C to get rid of any question about whether
this was a libpgtcl bug and to be able to add more instrumentation.
Using PQtrace tells the tale.  Here is a normal commit sequence:

To backend> Msg Q
To backend> 'insert into t2 values (4751, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140509 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'insert into t2 values (4752, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140510 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'commit'
To backend> Msg complete, length 12
From backend> C
From backend (#4)> 11
From backend> "COMMIT"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> I

(The reason "Z" appears twice is that libpq stops reading the input and
then re-reads that message.  It's not really two Zs in the datastream.)

Here is the failure case:

To backend> Msg Q
To backend> 'insert into t2 values (4760, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140518 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'insert into t2 values (4761, 'test data')'
To backend> Msg complete, length 47
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140519 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
To backend> Msg Q
To backend> 'commit'
To backend> Msg complete, length 12
-- this response is totally bogus:
From backend> C
From backend (#4)> 21
From backend> "INSERT 4140519 1"
From backend> Z
From backend (#4)> 5
From backend> Z
From backend (#4)> 5
From backend> T
-- here begins the I'm-crashing NOTICE message:
From backend> N
From backend (#4)> 386
From backend> S
From backend> "WARNING"
From backend> C
From backend> "57P02"
From backend> M
From backend> "terminating connection because of crash of another server process"
From backend> D
From backend> "The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory."
 
From backend> H
From backend> "In a moment you should be able to reconnect to the database and repeat your command."
From backend> F
From backend> "postgres.c"
From backend> L
From backend> "1880"
From backend> R
From backend> "quickdie"
From backend> : server closed the connection unexpectedly       This probably means the server terminated abnormally
  before or while processing the request.
 
disconnect

My present theory about the cause is that the backend lost its CPU
quantum immediately after doing the send() that responded to the last
INSERT, and was interrupted by SIGQUIT before it could continue.
If you look at pq_flush() you'll see that it does not reset
PqSendPointer until it exits.  This means that when quickdie()
pushes the WARNING into the send buffer, it will be added onto
what is there --- and then the pq_flush call at the end of ereport
will transmit that bufferload a second time.

Dunno if you've done any testing yet with quickdie's ereport call
removed, but this theory implies you won't be able to reproduce the
bug without that.

It seems possible that there are related bugs that don't need quickdie()
to trigger them --- any elog call from a routine called by pq_flush is a
risk factor as well.  (In theory all such messages should be COMMERROR
so as not to try to send to the client, but in practice that's a fragile
rule.)  The fix I am thinking about is to introduce some sort of
reentrancy detection into pqcomm.c, so that it will ignore attempts to
send messages at times when it is partway through sending a message.
        regards, tom lane


/** test for commit-loss problem** Invocation: losstest tablename "conn options"*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include "libpq-fe.h"


static void
exit_nicely(PGconn *conn)
{PQfinish(conn);exit(1);
}

static PGconn *
db_connect(const char *conninfo)
{PGconn       *conn;
for (;;){    conn = PQconnectdb(conninfo);    if (PQstatus(conn) == CONNECTION_OK)        return conn;
printf("connectfailed: %s", PQerrorMessage(conn));    PQfinish(conn);    sleep(1);}
 
}

static void
db_disconnect(PGconn *conn)
{PQfinish(conn);printf("disconnect\n");
}

static int
db_exec(PGconn *conn, const char *query, const char *qstatus)
{PGresult   *res;char *status;
res = PQexec(conn, query);if (PQresultStatus(res) != PGRES_COMMAND_OK){    printf("%s failed: %s", query,
PQerrorMessage(conn));   PQclear(res);    return 0;}status = PQcmdStatus(res);if (strcmp(status, qstatus) != 0){
printf("%sreturned status '%s'\n", query, status);}PQclear(res);return 1;
 
}

int
main(int argc, char **argv)
{const char *tab_name;const char *conninfo;PGconn       *conn;PGresult   *res;FILE       *tracefile;char
query[1024];int           id_count = 1;
 
/* * If the user supplies a parameter on the command line, use it as the * table name. */if (argc > 1)    tab_name =
argv[1];else   tab_name = "t1";
 
/* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to
settingdbname=crashtest and * using environment variables or defaults for all other connection * parameters. */if (argc
>2)    conninfo = argv[2];else    conninfo = "dbname = crashtest";
 
/* Make a connection to the database */conn = db_connect(conninfo);
/* Set up the target table */sprintf(query, "drop table %s", tab_name);(void) db_exec(conn, query, "DROP
TABLE");sprintf(query,"create table %s (id int primary key, data text)",        tab_name);if (!db_exec(conn, query,
"CREATETABLE"))    exit_nicely(conn);
 
/* Drop connection to enter loop */db_disconnect(conn);
/* Set up file for tracing libpq activity */sprintf(query, "trace_%s_%d", tab_name, (int) getpid());tracefile =
fopen(query,"w");if (!tracefile){    printf("couldn't open %s: %s\n", query, strerror(errno));    exit(1);}
 
for (;;){    /* Make a connection to the database */    conn = db_connect(conninfo);
    /* Establish tracing */    fprintf(tracefile, "\n-- NEW CONNECTION --\n\n");    PQtrace(conn, tracefile);
    for (;;)    {        int        n = random() % 4 + 2;        int            i;
        if (!db_exec(conn, "start transaction", "START TRANSACTION"))            goto fail;        for (i = 0; i < n;
i++)       {            sprintf(query, "insert into %s values (%d, 'test data')",                    tab_name, id_count
+i);            res = PQexec(conn, query);            if (PQresultStatus(res) != PGRES_COMMAND_OK)            {
      printf("%s failed: %s", query, PQerrorMessage(conn));                if (strstr(PQerrorMessage(conn), "duplicate
key"))                   id_count += 1;                PQclear(res);                goto fail;            }
if(strncmp(PQcmdStatus(res), "INSERT ", 7) != 0)            {                printf("%s returned status '%s'\n", query,
PQcmdStatus(res));           }            PQclear(res);        }        if (!db_exec(conn, "commit", "COMMIT"))
  goto fail;
 
        id_count += n;        printf("\r%d rows done", id_count);        fflush(stdout);    }fail:
    /* Drop dead connection */    db_disconnect(conn);}
return 0;
}


Re: 7.4.5 losing committed transactions

From
Tom Lane
Date:
I wrote:
> My present theory about the cause is that the backend lost its CPU
> quantum immediately after doing the send() that responded to the last
> INSERT, and was interrupted by SIGQUIT before it could continue.
> If you look at pq_flush() you'll see that it does not reset
> PqSendPointer until it exits.  This means that when quickdie()
> pushes the WARNING into the send buffer, it will be added onto
> what is there --- and then the pq_flush call at the end of ereport
> will transmit that bufferload a second time.

I have applied a patch to prevent this scenario in 7.4 and 8.0 branches.

Next question is whether your original problem was due to this, or
something else.  As far as I can see, this problem occurs *only* when
a SIGQUIT arrives at just the right time, so it seems unlikely to
explain any problems you'd have seen in the field.
        regards, tom lane