Re: 7.4.5 losing committed transactions - Mailing list pgsql-hackers

From Jan Wieck
Subject Re: 7.4.5 losing committed transactions
Date
Msg-id 4154DCBD.3090206@Yahoo.com
Whole thread Raw
In response to Re: 7.4.5 losing committed transactions  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: 7.4.5 losing committed transactions  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: 7.4.5 losing committed transactions  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 9/24/2004 10:24 PM, Tom Lane wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
>> Now the scary thing is that not only did this crash rollback a committed 
>> transaction. Another session had enough time in between to receive a 
>> NOTIFY and select the data that got rolled back later.
> 
> Different session, or same session?  NOTIFY is one of the cases that
> would cause the backend to emit messages within the trouble window
> between EndCommand and actual commit.  I don't believe that that path
> will do a deliberate pq_flush, but it would be possible that the NOTIFY
> message fills the output buffer and causes the 'C' message to go out
> prematurely.
> 
> If you can actually prove that a *different session* was able to see as
> committed data that was not safely committed, then we have another
> problem to look for.  I am hoping we have only one nasty bug today ;-)

I do mean *different session*.

My current theory about how the subscriber got out of sync is this:

In Slony the chunks of serializable replication data are applied in one 
transaction, together with the SYNC event and the events CONFIRM record 
plus a notify on the confirm relation. The data provider (master or 
cascading node) does listen on the subscribers (slave) confirm relation. 
So immediately after the subscriber commits, the provider will pick up 
the confirm record and knows now that the data has propagated and could 
be deleted.

If now the crash whipes out the committed transaction, the entire SYNC 
has to be redone. A problem that will be fixed in 1.0.3 can cause the 
replication engine not to restart immediately, and that probably gave 
the data providers cleanup procedure enough time to purge the 
replication data. That way it was possible, that a direct subscriber was 
still in sync, but a cascaded subscriber behind it wasn't. That 
constellation automatically ruled out that the update wasn't captured on 
the master. And since the log forwarding is stored within the same 
transaction too, the direct subscriber who had the correct data, must at 
that time have had the correct replication log as well.

I guess nobody ever relied that heavily on data to be persistent at the 
microsecond the NOTIFY arrives ...


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: 7.4.5 losing committed transactions
Next
From: Tom Lane
Date:
Subject: Re: 7.4.5 losing committed transactions