Re: 7.4.5 losing committed transactions - Mailing list pgsql-hackers
From | Jan Wieck |
---|---|
Subject | Re: 7.4.5 losing committed transactions |
Date | |
Msg-id | 4154DCBD.3090206@Yahoo.com Whole thread Raw |
In response to | Re: 7.4.5 losing committed transactions (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: 7.4.5 losing committed transactions
Re: 7.4.5 losing committed transactions |
List | pgsql-hackers |
On 9/24/2004 10:24 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> Now the scary thing is that not only did this crash rollback a committed >> transaction. Another session had enough time in between to receive a >> NOTIFY and select the data that got rolled back later. > > Different session, or same session? NOTIFY is one of the cases that > would cause the backend to emit messages within the trouble window > between EndCommand and actual commit. I don't believe that that path > will do a deliberate pq_flush, but it would be possible that the NOTIFY > message fills the output buffer and causes the 'C' message to go out > prematurely. > > If you can actually prove that a *different session* was able to see as > committed data that was not safely committed, then we have another > problem to look for. I am hoping we have only one nasty bug today ;-) I do mean *different session*. My current theory about how the subscriber got out of sync is this: In Slony the chunks of serializable replication data are applied in one transaction, together with the SYNC event and the events CONFIRM record plus a notify on the confirm relation. The data provider (master or cascading node) does listen on the subscribers (slave) confirm relation. So immediately after the subscriber commits, the provider will pick up the confirm record and knows now that the data has propagated and could be deleted. If now the crash whipes out the committed transaction, the entire SYNC has to be redone. A problem that will be fixed in 1.0.3 can cause the replication engine not to restart immediately, and that probably gave the data providers cleanup procedure enough time to purge the replication data. That way it was possible, that a direct subscriber was still in sync, but a cascaded subscriber behind it wasn't. That constellation automatically ruled out that the update wasn't captured on the master. And since the log forwarding is stored within the same transaction too, the direct subscriber who had the correct data, must at that time have had the correct replication log as well. I guess nobody ever relied that heavily on data to be persistent at the microsecond the NOTIFY arrives ... Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
pgsql-hackers by date: