Re: Sync Rep: First Thoughts on Code - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Sync Rep: First Thoughts on Code
Date
Msg-id 1228245678.20796.410.camel@hp_dx2400_1
Whole thread Raw
In response to Sync Rep: First Thoughts on Code  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Tue, 2008-12-02 at 11:08 -0800, Jeff Davis wrote:
> On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote:
> > > Is it dangerous to abort the transaction with replication continued when
> > > the timeout occurs? I think that the WAL consistency between two servers
> > > might be broken. Because the WAL writing and sending are done concurrently,
> > > and the backend might already write the WAL to disk on the primary when
> > > waiting for walsender.
> > 
> > The issue I see is that we might want to keep wal_sender_delay small so
> > that transaction times are not increased. But we also want
> > wal_sender_delay high so that replication never breaks. It seems better
> > to have the action on wal_sender_delay configurable if we have an
> > unsteady network (like the internet). Marcus made some comments on line
> > dropping that seem relevant here; we should listen to his experience.
> > 
> > Hmmm, dangerous? Well assuming we're linking commits with replication
> > sends then it sounds it. We might end up committing to disk and then
> > deciding to abort instead. But remember we don't remove the xid from
> > procarray or mark the result in clog until the flush is over, so it is
> > possible. But I think we should discuss this in more detail when the
> > main patch is committed.
> > 
> 
> What is the "it" in "it is possible"? It seems like there's still a
> problem window in there.

Marking a transaction aborted after we have written a commit record, but
before we have removed it from proc array and marked in clog. We'd need
a special kind of WAL record to do that.

> Even if that could be made safe, in the event of a real network failure,
> you'd just wait the full timeout every transaction, because it still
> thinks it's replicating.

True, but I did suggest having two timeouts.

There is considerable reason to reduce the timeout as well as reason to
increase it - at the same time.

Anyway, lets wait for some user experience following commit.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: PiTR and other architectures....
Next
From: Jeff Davis
Date:
Subject: Re: Sync Rep: First Thoughts on Code