On Tue, 2008-12-02 at 11:08 -0800, Jeff Davis wrote:
> On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote:
> > > Is it dangerous to abort the transaction with replication continued when
> > > the timeout occurs? I think that the WAL consistency between two servers
> > > might be broken. Because the WAL writing and sending are done concurrently,
> > > and the backend might already write the WAL to disk on the primary when
> > > waiting for walsender.
> >
> > The issue I see is that we might want to keep wal_sender_delay small so
> > that transaction times are not increased. But we also want
> > wal_sender_delay high so that replication never breaks. It seems better
> > to have the action on wal_sender_delay configurable if we have an
> > unsteady network (like the internet). Marcus made some comments on line
> > dropping that seem relevant here; we should listen to his experience.
> >
> > Hmmm, dangerous? Well assuming we're linking commits with replication
> > sends then it sounds it. We might end up committing to disk and then
> > deciding to abort instead. But remember we don't remove the xid from
> > procarray or mark the result in clog until the flush is over, so it is
> > possible. But I think we should discuss this in more detail when the
> > main patch is committed.
> >
>
> What is the "it" in "it is possible"? It seems like there's still a
> problem window in there.
Marking a transaction aborted after we have written a commit record, but
before we have removed it from proc array and marked in clog. We'd need
a special kind of WAL record to do that.
> Even if that could be made safe, in the event of a real network failure,
> you'd just wait the full timeout every transaction, because it still
> thinks it's replicating.
True, but I did suggest having two timeouts.
There is considerable reason to reduce the timeout as well as reason to
increase it - at the same time.
Anyway, lets wait for some user experience following commit.
-- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support