Re: loss of transactions in streaming replication - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: loss of transactions in streaming replication
Date
Msg-id CAHGQGwFqEvHEZjgbefNWrxs9WCVKP9OE8x8L+==PKKT-Xab7MA@mail.gmail.com
Whole thread Raw
In response to Re: loss of transactions in streaming replication  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: loss of transactions in streaming replication
Re: loss of transactions in streaming replication
List pgsql-hackers
On Wed, Oct 19, 2011 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Convince me.  :-)

Yeah, I try.

> My reading of the situation is that you're talking about a problem
> that will only occur if, while the master is in the process of
> shutting down, a network error occurs.

No. This happens even if a network error doesn't occur. I can
reproduce the issue by doing the following:

1. Set up streaming replication master and standby with archive  setting.
2. Run pgbench -i
3. Shuts down the master with fast mode.

Then I can see that the latest WAL file in the master's pg_xlog
doesn't exist in the standby's one. The WAL record which was
lost was the shutdown checkpoint one.

When smart or fast shutdown is requested, the master tries to
write and send the WAL switch (if archiving is enabled) and
shutdown checkpoint record. Because of the problem I described,
the WAL switch record arrives at the standby but the shutdown
checkpoint does not.

> I am not sure it's a good idea
> to convolute the code to handle that case, because (1) there are going
> to be many similar situations where nothing within our power is
> sufficient to prevent WAL from failing to make it to the standby and

Shutting down the master is not a rare case. So I think it's worth
doing something.

> (2) for this marginal improvement, you're giving up including
> PQerrorMessage(streamConn) in the error message that ultimately gets
> omitted, which seems like a substantial regression as far as
> debuggability is concerned.

I think that it's possible to include PQerrorMessage() in the error
message. Will change the patch.

> Even if we do decide that we want the
> change in behavior, I see no compelling reason to back-patch it.
> Stable releases are supposed to be stable, not change behavior because
> we thought of something we like better than what we originally
> released.

The original behavior, in 9.0, is that all outstanding WAL are
replicated to the standby when the master shuts down normally.
But ISTM the behavior was changed unexpectedly in 9.1. So
I think that it should be back-patched to 9.1 to revert the behavior
to the original.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Silent failure with invalid hba_file setting
Next
From: Jun Ishiduka
Date:
Subject: Re: Online base backup from the hot-standby