Thread: Master-slave visibility order

Master-slave visibility order

From
Ants Aasma
Date:
I'm currently implementing commit sequence number (CSN) based
snapshots and I hit a design decision that I would like to resolve
before I have too much code to rewrite.

The issue is commit visibility ordering on slaves. As a couple of
threads on hackers have already noted, currently commit order on
slaves can differ from what is seen on master. This arises from the
fact that on master commit visibility is determined by the order of
ProcArrayLock acquisition by ProcArrayEndTransaction(). On slaves
commit visibility is exactly the order of commit records in WAL.
Because XLogInsert() in RecordTransactionCommit() is not interlocked
with ProcArrayEndTransaction() these orders can differ. In case of
mixed sync and async transactions they in fact are quite likely to
differ due to the durability wait in RecordTransactionCommit().

It's not possible to change master commit order to match WAL order
because then either async transactions must either wait behind sync
transactions before returning losing the point of async; or async
transactions must return without becoming visible, changing user
visible semantics; or sync transactions must become visible before
they become durable, again changing user visible semantics.

As it's not possible to change master commit order, the slave
visibility order must change for the orders to be consistent. WAL
currently doesn't have the information to reconstruct master commit
order. Either we need to add a new WAL record for the commit order
(only necessary when wal_level=hot_standby) or add a side channel to
replication connections to communicate commit order information.

One more consideration here is the wish expressed by several hackers
that commit record LSNs could be used as CSNs. One of the most
interesting benefits of this is the property of LSNs being the same
over the whole cluster, meaning that it would be relatively simple to
create cluster wide consistent snapshots.

I currently see the following courses of action:

1. Do nothing about the inconsistency, use a transient global counter
for master commit order and commit record LSN for slaves.  Pro: doesn't change any semantics  Con: we are not making
anyprogress towards cluster wide snapshots 
or even serializable transactions on slaves.

2. Create a new WAL record type that is inserted when a transaction
becomes visible. LSN of this record determines transaction visibility
order. Async transactions can be optimized to skip this record. This
record does not need to be flushed.  Pro: cluster wide consistency, replication method agnostic  Con: one extra WAL
recordinsertion per writing transaction. (32 
bytes of WAL per tx)

3. Use a transient global counter on master, send xid-csn pairs to
slave via a side channel on the replication connection.  Pro: Less overhead than WAL records  Con: replication protocol
needs(possibly invasive) changes, WAL 
shipping based replication can't use this mechanism, lots of extra
code required.

4. Make the choice between 1 and 2 user configurable (it seems to me
that it could even be changed without a restart).

Thoughts?

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Master-slave visibility order

From
Robert Haas
Date:
On Wed, Aug 28, 2013 at 10:58 AM, Ants Aasma <ants@cybertec.at> wrote:
> I currently see the following courses of action:
>
> 1. Do nothing about the inconsistency, use a transient global counter
> for master commit order and commit record LSN for slaves.
>    Pro: doesn't change any semantics
>    Con: we are not making any progress towards cluster wide snapshots
> or even serializable transactions on slaves.
>
> 2. Create a new WAL record type that is inserted when a transaction
> becomes visible. LSN of this record determines transaction visibility
> order. Async transactions can be optimized to skip this record. This
> record does not need to be flushed.
>    Pro: cluster wide consistency, replication method agnostic
>    Con: one extra WAL record insertion per writing transaction. (32
> bytes of WAL per tx)
>
> 3. Use a transient global counter on master, send xid-csn pairs to
> slave via a side channel on the replication connection.
>    Pro: Less overhead than WAL records
>    Con: replication protocol needs (possibly invasive) changes, WAL
> shipping based replication can't use this mechanism, lots of extra
> code required.
>
> 4. Make the choice between 1 and 2 user configurable (it seems to me
> that it could even be changed without a restart).
>
> Thoughts?

I think approach #2 is dead on arrival, at least as a default policy.
It essentially amounts to requiring two commit records per transaction
rather than one, and I think that has no chance of being acceptable.
It's not just or even primarily the *volume* of WAL that I'm concerned
about so much as the feeling that hitting WAL twice rather than once
at the end of a transaction that may have only written one or two WAL
records to begin with is going to slow things down pretty
substantially, especially in high-concurrency scenarios.

I wouldn't entirely dismiss the idea of changing the user-visible
semantics.  In addition to a WAL insertion pointer and a WAL flush
pointer, you'd have a WAL snapshot pointer, which could run ahead of
the flush pointer if the transactions were all asynchronous, but which
for synchronous transactions could not advance faster than the flush
pointer.  Only users running a mix of synchronous_commit=on and
synchronous_commit=off would be harmed, and maybe we could convince
ourselves that's OK.

Still, there's no doubt that there is a downside there.  Therefore,
I'm inclined to suggest that you implement #1.  If, at a later time,
we want to make progress on the issue of cluster-wide snapshot
consistency, you could implement #2 or #3 as an optional feature that
can be turned on via some flag.  However, I would recommend against
trying to do that in the initial patch; I think that doing either #2
or #3 is really a separate feature, and I think if you try to
incorporate all of that code into the main CSN patch it's just going
to be a distraction from what figures to be a very complicated patch
even in minimal form.

If you did choose to implement #2 as an option at some point, it would
probably be worth optimizing for the case where commit ordering and
visibility ordering match, and try to find a design where you only
need the extra WAL record when the orderings don't match.  I'm not
sure exactly how to do that, but it might be worth investigating.  I
don't think that's enough to save #2 as a default behavior, but it
might make it more palatable as an option.

I agree with what others have said insofar as it would be nifty if we
could use the commit LSN as the commit sequence number.  But I think
you've put your finger on why that's not likely to work out well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Master-slave visibility order

From
Ants Aasma
Date:
Hi, thanks for your reply.

On Thu, Aug 29, 2013 at 6:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think approach #2 is dead on arrival, at least as a default policy.
> It essentially amounts to requiring two commit records per transaction
> rather than one, and I think that has no chance of being acceptable.
> It's not just or even primarily the *volume* of WAL that I'm concerned
> about so much as the feeling that hitting WAL twice rather than once
> at the end of a transaction that may have only written one or two WAL
> records to begin with is going to slow things down pretty
> substantially, especially in high-concurrency scenarios.

Heikki's excellent work on WAL insert scaling improves this so the hit
might not be all that big, considering that the visibility record only
needs to be inserted - relatively cheap compared to a WAL sync. But
it's still not likely to be free. I guess the only way to know for
sure would be to build it and bench it.

> I wouldn't entirely dismiss the idea of changing the user-visible
> semantics.  In addition to a WAL insertion pointer and a WAL flush
> pointer, you'd have a WAL snapshot pointer, which could run ahead of
> the flush pointer if the transactions were all asynchronous, but which
> for synchronous transactions could not advance faster than the flush
> pointer.  Only users running a mix of synchronous_commit=on and
> synchronous_commit=off would be harmed, and maybe we could convince
> ourselves that's OK.

Do you mean that mixed durability workloads with replication would
make async transactions wait or delay the visibility? We have the
additional complication of different synchronous_commit levels, so
this decision also affects different levels of synchronous commits.

> Still, there's no doubt that there is a downside there.  Therefore,
> I'm inclined to suggest that you implement #1.  If, at a later time,
> we want to make progress on the issue of cluster-wide snapshot
> consistency, you could implement #2 or #3 as an optional feature that
> can be turned on via some flag.  However, I would recommend against
> trying to do that in the initial patch; I think that doing either #2
> or #3 is really a separate feature, and I think if you try to
> incorporate all of that code into the main CSN patch it's just going
> to be a distraction from what figures to be a very complicated patch
> even in minimal form.

I'll go with #1. I agree that snapshot consistency a separate feature
that is mostly orthogonal to CSN snapshots. I wanted to get this
decision out of the way, so when it's time to discuss the actual patch
we don't have the distraction of discussing why LSNs are not workable
for determining visibility order.

> If you did choose to implement #2 as an option at some point, it would
> probably be worth optimizing for the case where commit ordering and
> visibility ordering match, and try to find a design where you only
> need the extra WAL record when the orderings don't match.  I'm not
> sure exactly how to do that, but it might be worth investigating.  I
> don't think that's enough to save #2 as a default behavior, but it
> might make it more palatable as an option.

Without a side channel the extra WAL record is necessary. Suppose that
we want to determine the ordering with a single commit record. The
slave must be able to deduce from the single record if it can make the
commit immediately visible or should it wait for additional
information. If it waits for additional information, that may never
come as the master could have committed and then went idle. If it
doesn't wait, then an async transaction could arrive on master, commit
and would want to become visible, but the master can't make it visible
without either violating the visibility order or letting the async
transaction wait behind the sync. In other words, without an oracle
(in the computer science sense :) ) master can't determine at the time
of commit record generation if the orderings can differ, and as WAL is
the only communication channel, neither can the slave. Timeouts won't
help either as that would need clock synchronization between servers,
similarly to Google's F1 system.

Speaking of F1, they solve the same problem by having clients be aware
of how fresh they want their snapshot to be. If we add this capability
then clients aware of this functionality could shift the visibility
wait from commit to the start of next transaction that needs to see
the changes.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Master-slave visibility order

From
Andres Freund
Date:
On 2013-08-30 00:22:49 +0300, Ants Aasma wrote:
> Hi, thanks for your reply.
> 
> On Thu, Aug 29, 2013 at 6:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > I think approach #2 is dead on arrival, at least as a default policy.
> > It essentially amounts to requiring two commit records per transaction
> > rather than one, and I think that has no chance of being acceptable.
> > It's not just or even primarily the *volume* of WAL that I'm concerned
> > about so much as the feeling that hitting WAL twice rather than once
> > at the end of a transaction that may have only written one or two WAL
> > records to begin with is going to slow things down pretty
> > substantially, especially in high-concurrency scenarios.
> 
> Heikki's excellent work on WAL insert scaling improves this so the hit
> might not be all that big, considering that the visibility record only
> needs to be inserted - relatively cheap compared to a WAL sync. But
> it's still not likely to be free. I guess the only way to know for
> sure would be to build it and bench it.

FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
per CPU overhead actually minimally increased (at least in my tests), it
just scales noticeably better than before.

But I think that actually coordinating a consistent visibility order
between commit, wal insertion and the procarray would have bigger
scalability impact than the second record. I might be missing some
clever tricks here though.

> > If you did choose to implement #2 as an option at some point, it would
> > probably be worth optimizing for the case where commit ordering and
> > visibility ordering match, and try to find a design where you only
> > need the extra WAL record when the orderings don't match.  I'm not
> > sure exactly how to do that, but it might be worth investigating.  I
> > don't think that's enough to save #2 as a default behavior, but it
> > might make it more palatable as an option.
> 
> Without a side channel the extra WAL record is necessary. Suppose that
> we want to determine the ordering with a single commit record. The
> slave must be able to deduce from the single record if it can make the
> commit immediately visible or should it wait for additional
> information. If it waits for additional information, that may never
> come as the master could have committed and then went idle.

Well, we relatively easily could offload the task of sending such
information to the bgwriter or similar. I don't think that's a
particularly good idea, but it certainly is a possibility.

Andres



Re: Master-slave visibility order

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> But I think that actually coordinating a consistent visibility order
> between commit, wal insertion and the procarray would have bigger
> scalability impact than the second record. I might be missing some
> clever tricks here though.

Yeah.  ISTM the only way to really guarantee that the visible commit
order is the same would be for transactions to hold the ProcArrayLock
while they're inserting that WAL record.  Needless to say, that would
be absolutely disastrous performance-wise.

Or at least, that's true as long as we rely on the current procarray-based
mechanism for noting that a transaction is still in progress.  Maybe
there's some other approach altogether.
        regards, tom lane



Re: Master-slave visibility order

From
Ants Aasma
Date:
On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
> per CPU overhead actually minimally increased (at least in my tests), it
> just scales noticeably better than before.

Interesting. Do you have any insight what is behind the CPU overhead?
Maybe the solution is to make WAL insertion cheap enough to not
matter. That won't be easy, but neither are the alternatives.

Regards,
Ants Aasma



Re: Master-slave visibility order

From
Ants Aasma
Date:
On Fri, Aug 30, 2013 at 12:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> But I think that actually coordinating a consistent visibility order
>> between commit, wal insertion and the procarray would have bigger
>> scalability impact than the second record. I might be missing some
>> clever tricks here though.
>
> Yeah.  ISTM the only way to really guarantee that the visible commit
> order is the same would be for transactions to hold the ProcArrayLock
> while they're inserting that WAL record.  Needless to say, that would
> be absolutely disastrous performance-wise.
>
> Or at least, that's true as long as we rely on the current procarray-based
> mechanism for noting that a transaction is still in progress.  Maybe
> there's some other approach altogether.

This is exactly what I'm working on. Under my scheme snapshots can be
taken completely lock free, without consulting the procarray at all,
and commits only need to exclude other commits from the moment that
visibility order is determined to when it's safe to become visible. If
we don't have any constraints on visibility order this is only a
matter of looking up the transactions slot in a shared memory
structure and writing the next commit sequence number there. I
described the approach in a lot more detail a couple of months ago.
[1] For now I'm going to leave the semantics as is and be content that
we will have a better foundation to do something about it later.

[1] http://www.postgresql.org/message-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com

Regards,
Ants Aasma