Thread: SSI and Hot Standby

SSI and Hot Standby

From
"Kevin Grittner"
Date:
Here's an issue for feedback from the community -- do we want to
support truly serializable transactions on hot standby machines?
The best way Dan and I have been able to think to do this is to
build on the SERIALIZABLE READ ONLY DEFERRABLE behavior.  We are
able to obtain a snapshot and then check to see if it is at a place
in the transaction processing that it would be guaranteed to be
serializable without participating in predicate locking, rw-conflict
detection, etc.  If it's not, we block until a READ WRITE
transaction completes, and then check again.  Repeat.  We may reach
a point where we determine that the snapshot can't work, and we get
a new one and start over.  Due to the somewhat complex rules for
this, you are likely to see a safe snapshot fairly quickly even in a
mix which always has short-lived READ WRITE transactions running,
although a single long-running READ WRITE transaction can block
things until it completes.
The idea is that whenever we see a valid snapshot which would yield
a truly serializable view of the data for a READ ONLY transaction,
we add a WAL record with that snapshot information.  Of course, we
might want some limit of how often they are sent, to avoid WAL
bloat.  A hot standby could just keep the most recently received of
these and use it when a SERIALIZABLE transaction is requested. 
Perhaps DEFERRABLE in this context could mean that it waits for the
*next* one and uses it, to assure "freshness".
Actually, we could try to get tricky to avoid sending a complete
snapshot by having two WAL messages with no payload -- one would
mean "the snapshot you would get now is being tested for
serializability".  If it failed reach that state we would send
another when we started working a new snapshot.  The other type of
message would mean "the snapshot you built when we last told you we
were starting to test one is good."  I *think* that can work, and it
may require less WAL space.
If we don't do something like this, do we just provide REPEATABLE
READ on the standby as the strictest level of transaction isolation?
If so, do we generate an error on a request for SERIALIZABLE, warn
and provide degraded behavior, or just quietly give them REPEATABLE
READ behavior?
Thoughts?
-Kevin


Re: SSI and Hot Standby

From
Simon Riggs
Date:
On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:

> Here's an issue for feedback from the community -- do we want to
> support truly serializable transactions on hot standby machines?

In this release? Maybe? In later releases? Yes.

If it blocks your excellent contribution in this release, then from me,
"no". If you can achieve this in this release, yes. However, if this is
difficult or complex, then I would rather say "not yet" quickly now,
than spend months working out the weirdnesses and possibly still get
them wrong.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> In this release? Maybe? In later releases? Yes.
> 
> If it blocks your excellent contribution in this release, then
> from me, "no". If you can achieve this in this release, yes.
> However, if this is difficult or complex, then I would rather say
> "not yet" quickly now, than spend months working out the
> weirdnesses and possibly still get them wrong.
We already have a mechanism for generating a good snapshot, the hard
part (for me at least) would be to get that snapshot over to the hot
standby and have it use the latest one on a request for a
serializable transaction.  I have no experience with WAL file
output, and don't know what it would take for hot standby to use it
as I describe.
I agree it's pretty late in the cycle, but I'm going through all the
loose ends and found this one -- which has been hanging out on the
Wiki page as an R&D item for over a full year without discussion. 
:-(  If we provide the snapshots (which we can safely and easily
do), can someone else who knows what they're doing with WAL and HS
get the rest of it safely into the release?  That seems to me to be
the only way it can still happen for 9.1.
If not, I agree this can be 9.2 material.  We just have to decide
how to document it and answer the questions near the bottom of my
initial post of the thread.
-Kevin


Re: SSI and Hot Standby

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 8:34 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> I agree it's pretty late in the cycle, but I'm going through all the
> loose ends and found this one -- which has been hanging out on the
> Wiki page as an R&D item for over a full year without discussion.
> :-(  If we provide the snapshots (which we can safely and easily
> do), can someone else who knows what they're doing with WAL and HS
> get the rest of it safely into the release?  That seems to me to be
> the only way it can still happen for 9.1.

I think it's way too late to be embarking on what will probably turn
out to be a reasonably complex and possibly controversial new
development arc.  I don't have a strong position on what we should do
instead, but let's NOT do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SSI and Hot Standby

From
Dan Ports
Date:
Kevin's suggestion seems eminently reasonable to me and probably the
best approach one can do for SSI and hot standby. Pulling it off in
time for 9.1 would be a stretch; 9.2 seems quite doable.

It's worth noting that one way or another, the semantics of
SERIALIZABLE transactions on hot standby replicas could be surprising
to some. There's no getting around this; serializability in distributed
systems is just a hard problem in general. Either we go with Kevin's
suggestion of treating SERIALIZABLE transactions as DEFERRABLE (whether
now or for 9.2), causing them to have to use an older snapshot or block
until an acceptable snapshot becomes available; or we require them to
be downgraded to REPEATABLE READ either implicitly or explicitly.

Now, neither of these is as alarming as they might sound, given that
replication lag is a fact of life for hot standby systems and
REPEATABLE READ is exactly the same as the current (9.0) SERIALIZABLE
behavior. But it's definitely something that should be addressed in
documentation.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: SSI and Hot Standby

From
Simon Riggs
Date:
On Wed, 2011-01-19 at 19:34 -0600, Kevin Grittner wrote:
> I agree it's pretty late in the cycle, but I'm going through all the
> loose ends and found this one -- which has been hanging out on the
> Wiki page as an R&D item for over a full year without discussion. 
> :-(  If we provide the snapshots (which we can safely and easily
> do), can someone else who knows what they're doing with WAL and HS
> get the rest of it safely into the release?  That seems to me to be
> the only way it can still happen for 9.1.

I gave you a quick response to let you know that HS need not be a
blocker, for this release. If you are saying you have knowingly ignored
a requirement for a whole year, then I am shocked. How exactly did you
think this would ever be committed?

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Robert Haas  wrote:
> Kevin Grittner  wrote:
>> I agree it's pretty late in the cycle, but I'm going through all
>> the loose ends and found this one -- which has been hanging out on
>> the Wiki page as an R&D item for over a full year without
>> discussion.  :-(  If we provide the snapshots (which we can safely
>> and easily do), can someone else who knows what they're doing with
>> WAL and HS get the rest of it safely into the release? That seems
>> to me to be the only way it can still happen for 9.1.
> 
> I think it's way too late to be embarking on what will probably
> turn out to be a reasonably complex and possibly controversial new
> development arc. I don't have a strong position on what we should
> do instead, but let's NOT do that.
If that can't reasonably be done for 9.1, well, my next sentence was:
>> If not, I agree this can be 9.2 material.
It'd be sweet if it could still happen 9.1, but hardly a shock if it
can't.  I didn't want to presume to make the call.
Like I said at the start, the alternative is to decide how noisy we
want to be about providing snapshot isolation on hot standbys when
SERIALIZABLE is requested, and figuring out where to document it.
-Kevin


Re: SSI and Hot Standby

From
Stephen Frost
Date:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
> I gave you a quick response to let you know that HS need not be a
> blocker, for this release. If you are saying you have knowingly ignored
> a requirement for a whole year, then I am shocked. How exactly did you
> think this would ever be committed?

Erm, to be perfectly honest, I think the answer is probably "I was
busy.", and "no one provided any feedback on *how* to deal with it."
Given the amount of work that Kevin's put into this patch (which has
been beyond impressive, imv), I have a hard time finding fault with
him not getting time to implement a solution for Hot Standby for this.

As you say, it's not a blocker, I agree completely with that, regardless
of when it was identified as an issue.  What we're talking about is
right now, and right now is too late to fix it for HS, and to be
perfectly frank, fixing it for HS isn't required or even a terribly
important factor in if it should be committed.

I'll refrain from casting stones about issues brought up nearly a year
ago on certain other patches which are apparently not going to include
what I, at least, consider extremely important to PG acceptance by
others.
Thanks,
    Stephen

Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Simon Riggs  wrote:
> I gave you a quick response to let you know that HS need not be a
> blocker, for this release. If you are saying you have knowingly
> ignored a requirement for a whole year, then I am shocked. How
> exactly did you think this would ever be committed?
I was asked not to discuss this effort on list for most of that time,
and while it was on the Wiki page, I just lost track of it -- not
maliciously or intentionally.  I really apologize.  By the time the
9.0 release was out and it was deemed OK for me to discuss things, I
started getting feedback on problems which needed response, and I got
into the mode of reacting to that rather than ticking through my
issues list.
-Kevin


Re: SSI and Hot Standby

From
Jeff Davis
Date:
On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:
> If we don't do something like this, do we just provide REPEATABLE
> READ on the standby as the strictest level of transaction isolation?
> If so, do we generate an error on a request for SERIALIZABLE, warn
> and provide degraded behavior, or just quietly give them REPEATABLE
> READ behavior?
>  
> Thoughts?

Hopefully there is a better option available. We don't want to silently
give wrong results.

Maybe we should bring back the compatibility GUC? It could throw an
error unless the user sets the compatibility GUC to turn "serializable"
into "repeatable read".

Regards,Jeff Davis



Re: SSI and Hot Standby

From
Heikki Linnakangas
Date:
On 20.01.2011 03:05, Kevin Grittner wrote:
> If we don't do something like this, do we just provide REPEATABLE
> READ on the standby as the strictest level of transaction isolation?
> If so, do we generate an error on a request for SERIALIZABLE, warn
> and provide degraded behavior, or just quietly give them REPEATABLE
> READ behavior?

+1 for generating an error.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> On 20.01.2011 03:05, Kevin Grittner wrote:
>> If we don't do something like this, do we just provide REPEATABLE
>> READ on the standby as the strictest level of transaction
>> isolation?  If so, do we generate an error on a request for
>> SERIALIZABLE, warn and provide degraded behavior, or just quietly
>> give them REPEATABLE READ behavior?
> 
> +1 for generating an error.
Before I go do that, I want to be sure everyone is clear about the
state of things.
If SSI is used to provide data integrity on the master, it will
prevent any serialization anomalies from being persisted on any hot
standby *long term*.  For example, at any point where the standby is
at a point in the transaction stream where there were no read/write
transaction active, no anomalies can be observed.  (That isn't the
*only* time; it's just the simplest one to describe as an example.) 
Queries on the standby can, however, see *transient* anomalies when
they run queries which would cause a serialization failure if run on
the master at the same point in the transaction stream.  This can
only occur when, of two concurrent transactions, the one which
*appears* to run second because the other can't read what it wrote,
*commits* first.
The most common and alarming situation where this occurs, in my
opinion, is batch processing.  This is extremely common in financial
applications, and tends to show up in a lot of other places, too. 
(The receipting query set is an instance of this type of problem,
but I'm going to keep it more general in hopes that people can see
where it impacts them.)  Imagine an application which has some small
control record in a table, and inserts to some other table are
assigned to a batch based on the control record.  The batches are
normally identified by ascending dates or serial numbers. 
Periodically a new batch is opened and the old batch is closed by
updating a "current batch id" column in the control table.  If the
batch ID is updated and the transaction in which that update was
executed commits while a transaction which read the old batch ID is
still in flight, a read of the database will show that the batch is
closed, but if you look at the detail of the batch, it will not yet
be complete.
Under SSI, one of these transactions will be canceled to prevent
this.  Our implementation will always allow the update which closes
the batch to complete, and either the insert or the select of the
detail will be rolled back with a serialization failure, depending
on the timing the actions inside those transactions.  If the insert
fails, it can be retried, and will land in the new batch -- making
the list of the batch which omits it OK.  If the listing of the
batch details is canceled, it will be because the insert into the
old batch committed before it recognized the problem, so an
immediate retry of the select will see the complete batch contents.
A hot standby can't really take part in the predicate locking and
transaction cancellation on the master.
Dan and I have both come to the conclusion that the only reasonable
way to allow hot standby to work with SSI is for the WAL (when
wal_level = hot_standby) to contain information about which
snapshots develop which won't see such a state.  In the above
example, barring some throttling mechanism skipping these particular
snapshots, or other problematic conflicts around the same time, the
master would tell the standby that the snapshot before either of the
two problem transactions was OK, and then it would tell them that
the snapshot after both had committed was OK.  It would not suggest
using the snapshot available between the commit of the control
record update and the commit of the insert into the batch.
This seems to me to be not completely unrelated to the snapshot
synchronization patch.  It is clearly closely related to the READ
ONLY DEFERRABLE mode, which also looks for a snapshot which is
immune to serialization anomalies without predicate locking,
conflict detection, transaction cancellation, etc.  Melding these
two things with hot standby seems to be beyond what can reasonably
happen for 9.1 without delaying the release.
If someone is using one feature and not the other, they really don't
have a problem.  Like anyone else, if a hot standby user has been
using SERIALIZABLE mode under 9.0 or earlier, they will need to
switch to REPEATABLE READ.  A SERIALIZABLE user who doesn't set up
hot standby has no issue.  Opinions so far seem to be in favor of
reporting an error on the standby if SERIALIZABLE is requested, so
that people don't silently get less protection than they expect. 
The most annoying thing about that is that if the use would *like*
to use truly serializable transactions on the standby, and will do
so when they get it in 9.2, they must switch to REPEATABLE READ now,
and switch back to SERIALIZABLE with the next release.
So, based on a more complete description of the issues, any more
opinions on whether to generate the error, as suggested by Heikki? 
Does anyone think this justifies the compatibility GUC as suggested
by Jeff?  It seems to me that this deserved documentation in the
MVCC chapter under both the "Serializable Isolation Level" and
"Enforcing Consistency With Serializable Transactions" sections.  I
think it probably deserves a note in the SET TRANSACTION reference
page, too.  Agreed?  Anywhere else?
-Kevin


Re: SSI and Hot Standby

From
Josh Berkus
Date:
Kevin,

> So, based on a more complete description of the issues, any more
> opinions on whether to generate the error, as suggested by Heikki? 

If it's a choice between generating an error and letting users see
inconsistent data, I'll take the former.

> Does anyone think this justifies the compatibility GUC as suggested
> by Jeff?  

I think it might, yes.  Since someone could simply turn on the backwards
compatibility flag for 9.1 and turn it off for 9.2, rather than trying
to mess with transaction states which might be set in application code.

Unfortunately, people have not responded to our survey :-(
http://www.postgresql.org/community/survey.77

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: SSI and Hot Standby

From
Simon Riggs
Date:
On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:

> The idea is that whenever we see a valid snapshot which would yield
> a truly serializable view of the data for a READ ONLY transaction,
> we add a WAL record with that snapshot information. 

You haven't explained why this approach is the way forwards. What other
options have been ruled out, and why. The above approach doesn't sound
particularly viable to me.

It's not clear to me what the reason is that this doesn't just work on
HS already. If you started there it might help.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:
> 
>> The idea is that whenever we see a valid snapshot which would
>> yield a truly serializable view of the data for a READ ONLY
>> transaction, we add a WAL record with that snapshot information. 
> 
> You haven't explained why this approach is the way forwards. What
> other options have been ruled out, and why. The above approach
> doesn't sound particularly viable to me.
Why not?  We already generate appropriate snapshots for this in SSI,
so is the problem in getting the appropriate information into the
WAL stream or in having a request for a snapshot within a
serializable transaction while running in hot standby the problem?
> It's not clear to me what the reason is that this doesn't just
> work on HS already. If you started there it might help.
Because the standby would need to bombard the server with a stream
of predicate lock information, we would need to allow transactions
on the master to be canceled do in part to activity on the standby,
and I don't even know how you would begin to track read/write
conflicts between transactions on two different clusters.
If any of that didn't make sense, it would probably be more
efficient for everyone involved if those interested browsed the
Overview section of the Wiki page than to have me duplicate its
contents in a post.
http://wiki.postgresql.org/wiki/Serializable
-Kevin


Re: SSI and Hot Standby

From
Florian Pflug
Date:
On Jan21, 2011, at 00:11 , Simon Riggs wrote:
> It's not clear to me what the reason is that this doesn't just work on
> HS already. If you started there it might help.


The problem is that snapshots taken on the master sometimes represent a
state of the database which cannot occur under any (valid) serial schedule.
Hence, if you use that snapshot to read the *whole* database, you've
surely violated serializability. If you read only parts of the database,
things may or may not be fine, depending on the parts you read.

To have the same stringent guarantees that SERIALIZABLE provides on the
master also for queries run against the slave, you somehow need to prevent
this. The easiest way is to only use snapshots on the slave which *cannot*
produce such anomalies. We already know now to generate such snapshots -
SERIALIZABLE READ ONLY DEFERRABLE does exactly that. So the open question
is mainly how to transfer such snapshots to the slave, and how often we
transmit a new one.

best regards,
Florian Pflug



Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
I wrote:
> Why not?  We already generate appropriate snapshots for this in
> SSI, so is the problem in getting the appropriate information into
> the WAL stream or in having a request for a snapshot within a
> serializable transaction while running in hot standby the problem?
I dropped few words.  
That was supposed to ask whether the problem was in getting hot
standby to *use such a snapshot*.
I'm open to other suggestions on how else we might do this.  I don't
see any alternatives, but maybe you're seeing some possibility that
eludes me.
-Kevin


Re: SSI and Hot Standby

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:
>> The idea is that whenever we see a valid snapshot which would yield
>> a truly serializable view of the data for a READ ONLY transaction,
>> we add a WAL record with that snapshot information. 

> You haven't explained why this approach is the way forwards. What other
> options have been ruled out, and why. The above approach doesn't sound
> particularly viable to me.

I'm pretty concerned about the performance implications, too.  In
particular that sounds like you could get an unbounded amount of WAL
emitted from a *purely read only* transaction flow.  Which is not
going to fly.
        regards, tom lane


Re: SSI and Hot Standby

From
Simon Riggs
Date:
On Fri, 2011-01-21 at 00:26 +0100, Florian Pflug wrote:
> On Jan21, 2011, at 00:11 , Simon Riggs wrote:
> > It's not clear to me what the reason is that this doesn't just work on
> > HS already. If you started there it might help.
> 
> 
> The problem is that snapshots taken on the master sometimes represent a
> state of the database which cannot occur under any (valid) serial schedule.
> Hence, if you use that snapshot to read the *whole* database, you've
> surely violated serializability. If you read only parts of the database,
> things may or may not be fine, depending on the parts you read.
> 
> To have the same stringent guarantees that SERIALIZABLE provides on the
> master also for queries run against the slave, you somehow need to prevent
> this. The easiest way is to only use snapshots on the slave which *cannot*
> produce such anomalies. We already know now to generate such snapshots -
> SERIALIZABLE READ ONLY DEFERRABLE does exactly that. So the open question
> is mainly how to transfer such snapshots to the slave, and how often we
> transmit a new one.

Thank you for explaining a little more.

What I'm still not clear on is why that HS is different. Whatever rules
apply on the master must also apply on the standby, immutably. Why is it
we need to pass explicit snapshot information from master to standby? We
don't do that, except at startup for normal HS. Why do we need that?

I hear, but do not yet understand, that the SSI transaction sequence on
the master may differ from the WAL transaction sequence. Is it important
that the ordering on the master would differ from the standby?

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Tom Lane  wrote:
> I'm pretty concerned about the performance implications, too. In
> particular that sounds like you could get an unbounded amount of
> WAL emitted from a *purely read only* transaction flow.
No.  Read only transactions wouldn't create any flow at all.  And I
suggested that we might want some kind of throttle on how often we
generate snapshots even from the read write transactions.  I'm not
at all clear on how you got to the concerns you have.  Is there
something in particular I could clear up for you that isn't already
mentioned in the previous emails?
-Kevin




Re: SSI and Hot Standby

From
Dan Ports
Date:
> What I'm still not clear on is why that HS is different. Whatever rules
> apply on the master must also apply on the standby, immutably. Why is it
> we need to pass explicit snapshot information from master to standby? We
> don't do that, except at startup for normal HS. Why do we need that?
> 
> I hear, but do not yet understand, that the SSI transaction sequence on
> the master may differ from the WAL transaction sequence. Is it important
> that the ordering on the master would differ from the standby?

The logical serializable ordering of transactions in SSI doesn't
necessarily match the commit time ordering (i.e. the WAL sequence). For
example, with two concurrent transactions, T1 might commit after T2,
even though it didn't see the changes made by T2 and thus has to be
considered "earlier".

It doesn't matter whether T1 committed before T2 or the other way
around, as long as no other transaction can tell the difference. If
someone saw the changes made by T1 but not those made by T2, they'd see
T2 as happening before T1, violating serializability. Our SSI code
ensures that doesn't happen by tracking read dependencies. If it
detects that such a read is happening, it rolls back one of the
transactions involved.

Now, if we extend this to hot standby, if T2 commits before T1 on the
master, it obviously will on the slave too. A transaction run on the
slave at the right time might be able to see that T2 has happened but
not T1, which is unserializable. If that transaction had ben run on the
master, then it would have been detected and something would have been
rolled back, but the master has no way to know what data is being read
on the slave.

What Kevin is suggesting is that we already have a mechanism for
identifying snapshots where serialization failures like these will
never happen. If we pass that information to the slave and allow it to
run transactions only on those snapshots, serializability is safe.

Hopefully that made some more sense...

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: SSI and Hot Standby

From
Florian Pflug
Date:
On Jan21, 2011, at 01:28 , Simon Riggs wrote:
> What I'm still not clear on is why that HS is different. Whatever rules
> apply on the master must also apply on the standby, immutably. Why is it
> we need to pass explicit snapshot information from master to standby? We
> don't do that, except at startup for normal HS. Why do we need that?

> I hear, but do not yet understand, that the SSI transaction sequence on
> the master may differ from the WAL transaction sequence. Is it important
> that the ordering on the master would differ from the standby?

The COMMIT order in the actual, concurrent, schedule doesn't not necessarily
represent the order of the transaction in an equivalent serial schedule. Here's
an example

T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
T1: UPDATE D1 ... ;
T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
T2: SELECT * FROM D1 ... ;
T2: UPDATE D2 ... ;
T1: COMMIT;
T3: SELECT * FROM D1, D2;
T2: COMMIT;

Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent
serial schedule. In any such schedule

T2 must run before T1 because T2 didn't see T1's changes to D1
T3 must run after T1 because T3 did see T1's changes to D1
T3 must run before T2 because T3 didn't see T2's changes to D2

This is obviously impossible - if T3 runs before T2 and T2 runs before T1
then T3 runs before T1, contradicting the second requirement. There is thus
no equivalent serial schedule and we must abort of these transactions with
a serialization error.

Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone,
an equivalent serial schedule is T2,T1!

On the master, these "run before" requirement are tracked by remembering which
transaction read which parts of the data via the SIREAD-lock mechanism (These
are more flags than locks, since nobody ever blocks on them).

Since we do not want to report SIREAD locks back to the master, the slave has
to prevent this another way. Kevin's proposed solution does that by only using
those snapshots on the slave for which reading the *whole* database is safe. The
downside is that whether or not a snapshot is safe can only be decided after all
concurrent transactions have finished. The snapshot is thus always a bit outdated,
but shows that state that is known to be possible in some serial schedule.

The very same mechanism can be used on the master also by setting the isolation
level to SERIALIZABLE READ ONLY DEFERRED.

best regards,
Florian Pflug



Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Tom Lane  wrote:
> Simon Riggs  writes:
>> On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:
>>> The idea is that whenever we see a valid snapshot which would
>>> yield a truly serializable view of the data for a READ ONLY
>>> transaction, we add a WAL record with that snapshot information.
>
>> You haven't explained why this approach is the way forwards. What
>> other options have been ruled out, and why. The above approach
>> doesn't sound particularly viable to me.
>
> I'm pretty concerned about the performance implications, too. In
> particular that sounds like you could get an unbounded amount of
> WAL emitted from a *purely read only* transaction flow. Which is
> not going to fly.
Ah, coming back to this and re-reading, I think I see the point of
confusion.  The technique we're suggesting is based on the fact that
the *standby* is read only.  The flow of information about snapshots
(which might be done as actual snapshots with xid values, or possibly
as marker records saying when a candidate snapshot is being
considered and when the last one has been found acceptable) would be
from the master *to* the standby, based on *read write transactions
on the master*.  They would be informing the slave of what would be a
snapshot guaranteed not to see a serialization anomaly to a read only
transaction.
As I mentioned in another email, we might want to throttle this.  My
thinking was that we could start a timer on capturing a snapshot, and
continue to gather new ones as they become available.  When you hit
the timer limit (maybe 100ms?) you send the latest snapshot, if you
have a new one; otherwise you keep trying and send one as soon as you
get it.
-Kevin


Re: SSI and Hot Standby

From
Robert Haas
Date:
On Thu, Jan 20, 2011 at 8:54 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> As I mentioned in another email, we might want to throttle this.  My
> thinking was that we could start a timer on capturing a snapshot, and
> continue to gather new ones as they become available.  When you hit
> the timer limit (maybe 100ms?) you send the latest snapshot, if you
> have a new one; otherwise you keep trying and send one as soon as you
> get it.

I think this is likely to suck.  That's introducing 10 not-small XLOG
records per second just in case someone happens to try to start a
serializable transaction on a standby server.

A possibly-viable alternative would be to build something into the SR
protocol to allow the standby to request a workable snapshot from the
master, and the master to send it (out-of-band with respect to the WAL
stream) when so requested.  Then it wouldn't work if you lose the
connection to the master, but maybe that's OK.  Even with that, it
seems like there could be starvation problems - is there an upper
bound on the length of time it would take the master to generate a
safe snapshot for the standby to use?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Robert Haas  wrote:
> Kevin Grittner  wrote:
>> As I mentioned in another email, we might want to throttle this.
>> My thinking was that we could start a timer on capturing a
>> snapshot, and continue to gather new ones as they become
>> available. When you hit the timer limit (maybe 100ms?) you send
>> the latest snapshot, if you have a new one; otherwise you keep
>> trying and send one as soon as you get it.
>
> I think this is likely to suck. That's introducing 10 not-small
> XLOG records per second just in case someone happens to try to
> start a serializable transaction on a standby server.
That depends on whether we can pull off the idea for not sending the
snapshot itself which I mentioned.  But that idea is pretty sketchy
at the moment.  I can't swear we can make that work, but if we can,
it should use a lot less WAL space.
> A possibly-viable alternative would be to build something into the
> SR protocol to allow the standby to request a workable snapshot
> from the master, and the master to send it (out-of-band with
> respect to the WAL stream) when so requested.
If we can make that work, that has advantages.
> it seems like there could be starvation problems - is there an
> upper bound on the length of time it would take the master to
> generate a safe snapshot for the standby to use?
Unfortunately, to get a safe snapshot you need to grab a candidate
snapshot and wait for all serializable read write transactions which
were active at the time to complete.  At a minimum.  If any of them
develop the wrong pattern of rw-dependencies you have to discard it,
grab a new snapshot, and try again.  I suspect that most of the time
you will succeed on the first snapshot, and so will be able to call
it safe when the last concurrent serializable read write transaction
completes, but that *could* be a long time, and there is no upper
bound.
That's why I was suggesting that we try to keep a fairly current safe
snapshot sitting on the standby and use it when a serializable
transaction is requested.  Unless you request DEFERRABLE, in which
case you would wait for the *next* one to arrive.
I keep looking for another angle on this, but I'm not finding it.
I've *thought* I had something a couple times this evening while
tossing it around in my head, but the ideas fall apart on closer
inspection.  :-(
-Kevin


Re: SSI and Hot Standby

From
Heikki Linnakangas
Date:
On 21.01.2011 03:19, Dan Ports wrote:
>> What I'm still not clear on is why that HS is different. Whatever rules
>> apply on the master must also apply on the standby, immutably. Why is it
>> we need to pass explicit snapshot information from master to standby? We
>> don't do that, except at startup for normal HS. Why do we need that?
>>
>> I hear, but do not yet understand, that the SSI transaction sequence on
>> the master may differ from the WAL transaction sequence. Is it important
>> that the ordering on the master would differ from the standby?
>
> The logical serializable ordering of transactions in SSI doesn't
> necessarily match the commit time ordering (i.e. the WAL sequence). For
> example, with two concurrent transactions, T1 might commit after T2,
> even though it didn't see the changes made by T2 and thus has to be
> considered "earlier".
>
> It doesn't matter whether T1 committed before T2 or the other way
> around, as long as no other transaction can tell the difference. If
> someone saw the changes made by T1 but not those made by T2, they'd see
> T2 as happening before T1, violating serializability. Our SSI code
> ensures that doesn't happen by tracking read dependencies. If it
> detects that such a read is happening, it rolls back one of the
> transactions involved.
>
> Now, if we extend this to hot standby, if T2 commits before T1 on the
> master, it obviously will on the slave too. A transaction run on the
> slave at the right time might be able to see that T2 has happened but
> not T1, which is unserializable. If that transaction had ben run on the
> master, then it would have been detected and something would have been
> rolled back, but the master has no way to know what data is being read
> on the slave.

We have enough information in the standby to reconstruct all writes done 
in the master. I gather that's not enough, in order to roll back 
read-only transaction T3 on the standby which would see an anomaly, we'd 
also need to know what reads T1 and T2 did in the master. Is that correct?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: SSI and Hot Standby

From
Dan Ports
Date:
On Fri, Jan 21, 2011 at 08:44:59AM +0200, Heikki Linnakangas wrote:
> We have enough information in the standby to reconstruct all writes done 
> in the master. I gather that's not enough, in order to roll back 
> read-only transaction T3 on the standby which would see an anomaly, we'd 
> also need to know what reads T1 and T2 did in the master. Is that correct?

That's some of the information we need, but it's not enough...

The problem is that the conflict might not be discovered until after T3
(the reader) commits. In that case, it's too late to abort T3, so you'd
need to roll back T2 instead. But that means a read-only transaction on
the slave has to be able to cause a concurrent read-write transaction
on the master to abort, which brings with it no end of problems.

To make that a little more concrete, let me borrow Kevin's favorite
batch processing example...
[master] T2: BEGIN[master] T2: SELECT FROM control[master] T1: BEGIN[master] T1: UPDATE control[master] T1: COMMIT
[slave]T3: BEGIN [slave] T3: SELECT FROM control, receipt [slave] T3: COMMIT[master] T2: INSERT INTO receipt[master]
T2:COMMIT
 

If this all happened at the master, T2 would get rolled back when it
tries to do its INSERT. (I just tried it.) But if T3 happened on the
slave, the master doesn't know that it read both tables, nor does the
slave know at the time it's executing T3 that it's going to conflict
with T2. 

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: SSI and Hot Standby

From
Simon Riggs
Date:
On Fri, 2011-01-21 at 02:32 -0500, Dan Ports wrote:
> On Fri, Jan 21, 2011 at 08:44:59AM +0200, Heikki Linnakangas wrote:
> > We have enough information in the standby to reconstruct all writes done 
> > in the master. I gather that's not enough, in order to roll back 
> > read-only transaction T3 on the standby which would see an anomaly, we'd 
> > also need to know what reads T1 and T2 did in the master. Is that correct?
> 
> That's some of the information we need, but it's not enough...
> 
> The problem is that the conflict might not be discovered until after T3
> (the reader) commits. In that case, it's too late to abort T3, so you'd
> need to roll back T2 instead. But that means a read-only transaction on
> the slave has to be able to cause a concurrent read-write transaction
> on the master to abort, which brings with it no end of problems.

So T1 and T2 are already potentially unserialized and the presence of T3
causes the sequence to be "caught out" from which we must then abort T2.
The witness does not create the offence, they just report the crime.

So any xid that commits in a different sequence to the order in which
the xid was assigned creates a potential for unserialization? Or?

On HS we know the order of arrival of xids, and we know the order of
commits, so we should be able to work out which are the potentially
unserializable snapshots. That would allow us to make the standby
independent of the master, thereby avoiding all this messy information
flow.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: SSI and Hot Standby

From
Heikki Linnakangas
Date:
On 21.01.2011 11:10, Simon Riggs wrote:
> So any xid that commits in a different sequence to the order in which
> the xid was assigned creates a potential for unserialization? Or?

It's not the order in which the xid was assigned that matters, but the 
order the transactions started and got their snapshots. The xids might 
be assigned a lot later, after the transactions have already read data.

> On HS we know the order of arrival of xids, and we know the order of
> commits, so we should be able to work out which are the potentially
> unserializable snapshots. That would allow us to make the standby
> independent of the master, thereby avoiding all this messy information
> flow.

Unfortunately although we know the order of arrival of xids, it doesn't 
tell us the order the transactions started.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: SSI and Hot Standby

From
Simon Riggs
Date:
On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
> On 21.01.2011 11:10, Simon Riggs wrote:
> > So any xid that commits in a different sequence to the order in which
> > the xid was assigned creates a potential for unserialization? Or?
> 
> It's not the order in which the xid was assigned that matters, but the 
> order the transactions started and got their snapshots. The xids might 
> be assigned a lot later, after the transactions have already read data.

So if a read-write transaction assigns an xid before it takes a snapshot
then we'll be OK? That seems much easier to arrange than passing chunks
of snapshot data backwards and forwards. Optionally.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: SSI and Hot Standby

From
Anssi Kääriäinen
Date:
On 01/21/2011 03:25 AM, Florian Pflug wrote:
> The COMMIT order in the actual, concurrent, schedule doesn't not necessarily
> represent the order of the transaction in an equivalent serial schedule. Here's
> an example
>
> T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
> T1: UPDATE D1 ... ;
> T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
> T2: SELECT * FROM D1 ... ;
> T2: UPDATE D2 ... ;
> T1: COMMIT;
> T3: SELECT * FROM D1, D2;
> T2: COMMIT;
>
> Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent
> serial schedule. In any such schedule
>
> T2 must run before T1 because T2 didn't see T1's changes to D1
> T3 must run after T1 because T3 did see T1's changes to D1
> T3 must run before T2 because T3 didn't see T2's changes to D2
>
> This is obviously impossible - if T3 runs before T2 and T2 runs before T1
> then T3 runs before T1, contradicting the second requirement. There is thus
> no equivalent serial schedule and we must abort of these transactions with
> a serialization error.
>
> Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone,
> an equivalent serial schedule is T2,T1!
Sorry for bothering all of you, but I just don't get this. What if T2 
rolls back instead of committing? Then the snapshot of T3 would have 
been valid, right? Now, for the snapshot of T3 it doesn't matter if T2 
commits or if it doesn't, because it can't see the changes of T2 in any 
case. Thus, it would seem that the snapshot is valid. On the other hand 
I can't see anything wrong in the logic in your post. What am I missing? 
I am feeling stupid...

At least for dumps I don't see how T2 can matter (assuming T3 is the 
pg_dump's snapshot). Because if you reload from the dump, T2 never 
happened in that dump. In the reloaded database it just did not exist at 
all.
 - Anssi


Re: SSI and Hot Standby

From
Florian Pflug
Date:
On Jan21, 2011, at 12:55 , Anssi Kääriäinen wrote:
> On 01/21/2011 03:25 AM, Florian Pflug wrote:
>> The COMMIT order in the actual, concurrent, schedule doesn't not necessarily
>> represent the order of the transaction in an equivalent serial schedule. Here's
>> an example
>>
>> T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
>> T1: UPDATE D1 ... ;
>> T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
>> T2: SELECT * FROM D1 ... ;
>> T2: UPDATE D2 ... ;
>> T1: COMMIT;
>> T3: SELECT * FROM D1, D2;
>> T2: COMMIT;
>>
>> Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent
>> serial schedule. In any such schedule
>>
>> T2 must run before T1 because T2 didn't see T1's changes to D1
>> T3 must run after T1 because T3 did see T1's changes to D1
>> T3 must run before T2 because T3 didn't see T2's changes to D2
>>
>> This is obviously impossible - if T3 runs before T2 and T2 runs before T1
>> then T3 runs before T1, contradicting the second requirement. There is thus
>> no equivalent serial schedule and we must abort of these transactions with
>> a serialization error.
>>
>> Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone,
>> an equivalent serial schedule is T2,T1!

> Sorry for bothering all of you, but I just don't get this. What if T2 rolls back instead of committing? Then the
snapshotof T3 would have been valid, right?  

Yeah. If T2 is removed from the picture, the only remaining ordering constraint is "T3 must run after T1 because T3 did
seeT1's changes to D1", and thus T1,T3 is an equivalent serial schedule. 

> Now, for the snapshot of T3 it doesn't matter if T2 commits or if it doesn't, because it can't see the changes of T2
inany case. Thus, it would seem that the snapshot is valid. On the other hand I can't see anything wrong in the logic
inyour post. What am I missing? I am feeling stupid...   

The problem is that T3 sees the effects of T1 but not those of T2. Since T2 must run *before* T1 in any equivalent
serialschedule, that is impossible. In other words, if you look at an equivalent serial schedule of the *writing*
transactionsT1 and T2 you won't find a place to insert T3 such that it gives the same answer as in the concurrent
schedule.

It isn't really T3's snapshot that is invalid, it's the interleaving of T1,T2,T3 because there is no equivalent serial
schedule(a serial schedule the produces the same results). If, for example T3 reads only *one* of the tables D1,D2 then
thewhole thing suddenly *is* serializable! If T3 reads only D1 an equivalent serial schedule must run T3 after T1, and
ifit reads only D2 then it must run before T2. 

That "validity" of snapshots comes into play if you attempt to distinguish safe and unsafe interleaved schedules
*without*taking the dataset inspected by T3 into account. So you simply assume that T3 reads the *whole* database
(sincethats the worst case), and must thus run *after* all transactions I didn't see as COMMITTED in any serial
schedule.

The root of the whole issue is that this might not be possible! Some not-yet-committed transaction (T2 in the example)
mayhave to be placed *before* some transaction seen as COMMITTED by T3 (T1 is our example). Since T3 needs to run
*after*T1 (since it saw it as committed) it'd thus also see T2 in any serial schedule. But it didn't see T2 in the
interleavedschedule, we're hosed. 

> At least for dumps I don't see how T2 can matter (assuming T3 is the pg_dump's snapshot). Because if you reload from
thedump, T2 never happened in that dump. In the reloaded database it just did not exist at all. 

Still, the would dump reflects a database state that *logically* never existed (i.e. not in any serial schedule). If
youdump for disaster recovery, you might not care. If you dump to copy the data onto some reporting server you might. 

best regards,
Florian Pflug



Re: SSI and Hot Standby

From
Florian Pflug
Date:
On Jan21, 2011, at 10:19 , Heikki Linnakangas wrote:
> On 21.01.2011 11:10, Simon Riggs wrote:
>> So any xid that commits in a different sequence to the order in which
>> the xid was assigned creates a potential for unserialization? Or?
>
> It's not the order in which the xid was assigned that matters, but the order the transactions started and got their
snapshots.The xids might be assigned a lot later, after the transactions have already read data. 


Any pair of concurrent transactions on the master between which r/w-dependencies exist are a potential risk. If their
orderin all equivalent serial schedule doesn't match their commit order, any snapshot taken between the two commits
don'trepresent a fully consistent view of the database.  

Hm, wait a minute... Thinks...

Since transactions on the slave don't write, they really don't need to be aware of any SIREAD locks taken on the
master,right? We'd still need to detect conflicts between SIREAD locks taken on the slaves and writes by transaction on
themaster, but that could be handled by the recovery process without having to report anything back to the master, and
withoutlogging SIREAD lock acquisitions. So, how about the following 

A) We log r/w-dependencies between transactions committed on the master in the WAL, probably in the COMMIT record
B) SERIALIZABLE queries on the slave use the SIREAD lock machinery like they'd do on the master. The recovery process
doesthe necessary conflict flagging in case the write happens (in wall clock time) after the slave, mimicking what the
writingtransaction had done on the master had it seen the SIREAD lock 
C) By using the r/w-dependency information from the WAL plus the r/w-dependency information generated on the slave we
candetect dangerous situations on the slave, and abort the offending query on the slave. 

(A) and (B) seem quite straight-forward. Regarding (C), I'm not familiar enough with the inner workings of the SSI
patchto judge that. 

best regards,
Florian Pflug



Re: SSI and Hot Standby

From
Nicolas Barbier
Date:
2011/1/21 Anssi Kääriäinen <anssi.kaariainen@thl.fi>:

> Sorry for bothering all of you, but I just don't get this. What if T2 rolls
> back instead of committing? Then the snapshot of T3 would have been valid,
> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
> doesn't, because it can't see the changes of T2 in any case. Thus, it would
> seem that the snapshot is valid. On the other hand I can't see anything
> wrong in the logic in your post. What am I missing? I am feeling stupid...
>
> At least for dumps I don't see how T2 can matter (assuming T3 is the
> pg_dump's snapshot). Because if you reload from the dump, T2 never happened
> in that dump. In the reloaded database it just did not exist at all.

This has been discussed before; in [1] I summarized:

"IOW, one could say that the backup is consistent only if it were
never compared against the system as it continued running after the
dump took place."

Nicolas

[1] <URL:http://archives.postgresql.org/pgsql-hackers/2010-09/msg01763.php>


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Heikki Linnakangas  wrote:
> It's not the order in which the xid was assigned that matters, but
> the order the transactions started and got their snapshots. The
> xids might be assigned a lot later, after the transactions have
> already read data.
From the "Apparent Serial Order of Execution" sub-section of the
Overview section of the Wiki page:
http://wiki.postgresql.org/wiki/Serializable#Apparent_Serial_Order_of_Execution
| The order in which transactions appear to have executed is
| determined by something more subtle than in S2PL: read/write
| dependencies. If a transaction attempts to read data which is not
| visible to it because the transaction which wrote it (or will later
| write it) is concurrent (one of them was running when the other
| acquired its snapshot), then the reading transaction appears to
| have executed first, regardless of the actual sequence of
| transaction starts or commits (since it sees a database state prior
| to that in which the other transaction leaves it).
It is the fact that it generates results consistent with some serial
order of execution which makes this truly serializable.  With SSI
that order isn't tied to transaction start or transaction commit when
transactions overlap -- the one which *appears* to have run first
might have started first *and* committed first.  And as Dan pointed
out with a repeatable example, a read only transaction can contribute
to a cycle in teh apparent order of execution, even if it commits
ahead of one of the writers.
There really are only two ways a read only transaction can be
guaranteed to see consistent data under SSI:
(1)  A read write transaction might need to be cancelled to prevent
the view of the data a committed read only transaction has already
seen from becoming inconsistent.  (Dan's example)
(2)  The read only transaction needs to get a snapshot which the
master has determined is safe.
There really aren't any other alternatives under SSI.  The question
is which we support on hot standbys and how we achieve it.  Or
whether we just dont, I guess; but that doesn't seem very
satisfactory as a long term solution.
-Kevin


Re: SSI and Hot Standby

From
Robert Haas
Date:
On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:
> 2011/1/21 Anssi Kääriäinen <anssi.kaariainen@thl.fi>:
>
>> Sorry for bothering all of you, but I just don't get this. What if T2 rolls
>> back instead of committing? Then the snapshot of T3 would have been valid,
>> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
>> doesn't, because it can't see the changes of T2 in any case. Thus, it would
>> seem that the snapshot is valid. On the other hand I can't see anything
>> wrong in the logic in your post. What am I missing? I am feeling stupid...
>>
>> At least for dumps I don't see how T2 can matter (assuming T3 is the
>> pg_dump's snapshot). Because if you reload from the dump, T2 never happened
>> in that dump. In the reloaded database it just did not exist at all.
>
> This has been discussed before; in [1] I summarized:
>
> "IOW, one could say that the backup is consistent only if it were
> never compared against the system as it continued running after the
> dump took place."

But that's a pretty fair way to look at it, isn't it?  I mean, I guess
it's a question of what you plan to use that backup for, but if it's
disaster recovery, everything that happened after the dump is gone, so
no such comparison will occur.  And that's probably the most common
reason for taking a dump.

It occurs to me that focusing on how this is going to work on Hot
Standby might be looking at the question too narrowly.  The general
issue is - does this technique generalize to a distributed computing
environment, with distributed transactions across multiple PostgreSQL
databases?  For example, what if the control record in Kevin's example
is stored in another database, or on another server.  Or what if some
tables are replicated via Slony?  I realize this is all outside the
scope of the patch, but that's exactly the point: making this stuff
work across multiple databases (even if they are replicas of each
other) is much more complex than getting it to work on just one
machine.  Even if we could agree on how to do it, coming up with some
hack that can only ever possibly work in the Hot Standby case might
not be the best thing to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> (1)  A read write transaction might need to be canceled to
> prevent the view of the data a committed read only transaction has
> already seen from becoming inconsistent.  (Dan's example)
And this one seems entirely a theoretical possibility.  I spent a
little time looking it over, and I don't see how it could be made to
work from hot standbys without an unbounded flow of predicate lock
information from all standbys to the master *plus* blocking commits
on the master for the duration of the longest round trip latency to
any standby.  I think we can call this one dead on arrival.
-Kevin


Re: SSI and Hot Standby

From
Anssi Kääriäinen
Date:
On 01/21/2011 02:21 PM, Florian Pflug wrote:
> Still, the would dump reflects a database state that *logically* never existed (i.e. not in any serial schedule). If
youdump for disaster recovery, you might not care. If you dump to copy the data onto some reporting server you might.
 
>
> best regards,
> Florian Pflug

I am beginning to understand the problem. If you don't mind, here is a 
complete example if somebody else is having troubles understanding this.

Let's say we have tables D1 and D2. Both contain a single column, id, 
and a single row. The data in the beginning is as follows:

D1:  id = 1
D2:  id = 1

The constrains: D1.id can only be incremented. Whenever D2.id is 
updated, it must be updated to D1.id + 1.

The transactions:
T1: begin; update D1 set id = id + 1;
T2: begin; update D2 set id = (select id+1 from D1);
T1: commit;
T3: begin; select id from D1; select id from D2; commit; Data seen: (2, 
1) -- this is a possible state
T2: commit;
T4: begin; select id from D1; select id from D2; commit; Data seen: (2, 2)
This is again a possible state. But if we compare this to the state seen 
by T3 this is not valid. From state (2, 1) we can not get to state (2, 
2) without breaking one of the constraints. Thus, the state of T3 is not 
valid in the database.

So, I finally got it! :-) I hope this example will help somebody else 
understand the problem. The problem I had understanding this was that 
the state in T3 is in fact perfectly valid. I though that there must be 
some problem with that state alone. There isn't, unless you compare it 
to the state after T2 has committed.

Thanks to all explaining this to me, - Anssi


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier
>> This has been discussed before; in [1] I summarized:
>>
>> "IOW, one could say that the backup is consistent only if it were
>> never compared against the system as it continued running after
>> the dump took place."
> 
> But that's a pretty fair way to look at it, isn't it?  I mean, I
> guess it's a question of what you plan to use that backup for, but
> if it's disaster recovery, everything that happened after the dump
> is gone, so no such comparison will occur.  And that's probably
> the most common reason for taking a dump.
It's not, however, a reason for having a hot standby (versus a warm
standby or PITR backup).
> It occurs to me that focusing on how this is going to work on Hot
> Standby might be looking at the question too narrowly.  The
> general issue is - does this technique generalize to a distributed
> computing environment, with distributed transactions across
> multiple PostgreSQL databases?
No, and I can pretty much guarantee that you can't have such a
solution without blocking on all masters at commit time.  What
you're suggesting goes *way* beyond two phase commit, which just
guarantees the integrity rules of each database are honored and
that all transactions either commit or don't.  You're talking about
sharing lock information across high-latency communication links
which in SSI are communicated via LW locking.  Expect any such
generalized "and world peace!" solution to be rather slow.
> For example, what if the control record in Kevin's example is
> stored in another database, or on another server.  Or what if some
> tables are replicated via Slony?  I realize this is all outside
> the scope of the patch
Yep.  Again, the patch achieves true serializability with minimal
cost and *no blocking*.  Spend a few minutes thinking about how you
might coordinate what you propose, and you'll see it's going to
involve blocking based on waiting for messages from across the wire.
> but that's exactly the point: making this stuff work across
> multiple databases (even if they are replicas of each other) is
> much more complex than getting it to work on just one machine. 
> Even if we could agree on how to do it, coming up with some hack
> that can only ever possibly work in the Hot Standby case might not
> be the best thing to do.
I don't see it as a hack.  It's the logical extension of SSI onto
read only replicas.  If you're looking for something more than that
(as the above suggests), it's not a good fit; but I suspect that
there are people besides me who would want to use hot standby for
reporting and read only web access who would want a serializable
view.  What this proposal does is to say that there are two time
streams to look at on the standby -- how far along you are for
purposes of recovery, and how far along you are for purposes of
seeing a view of the data sure to be consistent the later state of
the master.  With SSI they can't be the same.  If someone wants them
to be, they could implement a traditional S2PL serializable mode,
complete with blocking and deadlocks, and then you'd have it
automatically on the replicas, because with S2PL the apparent order
of execution matches the commit order.
-Kevin


Re: SSI and Hot Standby

From
Nicolas Barbier
Date:
2011/1/21 Robert Haas <robertmhaas@gmail.com>:

> On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier
> <nicolas.barbier@gmail.com> wrote:
>
>> 2011/1/21 Anssi Kääriäinen <anssi.kaariainen@thl.fi>:
>>
>>> Sorry for bothering all of you, but I just don't get this. What if T2 rolls
>>> back instead of committing? Then the snapshot of T3 would have been valid,
>>> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
>>> doesn't, because it can't see the changes of T2 in any case. Thus, it would
>>> seem that the snapshot is valid. On the other hand I can't see anything
>>> wrong in the logic in your post. What am I missing? I am feeling stupid...
>>>
>>> At least for dumps I don't see how T2 can matter (assuming T3 is the
>>> pg_dump's snapshot). Because if you reload from the dump, T2 never happened
>>> in that dump. In the reloaded database it just did not exist at all.
>>
>> This has been discussed before; in [1] I summarized:
>>
>> "IOW, one could say that the backup is consistent only if it were
>> never compared against the system as it continued running after the
>> dump took place."
>
> But that's a pretty fair way to look at it, isn't it?

Indeed, I just wanted to point Anssi to the previous discussion.

> It occurs to me that focusing on how this is going to work on Hot
> Standby might be looking at the question too narrowly.  The general
> issue is - does this technique generalize to a distributed computing
> environment, with distributed transactions across multiple PostgreSQL
> databases?  For example, what if the control record in Kevin's example
> is stored in another database, or on another server.  Or what if some
> tables are replicated via Slony?  I realize this is all outside the
> scope of the patch, but that's exactly the point: making this stuff
> work across multiple databases (even if they are replicas of each
> other) is much more complex than getting it to work on just one
> machine.  Even if we could agree on how to do it, coming up with some
> hack that can only ever possibly work in the Hot Standby case might
> not be the best thing to do.

You seem to be questioning whether the normal (?) way of using 2PC on
multiple DBs (just start transactions, and let 2PC coordinate the
commits) that all use SERIALIZABLE isolation mode always results in
global serializable behavior.

I must say that I don't immediately see the answer (my gut feeling
says "nope"), but it sure is an interesting question.

In the special case where all databases use SS2PL as their concurrency
mechanism, the answer is "yes". (Because any "possibly conflicting"
local transactions of any global transactions that touch the same
tables ("conflict") will necessarily commit in some globally
consistent order, which in the case of SS2PL is consistent with the
"must-have-happened-before-in-any-serialized-order" relation, i.e.,
using the terminology from the literature: "because SS2PL is CO".)

I don't know whether it is generally considered a requirement for a
2PC environment to guarantee global serializability.

Nicolas


Re: SSI and Hot Standby

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
>> It's not the order in which the xid was assigned that matters, but the 
>> order the transactions started and got their snapshots. The xids might 
>> be assigned a lot later, after the transactions have already read data.

> So if a read-write transaction assigns an xid before it takes a snapshot
> then we'll be OK? That seems much easier to arrange than passing chunks
> of snapshot data backwards and forwards. Optionally.

No, that idea is DOA from a performance standpoint.  We sweated blood to
avoid having to assign XIDs to read-only transactions, and we're not
going back.  If SSI requires that, SSI is not getting committed.
        regards, tom lane


Re: SSI and Hot Standby

From
Robert Haas
Date:
On Fri, Jan 21, 2011 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
>>> It's not the order in which the xid was assigned that matters, but the
>>> order the transactions started and got their snapshots. The xids might
>>> be assigned a lot later, after the transactions have already read data.
>
>> So if a read-write transaction assigns an xid before it takes a snapshot
>> then we'll be OK? That seems much easier to arrange than passing chunks
>> of snapshot data backwards and forwards. Optionally.
>
> No, that idea is DOA from a performance standpoint.  We sweated blood to
> avoid having to assign XIDs to read-only transactions, and we're not
> going back.  If SSI requires that, SSI is not getting committed.

So far I think all of the ideas proposed for generalizing this across
the master-standby connection seem likely to be DOA from a performance
perspective.  But I think we have a pretty broad consensus that it's
OK to punt this issue for 9.1.  We can always add this in 9.2 if it
can be demonstrated to work well, but it's all vapor-ware right now
anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
>>> It's not the order in which the xid was assigned that matters,
>>> but the order the transactions started and got their snapshots.
>>> The xids might be assigned a lot later, after the transactions
>>> have already read data.
> 
>> So if a read-write transaction assigns an xid before it takes a
>> snapshot then we'll be OK? That seems much easier to arrange than
>> passing chunks of snapshot data backwards and forwards.
We're not talking about passing the backwards.  I'm suggesting that
we probably don't even need to pass them forward, but that
suggestion has been pretty handwavy so far.  I guess I should fill
it out, because everyone's been ignoring it so far.
> No, that idea is DOA from a performance standpoint.  We sweated
> blood to avoid having to assign XIDs to read-only transactions,
> and we're not going back.  If SSI requires that, SSI is not
> getting committed.
SSI doesn't require that.  The suggestion that it would in *any* way
help with the interaction with hot standby is off-base.
-Kevin


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
Anssi Kääriäinen<anssi.kaariainen@thl.fi> wrote:
> I am beginning to understand the problem. If you don't mind, here
> is a complete example if somebody else is having troubles
> understanding this.
> 
> Let's say we have tables D1 and D2. Both contain a single column,
> id, and a single row. The data in the beginning is as follows:
> 
> D1:  id = 1
> D2:  id = 1
> 
> The constrains: D1.id can only be incremented. Whenever D2.id is 
> updated, it must be updated to D1.id + 1.
> 
> The transactions:
> T1: begin; update D1 set id = id + 1;
> T2: begin; update D2 set id = (select id+1 from D1);
> T1: commit;
> T3: begin; select id from D1; select id from D2; commit; Data
> seen: (2, 1) -- this is a possible state
> T2: commit;
> T4: begin; select id from D1; select id from D2; commit; Data
> seen: (2, 2)
> This is again a possible state. But if we compare this to the
> state seen 
> by T3 this is not valid. From state (2, 1) we can not get to state
> (2, 2) without breaking one of the constraints. Thus, the state of
> T3 is not valid in the database.
> 
> So, I finally got it! :-) I hope this example will help somebody
> else understand the problem.
Yeah, interesting example.  Under SSI, once T3 selects from D2 you
have a dangerous structure, and either T2 or T3 must fail to prevent
the possibility of the sort of anomaly your example demonstrates. 
We would prefer to see T2 fail, because if T3 fails it will continue
to fail on retry until T2 completes.  We're trying to avoid that
kind of thrashing.  If T2 fails and is retried, it will immediately
succeed and generate results consistent with what T3 saw.
When I test your example, though, I'm getting the serialization
failure on T3 rather than T2, so I'd call that a bug.  Will
investigate.  Thanks again for your tests!  You seem to be able to
shake out issues better than anyone else!  Once found, fixing them
is not usually very hard, it's coming up with that creative usage
pattern to *find* the problem which is the hard part.
OK if I add this one to our dcheck test suite, too?
-Kevin


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> When I test your example, though, I'm getting the serialization
> failure on T3 rather than T2, so I'd call that a bug.  Will
> investigate.  Thanks again for your tests!
Fixed with this:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=b91460812396b68362c812d6e4fb67799fc6147e
Thanks again!
-Kevin


Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
I wrote:
> We're not talking about passing the backwards.  I'm suggesting
> that we probably don't even need to pass them forward, but that
> suggestion has been pretty handwavy so far.  I guess I should fill
> it out, because everyone's been ignoring it so far.
It's been too hectic today to flesh this out very well, but I can at
least do a better brain dump -- you know, wave my hands a little
less vaguely.
The idea of communicating regarding a safe snapshot through the WAL
without actually *sending* snapshot XIDs through the WAL might work
something like this:
(1)  We communicate when we are starting to consider a snapshot. 
This would always be related to the commit or rollback of a
serializable read-write transaction, so perhaps we could include the
information in an existing WAL record.  We would need to find one
free bit somewhere, or make room for it.  Alternatively, we could
send a new WAL record type to communicate this.  At the point that a
standby processes such a WAL record, it would grab a snapshot effect
after the commit, and save it as the "latest candidate", releasing
the previous candidate, if any.
(2)  If a snapshot fails to make it to a safe status on the master,
it will pick a new candidate and repeat (1) -- there's no need to
explicitly quash a failed candidate.
(3)  We communicate when we find that the last candidate made it to
"safe" status.  Again, this would be related to the commit or
rollback of a serializable read-write transaction.  Same issues
about needing (another) bit or using a new record type.  When a
standby receives this, it promotes the latest candidate to the new
"safe snapshot" to be used when a serializable transaction asks for
a snapshot, replacing the previous value, if any.  Any transactions
waiting for a snapshot (either because there previously wasn't a
safe snapshot on record or because they requested DEFERRABLE) could
be provided the new snapshot and turned loose.
(4)  It's not inconceivable that we might want to send both (1) and
(3) with the same commit.
(5)  Obviously, we can pick our heuristics for how often we try to
refresh this, limiting it to avoid too much overhead, at the cost of
less frequent snapshot updates for serializable transactions on the
standbys.
My assumption is that when we have a safe snapshot (which should be
pretty close to all the time), we immediately provide it to any
serializable transaction requesting a snapshot, except it seems to
make sense to use the new DEFERRABLE mode to mean that you want to
use the *next* one to arrive.
This would effectively cause the point in time which was visible to
serializable transactions to lag behind what is visible to other
transactions by a variable amount, but would ensure that a
serializable transaction couldn't see any serialization anomalies. 
It would also be immune to serialization failures from SSI logic;
but obviously, standby-related cancellations would be in play.  I
don't know whether the older snapshots would tend to increase the
standby-related cancellations, but it wouldn't surprise me.
Hopefully this is enough for people to make something of it.
-Kevin


Re: SSI and Hot Standby

From
Jeff Davis
Date:
On Fri, 2011-01-21 at 18:52 -0600, Kevin Grittner wrote:
> My assumption is that when we have a safe snapshot (which should be
> pretty close to all the time), we immediately provide it to any
> serializable transaction requesting a snapshot, except it seems to
> make sense to use the new DEFERRABLE mode to mean that you want to
> use the *next* one to arrive.

How would it handle this situation:
1. Standby has safe snapshot S1
2. Primary does a VACUUM which removes some stuff visible in S1
3. Standby can't replay the VACUUM because it still has S1, but also
can't get a new S2 because the WAL needed for that is behind the VACUUM
So, S1 needs to be discarded. What do we do on the standby while there
is no safe snapshot? I suppose throw errors -- I can't think of anything
else.

> This would effectively cause the point in time which was visible to
> serializable transactions to lag behind what is visible to other
> transactions by a variable amount, but would ensure that a
> serializable transaction couldn't see any serialization anomalies. 
> It would also be immune to serialization failures from SSI logic;
> but obviously, standby-related cancellations would be in play.  I
> don't know whether the older snapshots would tend to increase the
> standby-related cancellations, but it wouldn't surprise me.

I'm also a little concerned about the user-understandability here. Is it
possible to make the following guarantees in this approach:

1. If transactions are completing on the primary, new snapshots will be
taken on the standby; and
2. If no write transactions are in progress on the primary, then the
standby will get a snapshot that represents the exact same data as on
the primary?

That would be fairly easy to explain to users. If there is a visibility
lag, then we just say "finish the write transactions, and progress will
be made". And if the system is idle, they should see identical data.

Regards,Jeff Davis



Re: SSI and Hot Standby

From
"Kevin Grittner"
Date:
> Jeff Davis  wrote:
> On Fri, 2011-01-21 at 18:52 -0600, Kevin Grittner wrote:
>> My assumption is that when we have a safe snapshot (which should
>> be pretty close to all the time), we immediately provide it to any
>> serializable transaction requesting a snapshot, except it seems to
>> make sense to use the new DEFERRABLE mode to mean that you want to
>> use the *next* one to arrive.
> 
> How would it handle this situation:
> 1. Standby has safe snapshot S1
> 2. Primary does a VACUUM which removes some stuff visible in S1
> 3. Standby can't replay the VACUUM because it still has S1, but
> also can't get a new S2 because the WAL needed for that is behind
> the VACUUM
> 
> So, S1 needs to be discarded. What do we do on the standby while
> there is no safe snapshot? I suppose throw errors -- I can't think
> of anything else.
We could wait for the next safe snapshot to arrive.  I don't know how
often that combination would occur, particulary in a situation where
there were long-running serializable read write transactions on the
master which would prevent a new safe snapshot from being generated. 
It seems as though a long-running transaction on the master would
also block vacuum activity.
I'm not sure how we can *really* know the frequency without field
experience.
>> This would effectively cause the point in time which was visible
>> to serializable transactions to lag behind what is visible to
>> other transactions by a variable amount, but would ensure that a
>> serializable transaction couldn't see any serialization anomalies.
>> It would also be immune to serialization failures from SSI logic;
>> but obviously, standby-related cancellations would be in play. I
>> don't know whether the older snapshots would tend to increase the
>> standby-related cancellations, but it wouldn't surprise me.
> 
> I'm also a little concerned about the user-understandability here.
> Is it possible to make the following guarantees in this approach:
> 
> 1. If transactions are completing on the primary, new snapshots
> will be taken on the standby; and
The rules there are rather complicated.  Based on previous responses
to posts where I've gotten into that detail, I fear that specifying
it with complete accuracy would cause so many eyes to glaze over that
nobody would get to the end of the description.  I will do it if
anybody asks, but without that I'll just say that the conditions for
blocking a safe snapshot in a mix of short-lived read-write
transactions are so esoteric that I expect that they would be
uncommon in practical use.  On the other hand, one long-running
read-write transaction could block generation of a new safe snapshot
indefinitely.  Transactions declared as read-only or running at an
isolation level other than serializable would have no impact on
generation of a safe snapshot.
> 2. If no write transactions are in progress on the primary, then
> the standby will get a snapshot that represents the exact same data
> as on the primary?
A snapshot taken while there are no serializable read write
transactions active can immediately be declared safe.  Whether such a
snapshot is always available on the standby depends on what sort of
throttling, if any, is used.
> That would be fairly easy to explain to users. If there is a
> visibility lag, then we just say "finish the write transactions,
> and progress will be made". And if the system is idle, they should
> see identical data.
Well, unless it's sync rep, you'll always have some latency between
the master and the standby.  And any throttling to control resource
utilization could also cause latency between other transactions and
serializable ones.  But other than that, you're exactly on target.
-Kevin


Re: SSI and Hot Standby

From
Kääriäinen Anssi
Date:
"When I test your example, though, I'm getting the serialization
failure on T3 rather than T2, so I'd call that a bug.  Will
investigate.  Thanks again for your tests!  You seem to be able to
shake out issues better than anyone else!  Once found, fixing them
is not usually very hard, it's coming up with that creative usage
pattern to *find* the problem which is the hard part."

Thank you very much, but I do not deserve this honor. I was just constructing an example for myself so that I could
understandwhy read only transaction might pose a problem. I posted it to help other people to see a concrete example of
theproblem. I had no idea this would show an actual bug in the code. 

"OK if I add this one to our dcheck test suite, too?"

It is of course OK. And if you want to add this as an example in the documentation, it would be great. This is a
simple,but concrete example of why read only serializable transaction might cause an anomaly. If I am not mistaken,
thereisn't any complete example in the documentation. It was hard for me to grasp why there might be a problem and I
don'tthink I am alone. 
- Anssi