Re: Serializable snapshot isolation error logging - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Serializable snapshot isolation error logging
Date
Msg-id 4C989DA70200002500035A70@gw.wicourts.gov
Whole thread Raw
In response to Re: Serializable snapshot isolation error logging  (Dan S <strd911@gmail.com>)
Responses Re: Serializable snapshot isolation error logging
List pgsql-hackers
Dan S <strd911@gmail.com> wrote:
> A starvation scenario is what worries me:
> 
> Lets say we have a slow complex transaction with many tables
> involved.  Concurrently smaller transactions begins and commits .
> 
> Wouldn't it be possible for a starvation scenario where the slower
> transaction will never run to completion but give a serialization
> failure over and over again on retry ?
At least theoretically, yes.  One of the reasons I want to try
converting the single conflict reference to a list is to make for a
better worst-case situation.  Since anomalies can only occur when
the TN transaction (convention used in earlier post) commits first,
and by definition TN has done writes, with a list of conflicts you
could make sure that some transaction which writes has successfully
committed before any transaction rolls back.  So progress with
writes would be guaranteed.  There would also be a guarantee that if
you restart a canceled transaction, it would not immediately fail
again on conflicts *with the same transactions*.  Unfortunately,
with the single field for tracking conflicts, the self-reference on
multiple conflicting transactions loses detail, and you lose these
guarantees.
Now, could the large, long-running transaction still be the
transaction canceled?  Yes.  Are there ways to ensure it can
complete?  Yes.  Some are prettier than others.  I've already come
up with some techniques to avoid some classes of rollbacks with
transactions flagged as READ ONLY, and with the conflict lists there
would be a potential to recognize de facto read only transactions
apply similar logic, so a long-running transaction which didn't
write to any permanent tables (or at least not to ones which other
transactions were reading) would be pretty safe -- and with one of
our R&D point, you could guarantee its safety by blocking the
acquisition of its snapshot until certain conditions were met.
With conflict lists we would also always have two candidates for
cancellation at the point where we found something needed to be
canceled.  Right now I'm taking the coward's way out and always
canceling the transaction active in the process which detects the
need to roll something back.  As long as one process can cancel
another, we can use other heuristics for that.  Several possible
techniques come to mind to try to deal with the situation you raise.
If all else fails, the transaction could acquire explicit table
locks up front, but that sort of defeats the purpose of having an
isolation level which guarantees full serializable behavior without
adding any blocking to snapshot isolation.  :-(
> If I know at what sql-statement the serialization failure occurs
> can i then conclude that some of the tables in that exact query
> were involved in the conflict ?
No.  It could be related to any statements which had executed in the
transaction up to that point.
> If the serialization failure occurs at commit time what can I
> conclude then ?
That a dangerous combination of read-write dependencies occurred
which involved this transaction.
> They can  occur at commit time right ?
Yes.  Depending on the heuristics chosen, it could happen while
"idle in transaction".  (We can kill transactions in that state now,
right?)
> What is the likelyhood that there exists an update pattern that
> always give the failure in the slow transaction ?
I don't know how to quantify that.  I haven't seen it yet in
testing, but many of my tests so far have been rather contrived.  We
disparately need more testing of this patch with realistic
workloads.
> How would one break such a recurring pattern ?
As mentioned above, the conflict list enhancement would help ensure
that *something* is making progress.  As mentioned above, we could
tweak the heuristics on *what* gets canceled to try to deal with
this.
> You could maybe try to lock each table used in the slow
> transaction but that would be prohibitively costly for
> concurrency.
Exactly.
> But what else if there is no way of knowing what the slow
> transaction conflicts against.
Well, that is supposed to be the situation where this type of
approach is a good thing.  The trick is to get enough experience
with different loads to make sure we're using good heuristics to
deal with various loads well.  Ultimately, there may be some loads
for which this technique is just not appropriate.  Hopefully those
cases can be addressed with the techniques made possible with
Florian's patch.
> As things with concurrency involved have a tendency to pop up in
> production and not in test I think it is important to start
> thinking about them as soon as possible.
Oh, I've been thinking about it a great deal for quite a while.  The
problem is exactly as you state -- it is very hard to construct
tests which give a good idea of what the impact will be in
production loads.  I'm sure I could construct a test which would
make the patch look glorious.  I'm sure I could construct a test
which would make the patch look horrible.  Neither would really mean
much, other than to illustrate loads with which you might want to
avoid SSI.  The most fair tests I've done have indicated that it
isn't anywhere near either extreme for most workloads.
Based on benchmarks from the papers, some of which were
independently confirmed by ACM testers, and tests by Dan Ports and
myself, I suspect that most common workloads will pay a 2% to 20%
cost for SSI over straight snapshot isolation.  (The high end of
that is generally with more active connections than you should be
using anyway.)  Further, I have reason to believe that whether the
techniques which Florian's patch allows or the SSI techniques give
better performance will depend on the workload.  I have seen some
tests which suggest in some workloads SSI beats snapshot isolation
with SELECT FOR SHARE / UPDATE, although they weren't done
rigorously and repeated enough times to really trust them just yet.
By the way, thanks for your interest in this patch!  :-)
-Kevin


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Git conversion status
Next
From: Alvaro Herrera
Date:
Subject: Re: What happened to the is_ family of functions proposal?