Thread: Serializable snapshot isolation patch

Serializable snapshot isolation patch

From
Jeff Davis
Date:
This is based on the Kevin's git repo at:
 git://git.postgresql.org/git/users/kgrittn/postgres.git SHA1: 729541fa5ea94d66e6f4b22fb65bfef92214cd6b

* Trivial stuff:

I get a compiler warning:
   indexfsm.c: In function ‘RecordFreeIndexPage’:   indexfsm.c:55: warning: implicit declaration of function
‘PageIsPredicateLocked’

* Open issues, as I see it:

1. 2PC and SSI don't mix (this may be a known issue, because there's not
really any code in the current implementation to deal with 2PC):
  Session1:    BEGIN ISOLATION LEVEL SERIALIZABLE;    select count(*) from a;    insert into a values(1);    PREPARE
TRANSACTION't1';
 
  Session2:    BEGIN ISOLATION LEVEL SERIALIZABLE;    select count(*) from a;    insert into a values(1);    COMMIT;
  Session1:    COMMIT PREPARED 't1';

Looks like we need to track information about prepared transactions in
shared memory. I think you'll need to keep the information in the 2PC
state file as well, so that it can be rebuilt after a crash or restart.
It all looks solvable at first glance, but it looks like it might be
some work.

2. I think there's a GiST bug (illustrating with PERIOD type):
 create table foo(p period); create index foo_idx on foo using gist (p); insert into foo select period(
'2009-01-01'::timestamptz+ g * '1 microsecond'::interval,     '2009-01-01'::timestamptz + (g+1) * '1
microsecond'::interval)  from generate_series(1,2000000) g;
 

Session1: begin isolation level serializable; select * from foo where p && '[2009-01-01, 2009-01-01]'::period; insert
intofoo values('[2009-01-01, 2009-01-01]'::period);
 

Session2: begin isolation level serializable; select * from foo where p && '[2009-01-01, 2009-01-01]'::period; insert
intofoo values('[2009-01-01, 2009-01-01]'::period); commit;
 

Session1: commit;

In pg_locks (didn't paste here due to formatting), it looks like the
SIRead locks are holding locks on different pages. Can you clarify your
design for GiST and the interaction with page-level locks? It looks like
you're making some assumption about which pages will be visited when
searching for conflicting values which doesn't hold true. However, that
seems odd, because even if the value is actually inserted in one
transaction, the other doesn't seem to find the conflict. Perhaps the
bug is simpler than that? Or perhaps I have some kind of odd bug in
PERIOD's gist implementation?

Also, it appears to be non-deterministic, to a degree at least, so you
may not observe the problem in the exact way that I do.

3. Limited shared memory space to hold information about committed
transactions that are still "interesting". Relevant thread:
   http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php

It's a challenging problem, however, and the current solution is less
than ideal. Idle transactions can mean that all new serializable
transactions fail until the idle transactions start to terminate. I
don't like that very much, because people expect to have to retry
serializable transactions, but retrying here has no real hope (except
that some time has elapsed, and maybe the other transaction decided to
commit).

A comparison is made (in the aforementioned thread) to the existing
limitation on the number of locks. However, it's much easier to
understand normal locks, and for a given workload usually you can put an
upper bound on the number of locks required (right?).

Does it make sense to kill the existing transactions that are holding
everything up, rather than the new transaction? Or would that just
confuse matters more? This does not necessarily guarantee that progress
can be made, either, but intuitively it seems more likely.

4. A few final details:
 a. We should probably have a compatibility GUC that makes SERIALIZABLE
equal to REPEATABLE READ. My opinion is that this should be only for
compatibility, and should default to "off" (i.e. SSI code enabled)
either in 9.1 or soon after.
 b. Docs.

* Questions:

1. For TargetTagIsCoveredBy(), why is it impossible for the covering tag
to have an offset?

2. The explanation for SerializablePredicateLockListLock is a little
confusing. It makes it sound like other processes can't walk the list,
but they can modify it?

* Summary

Great patch! I didn't make it through the patch in as much detail as I
would have liked, because the theory behind it is quite complex and it
will take longer for me to absorb. But the implementation looks good and
the use case is very important.

Regards,Jeff Davis




Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
First off, thanks for the review!  I know that it's a lot of work,
and I do appreciate it.
Jeff Davis <pgsql@j-davis.com> wrote:
> * Trivial stuff:
> 
> I get a compiler warning:
> 
>     indexfsm.c: In function *RecordFreeIndexPage*:
>     indexfsm.c:55: warning: implicit declaration of function
> *PageIsPredicateLocked*
Oops.  Fixed on my git repo now.
> * Open issues, as I see it:
> 
> 1. 2PC and SSI don't mix (this may be a known issue, because
> there's not really any code in the current implementation to deal
> with 2PC):
> Looks like we need to track information about prepared
> transactions in shared memory. I think you'll need to keep the
> information in the 2PC state file as well, so that it can be
> rebuilt after a crash or restart.  It all looks solvable at first
> glance, but it looks like it might be some work.
Argh!  I didn't anticipate an interaction with prepared
transactions, probably because I've never used them or taken a close
look at them.  I'll take a look.
> 2. I think there's a GiST bug (illustrating with PERIOD type):
> Can you clarify your design for GiST and the interaction with
> page-level locks? It looks like you're making some assumption
> about which pages will be visited when searching for conflicting
> values which doesn't hold true. However, that seems odd, because
> even if the value is actually inserted in one transaction, the
> other doesn't seem to find the conflict. Perhaps the bug is
> simpler than that? Or perhaps I have some kind of odd bug in
> PERIOD's gist implementation?
My assumptions for GiST were that:
(1) A search for a matching value could bail out at any level in the
tree; there is no requirement for the search to proceed to the leaf
level to get a negative index probe.
(2) An index insertion which would cause a row to be found if an
earlier search was repeated must modify some page which was read by
the earlier search.
(3) Because of the above, it is necessary and sufficient to acquire
an SIRead lock all pages visited in a search of a GiST index, and
flag a conflict on insertion into a locked page at any level of the
index.
That logic still seems sound to me, so if someone else sees a flaw
in it, please point it out.  Assuming that logic is sound, I'll poke
around to see where the flaw in implementation may be.  If you have
a full self-contained test case to demonstrate the failure here,
could you send it to me?
> Also, it appears to be non-deterministic, to a degree at least, so
> you may not observe the problem in the exact way that I do.
Yeah, I have tested this area without seeing the failure , so that
self-contained example would be a big help.
> 3. Limited shared memory space to hold information about committed
> transactions that are still "interesting". Relevant thread:
> 
> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php
> 
> It's a challenging problem, however, and the current solution is
> less than ideal.
I'd go further than that and say that it clearly needs to be fixed. 
The scale of the issue was somewhat disguised in my testing when I
was using the hash table for managing acquisition of shared memory
for what was essentially an unordered list.  Now that I've made it a
proper list with a hard limit on entries, the problem is pretty easy
to provoke, and just reserving space for A Very Large Number Of
Entries is clearly not an adequate solution.  :-(
> Idle transactions can mean that all new serializable transactions
> fail until the idle transactions start to terminate. I don't like
> that very much, because people expect to have to retry
> serializable transactions, but retrying here has no real hope
> (except that some time has elapsed, and maybe the other
> transaction decided to commit).
Agreed.
> Does it make sense to kill the existing transactions that are
> holding everything up, rather than the new transaction? Or would
> that just confuse matters more? This does not necessarily
> guarantee that progress can be made, either, but intuitively it
> seems more likely.
Canceling an old transaction to allow new transactions to begin
*might* be better than the current situation, but I think we can and
should do better.  The best I have been able to come up with is in
this post:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php
There's a fair amount of work there, so I was hoping for some
confirmation that this combination of steps was both sane and had
some chance of being considered sufficient and acceptable before
diving in to code it.  I also *really* hope to add the "SERIALIZABLE
READ ONLY DEFERRABLE" mode so that pg_dump and other read-only
transactions don't push things into a state where the rollback rate
spikes:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php
> 4. A few final details:
> 
>   a. We should probably have a compatibility GUC that makes
> SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
> should be only for compatibility, and should default to "off"
> (i.e. SSI code enabled) either in 9.1 or soon after.
I'm inclined to agree with you, with the strong preference for a 9.1
setting of off.  This was previously discussed, and there were
people who felt that we should avoid a "behavior-changing" GUC like
that, so I didn't add it.  It wouldn't be hard to put it in, if
that's the community consensus.
>   b. Docs.
> 
> * Questions:
> 
> 1. For TargetTagIsCoveredBy(), why is it impossible for the
> covering tag to have an offset?
Because a "covering" lock is a lock at a coarser granularity than
the "covered" lock, which includes what the finer-grained lock
covers.  Since the tuple level is as fine as we take it, and we only
use the offset for tuple locks, a covering lock would not have that
set.  (We also check for duplicate locks.)  I'll expand that comment
a bit, since it obviously wasn't clear enough.
> 2. The explanation for SerializablePredicateLockListLock is a
> little confusing. It makes it sound like other processes can't
> walk the list, but they can modify it?
That's a bit unconventional, I admit, but it was a big part of
bringing the benchmarks for a fully cached, read-only transaction
load down to running just 1.8% longer than REPEATABLE READ (which is
the same as current SERIALIZABLE).  I'm open to other locking
strategies, or an explanation of where this one doesn't work, but
I'd want to benchmark carefully.  Since you took it to mean the
right thing, but found that surprising enough to doubt that it was
accurate, do you have any particular suggestions about how to
clarify it?  (Maybe just, "Yeah, that's unconventional, but it works
and is faster than the alternatives we've tried."?)
> * Summary
> 
> Great patch!
Thanks!
> I didn't make it through the patch in as much detail as I would
> have liked, because the theory behind it is quite complex and it
> will take longer for me to absorb.
Yeah, there are some mind-bending ideas in there.  If I recall
correctly, Dan said he spent one week reading the academic papers on
which it was based and another week reading the patch before he felt
comfortable starting to code the parts he wrote.  I spent a lot more
time than that reading papers, and I still didn't quite have my
head around some of the concepts until I went back to some of
Fekete's work which led up to this innovation.  While many of the
concepts go back to the '80s and '90s, I found Fekete's work
"clicked" with me better than some of the earlier, more theoretical
work because he was looking at real-world examples of how these
manifest in actual, working production software and what techniques
could be used to identify and mitigate the problems.  Sometimes the
theory is hard for me to properly digest without such concrete
details.
> But the implementation looks good and the use case is very
> important.
It certainly is for our shop, and I've heard from others who are
very interested in it.  I think, that it's generally of more
benefit to "big shops" than situations where you have just a few
programmers coding against a schema with just a few dozen tables;
although it's also somewhat a matter of style: I would rather deal
with *one* issue (serialization failures) in one place than scatter
defensive code all over my application codebase, even on a moderate
scale.
Anyway, given the issues raised and the date, it seems reasonable to
mark this as "Returned with Feedback" for CF management purposes. 
I'll get fixes out as soon as I can.
-Kevin


Re: Serializable snapshot isolation patch

From
Jeff Davis
Date:
On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote: 
> > 2. I think there's a GiST bug (illustrating with PERIOD type):
>  
> My assumptions for GiST were that:
>  
> (1) A search for a matching value could bail out at any level in the
> tree; there is no requirement for the search to proceed to the leaf
> level to get a negative index probe.
>  
> (2) An index insertion which would cause a row to be found if an
> earlier search was repeated must modify some page which was read by
> the earlier search.
>  
> (3) Because of the above, it is necessary and sufficient to acquire
> an SIRead lock all pages visited in a search of a GiST index, and
> flag a conflict on insertion into a locked page at any level of the
> index.
>  
> That logic still seems sound to me, so if someone else sees a flaw
> in it, please point it out.

Looks sound to me, as well.

> > Also, it appears to be non-deterministic, to a degree at least, so
> > you may not observe the problem in the exact way that I do.
>  
> Yeah, I have tested this area without seeing the failure , so that
> self-contained example would be a big help.

I assume here that you mean that you _did_ see the failure
(serialization error) and therefore did not see the problem? Also, are
you sure it was using the GiST index for the searches and didn't just
get a full table lock or full index lock?

I'll try to narrow it down. It could be a problem with PERIOD, or maybe
it wasn't using the GiST index for a search when I thought it was (I
didn't run EXPLAIN at every point, so I'll double-check). Clearly, a
problem exists somewhere though. 

> > 3. Limited shared memory space to hold information about committed
> > transactions that are still "interesting". Relevant thread:
> > 
> > http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php
> > 
> > It's a challenging problem, however, and the current solution is
> > less than ideal.
>  
> I'd go further than that and say that it clearly needs to be fixed. 

OK, this will remain an open issue then.

> Canceling an old transaction to allow new transactions to begin
> *might* be better than the current situation, but I think we can and
> should do better.  The best I have been able to come up with is in
> this post:
>  
> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php

At first I didn't like that approach much, but after re-reading it, I am
more optimistic. What I like most about it is that a transaction won't
be canceled if it doesn't interfere at all (e.g. operating on different
tables). There are still edge cases where it can be canceled, like when
the memory is so exhausted that it can't even track a couple table-level
locks, but that doesn't sound as bad (closer to the current restriction
on the total number of locks).

Ultimately I think this will be one of those problems with no 100% clean
solution. However, I think it's important that we try to minimize these
degenerate cases. Eventually, we may need to keep statistics about the
number of conflicts happening, and start to rate-limit new serializable
transactions (effectively reducing parallelism) to ensure that
reasonable progress can be made (hopefully faster than serial
execution).

> There's a fair amount of work there, so I was hoping for some
> confirmation that this combination of steps was both sane and had
> some chance of being considered sufficient and acceptable before
> diving in to code it.  I also *really* hope to add the "SERIALIZABLE
> READ ONLY DEFERRABLE" mode so that pg_dump and other read-only
> transactions don't push things into a state where the rollback rate
> spikes:
>  
> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php

One potential issue there is starvation. I don't see how you would
guarantee that there will ever be a point to grab a safe snapshot, even
if all of the other transactions are completing.

> > 4. A few final details:
> > 
> >   a. We should probably have a compatibility GUC that makes
> > SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
> > should be only for compatibility, and should default to "off"
> > (i.e. SSI code enabled) either in 9.1 or soon after.
>  
> I'm inclined to agree with you, with the strong preference for a 9.1
> setting of off.  This was previously discussed, and there were
> people who felt that we should avoid a "behavior-changing" GUC like
> that, so I didn't add it.  It wouldn't be hard to put it in, if
> that's the community consensus.

I think that was me, but I'm OK with behavior-changing GUCs as long as
they are there for compatibility and we intend to push people toward the
new behavior in the long run (like standard_conforming_strings, but
hopefully not as long-lived).

Maybe it's not necessary at all, because your patch offers strictly
better guarantees (at the cost of more serialization failures), and if
they want the old behavior then REPEATABLE READ is already there.

> > 1. For TargetTagIsCoveredBy(), why is it impossible for the
> > covering tag to have an offset?
>  
> Because a "covering" lock is a lock at a coarser granularity than
> the "covered" lock, which includes what the finer-grained lock
> covers.  Since the tuple level is as fine as we take it, and we only
> use the offset for tuple locks, a covering lock would not have that
> set.  (We also check for duplicate locks.)  I'll expand that comment
> a bit, since it obviously wasn't clear enough.

I see. So a lock can't ever cover itself?

> > 2. The explanation for SerializablePredicateLockListLock is a
> > little confusing. It makes it sound like other processes can't
> > walk the list, but they can modify it?
>  
> That's a bit unconventional, I admit, but it was a big part of
> bringing the benchmarks for a fully cached, read-only transaction
> load down to running just 1.8% longer than REPEATABLE READ (which is
> the same as current SERIALIZABLE).  I'm open to other locking
> strategies, or an explanation of where this one doesn't work, but
> I'd want to benchmark carefully.  Since you took it to mean the
> right thing, but found that surprising enough to doubt that it was
> accurate, do you have any particular suggestions about how to
> clarify it?  (Maybe just, "Yeah, that's unconventional, but it works
> and is faster than the alternatives we've tried."?)

When using locks in an unconventional way, it would be helpful to
describe the invalid schedules that you're preventing. Perhaps an
example if you think it would be reasonably simple? Also some indication
of how another process is intended to modify the list without walking
it.
> It certainly is for our shop, and I've heard from others who are
> very interested in it.  I think, that it's generally of more
> benefit to "big shops" than situations where you have just a few
> programmers coding against a schema with just a few dozen tables;
> although it's also somewhat a matter of style: I would rather deal
> with *one* issue (serialization failures) in one place than scatter
> defensive code all over my application codebase, even on a moderate
> scale.

It's a complexity-reducing and correctness-increasing feature, which is
generally good.

Regards,Jeff Davis



Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
> Jeff Davis  wrote:
> On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote:
> I assume here that you mean that you _did_ see the failure
> (serialization error) and therefore did not see the problem?
Yeah.
> Also, are you sure it was using the GiST index for the searches and
> didn't just get a full table lock or full index lock?
I think so, but I'm at home and don't have details here.  Will check
tomorrow.
> I'll try to narrow it down. It could be a problem with PERIOD, or
> maybe it wasn't using the GiST index for a search when I thought it
> was (I didn't run EXPLAIN at every point, so I'll double-check).
> Clearly, a problem exists somewhere though.
Hmmm...  When Joe was looking at the patch he exposed an intermittent
problem with btree indexes which turned out to be related to improper
handling of the predicate locks during index page clean-up caused by a
vacuum.  Easy to fix once found, but it didn't always happen, even
with identical runs.  (I'm guessing that was due to the randomness in
picking a page to split when inserting into a group of identical
keys.)  Perhaps a similar bug lurks in the GiST predicate locking.
> Eventually, we may need to keep statistics about the number of
> conflicts happening, and start to rate-limit new serializable
> transactions (effectively reducing parallelism) to ensure that
> reasonable progress can be made (hopefully faster than serial
> execution).
Ah, you've exposed just how self-serving my interest in an admission
control policy mechanism is!  ;-)
http://archives.postgresql.org/pgsql-hackers/2009-12/msg02189.php
>> I also *really* hope to add the "SERIALIZABLE READ ONLY
>> DEFERRABLE" mode so that pg_dump and other read-only transactions
>> don't push things into a state where the rollback rate spikes:
>>
>> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php
>
> One potential issue there is starvation. I don't see how you would
> guarantee that there will ever be a point to grab a safe snapshot,
> even if all of the other transactions are completing.
After mulling it over in greater detail the previously, I see your
point.  I'll think about it some more, but this particular idea might
be a dead end.
>>> a. We should probably have a compatibility GUC that makes
>>> SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
>>> should be only for compatibility, and should default to "off"
>>> (i.e. SSI code enabled) either in 9.1 or soon after.
>>
>> I'm inclined to agree with you, with the strong preference for a
>> 9.1 setting of off. This was previously discussed, and there were
>> people who felt that we should avoid a "behavior-changing" GUC
>> like that, so I didn't add it. It wouldn't be hard to put it in,
>> if that's the community consensus.
>
> I think that was me, but I'm OK with behavior-changing GUCs as long
> as they are there for compatibility and we intend to push people
> toward the new behavior in the long run (like
> standard_conforming_strings, but hopefully not as long-lived).
>
> Maybe it's not necessary at all, because your patch offers strictly
> better guarantees (at the cost of more serialization failures), and
> if they want the old behavior then REPEATABLE READ is already
> there.
Anyone else with an opinion on this?
> So a lock can't ever cover itself?
No.  If a transaction requests an SIRead lock identical to one it
already holds, we ignore the request.
> When using locks in an unconventional way, it would be helpful to
> describe the invalid schedules that you're preventing. Perhaps an
> example if you think it would be reasonably simple? Also some
> indication of how another process is intended to modify the list
> without walking it.
I will review that comment and add something along those lines.
-Kevin


Re: Serializable snapshot isolation patch

From
Jeff Davis
Date:
On Mon, 2010-10-18 at 22:12 -0500, Kevin Grittner wrote:
> Hmmm...  When Joe was looking at the patch he exposed an intermittent
> problem with btree indexes which turned out to be related to improper
> handling of the predicate locks during index page clean-up caused by a
> vacuum.  Easy to fix once found, but it didn't always happen, even
> with identical runs.  (I'm guessing that was due to the randomness in
> picking a page to split when inserting into a group of identical
> keys.)  Perhaps a similar bug lurks in the GiST predicate locking.

I briefly looked into this when I woke up this morning, and I think I'm
close. I can reproduce it every time, so I should be able to fix this as
soon as I can find some free time (tomorrow night, probably).

I might also be able to help with the 2PC issue, but it will be at least
a week or two before I have the free time to dig into that one.
> > Eventually, we may need to keep statistics about the number of
> > conflicts happening, and start to rate-limit new serializable
> > transactions (effectively reducing parallelism) to ensure that
> > reasonable progress can be made (hopefully faster than serial
> > execution).
>  
> Ah, you've exposed just how self-serving my interest in an admission
> control policy mechanism is!  ;-)
>  
> http://archives.postgresql.org/pgsql-hackers/2009-12/msg02189.php

Cool!
> >> I also *really* hope to add the "SERIALIZABLE READ ONLY
> >> DEFERRABLE" mode so that pg_dump and other read-only transactions
> >> don't push things into a state where the rollback rate spikes:
> >>
> >> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php
> >
> > One potential issue there is starvation. I don't see how you would
> > guarantee that there will ever be a point to grab a safe snapshot,
> > even if all of the other transactions are completing.
>  
> After mulling it over in greater detail the previously, I see your
> point.  I'll think about it some more, but this particular idea might
> be a dead end.

I didn't quite mean that it's a dead-end. It still seems tempting to at
least try to get a safe snapshot, especially for a transaction that you
know will be long-running. Maybe a timeout or something?

Regards,Jeff Davis



Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
Jeff Davis <pgsql@j-davis.com> wrote:
> I briefly looked into this when I woke up this morning, and I
> think I'm close. I can reproduce it every time, so I should be
> able to fix this as soon as I can find some free time (tomorrow
> night, probably).
OK, I'll focus on other areas.
> I might also be able to help with the 2PC issue, but it will be at
> least a week or two before I have the free time to dig into that
> one.
If I get through the other points you raised, I'll start reading up
on 2PC.
> I didn't quite mean that it's a dead-end. It still seems tempting
> to at least try to get a safe snapshot, especially for a
> transaction that you know will be long-running. Maybe a timeout or
> something?
But what would you do when you hit the timeout?
One thing that would work, but I really don't think I like it, is
that a request for a snapshot for such a transaction would not only
block until it could get a "clean" snapshot (no overlapping
serializable non-read-only transactions which overlap serializable
transactions which wrote data and then committed in time to be
visible to the snapshot being acquired), but it would *also* block
*other* serializable transactions, if they were non-read-only, on an
attempt to acquire a snapshot.  That would at least guarantee that
the serializable read only deferrable transaction could get its
snapshot as soon as the initial set of problem overlapping
transactions completed, but it would be an exception to the "SSI
introduces no new blocking" guarantee.  :-(  I was OK with that for
the particular transaction where DEFERRABLE was requested, but to
have that block other serializable transactions seems pretty iffy to
me.
Short of that, I think you would just have to wait for completion of
all known problem transactions and then try again.  On a server with
a heave write load, particularly if the length of the writing
transactions was highly variable, you could go through a lot of
iterations before getting that clean snapshot.
-Kevin


Re: Serializable snapshot isolation patch

From
Robert Haas
Date:
On Tue, Oct 19, 2010 at 6:28 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> One thing that would work, but I really don't think I like it, is
> that a request for a snapshot for such a transaction would not only
> block until it could get a "clean" snapshot (no overlapping
> serializable non-read-only transactions which overlap serializable
> transactions which wrote data and then committed in time to be
> visible to the snapshot being acquired), but it would *also* block
> *other* serializable transactions, if they were non-read-only, on an
> attempt to acquire a snapshot.

This seems pretty close to guaranteeing serializability by running
transactions one at a time (i.e. I don't think it's likely to be
acceptable from a performance standpoint).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 19, 2010 at 6:28 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>> One thing that would work, but I really don't think I like it, is
>> that a request for a snapshot for such a transaction would not
>> only block until it could get a "clean" snapshot (no overlapping
>> serializable non-read-only transactions which overlap
>> serializable transactions which wrote data and then committed in
>> time to be visible to the snapshot being acquired), but it would
>> *also* block *other* serializable transactions, if they were
>> non-read-only, on an attempt to acquire a snapshot.
> 
> This seems pretty close to guaranteeing serializability by running
> transactions one at a time (i.e. I don't think it's likely to be
> acceptable from a performance standpoint).
It absolutely makes no sense except for long-running read-only
transactions, and would only be used when explicitly requested; and
like I said, I really don't think I like it even on that basis --
just putting it out there as the only alternative I've found so far
to either tolerating possible serialization anomalies in pg_dump
output (albeit only when compared to the state the database reached
after the dump's snapshot) or waiting indefinitely for a clean
snapshot to become available.
FWIW from a brainstorming perspective, while waiting for problem
transactions to clear so we could get a clean snapshot for the dump
I think it would work even better to block the *commit* of
serializable transactions which *had done* writes than to block
snapshot acquisition for serializable transactions which were not
read-only.  Still pretty icky, though.  I am loathe to compromise
the "no new blocking" promise of SSI.
[thinks]
Actually, maybe we can reduce the probability of needing to retry
at each iteration of the non-blocking alternative by checking the
conflict information for the problem transactions after they commit.
Any transaction which didn't *actually* generate a read-write
conflict out to a transaction visible to the dump's candidate
snapshot could not cause an anomaly.  If none of the problem
transactions actually generates a rw-conflict we can use the
candidate snapshot.  Adding that logic to the non-blocking
alternative might actually work pretty well.
There might be some workloads where conflicts would be repeatedly
generated, but there would be a lot where they wouldn't.  If we add
a switch to pg_dump to allow users to choose, I think this algorithm
works.  It never affects a transaction unless it has explicitly
requested SERIALIZABLE READ ONLY DEFERRABLE, and the only impact is
that startup may be deferred until a snapshot can be acquired which
ensures serializable behavior without worrying about SIRead locks.
-Kevin


Re: Serializable snapshot isolation patch

From
Jeff Davis
Date:
On Sun, 2010-10-17 at 22:53 -0700, Jeff Davis wrote:
> 2. I think there's a GiST bug (illustrating with PERIOD type):
> 
>   create table foo(p period);
>   create index foo_idx on foo using gist (p);
>   insert into foo select period(
>       '2009-01-01'::timestamptz + g * '1 microsecond'::interval,
>       '2009-01-01'::timestamptz + (g+1) * '1 microsecond'::interval)
>     from generate_series(1,2000000) g;
> 
> Session1:
>   begin isolation level serializable;
>   select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
>   insert into foo values('[2009-01-01, 2009-01-01]'::period);
> 
> Session2:
>   begin isolation level serializable;
>   select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
>   insert into foo values('[2009-01-01, 2009-01-01]'::period);
>   commit;
> 
> Session1:
>   commit;
> 
> In pg_locks (didn't paste here due to formatting), it looks like the
> SIRead locks are holding locks on different pages. Can you clarify your
> design for GiST and the interaction with page-level locks? It looks like
> you're making some assumption about which pages will be visited when
> searching for conflicting values which doesn't hold true. However, that
> seems odd, because even if the value is actually inserted in one
> transaction, the other doesn't seem to find the conflict. Perhaps the
> bug is simpler than that? Or perhaps I have some kind of odd bug in
> PERIOD's gist implementation?
> 
> Also, it appears to be non-deterministic, to a degree at least, so you
> may not observe the problem in the exact way that I do.
> 

I have more information on this failure. Everything in GiST actually
looks fine. I modified the example slightly:
T1: begin isolation level serializable;T2: begin isolation level serializable;T1: select * from foo where p &&
'[2009-01-01,2009-01-01]'::period;T2: select * from foo where p && '[2009-01-01, 2009-01-01]'::period;T2: commit;T1:
commit;

The SELECTs only look at the root and the predicate doesn't match. So
each SELECT sets an SIReadLock on block 0 and exits the search. Looks
good so far.

T1 then inserts, and it has to modify page 0, so it does
FlagRWConflict(). That sets writer->inConflict = reader and
reader->outConflict = writer (where writer is T1 and reader is T2); and
T1->outConflict and T2->inConflict remain NULL.

Then T2 inserts, and I didn't catch that part in as much detail in gdb,
but it apparently has no effect on that state, so we still have
T1->inConflict = T2, T1->outConflict = NULL, T2->inConflict = NULL, and
T2->outConflict = T1.

That looks like a reasonable state to me, but I'm not sure exactly what
the design calls for. I am guessing that the real problem is in
PreCommit_CheckForSerializationFailure(), where there are 6 conditions
that must be met for an error to be thrown. T2 falls out right away at
condition 1. T1 falls out on condition 4. I don't really understand
condition 4 at all -- can you explain it? And can you explain conditions
5 and 6 too?

Regards,Jeff Davis






Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
Jeff Davis <pgsql@j-davis.com> wrote:
>> Also, it appears to be non-deterministic, to a degree at least,
>> so you may not observe the problem in the exact way that I do.
> The SELECTs only look at the root and the predicate doesn't match.
> So each SELECT sets an SIReadLock on block 0 and exits the search.
> Looks good so far.
> 
> T1 then inserts, and it has to modify page 0, so it does
> FlagRWConflict(). That sets writer->inConflict = reader and
> reader->outConflict = writer (where writer is T1 and reader is
> T2); and T1->outConflict and T2->inConflict remain NULL.
> 
> Then T2 inserts, and I didn't catch that part in as much detail in
> gdb, but it apparently has no effect on that state, so we still
> have T1->inConflict = T2, T1->outConflict = NULL, T2->inConflict =
> NULL, and T2->outConflict = T1.
I now see where the wheels fall off.  The GiST query initially stops
at a high level, so predicate locks only go that deep, and the
*first* insert of a conflicting row must ripple up and modify a
locked page; but *subsequent* inserts may only need to modify the
leaf level.  Even though your particular example doesn't involve a
cycle and therefore doesn't require a rollback for correctness
(although it might tend to generate a false positive if index page
locks were working correctly), you've exposed a flaw in the
GiST AM implementation of predicate locks.
On a first glance, it appears that we would need to check for
conflicts as we move down through the index to find the right spot
for an insert, not as we modify pages for the insert.  I hope
there's some more subtle technique or some way to qualify it;
otherwise a search which stops at the root page would generate a
conflict out to just about any index insertion from a concurrent
transaction.
I will add this to my list of issues to fix based on your review,
unless it's something you would like to tackle -- I'm not going to
chase away anyone who wants to help with this.  :-)
-Kevin


Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
Jeff Davis <pgsql@j-davis.com> wrote:
> That looks like a reasonable state to me, but I'm not sure exactly
> what the design calls for. I am guessing that the real problem is
> in PreCommit_CheckForSerializationFailure(), where there are 6
> conditions that must be met for an error to be thrown. T2 falls
> out right away at condition 1. T1 falls out on condition 4. I
> don't really understand condition 4 at all -- can you explain it?
> And can you explain conditions 5 and 6 too?
Since most transactions are rolled back on a conflict detection
during a read or write attempt, there are only a few very specific
conditions which can "slip through" to where they need to be
detected on commit.  Here's the code with the six conditions:
if (MySerializableXact->inConflict != InvalidSerializableXact && MySerializableXact->inConflict != MySerializableXact
&&!(MySerializableXact->inConflict->rolledBack) && MySerializableXact->inConflict->inConflict !=
InvalidSerializableXact&& !SxactIsCommitted(MySerializableXact->inConflict) &&
!SxactIsCommitted(MySerializableXact->inConflict->inConflict))
Condition 4 is testing whether MySerializableXact is on the "out"
side of a pivot -- in the parlance of most examples, is
MySerializableXact TN?
Condition 5 and 6 confirm that neither T0 nor T1 have committed
first; we can only have a problem if TN commits first.
Basically, when we already have a pivot, but no transaction has yet
committed, we wait to see if TN commits first.  If so, we have a
problem; if not, we don't.  There's probably some room for improving
performance by cancelling T0 or T1 instead of TN, at least some of
the time; but in this pass we are always cancelling the transaction
in whose process we detect the need to cancel something.
-Kevin


Re: Serializable snapshot isolation patch

From
Jeff Davis
Date:
On Thu, 2010-10-21 at 10:29 -0500, Kevin Grittner wrote:
> Basically, when we already have a pivot, but no transaction has yet
> committed, we wait to see if TN commits first.  If so, we have a
> problem; if not, we don't.  There's probably some room for improving
> performance by cancelling T0 or T1 instead of TN, at least some of
> the time; but in this pass we are always cancelling the transaction
> in whose process we detect the need to cancel something.

Well, in this case we do clearly have a problem, because the result is
not equal to the serial execution of the transactions in either order.

So the question is: at what point is the logic wrong? It's either: 1. PreCommit_CheckForSerializationFailure() is
missinga failure case. 2. The state prior to entering that function (which I believe I
 
sufficiently described) is wrong.

If it's (2), then what should the state look like, and how is the GiST
code supposed to result in that state?

I know some of these questions are answered in the relevant research,
but I'd like to discuss this concrete example specifically.

Regards,Jeff Davis



Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
Jeff Davis <pgsql@j-davis.com> wrote:
> in this case we do clearly have a problem, because the result is
> not equal to the serial execution of the transactions in either
> order.
Yeah, you're right.  I misread that example -- newbie with the
PERIOD type.
> So the question is: at what point is the logic wrong? It's either:
>   1. PreCommit_CheckForSerializationFailure() is missing a failure
> case.
>   2. The state prior to entering that function (which I believe I
> sufficiently described) is wrong.
It's (2).  For the reasons I described in my previous email.  Even
though misread the specifics of your example, I was close enough to
see where the problem was accurately.  :-/
> If it's (2), then what should the state look like, and how is the
> GiST code supposed to result in that state?
The second insert should create conflicts similar to what the first
did, but in the other direction -- simple write skew.  How GiST is
supposed to catch this is the big question.  My logic that a
conflicting insert will modify a page read by the other transaction
only holds until someone inserts a conflicting entry.  That's why it
wasn't reliably failing until you had and example where both
transactions accessing the same leaf page.
In your example, session 1's insert creates the leaf entry and
propagates entries up to the root.  When session 2 inserts, it
can just modify the leaf, so the conflict is missed.  As I said, the
most obvious way to fix this is to look for conflicts while
descending to the leaf for an insert.  I'm almost sure we can do
better than that, but I haven't finished thinking it through.  A
rough idea might be that when we find a conflict on an insert, we
acquire additional predicate locks on everything between the lowest
point of conflict and the leaf; the rest of the logic would remain
as-is.  I haven't finished mulling that over, but it seems likely to
work.  If we did that, session 2 would detect the conflict on the
insert to the leaf, and all would be well.
-Kevin


Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
Jeff Davis <pgsql@j-davis.com> wrote:
> When using locks in an unconventional way, it would be helpful to
> describe the invalid schedules that you're preventing. Perhaps an
> example if you think it would be reasonably simple? Also some
> indication of how another process is intended to modify the list
> without walking it.
I've just pushed some comment changes intended to address this.  Did
I hit the mark?
-Kevin
P.S.  Sorry for the delay in responding to such simple requests --
I've been tied up with a family medical crisis; I hope to crank
through much of what you've raised this weekend.


Re: Serializable snapshot isolation patch

From
"Kevin Grittner"
Date:
Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote:
>>> 3. Limited shared memory space to hold information about
>>> committed transactions that are still "interesting".
>>> It's a challenging problem, however, and the current solution is
>>> less than ideal.
>>  
>> I'd go further than that and say that it clearly needs to be
>> fixed. 
> 
> OK, this will remain an open issue then.
This seems to me to be by far the biggest problem with the patch. 
I've been working through various ideas, and I think I see the light
at the end of the tunnel.  I'm posting the ideas for a reality
check.  I'm hoping for comments.
It seems clear to me that we need to attack this from two
directions:
(1) Mitigation, by more aggressively identifying transactions which
are no longer interesting so they can be cleaned up.
(2) Graceful degradation, by somehow summarizing information as we
approach the hard limit, so that we incrementally increase the
probability of false positives rather than resorting to either of
the two "simple" solutions (refusing new serializable transactions
or canceling the oldest running serializable transactions).
(Suggestions for other ways to attack this are welcome.)
It seemed to me likely that investigating (1) would help to clarify
how to do (2), so here's what I've got in the mitigation department:
(1A) A committed transaction TX which hasn't written data can be
cleaned up when there is no overlapping non-read-only transaction
which is active and which overlaps a committed transaction which
wrote data and committed too soon to overlap TX.
(1B) Predicate locks and information about rw-conflicts *in* for a
committed transaction can be released when there are no overlapping
non-read-only transactions active.  Except for transactions
described in (1A), the fact that the transaction was a serializable
transaction with a rw-conflict *out* is significant, but nothing
else about the transaction matters, and we don't need to know
details of the conflict or maintain a reverse pointer.
(1C) A committing transaction which has written data can clear its
conflict out pointer if it points to an active transaction which
does not overlap a committed transaction which wrote data. 
(Obviously, the corresponding pointer in the other direction can
also be cleared.)
(1D) A committed transaction with no rw-conflict out cannot become a
pivot in a dangerous structure, because the transaction on the "out"
side of a pivot must commit first for there to be a problem.
(Obviously, two committed transactions cannot develop a new
conflict.)  Since a read-only transaction can only participate in a
dangerous structure through a conflict with a pivot, transactions in
this state can be ignored when a read-only transaction is checking
for conflicts.
That's all I've been able to come up with so far.  Let's see how
that plays with the SSI worst case -- a long running transaction
concurrent with many faster transactions.
(2A) If the long-running transaction is read-only, it looks pretty
good.  We can clear concurrent transactions on a reasonable schedule
and just maintain a list of committed serializable transactions with
rw-conflicts out which wrote data and have xids above the
serializable global xmin.  We can probably make room for such a list
of xids somehow -- I could even see potentially using the SLRU
mechanism for this without killing performance -- the long-running
read-only transaction would just need to look up a particular xid in
the list whenever it read a non-visible tuple.  If it exists, the
long-running transaction must roll back with a serialization
failure.  Normally the list should be short enough to stay in RAM.
(2B) It gets more complicated if the long-running transaction can
also write.  This is both because the predicate lock information
must be maintained and associated with something, and because the
long running transaction can become a pivot with an old short-lived
transaction on the rw-conflict *in* side.  The rw-conflicts *out*
can be handled the same as for a long-running read-only transaction
(described above).
I think that tempered with the above, the performance impact will be
minimal if we apply the technique Heikki described here to the
oldest committed transactions when the list becomes full:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg01477.php
I don't think this change will affect predicate.h or any of the code
which uses its API.  The only thing which uses the internal
structures outside of predicate.c is the code to show the SIReadLock
entries in the pg_locks view.  I'm not sure how we should show locks
for which we no longer have an associated virtual transaction ID. 
Does anyone have thoughts on that?  Just leave virtualtransaction
NULL, and leave the rest as-is?
-Kevin