Thread: Serializable snapshot isolation patch
This is based on the Kevin's git repo at: git://git.postgresql.org/git/users/kgrittn/postgres.git SHA1: 729541fa5ea94d66e6f4b22fb65bfef92214cd6b * Trivial stuff: I get a compiler warning: indexfsm.c: In function ‘RecordFreeIndexPage’: indexfsm.c:55: warning: implicit declaration of function ‘PageIsPredicateLocked’ * Open issues, as I see it: 1. 2PC and SSI don't mix (this may be a known issue, because there's not really any code in the current implementation to deal with 2PC): Session1: BEGIN ISOLATION LEVEL SERIALIZABLE; select count(*) from a; insert into a values(1); PREPARE TRANSACTION't1'; Session2: BEGIN ISOLATION LEVEL SERIALIZABLE; select count(*) from a; insert into a values(1); COMMIT; Session1: COMMIT PREPARED 't1'; Looks like we need to track information about prepared transactions in shared memory. I think you'll need to keep the information in the 2PC state file as well, so that it can be rebuilt after a crash or restart. It all looks solvable at first glance, but it looks like it might be some work. 2. I think there's a GiST bug (illustrating with PERIOD type): create table foo(p period); create index foo_idx on foo using gist (p); insert into foo select period( '2009-01-01'::timestamptz+ g * '1 microsecond'::interval, '2009-01-01'::timestamptz + (g+1) * '1 microsecond'::interval) from generate_series(1,2000000) g; Session1: begin isolation level serializable; select * from foo where p && '[2009-01-01, 2009-01-01]'::period; insert intofoo values('[2009-01-01, 2009-01-01]'::period); Session2: begin isolation level serializable; select * from foo where p && '[2009-01-01, 2009-01-01]'::period; insert intofoo values('[2009-01-01, 2009-01-01]'::period); commit; Session1: commit; In pg_locks (didn't paste here due to formatting), it looks like the SIRead locks are holding locks on different pages. Can you clarify your design for GiST and the interaction with page-level locks? It looks like you're making some assumption about which pages will be visited when searching for conflicting values which doesn't hold true. However, that seems odd, because even if the value is actually inserted in one transaction, the other doesn't seem to find the conflict. Perhaps the bug is simpler than that? Or perhaps I have some kind of odd bug in PERIOD's gist implementation? Also, it appears to be non-deterministic, to a degree at least, so you may not observe the problem in the exact way that I do. 3. Limited shared memory space to hold information about committed transactions that are still "interesting". Relevant thread: http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php It's a challenging problem, however, and the current solution is less than ideal. Idle transactions can mean that all new serializable transactions fail until the idle transactions start to terminate. I don't like that very much, because people expect to have to retry serializable transactions, but retrying here has no real hope (except that some time has elapsed, and maybe the other transaction decided to commit). A comparison is made (in the aforementioned thread) to the existing limitation on the number of locks. However, it's much easier to understand normal locks, and for a given workload usually you can put an upper bound on the number of locks required (right?). Does it make sense to kill the existing transactions that are holding everything up, rather than the new transaction? Or would that just confuse matters more? This does not necessarily guarantee that progress can be made, either, but intuitively it seems more likely. 4. A few final details: a. We should probably have a compatibility GUC that makes SERIALIZABLE equal to REPEATABLE READ. My opinion is that this should be only for compatibility, and should default to "off" (i.e. SSI code enabled) either in 9.1 or soon after. b. Docs. * Questions: 1. For TargetTagIsCoveredBy(), why is it impossible for the covering tag to have an offset? 2. The explanation for SerializablePredicateLockListLock is a little confusing. It makes it sound like other processes can't walk the list, but they can modify it? * Summary Great patch! I didn't make it through the patch in as much detail as I would have liked, because the theory behind it is quite complex and it will take longer for me to absorb. But the implementation looks good and the use case is very important. Regards,Jeff Davis
First off, thanks for the review! I know that it's a lot of work, and I do appreciate it. Jeff Davis <pgsql@j-davis.com> wrote: > * Trivial stuff: > > I get a compiler warning: > > indexfsm.c: In function *RecordFreeIndexPage*: > indexfsm.c:55: warning: implicit declaration of function > *PageIsPredicateLocked* Oops. Fixed on my git repo now. > * Open issues, as I see it: > > 1. 2PC and SSI don't mix (this may be a known issue, because > there's not really any code in the current implementation to deal > with 2PC): > Looks like we need to track information about prepared > transactions in shared memory. I think you'll need to keep the > information in the 2PC state file as well, so that it can be > rebuilt after a crash or restart. It all looks solvable at first > glance, but it looks like it might be some work. Argh! I didn't anticipate an interaction with prepared transactions, probably because I've never used them or taken a close look at them. I'll take a look. > 2. I think there's a GiST bug (illustrating with PERIOD type): > Can you clarify your design for GiST and the interaction with > page-level locks? It looks like you're making some assumption > about which pages will be visited when searching for conflicting > values which doesn't hold true. However, that seems odd, because > even if the value is actually inserted in one transaction, the > other doesn't seem to find the conflict. Perhaps the bug is > simpler than that? Or perhaps I have some kind of odd bug in > PERIOD's gist implementation? My assumptions for GiST were that: (1) A search for a matching value could bail out at any level in the tree; there is no requirement for the search to proceed to the leaf level to get a negative index probe. (2) An index insertion which would cause a row to be found if an earlier search was repeated must modify some page which was read by the earlier search. (3) Because of the above, it is necessary and sufficient to acquire an SIRead lock all pages visited in a search of a GiST index, and flag a conflict on insertion into a locked page at any level of the index. That logic still seems sound to me, so if someone else sees a flaw in it, please point it out. Assuming that logic is sound, I'll poke around to see where the flaw in implementation may be. If you have a full self-contained test case to demonstrate the failure here, could you send it to me? > Also, it appears to be non-deterministic, to a degree at least, so > you may not observe the problem in the exact way that I do. Yeah, I have tested this area without seeing the failure , so that self-contained example would be a big help. > 3. Limited shared memory space to hold information about committed > transactions that are still "interesting". Relevant thread: > > http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php > > It's a challenging problem, however, and the current solution is > less than ideal. I'd go further than that and say that it clearly needs to be fixed. The scale of the issue was somewhat disguised in my testing when I was using the hash table for managing acquisition of shared memory for what was essentially an unordered list. Now that I've made it a proper list with a hard limit on entries, the problem is pretty easy to provoke, and just reserving space for A Very Large Number Of Entries is clearly not an adequate solution. :-( > Idle transactions can mean that all new serializable transactions > fail until the idle transactions start to terminate. I don't like > that very much, because people expect to have to retry > serializable transactions, but retrying here has no real hope > (except that some time has elapsed, and maybe the other > transaction decided to commit). Agreed. > Does it make sense to kill the existing transactions that are > holding everything up, rather than the new transaction? Or would > that just confuse matters more? This does not necessarily > guarantee that progress can be made, either, but intuitively it > seems more likely. Canceling an old transaction to allow new transactions to begin *might* be better than the current situation, but I think we can and should do better. The best I have been able to come up with is in this post: http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php There's a fair amount of work there, so I was hoping for some confirmation that this combination of steps was both sane and had some chance of being considered sufficient and acceptable before diving in to code it. I also *really* hope to add the "SERIALIZABLE READ ONLY DEFERRABLE" mode so that pg_dump and other read-only transactions don't push things into a state where the rollback rate spikes: http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php > 4. A few final details: > > a. We should probably have a compatibility GUC that makes > SERIALIZABLE equal to REPEATABLE READ. My opinion is that this > should be only for compatibility, and should default to "off" > (i.e. SSI code enabled) either in 9.1 or soon after. I'm inclined to agree with you, with the strong preference for a 9.1 setting of off. This was previously discussed, and there were people who felt that we should avoid a "behavior-changing" GUC like that, so I didn't add it. It wouldn't be hard to put it in, if that's the community consensus. > b. Docs. > > * Questions: > > 1. For TargetTagIsCoveredBy(), why is it impossible for the > covering tag to have an offset? Because a "covering" lock is a lock at a coarser granularity than the "covered" lock, which includes what the finer-grained lock covers. Since the tuple level is as fine as we take it, and we only use the offset for tuple locks, a covering lock would not have that set. (We also check for duplicate locks.) I'll expand that comment a bit, since it obviously wasn't clear enough. > 2. The explanation for SerializablePredicateLockListLock is a > little confusing. It makes it sound like other processes can't > walk the list, but they can modify it? That's a bit unconventional, I admit, but it was a big part of bringing the benchmarks for a fully cached, read-only transaction load down to running just 1.8% longer than REPEATABLE READ (which is the same as current SERIALIZABLE). I'm open to other locking strategies, or an explanation of where this one doesn't work, but I'd want to benchmark carefully. Since you took it to mean the right thing, but found that surprising enough to doubt that it was accurate, do you have any particular suggestions about how to clarify it? (Maybe just, "Yeah, that's unconventional, but it works and is faster than the alternatives we've tried."?) > * Summary > > Great patch! Thanks! > I didn't make it through the patch in as much detail as I would > have liked, because the theory behind it is quite complex and it > will take longer for me to absorb. Yeah, there are some mind-bending ideas in there. If I recall correctly, Dan said he spent one week reading the academic papers on which it was based and another week reading the patch before he felt comfortable starting to code the parts he wrote. I spent a lot more time than that reading papers, and I still didn't quite have my head around some of the concepts until I went back to some of Fekete's work which led up to this innovation. While many of the concepts go back to the '80s and '90s, I found Fekete's work "clicked" with me better than some of the earlier, more theoretical work because he was looking at real-world examples of how these manifest in actual, working production software and what techniques could be used to identify and mitigate the problems. Sometimes the theory is hard for me to properly digest without such concrete details. > But the implementation looks good and the use case is very > important. It certainly is for our shop, and I've heard from others who are very interested in it. I think, that it's generally of more benefit to "big shops" than situations where you have just a few programmers coding against a schema with just a few dozen tables; although it's also somewhat a matter of style: I would rather deal with *one* issue (serialization failures) in one place than scatter defensive code all over my application codebase, even on a moderate scale. Anyway, given the issues raised and the date, it seems reasonable to mark this as "Returned with Feedback" for CF management purposes. I'll get fixes out as soon as I can. -Kevin
On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote: > > 2. I think there's a GiST bug (illustrating with PERIOD type): > > My assumptions for GiST were that: > > (1) A search for a matching value could bail out at any level in the > tree; there is no requirement for the search to proceed to the leaf > level to get a negative index probe. > > (2) An index insertion which would cause a row to be found if an > earlier search was repeated must modify some page which was read by > the earlier search. > > (3) Because of the above, it is necessary and sufficient to acquire > an SIRead lock all pages visited in a search of a GiST index, and > flag a conflict on insertion into a locked page at any level of the > index. > > That logic still seems sound to me, so if someone else sees a flaw > in it, please point it out. Looks sound to me, as well. > > Also, it appears to be non-deterministic, to a degree at least, so > > you may not observe the problem in the exact way that I do. > > Yeah, I have tested this area without seeing the failure , so that > self-contained example would be a big help. I assume here that you mean that you _did_ see the failure (serialization error) and therefore did not see the problem? Also, are you sure it was using the GiST index for the searches and didn't just get a full table lock or full index lock? I'll try to narrow it down. It could be a problem with PERIOD, or maybe it wasn't using the GiST index for a search when I thought it was (I didn't run EXPLAIN at every point, so I'll double-check). Clearly, a problem exists somewhere though. > > 3. Limited shared memory space to hold information about committed > > transactions that are still "interesting". Relevant thread: > > > > http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php > > > > It's a challenging problem, however, and the current solution is > > less than ideal. > > I'd go further than that and say that it clearly needs to be fixed. OK, this will remain an open issue then. > Canceling an old transaction to allow new transactions to begin > *might* be better than the current situation, but I think we can and > should do better. The best I have been able to come up with is in > this post: > > http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php At first I didn't like that approach much, but after re-reading it, I am more optimistic. What I like most about it is that a transaction won't be canceled if it doesn't interfere at all (e.g. operating on different tables). There are still edge cases where it can be canceled, like when the memory is so exhausted that it can't even track a couple table-level locks, but that doesn't sound as bad (closer to the current restriction on the total number of locks). Ultimately I think this will be one of those problems with no 100% clean solution. However, I think it's important that we try to minimize these degenerate cases. Eventually, we may need to keep statistics about the number of conflicts happening, and start to rate-limit new serializable transactions (effectively reducing parallelism) to ensure that reasonable progress can be made (hopefully faster than serial execution). > There's a fair amount of work there, so I was hoping for some > confirmation that this combination of steps was both sane and had > some chance of being considered sufficient and acceptable before > diving in to code it. I also *really* hope to add the "SERIALIZABLE > READ ONLY DEFERRABLE" mode so that pg_dump and other read-only > transactions don't push things into a state where the rollback rate > spikes: > > http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php One potential issue there is starvation. I don't see how you would guarantee that there will ever be a point to grab a safe snapshot, even if all of the other transactions are completing. > > 4. A few final details: > > > > a. We should probably have a compatibility GUC that makes > > SERIALIZABLE equal to REPEATABLE READ. My opinion is that this > > should be only for compatibility, and should default to "off" > > (i.e. SSI code enabled) either in 9.1 or soon after. > > I'm inclined to agree with you, with the strong preference for a 9.1 > setting of off. This was previously discussed, and there were > people who felt that we should avoid a "behavior-changing" GUC like > that, so I didn't add it. It wouldn't be hard to put it in, if > that's the community consensus. I think that was me, but I'm OK with behavior-changing GUCs as long as they are there for compatibility and we intend to push people toward the new behavior in the long run (like standard_conforming_strings, but hopefully not as long-lived). Maybe it's not necessary at all, because your patch offers strictly better guarantees (at the cost of more serialization failures), and if they want the old behavior then REPEATABLE READ is already there. > > 1. For TargetTagIsCoveredBy(), why is it impossible for the > > covering tag to have an offset? > > Because a "covering" lock is a lock at a coarser granularity than > the "covered" lock, which includes what the finer-grained lock > covers. Since the tuple level is as fine as we take it, and we only > use the offset for tuple locks, a covering lock would not have that > set. (We also check for duplicate locks.) I'll expand that comment > a bit, since it obviously wasn't clear enough. I see. So a lock can't ever cover itself? > > 2. The explanation for SerializablePredicateLockListLock is a > > little confusing. It makes it sound like other processes can't > > walk the list, but they can modify it? > > That's a bit unconventional, I admit, but it was a big part of > bringing the benchmarks for a fully cached, read-only transaction > load down to running just 1.8% longer than REPEATABLE READ (which is > the same as current SERIALIZABLE). I'm open to other locking > strategies, or an explanation of where this one doesn't work, but > I'd want to benchmark carefully. Since you took it to mean the > right thing, but found that surprising enough to doubt that it was > accurate, do you have any particular suggestions about how to > clarify it? (Maybe just, "Yeah, that's unconventional, but it works > and is faster than the alternatives we've tried."?) When using locks in an unconventional way, it would be helpful to describe the invalid schedules that you're preventing. Perhaps an example if you think it would be reasonably simple? Also some indication of how another process is intended to modify the list without walking it. > It certainly is for our shop, and I've heard from others who are > very interested in it. I think, that it's generally of more > benefit to "big shops" than situations where you have just a few > programmers coding against a schema with just a few dozen tables; > although it's also somewhat a matter of style: I would rather deal > with *one* issue (serialization failures) in one place than scatter > defensive code all over my application codebase, even on a moderate > scale. It's a complexity-reducing and correctness-increasing feature, which is generally good. Regards,Jeff Davis
> Jeff Davis wrote: > On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote: > I assume here that you mean that you _did_ see the failure > (serialization error) and therefore did not see the problem? Yeah. > Also, are you sure it was using the GiST index for the searches and > didn't just get a full table lock or full index lock? I think so, but I'm at home and don't have details here. Will check tomorrow. > I'll try to narrow it down. It could be a problem with PERIOD, or > maybe it wasn't using the GiST index for a search when I thought it > was (I didn't run EXPLAIN at every point, so I'll double-check). > Clearly, a problem exists somewhere though. Hmmm... When Joe was looking at the patch he exposed an intermittent problem with btree indexes which turned out to be related to improper handling of the predicate locks during index page clean-up caused by a vacuum. Easy to fix once found, but it didn't always happen, even with identical runs. (I'm guessing that was due to the randomness in picking a page to split when inserting into a group of identical keys.) Perhaps a similar bug lurks in the GiST predicate locking. > Eventually, we may need to keep statistics about the number of > conflicts happening, and start to rate-limit new serializable > transactions (effectively reducing parallelism) to ensure that > reasonable progress can be made (hopefully faster than serial > execution). Ah, you've exposed just how self-serving my interest in an admission control policy mechanism is! ;-) http://archives.postgresql.org/pgsql-hackers/2009-12/msg02189.php >> I also *really* hope to add the "SERIALIZABLE READ ONLY >> DEFERRABLE" mode so that pg_dump and other read-only transactions >> don't push things into a state where the rollback rate spikes: >> >> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php > > One potential issue there is starvation. I don't see how you would > guarantee that there will ever be a point to grab a safe snapshot, > even if all of the other transactions are completing. After mulling it over in greater detail the previously, I see your point. I'll think about it some more, but this particular idea might be a dead end. >>> a. We should probably have a compatibility GUC that makes >>> SERIALIZABLE equal to REPEATABLE READ. My opinion is that this >>> should be only for compatibility, and should default to "off" >>> (i.e. SSI code enabled) either in 9.1 or soon after. >> >> I'm inclined to agree with you, with the strong preference for a >> 9.1 setting of off. This was previously discussed, and there were >> people who felt that we should avoid a "behavior-changing" GUC >> like that, so I didn't add it. It wouldn't be hard to put it in, >> if that's the community consensus. > > I think that was me, but I'm OK with behavior-changing GUCs as long > as they are there for compatibility and we intend to push people > toward the new behavior in the long run (like > standard_conforming_strings, but hopefully not as long-lived). > > Maybe it's not necessary at all, because your patch offers strictly > better guarantees (at the cost of more serialization failures), and > if they want the old behavior then REPEATABLE READ is already > there. Anyone else with an opinion on this? > So a lock can't ever cover itself? No. If a transaction requests an SIRead lock identical to one it already holds, we ignore the request. > When using locks in an unconventional way, it would be helpful to > describe the invalid schedules that you're preventing. Perhaps an > example if you think it would be reasonably simple? Also some > indication of how another process is intended to modify the list > without walking it. I will review that comment and add something along those lines. -Kevin
On Mon, 2010-10-18 at 22:12 -0500, Kevin Grittner wrote: > Hmmm... When Joe was looking at the patch he exposed an intermittent > problem with btree indexes which turned out to be related to improper > handling of the predicate locks during index page clean-up caused by a > vacuum. Easy to fix once found, but it didn't always happen, even > with identical runs. (I'm guessing that was due to the randomness in > picking a page to split when inserting into a group of identical > keys.) Perhaps a similar bug lurks in the GiST predicate locking. I briefly looked into this when I woke up this morning, and I think I'm close. I can reproduce it every time, so I should be able to fix this as soon as I can find some free time (tomorrow night, probably). I might also be able to help with the 2PC issue, but it will be at least a week or two before I have the free time to dig into that one. > > Eventually, we may need to keep statistics about the number of > > conflicts happening, and start to rate-limit new serializable > > transactions (effectively reducing parallelism) to ensure that > > reasonable progress can be made (hopefully faster than serial > > execution). > > Ah, you've exposed just how self-serving my interest in an admission > control policy mechanism is! ;-) > > http://archives.postgresql.org/pgsql-hackers/2009-12/msg02189.php Cool! > >> I also *really* hope to add the "SERIALIZABLE READ ONLY > >> DEFERRABLE" mode so that pg_dump and other read-only transactions > >> don't push things into a state where the rollback rate spikes: > >> > >> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php > > > > One potential issue there is starvation. I don't see how you would > > guarantee that there will ever be a point to grab a safe snapshot, > > even if all of the other transactions are completing. > > After mulling it over in greater detail the previously, I see your > point. I'll think about it some more, but this particular idea might > be a dead end. I didn't quite mean that it's a dead-end. It still seems tempting to at least try to get a safe snapshot, especially for a transaction that you know will be long-running. Maybe a timeout or something? Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> wrote: > I briefly looked into this when I woke up this morning, and I > think I'm close. I can reproduce it every time, so I should be > able to fix this as soon as I can find some free time (tomorrow > night, probably). OK, I'll focus on other areas. > I might also be able to help with the 2PC issue, but it will be at > least a week or two before I have the free time to dig into that > one. If I get through the other points you raised, I'll start reading up on 2PC. > I didn't quite mean that it's a dead-end. It still seems tempting > to at least try to get a safe snapshot, especially for a > transaction that you know will be long-running. Maybe a timeout or > something? But what would you do when you hit the timeout? One thing that would work, but I really don't think I like it, is that a request for a snapshot for such a transaction would not only block until it could get a "clean" snapshot (no overlapping serializable non-read-only transactions which overlap serializable transactions which wrote data and then committed in time to be visible to the snapshot being acquired), but it would *also* block *other* serializable transactions, if they were non-read-only, on an attempt to acquire a snapshot. That would at least guarantee that the serializable read only deferrable transaction could get its snapshot as soon as the initial set of problem overlapping transactions completed, but it would be an exception to the "SSI introduces no new blocking" guarantee. :-( I was OK with that for the particular transaction where DEFERRABLE was requested, but to have that block other serializable transactions seems pretty iffy to me. Short of that, I think you would just have to wait for completion of all known problem transactions and then try again. On a server with a heave write load, particularly if the length of the writing transactions was highly variable, you could go through a lot of iterations before getting that clean snapshot. -Kevin
On Tue, Oct 19, 2010 at 6:28 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > One thing that would work, but I really don't think I like it, is > that a request for a snapshot for such a transaction would not only > block until it could get a "clean" snapshot (no overlapping > serializable non-read-only transactions which overlap serializable > transactions which wrote data and then committed in time to be > visible to the snapshot being acquired), but it would *also* block > *other* serializable transactions, if they were non-read-only, on an > attempt to acquire a snapshot. This seems pretty close to guaranteeing serializability by running transactions one at a time (i.e. I don't think it's likely to be acceptable from a performance standpoint). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 19, 2010 at 6:28 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> One thing that would work, but I really don't think I like it, is >> that a request for a snapshot for such a transaction would not >> only block until it could get a "clean" snapshot (no overlapping >> serializable non-read-only transactions which overlap >> serializable transactions which wrote data and then committed in >> time to be visible to the snapshot being acquired), but it would >> *also* block *other* serializable transactions, if they were >> non-read-only, on an attempt to acquire a snapshot. > > This seems pretty close to guaranteeing serializability by running > transactions one at a time (i.e. I don't think it's likely to be > acceptable from a performance standpoint). It absolutely makes no sense except for long-running read-only transactions, and would only be used when explicitly requested; and like I said, I really don't think I like it even on that basis -- just putting it out there as the only alternative I've found so far to either tolerating possible serialization anomalies in pg_dump output (albeit only when compared to the state the database reached after the dump's snapshot) or waiting indefinitely for a clean snapshot to become available. FWIW from a brainstorming perspective, while waiting for problem transactions to clear so we could get a clean snapshot for the dump I think it would work even better to block the *commit* of serializable transactions which *had done* writes than to block snapshot acquisition for serializable transactions which were not read-only. Still pretty icky, though. I am loathe to compromise the "no new blocking" promise of SSI. [thinks] Actually, maybe we can reduce the probability of needing to retry at each iteration of the non-blocking alternative by checking the conflict information for the problem transactions after they commit. Any transaction which didn't *actually* generate a read-write conflict out to a transaction visible to the dump's candidate snapshot could not cause an anomaly. If none of the problem transactions actually generates a rw-conflict we can use the candidate snapshot. Adding that logic to the non-blocking alternative might actually work pretty well. There might be some workloads where conflicts would be repeatedly generated, but there would be a lot where they wouldn't. If we add a switch to pg_dump to allow users to choose, I think this algorithm works. It never affects a transaction unless it has explicitly requested SERIALIZABLE READ ONLY DEFERRABLE, and the only impact is that startup may be deferred until a snapshot can be acquired which ensures serializable behavior without worrying about SIRead locks. -Kevin
On Sun, 2010-10-17 at 22:53 -0700, Jeff Davis wrote: > 2. I think there's a GiST bug (illustrating with PERIOD type): > > create table foo(p period); > create index foo_idx on foo using gist (p); > insert into foo select period( > '2009-01-01'::timestamptz + g * '1 microsecond'::interval, > '2009-01-01'::timestamptz + (g+1) * '1 microsecond'::interval) > from generate_series(1,2000000) g; > > Session1: > begin isolation level serializable; > select * from foo where p && '[2009-01-01, 2009-01-01]'::period; > insert into foo values('[2009-01-01, 2009-01-01]'::period); > > Session2: > begin isolation level serializable; > select * from foo where p && '[2009-01-01, 2009-01-01]'::period; > insert into foo values('[2009-01-01, 2009-01-01]'::period); > commit; > > Session1: > commit; > > In pg_locks (didn't paste here due to formatting), it looks like the > SIRead locks are holding locks on different pages. Can you clarify your > design for GiST and the interaction with page-level locks? It looks like > you're making some assumption about which pages will be visited when > searching for conflicting values which doesn't hold true. However, that > seems odd, because even if the value is actually inserted in one > transaction, the other doesn't seem to find the conflict. Perhaps the > bug is simpler than that? Or perhaps I have some kind of odd bug in > PERIOD's gist implementation? > > Also, it appears to be non-deterministic, to a degree at least, so you > may not observe the problem in the exact way that I do. > I have more information on this failure. Everything in GiST actually looks fine. I modified the example slightly: T1: begin isolation level serializable;T2: begin isolation level serializable;T1: select * from foo where p && '[2009-01-01,2009-01-01]'::period;T2: select * from foo where p && '[2009-01-01, 2009-01-01]'::period;T2: commit;T1: commit; The SELECTs only look at the root and the predicate doesn't match. So each SELECT sets an SIReadLock on block 0 and exits the search. Looks good so far. T1 then inserts, and it has to modify page 0, so it does FlagRWConflict(). That sets writer->inConflict = reader and reader->outConflict = writer (where writer is T1 and reader is T2); and T1->outConflict and T2->inConflict remain NULL. Then T2 inserts, and I didn't catch that part in as much detail in gdb, but it apparently has no effect on that state, so we still have T1->inConflict = T2, T1->outConflict = NULL, T2->inConflict = NULL, and T2->outConflict = T1. That looks like a reasonable state to me, but I'm not sure exactly what the design calls for. I am guessing that the real problem is in PreCommit_CheckForSerializationFailure(), where there are 6 conditions that must be met for an error to be thrown. T2 falls out right away at condition 1. T1 falls out on condition 4. I don't really understand condition 4 at all -- can you explain it? And can you explain conditions 5 and 6 too? Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> wrote: >> Also, it appears to be non-deterministic, to a degree at least, >> so you may not observe the problem in the exact way that I do. > The SELECTs only look at the root and the predicate doesn't match. > So each SELECT sets an SIReadLock on block 0 and exits the search. > Looks good so far. > > T1 then inserts, and it has to modify page 0, so it does > FlagRWConflict(). That sets writer->inConflict = reader and > reader->outConflict = writer (where writer is T1 and reader is > T2); and T1->outConflict and T2->inConflict remain NULL. > > Then T2 inserts, and I didn't catch that part in as much detail in > gdb, but it apparently has no effect on that state, so we still > have T1->inConflict = T2, T1->outConflict = NULL, T2->inConflict = > NULL, and T2->outConflict = T1. I now see where the wheels fall off. The GiST query initially stops at a high level, so predicate locks only go that deep, and the *first* insert of a conflicting row must ripple up and modify a locked page; but *subsequent* inserts may only need to modify the leaf level. Even though your particular example doesn't involve a cycle and therefore doesn't require a rollback for correctness (although it might tend to generate a false positive if index page locks were working correctly), you've exposed a flaw in the GiST AM implementation of predicate locks. On a first glance, it appears that we would need to check for conflicts as we move down through the index to find the right spot for an insert, not as we modify pages for the insert. I hope there's some more subtle technique or some way to qualify it; otherwise a search which stops at the root page would generate a conflict out to just about any index insertion from a concurrent transaction. I will add this to my list of issues to fix based on your review, unless it's something you would like to tackle -- I'm not going to chase away anyone who wants to help with this. :-) -Kevin
Jeff Davis <pgsql@j-davis.com> wrote: > That looks like a reasonable state to me, but I'm not sure exactly > what the design calls for. I am guessing that the real problem is > in PreCommit_CheckForSerializationFailure(), where there are 6 > conditions that must be met for an error to be thrown. T2 falls > out right away at condition 1. T1 falls out on condition 4. I > don't really understand condition 4 at all -- can you explain it? > And can you explain conditions 5 and 6 too? Since most transactions are rolled back on a conflict detection during a read or write attempt, there are only a few very specific conditions which can "slip through" to where they need to be detected on commit. Here's the code with the six conditions: if (MySerializableXact->inConflict != InvalidSerializableXact && MySerializableXact->inConflict != MySerializableXact &&!(MySerializableXact->inConflict->rolledBack) && MySerializableXact->inConflict->inConflict != InvalidSerializableXact&& !SxactIsCommitted(MySerializableXact->inConflict) && !SxactIsCommitted(MySerializableXact->inConflict->inConflict)) Condition 4 is testing whether MySerializableXact is on the "out" side of a pivot -- in the parlance of most examples, is MySerializableXact TN? Condition 5 and 6 confirm that neither T0 nor T1 have committed first; we can only have a problem if TN commits first. Basically, when we already have a pivot, but no transaction has yet committed, we wait to see if TN commits first. If so, we have a problem; if not, we don't. There's probably some room for improving performance by cancelling T0 or T1 instead of TN, at least some of the time; but in this pass we are always cancelling the transaction in whose process we detect the need to cancel something. -Kevin
On Thu, 2010-10-21 at 10:29 -0500, Kevin Grittner wrote: > Basically, when we already have a pivot, but no transaction has yet > committed, we wait to see if TN commits first. If so, we have a > problem; if not, we don't. There's probably some room for improving > performance by cancelling T0 or T1 instead of TN, at least some of > the time; but in this pass we are always cancelling the transaction > in whose process we detect the need to cancel something. Well, in this case we do clearly have a problem, because the result is not equal to the serial execution of the transactions in either order. So the question is: at what point is the logic wrong? It's either: 1. PreCommit_CheckForSerializationFailure() is missinga failure case. 2. The state prior to entering that function (which I believe I sufficiently described) is wrong. If it's (2), then what should the state look like, and how is the GiST code supposed to result in that state? I know some of these questions are answered in the relevant research, but I'd like to discuss this concrete example specifically. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> wrote: > in this case we do clearly have a problem, because the result is > not equal to the serial execution of the transactions in either > order. Yeah, you're right. I misread that example -- newbie with the PERIOD type. > So the question is: at what point is the logic wrong? It's either: > 1. PreCommit_CheckForSerializationFailure() is missing a failure > case. > 2. The state prior to entering that function (which I believe I > sufficiently described) is wrong. It's (2). For the reasons I described in my previous email. Even though misread the specifics of your example, I was close enough to see where the problem was accurately. :-/ > If it's (2), then what should the state look like, and how is the > GiST code supposed to result in that state? The second insert should create conflicts similar to what the first did, but in the other direction -- simple write skew. How GiST is supposed to catch this is the big question. My logic that a conflicting insert will modify a page read by the other transaction only holds until someone inserts a conflicting entry. That's why it wasn't reliably failing until you had and example where both transactions accessing the same leaf page. In your example, session 1's insert creates the leaf entry and propagates entries up to the root. When session 2 inserts, it can just modify the leaf, so the conflict is missed. As I said, the most obvious way to fix this is to look for conflicts while descending to the leaf for an insert. I'm almost sure we can do better than that, but I haven't finished thinking it through. A rough idea might be that when we find a conflict on an insert, we acquire additional predicate locks on everything between the lowest point of conflict and the leaf; the rest of the logic would remain as-is. I haven't finished mulling that over, but it seems likely to work. If we did that, session 2 would detect the conflict on the insert to the leaf, and all would be well. -Kevin
Jeff Davis <pgsql@j-davis.com> wrote: > When using locks in an unconventional way, it would be helpful to > describe the invalid schedules that you're preventing. Perhaps an > example if you think it would be reasonably simple? Also some > indication of how another process is intended to modify the list > without walking it. I've just pushed some comment changes intended to address this. Did I hit the mark? -Kevin P.S. Sorry for the delay in responding to such simple requests -- I've been tied up with a family medical crisis; I hope to crank through much of what you've raised this weekend.
Jeff Davis <pgsql@j-davis.com> wrote: > On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote: >>> 3. Limited shared memory space to hold information about >>> committed transactions that are still "interesting". >>> It's a challenging problem, however, and the current solution is >>> less than ideal. >> >> I'd go further than that and say that it clearly needs to be >> fixed. > > OK, this will remain an open issue then. This seems to me to be by far the biggest problem with the patch. I've been working through various ideas, and I think I see the light at the end of the tunnel. I'm posting the ideas for a reality check. I'm hoping for comments. It seems clear to me that we need to attack this from two directions: (1) Mitigation, by more aggressively identifying transactions which are no longer interesting so they can be cleaned up. (2) Graceful degradation, by somehow summarizing information as we approach the hard limit, so that we incrementally increase the probability of false positives rather than resorting to either of the two "simple" solutions (refusing new serializable transactions or canceling the oldest running serializable transactions). (Suggestions for other ways to attack this are welcome.) It seemed to me likely that investigating (1) would help to clarify how to do (2), so here's what I've got in the mitigation department: (1A) A committed transaction TX which hasn't written data can be cleaned up when there is no overlapping non-read-only transaction which is active and which overlaps a committed transaction which wrote data and committed too soon to overlap TX. (1B) Predicate locks and information about rw-conflicts *in* for a committed transaction can be released when there are no overlapping non-read-only transactions active. Except for transactions described in (1A), the fact that the transaction was a serializable transaction with a rw-conflict *out* is significant, but nothing else about the transaction matters, and we don't need to know details of the conflict or maintain a reverse pointer. (1C) A committing transaction which has written data can clear its conflict out pointer if it points to an active transaction which does not overlap a committed transaction which wrote data. (Obviously, the corresponding pointer in the other direction can also be cleared.) (1D) A committed transaction with no rw-conflict out cannot become a pivot in a dangerous structure, because the transaction on the "out" side of a pivot must commit first for there to be a problem. (Obviously, two committed transactions cannot develop a new conflict.) Since a read-only transaction can only participate in a dangerous structure through a conflict with a pivot, transactions in this state can be ignored when a read-only transaction is checking for conflicts. That's all I've been able to come up with so far. Let's see how that plays with the SSI worst case -- a long running transaction concurrent with many faster transactions. (2A) If the long-running transaction is read-only, it looks pretty good. We can clear concurrent transactions on a reasonable schedule and just maintain a list of committed serializable transactions with rw-conflicts out which wrote data and have xids above the serializable global xmin. We can probably make room for such a list of xids somehow -- I could even see potentially using the SLRU mechanism for this without killing performance -- the long-running read-only transaction would just need to look up a particular xid in the list whenever it read a non-visible tuple. If it exists, the long-running transaction must roll back with a serialization failure. Normally the list should be short enough to stay in RAM. (2B) It gets more complicated if the long-running transaction can also write. This is both because the predicate lock information must be maintained and associated with something, and because the long running transaction can become a pivot with an old short-lived transaction on the rw-conflict *in* side. The rw-conflicts *out* can be handled the same as for a long-running read-only transaction (described above). I think that tempered with the above, the performance impact will be minimal if we apply the technique Heikki described here to the oldest committed transactions when the list becomes full: http://archives.postgresql.org/pgsql-hackers/2010-09/msg01477.php I don't think this change will affect predicate.h or any of the code which uses its API. The only thing which uses the internal structures outside of predicate.c is the code to show the SIReadLock entries in the pg_locks view. I'm not sure how we should show locks for which we no longer have an associated virtual transaction ID. Does anyone have thoughts on that? Just leave virtualtransaction NULL, and leave the rest as-is? -Kevin