Thread: Documenting when to retry on serialization failure
"Applications using this level must be prepared to retry transactions due to serialization failures." ... "When an application receives this error message, it should abort the current transaction and retry the whole transaction from the beginning." I note that the specific error codes this applies to are not documented, so lets discuss what the docs for that would look like. I had a conversation with Kevin Grittner about retry some years back and it seemed clear that the application should re-execute application logic from the beginning, rather than just slavishly re-execute the same SQL. But that is not documented either. Is *automatic* retry possible? In all cases? None? Or maybe Some? ISTM that we can't retry anything where a transaction has replied to a user and then the user issued a subsequent SQL statement, since we have no idea whether the subsequent SQL was influenced by the initial reply. But what about the case of a single statement transaction? Can we just re-execute then? I guess if it didn't run anything other than IMMUTABLE functions then it should be OK, assuming the inputs themselves were immutable, which we've no way for the user to declare. Could we allow a user-defined auto_retry parameter? We don't mention that a transaction might just repeatedly fail either. Anyway, know y'all would have some opinions on this. Happy to document whatever we agree. -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Dec 9, 2021 at 7:43 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > I had a conversation with Kevin Grittner about retry some years back > and it seemed clear that the application should re-execute application > logic from the beginning, rather than just slavishly re-execute the > same SQL. But that is not documented either. Yeah, that would be good to mention somehow. > Is *automatic* retry possible? In all cases? None? Or maybe Some? > > ISTM that we can't retry anything where a transaction has replied to a > user and then the user issued a subsequent SQL statement, since we > have no idea whether the subsequent SQL was influenced by the initial > reply. I agree. > But what about the case of a single statement transaction? Can we just > re-execute then? I guess if it didn't run anything other than > IMMUTABLE functions then it should be OK, assuming the inputs > themselves were immutable, which we've no way for the user to declare. > Could we allow a user-defined auto_retry parameter? I suppose in theory a user-defined parameter is possible, but I think it's fundamentally better for this to be managed on the application side. Even if the transaction is a single query, we don't know how expensive that query is, and it's at least marginally possible that the user might care about that. For example, if the user has set a 10-minute timeout someplace, and the query fails after 8 minutes, they may want to retry. But if we retry automatically then they might hit their timeout, or just be confused about why things are taking so long. And they can always decide not to retry after all, but give up, save it for a less busy period, or whatever. > We don't mention that a transaction might just repeatedly fail either. True. I think that's another good argument against an auto-retry system. The main thing that worries me about an auto-retry system is something else: I think it would rarely be applicable, and people would try to apply it to situations where it won't actually work properly. I believe most users who need to retry transactions that fail due to serialization problems will need some real application logic to make sure that they do the right thing. People with single-statement transactions that can be blindly retried probably aren't using higher isolation levels anyway, and probably won't have many failures even if they are. SSI is really for sophisticated applications, and I think trying to make it "just work" for people with dumb applications will, well, just not work. -- Robert Haas EDB: http://www.enterprisedb.com
Fwiw I think the real problem with automatic retries is that the SQL interface doesn't lend itself to it because the server never really knows if the command is going to be followed by a commit or more commands. I actually think if that problem were tackled it would very likely be a highly appreciated option. Because I think there's a big overlap between the set of users interested in higher isolation levels and the set of users writing stored procedures defining their business logic. They're both kind of "traditional" SQL engine approaches and both lend themselves to the environment where you have a lot of programmers working on a system and you're not able to do things like define strict locking and update orderings. So a lot of users are probably looking at something like "BEGIN; SELECT create_customer_order(....); COMMIT" and wondering why the server can't handle automatically retrying the query if they get an isolation failure. There are actually other reasons why providing the whole logic for the transaction up front with a promise that it'll be the whole transaction is attractive. E.g. vacuum could ignore a transaction if it knows the transaction will never look at the table it's processing... Or automatic deadlock testing tools could extract the list of tables being accessed and suggest "lock table" commands to put at the head of the transaction sorted in a canonical order. These things may not be easy but they're currently impossible for the same reasons automatically retrying is. The executor doesn't know what subsequent commands will be coming after the current one and doesn't know whether it has the whole transaction.
On Fri, Dec 10, 2021 at 1:43 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > "Applications using this level must be prepared to retry transactions > due to serialization failures." > ... > "When an application receives this error message, it should abort the > current transaction and retry the whole transaction from the > beginning." > > I note that the specific error codes this applies to are not > documented, so lets discuss what the docs for that would look like. +1 for naming the error. > I had a conversation with Kevin Grittner about retry some years back > and it seemed clear that the application should re-execute application > logic from the beginning, rather than just slavishly re-execute the > same SQL. But that is not documented either. Right, the result of the first statement could cause the application to do something completely different the second time through. I personally think the best way for applications to deal with this problem (and at least also deadlock, serialisation failure's pessimistic cousin) is to represent transactions as blocks of code that can be automatically retried, however that looks in your client language. It might be that you pass a function/closure/whatever-you-call-it to the transaction management code so it can rerun it if necessary, or that a function is decorated in some way that some magic infrastructure understands, but that's a little tricky to write about in a general enough way for our manual. (A survey of how this looks with various different libraries and tools might make a neat conference talk though.) But isn't that exactly what that existing sentence "... from the beginning" is trying to say, especially with the follow sentence ("The second time through...")? Hhm, yeah, perhaps that next sentence could be clearer. > Is *automatic* retry possible? In all cases? None? Or maybe Some? I'm aware of a couple of concrete cases that confound attempts to retry automatically: sometimes we report a unique constraint violation or an exclusion constraint failure, when we have the information required to diagnose a serialisation anomaly. In those cases, we really should figure out how to spit out 40001 (otherwise what is general purpose auto retry code supposed to do with UCV?). We fixed a single-index variant of this problem in commit fcff8a57. I have an idea for how this might be fixed for the multi-index UCV[1] and exclusion constraint[2] variants of the problem, but haven't actually tried yet. If there are other things that stand in the way of reliable automated retry (= a list of error codes a client library could look for) then I'd love to have a list of them. > But what about the case of a single statement transaction? Can we just > re-execute then? I guess if it didn't run anything other than > IMMUTABLE functions then it should be OK, assuming the inputs > themselves were immutable, which we've no way for the user to declare. > Could we allow a user-defined auto_retry parameter? I've wondered about that too, but so far it didn't seem worth the effort, since application developers need another solution for multi-statement retry anyway. > We don't mention that a transaction might just repeatedly fail either. According to the VLDB paper, the "safe retry" property (§ 5.4) means that a retry won't abort for the same reason (due to a cycle with the same set of other transactions as your last attempt), unless prepared transactions are involved (§ 7.1). This means that the whole system continues to make some kind of progress in the absence of 2PC, though of course your transaction might or might not fail because of a cycle with some other set of transactions. Maybe that is too technical for our manual, which already provides the link to that paper, but it's interesting to note that you can suffer from a stuck busy-work loop until conflicting prepared xacts go away, with a naive automatic-retry-forever system. [1] https://www.postgresql.org/message-id/flat/CAGPCyEZG76zjv7S31v_xPeLNRuzj-m%3DY2GOY7PEzu7vhB%3DyQog%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAMTXbE-sq9JoihvG-ccC70jpjMr%2BDWmnYUj%2BVdnFRFSRuaaLZQ%40mail.gmail.com
On Wed, 29 Dec 2021 at 03:30, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Dec 10, 2021 at 1:43 AM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > "Applications using this level must be prepared to retry transactions > > due to serialization failures." > > ... > > "When an application receives this error message, it should abort the > > current transaction and retry the whole transaction from the > > beginning." > > > > I note that the specific error codes this applies to are not > > documented, so lets discuss what the docs for that would look like. > > +1 for naming the error. I've tried to sum up the various points from everybody into this doc patch. Thanks all for replies. -- Simon Riggs http://www.EnterpriseDB.com/
Attachment
On Thu, 16 Dec 2021 at 06:05, Greg Stark <stark@mit.edu> wrote: > So a lot of users are probably looking at something like "BEGIN; > SELECT create_customer_order(....); COMMIT" and wondering why the > server can't handle automatically retrying the query if they get an > isolation failure. I agree with you that it would be desirable to retry for the simple case of an autocommit/single statement transaction run with default_transaction_isolation = 'serializability'. The most important question before we take further action is whether this would be correct to do so, in all cases. Some problem cases would help us decide either way. -- Simon Riggs http://www.EnterpriseDB.com/
Simon Riggs <simon.riggs@enterprisedb.com> writes: > I've tried to sum up the various points from everybody into this doc > patch. Thanks all for replies. This seemed rather badly in need of copy-editing. How do you like the attached text? regards, tom lane diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml index da07f3f6c6..cd659dd994 100644 --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -588,7 +588,7 @@ ERROR: could not serialize access due to concurrent update applications using this level must be prepared to retry transactions due to serialization failures. In fact, this isolation level works exactly the same as Repeatable - Read except that it monitors for conditions which could make + Read except that it also monitors for conditions which could make execution of a concurrent set of serializable transactions behave in a manner inconsistent with all possible serial (one at a time) executions of those transactions. This monitoring does not @@ -1720,6 +1720,60 @@ SELECT pg_advisory_lock(q.id) FROM </sect2> </sect1> + <sect1 id="mvcc-serialization-failure-handling"> + <title>Serialization Failure Handling</title> + + <indexterm> + <primary>serialization failure</primary> + </indexterm> + <indexterm> + <primary>retryable error</primary> + </indexterm> + + <para> + Both Repeatable Read and Serializable isolation levels can produce + errors that are designed to prevent serialization anomalies. As + previously stated, applications using these levels must be prepared to + retry transactions that fail due to serialization errors. Such an + error's message text will vary according to the precise circumstances, + but it will always have the SQLSTATE code <literal>40001</literal> + (<literal>serialization_failure</literal>). + </para> + + <para> + It may also be advisable to retry deadlock failures. + These have the SQLSTATE code <literal>40P01</literal> + (<literal>deadlock_detected</literal>). + </para> + + <para> + In some circumstances, a failure that is arguably a serialization + problem may manifest as a unique-key failure, with SQLSTATE + code <literal>23505</literal> (<literal>unique_violation</literal>), + or as an exclusion constraint failure, with SQLSTATE + code <literal>23P01</literal> (<literal>exclusion_violation</literal>). + Therefore, retrying these cases may also be advisable, although one must + be careful that such an error could be persistent. + </para> + + <para> + It is important to retry the complete transaction, including all logic + that decides which SQL to issue and/or which values to use. + Therefore, <productname>PostgreSQL</productname> does not offer an + automatic retry facility, since it cannot do so with any guarantee of + correctness. + </para> + + <para> + Transaction retry does not guarantee that the retried transaction will + complete; multiple retries may be needed. In cases with very high + contention, it is possible that completion of a transaction may take + many attempts. In cases involving a conflicting prepared transaction, + it may not be possible to make progress until the prepared transaction + commits or rolls back. + </para> + </sect1> + <sect1 id="mvcc-caveats"> <title>Caveats</title>
On Wed, 23 Mar 2022 at 19:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Simon Riggs <simon.riggs@enterprisedb.com> writes: > > I've tried to sum up the various points from everybody into this doc > > patch. Thanks all for replies. > > This seemed rather badly in need of copy-editing. How do you > like the attached text? Seems clear and does the job. The unique violation thing is worryingly general. Do we know enough to say that this is thought to occur only with a) multiple unique constraints, b) exclusion constraints? -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Mar 24, 2022 at 11:44 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > The unique violation thing is worryingly general. Do we know enough to > say that this is thought to occur only with a) multiple unique > constraints, b) exclusion constraints? I'm aware of 3 cases. The two you mentioned, which I think we can fix (as described in the threads I posted upthread), and then there is a third case that I'm still confused about, in the last line of read-write-unique-4.spec.
On Thu, 24 Mar 2022 at 11:01, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Mar 24, 2022 at 11:44 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > The unique violation thing is worryingly general. Do we know enough to > > say that this is thought to occur only with a) multiple unique > > constraints, b) exclusion constraints? > > I'm aware of 3 cases. The two you mentioned, which I think we can fix > (as described in the threads I posted upthread), and then there is a > third case that I'm still confused about, in the last line of > read-write-unique-4.spec. I don't see any confusion - it is clearly a serialization error. What is more, I see this as a confusing bug that we should fix. If we were updating the row rather than inserting it, we would get "ERROR: could not serialize access due to concurrent update", as documented. The type of command shouldn't affect whether it is a serialization error or not. (Attached patch proves it does throw serializable error for UPDATE). Solving this requires us to alter the Index API to pass down a snapshot to allow us to test whether the concurrent insert is visible or not. The test is shown in the attached patch, but this doesn't attempt the major task of tweaking the APIs to allow this check to be made. -- Simon Riggs http://www.EnterpriseDB.com/
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > On Thu, Mar 24, 2022 at 11:44 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: >> The unique violation thing is worryingly general. Do we know enough to >> say that this is thought to occur only with a) multiple unique >> constraints, b) exclusion constraints? > I'm aware of 3 cases. The two you mentioned, which I think we can fix > (as described in the threads I posted upthread), and then there is a > third case that I'm still confused about, in the last line of > read-write-unique-4.spec. That test is modeling the case where the application does an INSERT with values based on some data it read earlier. There is no way for the server to know that there's any connection, so I think if you try to throw a serialization error rather than a uniqueness error, you're basically lying to the client by claiming something you do not know to be true. And the lie is not without consequences: if the application believes it, it might iterate forever vainly trying to commit a transaction that will never succeed. regards, tom lane
On Thu, 24 Mar 2022 at 14:05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > On Thu, Mar 24, 2022 at 11:44 PM Simon Riggs > > <simon.riggs@enterprisedb.com> wrote: > >> The unique violation thing is worryingly general. Do we know enough to > >> say that this is thought to occur only with a) multiple unique > >> constraints, b) exclusion constraints? > > > I'm aware of 3 cases. The two you mentioned, which I think we can fix > > (as described in the threads I posted upthread), and then there is a > > third case that I'm still confused about, in the last line of > > read-write-unique-4.spec. > > That test is modeling the case where the application does an INSERT > with values based on some data it read earlier. There is no way for > the server to know that there's any connection, so I think if you > try to throw a serialization error rather than a uniqueness error, > you're basically lying to the client by claiming something you do not > know to be true. And the lie is not without consequences: if the > application believes it, it might iterate forever vainly trying to > commit a transaction that will never succeed. OK, I see what you mean. There are 2 types of transaction, one that reads inside the transaction, one that decides what value to use some other way. So now we have 2 cases, both of which generate uniqueness violations, but only one of which might succeed if retried. The patch does cover this, I guess, by saying be careful, but I would be happier if we can also add "this is thought to occur only with multiple unique constraints and/or an exclusion constraints" -- Simon Riggs http://www.EnterpriseDB.com/
Simon Riggs <simon.riggs@enterprisedb.com> writes: > OK, I see what you mean. There are 2 types of transaction, one that > reads inside the transaction, one that decides what value to use some > other way. > So now we have 2 cases, both of which generate uniqueness violations, > but only one of which might succeed if retried. The patch does cover > this, I guess, by saying be careful, but I would be happier if we can > also add > "this is thought to occur only with multiple unique constraints and/or > an exclusion constraints" Um, what's that got to do with it? The example in read-write-unique-4.spec involves only a single pkey constraint. We could add something trying to explain that if the application inserts a value into a constrained column based on data it read earlier, then any resulting constraint violation might be effectively a serialization failure. regards, tom lane
On Thu, 24 Mar 2022 at 14:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Simon Riggs <simon.riggs@enterprisedb.com> writes: > > OK, I see what you mean. There are 2 types of transaction, one that > > reads inside the transaction, one that decides what value to use some > > other way. > > > So now we have 2 cases, both of which generate uniqueness violations, > > but only one of which might succeed if retried. The patch does cover > > this, I guess, by saying be careful, but I would be happier if we can > > also add > > > "this is thought to occur only with multiple unique constraints and/or > > an exclusion constraints" > > Um, what's that got to do with it? The example in > read-write-unique-4.spec involves only a single pkey constraint. Yes, but as you explained, its not actually a serializable case, it just looks a bit like one. That means we are not currently aware of any case where the situation is serializable but the error message is uniqueness violation, unless we have 2 or more unique constraints and/or an exclusion constraint. > We could add something trying to explain that if the application inserts a > value into a constrained column based on data it read earlier, then any > resulting constraint violation might be effectively a serialization > failure. We could do that as well. -- Simon Riggs http://www.EnterpriseDB.com/
Simon Riggs <simon.riggs@enterprisedb.com> writes: > On Thu, 24 Mar 2022 at 14:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Um, what's that got to do with it? The example in >> read-write-unique-4.spec involves only a single pkey constraint. > Yes, but as you explained, its not actually a serializable case, it > just looks a bit like one. > That means we are not currently aware of any case where the situation > is serializable but the error message is uniqueness violation, unless > we have 2 or more unique constraints and/or an exclusion constraint. Meh. I'm disinclined to document it at that level of detail, both because it's subject to change and because we're not sure that that list is exhaustive. I think a bit of handwaving is preferable. How about the attached? (Only the third new para is different.) regards, tom lane diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml index da07f3f6c6..176325247d 100644 --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -588,7 +588,7 @@ ERROR: could not serialize access due to concurrent update applications using this level must be prepared to retry transactions due to serialization failures. In fact, this isolation level works exactly the same as Repeatable - Read except that it monitors for conditions which could make + Read except that it also monitors for conditions which could make execution of a concurrent set of serializable transactions behave in a manner inconsistent with all possible serial (one at a time) executions of those transactions. This monitoring does not @@ -1720,6 +1720,71 @@ SELECT pg_advisory_lock(q.id) FROM </sect2> </sect1> + <sect1 id="mvcc-serialization-failure-handling"> + <title>Serialization Failure Handling</title> + + <indexterm> + <primary>serialization failure</primary> + </indexterm> + <indexterm> + <primary>retryable error</primary> + </indexterm> + + <para> + Both Repeatable Read and Serializable isolation levels can produce + errors that are designed to prevent serialization anomalies. As + previously stated, applications using these levels must be prepared to + retry transactions that fail due to serialization errors. Such an + error's message text will vary according to the precise circumstances, + but it will always have the SQLSTATE code <literal>40001</literal> + (<literal>serialization_failure</literal>). + </para> + + <para> + It may also be advisable to retry deadlock failures. + These have the SQLSTATE code <literal>40P01</literal> + (<literal>deadlock_detected</literal>). + </para> + + <para> + In some cases it is also appropriate to retry unique-key failures, + which have SQLSTATE code <literal>23505</literal> + (<literal>unique_violation</literal>), and exclusion constraint + failures, which have SQLSTATE code <literal>23P01</literal> + (<literal>exclusion_violation</literal>). For example, if the + application selects a new value for a primary key column after + inspecting the currently stored keys, it could get a unique-key + failure because another application instance selected the same new key + concurrently. This is effectively a serialization failure, but the + server will not detect it as such because it cannot <quote>see</quote> + the connection between the inserted value and the previous reads. + There are also some corner cases in which the server will issue a + unique-key or exclusion constraint error even though in principle it + has enough information to determine that a serialization problem + exists. While it's recommendable to just + retry <literal>40001</literal> errors unconditionally, more care is + needed when retrying these other error codes, since they might + represent persistent error conditions rather than transient failures. + </para> + + <para> + It is important to retry the complete transaction, including all logic + that decides which SQL to issue and/or which values to use. + Therefore, <productname>PostgreSQL</productname> does not offer an + automatic retry facility, since it cannot do so with any guarantee of + correctness. + </para> + + <para> + Transaction retry does not guarantee that the retried transaction will + complete; multiple retries may be needed. In cases with very high + contention, it is possible that completion of a transaction may take + many attempts. In cases involving a conflicting prepared transaction, + it may not be possible to make progress until the prepared transaction + commits or rolls back. + </para> + </sect1> + <sect1 id="mvcc-caveats"> <title>Caveats</title>
On Thu, 24 Mar 2022 at 16:29, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Simon Riggs <simon.riggs@enterprisedb.com> writes: > > On Thu, 24 Mar 2022 at 14:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Um, what's that got to do with it? The example in > >> read-write-unique-4.spec involves only a single pkey constraint. > > > Yes, but as you explained, its not actually a serializable case, it > > just looks a bit like one. > > > That means we are not currently aware of any case where the situation > > is serializable but the error message is uniqueness violation, unless > > we have 2 or more unique constraints and/or an exclusion constraint. > > Meh. I'm disinclined to document it at that level of detail, both > because it's subject to change and because we're not sure that that > list is exhaustive. I think a bit of handwaving is preferable. > How about the attached? (Only the third new para is different.) It's much better, thanks. -- Simon Riggs http://www.EnterpriseDB.com/
Simon Riggs <simon.riggs@enterprisedb.com> writes: > On Thu, 24 Mar 2022 at 16:29, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> How about the attached? (Only the third new para is different.) > It's much better, thanks. Pushed then. regards, tom lane