Thread: Implementing SQL ASSERTION
Hi all, I’m wondering if there are other people out there working on implementing SQL ASSERTION functionality? I’ve recently spent a bit of time looking to implement the execution models described in “Applied Mathematics for DatabaseProfessionals” by Toon Koppelaars and Lex de Haan. I’ve gotten as far as execution model 3 and am now looking atderiving polarity of involved tables to do EM4 (described in some detail in “Deriving Production Rules for Constraint Maintenance”,Ceri & Widom, VLDB Conference 1990, p555-577). EM5 & EM6 look rather more difficult but I’m intending to tryand implement those, too. If there are other people working on this stuff it would be great to collaborate. Regards. -Joe
On Thu, Apr 30, 2015 at 6:36 PM, Joe Wildish <joe-postgresql.com@elusive.cx> wrote: > I’m wondering if there are other people out there working on implementing SQL ASSERTION functionality? > > I’ve recently spent a bit of time looking to implement the execution models described in “Applied Mathematics for DatabaseProfessionals” by Toon Koppelaars and Lex de Haan. I’ve gotten as far as execution model 3 and am now looking atderiving polarity of involved tables to do EM4 (described in some detail in “Deriving Production Rules for Constraint Maintenance”,Ceri & Widom, VLDB Conference 1990, p555-577). EM5 & EM6 look rather more difficult but I’m intending to tryand implement those, too. > > If there are other people working on this stuff it would be great to collaborate. I don't know of anyone working on this. It sounds very difficult. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 1 May 2015, at 19:51, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Apr 30, 2015 at 6:36 PM, Joe Wildish > <joe-postgresql.com@elusive.cx> wrote: >> I’m wondering if there are other people out there working on implementing SQL ASSERTION functionality? >> >> I’ve recently spent a bit of time looking to implement the execution models described in “Applied Mathematics for DatabaseProfessionals” by Toon Koppelaars and Lex de Haan. I’ve gotten as far as execution model 3 and am now looking atderiving polarity of involved tables to do EM4 (described in some detail in “Deriving Production Rules for Constraint Maintenance”,Ceri & Widom, VLDB Conference 1990, p555-577). EM5 & EM6 look rather more difficult but I’m intending to tryand implement those, too. >> >> If there are other people working on this stuff it would be great to collaborate. > > I don't know of anyone working on this. It sounds very difficult. The book I mention details a series of execution models, where each successive model aims to validate the assertion in amore efficient manner than the last. This is achieved by performing static analysis of the assertion's expression to determineunder what circumstances the assertion need be (re)checked. Briefly: EM1: after all DML statements; EM2: only after DML statements involving tables mentioned in the assertion expression; EM3: only after DML statements involving the columns mentioned in the assertion expression; EM4: only after DML statements involving the columns, plus if the statement has a “polarity” that may affect the assertionexpression. “Polarity" here means that one is able to (statically) determine if only INSERTS and not DELETES can affect an expressionor vice-versa. EMs 5 and 6 are further enhancements that make use of querying the “transition effect” data of what actually changed in astatement, to determine if the assertion expression need be validated. I’ve not done as much reading around this topic yetso am concentrating on EMs 1-4. I agree it is a difficult problem but there are a fair number of published academic papers relating to this topic. The AM4DPbook draws a lot of this research together and presents the executions models. I may start writing up on a blog of where I get to, and then post further to this list, if there is interest. Regards. -Joe
On Sat, May 02, 2015 at 10:42:24PM +0100, Joe Wildish wrote: > > > On 1 May 2015, at 19:51, Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Apr 30, 2015 at 6:36 PM, Joe Wildish > > <joe-postgresql.com@elusive.cx> wrote: > >> I’m wondering if there are other people out there working on implementing SQL ASSERTION functionality? > >> > >> I’ve recently spent a bit of time looking to implement the execution models described in “Applied Mathematics for DatabaseProfessionals” by Toon Koppelaars and Lex de Haan. I’ve gotten as far as execution model 3 and am now looking atderiving polarity of involved tables to do EM4 (described in some detail in “Deriving Production Rules for Constraint Maintenance”,Ceri & Widom, VLDB Conference 1990, p555-577). EM5 & EM6 look rather more difficult but I’m intending to tryand implement those, too. > >> > >> If there are other people working on this stuff it would be great to collaborate. > > > > I don't know of anyone working on this. It sounds very difficult. > > The book I mention details a series of execution models, where each successive model aims to validate the assertion ina more efficient manner than the last. This is achieved by performing static analysis of the assertion's expression todetermine under what circumstances the assertion need be (re)checked. Briefly: > > EM1: after all DML statements; > EM2: only after DML statements involving tables mentioned in the assertion expression; > EM3: only after DML statements involving the columns mentioned in the assertion expression; > EM4: only after DML statements involving the columns, plus if the statement has a “polarity” that may affect the assertionexpression. > > “Polarity" here means that one is able to (statically) determine if only INSERTS and not DELETES can affect an expressionor vice-versa. > > EMs 5 and 6 are further enhancements that make use of querying the “transition effect” data of what actually changed ina statement, to determine if the assertion expression need be validated. I’ve not done as much reading around this topicyet so am concentrating on EMs 1-4. > > I agree it is a difficult problem but there are a fair number of published academic papers relating to this topic. TheAM4DP book draws a lot of this research together and presents the executions models. > > I may start writing up on a blog of where I get to, and then post further to this list, if there is interest. I suspect that you would get a lot further with a PoC patch including the needed documentation. Remember to include how this would work at all the transaction isolation levels and combinations of same that we support. Recall also to include the lock strength needed. Just about anything can be done with a database-wide lock :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
> On 3 May 2015, at 02:42, David Fetter <david@fetter.org> wrote: > > On Sat, May 02, 2015 at 10:42:24PM +0100, Joe Wildish wrote: >> >> I may start writing up on a blog of where I get to, and then post further to this list, if there is interest. > > I suspect that you would get a lot further with a PoC patch including > the needed documentation. Remember to include how this would work at > all the transaction isolation levels and combinations of same that we > support. Recall also to include the lock strength needed. Just about > anything can be done with a database-wide lock :) Thanks David. I’m obviously new here so I not that familiar with how one starts contributing. Once I get to a decent level with the EM4 PoC I will post the details to this list. The general idea is that upon assertioncreation, the expression is analysed to determine when it needs to be validated — corresponding internal "afterstatement” triggers are then created. There will definitely need to be some serialisation take place on the basisof when an assertion has been validated, but I’ve not got that far yet. I’ll be sure to include the details when Ipost though. Regards. -Joe
On 4/30/15 6:36 PM, Joe Wildish wrote: > I’m wondering if there are other people out there working on implementing SQL ASSERTION functionality? I was the last one, probably: <http://www.postgresql.org/message-id/1384486216.5008.17.camel@vanquo.pezone.net>.I intend to pick up that work sometime,but feel free to review the thread for a start. The main question was how to manage transaction isolation.
Hackers, Attached is a WIP patch for SQL assertion. I am posting it for anyone who might be interested in seeing it, for comments/feedback,and to see if others are keen to collaborate on taking it further. It is not near production-ready (seethoughts on that below). The patch builds on the work posted by Peter back in 2013. I've taken his code and updated it to conform to some generalchanges made to the codebase since then. The bulk of the new work I have done is around when an assertion needs tobe checked. Essentially it is an implementation of the algorithm described by Ceri & Widom in "Deriving Production Rulesfor Constraint Maintenance” — http://infolab.stanford.edu/pub/papers/constraint-maintenance.ps The general idea is to traverse the expression tree and derive the set of potentially invalidating operations. These operationsare used to determine when the constraint trigger fires and causes a re-check. The detail is in the paper but someexamples are: * insertion into the subject of an exists cannot be invalidating; * deletion from the subject of a not exists cannot be invalidating; * update of columns in the target list of an exists cannot be invalidating; * certain combinations of aggregates with comparison operations cannot be invalidating. As an example of the last point, the expression "CHECK (10 > (SELECT COUNT(*) FROM t))" cannot be invalidated by a deleteor an update but can be invalidated by an insert. I have implemented most of the optimisations mentioned in the paper. There are one or two that I am unsure about, specificallyhow to deal with set-operations that are the subject of an exists. According to the paper, these are optimisablewhen they're the subject of an exists, but I think it is only applicable for union and not intersect or except,so I have skipped that particular optimisation for the time being. The algorithm works under the assumption that when a recheck occurs the previous check result was true (the research reportby Ceri & Widom does acknolwedge this assumption). However, unfortunately the SQL specification requires that bothtrue and unknown be valid results for an assertion's check expression. This doesn't play too well with the algorithmso for the time being I have disallowed null. I think the solution here may be that when a null result for a checkoccurs, the assertion is changed to trigger on all operations against the involved tables; once it returns to true,the triggers can be returned to fire only on the derived invalidating operations. More thought required though. (note:having just written this paragraph, I've realised I can't right now think of a concrete example illustrating the point,so it may be that I'm wrong on this). The paper does mention a set of optimisations that I have not yet attempted to implement. These are essentially the techniqueof evaluating the expression against the deltas of a change rather than the full tables. Clearly there is a largeoverlap with incremental maintainence of views and actually the two authors of the paper have a similiarly named papercalled "Deriving Production Rules for Incremental View Maintanence". Although I have yet to finish reviewing all theliterature on the subject, I suspect that realistically for this to make it into production, we'd need some implementationof these techniques to make the performance palatable. Cheers, -Joe
Attachment
Hello Joe, Just a reaction to the example, which is maybe addressed in the patch which I have not investigated. > * certain combinations of aggregates with comparison operations cannot > be invalidating. > > As an example of the last point, the expression "CHECK (10 > (SELECT > COUNT(*) FROM t))" cannot be invalidated by a delete or an update but > can be invalidated by an insert. I'm wondering about the effect of MVVC on this: if the check is performed when the INSERT is done, concurrent inserting transactions would count the current status which would be ok, but on commit all concurrent inserts would be there and the count could not be ok anymore? Maybe if the check was deferred, but this is not currently possible with pg (eg the select can simply be put in a function), and I there might be race conditions. ISTM that such a check would imply non trivial locking to be okay, it is not just a matter of deciding whether to invoke the check or not. -- Fabien.
Hi Fabien,
Yes, there was quite a bit of discussion in the original thread about concurrency. See here:
* certain combinations of aggregates with comparison operations cannot be invalidating.
As an example of the last point, the expression "CHECK (10 > (SELECT COUNT(*) FROM t))" cannot be invalidated by a delete or an update but can be invalidated by an insert.
I'm wondering about the effect of MVVC on this: if the check is performed when the INSERT is done, concurrent inserting transactions would count the current status which would be ok, but on commit all concurrent inserts would be there and the count could not be ok anymore?
The patch doesn’t attempt to address concurrency (beyond the obvious benefit of reducing the circumstances under which the assertion is checked). I am working under the assumption that we will find some acceptable way for that to be resolved :-) And at the moment, working in serialisable mode addresses this issue. I think that is suggested in the thread actually (essentially, if you want to use assertions, you require that transactions be performed at serialisable isolation level).
Maybe if the check was deferred, but this is not currently possible with pg (eg the select can simply be put in a function), and I there might be race conditions. ISTM that such a check would imply non trivial locking to be okay, it is not just a matter of deciding whether to invoke the check or not.
I traverse into SQL functions so that the analysis can capture invalidating operations from the expression inside the function. Only internal and SQL functions are considered legal. Other languages are rejected.
-Joe
>> I'm wondering about the effect of MVVC on this: if the check is >> performed when the INSERT is done, concurrent inserting transactions >> would count the current status which would be ok, but on commit all >> concurrent inserts would be there and the count could not be ok >> anymore? > The patch doesn’t attempt to address concurrency (beyond the obvious > benefit of reducing the circumstances under which the assertion is > checked). I am working under the assumption that we will find some > acceptable way for that to be resolved :-) And at the moment, working in > serialisable mode addresses this issue. I think that is suggested in the > thread actually (essentially, if you want to use assertions, you require > that transactions be performed at serialisable isolation level). Thanks for the pointers. The "serializable" isolation level restriction sounds reasonnable. -- Fabien.
On Mon, Jan 15, 2018 at 03:40:57PM +0100, Fabien COELHO wrote: > > >>I'm wondering about the effect of MVVC on this: if the check is > >>performed when the INSERT is done, concurrent inserting transactions > >>would count the current status which would be ok, but on commit all > >>concurrent inserts would be there and the count could not be ok anymore? > > >The patch doesn’t attempt to address concurrency (beyond the obvious > >benefit of reducing the circumstances under which the assertion is > >checked). I am working under the assumption that we will find some > >acceptable way for that to be resolved :-) And at the moment, working in > >serialisable mode addresses this issue. I think that is suggested in the > >thread actually (essentially, if you want to use assertions, you require > >that transactions be performed at serialisable isolation level). > > Thanks for the pointers. The "serializable" isolation level restriction > sounds reasonnable. It sounds reasonable enough that I'd like to make a couple of Modest Proposals™, to wit: - We follow the SQL standard and make SERIALIZABLE the default transaction isolation level, and - We disallow writes at isolation levels other than SERIALIZABLE when any ASSERTION could be in play. That latter could range in implementation from crashingly unsubtle to very precise. Crashingly Unsubtle: Disallow writes at any isolation level other than SERIALIZABLE. Very Precise: Disallow writes at any other isolation level when the ASSERTION could come into play using the same machinery that enforces the ASSERTION in the first place. What say? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi David,
On 15 Jan 2018, at 16:35, David Fetter <david@fetter.org> wrote:It sounds reasonable enough that I'd like to make a couple of Modest
Proposals™, to wit:
- We follow the SQL standard and make SERIALIZABLE the default
transaction isolation level, and
- We disallow writes at isolation levels other than SERIALIZABLE when
any ASSERTION could be in play.
Certainly it would be easy to put a test into the assertion check function to require the isolation level be serialisable. I didn’t realise that that was also the default level as per the standard. That need not necessarily be changed, of course; it would be obvious to the user that it was a requirement as the creation of an assertion would fail without it, as would any subsequent attempts to modify the involved tables.
-Joe
On Mon, Jan 15, 2018 at 09:14:02PM +0000, Joe Wildish wrote: > Hi David, > > > On 15 Jan 2018, at 16:35, David Fetter <david@fetter.org> wrote: > > > > It sounds reasonable enough that I'd like to make a couple of Modest > > Proposals™, to wit: > > > > - We follow the SQL standard and make SERIALIZABLE the default > > transaction isolation level, and > > > > - We disallow writes at isolation levels other than SERIALIZABLE when > > any ASSERTION could be in play. > > Certainly it would be easy to put a test into the assertion check > function to require the isolation level be serialisable. I didn’t > realise that that was also the default level as per the standard. > That need not necessarily be changed, of course; it would be obvious > to the user that it was a requirement as the creation of an > assertion would fail without it, as would any subsequent attempts to > modify the involved tables. This patch no longer applies. Any chance of a rebase? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi David, > > This patch no longer applies. Any chance of a rebase? > Of course. I’ll look at it this weekend, Cheers, -Joe
On Thu, Mar 08, 2018 at 09:11:58PM +0000, Joe Wildish wrote: > Hi David, > > > > > This patch no longer applies. Any chance of a rebase? > > Of course. I’ll look at it this weekend, Much appreciate it! Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, Jan 15, 2018 at 11:35 AM, David Fetter <david@fetter.org> wrote: > - We follow the SQL standard and make SERIALIZABLE the default > transaction isolation level, and The consequences of such a decision would include: - pgbench -S would run up to 10x slower, at least if these old benchmark results are still valid: https://www.postgresql.org/message-id/CA+TgmoZog1wFbyrqzJUkiLSXw5sDUjJGUeY0c2BqSG-tciSB7w@mail.gmail.com - pgbench without -S would fail outright, because it doesn't have provision to retry failed transactions. https://commitfest.postgresql.org/16/1419/ - Many user applications would probably also experience similar difficulties. - Parallel query would no longer work by default, unless this patch gets committed: https://commitfest.postgresql.org/17/1004/ I think a good deal of work to improve the performance of serializable would need to be done before we could even think about making it the default -- and even then, the fact that it really requires the application to be retry-capable seems like a pretty major obstacle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 10, 2018 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 15, 2018 at 11:35 AM, David Fetter <david@fetter.org> wrote: >> - We follow the SQL standard and make SERIALIZABLE the default >> transaction isolation level, and > > The consequences of such a decision would include: > > - pgbench -S would run up to 10x slower, at least if these old > benchmark results are still valid: > > https://www.postgresql.org/message-id/CA+TgmoZog1wFbyrqzJUkiLSXw5sDUjJGUeY0c2BqSG-tciSB7w@mail.gmail.com > > - pgbench without -S would fail outright, because it doesn't have > provision to retry failed transactions. > > https://commitfest.postgresql.org/16/1419/ > > - Many user applications would probably also experience similar difficulties. > > - Parallel query would no longer work by default, unless this patch > gets committed: > > https://commitfest.postgresql.org/17/1004/ > > I think a good deal of work to improve the performance of serializable > would need to be done before we could even think about making it the > default -- and even then, the fact that it really requires the > application to be retry-capable seems like a pretty major obstacle. Also: - It's not available on hot standbys. Experimental patches have been developed based on the read only safe snapshot concept, but some tricky problems remain unsolved. - Performance is terrible (conflicts are maximised) if you use any index type except btree, unless some of these get committed: https://commitfest.postgresql.org/17/1172/ https://commitfest.postgresql.org/17/1183/ https://commitfest.postgresql.org/17/1466/ -- Thomas Munro http://www.enterprisedb.com
> >> >> This patch no longer applies. Any chance of a rebase? >> > Attached is a rebased version of this patch. It takes into account the ACL checking changes and a few other minor amendments. Cheers, -Joe
Attachment
On Sun, Mar 18, 2018 at 12:29:50PM +0000, Joe Wildish wrote: > > > >> > >> This patch no longer applies. Any chance of a rebase? > >> > > > > > Attached is a rebased version of this patch. It takes into account the ACL checking changes and a few other minor amendments. Thanks! Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Sun, Mar 18, 2018 at 12:29:50PM +0000, Joe Wildish wrote: > > > >> > >> This patch no longer applies. Any chance of a rebase? > > Attached is a rebased version of this patch. It takes into account > the ACL checking changes and a few other minor amendments. Sorry to bother you again, but this now doesn't compile atop master. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 28 Mar 2018, at 16:13, David Fetter <david@fetter.org> wrote: > > Sorry to bother you again, but this now doesn't compile atop master. Attached is a rebased patch for the prototype. Cheers, -Joe
Attachment
On Sun, Apr 29, 2018 at 07:18:00PM +0100, Joe Wildish wrote: > On 28 Mar 2018, at 16:13, David Fetter <david@fetter.org> wrote: > > > > Sorry to bother you again, but this now doesn't compile atop master. > > Attached is a rebased patch for the prototype. Thanks! This is great timing for the 12 cycle :) Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 29/04/2018 20:18, Joe Wildish wrote: > On 28 Mar 2018, at 16:13, David Fetter <david@fetter.org> wrote: >> >> Sorry to bother you again, but this now doesn't compile atop master. > > Attached is a rebased patch for the prototype. I took a look at this. This has been lying around for a few months, so it will need to be rebased again. I applied this patch on top of 68e7e973d22274a089ce95200b3782f514f6d2f8, which was the HEAD around the time this patch was created, and it applies cleanly there. Please check you patch for whitespace errors: warning: squelched 13 whitespace errors warning: 18 lines add whitespace errors. Also, reduce the amount of useless whitespace changes in the patch. There are some compiler warnings: constraint.c: In function 'CreateAssertion': constraint.c:1211:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] constraint.c: In function 'oppositeDmlOp': constraint.c:458:1: error: control reaches end of non-void function [-Werror=return-type] The version check in psql's describeAssertions() needs to be updated. Also, you should use formatPGVersionNumber() to cope with two-part and one-part version numbers. All this new code in constraint.c that checks the assertion expression needs more comments and documentation. Stuff like this isn't going to work: static int funcMaskForFuncOid(Oid funcOid) { char *name = get_func_name(funcOid); if (name == NULL) return OTHER_FUNC; else if (strncmp(name, "min", strlen("min")) == 0) return MIN_AGG_FUNC; else if (strncmp(name, "max", strlen("max")) == 0) return MAX_AGG_FUNC; You can assume from the name of a function what it's going to do. Solving this properly might be hard. The regression test crashes for me around frame #4: 0x000000010d3a4cdc postgres`castNodeImpl(type=T_SubLink, ptr=0x00007ff27006d230) at nodes.h:582 frame #5: 0x000000010d3a61c6 postgres`visitSubLink(node=0x00007ff270034040, info=0x00007ffee2a23930) at constraint.c:843 This ought to be reproducible for you if you build with assertions. My feeling is that if we want to move forward on this topic, we need to solve the concurrency question first. All these optimizations for when we don't need to check the assertion are cool, but they are just optimizations that we can apply later on, once we have solved the critical problems. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Peter, > On 24 Sep 2018, at 15:06, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: > > On 29/04/2018 20:18, Joe Wildish wrote: >> >> Attached is a rebased patch for the prototype. > > I took a look at this. Thank you for reviewing. > This has been lying around for a few months, so it will need to be > rebased again. > > 8< - - - snipped for brevity - - - 8< > > All this new code in constraint.c that checks the assertion expression > needs more comments and documentation. All agreed. I’ll give the patch some TLC and get a new version that addresses the above. > Stuff like this isn't going to work: > > static int > funcMaskForFuncOid(Oid funcOid) > { > char *name = get_func_name(funcOid); > > if (name == NULL) > return OTHER_FUNC; > else if (strncmp(name, "min", strlen("min")) == 0) > return MIN_AGG_FUNC; > else if (strncmp(name, "max", strlen("max")) == 0) > return MAX_AGG_FUNC; > > You can assume from the name of a function what it's going to do. > Solving this properly might be hard. Agreed. My assumption was that we would record in the data dictionary the behaviour (or “polarity") of each aggregate function with respect to the various operators. Column in pg_aggregate? I don’t know how we’d record it exactly. A bitmask would be a possibility. Also, I don’t know what we’d do with custom aggregate functions (or indeed custom operators). Allowing end users to determine the value would potentially lead to assertion checks being incorrectly skipped. Maybe we’d say that custom aggregates always have a neutral polarity and are therefore not subject to this optimisation. > This ought to be reproducible for you if you build with assertions. Yes. I shall correct this when I do the aforementioned rebase and application of TLC. > My feeling is that if we want to move forward on this topic, we need to > solve the concurrency question first. All these optimizations for when > we don't need to check the assertion are cool, but they are just > optimizations that we can apply later on, once we have solved the > critical problems. I obviously agree that the concurrency issue needs solving. But I don’t see that at all as a separate matter from the algos. Far from being merely optimisations, the research indicates we can go a lot further toward reducing the need for rechecks and, therefore, reducing the chance of concurrency conflicts from occurring in the first place. This is true regardless of whatever mechanism we use to enforce correct behaviour under concurrent modifications -- e.g. a lock on the ASSERTION object itself, enforced use of SERIALIZABLE, etc. By way of example (lifted directly from the AM4DP book): CREATE TABLE employee ( id INTEGER PRIMARY KEY, dept INTEGER NOT NULL, job TEXT NOT NULL ); CREATE ASSERTION department_managers_need_administrators CHECK (NOT EXISTS (SELECT dept FROM employee a WHERE EXISTS (SELECT * FROM employee b WHERE a.dept = b.dept AND b.job IN ('Manager', 'Senior Manager')) AND NOT EXISTS (SELECT * FROM employee b WHERE a.dept = b.dept AND b.job = 'Administrator'))); The current implementation derives "DELETE(employee), INSERT(employee) and UPDATE(employee.dept, employee.job)" as the set of invalidating operations and triggers accordingly. However, in this case, we can supplement the triggers by having them inspect the transition tables to see if the actual data from the triggering DML statement could in fact affect the truth of the expression: specifically, only do the recheck on DELETE of an "Administrator", INSERT of a "Manager" or "Senior Manager", or UPDATE when the new job is a "Manager" or "Senior Manager" or the old job was an "Administrator". Now, if this is a company with 10,000 employees, and would therefore presumably only require a handful of managers, right? ;-), then the potential for a concurrency conflict is massively reduced when compared to rechecking every time the employee table is touched. (This optimisation has some caveats and is reliant upon being able to derive the key of an expression from the underlying base tables plus some stuff about functional dependencies. I have started work on it but sadly not had time to progress it in recent months). Having said all that: there are obviously going to be some expressions that cannot be proven to have no potential for invalidating the assertion truth. I guess this is the prime concern from a concurrency PoV? Example: CREATE TABLE t ( b BOOLEAN NOT NULL, n INTEGER NOT NULL, PRIMARY KEY (b, n) ); CREATE ASSERTION sum_per_b_less_than_10 CHECK (NOT EXISTS (SELECT FROM (SELECT b, SUM(n) FROM t GROUP BY b) AS v(b, sum_n) WHERE sum_n > 10)); Invalidating operations are "INSERT(t) and UPDATE(t.b, t.n)". I guess the interesting case, from a concurrency perspective, is how do we avoid an INSERT WHERE b IS TRUE from blocking an INSERT WHERE B IS FALSE? I don’t have an answer to that unfortunately. Although my understanding was that SSI could help in these sorts of cases, but I really haven't read or looked into the detail (yet). Thoughts? -Joe
>>>>> "Joe" == Joe Wildish <joe-postgresql.org@elusive.cx> writes: Joe> Agreed. My assumption was that we would record in the data Joe> dictionary the behaviour (or “polarity") of each aggregate Joe> function with respect to the various operators. Column in Joe> pg_aggregate? I don’t know how we’d record it exactly. I haven't looked at the background of this, but if what you want to know is whether the aggregate function has the semantics of min() or max() (and if so, which) then the place to look is pg_aggregate.aggsortop. (For a given aggregate foo(x), the presence of an operator oid in aggsortop means something like "foo(x) is equivalent to (select x from ... order by x using OP limit 1)", and the planner will replace the aggregate by the applicable subquery if it thinks it'd be faster.) As for operators, you can only make assumptions about their meaning if the operator is a member of some opfamily that assigns it some semantics. For example, the planner can assume that WHERE x=y AND x=1 implies that y=1 (assuming x and y are of appropriate types) not because it assumes that "=" is the name of a transitive operator, but because the operators actually selected for (x=1) and (x=y) are both "equality" members of the same btree operator family. Likewise proving that (a>2) implies (a>1) requires knowing that > is a btree comparison op. -- Andrew (irc:RhodiumToad)
On 25/09/2018 01:04, Joe Wildish wrote: > Having said all that: there are obviously going to be some expressions > that cannot be proven to have no potential for invalidating the assertion > truth. I guess this is the prime concern from a concurrency PoV? Before we spend more time on this, I think we need to have at least a plan for that. Perhaps we could should disallow cases that we can't handle otherwise. But even that would need some analysis of which practical cases we can and cannot handle, how we could extend support in the future, etc. In the meantime, I have committed parts of your gram.y changes that seem to come up every time someone dusts off an assertions patch. Keep that in mind when you rebase. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 25, 2018 at 12:04:12AM +0100, Joe Wildish wrote: > Hi Peter, > > > My feeling is that if we want to move forward on this topic, we need to > > solve the concurrency question first. All these optimizations for when > > we don't need to check the assertion are cool, but they are just > > optimizations that we can apply later on, once we have solved the > > critical problems. > > Having said all that: there are obviously going to be some expressions > that cannot be proven to have no potential for invalidating the assertion > truth. I guess this is the prime concern from a concurrency PoV? Example: > > CREATE TABLE t ( > b BOOLEAN NOT NULL, > n INTEGER NOT NULL, > PRIMARY KEY (b, n) > ); > > CREATE ASSERTION sum_per_b_less_than_10 CHECK > (NOT EXISTS > (SELECT FROM (SELECT b, SUM(n) > FROM t > GROUP BY b) AS v(b, sum_n) > WHERE sum_n > 10)); > > Invalidating operations are "INSERT(t) and UPDATE(t.b, t.n)". So would DELETE(t), assuming n can be negative. Is there some interesting and fairly easily documented subset of ASSERTIONs that wouldn't have the "can't prove" property? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 26 Sep 2018, at 12:36, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 25/09/2018 01:04, Joe Wildish wrote: >> Having said all that: there are obviously going to be some expressions >> that cannot be proven to have no potential for invalidating the assertion >> truth. I guess this is the prime concern from a concurrency PoV? > > Before we spend more time on this, I think we need to have at least a > plan for that. Having thought about this some more: the answer could lie in using predicate locks, and enforcing that the transaction be SERIALIZABLE whenever an ASSERTION is triggered. To make use of the predicate locks we'd do a transformation on the ASSERTION expression. I believe that there is derivation, similar to the one mentioned up-thread re: "managers and administrators", that would essentially push predicates into the expression on the basis of the changed data. The semantics of the expression would remain unchanged, but it would mean that when the expression is rechecked, the minimal set of data is read and would therefore not conflict with other DML statements that had triggered the same ASSERTION but had modified unrelated data. Example: CREATE TABLE t (n INTEGER NOT NULL, m INTEGER NOT NULL, k INTEGER NOT NULL, PRIMARY KEY (n, m)); CREATE ASSERTION sum_k_at_most_10 CHECK (NOT EXISTS (SELECT * FROM (SELECT n, sum(k) FROM t GROUP BY n) AS r(n, ks) WHERE ks > 10)); On an INSERT/DELETE/UPDATE of "t", we would transform the inner-most expression of the ASSERTION to have a predicate of "WHERE n = NEW.n". In my experiments I can see that doing so allows concurrent transactions to COMMIT that have modified unrelated segments of "t" (assuming the planner uses Index Scan). The efficacy of this would be dictated by the granularity of the SIREAD locks; my understanding is that this can be as low as tuple-level in the case where Index Scans are used (and this is borne out in my experiments - ie. you don't want a SeqScan). > Perhaps we could should disallow cases that we can't > handle otherwise. But even that would need some analysis of which > practical cases we can and cannot handle, how we could extend support in > the future, etc. The optimisation I mentioned up-thread, plus the one hypothesised here, both rely on being able to derive the key of an expression from the underlying base tables/other expressions. We could perhaps disallow ASSERTIONS that don't have such properties? Beyond that I think it starts to get difficult (impossible?) to know which expressions are likely to be costly on the basis of static analysis. It could be legitimate to have an ASSERTION defined over what turns out to be a small subset of a very large table, for example. -Joe
Hi Andrew, On 25 Sep 2018, at 01:51, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > I haven't looked at the background of this, but if what you want to know > is whether the aggregate function has the semantics of min() or max() > (and if so, which) then the place to look is pg_aggregate.aggsortop. Thanks for the pointer. I've had a quick look at pg_aggregate, and back at my code, but I think there is more to it than just the sorting property. Specifically we need to know about the aggregate function when combined with connectors <, <=, < ANY, <= ANY, < ALL and <= ALL (and their equivalents with ">" and ">="). Also, it looks like COUNT and SUM don't have a sortop (the other aggregates I've catered for do though). When I come to do the rework of the patch I'll take a more in-depth look though, and see if this can be utilised. > As for operators, you can only make assumptions about their meaning if > the operator is a member of some opfamily that assigns it some > semantics. I had clocked the BT semantics stuff when doing the PoC patch. I have used the "get_op_btree_interpretation" function for determining operator meaning. -Joe
Hi David, > On 26 Sep 2018, at 19:47, David Fetter <david@fetter.org> wrote: > >> Invalidating operations are "INSERT(t) and UPDATE(t.b, t.n)". > > So would DELETE(t), assuming n can be negative. Oops, right you are. Bug in my implementation :-) > Is there some interesting and fairly easily documented subset of > ASSERTIONs that wouldn't have the "can't prove" property? We can certainly know at the time the ASSERTION is created if we can use the transition table optimisation, as that relies upon the expression being written in such a way that a key can be derived for each expression. We could warn or disallow the creation on that basis. Ceri & Widom mention this actually in their papers, and their view is that most real-world use cases do indeed allow themselves to be optimised using the transition tables. -Joe
>>>>> "Joe" == Joe Wildish <joe-postgresql.org@elusive.cx> writes: >> I haven't looked at the background of this, but if what you want to >> know is whether the aggregate function has the semantics of min() or >> max() (and if so, which) then the place to look is >> pg_aggregate.aggsortop. Joe> Thanks for the pointer. I've had a quick look at pg_aggregate, and Joe> back at my code, but I think there is more to it than just the Joe> sorting property. Specifically we need to know about the aggregate Joe> function when combined with connectors <, <=, < ANY, <= ANY, < ALL Joe> and <= ALL (and their equivalents with ">" and ">="). The presence of an aggsortop means "this aggregate function is interchangeable with (select x from ... order by x using OP limit 1)", with all of the semantic consequences that implies. Since OP must be the "<" or ">" member of a btree index opclass, the semantics of its relationships with other members of the same opfamily can be deduced from that. Joe> Also, it looks like COUNT and SUM don't have a sortop Right, because those currently have no semantics that PG needs to know about or describe. -- Andrew (irc:RhodiumToad)
> On Tue, Sep 25, 2018 at 1:04 AM Joe Wildish <joe-postgresql.org@elusive.cx> > wrote: > > All agreed. I’ll give the patch some TLC and get a new version that > addresses the above. Hi, Just a reminder, that the patch still needs to be rebased, could you please do this? I'm moving the item to the next CF.
Hi, On 2018-11-29 16:54:14 +0100, Dmitry Dolgov wrote: > > On Tue, Sep 25, 2018 at 1:04 AM Joe Wildish <joe-postgresql.org@elusive.cx> > > wrote: > > > > All agreed. I’ll give the patch some TLC and get a new version that > > addresses the above. > > Hi, > > Just a reminder, that the patch still needs to be rebased, could you please do > this? I'm moving the item to the next CF. As nothing has happened, I'm marking this patch as returned with feedback. Greetings, Andres Freund