Re: CREATE ASSERTION: database level assertions feature - Mailing list pgsql-hackers
| From | Joe Wildish |
|---|---|
| Subject | Re: CREATE ASSERTION: database level assertions feature |
| Date | |
| Msg-id | 9ab31430-fa94-4aae-87fd-4454a5246ec7@app.fastmail.com Whole thread Raw |
| In response to | CREATE ASSERTION: database level assertions feature (Marcos Magueta <maguetamarcos@gmail.com>) |
| List | pgsql-hackers |
Hi Marcos,
I have been interested in this topic for several years and can share what I know
about it.
> I can foresee that in a naive implementation of tracking all the objects that
> involve a constraint (likely how views do) and evaluating the constraint
> everytime something is modified on such objects, might incur into a
> performance hit for those using it, but I believe that it gives one
> extraordinary ergonomics, expressive and descriptive power that otherwise goes
> hidden with, say, triggers.
There are a few papers and other resources out there that describe methods aimed
at making a viable implementation. I'll list some at the end of this message. I
can briefly describe here the gist of the techniques. (This is also how Oracle
have implemented it).
Leaving aside the concurrency aspects for a moment, I would say there are two
main areas of concern.
1. when should an assertion be re-checked, and;
2. what should actually be executed to perform a re-check.
For point 1, it boils down to discovering any constants that may appear in the
assertion expression, and also discovering the polarity under which a table
reference appears in the assertion expression. By way of example, imagine the
following:
CREATE ASSERTION con1 CHECK
(NOT EXISTS (SELECT n FROM r WHERE x = 'Z' EXCEPT SELECT n FROM s))
There is no need to re-check this expression for mutations of table "r" unless
that mutation involves a row whose value of "r.x" is equal to "Z". Moreover,
there is no need to re-check mutations of table "r" unless the triggering
operation was INSERT or UPDATE, and likewise a mutation of table "s" only needs
to be re-checked for DELETE or UPDATE. This is because the polarity under which
"r" occurs is negative and for "s" it is positive. This can be determined with
a query tree walker where the polarity is toggled when you pass through negating
nodes (e.g. EXCEPT, NOT). References appearing under OUTER JOIN have both
polarities. Discovery of constants is already implemented of course in the
planner.
For point 2, things get a bit more complicated. The gist is that universally
quantified variables appearing in the original expression can have their
transaction-specific values substituted into it, and this modified expression,
rather than the original one, can be used for re-check purposes. This has the
effect of dramatically reducing the scope of the expression so that it queries
only the minimal number of rows to determine the truth of the constraint. In
other words, it makes the re-check be incremental.
Continuing with the example above, we can can eliminate the use of EXCEPT to
better see the involved variables:
NOT EXISTS
(SELECT FROM r
WHERE x = 'Z'
AND NOT EXISTS
(SELECT FROM s
WHERE r.n = s.n))
As the table reference "r" appears under NOT EXISTS (which is equivalent to the
universal quantifier), the variable "r.n" is universally quantified. Moreover,
"s.n" is equal to "r.n", so "s.n" is also universally quantified. We can
therefore focus any re-check expressions on the specific values of "r.n" and
"s.n" for any given triggering transaction. That can be done using the trigger
transition tables.
Bringing all that together then, the following circumstances and expressions are
sufficient to enforce the constraint. Essentially, query the relevant transition
table, with it filtered for any discovered constants, and also filtered with a
negated version of the expression, where the universal variables have been bound
to the relevant column in the transition table. If the queries yield any rows
then the assertion does not hold and an error should be thrown to the caller.
-- On INSERT or UPDATE of "r"
SELECT * FROM inserted_r
WHERE inserted_r.x = 'Z'
AND EXISTS
(SELECT FROM r
WHERE inserted_r.n = r.n
AND x = 'Z'
AND NOT EXISTS
(SELECT FROM s
WHERE inserted_r.n = s.n
AND r.n = s.n))
-- On DELETE or UPDATE of "s"
SELECT * FROM deleted_s
WHERE EXISTS
(SELECT FROM r
WHERE deleted_s.n = r.n
AND x = 'Z'
AND NOT EXISTS
(SELECT FROM s
WHERE deleted_s.n = s.n
AND r.n = s.n))
This is the gist of the techniques I have read about. There are some other
optimisations for specific types of query (e.g. aggregates), plus things like
avoiding UPDATE re-checks unless an involved column from the expression has
actually changed. But I won't go into them here.
I said earlier there are concurrency considerations. The technique would work
under SERIALIZABLE because the re-check expressions have predicates in them that
focus on only the involved rows, but obviously that isn't a widely used
isolation level. The Oracle guys realised this too, so they spent some time
figuring out how to get concurrency implemented under READ COMMITTED. There is
a published paper on how to do this which exploits the universal variable
insight above. This is already a long e-mail so I won't go into details (let me
know if you are interested though and I can try to write up that information).
> I would be willing to at least get started with a patch for this, but before
> that, I want to assess the interest and thoughts on how to properly implement
> it.
I had a go back in 2015 when I was still learning about it all. But it wasn't a
serious attempt. Peter E actually did a patch back in 2013, where he did much
of the plumbing work for things like parsing of the statement, changes to all
the places that make reference to constraint types, tab completion, etc. I took
that and tried to implement the polarity discovery as described above. At
that time, I didn't know about the universal variables, and was also unfamiliar
with the PG codebase, so if you are wanting a decent starting point, I would
rebase Peter's patch against master, and go from there. I occasionally return
to this topic myself, and have some ideas on how to implement, but I think it
would be good to get some buy-in from a committer or other senior member of the
community first if you are going to expend serious effort.
> If there's already a thread on the issue, forgive me, but I couldn't find it
> in the archives.
Here are links to the two threads I mentioned:
https://www.postgresql.org/message-id/flat/FD62F1DE-4227-4D67-A1A5-C17350B8CAA1%40elusive.cx
https://www.postgresql.org/message-id/1384486216.5008.17.camel@vanquo.pezone.net
Book of interest:
Applied Mathematics for Database Professionals
Lex de Hann and Toon Koppelaars
(specifically chapter 11, "Implementing Database Designs in Oracle")
BTW, Toon now works at Oracle and is the architect of their solution.
Papers of interest:
Translating Advanced Integrity Checking Technology to SQL Database
by Hendrik Decker
Logic for Improving Integrity Checking
by J.M.Nicolas
Deriving Production Rules for Constraint Maintenance
by Ceri & Widom
The Decker paper is basically a very detailed description of what I have
outlined here. Hendrik worked a lot with deductive database systems from what I
can tell, so it's all written in terms of pure predicate logic, but in the paper
he does a good job of explaining how that applies to SQL.
Cheers,
-Joe
pgsql-hackers by date: