Trying to gain peace of mind by using constraints in PostgreSQL
This time we decided to create a blog post on constraints using one of the most popular PGConf.Online talks explaining how you can try to gain peace of mind by using constraints in PostgreSQL.
“Just a dumb storage” - is it enough?
In the recent few years, software developers more and more often state that their database is just a “dumb storage without any logic”. Why? It happens due to multiple difficulties with deployment, migration, source code control systems, and we shouldn’t forget the mere laziness typical of all humans: “If it seems to work okay, so why implement logic in the DBMS?” One can go create the tables (or delegate this to ORM), and it’s enough.
NoSQL = no brainer?
It’s even simpler with NoSQL databases - you don’t need to create or control anything, no headache at all - everything has already been automated for you. It’s enough if you just pull documents from your storage using their IDs. If you need more complex operations, then a SQL database has advantages. However, very often SQL database usage is limited to the creation of tables and indexes and writing queries for them. Not many companies take further steps and leverage their DBMS to improve business processes. In most cases, the logic seems somewhat redundant.
Database is not just for data, database is for business, too
This approach is very common (let’s not forget that humans are lazy!). However, it’s very naive to avoid something useful because your employees or yourself don’t fancy new skills. A database is very sophisticated storage (read about isolation levels or backup procedures to understand what I mean by saying ‘sophisticated’). However, it is also something that can help you orchestrate your business processes and avoid losses related to organizational issues.
Ecommerce: where constraints could save the day
Of course, if you have just one SKU in stock, and it has been sold twice to two different customers, it won’t ruin your business. You might lose two loyal clients, which is not good. However, it’s not the worst case. In the fall of 2020, I heard about a scandal inside a large retail company. Piles of various goods crowded their warehouses as they were awaiting delivery since spring. It wasn’t clear who should take the responsibility for this, just as always, everyone could be blamed and no one could be punished…
No code is safe forever
If we are stick to the “dumb storage” concept, we cannot be protected from such situations. Moreover, we cannot avoid them - it’s just a matter of time. Errare humanum est (“to err is human”), so even the best software developer will sooner or later leave a bug somewhere. What is even worse, this bug can remain undetected for years, and then the error will finally occur. Even your good old code may surprise you with a sudden crash.
Why is it better to see the database as ‘smart’?
What can be done about such incidents? How can we mitigate these risks? We can control it to a certain extent, but you need smart storage in this case. Yes, we need to get back to what is neglected so often and learn how a database can prevent being put in a clearly incorrect state. Of course, it is quite hard to set things in stone from the very beginning and define various states as “correct” and “incorrect”, but you definitely need to evaluate the right and the wrong. At least, you need to come up with a list of conditions that your database should never experience.
Constraints can save you money
Once I witnessed a software developer calculating the cryptocurrency exchange rates and making a mistake that resulted in sending 300,000 USD to several payees, while the correct amount was equal to just 300 USD. The simplest boring check was missing. “For the defined pair the amount sent should not be larger than 1/10 of the amount received”, - such a constraint could easily save this developer a lot of time and nerves. It is way better to deal with the payment that did not go through than with the payment that went to the wrong place or with the wrong amount.
Performance is not the only god to worship
These real-world stories look too familiar to me. Therefore, usual objections to setting integrity constraints seem rather absurd to yours truly. Of course, constraints do affect performance, but you won’t have to think of selling your kidney if something goes wrong. (Fines at work are never fun.) Is it clear that the younger generation of developers for some reason consider their code super reliable, but let me quote Mike Tyson: "Everybody has a plan until they get punched in the mouth."
Duplicating checks for everything that costs people money
I would go further and strongly recommend at least two duplicating checks for each real-world application working with money or material assets. It might also be important to have separate authors for each of these checks. It would also be great if they were unfamiliar with each other and had different grounds to apply this or that approach. (In the above-mentioned example with cryptocurrency rates, the first check should have provided a rough estimate - “no more than 1/10”, and the second check should have compared the calculated exchange rate and the actual exchange rate to keep the “difference within 2%”, larger fluctuations should have been labeled as “error”.) Why duplicate? Just because one of these checks will get broken sooner or later, and if there’s no backup one, you’ll have to proceed without any checks.
Paranoia is good when it comes to business
Let me share a practical case from the past. Once I was working on the order placement code and needed to make sure that each stock keeping unit (SKU) was ordered only once. So I created a unique index for SKU IDs and applied an additional check to see if the same SKU had been included in previous orders. It looks too insanely paranoid, as the check for uniqueness isn’t very complex, but it isn’t trivial either. My supervisors weren’t overly strict. However, with time my code was rewritten, and my unique index was accidentally replaced by an ordinary one. So the duplicates weren’t forbidden anymore. We managed to discover it only after 10 days. However, no SKUs were sold twice thanks to the additional check. So my paranoia appeared to be practical. I recommend having such double checks though they might be not good for all cases. However, for situations where disruption of a business process means noticeable losses, it is highly prudent to have two or more checks.
Why ‘flexible’ is not always good?
Another common objection to constraints usage is the fact that the database “loses its flexibility”. Well, it does lose its flexibility, but when we apply constraints, a less flexible database is our goal. For example, when we describe a database with adjacent date ranges (“from -infinty till 2020-01-01, from 2020-01-01 till 2020-06-01, from 2020-06-01 till 2021-01-01, from 2021-01-01 till infinity”), it is hard to insert a separate date range between them. To do so, you will either need to update the entire table or apply deferrable checks (in the second case, you will have to change the adjacent rows, not just insert one row). However, this is our goal to avoid non-adjacent date ranges at all costs. We always need to ensure that all date ranges remain adjacent but do not overlap with each other.
My experience says that in real life such strict constraints are often missing, though I would like to see more companies, institutions, and organizations applying them.
Orphaned rows and uncertainty
Usually, each table has a primary key, though I face some tables without it way too often. Not null constraint is a rarer case, but still, it is quite common for a table with a primary key. As for foreign keys, you can find databases without them quite often. If an old and large database doesn’t have foreign keys, you can be sure that its descendant tables have orphaned rows. This is bad news, and moreover, if one row doesn’t have a parent row, how can you be sure that another row has a correct parent row?
“Exotic” constraints for ecommerce
Some constraints are less common (exclude constraints are among such “exotic” examples, though there are many others). They help avoid overlapping ranges or check if json or xml documents comply with the database schema, and these are rare cases. Checks in triggers are intended for non-standard situations, so they are only applied to meet some unique business requirements. Sometimes you cannot get the integrity constraints you need using standard database tools. The good examples may come from ecommerce - “every customer may have no more than three unpaid orders” or “the amount of unpaid orders may not be more than X”. To implement such a constraint, you will need a trigger checking the update or addition a row in the “Orders” table. The trigger should contain a lock - you should either block a row in the transaction to avoid the user’s update, or apply an advisory lock using the hash of user ID as a lock key. Of course, the usage of advisory locks isn’t somewhat standard, however, if you have lots of modification operations, an advisory lock will ensure better performance.
Triggers for banking
You will have to use triggers while checking the values of certain parameters. For example, there are certain codes reflecting the type of organization in the banking industry. There are special codes for banks, commercial non-financial organizations, non-profit organizations, individuals, etc. If you need to check something simple (whether the bank account belongs to an individual), you can use CHECK as a constraint. If you have to deal with a more complex check, you will have to work with a table where all codes for all types of organizations are listed. The good news is that for most cases you will need a very simple trigger.
Assertion: the non-existant heaven
The SQL language standard describes one more useful constraint applied at the database level. It’s ASSERTION. If it was implemented, many constraints would have become trivial. For example, you could have easily set constraints for the number of unpaid orders, amounts of the customer’s unpaid orders, dates of order shipments, etc. Unfortunately, ASSERTION has not been implemented in Postgres. To be fair to Postgres, we need to remark that this constraint hasn’t been implemented in any of the popular databases. If we take a closer look at this issue, we’ll understand that ASSERTION is hard to implement in an efficient way.
Why do you need to write triggers?
So if you need a non-trivial constraints for your business logic, you will have to create additional triggers or run special scripts. (I mean the situations like “no more than three unpaid orders”, “a paid order should not remain unshipped after three days”, etc.) To check if some orders are going to have delays in shipments, you will have to run scripts checking the correctness of shipment dates. Why do we need these additional scripts? Because correct states may become incorrect with time. SKUs intended for shipment cannot stay in the warehouse longer than date X, however, the database doesn’t have built-in triggers able to track time - so this is what you need to implement on your own.
So what the key takeaways should be like in this case? I would name a few:
- We should stop perceiving the database as a “dumb storage”.
- Constraints should be applied to make impossible clearly incorrect operations in the database.
- The database should be able to detect incorrect states.
Of course, if you want to go ahead with the “database as a smart storage” approach, it will require additional effort. However, if data in your database backs serious business, this effort is unavoidable. Of course, checks/constraints will not help you to avoid problems, but they will help you to detect them fast. And time is money.