Thread: Disable vacuuming to provide data history
Hey, It depnends on scenario, but there is many use cases that hack data change from somebody with admin privileges could be disaster. That is the place where data history could come with help. Some basic solution would be trigger which writes previous version of record to some other table. Trigger however can be disabled or removed (crazy solution would be to provide pernament triggers and tables which can only be pernamently inserted). Then we have also possibility to modify tablespace directly on disk. But Postgres has ability to not override records when two concurrent transaction modify data to provide MVCC. So what about pernamently not vacuumable tables. Adding some xid log tables with hash of record on hash on previous hash. I think that would be serious additional advantage for best open source relational databes. Best regards, Marek Mosiewicz
On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz@gmail.com> wrote:
Hey,
It depnends on scenario, but there is many use cases that hack data
change from somebody with admin privileges could be disaster.
That is the place where data history could come with help. Some basic
solution would be trigger which writes previous version of record
to some other table. Trigger however can be disabled or removed (crazy
solution would be to provide pernament
triggers and tables which can only be pernamently inserted).
Then we have also possibility to modify tablespace directly on disk.
But Postgres has ability to not override records when two concurrent
transaction modify data to provide MVCC.
So what about pernamently not vacuumable tables. Adding some xid log
tables with hash of record on hash on previous hash.
I think that would be serious additional advantage for best open source
relational databes.
Best regards,
Marek Mosiewicz
What you are describing sounds like the "system versioning" flavor of "temporal" tables. It's a part of the SQL Standard, but PostgreSQL has yet to implement it in core. Basically, every row has a start_timestamp and end_timestamp field. Updating a row sets the end_timestamp of the old version and inserts a new one with a start_timestamp matching the end-timestamp of the previous row. Once a record has a non-null [1] end_timestamp, it is not possible to update that row via SQL. Regular SQL statements effectively have a "AND end_timestamp IS NULL" filter on them, so the old rows are not visible without specifically invoking temporal features to get point-in-time queries. At the implementation level, this probably means a table with 2 partitions, one for live rows all having null end_timestamps, and one for archived rows which is effectively append-only.
This strategy is common practice for chain of custody and auditing purposes, either as a feature of the RDBMS or home-rolled. I have also seen it used for developing forecasting models (ex "what would this model have told us to do if we had run it a year ago?").
A few years ago, I personally thought about implementing a hash-chain feature, but my research at the time concluded that:
* Few customers were interested in going beyond what was required for regulatory compliance
This strategy is common practice for chain of custody and auditing purposes, either as a feature of the RDBMS or home-rolled. I have also seen it used for developing forecasting models (ex "what would this model have told us to do if we had run it a year ago?").
A few years ago, I personally thought about implementing a hash-chain feature, but my research at the time concluded that:
* Few customers were interested in going beyond what was required for regulatory compliance
* Once compliant, any divergence from established procedures, even if it was an unambiguous improvement, only invited re-examination of it and adjacent procedures, and they would avoid that
* They could get the same validation by comparing against a secured backup and out-of-band audit "logs" (most would call them "reports")
* They could get the same validation by comparing against a secured backup and out-of-band audit "logs" (most would call them "reports")
* They were of the opinion that if a bad actor got admin access, it was "game over" anyway
The world may have changed since then, but even if there is now interest, I wonder if that isn't better implemented at the OS level rather than the RDBMS level.
The world may have changed since then, but even if there is now interest, I wonder if that isn't better implemented at the OS level rather than the RDBMS level.
[1] some implementations don't use null, they use an end-timestamp set to a date implausibly far in the future ( 3999-12-31 for example ), but the concept remains that once the column is set to a real timestamp, the row isn't visible to update statements.
On 2/24/23 22:06, Corey Huinker wrote: > On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz@gmail.com> wrote: > > [1] some implementations don't use null, they use an end-timestamp set to > a date implausibly far in the future ( 3999-12-31 for example ), The specification is, "At any point in time, all rows that have their system-time period end column set to the highest value supported by the data type of that column are known as current system rows; all other rows are known as historical system rows." I would like to see us use 'infinity' for this. The main design blocker for me is how to handle dump/restore. The standard does not bother thinking about that. -- Vik Fearing
W dniu sob, 25.02.2023 o godzinie 03∶11 +0100, użytkownik Vik Fearing napisał: > On 2/24/23 22:06, Corey Huinker wrote: > > The main design blocker for me is how to handle dump/restore. The > standard does not bother thinking about that. That would be a little difficult. Most probably you would need to operate on history view to dump/restore Best regards, Marek Mosiewicz
There is also another blocker - our timestamp resolution is 1 microsecond and we are dangerously close to speeds where one could update a row twice in the same microsecond . I have been thinking about this, and what is needed is 1. a nanosecond-resolution "abstime" type - not absolutely necessary, but would help with corner cases. 2. VACUUM should be able to "freeze" by replacing xmin/xmax values with commit timestamps, or adding tmin/tmax where necessary. 3. Optionally VACUUM could move historic rows to archive tables with explicit tmin/tmax columns (this also solves the pg_dump problem) Most of the above design - apart from the timestamp resolution and vacuum being the one doing stamping in commit timestamps - is not really new - up to version 6.2 PostgreSQL had tmin/tmax instead of xmin/xmax and you could specify the timestamp you want to query any table at. And the original Postgres design was Full History Database where you could say " SELECT name, population FROM cities['epoch' .. 'now'] " to get all historic population values. And historic data was meant to be moved to the WORM optical drives which had just arrived to the market --- Hannu On Sat, Feb 25, 2023 at 3:11 AM Vik Fearing <vik@postgresfriends.org> wrote: > > On 2/24/23 22:06, Corey Huinker wrote: > > On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz@gmail.com> wrote: > > > > [1] some implementations don't use null, they use an end-timestamp set to > > a date implausibly far in the future ( 3999-12-31 for example ), > > The specification is, "At any point in time, all rows that have their > system-time period end column set to the highest value supported by the > data type of that column are known as current system rows; all other > rows are known as historical system rows." > > I would like to see us use 'infinity' for this. > > The main design blocker for me is how to handle dump/restore. The > standard does not bother thinking about that. > -- > Vik Fearing > > >