Thread: Disable vacuuming to provide data history

Disable vacuuming to provide data history

From
marekmosiewicz@gmail.com
Date:
Hey,

It depnends on scenario, but there is many use cases that hack data
change from somebody with admin privileges could be disaster.
That is the place where data history could come with help.  Some basic
solution would be trigger which writes previous version of record
to some other table. Trigger however can be disabled or removed (crazy
solution would be to provide pernament
triggers and tables which  can only be pernamently inserted). 
Then we have also possibility to modify tablespace directly on disk.

But Postgres has ability to not override records when two concurrent
transaction modify data to provide MVCC.

So what about pernamently not vacuumable tables. Adding some xid log
tables with hash of record on hash on previous hash.
I think that would be serious additional advantage for best open source
relational databes.

Best regards,
   Marek Mosiewicz




Re: Disable vacuuming to provide data history

From
Corey Huinker
Date:
On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz@gmail.com> wrote:
Hey,

It depnends on scenario, but there is many use cases that hack data
change from somebody with admin privileges could be disaster.
That is the place where data history could come with help.  Some basic
solution would be trigger which writes previous version of record
to some other table. Trigger however can be disabled or removed (crazy
solution would be to provide pernament
triggers and tables which  can only be pernamently inserted). 
Then we have also possibility to modify tablespace directly on disk.

But Postgres has ability to not override records when two concurrent
transaction modify data to provide MVCC.

So what about pernamently not vacuumable tables. Adding some xid log
tables with hash of record on hash on previous hash.
I think that would be serious additional advantage for best open source
relational databes.

Best regards,
   Marek Mosiewicz

What you are describing sounds like the "system versioning" flavor of "temporal" tables. It's a part of the SQL Standard, but PostgreSQL has yet to implement it in core. Basically, every row has a start_timestamp and end_timestamp field. Updating a row sets the end_timestamp of the old version and inserts a new one with a start_timestamp matching the end-timestamp of the previous row. Once a record has a non-null [1] end_timestamp, it is not possible to update that row via SQL. Regular SQL statements effectively have a "AND end_timestamp IS NULL" filter on them, so the old rows are not visible without specifically invoking temporal features to get point-in-time queries. At the implementation level, this probably means a table with 2 partitions, one for live rows all having null end_timestamps, and one for archived rows which is effectively append-only.

This strategy is common practice for chain of custody and auditing purposes, either as a feature of the RDBMS or home-rolled. I have also seen it used for developing forecasting models (ex "what would this model have told us to do if we had run it a year ago?").

A few years ago, I personally thought about implementing a hash-chain feature, but my research at the time concluded that:

* Few customers were interested in going beyond what was required for regulatory compliance
* Once compliant, any divergence from established procedures, even if it was an unambiguous improvement, only invited re-examination of it and adjacent procedures, and they would avoid that
* They could get the same validation by comparing against a secured backup and out-of-band audit "logs" (most would call them "reports")
* They were of the opinion that if a bad actor got admin access, it was "game over" anyway

The world may have changed since then, but even if there is now interest, I wonder if that isn't better implemented at the OS level rather than the RDBMS level.

 [1] some implementations don't use null, they use an end-timestamp set to a date implausibly far in the future ( 3999-12-31 for example ), but the concept remains that once the column is set to a real timestamp, the row isn't visible to update statements.

Re: Disable vacuuming to provide data history

From
Vik Fearing
Date:
On 2/24/23 22:06, Corey Huinker wrote:
> On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz@gmail.com> wrote:
> 
>   [1] some implementations don't use null, they use an end-timestamp set to
> a date implausibly far in the future ( 3999-12-31 for example ),

The specification is, "At any point in time, all rows that have their 
system-time period end column set to the highest value supported by the 
data type of that column are known as current system rows; all other 
rows are known as historical system rows."

I would like to see us use 'infinity' for this.

The main design blocker for me is how to handle dump/restore.  The 
standard does not bother thinking about that.
-- 
Vik Fearing




Re: Disable vacuuming to provide data history

From
marekmosiewicz@gmail.com
Date:
W dniu sob, 25.02.2023 o godzinie 03∶11 +0100, użytkownik Vik Fearing
napisał:
> On 2/24/23 22:06, Corey Huinker wrote:
>
> The main design blocker for me is how to handle dump/restore.  The
> standard does not bother thinking about that.

That would be a little difficult. Most probably you would need to
operate on history view to dump/restore

Best regards,
    Marek Mosiewicz




Re: Disable vacuuming to provide data history

From
Hannu Krosing
Date:
There is also another blocker - our timestamp resolution is 1
microsecond and we are dangerously close to speeds where one could
update a row twice in the same microsecond .

I have been thinking about this, and what is needed is

1. a nanosecond-resolution "abstime" type - not absolutely necessary,
but would help with corner cases.
2. VACUUM should be able to "freeze" by replacing xmin/xmax values
with commit timestamps, or adding tmin/tmax where necessary.
3. Optionally VACUUM could move historic rows to archive tables with
explicit tmin/tmax columns (this also solves the pg_dump problem)

Most of the above design - apart from the timestamp resolution and
vacuum being the one doing stamping in commit timestamps -  is not
really new - up to version 6.2 PostgreSQL had tmin/tmax instead of
xmin/xmax and you could specify the timestamp you want to query any
table at.

And the original Postgres design was Full History Database where you
could say " SELECT name, population FROM cities['epoch' .. 'now'] " to
get all historic population values.

And historic data was meant to be moved to the WORM optical drives
which had just arrived to the market


---
Hannu


On Sat, Feb 25, 2023 at 3:11 AM Vik Fearing <vik@postgresfriends.org> wrote:
>
> On 2/24/23 22:06, Corey Huinker wrote:
> > On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz@gmail.com> wrote:
> >
> >   [1] some implementations don't use null, they use an end-timestamp set to
> > a date implausibly far in the future ( 3999-12-31 for example ),
>
> The specification is, "At any point in time, all rows that have their
> system-time period end column set to the highest value supported by the
> data type of that column are known as current system rows; all other
> rows are known as historical system rows."
>
> I would like to see us use 'infinity' for this.
>
> The main design blocker for me is how to handle dump/restore.  The
> standard does not bother thinking about that.
> --
> Vik Fearing
>
>
>