Thread: journaling / time travel

journaling / time travel

From
Willy-Bas Loos
Date:
Hi,

In a joint effort with some parties, we have created a large database (1.1 TB)  of records that change only occasionally. Mainly, more of them are added to the database in a slow rate (1.1 TB built up in 10 years).
The records can have some significance in legal cases. Since records can be changed afterwards, it has been argued that we should have "journaling", meaning that every change to the data is saved in a separate schema that holds a "journaling" copy of each table in the datamodel, including lookups. So theoretically it is possible to go back to the data at one exact point in time. This is stored in the same database.
By now, this journaling schema takes up 624GB.
I once looked into a contrib module called time travel that does something similar, hence the title.

The use case of legal disputes being fought with our data as evidence and digging up the exact data from a certain point of time never occurred in those 10 years, and it is unlikely that it ever will.
But it might, if anyone could reasonably expect this to be possible.

Also, it has been argued that this journal is a good thing, because it could save us in case of a unforeseen creeping corruption of data or some other catastrophe that isn't covered  by our backup system. Someone described this as a good practice, resulting in the fact that we still do this.

Now, i don't like this blown up journal that we don't ever use, so i would love to lose it.
But indeed we should do anything we reasonably can to protect that data.

My question to you all is:
* Is the legal thing actualy something one could expect of us?
* Is the security thing really a good practice?
* Is this a common use case that is normally solved with standard components?

One thing: PITR would be a candidate for a standard component, but it takes up much more space. We could double the retention time if the journal were to be axed, but that would probably not be enough.

Cheers,

--
Willy-Bas Loos

Re: journaling / time travel

From
Willy-Bas Loos
Date:
On Mon, Sep 19, 2016 at 5:48 PM, Willy-Bas Loos <willybas@gmail.com> wrote:

The use case of legal disputes being fought with our data as evidence and digging up the exact data from a certain point of time never occurred in those 10 years, and it is unlikely that it ever will.
But it might, if anyone could reasonably expect this to be possible.


Just to clarify, this is not a specific feature that we are bound to in an SLA, or anything like that.
It is just argued that we should keep the feature because a judge may ask for it.
I'm not asking for official legal advice on this list, just for common sense.
We are based in the Netherlands and any legal action of relevance to this question would take place here.

Cheers,
--
Willy-Bas Loos

Re: journaling / time travel

From
Paul Jungwirth
Date:
On 09/19/2016 08:48 AM, Willy-Bas Loos wrote:
> Since records can
> be changed afterwards, it has been argued that we should have
> "journaling", meaning that every change to the data is saved in a
> separate schema that holds a "journaling" copy of each table

I don't think this is especially unusual. I've worked on similar
projects that maintain history for regulatory reasons. I can't speak to
your legal situation, but I don't think keeping history is a strange choice.

Re the security argument: it seems to me it depends on whether you can
restrict access to the journal while providing access to the normal
tables. I guess journaling would help for some threats but not others,
right? If regular users can't touch the journal, then I agree keeping
one could make sense.

Are you saying the journaling portion is 624GB on top of 1.1TB? Or that
of the 1.1TB, 624GB of it is from journaling? Either way it doesn't seem
like a massive cost to me.

I don't think PITR is an appropriate solution to keeping a 10-year
history of changes.

It sounds like you're not looking for a green-field solution, but just
trying to get perspective on what others are doing. Some resources for
this that might help you:

https://www.youtube.com/watch?v=TRgni5q0YM8
https://github.com/arkhipov/temporal_tables (aka
http://pgxn.org/dist/temporal_tables/)
http://www.cs.arizona.edu/~rts/tdbbook.pdf (also available in print)
https://www.amazon.com/Bitemporal-Data-Practice-Tom-Johnston/dp/0124080677

The two books contain multiple approaches to storing history, each with
tradeoffs for ease-of-use, disk space, etc. Reading them might be
overkill for you. If you want to choose one, I'd read Snodgrass. The
first few chapters are tedious, but it gets better.

Temporal databases are in a funny in-between zone where there is lots of
research, but standard tools are fairly underdeveloped. Postgres
recently added range types and exclusion constraints, which are
important primitives for building a temporal system, but it is still a
long way from SQL:2011, and SQL:2011 is itself a long way from
everything you might want. One thing that seems lacking to me, even in
the research, is how to handle DDL changes. You should be glad that you
only care about audit history and not subject history too, because going
bi-temporal is where you really cross over into lack of available tools,
outside of a few commercial offerings. (Teradata has temporal support,
using a Snodgrass-like approach that pre-dates the standard.)

Paul



Re: journaling / time travel

From
Paul Jungwirth
Date:
 > On 09/19/2016 10:56 AM, Willy-Bas Loos wrote:
 > > On Mon, Sep 19, 2016 at 6:26 PM, Paul Jungwirth
 > > <pj@illuminatedcomputing.com <mailto:pj@illuminatedcomputing.com>>
wrote:
 > >     I've worked on similar
 > >     projects that maintain history for regulatory reasons.
 > Can you explain "regulatory reasons" please?

I mean government regulations, like HIPAA and Sarbanes-Oxley. I'm in the
U.S. Since you are in the Netherlands you'll have different rules to
deal with.

Paul



Re: journaling / time travel

From
phb07
Date:

On 09/19/2016 08:48 AM, Willy-Bas Loos wrote:
> Since records can
> be changed afterwards, it has been argued that we should have
> "journaling", meaning that every change to the data is saved in a
> separate schema that holds a "journaling" copy of each table

I don't think this is especially unusual. I've worked on similar 
projects that maintain history for regulatory reasons. I can't speak to 
your legal situation, but I don't think keeping history is a strange choice.

Re the security argument: it seems to me it depends on whether you can 
restrict access to the journal while providing access to the normal 
tables. I guess journaling would help for some threats but not others, 
right? If regular users can't touch the journal, then I agree keeping 
one could make sense.

Are you saying the journaling portion is 624GB on top of 1.1TB? Or that 
of the 1.1TB, 624GB of it is from journaling? Either way it doesn't seem 
like a massive cost to me.

I don't think PITR is an appropriate solution to keeping a 10-year 
history of changes.

It sounds like you're not looking for a green-field solution, but just 
trying to get perspective on what others are doing. Some resources for 
this that might help you:

https://www.youtube.com/watch?v=TRgni5q0YM8
https://github.com/arkhipov/temporal_tables (aka 
http://pgxn.org/dist/temporal_tables/)
http://www.cs.arizona.edu/~rts/tdbbook.pdf (also available in print)
https://www.amazon.com/Bitemporal-Data-Practice-Tom-Johnston/dp/0124080677

The two books contain multiple approaches to storing history, each with 
tradeoffs for ease-of-use, disk space, etc. Reading them might be 
overkill for you. If you want to choose one, I'd read Snodgrass. The 
first few chapters are tedious, but it gets better.

Temporal databases are in a funny in-between zone where there is lots of 
research, but standard tools are fairly underdeveloped. Postgres 
recently added range types and exclusion constraints, which are 
important primitives for building a temporal system, but it is still a 
long way from SQL:2011, and SQL:2011 is itself a long way from 
everything you might want. One thing that seems lacking to me, even in 
the research, is how to handle DDL changes. You should be glad that you 
only care about audit history and not subject history too, because going 
bi-temporal is where you really cross over into lack of available tools, 
outside of a few commercial offerings. (Teradata has temporal support, 
using a Snodgrass-like approach that pre-dates the standard.)

Paul

You may also have a look on the E-Maj extension. It is available for instance here on pgxn.
Its purpose is to record all updates on tables either to examine them or to "rollback" them later. But it should be also usable for auditing.
The documentation in the package includes a detailed reference guide, but also a presentation that allows to quickly have a view on the extention.
Even if you think E-Maj doesn't fit your need, feel free to give me a feedback. It may give me thoughts for enhancements ;-)

Regards. Philippe.
 

Re: journaling / time travel

From
George Neuner
Date:
Coming late to this, but ...

On Mon, 19 Sep 2016 17:48:20 +0200, Willy-Bas Loos
<willybas@gmail.com> wrote:

>The use case of legal disputes being fought with our data as evidence and
>digging up the exact data from a certain point of time never occurred in
>those 10 years, and it is unlikely that it ever will.
>But it might, if anyone could reasonably expect this to be possible.
>
>    :
>
>My question to you all is:
>* Is the legal thing actualy something one could expect of us?
>* Is the security thing really a good practice?
>* Is this a common use case that is normally solved with standard
>components?

I am not a lawyer.

You don't say where you are specifically, but in the US, there is a
legal notion that changes/deletions done in the "normal course of
business" generally are permitted.  That is, e.g., if once a month you
routinely purge records older than 3 years, then you can't be expected
to produce records from 4 years ago.  But you have to prove to the
court that this is normal for your business: e.g., show documentation
of your record keeping procedures.

The problem comes when you do get notice of a legal action.  From that
moment forward you must preserve any data that might be relevent to
the case ... including any new data that is created ... in the event
that it ever is subpoenaed by the court.

This can become a major issue when you realize that a court case may
drag on for many years, and you may not know exactly what data has to
be preserved.  Lawyers often go on "fishing expeditions", asking for
data in many different ways [by different keywords, etc.], hoping to
find something by comparing the results.

Journaling solves the retention problem and may provide other nice
features like an audit trail of who made the changes.  Of course,
journaling may take a lot of extra space unless it stores only deltas.


Many locales have similar requirements for data preservation in the
face of a legal action.  You need to find out what is expected where
you are.  I'd have to advise that you talk to your lawyers rather than
ask here.

At least in the US, the "normal course of business" applies to archive
data as well as to live data, so you may be able to limit how long you
need to keep the journals.


Hope this ... doesn't further confuse.
George