Re: table versioning approach (not auditing) - Mailing list pgsql-general

From Gavin Flower
Subject Re: table versioning approach (not auditing)
Date
Msg-id 54332150.3020401@archidevsys.co.nz
Whole thread Raw
In response to Re: table versioning approach (not auditing)  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Responses Re: table versioning approach (not auditing)
List pgsql-general
On 07/10/14 10:47, Jim Nasby wrote:
> On 10/2/14, 9:27 AM, Adam Brusselback wrote:
>> i've also tried to implement a database versioning using JSON to log
>> changes in tables. Here it is:
>>
https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]

>>
<https://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5Bhttps://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5D>

>>
>> I've got two versioning tables, one storing information about all
>> transactions that happened and one where i put the JSON logs of row
>> changes of each table. I'm only logging old values and not complete
>> rows.
>>
>> Then I got a function that recreates a database state at a given time
>> into a separate schema - either to VIEWs, MVIEWs or TABLES. This
>> database state could then be indexed in order to work with it. You
>> can also reset the production state to the recreated past state.
>>
>> Unfortunately I've got no time to further work on it at the moment +
>> I have not done tests with many changes in the database so I can't
>> say if the recreation process scales well. On downside I've realised
>> is that using the json_agg function has limits when I've got binary
>> data. It gets too long. So I'm really looking forward using JSONB.
>>
>> There are more plans in my mind. By having a Transaction_Log table it
>> should be possible to revert only certain transactions. I'm also
>> thinking of parallel versioning, e.g. different users are all working
>> with their version of the database and commit their changes to the
>> production state. As I've got a unique history ID for each table and
>> each row, I should be able to map the affected records.
> Sorry I'm coming late to this thread. I agree that getting interested
> people together would be a good idea. Is there another mailing list we
> can do that with?
>
> Versioning is also something I've interested in, and have put a lot of
> thought into (if not much actual code :( ). I'll also make some
> general comments, if I may...
>
>
> I think timestamps should be *heavily avoided* in versioning, because
> they are frequently the wrong way to solve a problem. There are many
> use cases where you're trying to answer "What values were in place
> when X happened", and the simplest, most fool-proof way to answer that
> is that when you create a record for X, part of that record is a
> "history ID" that shows you the exact data used. For example, if
> you're creating an invoicing system that has versioning of customer
> addresses you would not try and join an invoice with it's address
> using a timestamp; you would put an actual address_history_id in the
> invoice table.
>
> I thought I saw a reference to versioning sets of information. This is
> perhaps the trickiest part. You first have to think about the
> non-versioned sets (ie: a customer may have many phone numbers) before
> you think about versioning the set. In this example, you want the
> history of the *set* of phone numbers, not of each individual number.
> Design it with full duplication of data first, don't think about
> normalizing until you have the full set versioning design.
>
> I understand the generic appeal of using something like JSON, but in
> reality I don't see it working terribly well. It's likely to be on the
> slow side, and it'll also be difficult to query from. Instead, I think
> it makes more sense to create actual history tables that derive their
> definition from the base table. I've got code that extracts
> information (column_name, data type, nullability) from a table (or
> even a table definition), and it's not that complex. With the work
> that's been done on capturing DDL changes it shouldn't be too hard to
> handle that automatically.
>
>
Yeah, my design was quite extensive and ensured all relevant information
was associated with the 'history id' (still need timestamps to find the
appropriate value), the powers that be watered it down somewhat (but
that was outside my control). Performance was not too critical, probably
less than 10 transactions per second at peak times.  JSON, had yet to be
invented, but we would not have used it anyhow.

Even if timestamps are used extensively, you'd have to be careful
joining on them. You may have information valid at T1 and changing at
T3, but the transaction has T2, where T1 < T2 < T3 - the appropriate set
of data would be associated with T1, would would not get anywhere trying
to find data with a timestamp of T2 (unless you were very lucky!).

Actually things like phone numbers are tricky.  Sometimes you may want
to use the current phone number, and not the one extant at that time (as
you want to phone the contact now), or you may still want the old phone
number (was the call to a specific number at date/time legitimate & who
do we charge the cost of the call too).


pgsql-general by date:

Previous
From: Jonathan Vanasco
Date:
Subject: Re: faster way to calculate top "tags" for a "resource" based on a column
Next
From: Tom Lane
Date:
Subject: Re: Converting char to varchar automatically