Thread: table versioning approach (not auditing)

table versioning approach (not auditing)

From
Abelard Hoffman
Date:
Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).

I have studied these two audit trigger examples:

I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.

My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.

But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record

#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.

#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).

So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?

Thanks


Re: table versioning approach (not auditing)

From
Gavin Flower
Date:
On 29/09/14 15:00, Abelard Hoffman wrote:
Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).

I have studied these two audit trigger examples:

I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.

My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.

But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record

#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.

#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).

So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?

Thanks


I implemented a 2 table approach over 15 years ago for an insurance application.  I used both an effective_date & and an as_at_date, no triggers were involved.  I think a 2 table approach gives you more flexibility.

The effective_date allowed changes to be made to the table in advance of when they were to become effective.

The as_at_date allowed quotes to be made, valid for a period starting at the as_at_date.

End users did not query the database directly, all queries were precoded in a 4GL called Progress backed by an Oracle database.  The same could be done with a WildFly Java Enterprise AppSever (or some other middleware) and a PostgreSQL backend.

Different use case, but the concept is probably adaptable to your situation.

You may want a change table, that has a change_number that is in each type of table affected by a change.  This would help for query type #2.

I would be quite happy to contract to work out the appropriate schema and develop some SQL scripts to query & update the database, if you were interested.  My approach would be to create a minimal database with sample data to validate the schema design and SQL scripts.

Using a flag to indicate current record, seems inflexible.  As some changes may not take affect until some time in the future, and you can't query the database to see what was the situation at a particular point in the past.  For example: somebody complains about something that happened last Saturday near noon, how would you query the database to what it was like then?


Cheers,
Gavin

Re: table versioning approach (not auditing)

From
"Felix Kunde"
Date:
Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is:
https://github.com/fxku/audit
I've got two versioning tables, one storing information about all transactions that happened and one where i put the
JSONlogs of row changes of each table. I'm only logging old values and not complete rows. 
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs
orTABLES. This database state could then be indexed in order to work with it. You can also reset the production state
tothe recreated past state. 
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the
databaseso I can't say if the recreation process scales well. On downside I've realised is that using the json_agg
functionhas limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB. 

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain
transactions.I'm also thinking of parallel versioning, e.g. different users are all working with their version of the
databaseand commit their changes to the production state. As I've got a unique history ID for each table and each row,
Ishould be able to map the affected records. 

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com>
An: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes
whennecessary (customer service makes an incorrect edit, etc.). 
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update
orinsert is done. 
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the
rowchanges (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of
allversioned changes very easily. 
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g.,
findall changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is
inthe json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all
thoserelationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form). 
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate
versioningtables associated with each real table? Or another approach? 
 
Thanks
 
 


Re: table versioning approach (not auditing)

From
Jonathan Vanasco
Date:

In the past, to accomplish the same thing I've done this:

- store the data in hstore/json.  instead of storing snapshots, I store deltas.  i've been using a second table though, because it's improved performance on reads and writes.
- use a "transaction" log.  every write session gets logged into the transaction table (serial, timestamp, user_id).  all updates to the recorded tables include the transaction's serial.  then there is a "transactions" table, that is just "transaction_serial ,  object_id , object_action".  

whenever I have needs for auditing or versioning, I can just query the transaction table for the records I want... then use that to grab the data out of hstore.



On Sep 28, 2014, at 10:00 PM, Abelard Hoffman wrote:

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).

I have studied these two audit trigger examples:

I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.

My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.

But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record

#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.

#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).

So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?

Thanks


Re: table versioning approach (not auditing)

From
Nick Guenther
Date:

On September 29, 2014 11:08:55 AM EDT, Jonathan Vanasco <postgres@2xlp.com> wrote:
>
>- use a "transaction" log.  every write session gets logged into the
>transaction table (serial, timestamp, user_id).  all updates to the
>recorded tables include the transaction's serial.  then there is a
>"transactions" table, that is just "transaction_serial ,  object_id ,
>object_action".

A newbie tangent question: how do you access the transaction serial? Is it txid_current() as listed in
http://www.postgresql.org/docs/9.3/static/functions-info.html?

And how do you actually make use of that information? I know from Bruce Momjians's excellent MVCC talk
<http://momjian.us/main/writings/pgsql/mvcc.pdf>that postgres internally has a secret txid column on each row; can you
somehowquery on the secret column? And does your implementation worry about multiple timelines?  

My use case is dynamically allocated replication. Broadly, my algorithm is that for each client
1) download a full copy of the current table
2) keep the connection open and send deltas (which are just inserts and deletes, for me)

I need 2 to begin *as if immediately* after 1.  txids sound like they are exactly what I need but without knowing how
tohandle them, I fudged it by opening a query for 1 and for 2 immediately after each other so that they should be
pluggedto the same txid but before reading them.  There's definitely a race condition that will show under load,
though.I think the correct algorithm is: 

1) ask the current txid X
2) start buffering deltas with txid > X
3) download the table as of X
4) download the buffer of deltas and listen for future ones
--


Re: table versioning approach (not auditing)

From
Abelard Hoffman
Date:
Thank you Felix, Gavin, and Jonathan for your responses.

Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?

Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.

So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.

I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.

AH



On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@gmx.de> wrote:
Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com>
An: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 

Re: table versioning approach (not auditing)

From
Jonathan Vanasco
Date:
On Sep 29, 2014, at 4:06 PM, Nick Guenther wrote:

> A newbie tangent question: how do you access the transaction serial? Is it txid_current() as listed in
http://www.postgresql.org/docs/9.3/static/functions-info.html?

My implementations were ridiculously simple/naive in design, and existed entirely with under defined serials.  i'd just
createa new record + id on a write operation, and then use it when logging all operations. 

I had read up on a lot of (possibly better) ways to handle this using pg internals.  They all seemed more advanced than
Ineeded. 


> And does your implementation worry about multiple timelines?

Not sure I understand this... but every object is given a revision id.  edits between consecutive revisions are
allowed,edits spanning multiple revisions are rejected. 


On Sep 29, 2014, at 5:25 PM, Abelard Hoffman wrote:

> Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record
withthe original row? Where's the PK stored, if it wasn't part of the delta? 

The logic I decided on, is this:

Revision 0
     Only the original record is stored
Revision 1
    • Copy the original record into revision store
Revision 1+
    • Update the original record, store the deltas in the revision store

The reason why I chose this path, is that in my system:
    • most records are not edited
    • the records that are edited, are heavily edited

We use an ORM and it was simple to implement this pattern with it, and then write some functions in postgres to ensure
itis adhered to. 

When I need to pull data out:

    • I can pull exact revisions out of the htstore for a given table/row using the revision ids as a key
    • the revisions all contain the transaction id
    • if i need to get more info about a given transaction, i can query the transactions table and get a list of all
theobjects that were edited within that transaction 

if i wanted to ensure referential integrity, i could have used a table instead of an hstore (or json).  If the
applicationgrows much larger, it will probably be migrated to a model like that.  This approach just gave a lot of
flexibility, minimized  tables in the database, and was very easy to pull off.  i went with hstore because json didn't
allowin-place updates at the time (i think it does now).   




Re: table versioning approach (not auditing)

From
"Felix Kunde"
Date:
Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my
audit_logtable, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that
iscalled the same over all tables. 
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x.
IfI would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases
thathave complex geometries and also image files, this strategy is too harddisk consuming. 
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have
onesolution that is tested, maintained and improved by more developpers. This would be great. 
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record
withthe original row? Where's the PK stored, if it wasn't part of the delta? 
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it
lookslike the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's
beingversioned. And then storing that key along with the delta. 
 
So then to find all the versions of a given row, you just need to join the audit row with the
schema_name.table_name.audit_idcolumn. Is that right? The only potential drawback there is there's no referential
integritybetween the audit_log.audit_id and the actual table. 
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to
performmost queries. 
 
AH
 
 
 
On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@gmx.de> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is:
https://github.com/fxku/audit[https://github.com/fxku/audit]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the
JSONlogs of row changes of each table. I'm only logging old values and not complete rows. 
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs
orTABLES. This database state could then be indexed in order to work with it. You can also reset the production state
tothe recreated past state. 
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the
databaseso I can't say if the recreation process scales well. On downside I've realised is that using the json_agg
functionhas limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB. 

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain
transactions.I'm also thinking of parallel versioning, e.g. different users are all working with their version of the
databaseand commit their changes to the production state. As I've got a unique history ID for each table and each row,
Ishould be able to map the affected records. 

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com>
An: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes
whennecessary (customer service makes an incorrect edit, etc.). 
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update
orinsert is done. 
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the
rowchanges (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of
allversioned changes very easily. 
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g.,
findall changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is
inthe json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all
thoserelationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form). 
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate
versioningtables associated with each real table? Or another approach? 
 
Thanks
 
 


Re: table versioning approach (not auditing)

From
Adam Brusselback
Date:
Felix, I'd love to see a single, well maintained project. For example, I just found yours, and gave it a shot today after seeing this post.  I found a bug when an update command is issued, but the old and new values are all the same.  The trigger will blow up.  I've got a fix for that, but if we had one project that more than a handful of people used, stuff like that would be quashed very quickly.

I love the design of it by the way. Any idea what it will take to move to JSONB for 9.4? 


On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-kunde@gmx.de> wrote:
Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my audit_log table, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that is called the same over all tables.
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x. If I would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases that have complex geometries and also image files, this strategy is too harddisk consuming.
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have one solution that is tested, maintained and improved by more developpers. This would be great.
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.
 
So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.
 
AH
 
 
 
On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@gmx.de> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com>
An: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: table versioning approach (not auditing)

From
"Felix Kunde"
Date:
Hey there. Thank you very much for that fix! Thats why I'd like to have a joint development and joint testing. It's way
moreconvincing for users to go for a solution that is tested by some experts than just by a random developer :)  

I'm open to create a new project and push the code there. Don't care about the name. Then we might figure out which
partsare already good, which parts could be improved and where to go next. I think switching to JSONB for example will
beeasy, as it offers the same functions than JSON afaik. 
 

Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
Von: "Adam Brusselback" <adambrusselback@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Felix, I'd love to see a single, well maintained project. For example, I just found yours, and gave it a shot today
afterseeing this post.  I found a bug when an update command is issued, but the old and new values are all the same. 
Thetrigger will blow up.  I've got a fix for that, but if we had one project that more than a handful of people used,
stufflike that would be quashed very quickly. 
 
I love the design of it by the way. Any idea what it will take to move to JSONB for 9.4? 
 
 
On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-kunde@gmx.de> wrote:Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my
audit_logtable, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that
iscalled the same over all tables. 
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x.
IfI would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases
thathave complex geometries and also image files, this strategy is too harddisk consuming. 
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have
onesolution that is tested, maintained and improved by more developpers. This would be great. 
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "Felix Kunde" <felix-kunde@gmx.de[felix-kunde@gmx.de]>
Cc: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]"
<pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record
withthe original row? Where's the PK stored, if it wasn't part of the delta? 
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it
lookslike the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's
beingversioned. And then storing that key along with the delta. 
 
So then to find all the versions of a given row, you just need to join the audit row with the
schema_name.table_name.audit_idcolumn. Is that right? The only potential drawback there is there's no referential
integritybetween the audit_log.audit_id and the actual table. 
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to
performmost queries. 
 
AH
 
 
 
On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@gmx.de[felix-kunde@gmx.de]> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is:
https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the
JSONlogs of row changes of each table. I'm only logging old values and not complete rows. 
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs
orTABLES. This database state could then be indexed in order to work with it. You can also reset the production state
tothe recreated past state. 
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the
databaseso I can't say if the recreation process scales well. On downside I've realised is that using the json_agg
functionhas limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB. 

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain
transactions.I'm also thinking of parallel versioning, e.g. different users are all working with their version of the
databaseand commit their changes to the production state. As I've got a unique history ID for each table and each row,
Ishould be able to map the affected records. 

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]"
<pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes
whennecessary (customer service makes an incorrect edit, etc.). 
 
I have studied these two audit trigger examples:

https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update
orinsert is done. 
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the
rowchanges (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of
allversioned changes very easily. 
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g.,
findall changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is
inthe json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all
thoserelationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form). 
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate
versioningtables associated with each real table? Or another approach? 
 
Thanks
 
 

 

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org[pgsql-general@postgresql.org])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]


Re: table versioning approach (not auditing)

From
Adam Brusselback
Date:
I know we're kinda hijacking this thread, so sorry for that.  If you'd like to do that, i'd be more than happy to use it and push any fixes / changes upstream.  I don't have much of a preference on the name either, as long as it's something that makes sense. 

I would consider myself far from an expert though! Either way, more people using a single solution is a good thing.

As a side note, I did some benchmarking this morning and wanted to share the results:
pgbench -i -s 140 -U postgres pgbench

pgbench -c 4 -j 4 -T 600 -U postgres pgbench
no auditing tps: 2854
NOTE: Accounts are audited
auditing tps: 1278

pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench
no auditing tps: 2504
NOTE: Accounts are audited
auditing tps: 822

pgbench -c 2 -j 2 -T 300 -U postgres pgbench
no auditing tps: 1836
NOTE: branches and tellers are audited, accounts are not
auditing tps: 505

I'd love to see if there are some easy wins to boost the performance.

On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-kunde@gmx.de> wrote:
Hey there. Thank you very much for that fix! Thats why I'd like to have a joint development and joint testing. It's way more convincing for users to go for a solution that is tested by some experts than just by a random developer :)

I'm open to create a new project and push the code there. Don't care about the name. Then we might figure out which parts are already good, which parts could be improved and where to go next. I think switching to JSONB for example will be easy, as it offers the same functions than JSON afaik.
 

Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
Von: "Adam Brusselback" <adambrusselback@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Felix, I'd love to see a single, well maintained project. For example, I just found yours, and gave it a shot today after seeing this post.  I found a bug when an update command is issued, but the old and new values are all the same.  The trigger will blow up.  I've got a fix for that, but if we had one project that more than a handful of people used, stuff like that would be quashed very quickly.
 
I love the design of it by the way. Any idea what it will take to move to JSONB for 9.4? 
 
 
On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-kunde@gmx.de> wrote:Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my audit_log table, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that is called the same over all tables.
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x. If I would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases that have complex geometries and also image files, this strategy is too harddisk consuming.
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have one solution that is tested, maintained and improved by more developpers. This would be great.
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "Felix Kunde" <felix-kunde@gmx.de[felix-kunde@gmx.de]>
Cc: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]" <pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.
 
So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.
 
AH
 
 
 
On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@gmx.de[felix-kunde@gmx.de]> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]" <pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 

 

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org[pgsql-general@postgresql.org])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]

Re: table versioning approach (not auditing)

From
Adam Brusselback
Date:
Testing that now.  Initial results are not looking too performant. 
I have one single table which had 234575 updates done to it.  I am rolling back 13093 of them.  It's been running 20 min now, using 100% of a single core, and almost 0 disk.  No idea how long it'll run at this point.

This is on an i5 desktop with 16 gigs of ram and an ssd.

This is a pretty good test though, as it's a real world use case (even if the data was generated with PGBench).  We now know that area needs some work before it can be used for anything more than a toy database.

Thanks,
-Adam

On Thu, Oct 2, 2014 at 7:52 AM, Felix Kunde <felix-kunde@gmx.de> wrote:
Hey there
 
Thanks again for the fix. I was able to merge it into my repo.
Also thanks for benchmarking audit. Very interesting results.
I wonder how the recreation of former database states scales when processing many deltas.
Haven’t done a lot of testing in that direction.
 
I will transfer the code soon to a more public repo on GitHub. As far as I see I have to create an organization for that.
 
Cheers
Felix
 
Gesendet: Mittwoch, 01. Oktober 2014 um 17:09 Uhr

Von: "Adam Brusselback" <adambrusselback@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)
I know we're kinda hijacking this thread, so sorry for that.  If you'd like to do that, i'd be more than happy to use it and push any fixes / changes upstream.  I don't have much of a preference on the name either, as long as it's something that makes sense. 

I would consider myself far from an expert though! Either way, more people using a single solution is a good thing.
 
As a side note, I did some benchmarking this morning and wanted to share the results:
pgbench -i -s 140 -U postgres pgbench

pgbench -c 4 -j 4 -T 600 -U postgres pgbench
no auditing tps: 2854
NOTE: Accounts are audited
auditing tps: 1278

pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench
no auditing tps: 2504
NOTE: Accounts are audited
auditing tps: 822

pgbench -c 2 -j 2 -T 300 -U postgres pgbench
no auditing tps: 1836
NOTE: branches and tellers are audited, accounts are not
auditing tps: 505
 
I'd love to see if there are some easy wins to boost the performance.
 
On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-kunde@gmx.de> wrote:
Hey there. Thank you very much for that fix! Thats why I'd like to have a joint development and joint testing. It's way more convincing for users to go for a solution that is tested by some experts than just by a random developer :)

I'm open to create a new project and push the code there. Don't care about the name. Then we might figure out which parts are already good, which parts could be improved and where to go next. I think switching to JSONB for example will be easy, as it offers the same functions than JSON afaik.
 

Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
Von: "Adam Brusselback" <adambrusselback@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Felix, I'd love to see a single, well maintained project. For example, I just found yours, and gave it a shot today after seeing this post.  I found a bug when an update command is issued, but the old and new values are all the same.  The trigger will blow up.  I've got a fix for that, but if we had one project that more than a handful of people used, stuff like that would be quashed very quickly.
 
I love the design of it by the way. Any idea what it will take to move to JSONB for 9.4? 
 
 
On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-kunde@gmx.de> wrote:Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my audit_log table, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that is called the same over all tables.
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x. If I would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases that have complex geometries and also image files, this strategy is too harddisk consuming.
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have one solution that is tested, maintained and improved by more developpers. This would be great.
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr

Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "Felix Kunde" <felix-kunde@gmx.de[felix-kunde@gmx.de]>
Cc: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]" <pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.
 
So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.
 
AH
 
 
 

On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@gmx.de[felix-kunde@gmx.de]> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr

Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]" <pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:

https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 

 

--

Sent via pgsql-general mailing list (pgsql-general@postgresql.org[pgsql-general@postgresql.org])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]

Re: table versioning approach (not auditing)

From
Adam Brusselback
Date:
Ended up running for 28 min, but it did work as expected.

On Thu, Oct 2, 2014 at 10:27 AM, Adam Brusselback <adambrusselback@gmail.com> wrote:
Testing that now.  Initial results are not looking too performant. 
I have one single table which had 234575 updates done to it.  I am rolling back 13093 of them.  It's been running 20 min now, using 100% of a single core, and almost 0 disk.  No idea how long it'll run at this point.

This is on an i5 desktop with 16 gigs of ram and an ssd.

This is a pretty good test though, as it's a real world use case (even if the data was generated with PGBench).  We now know that area needs some work before it can be used for anything more than a toy database.

Thanks,
-Adam

On Thu, Oct 2, 2014 at 7:52 AM, Felix Kunde <felix-kunde@gmx.de> wrote:
Hey there
 
Thanks again for the fix. I was able to merge it into my repo.
Also thanks for benchmarking audit. Very interesting results.
I wonder how the recreation of former database states scales when processing many deltas.
Haven’t done a lot of testing in that direction.
 
I will transfer the code soon to a more public repo on GitHub. As far as I see I have to create an organization for that.
 
Cheers
Felix
 
Gesendet: Mittwoch, 01. Oktober 2014 um 17:09 Uhr

Von: "Adam Brusselback" <adambrusselback@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)
I know we're kinda hijacking this thread, so sorry for that.  If you'd like to do that, i'd be more than happy to use it and push any fixes / changes upstream.  I don't have much of a preference on the name either, as long as it's something that makes sense. 

I would consider myself far from an expert though! Either way, more people using a single solution is a good thing.
 
As a side note, I did some benchmarking this morning and wanted to share the results:
pgbench -i -s 140 -U postgres pgbench

pgbench -c 4 -j 4 -T 600 -U postgres pgbench
no auditing tps: 2854
NOTE: Accounts are audited
auditing tps: 1278

pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench
no auditing tps: 2504
NOTE: Accounts are audited
auditing tps: 822

pgbench -c 2 -j 2 -T 300 -U postgres pgbench
no auditing tps: 1836
NOTE: branches and tellers are audited, accounts are not
auditing tps: 505
 
I'd love to see if there are some easy wins to boost the performance.
 
On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-kunde@gmx.de> wrote:
Hey there. Thank you very much for that fix! Thats why I'd like to have a joint development and joint testing. It's way more convincing for users to go for a solution that is tested by some experts than just by a random developer :)

I'm open to create a new project and push the code there. Don't care about the name. Then we might figure out which parts are already good, which parts could be improved and where to go next. I think switching to JSONB for example will be easy, as it offers the same functions than JSON afaik.
 

Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
Von: "Adam Brusselback" <adambrusselback@gmail.com>
An: "Felix Kunde" <felix-kunde@gmx.de>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Felix, I'd love to see a single, well maintained project. For example, I just found yours, and gave it a shot today after seeing this post.  I found a bug when an update command is issued, but the old and new values are all the same.  The trigger will blow up.  I've got a fix for that, but if we had one project that more than a handful of people used, stuff like that would be quashed very quickly.
 
I love the design of it by the way. Any idea what it will take to move to JSONB for 9.4? 
 
 
On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-kunde@gmx.de> wrote:Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my audit_log table, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that is called the same over all tables.
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x. If I would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases that have complex geometries and also image files, this strategy is too harddisk consuming.
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have one solution that is tested, maintained and improved by more developpers. This would be great.
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr

Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "Felix Kunde" <felix-kunde@gmx.de[felix-kunde@gmx.de]>
Cc: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]" <pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.
 
So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.
 
AH
 
 
 

On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-kunde@gmx.de[felix-kunde@gmx.de]> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr

Von: "Abelard Hoffman" <abelardhoffman@gmail.com[abelardhoffman@gmail.com]>
An: "pgsql-general@postgresql.org[pgsql-general@postgresql.org]" <pgsql-general@postgresql.org[pgsql-general@postgresql.org]>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:

https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 

 

--

Sent via pgsql-general mailing list (pgsql-general@postgresql.org[pgsql-general@postgresql.org])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]


Re: table versioning approach (not auditing)

From
Jim Nasby
Date:
On 10/2/14, 9:27 AM, Adam Brusselback wrote:
> i've also tried to implement a database versioning using JSON to log changes in tables. Here it is:
https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
<https://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5Bhttps://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5D>
> I've got two versioning tables, one storing information about all transactions that happened and one where i put the
JSONlogs of row changes of each table. I'm only logging old values and not complete rows. 
>
> Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs,
MVIEWsor TABLES. This database state could then be indexed in order to work with it. You can also reset the production
stateto the recreated past state. 
>
> Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the
databaseso I can't say if the recreation process scales well. On downside I've realised is that using the json_agg
functionhas limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB. 
>
> There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain
transactions.I'm also thinking of parallel versioning, e.g. different users are all working with their version of the
databaseand commit their changes to the production state. As I've got a unique history ID for each table and each row,
Ishould be able to map the affected records. 
Sorry I'm coming late to this thread. I agree that getting interested people together would be a good idea. Is there
anothermailing list we can do that with? 

Versioning is also something I've interested in, and have put a lot of thought into (if not much actual code :( ). I'll
alsomake some general comments, if I may... 


I think timestamps should be *heavily avoided* in versioning, because they are frequently the wrong way to solve a
problem.There are many use cases where you're trying to answer "What values were in place when X happened", and the
simplest,most fool-proof way to answer that is that when you create a record for X, part of that record is a "history
ID"that shows you the exact data used. For example, if you're creating an invoicing system that has versioning of
customeraddresses you would not try and join an invoice with it's address using a timestamp; you would put an actual
address_history_idin the invoice table. 

I thought I saw a reference to versioning sets of information. This is perhaps the trickiest part. You first have to
thinkabout the non-versioned sets (ie: a customer may have many phone numbers) before you think about versioning the
set.In this example, you want the history of the *set* of phone numbers, not of each individual number. Design it with
fullduplication of data first, don't think about normalizing until you have the full set versioning design. 

I understand the generic appeal of using something like JSON, but in reality I don't see it working terribly well. It's
likelyto be on the slow side, and it'll also be difficult to query from. Instead, I think it makes more sense to create
actualhistory tables that derive their definition from the base table. I've got code that extracts information
(column_name,data type, nullability) from a table (or even a table definition), and it's not that complex. With the
workthat's been done on capturing DDL changes it shouldn't be too hard to handle that automatically. 


Re: table versioning approach (not auditing)

From
Gavin Flower
Date:
On 07/10/14 10:47, Jim Nasby wrote:
> On 10/2/14, 9:27 AM, Adam Brusselback wrote:
>> i've also tried to implement a database versioning using JSON to log
>> changes in tables. Here it is:
>>
https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]

>>
<https://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5Bhttps://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5D>

>>
>> I've got two versioning tables, one storing information about all
>> transactions that happened and one where i put the JSON logs of row
>> changes of each table. I'm only logging old values and not complete
>> rows.
>>
>> Then I got a function that recreates a database state at a given time
>> into a separate schema - either to VIEWs, MVIEWs or TABLES. This
>> database state could then be indexed in order to work with it. You
>> can also reset the production state to the recreated past state.
>>
>> Unfortunately I've got no time to further work on it at the moment +
>> I have not done tests with many changes in the database so I can't
>> say if the recreation process scales well. On downside I've realised
>> is that using the json_agg function has limits when I've got binary
>> data. It gets too long. So I'm really looking forward using JSONB.
>>
>> There are more plans in my mind. By having a Transaction_Log table it
>> should be possible to revert only certain transactions. I'm also
>> thinking of parallel versioning, e.g. different users are all working
>> with their version of the database and commit their changes to the
>> production state. As I've got a unique history ID for each table and
>> each row, I should be able to map the affected records.
> Sorry I'm coming late to this thread. I agree that getting interested
> people together would be a good idea. Is there another mailing list we
> can do that with?
>
> Versioning is also something I've interested in, and have put a lot of
> thought into (if not much actual code :( ). I'll also make some
> general comments, if I may...
>
>
> I think timestamps should be *heavily avoided* in versioning, because
> they are frequently the wrong way to solve a problem. There are many
> use cases where you're trying to answer "What values were in place
> when X happened", and the simplest, most fool-proof way to answer that
> is that when you create a record for X, part of that record is a
> "history ID" that shows you the exact data used. For example, if
> you're creating an invoicing system that has versioning of customer
> addresses you would not try and join an invoice with it's address
> using a timestamp; you would put an actual address_history_id in the
> invoice table.
>
> I thought I saw a reference to versioning sets of information. This is
> perhaps the trickiest part. You first have to think about the
> non-versioned sets (ie: a customer may have many phone numbers) before
> you think about versioning the set. In this example, you want the
> history of the *set* of phone numbers, not of each individual number.
> Design it with full duplication of data first, don't think about
> normalizing until you have the full set versioning design.
>
> I understand the generic appeal of using something like JSON, but in
> reality I don't see it working terribly well. It's likely to be on the
> slow side, and it'll also be difficult to query from. Instead, I think
> it makes more sense to create actual history tables that derive their
> definition from the base table. I've got code that extracts
> information (column_name, data type, nullability) from a table (or
> even a table definition), and it's not that complex. With the work
> that's been done on capturing DDL changes it shouldn't be too hard to
> handle that automatically.
>
>
Yeah, my design was quite extensive and ensured all relevant information
was associated with the 'history id' (still need timestamps to find the
appropriate value), the powers that be watered it down somewhat (but
that was outside my control). Performance was not too critical, probably
less than 10 transactions per second at peak times.  JSON, had yet to be
invented, but we would not have used it anyhow.

Even if timestamps are used extensively, you'd have to be careful
joining on them. You may have information valid at T1 and changing at
T3, but the transaction has T2, where T1 < T2 < T3 - the appropriate set
of data would be associated with T1, would would not get anywhere trying
to find data with a timestamp of T2 (unless you were very lucky!).

Actually things like phone numbers are tricky.  Sometimes you may want
to use the current phone number, and not the one extant at that time (as
you want to phone the contact now), or you may still want the old phone
number (was the call to a specific number at date/time legitimate & who
do we charge the cost of the call too).


Re: table versioning approach (not auditing)

From
Jim Nasby
Date:
On 10/6/14, 6:10 PM, Gavin Flower wrote:
> Even if timestamps are used extensively, you'd have to be careful joining on them. You may have information valid at
T1and changing at T3, but the transaction has T2, where T1 < T2 < T3 - the appropriate set of data would be associated
withT1, would would not get anywhere trying to find data with a timestamp of T2 (unless you were very lucky!). 

Yeah, this is why I think timestamps need to be shunned in favor of explicit pointers. Anyone that thinks timestamps
aregood enough hasn't thought the problem through completely. :) 

I also think there's potential value to storing full transaction information (presumably in a separate table):
txid_current(),txid_current_snapshot(), now(), current_user, maybe some other stuff (client IP address?). That way you
cantell exactly what created a history record. With appropriate shenanigans you can theoretically determine exactly
whatother history data would be visible at that time without using pointers (but man would that bu ugly!) 

> Actually things like phone numbers are tricky.  Sometimes you may want to use the current phone number, and not the
oneextant at that time (as you want to phone the contact now), or you may still want the old phone number (was the call
toa specific number at date/time legitimate & who do we charge the cost of the call too). 

Yeah, I'm pretty convinced at this point that history/versioning should be built on top of a schema that always
containsthe current information, if for no other reason than so you always have a PK that points to what's current in
additionto your history PKs. 
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: table versioning approach (not auditing)

From
Gavin Flower
Date:
On 08/10/14 13:29, Jim Nasby wrote:
On 10/6/14, 6:10 PM, Gavin Flower wrote:
Even if timestamps are used extensively, you'd have to be careful joining on them. You may have information valid at T1 and changing at T3, but the transaction has T2, where T1 < T2 < T3 - the appropriate set of data would be associated with T1, would would not get anywhere trying to find data with a timestamp of T2 (unless you were very lucky!).

Yeah, this is why I think timestamps need to be shunned in favor of explicit pointers. Anyone that thinks timestamps are good enough hasn't thought the problem through completely. :)

I also think there's potential value to storing full transaction information (presumably in a separate table): txid_current(), txid_current_snapshot(), now(), current_user, maybe some other stuff (client IP address?). That way you can tell exactly what created a history record. With appropriate shenanigans you can theoretically determine exactly what other history data would be visible at that time without using pointers (but man would that bu ugly!)

Actually things like phone numbers are tricky.  Sometimes you may want to use the current phone number, and not the one extant at that time (as you want to phone the contact now), or you may still want the old phone number (was the call to a specific number at date/time legitimate & who do we charge the cost of the call too).

Yeah, I'm pretty convinced at this point that history/versioning should be built on top of a schema that always contains the current information, if for no other reason than so you always have a PK that points to what's current in addition to your history PKs.
One of the motivations for having an effective_date, was being able to put changes into the database ahead of time.

Finding the current value uses the same logic a find the value at any other date/time - so you don't need a special schema to distinguish the current state from anything else.  For example:

DROP TABLE IF EXISTS stock;

CREATE TABLE stock
(
    id              text,
    effective_date  timestamptz,
    price           numeric
);

INSERT INTO stock
(
   id,
   effective_date,
   price
)

VALUES
    ('y88', '2014-10-01', 12.0),
    ('x42', '2014-10-01', 12.1),
    ('x42', '2014-10-08', 12.2),
    ('x42', '2014-10-10', 12.3),
    ('x42', '2014-10-16', 12.4),
    ('z42', '2014-10-19', 12.5),
    ('z49', '2014-10-01', 12.6),
    ('z49', '2014-10-30', 12.7),
    ('z77', '2014-10-01', 12.8);
   
CREATE UNIQUE INDEX primary_key ON stock (id ASC, effective_date DESC);
       
SELECT
    s.price
FROM
    stock s
WHERE
        s.id = 'x42'
    AND s.effective_date <= '2014-10-11'
ORDER BY
    s.effective_date DESC
LIMIT 1;


Cheers,
Gavin

Re: table versioning approach (not auditing)

From
Jim Nasby
Date:
On 10/7/14, 10:40 PM, Gavin Flower wrote:
>> Yeah, I'm pretty convinced at this point that history/versioning should be built on top of a schema that always
containsthe current information, if for no other reason than so you always have a PK that points to what's current in
additionto your history PKs. 
> One of the motivations for having an effective_date, was being able to put changes into the database ahead of time.

Yeah, allowing for future data makes things more interesting. My first inclination is that it's a completely separate
requirement,and you would track the history of all records that you had at a point in time. Doing that means you can
seethings like someone changing the effective date from Nov. 1 to Dec. 1. But clearly this is an area where you have to
takethe business case into account. 
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com