Thread: Temporal extensions
Hi all, My apologies I couldn't directly respond to the earlier thread on this subject (http://www.postgresql.org/message-id/50D99278.3030704@dc.baikal.ru) but I wasn't subscribed to the list at that point. I've been working on a conversion of several utilities I wrote for another engine, and was wondering if there was any interest in seeing any of them added to contrib/ at some point in the vague undefined future? The github repo for the source is here: https://github.com/waveform80/oliphant The documentation for the extensions (currently) resides here: http://oliphant.readthedocs.org/en/latest/ (Obviously the docs would need conversion to docbook for any official sort of patch; I've used docbook in the past, I mainly wrote the docs in rst because that's what I'm most used to at the moment and it's awfully convenient :) The major extension is "history" which is intended for the tracking of temporal data. This comprises various functions for the creation of history tables, their associated triggers, and utility views for providing alternate transformations of the historical data. The other extensions ("assert", "auth", "merge") all really exist in support of "history" (to some greater or lesser degree) but I'd split them out previously as on the original engine (DB2) they served some purpose and perhaps they can serve some here. However, I'd have no issue removing such dependence and simply merging the relevant code into the "history" extension if that was deemed appropriate (as you can guess, it's "history" I'm really interested in.). The extensions are all written in plpgsql (no C), and there's some rudimentary tests of their functionality under the tests/ dir (which again would need converting/expanding in the event they were to become "official"). That said, having read through the former thread (referenced above) I get the impression there's still plenty I need to think about. I predict some questions are bound to arise, so I'll provide some brief answers introductory below: Q. Why not build on the existing work? Honestly, I didn't think to go looking for it until I considered posting to this list, and it's been enlightening seeing what others have done in this space (doh!). Personally, I started tinkering with this sort of stuff long ago, in a database engine far far away* and the stuff I had, had fulfilled all my needs in this space. So, when it came time to move onto postgres that's where I started (for better or worse). * http://groups.google.com/group/comp.databases.ibm-db2/msg/e84aeb1f6ac87e6c Q. Why are you using two timestamp fields instead of a range? Short answer: performance. I did some tests with a relatively large history data set using the two-field timestamp method, and a range method before starting work on the "history" extension. These tests seemed to indicate that using ranges significantly impacted performance (both writes, and queries). It didn't look terribly difficult to convert the code between the two systems so for the time being I pressed ahead with the two-field method, but I'd love to find out if I was doing something stupid with ranges. Further discussion definitely wanted! Q. Why AFTER triggers instead of BEFORE? Largely because on the original engine BEFORE triggers were limited in functionality, so they had to be AFTER triggers. After reading some of the discussion on the linked thread, this may be a decision I have to re-visit. Q. What about official SQL:2011 syntax? Application time, etc.? Sure - all stuff I'd love to see in Postgres at some point, but not stuff I'm qualified to even begin looking at (given it requires engine alterations). Anyway, that's probably enough for now. Questions, suggestions, criticisms all gratefully received! Dave.
On 4/25/15 7:49 PM, Dave Jones wrote: > I've been working on a conversion of several utilities I wrote for > another engine, and was wondering if there was any interest in seeing > any of them added to contrib/ at some point in the vague undefined future? Not in contrib, no, because there's no reason for these to be tied to specific versions of Postgres. Adding to PGXN would make sense though. (Though, I dislike using timestamps to do change/history tracking, but that's just me...) -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Hi Jim, On 27/04/15 21:48, Jim Nasby wrote: > On 4/25/15 7:49 PM, Dave Jones wrote: >> I've been working on a conversion of several utilities I wrote for >> another engine, and was wondering if there was any interest in seeing >> any of them added to contrib/ at some point in the vague undefined >> future? > > Not in contrib, no, because there's no reason for these to be tied to > specific versions of Postgres. Adding to PGXN would make sense though. Thanks for the reply and suggestion - I've just run across PGXN having finished reading through the all the temporal-related e-mails I could find on pgsql-hackers for the last 5 years or so (specifically, Jeff Davis' temporal extension). PGXN looks like a good place to park this (once I've sorted out a few of the things I've run across while reading - TRUNCATE triggers for it for starters!). > (Though, I dislike using timestamps to do change/history tracking, but > that's just me...) I've been playing around with history tracking (in the context of BI, typically with batch loaded reporting databases) for about 7-8 years now and always found timestamps perfect for the purpose, but are you perhaps referring to using it for audit purposes? If that's the case I'd agree entirely - this is absolutely the wrong tool for such things (which is something I need to put a bit more prominently in the docs - it's buried in the design section at the moment). Or did you mean ranges would be better? They certainly looked intriguing when I started moving this stuff to postgres, and I'd like to re-visit them in the near future as they offer capabilities I don't have with timestamps (such as guaranteeing no overlapping ranges via exclusion constraints) but my initial tests suggested some rather major performance degradation so I put it on the back-burner at first. Thanks, Dave.
On 4/27/15 6:08 PM, Dave Jones wrote: >> (Though, I dislike using timestamps to do change/history tracking, but >> >that's just me...) > I've been playing around with history tracking (in the context of BI, > typically with batch loaded reporting databases) for about 7-8 years now > and always found timestamps perfect for the purpose, but are you perhaps > referring to using it for audit purposes? If that's the case I'd agree > entirely - this is absolutely the wrong tool for such things (which is > something I need to put a bit more prominently in the docs - it's buried > in the design section at the moment). Most warehouses dumb things down to a day level, so it's probably OK there. What I specifically don't like is that using a timestamp to try and determine the order in which something happened is just fraught with gotchas. For starters, now() is locked in when you do a BEGIN, but maybe a newer transaction modifies a table before an older one does. Now the ordering is *backwards*. You have the same problem with using an XID. The only way I've thought of to make this guaranteed safe is to somehow serialize the logging with something like CREATE TABLE customer_history( customer_hid serial primary key -- hid == history_id , previous_customer_hid int referencescustomer_history , customer_id int NOT NULL references customer ... ); CREATE UNIQUE INDEX ... ON customer_history(previous_customer_hid) WHERE previous_customer_hid IS NOT NULL; CREATE UNIQUE INDEX ... ON customer_history(customer_hid) WHERE previous_customer_hid IS NULL; and then have a before trigger enforce NEW.previous_customer_hid := customer_history__get_latest(customer_id) where customer_history__get_latest() will 'walk the chain' starting with the first link customer_id = blah AND previous_customer_id = NULL Because of the indexes that will serialize inserts on a per-customer basis. You could still run into problems with a newer snapshot creating a history record before a transaction with an older snapshot does though. :( Though, if you included txid_current_snapshot() with each record you could probably detect when that happens. > Or did you mean ranges would be better? They certainly looked intriguing > when I started moving this stuff to postgres, and I'd like to re-visit > them in the near future as they offer capabilities I don't have with > timestamps (such as guaranteeing no overlapping ranges via exclusion > constraints) but my initial tests suggested some rather major > performance degradation so I put it on the back-burner at first. If you're going to keep both a start and end for each record you'd definitely want to do it with a range. If you're only keeping the change time then you can handle it differently. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Hi Jim, On 28/04/15 03:49, Jim Nasby wrote: > On 4/27/15 6:08 PM, Dave Jones wrote: >>> (Though, I dislike using timestamps to do change/history tracking, but >>> >that's just me...) >> I've been playing around with history tracking (in the context of BI, >> typically with batch loaded reporting databases) for about 7-8 years now >> and always found timestamps perfect for the purpose, but are you perhaps >> referring to using it for audit purposes? If that's the case I'd agree >> entirely - this is absolutely the wrong tool for such things (which is >> something I need to put a bit more prominently in the docs - it's buried >> in the design section at the moment). > > Most warehouses dumb things down to a day level, so it's probably OK there. That's certainly how I've always used this stuff (and how I've always encouraged it to be used), although I must admit everyone's first reaction is "great, let's use microsecond resolution to capture _everything_!" followed in a few years time by "good grief, my table's huge and I don't need most of the detail in it!". Oh well :) > What I specifically don't like is that using a timestamp to try and > determine the order in which something happened is just fraught with > gotchas. For starters, now() is locked in when you do a BEGIN, but maybe > a newer transaction modifies a table before an older one does. Now the > ordering is *backwards*. You have the same problem with using an XID. > The only way I've thought of to make this guaranteed safe is to somehow > serialize the logging with something like > > CREATE TABLE customer_history( > customer_hid serial primary key -- hid == history_id > , previous_customer_hid int references customer_history > , customer_id int NOT NULL references customer > ... > ); > CREATE UNIQUE INDEX ... ON customer_history(previous_customer_hid) WHERE > previous_customer_hid IS NOT NULL; > CREATE UNIQUE INDEX ... ON customer_history(customer_hid) WHERE > previous_customer_hid IS NULL; > > and then have a before trigger enforce > NEW.previous_customer_hid := customer_history__get_latest(customer_id) > > where customer_history__get_latest() will 'walk the chain' starting with > the first link customer_id = blah AND previous_customer_id = NULL > > Because of the indexes that will serialize inserts on a per-customer > basis. You could still run into problems with a newer snapshot creating > a history record before a transaction with an older snapshot does > though. :( Though, if you included txid_current_snapshot() with each > record you could probably detect when that happens. Yes, I noticed in my read through the older temporal threads that one of the solutions used clock_timestamp() rather than current_timestamp which seemed to be in order to avoid this sort of thing. Can't say I liked the sound of it though - seemed like that would lead to even more inconsistencies (as the timestamp for a changeset would potentially fall in the middle of the transaction that made it ... urgh!). I should make clear in the docs, that this sort of system isn't good for ordering at that level (i.e. it's only accurate for history resolutions significantly longer than any individual transaction length). >> Or did you mean ranges would be better? They certainly looked intriguing >> when I started moving this stuff to postgres, and I'd like to re-visit >> them in the near future as they offer capabilities I don't have with >> timestamps (such as guaranteeing no overlapping ranges via exclusion >> constraints) but my initial tests suggested some rather major >> performance degradation so I put it on the back-burner at first. > > If you're going to keep both a start and end for each record you'd > definitely want to do it with a range. If you're only keeping the change > time then you can handle it differently. Yup, it's keeping a start and end (some of the routines in the extension provide alternate transformations, but the base storage is always a range of timestamps simply because that seems the easiest structure to transform into others in practice). I'll work on a decent test case for the performance issues I ran into (unfortunately the original one is full of proprietary data so I'll have to come up with something similarly sized full of random bits). Dave.