Thread: Temporal extensions

Temporal extensions

From

Dave Jones

Date:

26 April 2015, 00:49:41

Hi all,

My apologies I couldn't directly respond to the earlier thread on this
subject
(http://www.postgresql.org/message-id/50D99278.3030704@dc.baikal.ru) but
I wasn't subscribed to the list at that point.

I've been working on a conversion of several utilities I wrote for
another engine, and was wondering if there was any interest in seeing
any of them added to contrib/ at some point in the vague undefined future?

The github repo for the source is here:
 https://github.com/waveform80/oliphant

The documentation for the extensions (currently) resides here:
 http://oliphant.readthedocs.org/en/latest/

(Obviously the docs would need conversion to docbook for any official
sort of patch; I've used docbook in the past, I mainly wrote the docs in
rst because that's what I'm most used to at the moment and it's awfully
convenient :)

The major extension is "history" which is intended for the tracking of
temporal data. This comprises various functions for the creation of
history tables, their associated triggers, and utility views for
providing alternate transformations of the historical data.

The other extensions ("assert", "auth", "merge") all really exist in
support of "history" (to some greater or lesser degree) but I'd split
them out previously as on the original engine (DB2) they served some
purpose and perhaps they can serve some here. However, I'd have no issue
removing such dependence and simply merging the relevant code into the
"history" extension if that was deemed appropriate (as you can guess,
it's "history" I'm really interested in.).

The extensions are all written in plpgsql (no C), and there's some
rudimentary tests of their functionality under the tests/ dir (which
again would need converting/expanding in the event they were to become
"official"). That said, having read through the former thread
(referenced above) I get the impression there's still plenty I need to
think about.

I predict some questions are bound to arise, so I'll provide some brief
answers introductory below:

Q. Why not build on the existing work?

Honestly, I didn't think to go looking for it until I considered posting
to this list, and it's been enlightening seeing what others have done in
this space (doh!). Personally, I started tinkering with this sort of
stuff long ago, in a database engine far far away* and the stuff I had,
had fulfilled all my needs in this space. So, when it came time to move
onto postgres that's where I started (for better or worse).

* http://groups.google.com/group/comp.databases.ibm-db2/msg/e84aeb1f6ac87e6c

Q. Why are you using two timestamp fields instead of a range?

Short answer: performance. I did some tests with a relatively large
history data set using the two-field timestamp method, and a range
method before starting work on the "history" extension. These tests
seemed to indicate that using ranges significantly impacted performance
(both writes, and queries). It didn't look terribly difficult to convert
the code between the two systems so for the time being I pressed ahead
with the two-field method, but I'd love to find out if I was doing
something stupid with ranges. Further discussion definitely wanted!

Q. Why AFTER triggers instead of BEFORE?

Largely because on the original engine BEFORE triggers were limited in
functionality, so they had to be AFTER triggers. After reading some of
the discussion on the linked thread, this may be a decision I have to
re-visit.

Q. What about official SQL:2011 syntax? Application time, etc.?

Sure - all stuff I'd love to see in Postgres at some point, but not
stuff I'm qualified to even begin looking at (given it requires engine
alterations).


Anyway, that's probably enough for now. Questions, suggestions,
criticisms all gratefully received!

Dave.

Re: Temporal extensions

From

Jim Nasby

Date:

27 April 2015, 20:49:14

On 4/25/15 7:49 PM, Dave Jones wrote:
> I've been working on a conversion of several utilities I wrote for
> another engine, and was wondering if there was any interest in seeing
> any of them added to contrib/ at some point in the vague undefined future?

Not in contrib, no, because there's no reason for these to be tied to 
specific versions of Postgres. Adding to PGXN would make sense though. 
(Though, I dislike using timestamps to do change/history tracking, but 
that's just me...)
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Temporal extensions

From

Dave Jones

Date:

27 April 2015, 23:08:52

Hi Jim,

On 27/04/15 21:48, Jim Nasby wrote:
> On 4/25/15 7:49 PM, Dave Jones wrote:
>> I've been working on a conversion of several utilities I wrote for
>> another engine, and was wondering if there was any interest in seeing
>> any of them added to contrib/ at some point in the vague undefined
>> future?
> 
> Not in contrib, no, because there's no reason for these to be tied to
> specific versions of Postgres. Adding to PGXN would make sense though.

Thanks for the reply and suggestion - I've just run across PGXN having
finished reading through the all the temporal-related e-mails I could
find on pgsql-hackers for the last 5 years or so (specifically, Jeff
Davis' temporal extension). PGXN looks like a good place to park this
(once I've sorted out a few of the things I've run across while reading
- TRUNCATE triggers for it for starters!).

> (Though, I dislike using timestamps to do change/history tracking, but
> that's just me...)

I've been playing around with history tracking (in the context of BI,
typically with batch loaded reporting databases) for about 7-8 years now
and always found timestamps perfect for the purpose, but are you perhaps
referring to using it for audit purposes? If that's the case I'd agree
entirely - this is absolutely the wrong tool for such things (which is
something I need to put a bit more prominently in the docs - it's buried
in the design section at the moment).

Or did you mean ranges would be better? They certainly looked intriguing
when I started moving this stuff to postgres, and I'd like to re-visit
them in the near future as they offer capabilities I don't have with
timestamps (such as guaranteeing no overlapping ranges via exclusion
constraints) but my initial tests suggested some rather major
performance degradation so I put it on the back-burner at first.

Thanks,

Dave.

Re: Temporal extensions

From

Jim Nasby

Date:

28 April 2015, 02:49:49

On 4/27/15 6:08 PM, Dave Jones wrote:
>> (Though, I dislike using timestamps to do change/history tracking, but
>> >that's just me...)
> I've been playing around with history tracking (in the context of BI,
> typically with batch loaded reporting databases) for about 7-8 years now
> and always found timestamps perfect for the purpose, but are you perhaps
> referring to using it for audit purposes? If that's the case I'd agree
> entirely - this is absolutely the wrong tool for such things (which is
> something I need to put a bit more prominently in the docs - it's buried
> in the design section at the moment).

Most warehouses dumb things down to a day level, so it's probably OK there.

What I specifically don't like is that using a timestamp to try and 
determine the order in which something happened is just fraught with 
gotchas. For starters, now() is locked in when you do a BEGIN, but maybe 
a newer transaction modifies a table before an older one does. Now the 
ordering is *backwards*. You have the same problem with using an XID. 
The only way I've thought of to make this guaranteed safe is to somehow 
serialize the logging with something like

CREATE TABLE customer_history(  customer_hid serial primary key -- hid == history_id  , previous_customer_hid int
referencescustomer_history  , customer_id int NOT NULL references customer

...
);
CREATE UNIQUE INDEX ... ON customer_history(previous_customer_hid) WHERE 
previous_customer_hid IS NOT NULL;
CREATE UNIQUE INDEX ... ON customer_history(customer_hid) WHERE 
previous_customer_hid IS NULL;

and then have a before trigger enforce
NEW.previous_customer_hid := customer_history__get_latest(customer_id)

where customer_history__get_latest() will 'walk the chain' starting with 
the first link customer_id = blah AND previous_customer_id = NULL

Because of the indexes that will serialize inserts on a per-customer 
basis. You could still run into problems with a newer snapshot creating 
a history record before a transaction with an older snapshot does 
though. :( Though, if you included txid_current_snapshot() with each 
record you could probably detect when that happens.

> Or did you mean ranges would be better? They certainly looked intriguing
> when I started moving this stuff to postgres, and I'd like to re-visit
> them in the near future as they offer capabilities I don't have with
> timestamps (such as guaranteeing no overlapping ranges via exclusion
> constraints) but my initial tests suggested some rather major
> performance degradation so I put it on the back-burner at first.

If you're going to keep both a start and end for each record you'd 
definitely want to do it with a range. If you're only keeping the change 
time then you can handle it differently.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Temporal extensions

From

Dave Jones

Date:

29 April 2015, 01:46:53

Hi Jim,

On 28/04/15 03:49, Jim Nasby wrote:
> On 4/27/15 6:08 PM, Dave Jones wrote:
>>> (Though, I dislike using timestamps to do change/history tracking, but
>>> >that's just me...)
>> I've been playing around with history tracking (in the context of BI,
>> typically with batch loaded reporting databases) for about 7-8 years now
>> and always found timestamps perfect for the purpose, but are you perhaps
>> referring to using it for audit purposes? If that's the case I'd agree
>> entirely - this is absolutely the wrong tool for such things (which is
>> something I need to put a bit more prominently in the docs - it's buried
>> in the design section at the moment).
> 
> Most warehouses dumb things down to a day level, so it's probably OK there.

That's certainly how I've always used this stuff (and how I've always
encouraged it to be used), although I must admit everyone's first
reaction is "great, let's use microsecond resolution to capture
_everything_!" followed in a few years time by "good grief, my table's
huge and I don't need most of the detail in it!". Oh well :)

> What I specifically don't like is that using a timestamp to try and
> determine the order in which something happened is just fraught with
> gotchas. For starters, now() is locked in when you do a BEGIN, but maybe
> a newer transaction modifies a table before an older one does. Now the
> ordering is *backwards*. You have the same problem with using an XID.
> The only way I've thought of to make this guaranteed safe is to somehow
> serialize the logging with something like
> 
> CREATE TABLE customer_history(
>   customer_hid serial primary key -- hid == history_id
>   , previous_customer_hid int references customer_history
>   , customer_id int NOT NULL references customer
> ...
> );
> CREATE UNIQUE INDEX ... ON customer_history(previous_customer_hid) WHERE
> previous_customer_hid IS NOT NULL;
> CREATE UNIQUE INDEX ... ON customer_history(customer_hid) WHERE
> previous_customer_hid IS NULL;
> 
> and then have a before trigger enforce
> NEW.previous_customer_hid := customer_history__get_latest(customer_id)
> 
> where customer_history__get_latest() will 'walk the chain' starting with
> the first link customer_id = blah AND previous_customer_id = NULL
> 
> Because of the indexes that will serialize inserts on a per-customer
> basis. You could still run into problems with a newer snapshot creating
> a history record before a transaction with an older snapshot does
> though. :( Though, if you included txid_current_snapshot() with each
> record you could probably detect when that happens.

Yes, I noticed in my read through the older temporal threads that one of
the solutions used clock_timestamp() rather than current_timestamp which
seemed to be in order to avoid this sort of thing. Can't say I liked the
sound of it though - seemed like that would lead to even more
inconsistencies (as the timestamp for a changeset would potentially fall
in the middle of the transaction that made it ... urgh!).

I should make clear in the docs, that this sort of system isn't good for
ordering at that level (i.e. it's only accurate for history resolutions
significantly longer than any individual transaction length).

>> Or did you mean ranges would be better? They certainly looked intriguing
>> when I started moving this stuff to postgres, and I'd like to re-visit
>> them in the near future as they offer capabilities I don't have with
>> timestamps (such as guaranteeing no overlapping ranges via exclusion
>> constraints) but my initial tests suggested some rather major
>> performance degradation so I put it on the back-burner at first.
> 
> If you're going to keep both a start and end for each record you'd
> definitely want to do it with a range. If you're only keeping the change
> time then you can handle it differently.

Yup, it's keeping a start and end (some of the routines in the extension
provide alternate transformations, but the base storage is always a
range of timestamps simply because that seems the easiest structure to
transform into others in practice).

I'll work on a decent test case for the performance issues I ran into
(unfortunately the original one is full of proprietary data so I'll have
to come up with something similarly sized full of random bits).

Dave.