Re: Temporal extensions - Mailing list pgsql-hackers

From Dave Jones
Subject Re: Temporal extensions
Date
Msg-id 55403803.7070904@waveform.org.uk
Whole thread Raw
In response to Re: Temporal extensions  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
List pgsql-hackers
Hi Jim,

On 28/04/15 03:49, Jim Nasby wrote:
> On 4/27/15 6:08 PM, Dave Jones wrote:
>>> (Though, I dislike using timestamps to do change/history tracking, but
>>> >that's just me...)
>> I've been playing around with history tracking (in the context of BI,
>> typically with batch loaded reporting databases) for about 7-8 years now
>> and always found timestamps perfect for the purpose, but are you perhaps
>> referring to using it for audit purposes? If that's the case I'd agree
>> entirely - this is absolutely the wrong tool for such things (which is
>> something I need to put a bit more prominently in the docs - it's buried
>> in the design section at the moment).
> 
> Most warehouses dumb things down to a day level, so it's probably OK there.

That's certainly how I've always used this stuff (and how I've always
encouraged it to be used), although I must admit everyone's first
reaction is "great, let's use microsecond resolution to capture
_everything_!" followed in a few years time by "good grief, my table's
huge and I don't need most of the detail in it!". Oh well :)

> What I specifically don't like is that using a timestamp to try and
> determine the order in which something happened is just fraught with
> gotchas. For starters, now() is locked in when you do a BEGIN, but maybe
> a newer transaction modifies a table before an older one does. Now the
> ordering is *backwards*. You have the same problem with using an XID.
> The only way I've thought of to make this guaranteed safe is to somehow
> serialize the logging with something like
> 
> CREATE TABLE customer_history(
>   customer_hid serial primary key -- hid == history_id
>   , previous_customer_hid int references customer_history
>   , customer_id int NOT NULL references customer
> ...
> );
> CREATE UNIQUE INDEX ... ON customer_history(previous_customer_hid) WHERE
> previous_customer_hid IS NOT NULL;
> CREATE UNIQUE INDEX ... ON customer_history(customer_hid) WHERE
> previous_customer_hid IS NULL;
> 
> and then have a before trigger enforce
> NEW.previous_customer_hid := customer_history__get_latest(customer_id)
> 
> where customer_history__get_latest() will 'walk the chain' starting with
> the first link customer_id = blah AND previous_customer_id = NULL
> 
> Because of the indexes that will serialize inserts on a per-customer
> basis. You could still run into problems with a newer snapshot creating
> a history record before a transaction with an older snapshot does
> though. :( Though, if you included txid_current_snapshot() with each
> record you could probably detect when that happens.

Yes, I noticed in my read through the older temporal threads that one of
the solutions used clock_timestamp() rather than current_timestamp which
seemed to be in order to avoid this sort of thing. Can't say I liked the
sound of it though - seemed like that would lead to even more
inconsistencies (as the timestamp for a changeset would potentially fall
in the middle of the transaction that made it ... urgh!).

I should make clear in the docs, that this sort of system isn't good for
ordering at that level (i.e. it's only accurate for history resolutions
significantly longer than any individual transaction length).

>> Or did you mean ranges would be better? They certainly looked intriguing
>> when I started moving this stuff to postgres, and I'd like to re-visit
>> them in the near future as they offer capabilities I don't have with
>> timestamps (such as guaranteeing no overlapping ranges via exclusion
>> constraints) but my initial tests suggested some rather major
>> performance degradation so I put it on the back-burner at first.
> 
> If you're going to keep both a start and end for each record you'd
> definitely want to do it with a range. If you're only keeping the change
> time then you can handle it differently.

Yup, it's keeping a start and end (some of the routines in the extension
provide alternate transformations, but the base storage is always a
range of timestamps simply because that seems the easiest structure to
transform into others in practice).

I'll work on a decent test case for the performance issues I ran into
(unfortunately the original one is full of proprietary data so I'll have
to come up with something similarly sized full of random bits).

Dave.



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: why does enum_endpoint call GetTransactionSnapshot()?
Next
From: Peter Eisentraut
Date:
Subject: Re: Replication identifiers, take 4