Thread: Option to ensure monotonic timestamps

Option to ensure monotonic timestamps

From
Brent Kerby
Date:
Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a new feature to make it possible to ensure that Postgres-generated timestamps never decrease even if the system clock may step backwards. My use case is that I'm implementing a form of temporal tables based on transaction commit timestamps (as returned by pg_xact_commit_timestamp), and to ensure the integrity of the system I need to know that the ordering of the commit timestamps will always be consistent with the order in which the transactions actually committed. I don't need the timestamps to be unique; i.e., if transactions occur close together in time, then it's fine for them to have the same timestamp -- just if the timestamps are different then they must be in the right order. I would guess there may be other scenarios where users may want to ensure the timestamps are monotonic, and in general it would probably be desired for the monotonicity to apply across all timestamps generated by a given Postgres server, not only the commit timestamps.

I'm aware of the obvious alternative, which is simply to try to configure the system clock so that it can't go backwards (e.g., using the option "stepback 0" for ntpd). However, in virtual environments this could potentially be difficult to achieve in a reliable way. And in any case, since in my application the integrity of the data history hinges on the timestamps being monotonic, I think it makes sense that this be enforceable on the database level.

What I propose is that we could add a boolean configuration option, say 'ensure_monotonic_timestamps', that enables the following behavior: when GetCurrentTimestamp is called (https://doxygen.postgresql.org/backend_2utils_2adt_2timestamp_8c.html#a9822cdf3fd41b15851c0c18ddc80143c), before it returns it checks if `result` is less than what was returned last time (if any) that GetCurrentTimestamp was called, and if so it returns the result from the previous call (after logging a warning), otherwise it proceeds as normal. In its simplest form, this could be accomplished by adding a global variable lastGetCurrentTimestamp that stores the result of the previous call. Since GetCurrentTimestamp appears to be the source of all of the significant system-generated timestamps, including commit timestamps, this should produce the behavior I'm looking for. 

One tricky thing is to figure out how to make this reliable even in the situation where the database engine has to be restarted. When we're starting up and have to initialize lastGetCurrentTimestamp, we need to make sure to make sure we initialize it to be at least as large as the largest previous result of GetCurrentTimestamp that made its way into the WAL before shutdown, i.e., the largest previous result of GetCurrentTimestamp that has the potential to be written out to tables upon recovery. What's fuzzy to me is whether this would require writing new data to the WAL specifically for this, or whether there are already timestamps (e.g., as part of WAL metadata) that could serve this purpose.

Any thoughts?

- Brent Kerby

Re: Option to ensure monotonic timestamps

From
Tom Lane
Date:
Brent Kerby <blkerby@gmail.com> writes:
> What I propose is that we could add a boolean configuration option, say
> 'ensure_monotonic_timestamps', that enables the following behavior: when
> GetCurrentTimestamp is called (
> https://doxygen.postgresql.org/backend_2utils_2adt_2timestamp_8c.html#a9822cdf3fd41b15851c0c18ddc80143c),
> before it returns it checks if `result` is less than what was returned last
> time (if any) that GetCurrentTimestamp was called, and if so it returns the
> result from the previous call (after logging a warning), otherwise it
> proceeds as normal. In its simplest form, this could be accomplished by
> adding a global variable lastGetCurrentTimestamp that stores the result of
> the previous call.

The "global" variable would actually need to be cluster-wide, ie in shared
memory, which would imply contention and the need for locks.  I think the
overhead of this would be mighty high, and the return pretty low.

It's also worth pointing out that if you don't trust the kernel clock,
simply clamping to the last returned value isn't likely to be terribly
satisfactory.  What if $idiotsysadmin steps the clock back an hour?
We've had actual problems of that sort, for example with the stats
collector going AWOL for awhile because it thought it'd already written a
sufficiently new stats file.  There's now an explicit check for clock-
went-backwards in pgstat_recv_inquiry, which will be broken in that sort
of scenario if you cause GetCurrentTimestamp to do clamping internally.

On the whole, the alternative of solving the problem at the kernel level
seems preferable to me.

            regards, tom lane


Re: Option to ensure monotonic timestamps

From
Andres Freund
Date:
Hi,

Leaving Tom's concerns aside:

On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:
> Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
> new feature to make it possible to ensure that Postgres-generated
> timestamps never decrease even if the system clock may step backwards. My
> use case is that I'm implementing a form of temporal tables based on
> transaction commit timestamps (as returned by pg_xact_commit_timestamp),
> and to ensure the integrity of the system I need to know that the ordering
> of the commit timestamps will always be consistent with the order in which
> the transactions actually committed.

The acquiration of the commit timestamp and the actual visibility of the
commit will not necessarily be sufficient for many things. A backend can
theoretically sleep for an hour between

static TransactionId
RecordTransactionCommit(void)
{
...
        SetCurrentTransactionStopTimestamp();
/* here */
        XactLogCommitRecord(xactStopTimestamp,
                            nchildren, children, nrels, rels,
                            nmsgs, invalMessages,
                            RelcacheInitFileInval, forceSyncCommit,
                            MyXactFlags,
                            InvalidTransactionId /* plain commit */ );
}

static void
CommitTransaction(void)
{
...
        /*
         * We need to mark our XIDs as committed in pg_xact.  This is where we
         * durably commit.
         */
        latestXid = RecordTransactionCommit();

/* here */

    /*
     * Let others know about no transaction in progress by me. Note that this
     * must be done _before_ releasing locks we hold and _after_
     * RecordTransactionCommit.
     */
    ProcArrayEndTransaction(MyProc, latestXid);

whether that affects your approach I do not know.


> Any thoughts?

Why are you looking to do something timestamp based in the first place?
It's a bit hard to give good advice without further information...

Greetings,

Andres Freund


Re: Option to ensure monotonic timestamps

From
Patrick Krecker
Date:
On Tue, Feb 20, 2018 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> Leaving Tom's concerns aside:
>
> On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:
>> Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
>> new feature to make it possible to ensure that Postgres-generated
>> timestamps never decrease even if the system clock may step backwards. My
>> use case is that I'm implementing a form of temporal tables based on
>> transaction commit timestamps (as returned by pg_xact_commit_timestamp),
>> and to ensure the integrity of the system I need to know that the ordering
>> of the commit timestamps will always be consistent with the order in which
>> the transactions actually committed.
>
> The acquiration of the commit timestamp and the actual visibility of the
> commit will not necessarily be sufficient for many things. A backend can
> theoretically sleep for an hour between
>
> static TransactionId
> RecordTransactionCommit(void)
> {
> ...
>                 SetCurrentTransactionStopTimestamp();
> /* here */
>                 XactLogCommitRecord(xactStopTimestamp,
>                                                         nchildren, children, nrels, rels,
>                                                         nmsgs, invalMessages,
>                                                         RelcacheInitFileInval, forceSyncCommit,
>                                                         MyXactFlags,
>                                                         InvalidTransactionId /* plain commit */ );
> }
>
> static void
> CommitTransaction(void)
> {
> ...
>                 /*
>                  * We need to mark our XIDs as committed in pg_xact.  This is where we
>                  * durably commit.
>                  */
>                 latestXid = RecordTransactionCommit();
>
> /* here */
>
>         /*
>          * Let others know about no transaction in progress by me. Note that this
>          * must be done _before_ releasing locks we hold and _after_
>          * RecordTransactionCommit.
>          */
>         ProcArrayEndTransaction(MyProc, latestXid);
>
> whether that affects your approach I do not know.
>
>
>> Any thoughts?
>
> Why are you looking to do something timestamp based in the first place?
> It's a bit hard to give good advice without further information...
>
> Greetings,
>
> Andres Freund
>

Hi Brent --

I haven't heard of temporal tables before, but I guess it's a feature
of SQL Server 2016. It sounds similar to some recent work in progress
to add "AS OF" to SELECT statements:
https://www.postgresql.org/message-id/78aadf6b-86d4-21b9-9c2a-51f1efb8a499@postgrespro.ru

Patrick


Re: Option to ensure monotonic timestamps

From
Andres Freund
Date:
Hi,

On 2018-02-20 12:32:22 -0500, Tom Lane wrote:
> The "global" variable would actually need to be cluster-wide, ie in shared
> memory, which would imply contention and the need for locks.  I think the
> overhead of this would be mighty high, and the return pretty low.

I think if we wanted to go for something like this (which I doubt), we'd
have that global variable as an atomic 64bit variable in shmem, and
*only* use it for stuff where the ordering actually matters. I.e. not
for transaction start times etc...


> It's also worth pointing out that if you don't trust the kernel clock,
> simply clamping to the last returned value isn't likely to be terribly
> satisfactory.  What if $idiotsysadmin steps the clock back an hour?
> We've had actual problems of that sort, for example with the stats
> collector going AWOL for awhile because it thought it'd already written a
> sufficiently new stats file.  There's now an explicit check for clock-
> went-backwards in pgstat_recv_inquiry, which will be broken in that sort
> of scenario if you cause GetCurrentTimestamp to do clamping internally.

I guess you could hack something together with CLOCK_MONOTONIC or such,
but brrrr.

Greetings,

Andres Freund


Re: Option to ensure monotonic timestamps

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-02-20 12:32:22 -0500, Tom Lane wrote:
>> The "global" variable would actually need to be cluster-wide, ie in shared
>> memory, which would imply contention and the need for locks.  I think the
>> overhead of this would be mighty high, and the return pretty low.

> I think if we wanted to go for something like this (which I doubt), we'd
> have that global variable as an atomic 64bit variable in shmem, and
> *only* use it for stuff where the ordering actually matters. I.e. not
> for transaction start times etc...

Then you've got problems with figuring out which usages "matter" and which
don't, and being sure you don't ever compare timestamps from the two
different sources.  Seems mighty fragile to me, and reminiscent of the
replication problems that forced us to drop support for float timestamps.

In any case I'd not much like a system that mostly reported in system
clock time except transaction commit timestamps are on some other
timescale.

But really, the killer point here is your upthread comment that even if
GetCurrentTimestamp were guaranteed to return monotonic time, that would
not guarantee that commit timestamps match physical commit order, which
was the OP's goal.  At least not unless we read the clock while holding
WALWriteLock, which I'm pretty sure everyone will say Ain't Happening.

I think your not-very-explicit suggestion that he should work in
commit LSNs, not timestamps at all, is a far superior answer.

            regards, tom lane


Re: Option to ensure monotonic timestamps

From
Brent Kerby
Date:
Right, I'm talking about temporal tables in the sense of the SQL:2011 standard. I know there's a Postgres extension temporal_tables by Vlad Arkhipov (https://github.com/arkhipov/temporal_tables/) that approximates this. There's also a way of doing it using only triggers written in pgplsql, by Paolo Chiodi (https://www.nearform.com/blog/time-travel-with-postgresql-on-amazon-rds/). In these solutions, however, as well as in the SQL Server 2016 and many other implementations, the transaction start time (as opposed to commit time) is used as the time at which the data is considered to have changed, which does not ensure consistency of the historical data: for instance, you can end up with a situation where, when viewed "AS OF" certain time points, the database will appear to have had non-unique primary keys and broken foreign key references (e.g., see https://dba.stackexchange.com/questions/143241/why-do-temporal-tables-log-the-begin-time-of-the-transaction/198204#198204).

I wasn't aware of that recent work. The "AS OF" syntax seems useful, although if I understand it correctly it doesn't provide the full power of the temporal tables. With a full implementation of temporal tables, for each temporal table there's a corresponding history table that can be directly accessed by queries, making it possible for instance to see a list of all changes that have affected rows satisfying certain conditions, or to see the data "AS OF" not just constant times but "AS OF" some variable time given by a column in another table that is being joined with (The ability to do this is important in my application). 

I agree with Tom's points and don't think that what I originally proposed is a very good solution, but it still makes me uncomfortable to trust blindly in the kernel clock when the integrity of the data hangs in the balance. How about the following alternative proposal?: Instead of trying to enforce monotonicity of all Postgres-generated timestamps, we look only at the commit timestamps, and if at the time that we are about to commit we detect that a violation occurs, instead of clamping (which I agree is ugly) we abort the transaction with an error. And this should happen only if a configuration option, say 'monotonic_commit_timestamp', is enabled. With this approach, we only need to keep track of the previous commit timestamp, which is already being done if "track_commit_timestamp" is enabled (which should probably be a prerequisite for enabling 'monotonic_commit_timestamp'), so that should impose minimal overhead -- no need for any additional locking or including anything more in the WAL, right?


On Tue, Feb 20, 2018 at 11:09 AM, Patrick Krecker <pkrecker@gmail.com> wrote:
On Tue, Feb 20, 2018 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> Leaving Tom's concerns aside:
>
> On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:
>> Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
>> new feature to make it possible to ensure that Postgres-generated
>> timestamps never decrease even if the system clock may step backwards. My
>> use case is that I'm implementing a form of temporal tables based on
>> transaction commit timestamps (as returned by pg_xact_commit_timestamp),
>> and to ensure the integrity of the system I need to know that the ordering
>> of the commit timestamps will always be consistent with the order in which
>> the transactions actually committed.
>
> The acquiration of the commit timestamp and the actual visibility of the
> commit will not necessarily be sufficient for many things. A backend can
> theoretically sleep for an hour between
>
> static TransactionId
> RecordTransactionCommit(void)
> {
> ...
>                 SetCurrentTransactionStopTimestamp();
> /* here */
>                 XactLogCommitRecord(xactStopTimestamp,
>                                                         nchildren, children, nrels, rels,
>                                                         nmsgs, invalMessages,
>                                                         RelcacheInitFileInval, forceSyncCommit,
>                                                         MyXactFlags,
>                                                         InvalidTransactionId /* plain commit */ );
> }
>
> static void
> CommitTransaction(void)
> {
> ...
>                 /*
>                  * We need to mark our XIDs as committed in pg_xact.  This is where we
>                  * durably commit.
>                  */
>                 latestXid = RecordTransactionCommit();
>
> /* here */
>
>         /*
>          * Let others know about no transaction in progress by me. Note that this
>          * must be done _before_ releasing locks we hold and _after_
>          * RecordTransactionCommit.
>          */
>         ProcArrayEndTransaction(MyProc, latestXid);
>
> whether that affects your approach I do not know.
>
>
>> Any thoughts?
>
> Why are you looking to do something timestamp based in the first place?
> It's a bit hard to give good advice without further information...
>
> Greetings,
>
> Andres Freund
>

Hi Brent --

I haven't heard of temporal tables before, but I guess it's a feature
of SQL Server 2016. It sounds similar to some recent work in progress
to add "AS OF" to SELECT statements:
https://www.postgresql.org/message-id/78aadf6b-86d4-21b9-9c2a-51f1efb8a499@postgrespro.ru

Patrick

Re: Option to ensure monotonic timestamps

From
Brent Kerby
Date:
The issue is that presence of timestamps is fundamental to the functionality of temporal tables. The users need to.be able to make queries on temporal tables in terms of timestamps; LSNs won't mean anything to them. It would be an option to implement the temporal tables using LSNs under the hood, but then it is still required to construct a monotonic mapping between LSNs and timestamps in order for it to be usable. The most natural method for constructing such a mapping would be to use the stored commit timestamps; if these are monotonic, then that works, but we gain little by storing the LSNs, since they will just be converted to timestamps anyway when they're used. But if the timestamps aren't monotonic, then we're faced with the same problem as before; we could try to patch up the non-monotonicity in the mapping after-the-fact, using clamping or possibly some more sophisticated method, but this is inefficient and could get ugly fast. It would seem preferable to just ensure that the timestamps are monotonic to begin with. And based on your observations of why we shouldn't try to enforce this on every application of GetCurrentTimestamp, I think maybe it would be cleaner to just enforce this on commit timestamps (and only if enabled by a configuration option, of course), since this is all that is needed and should be simpler and less expensive. And if a violation occurs, we can just abort the transaction with an error, rather than clamping the timestamp. This way we don't end up with an ugly scenario like you pointed out, where we have one set of timestamps that are clamped (i.e., for commit timestamps) and others that are not; and also this way, if the $idiotsysadmin sets the clock back an hour, then we get errors immediately instead of a whole hour of temporal table history being messed up.

On Tue, Feb 20, 2018 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
> On 2018-02-20 12:32:22 -0500, Tom Lane wrote:
>> The "global" variable would actually need to be cluster-wide, ie in shared
>> memory, which would imply contention and the need for locks.  I think the
>> overhead of this would be mighty high, and the return pretty low.

> I think if we wanted to go for something like this (which I doubt), we'd
> have that global variable as an atomic 64bit variable in shmem, and
> *only* use it for stuff where the ordering actually matters. I.e. not
> for transaction start times etc...

Then you've got problems with figuring out which usages "matter" and which
don't, and being sure you don't ever compare timestamps from the two
different sources.  Seems mighty fragile to me, and reminiscent of the
replication problems that forced us to drop support for float timestamps.

In any case I'd not much like a system that mostly reported in system
clock time except transaction commit timestamps are on some other
timescale.

But really, the killer point here is your upthread comment that even if
GetCurrentTimestamp were guaranteed to return monotonic time, that would
not guarantee that commit timestamps match physical commit order, which
was the OP's goal.  At least not unless we read the clock while holding
WALWriteLock, which I'm pretty sure everyone will say Ain't Happening.

I think your not-very-explicit suggestion that he should work in
commit LSNs, not timestamps at all, is a far superior answer.

                        regards, tom lane