Thread: Data version idea (please discuss)

Data version idea (please discuss)

From
webb
Date:
(I did a *little* bit of searching on this, but I am not even sure what
keywords to use, so forgive me if I should just RTFM...)

I am interested in using PG for large datasets, like census records,
insurance claims, mortality occurences, etc, etc.  The updates and
inserts would (I think) be batch oriented; we clean a bunch of records,
convert to a nice text file, do a big insert, repeat.

What I am curious about is versioning the data that goes into this
database using something that I want to call a "checkpoint".  The
"use-case" would be that you do an insert of something like 100 records,
updating secondary tables as necessary, check it to the best of your
ability, then run Function_1.  This function increments the version
number and returns it, storing whatever is necessary so that Function_2
can reset the database to any of the version numbers returned by
Function_1.  I guess there would be a Function_3 that would add tables
to the checkpointing system.

Questions:

0.  Does that make sense?

1.  Is there some literature on this, so I don't have to keep bothering
the list with beginners questions?

2.  Has somebody done the work already?  I would think it would be
possible, using rules on all insert/delete/update, plus storing a few
other pieces of information with each row, plus maybe a table or two to
keep version information.  You would never actually delete from a
versioned table, just change the current view, same with update.  I
don't think you would ever actually have to write C, but you might have
to write some dynamic SQL so that you can iterate over lists of tables.

3.  If the work has *not* been done, would it help anybody else to have
me do it?  If so, please give feedback.

This seems related to replication, but I haven't looked into that yet.
I am fairly bright, but I have only a cursory background in the
theoretical stuff behind transactions and concurrent stuff.

Thanks to all.


Re: Data version idea (please discuss)

From
Richard Huxton
Date:
webb wrote:
> What I am curious about is versioning the data that goes into this
> database using something that I want to call a "checkpoint".  The
> "use-case" would be that you do an insert of something like 100 records,
> updating secondary tables as necessary, check it to the best of your
> ability, then run Function_1.  This function increments the version
> number and returns it, storing whatever is necessary so that Function_2
> can reset the database to any of the version numbers returned by
> Function_1.  I guess there would be a Function_3 that would add tables
> to the checkpointing system.
>
> Questions:
>
> 0.  Does that make sense?

Yes

> 1.  Is there some literature on this, so I don't have to keep bothering
> the list with beginners questions?

Temporal databases - replace mentions of timestamps with version-numbers
and it matches your requirements precisely. There's various stuff on the
web, plenty of academic research and a couple of books too.

HTH
--
   Richard Huxton
   Archonet Ltd

Re: Data version idea (please discuss)

From
Michael Glaesemann
Date:
On Aug 6, 2004, at 4:20 PM, Richard Huxton wrote:

> webb wrote:
>
>> 1.  Is there some literature on this, so I don't have to keep
>> bothering the list with beginners questions?
>
> Temporal databases - replace mentions of timestamps with
> version-numbers and it matches your requirements precisely. There's
> various stuff on the web, plenty of academic research and a couple of
> books too.

This is an area I'm very  interested in, in particular Nikos Lorentzos'
IXSQL (Interval Extension to SQL) work proposed during the TSQL2
discussions.
<http://portal.acm.org/citation.cfm?id=627848>
(I haven't found the full text of this paper available freely online,
though it can be purchased from the publisher for download.)

Further work along these lines was published in "Temporal Data and the
Relational Model" by Hugh Darwn, CJ Date, and Nikos Lorentzos. An
overview of an early draft of the book can be found here:
<http://www.hughdarwen.freeola.com/TheThirdManifesto.web/
TemporalData.Warwick.pdf>

Valid-time and transaction-time proposals that appear to specifically
address the "timestamping" issues (similar in spirit I gather to the
"time travel" feature in older versions of Postgres) appear to be
limited to only that specific case, while a more full approach as
proposed by Lorentzos et al would allow versioning and timestamping and
much more. The following is a link to a critique by Darwen and Date of
various TSQL2 proposals.
<http://www.hughdarwen.freeola.com/TheThirdManifesto.web/OnTSQL2.pdf>

 From what I gather, the SQL TSQL2 discussions ended without any
conclusion as to extending SQL in this direction.

Michael Glaesemann
grzm myrealbox com


Re: Data version idea (please discuss)

From
Mike Mascari
Date:
Michael Glaesemann wrote:

>  From what I gather, the SQL TSQL2 discussions ended without any
> conclusion as to extending SQL in this direction.

Darwen's critique of TSQL2 is here:

http://www.hughdarwen.freeola.com/TheThirdManifesto.web/OnTSQL2.pdf

I'm not sure if Snodgrass ever replied to it.

The working draft's Part 7 was SQL/Temporal:


http://www.jtc1sc32.org/sc32/jtc1sc32.nsf/f3b9a582bbf35d33852566210054191a/115c37d71e64bfd188256a5b00442b3e?OpenDocument

There's a bunch of temporal related doucments from the TimeCenter:

http://www.cs.auc.dk/TimeCenter/pub.htm

Partial indexes get me close to where I want with temporal features.
I just wish the RI constraints had the ability to supply a WHERE
clause. Between the two, it might get me were I want, rather than
having to write triggers to ensure temporal integrity.

Mike Mascari


Re: Data version idea (please discuss)

From
Michael Glaesemann
Date:
Mike

Thanks for the links! I remember coming across a TimeCenter paper. The
TimeCenter reference page is quite extensive and I look forward to
reading more of the work Darwen and Date critiqued.

On Aug 6, 2004, at 6:37 PM, Mike Mascari wrote:

> Partial indexes get me close to where I want with temporal features. I
> just wish the RI constraints had the ability to supply a WHERE clause.
> Between the two, it might get me were I want, rather than having to
> write triggers to ensure temporal integrity.

Would you mind going into more depth into how you're doing this?

Michael Glaesemann
grzm myrealbox com


Re: Data version idea (please discuss)

From
Marius Andreiana
Date:
On Mon, 2004-08-02 at 16:09 -0700, webb wrote:
> What I am curious about is versioning the data that goes into this
> database using something that I want to call a "checkpoint".
:-)
How about using CHECKPOINT or SAVEPOINT in postgresql 8.0? (in beta now)
http://developer.postgresql.org/docs/postgres/sql-checkpoint.html
http://developer.postgresql.org/docs/postgres/sql-savepoint.html

There are some examples in postgresql 7.5 new features announcements
over the net.

--
Marius Andreiana
Galuna - Solutii Linux in Romania
http://www.galuna.ro


Re: Data version idea (please discuss)

From
Mike Mascari
Date:
Michael Glaesemann wrote:

> Thanks for the links! I remember coming across a TimeCenter paper. The
> TimeCenter reference page is quite extensive and I look forward to
> reading more of the work Darwen and Date critiqued.

Sorry for the redundant Darwen critique link.

>
> On Aug 6, 2004, at 6:37 PM, Mike Mascari wrote:
>
>> Partial indexes get me close to where I want with temporal features. I
>> just wish the RI constraints had the ability to supply a WHERE clause.
>> Between the two, it might get me were I want, rather than having to
>> write triggers to ensure temporal integrity.
>
>
> Would you mind going into more depth into how you're doing this?

It's rather crude. I've a start and end date on all temporal
relations. I've a surrogate key as well. The uniqueness of the
actual candidate key is enforced by a partial index only over tuples
whose end date IS NULL. I maintain "temporal integrity" through ON
UPDATE triggers. When a temporal tuple is deactivated, all relations
that have the soon-to-be deactivated tuple's surrogate key are also
deactivated. I also have a corresponding view for each temporal
relation that is queried by non-reporting user interfaces. They see
the active tuples, while the reporting component queries the base
relations for the entire history.

However, this is all maintained by a series of spaghetti C-language
SPI routines, and my own home-brewed version of pg_constraint, as
opposed to nice declared referential integrity, for the obvious reasons.

Mike Mascari




Re: Data version idea (please discuss)

From
webb
Date:
> :-)
> How about using CHECKPOINT or SAVEPOINT in postgresql 8.0? (in beta now)
> http://developer.postgresql.org/docs/postgres/sql-checkpoint.html
> http://developer.postgresql.org/docs/postgres/sql-savepoint.html

I guess checkpoint is the wrong term (duh...), since what I want is to
save all possible versions.  "REVERT TO id" would be my vote for SQL
2010 if this feature is enabled.

Thansk to the folks above for the links.  I have the snodgrass book -- I
will check it again.
W
>
> There are some examples in postgresql 7.5 new features announcements
> over the net.
>

Re: Data version idea (please discuss)

From
"Scott Marlowe"
Date:
Have you looked in contrib/spi/README.timetravel ?

On Fri, 2004-08-06 at 10:56, webb wrote:
> > :-)
> > How about using CHECKPOINT or SAVEPOINT in postgresql 8.0? (in beta now)
> > http://developer.postgresql.org/docs/postgres/sql-checkpoint.html
> > http://developer.postgresql.org/docs/postgres/sql-savepoint.html
>
> I guess checkpoint is the wrong term (duh...), since what I want is to
> save all possible versions.  "REVERT TO id" would be my vote for SQL
> 2010 if this feature is enabled.
>
> Thansk to the folks above for the links.  I have the snodgrass book -- I
> will check it again.
> W
> >
> > There are some examples in postgresql 7.5 new features announcements
> > over the net.
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>