Thread: Data version idea (please discuss)
(I did a *little* bit of searching on this, but I am not even sure what keywords to use, so forgive me if I should just RTFM...) I am interested in using PG for large datasets, like census records, insurance claims, mortality occurences, etc, etc. The updates and inserts would (I think) be batch oriented; we clean a bunch of records, convert to a nice text file, do a big insert, repeat. What I am curious about is versioning the data that goes into this database using something that I want to call a "checkpoint". The "use-case" would be that you do an insert of something like 100 records, updating secondary tables as necessary, check it to the best of your ability, then run Function_1. This function increments the version number and returns it, storing whatever is necessary so that Function_2 can reset the database to any of the version numbers returned by Function_1. I guess there would be a Function_3 that would add tables to the checkpointing system. Questions: 0. Does that make sense? 1. Is there some literature on this, so I don't have to keep bothering the list with beginners questions? 2. Has somebody done the work already? I would think it would be possible, using rules on all insert/delete/update, plus storing a few other pieces of information with each row, plus maybe a table or two to keep version information. You would never actually delete from a versioned table, just change the current view, same with update. I don't think you would ever actually have to write C, but you might have to write some dynamic SQL so that you can iterate over lists of tables. 3. If the work has *not* been done, would it help anybody else to have me do it? If so, please give feedback. This seems related to replication, but I haven't looked into that yet. I am fairly bright, but I have only a cursory background in the theoretical stuff behind transactions and concurrent stuff. Thanks to all.
webb wrote: > What I am curious about is versioning the data that goes into this > database using something that I want to call a "checkpoint". The > "use-case" would be that you do an insert of something like 100 records, > updating secondary tables as necessary, check it to the best of your > ability, then run Function_1. This function increments the version > number and returns it, storing whatever is necessary so that Function_2 > can reset the database to any of the version numbers returned by > Function_1. I guess there would be a Function_3 that would add tables > to the checkpointing system. > > Questions: > > 0. Does that make sense? Yes > 1. Is there some literature on this, so I don't have to keep bothering > the list with beginners questions? Temporal databases - replace mentions of timestamps with version-numbers and it matches your requirements precisely. There's various stuff on the web, plenty of academic research and a couple of books too. HTH -- Richard Huxton Archonet Ltd
On Aug 6, 2004, at 4:20 PM, Richard Huxton wrote: > webb wrote: > >> 1. Is there some literature on this, so I don't have to keep >> bothering the list with beginners questions? > > Temporal databases - replace mentions of timestamps with > version-numbers and it matches your requirements precisely. There's > various stuff on the web, plenty of academic research and a couple of > books too. This is an area I'm very interested in, in particular Nikos Lorentzos' IXSQL (Interval Extension to SQL) work proposed during the TSQL2 discussions. <http://portal.acm.org/citation.cfm?id=627848> (I haven't found the full text of this paper available freely online, though it can be purchased from the publisher for download.) Further work along these lines was published in "Temporal Data and the Relational Model" by Hugh Darwn, CJ Date, and Nikos Lorentzos. An overview of an early draft of the book can be found here: <http://www.hughdarwen.freeola.com/TheThirdManifesto.web/ TemporalData.Warwick.pdf> Valid-time and transaction-time proposals that appear to specifically address the "timestamping" issues (similar in spirit I gather to the "time travel" feature in older versions of Postgres) appear to be limited to only that specific case, while a more full approach as proposed by Lorentzos et al would allow versioning and timestamping and much more. The following is a link to a critique by Darwen and Date of various TSQL2 proposals. <http://www.hughdarwen.freeola.com/TheThirdManifesto.web/OnTSQL2.pdf> From what I gather, the SQL TSQL2 discussions ended without any conclusion as to extending SQL in this direction. Michael Glaesemann grzm myrealbox com
Michael Glaesemann wrote: > From what I gather, the SQL TSQL2 discussions ended without any > conclusion as to extending SQL in this direction. Darwen's critique of TSQL2 is here: http://www.hughdarwen.freeola.com/TheThirdManifesto.web/OnTSQL2.pdf I'm not sure if Snodgrass ever replied to it. The working draft's Part 7 was SQL/Temporal: http://www.jtc1sc32.org/sc32/jtc1sc32.nsf/f3b9a582bbf35d33852566210054191a/115c37d71e64bfd188256a5b00442b3e?OpenDocument There's a bunch of temporal related doucments from the TimeCenter: http://www.cs.auc.dk/TimeCenter/pub.htm Partial indexes get me close to where I want with temporal features. I just wish the RI constraints had the ability to supply a WHERE clause. Between the two, it might get me were I want, rather than having to write triggers to ensure temporal integrity. Mike Mascari
Mike Thanks for the links! I remember coming across a TimeCenter paper. The TimeCenter reference page is quite extensive and I look forward to reading more of the work Darwen and Date critiqued. On Aug 6, 2004, at 6:37 PM, Mike Mascari wrote: > Partial indexes get me close to where I want with temporal features. I > just wish the RI constraints had the ability to supply a WHERE clause. > Between the two, it might get me were I want, rather than having to > write triggers to ensure temporal integrity. Would you mind going into more depth into how you're doing this? Michael Glaesemann grzm myrealbox com
On Mon, 2004-08-02 at 16:09 -0700, webb wrote: > What I am curious about is versioning the data that goes into this > database using something that I want to call a "checkpoint". :-) How about using CHECKPOINT or SAVEPOINT in postgresql 8.0? (in beta now) http://developer.postgresql.org/docs/postgres/sql-checkpoint.html http://developer.postgresql.org/docs/postgres/sql-savepoint.html There are some examples in postgresql 7.5 new features announcements over the net. -- Marius Andreiana Galuna - Solutii Linux in Romania http://www.galuna.ro
Michael Glaesemann wrote: > Thanks for the links! I remember coming across a TimeCenter paper. The > TimeCenter reference page is quite extensive and I look forward to > reading more of the work Darwen and Date critiqued. Sorry for the redundant Darwen critique link. > > On Aug 6, 2004, at 6:37 PM, Mike Mascari wrote: > >> Partial indexes get me close to where I want with temporal features. I >> just wish the RI constraints had the ability to supply a WHERE clause. >> Between the two, it might get me were I want, rather than having to >> write triggers to ensure temporal integrity. > > > Would you mind going into more depth into how you're doing this? It's rather crude. I've a start and end date on all temporal relations. I've a surrogate key as well. The uniqueness of the actual candidate key is enforced by a partial index only over tuples whose end date IS NULL. I maintain "temporal integrity" through ON UPDATE triggers. When a temporal tuple is deactivated, all relations that have the soon-to-be deactivated tuple's surrogate key are also deactivated. I also have a corresponding view for each temporal relation that is queried by non-reporting user interfaces. They see the active tuples, while the reporting component queries the base relations for the entire history. However, this is all maintained by a series of spaghetti C-language SPI routines, and my own home-brewed version of pg_constraint, as opposed to nice declared referential integrity, for the obvious reasons. Mike Mascari
> :-) > How about using CHECKPOINT or SAVEPOINT in postgresql 8.0? (in beta now) > http://developer.postgresql.org/docs/postgres/sql-checkpoint.html > http://developer.postgresql.org/docs/postgres/sql-savepoint.html I guess checkpoint is the wrong term (duh...), since what I want is to save all possible versions. "REVERT TO id" would be my vote for SQL 2010 if this feature is enabled. Thansk to the folks above for the links. I have the snodgrass book -- I will check it again. W > > There are some examples in postgresql 7.5 new features announcements > over the net. >
Have you looked in contrib/spi/README.timetravel ? On Fri, 2004-08-06 at 10:56, webb wrote: > > :-) > > How about using CHECKPOINT or SAVEPOINT in postgresql 8.0? (in beta now) > > http://developer.postgresql.org/docs/postgres/sql-checkpoint.html > > http://developer.postgresql.org/docs/postgres/sql-savepoint.html > > I guess checkpoint is the wrong term (duh...), since what I want is to > save all possible versions. "REVERT TO id" would be my vote for SQL > 2010 if this feature is enabled. > > Thansk to the folks above for the links. I have the snodgrass book -- I > will check it again. > W > > > > There are some examples in postgresql 7.5 new features announcements > > over the net. > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >