Thread: Surviving transaction-ID wraparound, take 2
This is an attempt to flesh out the ideas of my earlier proposal (http://fts.postgresql.org/db/mw/msg.html?mid=67786) into a complete description of how things should work. It's still largely the same proposal, but I have adopted a couple of the ideas from the followup discussion --- notably Vadim's suggestion that pg_log should be divided into segments. I still think that expanding transaction IDs (XIDs) to 8 bytes is no help. Aside from portability and performance issues, allowing pg_log to grow without bound just isn't gonna do. So, the name of the game is to recycle XIDs in an appropriate fashion. The intent of this design is to allow XID recycling in a true 24x7 environment (ie, postmaster uptime must be able to exceed 4G transactions --- no "restart" events are required). This plan still depends on periodic VACUUMs, which was a point some people didn't like the last time around. However, given that we now have a lightweight VACUUM that's meant to be run frequently, I don't think this is a significant objection anymore. Here's the plan: 1. XIDs will be divided into two kinds, "permanent" and "normal". There will be just two permanent XIDs: BootstrapXID (= 1) and FrozenXID (= 2). (Actually we could get along with only one permanent XID, but it seems useful for debugging to distinguish the bootstrap XID from transactions frozen later.) All XIDs >= 3 are "normal". The XID generator starts at 3, and wraps around to 3 when it overflows its 32-bit limit. 2. Permanent XIDs are always considered committed and are older than all normal XIDs. Two normal XIDs are compared using modulo-2^31 arithmetic, ie, x < y if ((int32) (y - x)) > 0. This will work as long as no normal XID survives in the database for more than 2^31 (2 billion) transactions; if it did, it would suddenly appear to be "in the future" and thus be considered uncommitted. To allow a tuple to live for more than 2 billion transactions, we must replace its xmin with a permanent XID sometime before its initial normal XID expires. FrozenXID is used for this purpose. 3. VACUUM will have the responsibility of replacing old normal XIDs with FrozenXID. It will do this whenever a committed-good tuple has xmin less than a cutoff XID. (There is no need to replace xmax, since if xmax is committed good then the tuple itself will be removed.) The cutoff XID could be anything less than XmaxRecent (the oldest XID that might be considered still running by any current transaction). I believe that by default it ought to be pretty old, say 1 billion transactions in the past. This avoids expending I/O to update tuples that are unlikely to live long; furthermore, keeping real XIDs around for some period of time is useful for debugging. 4. To make this work, VACUUM must be run on every table at least once every billion transactions. To help keep track of this maintenance requirement, we'll add two columns to pg_database. Upon successful completion of a database-wide (all tables) VACUUM, VACUUM will update the current database's row in pg_database with the cutoff XID and XmaxRecent XID that it used. Inspection of pg_database will then show which databases are in need of re-vacuuming. The use of the XmaxRecent entry will be explained below. 5. There should be a VACUUM option ("VACUUM FREEZE", unless someone can come up with a better name) that causes the cutoff XID to be exactly XmaxRecent, not far in the past. Running VACUUM FREEZE in an otherwise idle database guarantees that every surviving tuple is frozen. I foresee two uses for this:A. Doing VACUUM FREEZE at completion of initdb ensures that template1 and template0 will have no unfrozentuples. This is particularly critical for template0, since ordinarily one couldn't connect to it to vacuum it.B.VACUUM FREEZE would be useful for pg_upgrade, since pg_log is no longer critical data after a FREEZE. 6. The oldest XmaxRecent value shown in pg_database tells us how far back pg_log is interesting; we know that all tuples with XIDs older than that are marked committed in the database, so we won't be probing pg_log to verify their commit status anymore. Therefore, we can discard pg_log entries older than that. To make this feasible, we should change pg_log to a segmented representation, with segments of say 256KB (1 million transaction entries). A segment can be discarded once oldest-XmaxRecent advances past it. At completion of a database-wide VACUUM, in addition to updating our own pg_database row we'll scan the other rows to determine the oldest XmaxRecent, and then remove no-longer-needed pg_log segments. 7. A small difficulty with the above is that if template0 is never vacuumed after initdb, its XmaxRecent entry would always be the oldest and would keep us from discarding any of pg_log. A brute force answer is to ignore template0 while calculating oldest-XmaxRecent, but perhaps someone can see a cleaner way. 8. Currently, pg_log is accessed through the buffer manager as if it were an ordinary relation. It seems difficult to continue this approach if we are going to allow removal of segments before the current segment (md.c will not be happy with that). Instead, I plan to build a special access mechanism for pg_log that will buffer a few pages of pg_log in shared memory. I think much of this can be lifted verbatim from the WAL access code. 9. WAL redo for pg_log updates will be handled like this: (A) Whenever a transaction is assigned the first XID in a new pg_log page's worth of XIDs, we will allocate and zero that page of pg_log, and enter a record into the WAL that reports having done so. (We must do this while holding the XidGenLock lock, which is annoying but it only happens once every 32K transactions. Note that no actual I/O need happen, we are just zeroing a buffer in shared memory and emitting a WAL record.) Now, before any transaction commit can modify that page of pg_log, we are guaranteed that the zero-the-page WAL entry will be flushed to disk. On crash and restart, we re-zero the page when we see the zeroing WAL entry, and then reapply the transaction commit and abort operations shown later in WAL. AFAICS we do not need to maintain page LSN or SUI information for pg_log pages if we do it this way (Vadim, do you agree?). NOTE: unless I'm missing something, 7.1's WAL code fails to guard against loss of pg_log pages at all, so this should provide an improvement in reliability. 10. It'd be practical to keep a checksum on pg_log pages with this implementation (the checksum would be updated just before writing out a pg_log page, and checked on read). Not sure if this is worth doing, but given the critical nature of pg_log it might be a good idea. Things to do later ------------------ It's looking more and more like an automatic VACUUM scheduler would be a good idea --- aside from the normal use of VACUUM for space reclamation, the scheduler could be relied on to dispatch VACUUM to databases whose cutoff XID was getting too far back. I don't think I have time to work on this for 7.2, but it should be a project for 7.3. regards, tom lane
> 3. VACUUM will have the responsibility of replacing old normal XIDs with > FrozenXID. It will do this whenever a committed-good tuple has xmin less > than a cutoff XID. (There is no need to replace xmax, since if xmax is > committed good then the tuple itself will be removed.) The cutoff XID > could be anything less than XmaxRecent (the oldest XID that might be > considered still running by any current transaction). I believe that by > default it ought to be pretty old, say 1 billion transactions in the past. > This avoids expending I/O to update tuples that are unlikely to live long; > furthermore, keeping real XIDs around for some period of time is useful > for debugging. > > 4. To make this work, VACUUM must be run on every table at least once > every billion transactions. To help keep track of this maintenance > requirement, we'll add two columns to pg_database. Upon successful > completion of a database-wide (all tables) VACUUM, VACUUM will update the > current database's row in pg_database with the cutoff XID and XmaxRecent > XID that it used. Inspection of pg_database will then show which > databases are in need of re-vacuuming. The use of the XmaxRecent entry > will be explained below. I like the 1 billion in the past idea, and recording it in pg_database so we can quickly know how far back we can go to recycle xids. Nice. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Tuesday 14 August 2001 02:25, you wrote: > I still think that expanding transaction IDs (XIDs) to 8 bytes is no help. > Aside from portability and performance issues, allowing pg_log to grow > without bound just isn't gonna do. So, the name of the game is to recycle But what about all of us who need to establish a true long term audit trail? For us, still the most elegant solution would be a quasi unlimited supply of unique row identifiers. 64 bit would be a huge help (and will be ubiquitous in a few years time anyway). Everything else will have far greater performance impact for us. While it might not suit everybody, I can't see why it should be a problem to offer this as an *option* There are other applications where we need database wide unique row identifiers; in our project for example we allow row level encryption on non-indexed attributes across alll tables. How would you implement such a thing without having unique row identifiers across your whole database? You would need to reference both to row AND table, and that must certainly be more expensive in terms of performance. Horst
Horst Herb <hherb@malleenet.net.au> writes: > On Tuesday 14 August 2001 02:25, you wrote: >> I still think that expanding transaction IDs (XIDs) to 8 bytes is no help. > But what about all of us who need to establish a true long term audit trail? > For us, still the most elegant solution would be a quasi unlimited supply of > unique row identifiers. 64 bit would be a huge help (and will be ubiquitous > in a few years time anyway). Uh, that has nothing to do with transaction identifiers ... regards, tom lane
Tom Lane wrote: > Horst Herb <hherb@malleenet.net.au> writes: > > On Tuesday 14 August 2001 02:25, you wrote: > >> I still think that expanding transaction IDs (XIDs) to 8 bytes is no help. > > > But what about all of us who need to establish a true long term audit trail? > > For us, still the most elegant solution would be a quasi unlimited supply of > > unique row identifiers. 64 bit would be a huge help (and will be ubiquitous > > in a few years time anyway). > > Uh, that has nothing to do with transaction identifiers ... And he who needs that kind of long term row identifiers would be better off with 8-byte sequences anyway - IMNSVHO. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan Wieck <JanWieck@yahoo.com> writes: > And he who needs that kind of long term row identifiers would > be better off with 8-byte sequences anyway - IMNSVHO. Indeed. I've been looking at switching sequences to be int8, and I see just one little problem, which is to uphold the promise that they'll still work as if int4 on a machine that has no int64 C datatype. The difficulty is that sequence.h has typedef struct FormData_pg_sequence {NameData sequence_name;int32 last_value;int32 increment_by;int32 max_value;int32 min_value;int32 cache_value;int32 log_cnt;char is_cycled;char is_called; } FormData_pg_sequence; If I just change "int32" to "int64" here, all is well on machines where sizeof(int64) is 8. But if there's no 64-bit C datatype, int64 is typedef'd as "long int", so sizeof(int64) is only 4. Result: the struct declaration won't agree with the heaptuple layout --- since the tuple routines will believe that the datatype of these columns has size 8. What I need is a way to pad the struct declaration so that it leaves 8 bytes per int64 column, no matter what. I thought of typedef struct FormData_pg_sequence {NameData sequence_name;int64 last_value; #ifdef INT64_IS_BUSTEDint32 pad1; #endifint64 increment_by; #ifdef INT64_IS_BUSTEDint32 pad2; #endifint64 max_value; #ifdef INT64_IS_BUSTEDint32 pad3; #endifint64 min_value; #ifdef INT64_IS_BUSTEDint32 pad4; #endifint64 cache_value; #ifdef INT64_IS_BUSTEDint32 pad5; #endifint64 log_cnt; #ifdef INT64_IS_BUSTEDint32 pad6; #endifchar is_cycled;char is_called; } FormData_pg_sequence; This would work, I think, but my goodness it's an ugly solution. Has any hacker got a better one? regards, tom lane
----- Original Message ----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Tuesday, August 14, 2001 11:28 AM > "Serguei Mokhov" <sa_mokho@alcor.concordia.ca> writes: > >> This would work, I think, but my goodness it's an ugly solution. > > > Is anything wrong with just having two int32 per value for this case? > > Well, we do want it to be int64 on machines where int64 is properly > defined. Or are you suggesting > > #ifdef INT64_IS_BUSTED > int32 last_value; > int32 pad1; > #else > int64 last_value; > #endif > > That does seem marginally more robust, now that you mention it... Yes, this version is more robust, but you till have to cope with all those #ifdef INT64_IS_BUSTED #else #endif. I guess if you want explicitly int64 type in here for those platforms that do support it, then there is no other way maybe. What I was thinking (for this particular struct only!) is just jave padded int32's for every value, which will always be correct and no marginal problems. And the accessor functions using the struct just employ int64 whatever it means. S.
Tom Lane wrote: [clip] > > This would work, I think, but my goodness it's an ugly solution. > Has any hacker got a better one? > > regards, tom lane How about: #ifdef INT64_IS_BUSTED #define int64aligned(name) int32 name##_; int64 name #else #define int64aligned(name) int64 name #endif typedef struct FormData_pg_sequence { NameData sequence_name; int64aligned(last_value); int64aligned(increment_by); int64aligned(max_value); int64aligned(min_value); int64aligned(cache_value); int64aligned(log_cnt); char is_cycled; char is_called; } FormData_pg_sequence; Neil -- Neil Padgett Red Hat Canada Ltd. E-Mail: npadgett@redhat.com 2323 Yonge Street, Suite #300, Toronto, ON M4P 2C9
"Joe Conway" <joseph.conway@home.com> writes: >> What I need is a way to pad the struct declaration so that it leaves >> 8 bytes per int64 column, no matter what. I thought of > What if you defined int64 as a union made up of one "long int" member and > one 8 byte char member, and then always refer to the "long int"? Well, that'd remove the notational ugliness from the struct definition, at the cost of adding it to the code that uses the struct. I think I'd prefer to uglify the struct and keep the code simple. But it's a good thought. regards, tom lane
"Serguei Mokhov" <sa_mokho@alcor.concordia.ca> writes: >> This would work, I think, but my goodness it's an ugly solution. > Is anything wrong with just having two int32 per value for this case? Well, we do want it to be int64 on machines where int64 is properly defined. Or are you suggesting #ifdef INT64_IS_BUSTEDint32 last_value;int32 pad1; #elseint64 last_value; #endif That does seem marginally more robust, now that you mention it... regards, tom lane
On Tue, 14 Aug 2001, Tom Lane wrote: > Jan Wieck <JanWieck@yahoo.com> writes: > > And he who needs that kind of long term row identifiers would > > be better off with 8-byte sequences anyway - IMNSVHO. > > What I need is a way to pad the struct declaration so that it leaves > 8 bytes per int64 column, no matter what. I thought of > > This would work, I think, but my goodness it's an ugly solution. > Has any hacker got a better one? The only thing I could think of is using a struct to hide the padding details instead of directly using int64, but then you'd have to add a '.value' or something to the references. I'm not sure that's really any cleaner.
> typedef struct FormData_pg_sequence > { > NameData sequence_name; > int32 last_value; > int32 increment_by; > int32 max_value; > int32 min_value; > int32 cache_value; > int32 log_cnt; > char is_cycled; > char is_called; > } FormData_pg_sequence; > > If I just change "int32" to "int64" here, all is well on machines where > sizeof(int64) is 8. But if there's no 64-bit C datatype, int64 is > typedef'd as "long int", so sizeof(int64) is only 4. Result: the struct > declaration won't agree with the heaptuple layout --- since the tuple > routines will believe that the datatype of these columns has size 8. > > What I need is a way to pad the struct declaration so that it leaves > 8 bytes per int64 column, no matter what. I thought of > What if you defined int64 as a union made up of one "long int" member and one 8 byte char member, and then always refer to the "long int"? -- Joe
----- Original Message ----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Tuesday, August 14, 2001 10:09 AM > typedef struct FormData_pg_sequence > { > NameData sequence_name; > int64 last_value; > #ifdef INT64_IS_BUSTED > int32 pad1; [snip] > } FormData_pg_sequence; > > This would work, I think, but my goodness it's an ugly solution. Is anything wrong with just having two int32 per value for this case? typedef struct FormData_pg_sequence { int32 last_value; int32 pad1; ... } FormData_pg_sequence; S.
Stephan Szabo wrote: > On Tue, 14 Aug 2001, Tom Lane wrote: > > > Jan Wieck <JanWieck@yahoo.com> writes: > > > And he who needs that kind of long term row identifiers would > > > be better off with 8-byte sequences anyway - IMNSVHO. > > > > What I need is a way to pad the struct declaration so that it leaves > > 8 bytes per int64 column, no matter what. I thought of > > > > This would work, I think, but my goodness it's an ugly solution. > > Has any hacker got a better one? > > The only thing I could think of is using a struct to hide the > padding details instead of directly using int64, but then you'd have to > add a '.value' or something to the references. I'm not sure that's really > any cleaner. What I'm asking myself all the time is "which platforms do we support that doesn't have 8-byte integers?". Could someone enlighten me please? And what does int8 do on these platforms? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com