Thread: Surviving transaction-ID wraparound, take 2

Surviving transaction-ID wraparound, take 2

From
Tom Lane
Date:
This is an attempt to flesh out the ideas of my earlier proposal
(http://fts.postgresql.org/db/mw/msg.html?mid=67786) into a complete
description of how things should work.  It's still largely the same
proposal, but I have adopted a couple of the ideas from the followup
discussion --- notably Vadim's suggestion that pg_log should be divided
into segments.

I still think that expanding transaction IDs (XIDs) to 8 bytes is no help.
Aside from portability and performance issues, allowing pg_log to grow
without bound just isn't gonna do.  So, the name of the game is to recycle
XIDs in an appropriate fashion.  The intent of this design is to allow XID
recycling in a true 24x7 environment (ie, postmaster uptime must be able
to exceed 4G transactions --- no "restart" events are required).

This plan still depends on periodic VACUUMs, which was a point some people
didn't like the last time around.  However, given that we now have a
lightweight VACUUM that's meant to be run frequently, I don't think this
is a significant objection anymore.

Here's the plan:

1. XIDs will be divided into two kinds, "permanent" and "normal".  There
will be just two permanent XIDs: BootstrapXID (= 1) and FrozenXID (= 2).
(Actually we could get along with only one permanent XID, but it seems
useful for debugging to distinguish the bootstrap XID from transactions
frozen later.)  All XIDs >= 3 are "normal".  The XID generator starts at
3, and wraps around to 3 when it overflows its 32-bit limit.

2. Permanent XIDs are always considered committed and are older than all
normal XIDs.  Two normal XIDs are compared using modulo-2^31 arithmetic,
ie, x < y if ((int32) (y - x)) > 0.  This will work as long as no normal
XID survives in the database for more than 2^31 (2 billion) transactions;
if it did, it would suddenly appear to be "in the future" and thus be
considered uncommitted.  To allow a tuple to live for more than 2 billion
transactions, we must replace its xmin with a permanent XID sometime
before its initial normal XID expires.  FrozenXID is used for this
purpose.

3. VACUUM will have the responsibility of replacing old normal XIDs with
FrozenXID.  It will do this whenever a committed-good tuple has xmin less
than a cutoff XID.  (There is no need to replace xmax, since if xmax is
committed good then the tuple itself will be removed.)  The cutoff XID
could be anything less than XmaxRecent (the oldest XID that might be
considered still running by any current transaction).  I believe that by
default it ought to be pretty old, say 1 billion transactions in the past.
This avoids expending I/O to update tuples that are unlikely to live long;
furthermore, keeping real XIDs around for some period of time is useful
for debugging.

4. To make this work, VACUUM must be run on every table at least once
every billion transactions.  To help keep track of this maintenance
requirement, we'll add two columns to pg_database.  Upon successful
completion of a database-wide (all tables) VACUUM, VACUUM will update the
current database's row in pg_database with the cutoff XID and XmaxRecent
XID that it used.  Inspection of pg_database will then show which
databases are in need of re-vacuuming.  The use of the XmaxRecent entry
will be explained below.

5. There should be a VACUUM option ("VACUUM FREEZE", unless someone can
come up with a better name) that causes the cutoff XID to be exactly
XmaxRecent, not far in the past.  Running VACUUM FREEZE in an otherwise
idle database guarantees that every surviving tuple is frozen.  I foresee
two uses for this:A. Doing VACUUM FREEZE at completion of initdb ensures that   template1 and template0 will have no
unfrozentuples.   This is particularly critical for template0, since ordinarily   one couldn't connect to it to vacuum
it.B.VACUUM FREEZE would be useful for pg_upgrade, since pg_log   is no longer critical data after a FREEZE.
 

6. The oldest XmaxRecent value shown in pg_database tells us how far back
pg_log is interesting; we know that all tuples with XIDs older than that
are marked committed in the database, so we won't be probing pg_log to
verify their commit status anymore.  Therefore, we can discard pg_log
entries older than that.  To make this feasible, we should change pg_log
to a segmented representation, with segments of say 256KB (1 million
transaction entries).  A segment can be discarded once oldest-XmaxRecent
advances past it.  At completion of a database-wide VACUUM, in addition
to updating our own pg_database row we'll scan the other rows to determine
the oldest XmaxRecent, and then remove no-longer-needed pg_log segments.

7. A small difficulty with the above is that if template0 is never
vacuumed after initdb, its XmaxRecent entry would always be the oldest and
would keep us from discarding any of pg_log.  A brute force answer is to
ignore template0 while calculating oldest-XmaxRecent, but perhaps someone
can see a cleaner way.

8. Currently, pg_log is accessed through the buffer manager as if it were
an ordinary relation.  It seems difficult to continue this approach if we
are going to allow removal of segments before the current segment (md.c
will not be happy with that).  Instead, I plan to build a special access
mechanism for pg_log that will buffer a few pages of pg_log in shared
memory.  I think much of this can be lifted verbatim from the WAL access
code.

9. WAL redo for pg_log updates will be handled like this: (A) Whenever a
transaction is assigned the first XID in a new pg_log page's worth of
XIDs, we will allocate and zero that page of pg_log, and enter a record
into the WAL that reports having done so.  (We must do this while holding
the XidGenLock lock, which is annoying but it only happens once every 32K
transactions.  Note that no actual I/O need happen, we are just zeroing a
buffer in shared memory and emitting a WAL record.)  Now, before any
transaction commit can modify that page of pg_log, we are guaranteed that
the zero-the-page WAL entry will be flushed to disk.  On crash and
restart, we re-zero the page when we see the zeroing WAL entry, and then
reapply the transaction commit and abort operations shown later in WAL.
AFAICS we do not need to maintain page LSN or SUI information for pg_log
pages if we do it this way (Vadim, do you agree?).  NOTE: unless I'm
missing something, 7.1's WAL code fails to guard against loss of pg_log
pages at all, so this should provide an improvement in reliability.

10. It'd be practical to keep a checksum on pg_log pages with this
implementation (the checksum would be updated just before writing out a
pg_log page, and checked on read).  Not sure if this is worth doing,
but given the critical nature of pg_log it might be a good idea.


Things to do later
------------------

It's looking more and more like an automatic VACUUM scheduler would be a
good idea --- aside from the normal use of VACUUM for space reclamation,
the scheduler could be relied on to dispatch VACUUM to databases whose
cutoff XID was getting too far back.  I don't think I have time to work on
this for 7.2, but it should be a project for 7.3.
        regards, tom lane


Re: Surviving transaction-ID wraparound, take 2

From
Bruce Momjian
Date:
> 3. VACUUM will have the responsibility of replacing old normal XIDs with
> FrozenXID.  It will do this whenever a committed-good tuple has xmin less
> than a cutoff XID.  (There is no need to replace xmax, since if xmax is
> committed good then the tuple itself will be removed.)  The cutoff XID
> could be anything less than XmaxRecent (the oldest XID that might be
> considered still running by any current transaction).  I believe that by
> default it ought to be pretty old, say 1 billion transactions in the past.
> This avoids expending I/O to update tuples that are unlikely to live long;
> furthermore, keeping real XIDs around for some period of time is useful
> for debugging.
> 
> 4. To make this work, VACUUM must be run on every table at least once
> every billion transactions.  To help keep track of this maintenance
> requirement, we'll add two columns to pg_database.  Upon successful
> completion of a database-wide (all tables) VACUUM, VACUUM will update the
> current database's row in pg_database with the cutoff XID and XmaxRecent
> XID that it used.  Inspection of pg_database will then show which
> databases are in need of re-vacuuming.  The use of the XmaxRecent entry
> will be explained below.

I like the 1 billion in the past idea, and recording it in pg_database
so we can quickly know how far back we can go to recycle xids.  Nice.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Surviving transaction-ID wraparound, take 2

From
Horst Herb
Date:
On Tuesday 14 August 2001 02:25, you wrote:

> I still think that expanding transaction IDs (XIDs) to 8 bytes is no help.
> Aside from portability and performance issues, allowing pg_log to grow
> without bound just isn't gonna do.  So, the name of the game is to recycle

But what about all of us who need to establish a true long term audit trail? 
For us, still the most elegant solution would be a quasi unlimited supply of 
unique row identifiers. 64 bit would be a huge help (and will be ubiquitous 
in a few years time anyway). 

Everything else will have far greater performance impact for us. While it 
might not suit everybody, I can't see why it should be a problem to offer 
this as an *option*

There are other applications where we need database wide unique row 
identifiers; in our project for example we allow row level encryption on 
non-indexed attributes across alll tables. How would you implement such a 
thing without having unique row identifiers across your whole database? You 
would need to reference both to row AND table, and that must certainly be 
more expensive in terms of performance.

Horst


Re: Surviving transaction-ID wraparound, take 2

From
Tom Lane
Date:
Horst Herb <hherb@malleenet.net.au> writes:
> On Tuesday 14 August 2001 02:25, you wrote:
>> I still think that expanding transaction IDs (XIDs) to 8 bytes is no help.

> But what about all of us who need to establish a true long term audit trail? 
> For us, still the most elegant solution would be a quasi unlimited supply of 
> unique row identifiers. 64 bit would be a huge help (and will be ubiquitous 
> in a few years time anyway). 

Uh, that has nothing to do with transaction identifiers ...
        regards, tom lane


Re: Surviving transaction-ID wraparound, take 2

From
Jan Wieck
Date:
Tom Lane wrote:
> Horst Herb <hherb@malleenet.net.au> writes:
> > On Tuesday 14 August 2001 02:25, you wrote:
> >> I still think that expanding transaction IDs (XIDs) to 8 bytes is no help.
>
> > But what about all of us who need to establish a true long term audit trail?
> > For us, still the most elegant solution would be a quasi unlimited supply of
> > unique row identifiers. 64 bit would be a huge help (and will be ubiquitous
> > in a few years time anyway).
>
> Uh, that has nothing to do with transaction identifiers ...
   And he who needs that kind of long term row identifiers would   be better off with 8-byte sequences anyway -
IMNSVHO.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



int8 sequences --- small implementation problem

From
Tom Lane
Date:
Jan Wieck <JanWieck@yahoo.com> writes:
>     And he who needs that kind of long term row identifiers would
>     be better off with 8-byte sequences anyway - IMNSVHO.

Indeed.  I've been looking at switching sequences to be int8, and I see
just one little problem, which is to uphold the promise that they'll
still work as if int4 on a machine that has no int64 C datatype.
The difficulty is that sequence.h has

typedef struct FormData_pg_sequence
{NameData    sequence_name;int32        last_value;int32        increment_by;int32        max_value;int32
min_value;int32       cache_value;int32        log_cnt;char        is_cycled;char        is_called;
 
} FormData_pg_sequence;

If I just change "int32" to "int64" here, all is well on machines where
sizeof(int64) is 8.  But if there's no 64-bit C datatype, int64 is
typedef'd as "long int", so sizeof(int64) is only 4.  Result: the struct
declaration won't agree with the heaptuple layout --- since the tuple
routines will believe that the datatype of these columns has size 8.

What I need is a way to pad the struct declaration so that it leaves
8 bytes per int64 column, no matter what.  I thought of

typedef struct FormData_pg_sequence
{NameData    sequence_name;int64        last_value;
#ifdef INT64_IS_BUSTEDint32        pad1;
#endifint64        increment_by;
#ifdef INT64_IS_BUSTEDint32        pad2;
#endifint64        max_value;
#ifdef INT64_IS_BUSTEDint32        pad3;
#endifint64        min_value;
#ifdef INT64_IS_BUSTEDint32        pad4;
#endifint64        cache_value;
#ifdef INT64_IS_BUSTEDint32        pad5;
#endifint64        log_cnt;
#ifdef INT64_IS_BUSTEDint32        pad6;
#endifchar        is_cycled;char        is_called;
} FormData_pg_sequence;

This would work, I think, but my goodness it's an ugly solution.
Has any hacker got a better one?
        regards, tom lane


Re: int8 sequences --- small implementation problem

From
"Serguei Mokhov"
Date:
----- Original Message ----- 
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Tuesday, August 14, 2001 11:28 AM


> "Serguei Mokhov" <sa_mokho@alcor.concordia.ca> writes:
> >> This would work, I think, but my goodness it's an ugly solution.
> 
> > Is anything wrong with just having two int32 per value for this case?
> 
> Well, we do want it to be int64 on machines where int64 is properly
> defined.  Or are you suggesting
> 
> #ifdef INT64_IS_BUSTED
> int32 last_value;
> int32 pad1;
> #else
> int64 last_value;
> #endif
> 
> That does seem marginally more robust, now that you mention it...

Yes, this version is more robust, but you till have to cope with all
those #ifdef INT64_IS_BUSTED #else #endif. I guess if you want explicitly
int64 type in here for those platforms that do support it, then there is no
other way maybe. What I was thinking (for this particular struct only!) is just jave padded
int32's for every value, which will always be correct and no marginal problems.
And the accessor functions using the struct just employ int64 whatever it means.

S.






Re: int8 sequences --- small implementation problem

From
Neil Padgett
Date:
Tom Lane wrote:
[clip]
> 
> This would work, I think, but my goodness it's an ugly solution.
> Has any hacker got a better one?
> 
>                         regards, tom lane

How about:

#ifdef INT64_IS_BUSTED
#define int64aligned(name) int32 name##_; int64 name
#else
#define int64aligned(name) int64 name
#endif

typedef struct FormData_pg_sequence
{       NameData        sequence_name;       int64aligned(last_value);       int64aligned(increment_by);
int64aligned(max_value);      int64aligned(min_value);       int64aligned(cache_value);       int64aligned(log_cnt);
  char            is_cycled;       char            is_called;
 
} FormData_pg_sequence;

Neil

-- 
Neil Padgett
Red Hat Canada Ltd.                       E-Mail:  npadgett@redhat.com
2323 Yonge Street, Suite #300, 
Toronto, ON  M4P 2C9


Re: int8 sequences --- small implementation problem

From
Tom Lane
Date:
"Joe Conway" <joseph.conway@home.com> writes:
>> What I need is a way to pad the struct declaration so that it leaves
>> 8 bytes per int64 column, no matter what.  I thought of

> What if you defined int64 as a union made up of one "long int" member and
> one 8 byte char member, and then always refer to the "long int"?

Well, that'd remove the notational ugliness from the struct definition,
at the cost of adding it to the code that uses the struct.  I think I'd
prefer to uglify the struct and keep the code simple.  But it's a good
thought.
        regards, tom lane


Re: int8 sequences --- small implementation problem

From
Tom Lane
Date:
"Serguei Mokhov" <sa_mokho@alcor.concordia.ca> writes:
>> This would work, I think, but my goodness it's an ugly solution.

> Is anything wrong with just having two int32 per value for this case?

Well, we do want it to be int64 on machines where int64 is properly
defined.  Or are you suggesting

#ifdef INT64_IS_BUSTEDint32 last_value;int32 pad1;
#elseint64 last_value;
#endif

That does seem marginally more robust, now that you mention it...
        regards, tom lane


Re: int8 sequences --- small implementation problem

From
Stephan Szabo
Date:
On Tue, 14 Aug 2001, Tom Lane wrote:

> Jan Wieck <JanWieck@yahoo.com> writes:
> >     And he who needs that kind of long term row identifiers would
> >     be better off with 8-byte sequences anyway - IMNSVHO.
> 
> What I need is a way to pad the struct declaration so that it leaves
> 8 bytes per int64 column, no matter what.  I thought of
> 
> This would work, I think, but my goodness it's an ugly solution.
> Has any hacker got a better one?

The only thing I could think of is using a struct to hide the
padding details instead of directly using int64, but then you'd have to
add a '.value' or something to the references.  I'm not sure that's really
any cleaner.



Re: int8 sequences --- small implementation problem

From
"Joe Conway"
Date:
> typedef struct FormData_pg_sequence
> {
> NameData sequence_name;
> int32 last_value;
> int32 increment_by;
> int32 max_value;
> int32 min_value;
> int32 cache_value;
> int32 log_cnt;
> char is_cycled;
> char is_called;
> } FormData_pg_sequence;
>
> If I just change "int32" to "int64" here, all is well on machines where
> sizeof(int64) is 8.  But if there's no 64-bit C datatype, int64 is
> typedef'd as "long int", so sizeof(int64) is only 4.  Result: the struct
> declaration won't agree with the heaptuple layout --- since the tuple
> routines will believe that the datatype of these columns has size 8.
>
> What I need is a way to pad the struct declaration so that it leaves
> 8 bytes per int64 column, no matter what.  I thought of
>

What if you defined int64 as a union made up of one "long int" member and
one 8 byte char member, and then always refer to the "long int"?

-- Joe




Re: int8 sequences --- small implementation problem

From
"Serguei Mokhov"
Date:
----- Original Message ----- 
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Tuesday, August 14, 2001 10:09 AM


> typedef struct FormData_pg_sequence
> {
> NameData sequence_name;
> int64 last_value;
> #ifdef INT64_IS_BUSTED
> int32 pad1;
[snip]
> } FormData_pg_sequence;
> 
> This would work, I think, but my goodness it's an ugly solution.

Is anything wrong with just having two int32 per value for this case?

typedef struct FormData_pg_sequence
{ int32 last_value; int32 pad1;
...
} FormData_pg_sequence;

S.




Re: int8 sequences --- small implementation problem

From
Jan Wieck
Date:
Stephan Szabo wrote:
> On Tue, 14 Aug 2001, Tom Lane wrote:
>
> > Jan Wieck <JanWieck@yahoo.com> writes:
> > >     And he who needs that kind of long term row identifiers would
> > >     be better off with 8-byte sequences anyway - IMNSVHO.
> >
> > What I need is a way to pad the struct declaration so that it leaves
> > 8 bytes per int64 column, no matter what.  I thought of
> >
> > This would work, I think, but my goodness it's an ugly solution.
> > Has any hacker got a better one?
>
> The only thing I could think of is using a struct to hide the
> padding details instead of directly using int64, but then you'd have to
> add a '.value' or something to the references.  I'm not sure that's really
> any cleaner.
   What I'm asking myself all the time is "which platforms do we   support that doesn't have 8-byte  integers?".  Could
someone   enlighten me please?
 
   And what does int8 do on these platforms?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com