Freezing without write I/O - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Freezing without write I/O
Date
Msg-id 51A7553E.5070601@vmware.com
Whole thread Raw
Responses Re: Freezing without write I/O  (Robert Haas <robertmhaas@gmail.com>)
Re: Freezing without write I/O  (Bruce Momjian <bruce@momjian.us>)
Re: Freezing without write I/O  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Since we're bashing around ideas around freezing, let me write down the 
idea I've been pondering and discussing with various people for years. I 
don't think I invented this myself, apologies to whoever did for not 
giving credit.

The reason we have to freeze is that otherwise our 32-bit XIDs wrap 
around and become ambiguous. The obvious solution is to extend XIDs to 
64 bits, but that would waste a lot space. The trick is to add a field 
to the page header indicating the 'epoch' of the XID, while keeping the 
XIDs in tuple header 32-bit wide (*).

The other reason we freeze is to truncate the clog. But with 64-bit 
XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. 
Instead, we could implicitly treat anything older than relfrozenxid as 
frozen.

That's the basic idea. Vacuum freeze only needs to remove dead tuples, 
but doesn't need to dirty pages that contain no dead tuples.

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need 
to replace the XIDs with FrozenXid whenever the difference between the 
smallest and largest XID on a page exceeds 2^31. But that would only 
happen when you're updating the page, in which case the page is dirtied 
anyway, so it wouldn't cause any extra I/O.

This would also be the first step in allowing the clog to grow larger 
than 2 billion transactions, eliminating the need for anti-wraparound 
freezing altogether. You'd still want to truncate the clog eventually, 
but it would be nice to not be pressed against the wall with "run vacuum 
freeze now, or the system will shut down".

(*) "Adding an epoch" is inaccurate, but I like to use that as my mental 
model. If you just add a 32-bit epoch field, then you cannot have xids 
from different epochs on the page, which would be a problem. In reality, 
you would store one 64-bit XID value in the page header, and use that as 
the "reference point" for all the 32-bit XIDs on the tuples. See 
existing convert_txid() function for how that works. Another method is 
to store the 32-bit xid values in tuple headers as offsets from the 
per-page 64-bit value, but then you'd always need to have the 64-bit 
value at hand when interpreting the XIDs, even if they're all recent.

- Heikki



pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Next
From: Robert Haas
Date:
Subject: Re: removing PD_ALL_VISIBLE