Re: Freezing without write I/O - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Freezing without write I/O
Date
Msg-id 51B621B9.8030903@vmware.com
Whole thread Raw
In response to Re: Freezing without write I/O  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Freezing without write I/O
Re: Freezing without write I/O
List pgsql-hackers
On 01.06.2013 23:21, Robert Haas wrote:
> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
>>> When this bit is set, it means there are no unfrozen tuples on the
>>> page with XIDs that predate the current half-epoch.  Whenever we know
>>> this to be true, we set the bit.  If the page LSN crosses more than
>>> one half-epoch boundary at a time, we freeze the page and set the bit.
>>>    If the page LSN crosses exactly one half-epoch boundary, then (1) if
>>> the bit is set, we clear it and (2) if the bit is not set, we freeze
>>> the page and set the bit.
>>
>> Yep, I think that would work. Want to write the patch, or should I? ;-)
>
> Have at it.

Here's a first draft. A lot of stuff missing and broken, but "make
check" passes :-).

In the patch, instead of working with "half-epochs", there are "XID-LSN
ranges", which can be of arbitrary size. An XID-LSN range consists of
three values:

minlsn: The point in WAL where this range begins.
minxid - maxxid: The range of XIDs allowed in this range.

Every point in WAL belongs to exactly one range. The minxid-maxxid of
the ranges can overlap. For example:

1. XIDs 25000942 - 27000003 LSN 0/3BB9938
2. XIDs 23000742 - 26000003 LSN 0/2AB9288
3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0
4. XIDs 22000002 - 24000003 LSN 0/10B1550

The invariant with the ranges is that a page with a certain LSN is only
allowed to contain XIDs that belong to the range specified by that LSN.
For example, if a page has LSN 0/3500000, it belongs to the 2nd range,
and can only contain XIDs between 23000742 - 26000003. If a backend
updates the page, so that it's LSN is updated to, say, 0/3D12345, all
XIDs on the page older than 25000942 must be frozen first, to avoid
violating the rule.

The system keeps track of a small number of these XID-LSN ranges. Where
we currently truncate clog, we can also truncate the ranges with maxxid
< the clog truncation point. Vacuum removes any dead tuples and updates
relfrozenxid as usual, to make sure that there are no dead tuples or
aborted XIDs older than the minxid of the oldest tracked XID-LSN range.
It no longer needs to freeze old committed XIDs, however - that's the
gain from this patch (except to uphold the invariant, if it has to
remove some dead tuples on the page and update its LSN).

A new range is created whenever we reach the maxxid on the current one.
The new range's minxid is set to the current global oldest xmin value,
and maxxid is just the old maxxid plus a fixed number (1 million in the
patch, but we probably want a higher value in reality). This ensures
that if you modify a page and update its LSN, all the existing XIDs on
the page that cannot be frozen yet are greater than the minxid of the
latest range. In other words, you can always freeze old XIDs on a page,
so that any remaining non-frozen XIDs are within the minxid-maxxid of
the latest range.

The HeapTupleSatisfies functions are modified to look at the page's LSN
first. If it's old enough, it doesn't look at the XIDs on the page level
at all, and just considers everything on the page is visible to everyone
(I'm calling this state a "mature page").

> I think the tricky part is going to be figuring out the
> synchronization around half-epoch boundaries.

Yep. I skipped all those difficult parts in this first version. There
are two race conditions that need to be fixed:

1. When a page is updated, we check if it needs to be frozen. If its LSN
is greater than the latest range's LSN. IOW, if we've already modified
the page, and thus frozen all older tuples, within the current range.
However, it's possible that a new range is created immediately after
we've checked that. When we then proceed to do the actual update on the
page and WAL-log that, the new LSN falls within the next range, and we
should've frozen the page. I'm planning to fix that by adding a "parity
bit" on the page header. Each XID-LSN range is assigned a parity bit, 0
or 1. When we check if a page needs to be frozen on update, we make note
of the latest range's parity bit, and write it in the page header.
Later, when we look at the page's LSN to determine which XID-LSN range
it belongs to, we compare the parity. If the parity doesn't match, we
know that the race condition happened, so we treat the page to belong to
the previous range, not the one it would normally belong to, per the LSN.

2. When we look at a page, and determine that it's not old enough to be
"matured", we then check the clog as usual. A page is considered mature,
if the XID-LSN range (and corresponding clog pages) has already been
truncated away. It's possible that between those steps, the XID-LSN
range and clog is truncated away, so that the backend tries to access a
clog page that doesn't exist anymore. To fix that, the XID-LSN range and
clog truncation needs to be done in two steps. First, mark the
truncation point in shared memory. Then somehow wait until all backends
see the new value, and go ahead with actually truncating the clog only
after that.


Aside from those two race conditions, there are plenty of scalability
issues remaining. Currently, the shared XID-LSN range array is checked
every time a page is accessed, so it could quickly become a bottleneck.
Need to cache that information in each backend. Oh, and I didn't
implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will
get a freezing frenzy right after a new XID-LSN range is created.

I'll keep hacking away on those things, but please let me know if you
see some fatal flaw with this plan.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Vacuum, Freeze and Analyze: the big picture
Next
From: Josh Berkus
Date:
Subject: Re: Hard limit on WAL space used (because PANIC sucks)