Re: Freezing without write I/O - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Freezing without write I/O |
Date | |
Msg-id | 51B621B9.8030903@vmware.com Whole thread Raw |
In response to | Re: Freezing without write I/O (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Freezing without write I/O
Re: Freezing without write I/O |
List | pgsql-hackers |
On 01.06.2013 23:21, Robert Haas wrote: > On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> We define a new page-level bit, something like PD_RECENTLY_FROZEN. >>> When this bit is set, it means there are no unfrozen tuples on the >>> page with XIDs that predate the current half-epoch. Whenever we know >>> this to be true, we set the bit. If the page LSN crosses more than >>> one half-epoch boundary at a time, we freeze the page and set the bit. >>> If the page LSN crosses exactly one half-epoch boundary, then (1) if >>> the bit is set, we clear it and (2) if the bit is not set, we freeze >>> the page and set the bit. >> >> Yep, I think that would work. Want to write the patch, or should I? ;-) > > Have at it. Here's a first draft. A lot of stuff missing and broken, but "make check" passes :-). In the patch, instead of working with "half-epochs", there are "XID-LSN ranges", which can be of arbitrary size. An XID-LSN range consists of three values: minlsn: The point in WAL where this range begins. minxid - maxxid: The range of XIDs allowed in this range. Every point in WAL belongs to exactly one range. The minxid-maxxid of the ranges can overlap. For example: 1. XIDs 25000942 - 27000003 LSN 0/3BB9938 2. XIDs 23000742 - 26000003 LSN 0/2AB9288 3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0 4. XIDs 22000002 - 24000003 LSN 0/10B1550 The invariant with the ranges is that a page with a certain LSN is only allowed to contain XIDs that belong to the range specified by that LSN. For example, if a page has LSN 0/3500000, it belongs to the 2nd range, and can only contain XIDs between 23000742 - 26000003. If a backend updates the page, so that it's LSN is updated to, say, 0/3D12345, all XIDs on the page older than 25000942 must be frozen first, to avoid violating the rule. The system keeps track of a small number of these XID-LSN ranges. Where we currently truncate clog, we can also truncate the ranges with maxxid < the clog truncation point. Vacuum removes any dead tuples and updates relfrozenxid as usual, to make sure that there are no dead tuples or aborted XIDs older than the minxid of the oldest tracked XID-LSN range. It no longer needs to freeze old committed XIDs, however - that's the gain from this patch (except to uphold the invariant, if it has to remove some dead tuples on the page and update its LSN). A new range is created whenever we reach the maxxid on the current one. The new range's minxid is set to the current global oldest xmin value, and maxxid is just the old maxxid plus a fixed number (1 million in the patch, but we probably want a higher value in reality). This ensures that if you modify a page and update its LSN, all the existing XIDs on the page that cannot be frozen yet are greater than the minxid of the latest range. In other words, you can always freeze old XIDs on a page, so that any remaining non-frozen XIDs are within the minxid-maxxid of the latest range. The HeapTupleSatisfies functions are modified to look at the page's LSN first. If it's old enough, it doesn't look at the XIDs on the page level at all, and just considers everything on the page is visible to everyone (I'm calling this state a "mature page"). > I think the tricky part is going to be figuring out the > synchronization around half-epoch boundaries. Yep. I skipped all those difficult parts in this first version. There are two race conditions that need to be fixed: 1. When a page is updated, we check if it needs to be frozen. If its LSN is greater than the latest range's LSN. IOW, if we've already modified the page, and thus frozen all older tuples, within the current range. However, it's possible that a new range is created immediately after we've checked that. When we then proceed to do the actual update on the page and WAL-log that, the new LSN falls within the next range, and we should've frozen the page. I'm planning to fix that by adding a "parity bit" on the page header. Each XID-LSN range is assigned a parity bit, 0 or 1. When we check if a page needs to be frozen on update, we make note of the latest range's parity bit, and write it in the page header. Later, when we look at the page's LSN to determine which XID-LSN range it belongs to, we compare the parity. If the parity doesn't match, we know that the race condition happened, so we treat the page to belong to the previous range, not the one it would normally belong to, per the LSN. 2. When we look at a page, and determine that it's not old enough to be "matured", we then check the clog as usual. A page is considered mature, if the XID-LSN range (and corresponding clog pages) has already been truncated away. It's possible that between those steps, the XID-LSN range and clog is truncated away, so that the backend tries to access a clog page that doesn't exist anymore. To fix that, the XID-LSN range and clog truncation needs to be done in two steps. First, mark the truncation point in shared memory. Then somehow wait until all backends see the new value, and go ahead with actually truncating the clog only after that. Aside from those two race conditions, there are plenty of scalability issues remaining. Currently, the shared XID-LSN range array is checked every time a page is accessed, so it could quickly become a bottleneck. Need to cache that information in each backend. Oh, and I didn't implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will get a freezing frenzy right after a new XID-LSN range is created. I'll keep hacking away on those things, but please let me know if you see some fatal flaw with this plan. - Heikki
Attachment
pgsql-hackers by date: