Re: Freezing without write I/O - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Freezing without write I/O |
Date | |
Msg-id | 521CCC1F.6040904@vmware.com Whole thread Raw |
In response to | Re: Freezing without write I/O (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Freezing without write I/O
Re: Freezing without write I/O |
List | pgsql-hackers |
On 10.06.2013 21:58, Heikki Linnakangas wrote: > On 01.06.2013 23:21, Robert Haas wrote: >> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN. >>>> When this bit is set, it means there are no unfrozen tuples on the >>>> page with XIDs that predate the current half-epoch. Whenever we know >>>> this to be true, we set the bit. If the page LSN crosses more than >>>> one half-epoch boundary at a time, we freeze the page and set the bit. >>>> If the page LSN crosses exactly one half-epoch boundary, then (1) if >>>> the bit is set, we clear it and (2) if the bit is not set, we freeze >>>> the page and set the bit. >>> >>> Yep, I think that would work. Want to write the patch, or should I? ;-) >> >> Have at it. > > Here's a first draft. A lot of stuff missing and broken, but "make > check" passes :-). > > In the patch, instead of working with "half-epochs", there are "XID-LSN > ranges", which can be of arbitrary size. An XID-LSN range consists of > three values: > > minlsn: The point in WAL where this range begins. > minxid - maxxid: The range of XIDs allowed in this range. > > Every point in WAL belongs to exactly one range. The minxid-maxxid of > the ranges can overlap. For example: > > 1. XIDs 25000942 - 27000003 LSN 0/3BB9938 > 2. XIDs 23000742 - 26000003 LSN 0/2AB9288 > 3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0 > 4. XIDs 22000002 - 24000003 LSN 0/10B1550 > > The invariant with the ranges is that a page with a certain LSN is only > allowed to contain XIDs that belong to the range specified by that LSN. > For example, if a page has LSN 0/3500000, it belongs to the 2nd range, > and can only contain XIDs between 23000742 - 26000003. If a backend > updates the page, so that it's LSN is updated to, say, 0/3D12345, all > XIDs on the page older than 25000942 must be frozen first, to avoid > violating the rule. > > The system keeps track of a small number of these XID-LSN ranges. Where > we currently truncate clog, we can also truncate the ranges with maxxid > < the clog truncation point. Vacuum removes any dead tuples and updates > relfrozenxid as usual, to make sure that there are no dead tuples or > aborted XIDs older than the minxid of the oldest tracked XID-LSN range. > It no longer needs to freeze old committed XIDs, however - that's the > gain from this patch (except to uphold the invariant, if it has to > remove some dead tuples on the page and update its LSN). > > A new range is created whenever we reach the maxxid on the current one. > The new range's minxid is set to the current global oldest xmin value, > and maxxid is just the old maxxid plus a fixed number (1 million in the > patch, but we probably want a higher value in reality). This ensures > that if you modify a page and update its LSN, all the existing XIDs on > the page that cannot be frozen yet are greater than the minxid of the > latest range. In other words, you can always freeze old XIDs on a page, > so that any remaining non-frozen XIDs are within the minxid-maxxid of > the latest range. > > The HeapTupleSatisfies functions are modified to look at the page's LSN > first. If it's old enough, it doesn't look at the XIDs on the page level > at all, and just considers everything on the page is visible to everyone > (I'm calling this state a "mature page"). > >> I think the tricky part is going to be figuring out the >> synchronization around half-epoch boundaries. > > Yep. I skipped all those difficult parts in this first version. There > are two race conditions that need to be fixed: > > 1. When a page is updated, we check if it needs to be frozen. If its LSN > is greater than the latest range's LSN. IOW, if we've already modified > the page, and thus frozen all older tuples, within the current range. > However, it's possible that a new range is created immediately after > we've checked that. When we then proceed to do the actual update on the > page and WAL-log that, the new LSN falls within the next range, and we > should've frozen the page. I'm planning to fix that by adding a "parity > bit" on the page header. Each XID-LSN range is assigned a parity bit, 0 > or 1. When we check if a page needs to be frozen on update, we make note > of the latest range's parity bit, and write it in the page header. > Later, when we look at the page's LSN to determine which XID-LSN range > it belongs to, we compare the parity. If the parity doesn't match, we > know that the race condition happened, so we treat the page to belong to > the previous range, not the one it would normally belong to, per the LSN. > > 2. When we look at a page, and determine that it's not old enough to be > "matured", we then check the clog as usual. A page is considered mature, > if the XID-LSN range (and corresponding clog pages) has already been > truncated away. It's possible that between those steps, the XID-LSN > range and clog is truncated away, so that the backend tries to access a > clog page that doesn't exist anymore. To fix that, the XID-LSN range and > clog truncation needs to be done in two steps. First, mark the > truncation point in shared memory. Then somehow wait until all backends > see the new value, and go ahead with actually truncating the clog only > after that. > > Aside from those two race conditions, there are plenty of scalability > issues remaining. Currently, the shared XID-LSN range array is checked > every time a page is accessed, so it could quickly become a bottleneck. > Need to cache that information in each backend. Oh, and I didn't > implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will > get a freezing frenzy right after a new XID-LSN range is created. > > I'll keep hacking away on those things, but please let me know if you > see some fatal flaw with this plan. Here's an updated patch. The race conditions I mentioned above have been fixed. This is still definitely work-in-progress, but overall I think this is quite promising. The patch basically works, although there are a bunch of TODO items like handling unlogged tables. This patch is also available in my git repository at git://git.postgresql.org/git/users/heikki/postgres.git, branch "freeze-by-xid-lsn-ranges". - Heikki
Attachment
pgsql-hackers by date: