Multixid hindsight design - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Multixid hindsight design |
Date | |
Msg-id | 55511D1F.7050902@iki.fi Whole thread Raw |
Responses |
Re: Multixid hindsight design
Re: Multixid hindsight design Re: Multixid hindsight design Re: Multixid hindsight design |
List | pgsql-hackers |
I'd like to discuss how we should've implemented the infamous 9.3 multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight is always 20/20 - I'll readily admit that I didn't understand the problems until well after the release - so this isn't meant to bash what's been done. Rather, let's think of the future. The main problem with the infamous multixid changes was that it made pg_multixact a permanent, critical, piece of data. Without it, you cannot decipher whether some rows have been deleted or not. The 9.3 changes uncovered pre-existing issues with vacuuming and wraparound, but the fact that multixids are now critical turned those the otherwise relatively harmless bugs into data loss. We have pg_clog, which is a similar critical data structure. That's a pain too - you need VACUUM and you can't easily move tables from one cluster to another for example - but we've learned to live with it. But we certainly don't need any more such data structures. So the lesson here is that having a permanent pg_multixact is not nice, and we should get rid of it. Here's how to do that: Looking at the tuple header, the CID and CTID fields are only needed, when either xmin or xmax is running. Almost: in a HOT-updated tuple, CTID is required even after xmax has committed, but since it's a HOT update, the new tuple is always on the same page so you only need the offsetnumber part. That leaves us with 8 bytes that are always available for storing "ephemeral" information. By ephemeral, I mean that it is only needed when xmin or xmax is in-progress. After that, e.g. after a shutdown, it's never looked at. Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed by a 64-bit pointer, which means that it never wraps around. That 64-bit pointer is stored in the tuple header, in those 8 ephemeral bytes currently used for CID and CTID. Whenever a tuple is deleted/updated and locked at the same time, a TED entry is created for it, in the new SLRU, and the pointer to the entry is put on the tuple. In the TED entry, we can use as many bytes as we need to store the ephemeral data. It would include the CID (or possibly both CMIN and CMAX separately, now that we have the space), CTID, and the locking XIDs. The list of locking XIDs could be stored there directly, replacing multixids completely, or we could store a multixid there, and use the current pg_multixact system to decode them. Or we could store the multixact offset in the TED, replacing the multixact offset SLRU, but keep the multixact member SLRU as is. The XMAX stored on the tuple header would always be a real transaction ID, not a multixid. Hence locked-only tuples don't need to be frozen afterwards. The beauty of this would be that the TED entries can be zapped at restart, just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be WAL-logged, and we are free to change its on-disk layout even in a minor release. Further optimizations are possible. If the TED entry fits in 8 bytes, it can be stored directly in the tuple header. Like today, if a tuple is locked but not deleted/updated, you only need to store the locker XID, and you can store the locking XID directly on the tuple. Or if it's deleted and locked, CTID is not needed, only CID and locker XID, so you can store those direcly on the tuple. Plus some spare bits to indicate what is stored. And if the XMIN is older than global-xmin, you could also steal the XMIN field for storing TED data, making it possible to store 12 bytes directly in the tuple header. Plus some spare bits again to indicate that you've done that. Now, given where we are, how do we get there? Upgrade is a pain, because even if we no longer generate any new multixids, we'll have to be able to decode them after pg_upgrade. Perhaps condense pg_multixact into a simpler pg_clog-style bitmap at pg_upgrade, to make it small and simple to read, but it would nevertheless be a fair amount of code just to deal with pg_upgraded databases. I think this is worth doing, even after we've fixed all the acute multixid bugs, because this would be more robust in the long run. It would also remove the need to do anti-wraparound multixid vacuums, and the newly-added tuning knobs related to that. - Heikki
pgsql-hackers by date: