Multixid hindsight design - Mailing list pgsql-hackers

I'd like to discuss how we should've implemented the infamous 9.3 
multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight 
is always 20/20 - I'll readily admit that I didn't understand the 
problems until well after the release - so this isn't meant to bash 
what's been done. Rather, let's think of the future.

The main problem with the infamous multixid changes was that it made 
pg_multixact a permanent, critical, piece of data. Without it, you 
cannot decipher whether some rows have been deleted or not. The 9.3 
changes uncovered pre-existing issues with vacuuming and wraparound, but 
the fact that multixids are now critical turned those the otherwise 
relatively harmless bugs into data loss.

We have pg_clog, which is a similar critical data structure. That's a 
pain too - you need VACUUM and you can't easily move tables from one 
cluster to another for example - but we've learned to live with it. But 
we certainly don't need any more such data structures.

So the lesson here is that having a permanent pg_multixact is not nice, 
and we should get rid of it. Here's how to do that:


Looking at the tuple header, the CID and CTID fields are only needed, 
when either xmin or xmax is running. Almost: in a HOT-updated tuple, 
CTID is required even after xmax has committed, but since it's a HOT 
update, the new tuple is always on the same page so you only need the 
offsetnumber part. That leaves us with 8 bytes that are always available 
for storing "ephemeral" information. By ephemeral, I mean that it is 
only needed when xmin or xmax is in-progress. After that, e.g. after a 
shutdown, it's never looked at.

Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed 
by a 64-bit pointer, which means that it never wraps around. That 64-bit 
pointer is stored in the tuple header, in those 8 ephemeral bytes 
currently used for CID and CTID. Whenever a tuple is deleted/updated and 
locked at the same time, a TED entry is created for it, in the new SLRU, 
and the pointer to the entry is put on the tuple. In the TED entry, we 
can use as many bytes as we need to store the ephemeral data. It would 
include the CID (or possibly both CMIN and CMAX separately, now that we 
have the space), CTID, and the locking XIDs. The list of locking XIDs 
could be stored there directly, replacing multixids completely, or we 
could store a multixid there, and use the current pg_multixact system to 
decode them. Or we could store the multixact offset in the TED, 
replacing the multixact offset SLRU, but keep the multixact member SLRU 
as is.

The XMAX stored on the tuple header would always be a real transaction 
ID, not a multixid. Hence locked-only tuples don't need to be frozen 
afterwards.

The beauty of this would be that the TED entries can be zapped at 
restart, just like pg_subtrans, and pg_multixact before 9.3. It doesn't 
need to be WAL-logged, and we are free to change its on-disk layout even 
in a minor release.

Further optimizations are possible. If the TED entry fits in 8 bytes, it 
can be stored directly in the tuple header. Like today, if a tuple is 
locked but not deleted/updated, you only need to store the locker XID, 
and you can store the locking XID directly on the tuple. Or if it's 
deleted and locked, CTID is not needed, only CID and locker XID, so you 
can store those direcly on the tuple. Plus some spare bits to indicate 
what is stored. And if the XMIN is older than global-xmin, you could 
also steal the XMIN field for storing TED data, making it possible to 
store 12 bytes directly in the tuple header. Plus some spare bits again 
to indicate that you've done that.


Now, given where we are, how do we get there? Upgrade is a pain, because 
even if we no longer generate any new multixids, we'll have to be able 
to decode them after pg_upgrade. Perhaps condense pg_multixact into a 
simpler pg_clog-style bitmap at pg_upgrade, to make it small and simple 
to read, but it would nevertheless be a fair amount of code just to deal 
with pg_upgraded databases.

I think this is worth doing, even after we've fixed all the acute 
multixid bugs, because this would be more robust in the long run. It 
would also remove the need to do anti-wraparound multixid vacuums, and 
the newly-added tuning knobs related to that.

- Heikki



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Multi-xacts and our process problem
Next
From: Heikki Linnakangas
Date:
Subject: Re: Multi-xacts and our process problem