Re: Multixid hindsight design - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Multixid hindsight design
Date
Msg-id CANP8+jKppa-S+qBJQtDmK5SYJcsRacHKVVE1TWZyPDaP_3H6sw@mail.gmail.com
Whole thread Raw
In response to Re: Multixid hindsight design  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Multixid hindsight design  (Robert Haas <robertmhaas@gmail.com>)
Re: Multixid hindsight design  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On 24 June 2015 at 14:57, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It would be a great deal nicer if we didn't have to keep ANY of the
> transactional data for a tuple around once it's all-visible.  Heikki
> defined ephemeral as "only needed when xmin or xmax is in-progress",
> but if we extended that definition slightly to "only needed when xmin
> or xmax is in-progress or commited but not all-visible" then the
> amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
> t_hoff).

OK, I was wrong here: if you only have that stuff, you can't
distinguish between a tuple that is visible to everyone and a tuple
that is visible to no one.  I think the minimal amount of data we need
in order to distinguish visibility once no relevant transactions are
in progress is one XID: either XMIN, if the tuple was never updated at
all or only be the inserting transaction or one of its subxacts; or
XMAX, if the inserting transaction committed.  The other visibility
information -- including (1) the other of XMIN and XMAX, (2) CMIN and
CMAX, and (3) the CTID -- are only interesting the transactions
involved are no longer running and, if they committed, visible to all
running transactions.

Heikki's proposal is basically to merge the 4-byte CID field and the
first 4 bytes of the CTID that currently store the block number into
one 8-byte field that can store a pointer into this new TED structure.
And after mulling it over, that sounds pretty good to me.  It's true
(as has been pointed out by several people) that the TED will need to
be persistent because of prepared transactions.  But it would still be
a big improvement over the status quo, because:

(1) We would no longer need to freeze MultiXacts.  TED wouldn't need
to be frozen either.  You'd just truncate it whenever RecentGlobalXmin
advances.

(2) If the TED becomes horribly corrupted, you can recover by
committing or aborting any prepared transactions, shutting the system
down, and truncating it, with no loss of data integrity.  Nothing in
the TED is required to determine whether tuples are visible to an
unrelated transaction - you only need it (a) to determine whether
tuples are visible to a particular command within a transaction that
has inserted, updated, or deleted the tuple and (b) determine whether
tuples are locked.

(3) As a bonus, we'd eliminate combo CIDs, because the TED could have
space to separately store CMIN and CMAX.  Combo CIDs required special
handling for logical decoding, and they are one of the nastier
barriers to making parallelism support writes (because they are stored
in backend-local memory of unbounded size and therefore can't easily
be shared with workers), so it wouldn't be very sad if they went away.

I'm not quite sure how to decide whether something like this worth (a)
the work and (b) the risk of creating new bugs, but the more I think
about it, the more the principal of the thing seems sound to me.

Splitting multitrans into persistent (xmax) and ephemeral (TED) is something I already proposed so I support the concept; TED is a much better suggestion, so I support TED.

Your addition of removing combocids is good also, since everything is public. 

I think we need to see a detailed design and we also need to understand the size of this new beast. I'm worried it might become very big, very quickly causing problems for us in other ways. We would need to be certain that truncation can actually occur reasonably frequently and that there are no edge cases that cause it to bloat.

Though TED sounds nice, the way to avoid another round of on-disk bugs is by using a new kind of testing, not simply by moving the bits around.

It might be argued that we are increasing the diagnostic/forensic capabilities by making CIDs more public. We can use that...

The good thing I see from TED is it allows us to test the on-disk outcome of concurrent activity. Currently we have isolationtester, but that is not married in any way to the on-disk state allowing us the situation where isolationtester can pass yet we have corrupted on-disk state. We should specify the on-disk tuple representation as a state machine and work out how to recheck the new on-disk state matches the state transition that we performed. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Removing SSL renegotiation (Was: Should we back-patch SSL renegotiation fixes?)
Next
From: Peter Eisentraut
Date:
Subject: Re: git push hook to check for outdated timestamps