Thread: Multixid hindsight design

Multixid hindsight design

From
Heikki Linnakangas
Date:
I'd like to discuss how we should've implemented the infamous 9.3 
multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight 
is always 20/20 - I'll readily admit that I didn't understand the 
problems until well after the release - so this isn't meant to bash 
what's been done. Rather, let's think of the future.

The main problem with the infamous multixid changes was that it made 
pg_multixact a permanent, critical, piece of data. Without it, you 
cannot decipher whether some rows have been deleted or not. The 9.3 
changes uncovered pre-existing issues with vacuuming and wraparound, but 
the fact that multixids are now critical turned those the otherwise 
relatively harmless bugs into data loss.

We have pg_clog, which is a similar critical data structure. That's a 
pain too - you need VACUUM and you can't easily move tables from one 
cluster to another for example - but we've learned to live with it. But 
we certainly don't need any more such data structures.

So the lesson here is that having a permanent pg_multixact is not nice, 
and we should get rid of it. Here's how to do that:


Looking at the tuple header, the CID and CTID fields are only needed, 
when either xmin or xmax is running. Almost: in a HOT-updated tuple, 
CTID is required even after xmax has committed, but since it's a HOT 
update, the new tuple is always on the same page so you only need the 
offsetnumber part. That leaves us with 8 bytes that are always available 
for storing "ephemeral" information. By ephemeral, I mean that it is 
only needed when xmin or xmax is in-progress. After that, e.g. after a 
shutdown, it's never looked at.

Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed 
by a 64-bit pointer, which means that it never wraps around. That 64-bit 
pointer is stored in the tuple header, in those 8 ephemeral bytes 
currently used for CID and CTID. Whenever a tuple is deleted/updated and 
locked at the same time, a TED entry is created for it, in the new SLRU, 
and the pointer to the entry is put on the tuple. In the TED entry, we 
can use as many bytes as we need to store the ephemeral data. It would 
include the CID (or possibly both CMIN and CMAX separately, now that we 
have the space), CTID, and the locking XIDs. The list of locking XIDs 
could be stored there directly, replacing multixids completely, or we 
could store a multixid there, and use the current pg_multixact system to 
decode them. Or we could store the multixact offset in the TED, 
replacing the multixact offset SLRU, but keep the multixact member SLRU 
as is.

The XMAX stored on the tuple header would always be a real transaction 
ID, not a multixid. Hence locked-only tuples don't need to be frozen 
afterwards.

The beauty of this would be that the TED entries can be zapped at 
restart, just like pg_subtrans, and pg_multixact before 9.3. It doesn't 
need to be WAL-logged, and we are free to change its on-disk layout even 
in a minor release.

Further optimizations are possible. If the TED entry fits in 8 bytes, it 
can be stored directly in the tuple header. Like today, if a tuple is 
locked but not deleted/updated, you only need to store the locker XID, 
and you can store the locking XID directly on the tuple. Or if it's 
deleted and locked, CTID is not needed, only CID and locker XID, so you 
can store those direcly on the tuple. Plus some spare bits to indicate 
what is stored. And if the XMIN is older than global-xmin, you could 
also steal the XMIN field for storing TED data, making it possible to 
store 12 bytes directly in the tuple header. Plus some spare bits again 
to indicate that you've done that.


Now, given where we are, how do we get there? Upgrade is a pain, because 
even if we no longer generate any new multixids, we'll have to be able 
to decode them after pg_upgrade. Perhaps condense pg_multixact into a 
simpler pg_clog-style bitmap at pg_upgrade, to make it small and simple 
to read, but it would nevertheless be a fair amount of code just to deal 
with pg_upgraded databases.

I think this is worth doing, even after we've fixed all the acute 
multixid bugs, because this would be more robust in the long run. It 
would also remove the need to do anti-wraparound multixid vacuums, and 
the newly-added tuning knobs related to that.

- Heikki



Re: Multixid hindsight design

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> So the lesson here is that having a permanent pg_multixact is not nice, 
> and we should get rid of it. Here's how to do that:

That would be cool, but ...

> Looking at the tuple header, the CID and CTID fields are only needed, 
> when either xmin or xmax is running. Almost: in a HOT-updated tuple, 
> CTID is required even after xmax has committed, but since it's a HOT 
> update, the new tuple is always on the same page so you only need the 
> offsetnumber part.

I think this is totally wrong.  HOT update or not, you need the forward
link represented by ctid not just until xmin/xmax commit, but until
they're in the past according to all active snapshots.  That's so that
other transactions can chain forward to the "current" version of a tuple
which they found according to their snapshots.

It might be you can salvage the idea anyway, since it's still true that
the forward links wouldn't be needed after a crash and restart.  But the
argument as stated is wrong.

(There's also the question of forensic requirements, although I'm aware
that it's barely worth bringing that up since nobody else here seems to
put much weight on that.)
        regards, tom lane



Re: Multixid hindsight design

From
Heikki Linnakangas
Date:
On 05/12/2015 01:51 AM, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> So the lesson here is that having a permanent pg_multixact is not nice,
>> and we should get rid of it. Here's how to do that:
>
> That would be cool, but ...
>
>> Looking at the tuple header, the CID and CTID fields are only needed,
>> when either xmin or xmax is running. Almost: in a HOT-updated tuple,
>> CTID is required even after xmax has committed, but since it's a HOT
>> update, the new tuple is always on the same page so you only need the
>> offsetnumber part.
>
> I think this is totally wrong.  HOT update or not, you need the forward
> link represented by ctid not just until xmin/xmax commit, but until
> they're in the past according to all active snapshots.  That's so that
> other transactions can chain forward to the "current" version of a tuple
> which they found according to their snapshots.
>
> It might be you can salvage the idea anyway, since it's still true that
> the forward links wouldn't be needed after a crash and restart.  But the
> argument as stated is wrong.

Ah yes, I stated that wrong. What I meant was that they are not needed 
after xmin and xmax are older than global xmin.

> (There's also the question of forensic requirements, although I'm aware
> that it's barely worth bringing that up since nobody else here seems to
> put much weight on that.)

I do care about that. In this scheme, you would always have the 
updater/deleter XMAX on the tuple itself, which IMO is more useful for 
forensic purposes than a multixid. You lose the CID and CTID in the 
tuple (for tuples that are updated and locked at the same time), but if 
you keep the TED around longer, you have all the information still 
there. On the whole, I don't think this is much worse than the current 
situation.

- Heikki



Re: Multixid hindsight design

From
Robert Haas
Date:
On Mon, May 11, 2015 at 5:20 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The main problem with the infamous multixid changes was that it made
> pg_multixact a permanent, critical, piece of data. Without it, you cannot
> decipher whether some rows have been deleted or not. The 9.3 changes
> uncovered pre-existing issues with vacuuming and wraparound, but the fact
> that multixids are now critical turned those the otherwise relatively
> harmless bugs into data loss.

Agreed.

> We have pg_clog, which is a similar critical data structure. That's a pain
> too - you need VACUUM and you can't easily move tables from one cluster to
> another for example - but we've learned to live with it. But we certainly
> don't need any more such data structures.

Yes.

> So the lesson here is that having a permanent pg_multixact is not nice, and
> we should get rid of it. Here's how to do that:
>
> Looking at the tuple header, the CID and CTID fields are only needed, when
> either xmin or xmax is running. Almost: in a HOT-updated tuple, CTID is
> required even after xmax has committed, but since it's a HOT update, the new
> tuple is always on the same page so you only need the offsetnumber part.
> That leaves us with 8 bytes that are always available for storing
> "ephemeral" information. By ephemeral, I mean that it is only needed when
> xmin or xmax is in-progress. After that, e.g. after a shutdown, it's never
> looked at.
>
> Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed by
> a 64-bit pointer, which means that it never wraps around. That 64-bit
> pointer is stored in the tuple header, in those 8 ephemeral bytes currently
> used for CID and CTID. Whenever a tuple is deleted/updated and locked at the
> same time, a TED entry is created for it, in the new SLRU, and the pointer
> to the entry is put on the tuple. In the TED entry, we can use as many bytes
> as we need to store the ephemeral data. It would include the CID (or
> possibly both CMIN and CMAX separately, now that we have the space), CTID,
> and the locking XIDs. The list of locking XIDs could be stored there
> directly, replacing multixids completely, or we could store a multixid
> there, and use the current pg_multixact system to decode them. Or we could
> store the multixact offset in the TED, replacing the multixact offset SLRU,
> but keep the multixact member SLRU as is.
>
> The XMAX stored on the tuple header would always be a real transaction ID,
> not a multixid. Hence locked-only tuples don't need to be frozen afterwards.
>
> The beauty of this would be that the TED entries can be zapped at restart,
> just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> WAL-logged, and we are free to change its on-disk layout even in a minor
> release.
>
> Further optimizations are possible. If the TED entry fits in 8 bytes, it can
> be stored directly in the tuple header. Like today, if a tuple is locked but
> not deleted/updated, you only need to store the locker XID, and you can
> store the locking XID directly on the tuple. Or if it's deleted and locked,
> CTID is not needed, only CID and locker XID, so you can store those direcly
> on the tuple. Plus some spare bits to indicate what is stored. And if the
> XMIN is older than global-xmin, you could also steal the XMIN field for
> storing TED data, making it possible to store 12 bytes directly in the tuple
> header. Plus some spare bits again to indicate that you've done that.
>
> Now, given where we are, how do we get there? Upgrade is a pain, because
> even if we no longer generate any new multixids, we'll have to be able to
> decode them after pg_upgrade. Perhaps condense pg_multixact into a simpler
> pg_clog-style bitmap at pg_upgrade, to make it small and simple to read, but
> it would nevertheless be a fair amount of code just to deal with pg_upgraded
> databases.
>
> I think this is worth doing, even after we've fixed all the acute multixid
> bugs, because this would be more robust in the long run. It would also
> remove the need to do anti-wraparound multixid vacuums, and the newly-added
> tuning knobs related to that.

One danger is that in rearranging all of this stuff we may introduce
lots of new bugs.  I do agree that making multixacts need to survive a
server crash was not a good idea.  We liked freezing xmin so much, we
decided to freeze xmax, too?  Uggh.  As painful as that's been,
though, we're 18 months into it at this point.  If we do another
reorganization, are going to end up back at month 0, where pretty much
everybody had corruption all the time rather than only some people on
some workloads?  Maybe not, but it's certainly something to worry
about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multixid hindsight design

From
Thomas Munro
Date:
On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The beauty of this would be that the TED entries can be zapped at restart,
> just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> WAL-logged, and we are free to change its on-disk layout even in a minor
> release.

What about prepared transactions?  They can lock rows FOR SHARE that
survive server restarts.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Multixid hindsight design

From
Greg Stark
Date:
On Mon, Jun 1, 2015 at 8:53 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> What about prepared transactions?  They can lock rows FOR SHARE that
> survive server restarts.

And they can make update chains that are still uncommitted after a restart.

I think we should think separately about what information we want to
store in the tuple ideally and only then worry about how to encode it
concisely as an optimization exercise. If you just grow every tuple by
64-bits would this scheme be trivial? Perhaps it would be worth
implementing that way first and then worrying about how to recuperate
that space later?


-- 
greg



Re: Multixid hindsight design

From
Simon Riggs
Date:
On 1 June 2015 at 20:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The beauty of this would be that the TED entries can be zapped at restart,
> just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> WAL-logged, and we are free to change its on-disk layout even in a minor
> release.

What about prepared transactions?  They can lock rows FOR SHARE that
survive server restarts.

Interesting comment. I'm not aware that we do.

If we do support row locking that survives server restart, how did it work before 9.3? 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multixid hindsight design

From
Andres Freund
Date:
On 2015-06-05 10:45:09 +0100, Simon Riggs wrote:
> On 1 June 2015 at 20:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> 
> > On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote:
> > > The beauty of this would be that the TED entries can be zapped at
> > restart,
> > > just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> > > WAL-logged, and we are free to change its on-disk layout even in a minor
> > > release.
> >
> > What about prepared transactions?  They can lock rows FOR SHARE that
> > survive server restarts.
> >
> 
> Interesting comment. I'm not aware that we do.
> 
> If we do support row locking that survives server restart, how did it work
> before 9.3?

Multixacts were persistent before 9.3 as well. A good number of the bugs
existed then as well, but their effect was much more limited. The
difference is that now multixacts don't just have to survive till the
last locker isn't running anymore (which was determined by a horizon),
but that they have to live till they're vacuumed away, since xmax might
be stored in the multixact.



Re: Multixid hindsight design

From
Simon Riggs
Date:
On 11 May 2015 at 22:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I'd like to discuss how we should've implemented the infamous 9.3 multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight is always 20/20 - I'll readily admit that I didn't understand the problems until well after the release - so this isn't meant to bash what's been done. Rather, let's think of the future.

The main problem with the infamous multixid changes was that it made pg_multixact a permanent, critical, piece of data. Without it, you cannot decipher whether some rows have been deleted or not. The 9.3 changes uncovered pre-existing issues with vacuuming and wraparound, but the fact that multixids are now critical turned those the otherwise relatively harmless bugs into data loss.

We have pg_clog, which is a similar critical data structure. That's a pain too - you need VACUUM and you can't easily move tables from one cluster to another for example - but we've learned to live with it. But we certainly don't need any more such data structures.

So the lesson here is that having a permanent pg_multixact is not nice, and we should get rid of it. Here's how to do that:

I think we should think back to exactly what we are trying to store, why and for how long. A clear definition of what we are trying to achieve is essential to solving the problem.

In my understanding we need to store
* at most one xid - the Updating Xid
* potentially many Locking Xids

The current design has two SLRUs and puts all of those xids in the Members SLRU, causing it to need to be persistent.

The problems come from having significant numbers of locking xids. My understanding is that any change in the number of lockers requires the full array to be rewritten. So with N lockers we end up with 2N-1 arrays, each array has an average of N/2 members, or N^2  entries, i.e. an O(N^2) algorithm, which makes it a bad thing to persist. Assuming that design continues mostly unchanged in its core points...

An alternate proposal:

1. Store only the Locking xids in the Members SLRU
2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the Locking xids in the Members SLRU. 

This means the Offsets SLRU will be around twice the size it was before BUT since we reduce the size of each Members array by one, there is a balanced saving there, so this change is disk-space-neutral.

That way if we need to make Offsets SLRU persistent it won't bloat.
We then leave the Members SLRU as non-persistent, just as it was <9.3

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multixid hindsight design

From
Simon Riggs
Date:
On 5 June 2015 at 11:02, Andres Freund <andres@anarazel.de> wrote:
On 2015-06-05 10:45:09 +0100, Simon Riggs wrote:
> On 1 June 2015 at 20:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> > On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote:
> > > The beauty of this would be that the TED entries can be zapped at
> > restart,
> > > just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> > > WAL-logged, and we are free to change its on-disk layout even in a minor
> > > release.
> >
> > What about prepared transactions?  They can lock rows FOR SHARE that
> > survive server restarts.
> >
>
> Interesting comment. I'm not aware that we do.
>
> If we do support row locking that survives server restart, how did it work
> before 9.3?

Multixacts were persistent before 9.3 as well. A good number of the bugs
existed then as well, but their effect was much more limited. The
difference is that now multixacts don't just have to survive till the
last locker isn't running anymore (which was determined by a horizon),
but that they have to live till they're vacuumed away, since xmax might
be stored in the multixact.

Phew! Had me worried for a minute.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multixid hindsight design

From
Robert Haas
Date:
On Fri, Jun 5, 2015 at 6:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I think we should think back to exactly what we are trying to store, why and
> for how long. A clear definition of what we are trying to achieve is
> essential to solving the problem.
>
> In my understanding we need to store
> * at most one xid - the Updating Xid
> * potentially many Locking Xids
>
> The current design has two SLRUs and puts all of those xids in the Members
> SLRU, causing it to need to be persistent.
>
> The problems come from having significant numbers of locking xids. My
> understanding is that any change in the number of lockers requires the full
> array to be rewritten. So with N lockers we end up with 2N-1 arrays, each
> array has an average of N/2 members, or N^2  entries, i.e. an O(N^2)
> algorithm, which makes it a bad thing to persist. Assuming that design
> continues mostly unchanged in its core points...
>
> An alternate proposal:
>
> 1. Store only the Locking xids in the Members SLRU
> 2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the
> Locking xids in the Members SLRU.
>
> This means the Offsets SLRU will be around twice the size it was before BUT
> since we reduce the size of each Members array by one, there is a balanced
> saving there, so this change is disk-space-neutral.
>
> That way if we need to make Offsets SLRU persistent it won't bloat.
> We then leave the Members SLRU as non-persistent, just as it was <9.3

Hmm, this is a neat idea.  It would have been easier to implement if
we'd thought of it before we released 9.3, though.  At this point, I
guess we'd have to either have a pg_upgrade compatibility break, or
teach pg_upgrade to rejigger the old files into the new file format,
or some other fix that's not immediately apparent to me.  And it also
sounds like a fair amount of work.  But it might be worth it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multixid hindsight design

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 11 May 2015 at 22:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> So the lesson here is that having a permanent pg_multixact is not nice,
>> and we should get rid of it. Here's how to do that:

> An alternate proposal:

> 1. Store only the Locking xids in the Members SLRU
> 2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the
> Locking xids in the Members SLRU.

> This means the Offsets SLRU will be around twice the size it was before BUT
> since we reduce the size of each Members array by one, there is a balanced
> saving there, so this change is disk-space-neutral.

> That way if we need to make Offsets SLRU persistent it won't bloat.
> We then leave the Members SLRU as non-persistent, just as it was <9.3

I don't think you can do that, because it supposes that locking XIDs need
not be remembered across a crash.  Don't prepared transactions break that
assumption?
        regards, tom lane



Re: Multixid hindsight design

From
Robert Haas
Date:
On Fri, Jun 5, 2015 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That way if we need to make Offsets SLRU persistent it won't bloat.
>> We then leave the Members SLRU as non-persistent, just as it was <9.3
>
> I don't think you can do that, because it supposes that locking XIDs need
> not be remembered across a crash.  Don't prepared transactions break that
> assumption?

Well, that issue existed before 9.3, too.  It's possible our old
releases weren't entirely correct either, but the big change in 9.3 is
that we have to keep MultiXacts around until they are frozen, rather
than just until their member transactions are no longer running.  If I
understand correctly, Simon's proposal would mean that
pg_multixact/offsets would still need to survive until freezing, but
pg_multixact/members would only need to survive until the member
transactions were no longer running.  That might span a crash or
restart, in the case of prepared transactions, but we could clean up
the member offsets when the prepared transactions were committed,
rather than having to scan every table in the cluster first.  That
only eliminates half the need for multixact vacuuming, but it's
something.

It would be a great deal nicer if we didn't have to keep ANY of the
transactional data for a tuple around once it's all-visible.  Heikki
defined ephemeral as "only needed when xmin or xmax is in-progress",
but if we extended that definition slightly to "only needed when xmin
or xmax is in-progress or commited but not all-visible" then the
amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
t_hoff).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multixid hindsight design

From
Robert Haas
Date:
On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It would be a great deal nicer if we didn't have to keep ANY of the
> transactional data for a tuple around once it's all-visible.  Heikki
> defined ephemeral as "only needed when xmin or xmax is in-progress",
> but if we extended that definition slightly to "only needed when xmin
> or xmax is in-progress or commited but not all-visible" then the
> amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
> t_hoff).

OK, I was wrong here: if you only have that stuff, you can't
distinguish between a tuple that is visible to everyone and a tuple
that is visible to no one.  I think the minimal amount of data we need
in order to distinguish visibility once no relevant transactions are
in progress is one XID: either XMIN, if the tuple was never updated at
all or only be the inserting transaction or one of its subxacts; or
XMAX, if the inserting transaction committed.  The other visibility
information -- including (1) the other of XMIN and XMAX, (2) CMIN and
CMAX, and (3) the CTID -- are only interesting the transactions
involved are no longer running and, if they committed, visible to all
running transactions.

Heikki's proposal is basically to merge the 4-byte CID field and the
first 4 bytes of the CTID that currently store the block number into
one 8-byte field that can store a pointer into this new TED structure.
And after mulling it over, that sounds pretty good to me.  It's true
(as has been pointed out by several people) that the TED will need to
be persistent because of prepared transactions.  But it would still be
a big improvement over the status quo, because:

(1) We would no longer need to freeze MultiXacts.  TED wouldn't need
to be frozen either.  You'd just truncate it whenever RecentGlobalXmin
advances.

(2) If the TED becomes horribly corrupted, you can recover by
committing or aborting any prepared transactions, shutting the system
down, and truncating it, with no loss of data integrity.  Nothing in
the TED is required to determine whether tuples are visible to an
unrelated transaction - you only need it (a) to determine whether
tuples are visible to a particular command within a transaction that
has inserted, updated, or deleted the tuple and (b) determine whether
tuples are locked.

(3) As a bonus, we'd eliminate combo CIDs, because the TED could have
space to separately store CMIN and CMAX.  Combo CIDs required special
handling for logical decoding, and they are one of the nastier
barriers to making parallelism support writes (because they are stored
in backend-local memory of unbounded size and therefore can't easily
be shared with workers), so it wouldn't be very sad if they went away.

I'm not quite sure how to decide whether something like this worth (a)
the work and (b) the risk of creating new bugs, but the more I think
about it, the more the principal of the thing seems sound to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multixid hindsight design

From
Simon Riggs
Date:
On 24 June 2015 at 14:57, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It would be a great deal nicer if we didn't have to keep ANY of the
> transactional data for a tuple around once it's all-visible.  Heikki
> defined ephemeral as "only needed when xmin or xmax is in-progress",
> but if we extended that definition slightly to "only needed when xmin
> or xmax is in-progress or commited but not all-visible" then the
> amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
> t_hoff).

OK, I was wrong here: if you only have that stuff, you can't
distinguish between a tuple that is visible to everyone and a tuple
that is visible to no one.  I think the minimal amount of data we need
in order to distinguish visibility once no relevant transactions are
in progress is one XID: either XMIN, if the tuple was never updated at
all or only be the inserting transaction or one of its subxacts; or
XMAX, if the inserting transaction committed.  The other visibility
information -- including (1) the other of XMIN and XMAX, (2) CMIN and
CMAX, and (3) the CTID -- are only interesting the transactions
involved are no longer running and, if they committed, visible to all
running transactions.

Heikki's proposal is basically to merge the 4-byte CID field and the
first 4 bytes of the CTID that currently store the block number into
one 8-byte field that can store a pointer into this new TED structure.
And after mulling it over, that sounds pretty good to me.  It's true
(as has been pointed out by several people) that the TED will need to
be persistent because of prepared transactions.  But it would still be
a big improvement over the status quo, because:

(1) We would no longer need to freeze MultiXacts.  TED wouldn't need
to be frozen either.  You'd just truncate it whenever RecentGlobalXmin
advances.

(2) If the TED becomes horribly corrupted, you can recover by
committing or aborting any prepared transactions, shutting the system
down, and truncating it, with no loss of data integrity.  Nothing in
the TED is required to determine whether tuples are visible to an
unrelated transaction - you only need it (a) to determine whether
tuples are visible to a particular command within a transaction that
has inserted, updated, or deleted the tuple and (b) determine whether
tuples are locked.

(3) As a bonus, we'd eliminate combo CIDs, because the TED could have
space to separately store CMIN and CMAX.  Combo CIDs required special
handling for logical decoding, and they are one of the nastier
barriers to making parallelism support writes (because they are stored
in backend-local memory of unbounded size and therefore can't easily
be shared with workers), so it wouldn't be very sad if they went away.

I'm not quite sure how to decide whether something like this worth (a)
the work and (b) the risk of creating new bugs, but the more I think
about it, the more the principal of the thing seems sound to me.

Splitting multitrans into persistent (xmax) and ephemeral (TED) is something I already proposed so I support the concept; TED is a much better suggestion, so I support TED.

Your addition of removing combocids is good also, since everything is public. 

I think we need to see a detailed design and we also need to understand the size of this new beast. I'm worried it might become very big, very quickly causing problems for us in other ways. We would need to be certain that truncation can actually occur reasonably frequently and that there are no edge cases that cause it to bloat.

Though TED sounds nice, the way to avoid another round of on-disk bugs is by using a new kind of testing, not simply by moving the bits around.

It might be argued that we are increasing the diagnostic/forensic capabilities by making CIDs more public. We can use that...

The good thing I see from TED is it allows us to test the on-disk outcome of concurrent activity. Currently we have isolationtester, but that is not married in any way to the on-disk state allowing us the situation where isolationtester can pass yet we have corrupted on-disk state. We should specify the on-disk tuple representation as a state machine and work out how to recheck the new on-disk state matches the state transition that we performed. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multixid hindsight design

From
Robert Haas
Date:
On Wed, Jun 24, 2015 at 11:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Though TED sounds nice, the way to avoid another round of on-disk bugs is by
> using a new kind of testing, not simply by moving the bits around.

I agree that we can do much better at testing than we traditionally
have done, and I think pretty much everyone in the room for the
developer unconference session on testing at PGCon was also in
agreement with that.  I really like the idea of taking purpose-built
testing frameworks - like the one that Heikki created for the WAL
format changes - and polishing them to the point where they can go
into core.  That's more work, of course, but very beneficial.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multixid hindsight design

From
Simon Riggs
Date:
On 24 June 2015 at 16:30, Simon Riggs <simon@2ndquadrant.com> wrote:
 
Though TED sounds nice, the way to avoid another round of on-disk bugs is by using a new kind of testing, not simply by moving the bits around.

It might be argued that we are increasing the diagnostic/forensic capabilities by making CIDs more public. We can use that...

The good thing I see from TED is it allows us to test the on-disk outcome of concurrent activity. Currently we have isolationtester, but that is not married in any way to the on-disk state allowing us the situation where isolationtester can pass yet we have corrupted on-disk state. We should specify the on-disk tuple representation as a state machine and work out how to recheck the new on-disk state matches the state transition that we performed. 

To put some more flesh on this idea...

What I'm suggesting is moving from a 2-session isolationtester to a 3-session isolationtester

1. Session 1
2. Session 2
3. After-action confirmation that the planned state change exists correctly on disk, rather than simply having the correct behaviour

The absence of the third step in our current testing is what has led us to various bugs in the past (IMHO)

A fourth step would be to define the isolationtests in such a way that we can run them as burn-in tests for billions of executions.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Fwd: Multixid hindsight design

From
Craig Ringer
Date:
On 25 June 2015 at 00:52, Robert Haas <robertmhaas@gmail.com> wrote:

> I agree that we can do much better at testing than we traditionally
> have done, and I think pretty much everyone in the room for the
> developer unconference session on testing at PGCon was also in
> agreement with that.  I really like the idea of taking purpose-built
> testing frameworks - like the one that Heikki created for the WAL
> format changes - and polishing them to the point where they can go
> into core.  That's more work, of course, but very beneficial.

Something that'd really help with that, IMO, would be weakening the
requirement that everything be C or be core Perl. Instead require that
configure detect whether or not facilities for some tests are present,
and have them fail with a clean warning indicating they were skipped
for lack of pre-requisites at 'make' time.

I don't see that as significantly different to having some buildfarm
animals not bother to configure or test the PLs, SSL, etc. I
understand why adding to the mix required for the core server build
isn't acceptable, but hopefully separate test suites can be more
flexible. A free-for-all of languages and tools doesn't make sense,
but I'd like to see, say, python and the 'unittest' module added, and
to see acceptance of tests using psycopg2 to stream and decode WAL
from a slot.

Thoughts?

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Multixid hindsight design

From
Robert Haas
Date:
On Mon, Nov 9, 2015 at 2:02 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 25 June 2015 at 00:52, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree that we can do much better at testing than we traditionally
>> have done, and I think pretty much everyone in the room for the
>> developer unconference session on testing at PGCon was also in
>> agreement with that.  I really like the idea of taking purpose-built
>> testing frameworks - like the one that Heikki created for the WAL
>> format changes - and polishing them to the point where they can go
>> into core.  That's more work, of course, but very beneficial.
>
> Something that'd really help with that, IMO, would be weakening the
> requirement that everything be C or be core Perl. Instead require that
> configure detect whether or not facilities for some tests are present,
> and have them fail with a clean warning indicating they were skipped
> for lack of pre-requisites at 'make' time.
>
> I don't see that as significantly different to having some buildfarm
> animals not bother to configure or test the PLs, SSL, etc. I
> understand why adding to the mix required for the core server build
> isn't acceptable, but hopefully separate test suites can be more
> flexible. A free-for-all of languages and tools doesn't make sense,
> but I'd like to see, say, python and the 'unittest' module added, and
> to see acceptance of tests using psycopg2 to stream and decode WAL
> from a slot.
>
> Thoughts?

I actually kind of hate this sort of thing.  It's already the case
that some of the TAP tests are skilled if you don't have the
prerequisites, and while that solves the problem of spurious test
failures, it also means that the tests which have those dependencies
are run in fewer and fewer places.

Now I'm not dead set against changing anything at all here, but I want
to point out that we just added the TAP framework and already there
are proposals (like Kevin's snapshot too old patch) to require DBD::Pg
for some tests, which a lot of people aren't going to have, and I know
you and probably a few other people would like python, which is fair
enough, but other people are going to want other things, and pretty
soon we we end up with a situation - if we're not already there -
where for a developer or packager to get a machine to the point where
it runs "all the tests" is a major undertaking.

Accepting more tools also increases the breadth of knowledge expected
of committers and patch authors.  If I commit a patch and it turns out
something breaks, I'm expected to fix that.  Now if it's in C or Perl,
I can.  If it's in Python, I can't.  If we add Python to the list of
things we test with, there will (I'm sure) be other committers who can
fix the C and Python stuff, but have no clue about the Perl.  The more
frameworks we support, the worse this gets.  And woe betide a patch
author whose feature requires adjustment of existing tests - if that
person actually notices the problem before commit, as opposed to the
buildfarm finding it, that person now needs to learn enough of every
language in which we support tests to update the existing tests and
add any new ones which are needed.  I'm starting to get the hang of
isolation tests, but I still don't *really* understand TAP tests,
really, and anything new we add is a whole new obstacle.

I think this is sort of an inherent problem with long-running
projects.  Everybody knows, for example, that the toolchain we're
using to build our documentation is pretty crufty.  But if we switched
to something else that used a different input format, then every
back-patchable doc fix would have to be done twice, once for the old
toolchain and once for the new, which means that at least some people
would have to understand both, for a good 5-6 years.  We'd also have
to get that new toolchain working on every platform we support, and
all of our packagers would have to switch to it, and of course we'd
want it to support all the output formats we have today.  Yet, if we
were starting over, I think there's about zero chance we'd pick this
approach.  It's just that switching now is a job and a half, and we
don't want to take that on likely.  Similarly with testing frameworks
- except those are even worse, because nobody's saying that any of the
stuff we have now will ever go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multixid hindsight design

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 9, 2015 at 2:02 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> Something that'd really help with that, IMO, would be weakening the
>> requirement that everything be C or be core Perl. Instead require that
>> configure detect whether or not facilities for some tests are present,
>> and have them fail with a clean warning indicating they were skipped
>> for lack of pre-requisites at 'make' time.

> I actually kind of hate this sort of thing.

FWIW, I agree with Robert's position.  Expanding the infrastructure
required for tests may make life simpler for the initial author of a test,
but we pay for that over and over again when you consider the long run.

I think insisting on having e.g. DBD::Pg is simply exporting the author's
work to other people; there isn't anything that that enables that you
can't do without it.  We have added test infrastructure when there simply
wasn't any other way to test something.  Both the isolation framework and
the TAP framework represent substantial investments of that sort.  I'm
fine with both of those.  But the bar to adding new requirements ought to
be pretty darn high, and not just "I was too lazy to write it without".
        regards, tom lane



Re: Multixid hindsight design

From
Craig Ringer
Date:
On 10 November 2015 at 02:26, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'd like to see, say, python and the 'unittest' module added, and
>> to see acceptance of tests using psycopg2 to stream and decode WAL
>> from a slot.
>
> I actually kind of hate this sort of thing.

For what it's worth, I don't actually like it either. However, I
suspect that the current restricted tools are a significant impediment
to test tooling and infrastructure improvement.

> Now I'm not dead set against changing anything at all here, but I want
> to point out that we just added the TAP framework and already there
> are proposals (like Kevin's snapshot too old patch) to require DBD::Pg
> for some tests, which a lot of people aren't going to have

The alternative seems to be driving psql -qAtz as a coprocess over a
pipeline as a poor man's database driver. I really doubt that's any
kind of improvement when it comes to the complexity of maintaining
tests, fixing them when they break, etc. I'd rather just write tests
in C against libpq despite the verbosity, annoyance of maintaining
them, etc.

Part of why I'm somewhat interested in seeing python and psycopg2
added is that I'd personally quite like to see widely used client
drivers covered by the buildfarm to some extent specifically to see if
we break them. Though really, it'd make sense to add the most
important drivers like psqlODBC and PgJDBC first if doing that.... and
that'd be a whole new barrel of fun.

> and I know
> you and probably a few other people would like python, which is fair
> enough, but other people are going to want other things, and pretty
> soon we we end up with a situation - if we're not already there -
> where for a developer or packager to get a machine to the point where
> it runs "all the tests" is a major undertaking.

That's the bit I find odd, I guess. It's *trivial* to add DBD::Pg on
any remotely sane host. Similarly Python and psycopg2. One package
each, or a very simple compile. It's nothing in comparison to setting
up to building our documentation... or getting pretty much any of the
basic libraries in place on Windows.

(However, the streaming replication support for psycopg2 hasn't landed
in mainline yet, so you can't just grab it from PGDG repos).

Adding more tools does not make the existing ones easier, of course,
so that's not really an argument in favour. I just suspect the cost of
adding them is being somewhat overblown. That said ... I don't
maintain a 1990s-era MIPS box, an old ARM 6, or whatever.

It's a pity in a way that the Java toolchain and build process is so
heavy, because otherwise it'd make sense to just use Java and PgJDBC.
PgJDBC development happens closer to core Pg than most drivers, within
the main Pg community. It's also crucial for PostgreSQL's success,
with huge adoption and a lot of people relying on it. JUnit is one of
the least awful test frameworks I've worked with. Java is widely
understood and relatively easy to follow even if you don't know it
well. Unfortunately I see around zero chance of getting a JVM, maven
or ant, and a bajillion libraries onto buildfarm members.

> Accepting more tools also increases the breadth of knowledge expected
> of committers and patch authors.

That's the real problem IMO. The cognitive and knowledge load.

It's exactly the same reason why I just explained to someone else (the
not-UTF32 network run by the secret gov't conspiracy/aliens dude, as
it turns out) why the codebase requires English identifiers, variable
names, comments, etc. While I'd argue that the cost of adding Python
localized to some tests is lower than the cost of adding (say) code in
Russian to a contrib, it's not zero.

So I guess I've already argued against myself in another form.

Still, the alternative seems to be writing new bespoke tools on top of
libpq. Need something to analyse and decode a WAL stream? Write it in
C on top of libpq, add it to the test infrastructure. Probably write
new tests as plugins in C, too, or use an embedded Perl runtime, or
XS. Or concoct yet another custom language / test description, like
isolationtester, for the purpose. Those tools aren't much easier to
get rid of, once added, than new languages etc are.

Eventually it'd be good to be able to actually test things like
streaming replication in an automated manner. We're going to hit the
limitations of 'do some stuff and diff the output files' pretty
quickly there. Adding a new scripting language isn't exactly going to
fix that though.

> Similarly with testing frameworks
> - except those are even worse, because nobody's saying that any of the
> stuff we have now will ever go away.

Yeah. I guess that's the issue, isn't it. When something gets added
we're generally stuck with it forever. We still have PL/TCL!

Even if it were desirable (and ignoring the horrific backpatching pain
that'd result), nobody's going to leap up and volunteer to rewrite all
the Perl like the Windows build system so we could remove Perl. It's
there essentially forever. The concern that if we added
python+psycopg2 we'd be stuck with it too is more than valid.

Drat.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Multixid hindsight design

From
Robert Haas
Date:
On Mon, Nov 9, 2015 at 9:57 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 10 November 2015 at 02:26, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I'd like to see, say, python and the 'unittest' module added, and
>>> to see acceptance of tests using psycopg2 to stream and decode WAL
>>> from a slot.
>>
>> I actually kind of hate this sort of thing.
>
> For what it's worth, I don't actually like it either. However, I
> suspect that the current restricted tools are a significant impediment
> to test tooling and infrastructure improvement.

But that's exactly the argument that was made to justify adding the
TAP testing framework.  It didn't work?

>> Now I'm not dead set against changing anything at all here, but I want
>> to point out that we just added the TAP framework and already there
>> are proposals (like Kevin's snapshot too old patch) to require DBD::Pg
>> for some tests, which a lot of people aren't going to have
>
> The alternative seems to be driving psql -qAtz as a coprocess over a
> pipeline as a poor man's database driver. I really doubt that's any
> kind of improvement when it comes to the complexity of maintaining
> tests, fixing them when they break, etc. I'd rather just write tests
> in C against libpq despite the verbosity, annoyance of maintaining
> them, etc.

I thought about an idea like this: have some sort of dummy program,
and then have a module written in Perl that provides an interface to
it, so that we don't have to depend on DBD::Pg.  I'm not at all
convinced that's a bad plan.

> Part of why I'm somewhat interested in seeing python and psycopg2
> added is that I'd personally quite like to see widely used client
> drivers covered by the buildfarm to some extent specifically to see if
> we break them. Though really, it'd make sense to add the most
> important drivers like psqlODBC and PgJDBC first if doing that.... and
> that'd be a whole new barrel of fun.

EnterpriseDB has something like this, and it does catch some bugs, but
it is an unbelievable amount of work to keep it working.  When
something fails, it's unlikely that any individual developer has
exactly the right configuration on their machine to reproduce the
failure.  Now maybe we could come up with something that provides
similar test coverage but is less work to maintain - I'm certainly not
going to rule out the possibility that it can be done better than
EnterpriseDB has done it.  However, I bet it will still be a lot of
work, and unlike at EntepriseDB where we've got entire teams of people
who spend major amounts of time on this kind of stuff, PostgreSQL
relies on its committers to keep the buildfarm green.

And if I commit something and it breaks the Win32 version of some test
that is written in Python and relies on pgsqlODBC, there is no way in
heck I'm committing myself to reproducing that environment.  Even if
it's Linux64 and JDBC I'm not doing it.  If you want to build such a
thing and maintain it in perpetuity, that is fine, and you don't need
my permission, and I and I'm sure many others will greatly appreciate
both the effort and any bugs that get found and fixed that way.  But
when you talk about putting this kind of thing to core you are asking
the committers to be responsible for it, and I'm still not very
confident we've got the TAP tests in a state where maintaining them is
going to be a reasonable level of effort.  Those tests failed on my
machine for months, and the only reason they eventually got fixed is
because I knew enough Perl to track the problem down to some screwy
coding in one of the Perl modules MacPorts installs.  If the same
thing happens with Python, it would be extremely difficult for me to
find it, because I don't actually know Python.

>> Accepting more tools also increases the breadth of knowledge expected
>> of committers and patch authors.
>
> That's the real problem IMO. The cognitive and knowledge load.

That's definitely a big part of it, but I think you're selling short
the amount of havoc that build dependencies cause.  On a Linux system
where you can install whatever you need via 'yum install', sure, it's
easy.  It's not so easy on Windows.  It's less easy on other UNIX
systems without modern package managers.  But that's just the tip of
the iceberg.  Installing your operating system's DBD::Pg package is
going to install a version of DBD::Pg compiled against the system's
libpq, not the latest PostgreSQL libpq that you just built.  Now
perhaps, you will say, that's OK: that way we test cross-version libpq
compatibility, too!  And on a certain level I can't argue with that.
But now you could have a situation where the tests pass for one user
and fail for another user even though both are compiling from the same
PostgreSQL commit and the same DBD::Pg version, and that kind of stuff
is a lot of work to track down.  And it may not really be a bug: maybe
the person who added the test intended the test to exercise some new
libpq feature which is indeed supported by DBD::Pg, but your build of
DBD::Pg doesn't have that because it was compiled against an older
version of libpq that didn't have that feature, so in your
environment, the test fails.

>> Similarly with testing frameworks
>> - except those are even worse, because nobody's saying that any of the
>> stuff we have now will ever go away.
>
> Yeah. I guess that's the issue, isn't it. When something gets added
> we're generally stuck with it forever. We still have PL/TCL!
>
> Even if it were desirable (and ignoring the horrific backpatching pain
> that'd result), nobody's going to leap up and volunteer to rewrite all
> the Perl like the Windows build system so we could remove Perl. It's
> there essentially forever. The concern that if we added
> python+psycopg2 we'd be stuck with it too is more than valid.
>
> Drat.

Yes, it sucks.  I think no matter how we try to improve testing, it's
likely to be a lot of work.  But personally I think it's better to put
that work into tools that live inside our tree and that we have full
control over, rather than adding dependencies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company