Re: Deriving Recovery Snapshots - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Deriving Recovery Snapshots
Date
Msg-id 1224187241.3808.347.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: Deriving Recovery Snapshots  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Deriving Recovery Snapshots
List pgsql-hackers
On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Each backend that existed on the master is represented by a PROC
> > structure in the ProcArray. These are known as "recovery procs" and are
> > similar to the dummy procs used for prepared transactions. All recovery
> > procs are "owned by" the Startup process. So there is no process for
> > which MyProc points to one of the recovery procs. This structure allows
> > us to record the top-level transactions and then put subtransactions in
> > the proc's subtransaction cache. A fixed one-to-one correspondence
> > allows efficient maintenance of the structures. We emulate all
> > transactional backend, including autovac.
> 
> We'll need to know the max_connections setting in the master, in order 
> to size the array correctly. Not a show-stopper, but would be nicer if 
> we didn't need to.

Yes. We'll probably need to add checks/configurability later. Unless you
have a way...

> > * The backend slot may not be reused for some time, so we should take
> > additional actions to keep state current and true. So we choose to log a
> > snapshot from the master into WAL after each checkpoint. This can then
> > be used to cleanup any unobserved xids. It also provides us with our
> > initial state data, see later.
> 
> We don't need to log a complete snapshot, do we? Just oldestxmin should 
> be enough.

Possibly, but you're thinking that once we're up and running we can use
less info.

Trouble is, you don't know when/if the standby will crash/be shutdown.
So we need regular full snapshots to allow it to re-establish full
information at regular points. So we may as well drop the whole snapshot
to WAL every checkpoint. To do otherwise would mean more code and less
flexibility.

With default settings that is at most 25600 bytes for subxid cache, plus
a maybe 2000 bytes for other info. For most cases, we will use less than
1 wal buffer.

> > UnobservedXids is maintained as a sorted array. This comes for free
> > since xids are always added in xid assignment order. This allows xids to
> > be removed via bsearch when WAL records arrive for the missing xids. It
> > also allows us to stop searching for xids once we reach
> > latestCompletedXid.
> 
> If we're going to have an UnobservedXids array, why don't we just treat 
> all in-progress transactions as Unobserved, and forget about the dummy 
> PROC entries?

That's a good question and I expected some debate on that.

The main problem is fatal errors that don't write abort records. By
reusing the PROC entries we can keep those to a manageable limit. If we
don't have that, the number of fatal errors could cause that list to
grow uncontrollably and we might overflow any setting, causing snapshots
to stall and new queries to hang. We really must have a way to place an
upper bound on the number of unobserved xacts. So we really need the
proc approach. But we also need the UnobservedXids array.

It's definitely more code to have both, so I would not have chosen that
route if there was another way. The simple approach just doesn't cover
all possible cases, and we need to cover them all.

Having only an UnobservedXid array was my first thought and I said
earlier I would do it without using procs. Bad idea. Using the
UnobservedXids array means every xact removal requires a bsearch,
whereas with procs we can do a direct lookup, removing all xids in one
stroke. Much better for typical cases. Also, if we have procs we can use
the "no locks" approach in some cases, as per current practice on new
xid insertions.

> Also, I can't help thinking that this would be a lot simpler if we just 
> treated all subtransactions the same as top-level transactions. The only 
> problem with that is that there can be a lot of subtransactions, which 
> means that we'd need a large UnobservedXids array to handle the worst 
> case, but maybe it would still be acceptable?

Yes, you see the problem. Without subtransactions, this would be a
simple issue to solve.

In one sense, I do as you say. When we make a snapshot we stuff the
UnobservedXids into the snapshot *somewhere*. We don't know whether they
are top level or subxacts. But we need a solution for when we run out of
top-level xid places in the snapshot. Which has now been provided,
luckily.

If we have no upper bound on snapshot size then *all* backends would
need a variable size snapshot. We must solve that problem or accept
having people wait maybe minutes for a snapshot in worst case. I've
found one way of placing a bound on the number of xids we need to keep
in the snapshot. If there is another, better way of keeping it bounded I
will happily adopt it. I spent about 2 weeks sweating this issue...

I'm available tomorrow to talk in real time if there's people in the Dev
room at PGday want to discuss this, or have me explain the patch(es). 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: "Neil Conway"
Date:
Subject: Re: Memory leak on hashed agg rescan
Next
From: Ron Mayer
Date:
Subject: Re: Cross-column statistics revisited