Re: Summary and Plan for Hot Standby - Mailing list pgsql-hackers

Simon Riggs wrote:
> There are two remaining areas of significant thought/effort:

Here's a list of other TODO items I've collected so far. Some of them
are just improvements or nice-to-have stuff, but some are more serious:

- If WAL recovery runs out of lock space while acquiring an
AccessExclusiveLock on behalf of a transaction that ran in the master,
it will FATAL and abort recovery, bringing down the standby. Seems like
it should wait/cancel queries instead.

- When switching from standby mode to normal operation, we momentarily
hold all AccessExclusiveLocks held by prepared xacts twice, needing
twice the lock space. You can run out of lock space at that point,
causing failover to fail.

- When replaying b-tree deletions, we currently wait out/cancel all
running (read-only) transactions. We take the ultra-conservative stance
because we don't know how recent the tuples being deleted are. If we
could store a better estimate for latestRemovedXid in the WAL record, we
could make that less conservative.

- The assumption that b-tree vacuum records don't need conflict
resolution because we did that with the additional cleanup-info record
works ATM, but it hinges on the fact that we don't delete any tuples
marked as killed while we do the vacuum. That seems like a low-hanging
fruit that I'd actually like to do now that I spotted it, but will then
need to fix b-tree vacuum records accordingly. We'd probably need to do
something about the previous item first to keep performance acceptable.

- There's the optimization to replay of b-tree vacuum records that we
discussed earlier: Replay has to touch all leaf pages because of the
interlock between heap scans, to ensure that we don't vacuum away a heap
tuple that a concurrent index scan is about to visit. Instead of
actually reading in and pinning all pages, during replay we could just
check that the pages that don't need any other work to be done are not
currently pinned in the buffer cache.

- Do we do the b-tree page pinning explained in previous point correctly
at the end of index vacuum? ISTM we're not visiting any pages after the
last page that had dead tuples on it.

- code structure. I moved much of the added code to a new standby.c
module that now takes care of replaying standby related WAL records. But
there's code elsewhere too. I'm not sure if this is a good division but
seems better than the original ad hoc arrangement where e.g lock-related
WAL handling was in inval.c

- The "standby delay" is measured as current timestamp - timestamp of
last replayed commit record. If there's little activity in the master,
that can lead to surprising results. For example, imagine that
max_standby_delay is set to 8 hours. The standby is fully up-to-date
with the master, and there's no write activity in master.  After 10
hours, a long reporting query is started in the standby. Ten minutes
later, a small transaction is executed in the master that conflicts with
the reporting query. I would expect the reporting query to be canceled 8
hours after the conflicting transaction began, but it is in fact
canceled immediately, because it's over 8 hours since the last commit
record was replayed.

- ResolveRecoveryConflictWithVirtualXIDs polls until the victim
transactions have ended. It would be much nicer to sleep. We'd need a
version of LockAcquire with a timeout. Hmm, IIRC someone submitted a
patch for lock timeouts recently. Maybe we could borrow code from that?


--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Summary and Plan for Hot Standby
Next
From: Simon Riggs
Date:
Subject: Re: Hot standby, race condition between recovery snapshot and commit