Re: [RFC] GSoC Work on readonly queries done so far - Mailing list pgsql-hackers

From Florian G. Pflug
Subject Re: [RFC] GSoC Work on readonly queries done so far
Date
Msg-id 46687077.9030305@phlo.org
Whole thread Raw
In response to Re: [RFC] GSoC Work on readonly queries done so far  ("Simon Riggs" <simon@2ndquadrant.com>)
List pgsql-hackers
Simon Riggs wrote:
> On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote:
> 
>> .) Added a new GUC operational_mode, which can be set to either
>>     readwrite or readonly. If it is set to readwrite (the default),
>>     postgres behaves as usual. All the following changes are only
>>     in effect if operational_mode is set to readonly.
> 
> Do we need this? We are already InArchiveRecovery.
If I understand you correctly, you suggest that readonly queries
are allways allowed during archive recovery - so upon startup
postgres step through these states:  .) Initial recovery (Until we reach a consistent state)  .) Allow readonly queries
.) Finish recovery in the background (might mean recovering "forever"     on a PITR slave)  .) Allow readwrite queries
 

My plan was to have a global switch, which lets you choose between  .) All queries are readonly (Until the next
postmasterrestart at least),     but you get background replay  .) No background replay, but once replay is done,
readwritequeries     can be execute (Just what PG does now).
 

The main reason why I invented that global switch operational_mode was
to remove to need to switch between readonly mode and readwrite mode
on the fly.

>> .) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR)
>>     if postgre is not in readwrite mode. This macro protects the
>>     following functions to make sure that no writes occur in
>>     readonly mode.
>>       SimpleLruWritePage, SLruPhysicalWritePage
>>       EndPrepare, FinishPreparedTransaction
>>       XLogInsert, XLogWrite, ShutdownXLog
>>       CreateCheckpoint
>>       MarkBufferDirty.
> 
> These are Asserts?
The macro ASSUME_OPMODE_READWRITE just does
if (!OperationalModeReadWrite)  elog(ERROR, ...)

>>   .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay
>>      is started, and it takes over that role that bgwriter play in the
>>      shutdown process.
> 
> Autovacuum -> understood.
> 
> What does bgreplay do? Why not just start bgwriter earlier and disable
> some of its other functionality while InRecovery?
See above - it seemed simpler to clearly seperate

>>   .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
>>      is considered to be "later" than any other xid.
> 
> So you are bumping FirstNormalTransactionId up by one for this?
In fact I changed MaxTransactionId to 0xFFFFFFFE, and set
ReadOnlyTransactionId to 0xFFFFFFFF. Additionally, I changed
IsNormalTransactionId to test not only for >= FirstNormalTransactionid,
but also for <= MaxTransactionId.

> You're assuming then that we will "freeze" replay while we run a query?
No. My plan is to first get to a point where replay is freezes while
queries are running, and to then figure out a more intelligent way to do this.
I already have a few ideas how to do this, but I want to complete the "simple 
version", before I start with that work.

> Otherwise doing this will mean the snapshot changes as a query executes.
Why? It's only the xid of the transaction, not it's xmin and xmax that are
set to ReadOnlyTransactionId.

>>   .) A global ReadOnlySnapshot is maintained in shared memory. This is
>>      copied into backend local memory by GetReadonlySnapshotData (which
>>      replaces GetSnapshotData in readonly mode).
>>    .) Crash recovery is not performed in readonly mode - instead, postgres
>>       PANICs, and tells the DBA to restart in readwrite mode. Archive
>>       recovery of course *will* be allowed, but I'm not that far yet.
> 
> This is the very heart of the matter. This isn't just a technical issue,
> it goes to the heart of the use case for this feature. Can we recover
> while running queries?
Yes. My comment only applies only to crash recovery - i.e, recovery that happens
*without* a recovery.conf present, after a crash.
It only really matters if you do following  .) Start pg in readwrite mode.  .) Kill it / It crashes  .) Restart in
readonlymode.
 

The main different between crash recovery, and recovery from a filesystem-level
backup is the additional information that the backup label gives us in the
second case - more specifically, the minRecoveryLoc that we read from the
backup label. Only with that knowledge is "recovering until we reach
a consistent state" a welldefined operation. And readonly queries
can only be executed *after* we did this "minimal recovery". So if there is
crash recovery to be done, we best we could do is to recover, and then start
in readonly mode. If this is *really* what the DBA wants, he can just start
in readwrite mode first, then cleanly shut PG down, and restart in readonly mode.
> If not, how much time will we spend in replay
> mode v query mode? Will we be able to run long running queries *and*
> maintain a reasonable time to recover? Is this a mechanism for providing
> HA and additional query capacity, or is it just a mechanism for
> additional query capacity only? Those are open questions to which I
> don't have any answers yet myself.
My goal is to allow replay and queries to run concurrently, at least as
long as only inserts, updates and deletes to non-system tables are replayed
(Which I assume is 99% of the wal-traffic on a typical productive pg
installation).

> Will we switch back and forth between replay and query mode. 
> Do we connect to the master, or to the slave?
> If we connect to the slave will we accept new queries when in replay
> mode and pause them before we switch back to query mode.

*If* the replay process has to pause query execution, it will
acquire some global lock in write mode. That same lock will be acquired
in read mode if a client starts a transaction. So waiting for the recovery
process will "feel" just the same as waiting for another query to release
a lock.

>> Open Problems:
>> --------------
>>    .) Protecting MarkBufferDirty with ASSUME_OPMODE_READWRITE is troublesome,
>>       because callers usually call MarkBufferDirty from within a critical
>>       section, and thus elog(ERRROR) is turned into elog(PANIC). This e.g.
>>       happens with my patch if you call nextval() in readonly mode.
>>       Does anyone see a better solution then adding checks into
>>       all callers that are not otherwise protected from being called
>>       in readonly mode?
> 
> Do we need to do this at all?
We need a waterproof protection against writes in readonly mode. The current
"set transaction readonly" is far away from being that - it e.g. allows 
nextval(). So protecting MarkBufferDirty seems like a good last-resort 
protection - even a PANIC is *much* better than just silently changing data
on the slave I think.

>>    .) Since the slaves needs to track an Snapshot in shared memory, it cannot
>>       resize that snapshot to accomodate however many concurrent transactions
>>       might have been running on the master. My current plan is to detect if
>>       that global snapshot overflows, and to lock out readonly queries on the
>>       slave (and therefore remove the need of keeping the snapshot current)
>>       until the number of active xids on the master has dropped below
>>       max_connections on the slave. A warning will be written to the postgres
>>       log that suggest that the DBA increases the max_connections value on
>>       the slave.
> 
> Sized according to max_connections on the master?
Well, but how do I get that value? And the master might be restarted with
a different max_connections value while the slave keeps running...

greetings, Florian Pflug


pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Controlling Load Distributed Checkpoints
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Controlling Load Distributed Checkpoints