Re: synchronous_commit = apply - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: synchronous_commit = apply
Date
Msg-id CANP8+j+jCpNoOjc-KQLtt4PDyOX2Sq6wYWqCSy6aaHWkvNa0hw@mail.gmail.com
Whole thread Raw
In response to synchronous_commit = apply  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: synchronous_commit = apply  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On 1 September 2015 at 20:25, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
 
Do you think it's reasonable to want to COMMIT a particular transaction on a master node, and then immediately run a read-only query on a hot standby node that is guaranteed to see that transaction?

Yes, that is reasonable and we've been discussing it for a few years now.
 
A friend of mine who works with a different RDBMS technology that can do that asked me how to achieve this with Postgres, and I suggested waiting for the standby's pg_last_xlog_replay_location() to be >= the master's pg_current_xlog_location() after COMMIT, which might involve some looping and sleeping.

As a quick weekend learning exercise/hack I recently went looking into how we could support $SUBJECT.  I discovered we already report the apply progress back to the master, and the synchronous waiting facility seemed to be all ready to support this.  In fact it seemed a little too easy so something tells me it must be wrong!  But anyway, please see the attached toy POC patch which does that.

As you say, that is the easy part.
 
The next problem is that the master can be waiting quite a long time for a reply from the remote walreceiver containing the desired apply LSN: in the best case it learns of apply progress from replies to subsequent unrelated records (which might be very soon on a busy system but still involves waiting for the next transaction's WAL flush), and in the worst case it needs to wait for wal_receiver_status_interval (10 seconds by default), which makes for a long COMMIT delay.  I was thinking that the solution to that may be to teach StartupLOG to signal the walreceiver after it updates XLogCtl->lastReplayedEndRecPtr, which should cause walrcv_receive to be interrupted and return early, and then walreceiver could send a reply if it sees that lastReplayedEndRecPtr has moved.  Maybe that would generate an unacceptably high frequency of signals, and maybe there is a better form of IPC for this.  Without introducing any new IPC, the walreceiver could instead simply report apply progress to the master whenever it sees that the apply LSN has moved after its regular NAPTIME_PER_CYCLE wait (100ms), but that would still introduces bogus latency.  A quick and dirty way to see that on top of the attached patch is to set requestReply = true in WalReceiverMain to force a send after every nap.

This problem is exactly why I wrote my recent patch to make WALWriter work in recovery.

Currently, the WALReceiver issues regular fsyncs that prevent it from replying in time. Also, the WALReceiver waits on incoming data only, so we can't (yet) set a latch when the Startup process has applied some records.

I've solved the first problem and know how to solve the second, just haven't coded it yet. I was expecting to do that for CF3 or CF4.

I don't think we should be using signals, nor would I expect them to work effectively while in an fsync. 

I can see that using synchronous_commit = apply in the practice might prove difficult:  how does a client know which node is the synchronous standby?  Perhaps those sorts of practical problems are the reason no one has done or wanted this.

It means we need quorum sync rep as well, to make this useful in practice without sacrificing HA.

Bringing my patch and Beena's patch together will solve this for us in 9.6

So yes, 1) we have thought of it and want it, 2) the basic patch is trivial, 3) but it isn't the main problem.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: ghanshyamb
Date:
Subject: PostgreSQL streaming replication doubt
Next
From: Masahiko Sawada
Date:
Subject: Re: [PROPOSAL] VACUUM Progress Checker.