Re: Proposal: "Causal reads" mode for load balancing reads without stale data - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date
Msg-id CAEepm=3prAR4LXnj+ju_ZuiM72tdG37soWzSdqUqc1ojy3C31Q@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 11 November 2015 at 09:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
 
1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what BerkeleyDB does in "read-your-writes" mode.  It means that application developers have the responsibility for correctly identifying transactions with causal dependencies and dealing with LSNs (or whatever equivalent tokens), potentially even passing them to other processes where the transactions are causally dependent but run by multiple communicating clients (for example, communicating microservices).  This makes it difficult to retrofit load balancing to pre-existing applications and (like anything involving concurrency) difficult to reason about as applications grow in size and complexity.  It is efficient if done correctly, but it is a tax on application complexity.

Agreed. This works if you have a single transaction connected thru a pool that does statement-level load balancing, so it works in both session and transaction mode.

I was in favour of a scheme like this myself, earlier, but have more thoughts now.

We must also consider the need for serialization across sessions or transactions.

In transaction pooling mode, an application could get assigned a different session, so a token would be much harder to pass around.

2.  Reader waits for a conservatively chosen LSN.  This is roughly what MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1" modes.  Read transactions would start off by finding the current end of WAL on the primary, since that must be later than any commit that already completed, and then waiting for that to apply locally.  That means every read transaction waits for a complete replication lag period, potentially unnecessarily.  This is tax on readers with unnecessary waiting.

This tries to make it easier for users by forcing all users to experience a causality delay. Given the whole purpose of multi-node load balancing is performance, referencing the master again simply defeats any performance gain, so you couldn't ever use it for all sessions. It could be a USERSET parameter, so could be turned off in most cases that didn't need it.  But its easier to use than (1).
 
Though this should be implemented in the pooler.

3.  Writer waits, as proposed.  In this model, there is no tax on readers (they have zero overhead, aside from the added complexity of dealing with the possibility of transactions being rejected when a standby falls behind and is dropped from 'available' status; but database clients must already deal with certain types of rare rejected queries/failures such as deadlocks, serialization failures, server restarts etc).  This is a tax on writers.

This would seem to require that all readers must first check with the master as to which standbys are now considered available, so it looks like (2).

No -- in (3), that is this proposal, standbys don't check with the primary when you run a transaction.  Instead, the primary sends a constant stream of authorizations (in the form of keepalives sent every causal_reads_timeout / 2 in the current patch) to the standby, allowing it to consider itself available for a short time into the future (currently now + causal_reads_timeout - max_tolerable_clock_skew to be specific -- I can elaborate on that logic in a separate email).  At the start of a transaction in causal reads mode (the first call to GetTransaction to be specific), the standby knows immediately without communicating with the primary whether it can proceed or must raise the error.  In the happy case, the reader simply compares the most recently received authorization's expiry time with the system clock and proceeds.  In the worst case, when contact is lost between primary and standby, the primary must stall causal_reads commits for causal_reads_timeout (see CausalReadsBeginStall).  Doing that makes sure that no causal reads commit can return (see CausalReadsCommitCanReturn) before the lost standby has definitely started raising the error for causal_reads queries (because its most recent authorization has expired), in case it is still alive and handling requests from clients.

It is not at all like (2), which introduces a conservative wait at the start of every read transaction, slowing all readers down.  In (3), readers don't wait, they run (or are rejected) as fast as possible, but instead the primary has to do extra things.  Hence my categorization of (2) as a 'tax on readers', and of (3) as a 'tax on writers'.  The idea is that a site with a high ratio of reads to writes would prefer zero-overhead reads.
 
The alternative is that we simply send readers to any standby and allow the pool to work out separately whether the standby is still available, which mostly works, but it doesn't handle sporadic slow downs on particular standbys very well (if at all).

This proposal does handle sporadic slowdowns on standbys: it drops them from the set of available standbys if they don't apply fast enough, all the while maintaining the guarantee.  Though occurs to me that it probably needs some kind of defence against too much flapping between available and unavailable (maybe some kind of back off on the 'joining' phase that standbys go through when they transition from unavailable to available in the current patch, which I realize I haven't described yet -- but I don't want to get bogged down in details, while we're talking about the 30,000 foot view).
 
I think we need to look at whether this does actually give us anything, or whether we are missing the underlying Heisenberg reality.

--

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: checkpointer continuous flushing
Next
From: Jeff Janes
Date:
Subject: Re: Erroneous cost estimation for nested loop join