Re: Proposal: "Causal reads" mode for load balancing reads without stale data - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date
Msg-id CAEepm=3X40PP-JDW2cnWyRQHod9VAu-oMBTXy7wZqdRMtyhi0w@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Ants Aasma <ants.aasma@eesti.ee>)
Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Peter Eisentraut <peter_e@gmx.net>)
Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 11/11/2015 10:23 AM, Simon Riggs wrote:
On 11 November 2015 at 05:37, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

Many sites use hot standby servers to spread read-heavy workloads over more
hardware, or at least would like to.  This works well today if your
application can tolerate some time lag on standbys.  The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys.  The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns.  It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.

Thanks for working on this issue.

+1.

3.  Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).


This causes every writer to wait.

What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that debate
up front before we start reviewing the patch.

Agreed. And in the write-read sequence, you don't need to wait at the write either, it's enough that you wait just before you start doing the read. An application might do a lot of other things between the two, so that in most cases, there would in fact be no waiting as the record is already applied when you perform the read.

I'm thinking the client should get some kind of a token back from the commit, and it could use the token on the standby, to wait for that commit to be applied. The token could be just the XID, or the LSN of the commit record. Or the application could generate the token and pass it to the server in the commit, similar to how 2PC works. So the interaction would be something like:

In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234

Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
...

I thought about this question, and considered three different approaches:

1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what BerkeleyDB does in "read-your-writes" mode.  It means that application developers have the responsibility for correctly identifying transactions with causal dependencies and dealing with LSNs (or whatever equivalent tokens), potentially even passing them to other processes where the transactions are causally dependent but run by multiple communicating clients (for example, communicating microservices).  This makes it difficult to retrofit load balancing to pre-existing applications and (like anything involving concurrency) difficult to reason about as applications grow in size and complexity.  It is efficient if done correctly, but it is a tax on application complexity.

2.  Reader waits for a conservatively chosen LSN.  This is roughly what MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1" modes.  Read transactions would start off by finding the current end of WAL on the primary, since that must be later than any commit that already completed, and then waiting for that to apply locally.  That means every read transaction waits for a complete replication lag period, potentially unnecessarily.  This is tax on readers with unnecessary waiting.

3.  Writer waits, as proposed.  In this model, there is no tax on readers (they have zero overhead, aside from the added complexity of dealing with the possibility of transactions being rejected when a standby falls behind and is dropped from 'available' status; but database clients must already deal with certain types of rare rejected queries/failures such as deadlocks, serialization failures, server restarts etc).  This is a tax on writers.

My thinking was that the reason for wanting to load balance over a set of hot standbys is because you have a very read-heavy workload, so it makes sense to tax the writers and leave the many dominant readers unburdened, so (3) should be better than (2) for the majority of users who want such a configuration.  (Note also that it's not a requirement to tax every write; with this proposal you can set causal_reads to off for those transactions where you know there is no possibility of a causally dependent read).

As for (1), my thinking was that most application developers would probably prefer not to have to deal with that type of interface.  For users who do want to do that, it would be comparatively simple to make that possible, and would not conflict with this proposal.  This proposal could be used by people retrofitting load balancing to an existing applications with relative ease, or simply not wanting to have to deal with LSNs and complexity.  (I have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout) separately, which could be called on a standby with the lsn obtained from pg_current_xlog_location() on the primary any time after a COMMIT completes, but I was thinking of that as a different feature addressing a different user base: people prepared to do more work to squeeze out some extra performance.)

--

pgsql-hackers by date:

Previous
From: Atri Sharma
Date:
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Next
From: Marko Tiikkaja
Date:
Subject: proposal: numeric scale functions